3. Model Benchmarks¶

3.1 Filter By Benchmarks¶

For RAG architectures, we need a chat completion model and an embedding model. To select a model for prototyping, we'll filter by inference task, then look for models with benchmarks, then compare a few by available metrics to make a decision. Let's find our chat model:

Filter by Chat Completion → see: 62 models
Now, Filter by Benchmark Results → see: 51 models
You should see something like this:

FIGURE: click to expand for example screenshot
Click Compare Models → see: Assess model performance with evaluated metrics

Let's use this page to compare the model options by available benchmarks.

3.2 Compare By Benchmarks¶

In the previous step, we saw 51 choices that included the 4 models below.

gpt-4o, gpt-4o-mini, AI21-Jamba-1.5-Mini, and Phi-3-mini-128k-instruct.

Let's use these as a sample for an exercise in using benchmarks for model selection.

The Benchmarks Compare View will have default models selected. Delete the defaults.
Now, add the 4 models above (one at a time) using the + Model to compare button.
You should see something like this:

FIGURE: click to expand for example screenshot
Explore the available critera for comparisons (click each drop-down in the chart)
- Criteria include: quality, embeddings, cost and latency.
Select Accuracy for x-axis and Cost for y-axis as shown in figure above
- The chart will update to show where models fit on this comparison
- Higher accuracy values - and lower cost values - are better.
Observe the chart. We can see:
- the AI21-Jamba-1.5-Mini model costs the least but is also the least accurate
- the gpt-4o model has the highest accuracy but also the highest cost.
- the gpt-4o-mini has comparable cost to (1) and is second in accuracy to (2).
Make an informed decision: select gpt-4o-mini
- we'll review the Model card in the next section to determine next steps.

HOMEWORK: Walk through a similar process to select an embedding model.

3.3 List By Benchmarks¶

The compare view above lets you assess model choices relative to each other based on specific criteria like accuracy, cost and other metrics. The list view provides more detailed metrics for each model, giving insights into their effectiveness for various tasks. Learn more:

Let's explore this briefly for the gpt-4o-mini model we selected earlier.

Search for the model by name as shown below. You should see:
- A row of benchmarks for that model, each with a model version and associated dataset
- Each row has columns for relevant quality metrics (with values, where assessed)
- The top row provides the average for each metric, across all assessed benchmark
FIGURE: click to expand for example screenshot
We see this model ranks well on accuracy and prompt-based metrics like coherence, fluency, and groundedness - but does less well on GPTSimilarity. See: Quality docs for explainers on what each metric means. Overall, we see the selected model quality is acceptable.
Each row of benchmarks for a model defines a dataset and a task. The dataset contains examples of inputs relevant to the task, along with information to assess quality of model response to that input. The resulting quality metrics are listed in that row. Click on a dataset to get more details on what it does, and how.
- Ex 1: Click human_eval which assesses accuracy for Text generation tasks
  - it assesses functional correctness of code generation from given word problem.
  - it assesses this model at 0.841 accuracy for this text generation task.
  FIGURE: (click to expand) Dataset details for HumanEval
- Ex 2: Click squad_v2 which assesses groundedness and relevance for QA tasks
  - it assesses reading comprehension using questions on a Wikipedia dataset.
  - it assesses this model at 4.146 for Groundedness and 3.753 for GPTSimilarity.
  FIGURE: (click to expand_ Dataset details for squad_v2

This allows us to get a quick sense of the general suitability of the selected model based on benchmarks. The next step, is to explore the model card.