4.1 Create Evaluation Dataset¶

This begins Part 3 of the tutorial. This stage is completed USING THE AZURE AI FOUNDRY SDK.

At the end of this section, you should have established an evaluation inputs dataset, created an evaluation script for AI-assisted evaluation, and run the script to assess the chat AI app for quality. You will then learn to explore the evaluation outcomes locally, and via the Azure AI Foundry portal, and understand how to customize and automate the process for rapid iteration and improvement of your application quality.

1. Evaluating Generative AI Apps¶

The evaluation phase is critical to assessing the quality and safety of your generative AI application. The Azure AI Foundry provides a comprehensive set of tools and guidance to support evaluation in three dimensions. Learn More Here.

Risk and safety evaluators: Evaluating potential risks associated with AI-generated content. Ex: evaluating AI predisposition towards generating harmful or inappropriate content.
Performance and quality evaluators: Assessing accuracy, groundedness, and relevance of generated content using robust AI-assisted and Natural Language Processing (NLP) metrics.
Custom evaluators: Tailored evaluation metrics that allow for more detailed and specific analyses. Ex: addressing app-specific requirements not covered by standard metrics.

AI-Assisted Evaluators are available only in select regions (Recommended: East US 2)

2. AI-Assisted Evaluation Flow¶

So far, we've tested the chat application interactively (command-line) using a single test prompt. Now, we want to evaluate the responses on a larger and more diverse set of test prompts including edge cases. We can then use those results to iterate on our application (e.g., prompt template, data sources) till evaluations pass our acceptance criteria.

To do this we use AI-Assisted Evaluation - also referred to as LLM-as-a-judge - where we ask a second AI model ("judge") to grade the responses of the first AI model ("target"). The workflow is as shown below:

First, create an evaluation dataset that consists of diverse queries for testing.
Next, have the target AI (app) generate responses for each query
Next, have the judge AI (assessor) grade tge {query, response} pairs
Finally, visualize results (individual vs. aggregate metrics) for review.

Generational Quality Metrics

3. Create Evaluation Dataset¶

Let's copy over the chat_eval_data.jsonl dataset into our assets folder.

Make sure you are in the root directory of the repo.
Then run this command:

1	`cp src.sample/api/assets/chat_eval_data.jsonl src/api/assets/.`

4. Review Evaluation Dataset¶

The Azure AI Foundry supports different data formats for evaluation including:

Query/Response - each result has the query, response, and ground truth.
Conversation (single/multi-turn) - messages (with content, role, optional context)

Click to expand and view the evaluation dataset

src/api/assets/chat_eval_data.jsonl
{"query": "Which tent is the most waterproof?", "truth": "The Alpine Explorer Tent has the highest rainfly waterproof rating at 3000m"}
{"query": "Which camping table holds the most weight?", "truth": "The Adventure Dining Table has a higher weight capacity than all of the other camping tables mentioned"}
{"query": "How much do the TrailWalker Hiking Shoes cost? ", "truth": "The Trailewalker Hiking Shoes are priced at $110"}
{"query": "What is the proper care for trailwalker hiking shoes? ", "truth": "After each use, remove any dirt or debris by brushing or wiping the shoes with a damp cloth."}
{"query": "What brand is TrailMaster tent? ", "truth": "OutdoorLiving"}
{"query": "How do I carry the TrailMaster tent around? ", "truth": " Carry bag included for convenient storage and transportation"}
{"query": "What is the floor area for Floor Area? ", "truth": "80 square feet"}
{"query": "What is the material for TrailBlaze Hiking Pants?", "truth": "Made of high-quality nylon fabric"}
{"query": "What color does TrailBlaze Hiking Pants come in?", "truth": "Khaki"}
{"query": "Can the warrenty for TrailBlaze pants be transfered? ", "truth": "The warranty is non-transferable and applies only to the original purchaser of the TrailBlaze Hiking Pants. It is valid only when the product is purchased from an authorized retailer."}
{"query": "How long are the TrailBlaze pants under warranty for? ", "truth": " The TrailBlaze Hiking Pants are backed by a 1-year limited warranty from the date of purchase."}
{"query": "What is the material for PowerBurner Camping Stove? ", "truth": "Stainless Steel"}
{"query": "Is France in Europe?", "truth": "Sorry, I can only queries related to outdoor/camping gear and equipment"}

Our dataset reflects the first format, where the test prompts contain a query with the ground truth for evaluating responses. The chat AI will then generate a response (based on query) that gets added to this record, to create the evaluation dataset that is sent to the "judge" AI.

 {
    "query": "Which tent is the most waterproof?", 
    "truth": "The Alpine Explorer Tent has the highest rainfly waterproof rating at 3000m"
}

Let's look at the evaluation script that orchestrates this workflow, next