6️⃣ | Evaluate Chat Agent

You can unit test your Flow. However, Prompt flow provides a gallery of sample evaluation flows your can use to test you Flow in bulk. For example, classification accuracy, QnA Groundedness, QnA Relevant, QnA Similarity, QnA F1 Score etc. This enables you to test how well your LLM is performing. In addition, you have the ability to examine which of your variant prompts are performing better. In this example, we’ll use the QnA Groundedness Evaluation and QnA Relevance Evaluation template to test our flow.

Evaluate chatbot flow

On the Outputs section, click on the + Add output button. Then enter context for the Name field and ${chat.output} for the value.

Click Save button
Click on the Evaluate button on the top right-side of the screen.

VARIANT FOR EVALUATION

You can evaluate the default variant of your prompt or select the variants your prefer. This is a good way to compare different variants and see which prompts have a better performance.

On the Batch run & Evaluate page, select Select a node to run variants. Choose variant_1 (default) and click on Next.

Under Data, select the test-contoso-dental-data dataset you created earlier. A preview of the top 5 rows of the data should be displayed at the bottom of the page.
Under Input mapping, enter the open and close brackets [] for the value of chat_history.
Click in the Value textbox for the question field and enter ${data.question}.

Click the Next button.

Multiple EVALUATION

You can select one or more evaluation templates to validate your flow. It depends on your use case and which performance insights you want to get.

On the Select evaluation page, select the checkbox for the QnA Groundedness Evaluation and QnA Relevance Evaluation.

Click the Next button.
Click on the right arrow “>” to expand the QnA Groundedness Evaluation settings.

Select the test-contoso-dental-dataset dataset your uploaded earlier for the Choose data asset for evaluation field.
Enter ${run.outputs.answer} for the answer field.
Click on the Data Source textbox and enter ${data.question} for the question field.
Enter ${run.outputs.context} for the context field.
On the right-hand side of the page, scroll down to the bottom of the page.
Select your AzureOpenAI connection name (e.g. azure-openai-conn) for the Connection fields.
The Deployment name / Model should automatically population the your AzureOpenAI deployment name.

Click on the right arrow “>” to expand the QnA Relevance Evaluation settings.
Repeat the same selects you chose for the QnA Groundness section.
Click the Next button.
Finally, click on the Submit button.
To monitor the run progress, click on the Prompt flow navigation option. Then click on the Runs tab

Click the Refresh button to update the run status. The run should take ~60 minutes.
Click on the radio button for the QnA RAG Evaluation, the press the Visualize outputs to view the results.

The Runs & metrics section shows a summary score for gpt_groundedness and gpt_relevance. The Outputs section shows the detailed results for each of the 2 metrics.

The score will range from 1 to 5, where 1 is the worst and 5 is the best performance.

Summary

In this lab, we learned that while building generate AI solution, it is important to apply responsible AI principles. We learn that even when an AI app provides the correct answer, it is important to validate that the answer is grounded to the context it's data source. Even when the answer is grounded, it is important to validate that the answer is relevant to the question. Finally, it is important to validate that the answer is similar to the answer provided by the data source. In the content of Contoso dental clinc, we learn the important of the chatbox giving out information that pertain to their specific clinic.

Next, we learn how vector index are useful in storing and retrieving custom data, instead of using a pre-trained LLM where the data may be out of date or not relevance to your unique use case.

Finally, we learn how vector embedding is useful in converting text to numeric representation. This makes if useful in storing data base on thier relationship distance and similarity. Search is quicker and more accurate when using vector embedding.

Evaluate chatbot flow​

Summary​

Evaluate chatbot flow

Summary