4.3 Run Batch Evaluation¶
In the previous section, we assessed a single answer for a single metric, running one Prompty at a time. In reality, we will need to run assessments automatically across a large set of test inputs, with all custom evaluators, before we can judge if the application is ready for production use. In this exercise, we'll run a batch evaluation on our Contoso Chat application, using a Jupyter notebook.
1. Run Evaluation Notebook¶
Navigate to the src/api
folder in Visual Studio Code.
- Click:
evaluate-chat-flow.ipynb
- see: A Jupyter notebook - Click: Select Kernel - choose "Python Environments" - pick recommended
Python 3.11.x
- Click:
Run all
- this kickstarts the multi-step evaluation flow.
You may see a pop-up alert: The notebook is not displayed in the notebook editor because it is very large
with two options to proceed. Select the Open Anyway
(default) option.
2. Watch Evaluation Runs¶
One of the benefits of using Prompty is the built-in Tracer
feature that captures execution traces for the entire workflow. These trace runs are stored in .tracy
files in the api/.runs/
folder as shown in the figure below.
- Keep this explorer sidebar open while the evaluation notebook runs/
- You see:
get_response
traces when our chat application is running - You see:
groundedness
traces when its groundeness is evaluated - You see: similar
fluency
,coherence
andrelevance
traces
These are live trace runs so you should be able to make the following observations when completed:
- There will be 12
get_response
traces corresponding to 12 chat prompts that are executed by our chat AI. - For each of these responses, you should see 4 traces for each of the 4 custom evaluators we have defined.
- Clicking on an
.tracy
file should open the Trace Viewer window, allowing you to dive into the data visually.
3. Explore: Evaluation Trace¶
OPTIONAL: Explore .tracy files with Trace Viewer
The Prompty runtime generates .tracy
files (underlying JSON format) that capture the execution trace from prompt (input) to response (output). This section explains how you can use the traces to view or debug workflows.
To explore the evaluation trace:
- Wait till the batch evaluation process completes.
- Click on a
.tracy
file to launch trace viewer (see figure above).
The trace viewer feature is experimental. You may need to click, wait, and retry a few times before the viewer loads the file successfully. Skip this section and revisit it at home if time is limited.
-
Observe the Trace View
- You should see a waterfall view on the left, and a detail view on the right.
- The waterfall view shows the sequence of steps in the orchestrated flow.
- "Prompty" icons show asset execution (load-prepare-run)
- "Open AI" icons show model invocations (chat, embeddings)
- Cube icons represent Python function invocations (code)
- Click an item on the left to see detailed trace on the right.
-
Explore the
get_response
root trace- Click the
get_response
node on left - Observe the trace details on right
- You should see:
- The Input query (question, customerId, chat_history)
- The Output response (question, answer)
- Total time taken for execution
- Total tokens used in execution
- Token split between prompt and completion
- Click the
-
Explore a Prompty execution trace
-
Explore the Prompty tracer code
Want to learn more about Prompty Tracing? Explore the documentation to learn how to configure your application for traces, and how to view and publish traces for debugging and observability.
CONGRATULATIONS. You ran a batch evaluation on the chat AI application responses!