4.3 Run Batch Evaluation¶
In the previous section, we assessed a single answer for a single metric, running one Prompty at a time. In reality, we will need to run assessments automatically across a large set of test inputs, with all custom evaluators, before we can judge if the application is ready for production use. In this exercise, we'll run a batch evaluation on our Contoso Chat application, using a Jupyter notebook.
1. Run Evaluation Notebook¶
Navigate to the src/api
folder in Visual Studio Code.
- Click:
evaluate-chat-flow.ipynb
- see: A Jupyter notebook - Click: Select Kernel - choose "Python Environments" - pick recommended
Python 3.11.x
- Click:
Run all
- this kickstarts the multi-step evaluation flow.
You may see a pop-up alert: The notebook is not displayed in the notebook editor because it is very large
with two options to proceed. Select the Open Anyway
(default) option.
2. Watch Evaluation Runs¶
One of the benefits of using Prompty is the built-in Tracer
feature that captures execution traces for the entire workflow. These trace runs are stored in .tracy
files in the api/.runs/
folder as shown in the figure below.
- Keep this explorer sidebar open while the evaluation notebook runs/
- You see:
get_response
traces when our chat application is running - You see:
groundedness
traces when its groundeness is evaluated - You see: similar
fluency
,coherence
andrelevance
traces
These are live trace runs! Observe the following:
- There are 12
get_response
traces → correspond to the 12 prompts in our test file. - There are 4 eval traces for each response → for the 4 custom evaluators defined.
- Clicking on an
.tracy
file opens a Trace Viewer window → it should look like this:
The trace viewer feature is experimental. You may need to click, wait, and retry a few times before the viewer loads the file successfully. Skip this section and revisit it at home if time is limited.
3. Explore: Evaluation Trace¶
The Trace Viewer is an experimental feature. For optimal results you should:
- Wait till the batch evaluation process completes.
- Click on a
.tracy
file to launch trace viewer (see figure above). - You may need to try this with a few options till the viewer launches.
To save time in-venue, we recommend completing these tasks later, at home
OPTIONAL: Explore the .tracy files with Trace Viewer
The Prompty runtime generates .tracy
files (underlying JSON format) that capture the execution trace from prompt (input) to response (output). This section explains how you can use the traces to view or debug workflows.
-
Observe the Trace View
- You should see a waterfall view on the left, and a detail view on the right.
- The waterfall view shows the sequence of steps in the orchestrated flow.
- "Prompty" icons show asset execution (load-prepare-run)
- "Open AI" icons show model invocations (chat, embeddings)
- Cube icons represent Python function invocations (code)
- Click an item on the left to see detailed trace on the right.
-
Explore the
get_response
root trace- Click the
get_response
node on left - Observe the trace details on right
- You should see:
- The Input query (question, customerId, chat_history)
- The Output response (question, answer)
- Total time taken for execution
- Total tokens used in execution
- Token split between prompt and completion
- Click the
-
Explore the Prompty
execute
trace- Look at the overall performance in time and tokens
- Select a Prompty segment in the waterfall trace
- Understand what the
load
operation does - Understand what the
prepare
operation does - Understand what the
run
operation does
-
Explore the Prompty tracer code
- Open the
src/api/chat_request.py
file - Look for this code. It sets up Tracer to log to console and JSON formats.
1 2 3 4 5 6 7 8 9 10
import prompty import prompty.azure from prompty.tracer import trace, Tracer, console_tracer, PromptyTracer # add console and json tracer: # this only has to be done once # at application startup Tracer.add("console", console_tracer) json_tracer = PromptyTracer() Tracer.add("PromptyTracer", json_tracer.tracer)
- Look for this code. The
@trace
decorator identifies the functions that will be traced in this run.1 2 3 4 5 6 7 8 9 10 11 12 13
@trace def get_customer(customerId: str) -> str: try: url = os.environ["COSMOS_ENDPOINT"] client = CosmosClient(url=url, credential=DefaultAzureCredential()) db = client.get_database_client("contoso-outdoor") container = db.get_container_client("customers") response = container.read_item(item=str(customerId), partition_key=str(customerId)) response["orders"] = response["orders"][:2] return response except Exception as e: print(f"Error retrieving customer: {e}") return None
- Open the
Want to learn more about Prompty Tracing? Explore the documentation to learn how to configure your application for traces, and how to view and publish traces for debugging and observability.
CONGRATULATIONS. You ran a batch evaluation on the chat AI application responses!