4.3 Run Batch Evaluation¶

In the previous section, we assessed a single answer for a single metric, running one Prompty at a time. In reality, we will need to run assessments automatically across a large set of test inputs, with all custom evaluators, before we can judge if the application is ready for production use. In this exercise, we'll run a batch evaluation on our Contoso Chat application, using a Jupyter notebook.

1. Run Evaluation Notebook¶

Navigate to the src/api folder in Visual Studio Code.

Click: evaluate-chat-flow.ipynb - see: A Jupyter notebook
Click: Select Kernel - choose "Python Environments" - pick recommended Python 3.11.x
Click: Run all - this kickstarts the multi-step evaluation flow.

You may see a pop-up alert: The notebook is not displayed in the notebook editor because it is very large with two options to proceed. Select the Open Anyway (default) option.

2. Watch Evaluation Runs¶

One of the benefits of using Prompty is the built-in Tracer feature that captures execution traces for the entire workflow. These trace runs are stored in .tracy files in the api/.runs/ folder as shown in the figure below.

Keep this explorer sidebar open while the evaluation notebook runs/
You see: get_response traces when our chat application is running
You see: groundedness traces when its groundeness is evaluated
You see: similar fluency, coherence and relevance traces

These are live trace runs! Observe the following:

There are 12 get_response traces → correspond to the 12 prompts in our test file.
There are 4 eval traces for each response → for the 4 custom evaluators defined.
Clicking on an .tracy file opens a Trace Viewer window → it should look like this:

Eval

The trace viewer feature is experimental. You may need to click, wait, and retry a few times before the viewer loads the file successfully. Skip this section and revisit it at home if time is limited.

3. Explore: Evaluation Trace¶

The Trace Viewer is an experimental feature. For optimal results you should:

Wait till the batch evaluation process completes.
Click on a .tracy file to launch trace viewer (see figure above).
You may need to try this with a few options till the viewer launches.

To save time in-venue, we recommend completing these tasks later, at home

OPTIONAL: Explore the .tracy files with Trace Viewer

The Prompty runtime generates .tracy files (underlying JSON format) that capture the execution trace from prompt (input) to response (output). This section explains how you can use the traces to view or debug workflows.

Observe the Trace View
- You should see a waterfall view on the left, and a detail view on the right.
- The waterfall view shows the sequence of steps in the orchestrated flow.
- "Prompty" icons show asset execution (load-prepare-run)
- "Open AI" icons show model invocations (chat, embeddings)
- Cube icons represent Python function invocations (code)
- Click an item on the left to see detailed trace on the right.
Explore the get_response root trace
- Click the get_response node on left
- Observe the trace details on right
- You should see:
  - The Input query (question, customerId, chat_history)
  - The Output response (question, answer)
  - Total time taken for execution
  - Total tokens used in execution
  - Token split between prompt and completion
Explore the Prompty execute trace
- Look at the overall performance in time and tokens
- Select a Prompty segment in the waterfall trace
- Understand what the load operation does
- Understand what the prepare operation does
- Understand what the run operation does

Explore the Prompty tracer code

Open the src/api/chat_request.py file

Look for this code. It sets up Tracer to log to console and JSON formats.

    import prompty
    import prompty.azure
    from prompty.tracer import trace, Tracer, console_tracer, PromptyTracer

    # add console and json tracer:
    # this only has to be done once
    # at application startup
    Tracer.add("console", console_tracer)
    json_tracer = PromptyTracer()
    Tracer.add("PromptyTracer", json_tracer.tracer)

Look for this code. The @trace decorator identifies the functions that will be traced in this run.

    @trace
    def get_customer(customerId: str) -> str:
        try:
            url = os.environ["COSMOS_ENDPOINT"]
            client = CosmosClient(url=url, credential=DefaultAzureCredential())
            db = client.get_database_client("contoso-outdoor")
            container = db.get_container_client("customers")
            response = container.read_item(item=str(customerId), partition_key=str(customerId))
            response["orders"] = response["orders"][:2]
            return response
        except Exception as e:
            print(f"Error retrieving customer: {e}")
            return None

Want to learn more about Prompty Tracing? Explore the documentation to learn how to configure your application for traces, and how to view and publish traces for debugging and observability.

CONGRATULATIONS. You ran a batch evaluation on the chat AI application responses!