Skip to content

4.3 Run Batch Evaluation

In the previous section, we assessed a single answer for a single metric, running one Prompty at a time. In reality, we will need to run assessments automatically across a large set of test inputs, with all custom evaluators, before we can judge if the application is ready for production use. In this exercise, we'll run a batch evaluation on our Contoso Chat application, using a Jupyter notebook.


1. Run Evaluation Notebook

Navigate to the src/api folder in Visual Studio Code.

  • Click: evaluate-chat-flow.ipynb - see: A Jupyter notebook
  • Click: Select Kernel - choose "Python Environments" - pick recommended Python 3.11.x
  • Click: Run all - this kickstarts the multi-step evaluation flow.

You may see a pop-up alert: The notebook is not displayed in the notebook editor because it is very large with two options to proceed. Select the Open Anyway (default) option.


2. Watch Evaluation Runs

One of the benefits of using Prompty is the built-in Tracer feature that captures execution traces for the entire workflow. These trace runs are stored in .tracy files in the api/.runs/ folder as shown in the figure below.

  • Keep this explorer sidebar open while the evaluation notebook runs/
  • You see: get_response traces when our chat application is running
  • You see: groundedness traces when its groundeness is evaluated
  • You see: similar fluency, coherence and relevance traces

These are live trace runs so you should be able to make the following observations when completed:

  • There will be 12 get_response traces corresponding to 12 chat prompts that are executed by our chat AI.
  • For each of these responses, you should see 4 traces for each of the 4 custom evaluators we have defined.
  • Clicking on an .tracy file should open the Trace Viewer window, allowing you to dive into the data visually.

Eval


3. Explore: Evaluation Trace

OPTIONAL: Explore .tracy files with Trace Viewer

The Prompty runtime generates .tracy files (underlying JSON format) that capture the execution trace from prompt (input) to response (output). This section explains how you can use the traces to view or debug workflows.

To explore the evaluation trace:

  • Wait till the batch evaluation process completes.
  • Click on a .tracy file to launch trace viewer (see figure above).

The trace viewer feature is experimental. You may need to click, wait, and retry a few times before the viewer loads the file successfully. Skip this section and revisit it at home if time is limited.

  1. Observe the Trace View

    • You should see a waterfall view on the left, and a detail view on the right.
    • The waterfall view shows the sequence of steps in the orchestrated flow.
    • "Prompty" icons show asset execution (load-prepare-run)
    • "Open AI" icons show model invocations (chat, embeddings)
    • Cube icons represent Python function invocations (code)
    • Click an item on the left to see detailed trace on the right.
  2. Explore the get_response root trace

    • Click the get_response node on left
    • Observe the trace details on right
    • You should see:
      • The Input query (question, customerId, chat_history)
      • The Output response (question, answer)
      • Total time taken for execution
      • Total tokens used in execution
      • Token split between prompt and completion
  3. Explore a Prompty execution trace

  4. Explore the Prompty tracer code

Want to learn more about Prompty Tracing? Explore the documentation to learn how to configure your application for traces, and how to view and publish traces for debugging and observability.


CONGRATULATIONS. You ran a batch evaluation on the chat AI application responses!