4.2 Understand Evaluators¶

The "scoring" task could be performed by a human, but this does not scale. Instead, we use AI-assisted evaluation by using one AI application ("evaluator") to grade the other ("chat"). And just like we used a chat.prompty to define our chat application, we can design evaluator.prompty instances that define the grading application - with a custom evaluator for each assessed metric.

ACTIVATE WORD WRAP: Press Alt-Z (or Cmd-Z on Mac) to toggle word wrap. This will make the prompts in the .prompty file easier to read within the limited screen view.

1. View/Run all evaluators.¶

Use Fn-F5 in a Prompty file to run it. Alternatively, click the Play icon in the editor

Navigate to the src/api/evaluators/custom_evals folder in VS Code.
Open each of the 4 .prompty files located there, in the VS Code editor.
- fluency.prompty
- coherence.prompty
- groundedness.prompty
- relevance.prompty
Run each file and observe the output seen from Prompty execution.

Run a Prompty file by clicking the play icon or pressing F5 on your keyboard
Check: You see prompty for Coherence, Fluency, Relevance and Groundedness.
Check: Running the prompty assets gives scores between 1 and 5

Let's understand how this works, taking one of these custom evaluators as an example.

2. View Coherence Prompty¶

Open the file coherence.prompty and look at its structure
1. You should see: system task is
  
  You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.
2. You should see: inputs expected are
  - question = user input to the chat model
  - answer = response provided by the chat model
  - context = support knowledge that the chat model was given
3. You should see: meta-prompt guidance for the task:
  Coherence of an answer is measured by how well all the sentences fit together and sound naturally as a whole. Consider the overall quality of the answer when evaluating coherence. Given the question and answer, score the coherence of answer between one to five stars using the following rating scale:
  - One star: the answer completely lacks coherence
  - Two stars: the answer mostly lacks coherence
  - Three stars: the answer is partially coherent
  - Four stars: the answer is mostly coherent
  - Five stars: the answer has perfect coherency
4. You should see: examples that provide guidance for the scoring.
  
  This rating value should always be an integer between 1 and 5. So the rating produced should be 1 or 2 or 3 or 4 or 5. (See examples for question-answer-context inputs that reflect 1,2,3,4 and 5 scores)

3. Run Coherence Prompty¶

You see: sample input for testing

question

What feeds all the fixtures in low voltage tracks instead of each light having a line-to-low voltage transformer?

answer

The main transformer is the object that feeds all the fixtures in low voltage tracks.

context

Track lighting, invented by Lightolier, was popular at one period of time because it was much easier to install than recessed lighting, and individual fixtures are decorative and can be easily aimed at a wall. It has regained some popularity recently in low-voltage tracks, which often look nothing like their predecessors because they do not have the safety issues that line-voltage systems have, and are therefore less bulky and more ornamental in themselves. A master transformer feeds all of the fixtures on the track or rod with 12 or 24 volts, instead of each light fixture having its own line-to-low voltage transformer. There are traditional spots and floods, as well as other small hanging fixtures. A modified version of this is cable lighting, where lights are hung from or clipped to bare metal cables under tension

Run the prompty file. You see output like this. This means the evaluator "assessed" this ANSWER as being very coherent (score=5).

Bash
2024-09-16 21:35:43.602 [info] Loading /workspaces/contoso-chat/.env
2024-09-16 21:35:43.678 [info] Calling ...
2024-09-16 21:35:44.488 [info] 5

Observe: Recall that coherence is about how well the sentences fit together.
- Read the question (input)
- Read the answer (output)
- Review the context (support knowledge)
- Based on this review, do you agree with the Coherence assessment?
Change Answer
- Replace sample answer with: Lorem ipsum orci dictumst aliquam diam
- Run the prompty again. How did the score change?
- Undo the change. Return the prompty to original state for the next step.

Repeat this exercise for the other evaluators on your own (e.g., Run the Groundedness.Prompty and see if the responses reflect knowledge provided in the support context). Use this to build your intuition for each metric and how it defines and assesses response quality.

Note the several examples given in the Prompty file of answers that represent each of the star ratings. This is an example of few-shot learning, a common technique used to guide AI models.

CONGRATULATIONS. You just learned how to use custom quality evaluators with Prompty!