Empirical performance evaluations

Check out our evals cookbook to go straight to code examples.

Optimizing Claude to give you the highest possible accuracy on a task is an empirical science and a process of continuous improvement. Whether you are trying to determine if a change to your prompt has improved Claude's performance, testing different Claude models against each other, or assessing if your use case is ready for production, a well-designed evaluation system is critical for success.

In this guide, we'll walk you through the prompt development lifecycle, the different types of evaluations (evals) you can use, their pros and cons, and provide some guidelines as to how to choose the best eval for your use case.

How to use evals

Evals should be an integral part of your entire production lifecycle when working with LLMs. They provide a quantitative measure of performance that allows you to track progress, identify issues, and make data-driven decisions. Here's how evals fit into the different stages of the production lifecycle:

  1. Prompt engineering: The prompt engineering process should begin with building a rigorous set of evals, not writing a prompt. These evals will serve as the foundation for measuring the effectiveness of your prompts and help you iterate and improve them over time.

  2. Development: As you develop your application or workflow with Claude, use the evals you designed during the prompt engineering phase to regularly test the performance of your prompts, even if the prompts themselves have not changed. Parts of the workflow outside and downstream of the prompt can inadvertently affect model performance. This will help you catch any issues early on and ensure that your workflows are performing as expected.

  3. Final testing: Before deploying your application or workflow to production, create at least one additional set of evals that you have not used during the development phase. This held-out set of evals will help you assess the true performance of your prompts and ensure that they have not been overfit to the evals used during development.

  4. Production: Once your application or workflow is in production, continue to use evals to monitor performance and identify any potential issues. You can also use evals to compare the performance of different Claude models or versions of your prompts to make data-driven decisions about updates and improvements.

By incorporating evals throughout the production lifecycle, you can ensure that your prompts are performing optimally and that your application or workflow is delivering the best possible results.

Parts of an eval

Evals typically have four parts:

  1. Input prompt: The prompt that is fed to the model. Claude generates a completion (a.k.a. output) based on this prompt. Often, when designing evals, the input column will contain a set of variable inputs that get fed into a prompt template at test time.

  2. Output: The text generated by running the input prompt through the model being evaluated.

  3. Golden answer: The correct answer to which the model output is compared. The golden answer could be a mandatory exact match or an example of a perfect answer meant to give a grader (human or LLM) a point of comparison for scoring.

  4. Score: A numerical value, generated by one of the grading methods discussed below, that represents how well the model performed on the question.

Eval grading methods

There are two aspects of evals that can be time-consuming and expensive: writing the questions & golden answer pairs, and grading. While writing questions and golden answers is typically a one-time fixed cost, grading is a cost you will incur every time you re-run your eval, which you will likely do frequently. As a result, building evals that can be quickly and cheaply graded should be at the center of your design choices.

There are three common ways to grade evals:

  1. Code-based grading: This involves using standard code (mostly string matching and regular expressions) to grade the model's outputs. Common versions include checking for an exact match against an answer or checking that a string contains some key phrase(s). This is the best grading method if you can design an eval that allows for it, as it is fast and highly reliable. However, many evaluations do not allow for this style of grading.

  2. Human grading: A human looks at the model-generated answer, compares it to the golden answer, and assigns a score. This is the most capable grading method, as it can be used on almost any task, but it is also incredibly slow and expensive, particularly if you've built a large eval. You should mostly try to avoid designing evals that require human grading if possible.

  3. Model-based grading: Claude is highly capable of grading itself and can be used to grade a wide variety of tasks that might have historically required humans, such as analysis of tone in creative writing or accuracy in free-form question answering. You can do this by writing a grader prompt for Claude.

Types of evaluations

There are several types of evaluations you can use to measure Claude's performance on a task. Each type has its own strengths and weaknesses.

Eval TypeDescriptionProsCons
Multiple choice question (MCQ)Closed-form questions with multiple answers, at least one of which is correct- Easy to automate
- Assesses general knowledge of a topic
- Clear answer key
- Easy to know what accurate looks like
- Potential training leakage if the test is public
- Limited in assessing more complex or open-ended tasks
Exact match (EM)Checks whether the model's answer is exactly the same string as the correct answer- Easy to automate
- High precision in assessing specific knowledge or tasks
- Easy to know what accurate looks like
- Limited in assessing more complex or open-ended tasks
- May not capture variations in correct answers
String matchChecks whether the model's answer contains the answer string- Easy to automate
- Assesses the presence of specific information in the model's output
- May not capture the full context or meaning of the model's response
- Can result in false positives or negatives
Open answer (OA)Open-ended questions that can have multiple possible solutions or require multi-step processes to assess- Great for assessing advanced knowledge, tacit knowledge, or qualitative open-ended performance
- Can be graded by humans or models
- More difficult to automate
- Requires a clear rubric for grading
- Model-based grading may be less accurate than human grading

Best practices for designing evals

When designing evals for your specific use case, keep the following best practices in mind:

  1. Task-specific evals: Make your evals specific to your task whenever possible, and try to have the distribution in your eval represent the real-life distribution of questions and question difficulties.

  2. Test model-based grading: The only way to know if a model-based grader can do a good job grading your task is to try it out and read some samples to see if your task is a good candidate.

  3. Automate when possible: Often, clever design can make an eval automatable. Try to structure questions in a way that allows for automated grading while still staying true to the task. Reformatting questions into multiple choice is a common tactic.

  4. Prioritize volume over quality: In general, prefer higher volume and lower quality of questions over very low volume with high quality.

  5. Use the evals cookbook: Our evals cookbook provides implemented examples of various types of human- and model-graded evals, including guidance and code you can copy.

By following these best practices and selecting the appropriate eval type for your use case, you can effectively measure Claude's performance and make data-driven decisions to improve your prompts and workflows.