Define your success criteria

Building a successful LLM-based application starts with clearly defining your success criteria. How will you know when your application is good enough to publish?

Having clear success criteria ensures that your prompt engineering & optimization efforts are focused on achieving specific, measurable goals.

Building strong criteria

Good success criteria are:

Specific: Clearly define what you want to achieve. Instead of “good performance,” specify “accurate sentiment classification.”
Measurable: Use quantitative metrics or well-defined qualitative scales. Numbers provide clarity and scalability, but qualitative measures can be valuable if consistently applied along with quantitative measures.
- Even “hazy” topics such as ethics and safety can be quantified:
  Safety criteria
  Bad Safe outputs
  Good Less than 0.1% of outputs out of 10,000 trials flagged for toxicity by our content filter.
Quantitative metrics:
- Task-specific: F1 score, BLEU score, perplexity
- Generic: Accuracy, precision, recall
- Operational: Response time (ms), uptime (%)
Quantitative methods:
- A/B testing: Compare performance against a baseline model or earlier version.
- User feedback: Implicit measures like task completion rates.
- Edge case analysis: Percentage of edge cases handled without errors.
Qualitative scales:
- Likert scales: “Rate coherence from 1 (nonsensical) to 5 (perfectly logical)”
- Expert rubrics: Linguists rating translation quality on defined criteria
Achievable: Base your targets on industry benchmarks, prior experiments, AI research, or expert knowledge. Your success metrics should not be unrealistic to current frontier model capabilities.
Relevant: Align your criteria with your application’s purpose and user needs. Strong citation accuracy might be critical for medical apps but less so for casual chatbots.

	Safety criteria
Bad	Safe outputs
Good	Less than 0.1% of outputs out of 10,000 trials flagged for toxicity by our content filter.

Example task fidelity criteria for sentiment analysis

	Criteria
Bad	The model should classify sentiments well
Good	Our sentiment analysis model should achieve an F1 score of at least 0.85 (Measurable, Specific) on a held-out test set* of 10,000 diverse Twitter posts (Relevant), which is a 5% improvement over our current baseline (Achievable).

*More on held-out test sets in the next section

Common success criteria to consider

Here are some criteria that might be important for your use case. This list is non-exhaustive.

Task fidelity

Consistency

Relevance and coherence

Tone and style

Privacy preservation

Context utilization

Latency

Price

Most use cases will need multidimensional evaluation along several success criteria.

Example multidimensional criteria for sentiment analysis

	Criteria
Bad	The model should classify sentiments well
Good	On a held-out test set of 10,000 diverse Twitter posts, our sentiment analysis model should achieve: - an F1 score of at least 0.85 - 99.5% of outputs are non-toxic - 90% of errors are would cause inconvenience, not egregious error* - 95% response time < 200ms

*In reality, we would also define what “inconvenience” and “egregious” means.

Next steps

Brainstorm criteria

Brainstorm success criteria for your use case with Claude on claude.ai.

Tip: Drop this page into the chat as guidance for Claude!

Design evaluations

Learn to build strong test sets to gauge Claude’s performance against your criteria.

First steps

Models & pricing

Learn about Claude

Capabilities

Tools

Model Context Protocol (MCP)

Use cases

Prompt engineering

Test & evaluate

Strengthen guardrails

Legal center

Define your success criteria

Building strong criteria

Common success criteria to consider

Next steps

Brainstorm criteria

Design evaluations

First steps

Models & pricing

Learn about Claude

Capabilities

Tools

Model Context Protocol (MCP)

Use cases

Prompt engineering

Test & evaluate

Strengthen guardrails

Legal center

​Building strong criteria

​Common success criteria to consider

​Next steps

Brainstorm criteria

Design evaluations

Building strong criteria

Common success criteria to consider

Next steps