Example edge cases
Task fidelity (sentiment analysis) - exact match evaluation
Consistency (FAQ bot) - cosine similarity evaluation
Relevance and coherence (summarization) - ROUGE-L evaluation
Tone and style (customer service) - LLM-based Likert scale
Privacy preservation (medical chatbot) - LLM-based binary classification
Context utilization (conversation assistant) - LLM-based ordinal scale
output == golden_answer
key_phrase in output
Example: LLM-based grading