References
- Arxiv - Evaluate Agent with Agent
- GitHub - Eval-Assist (IBM) - LLM as a Judge
- BLEU - ROUGE score
- Giskard - LLM as a judge
- Arxiv - Lessons from red teaming 100 Generative AI Products
- AWS Blog - RAG evaluation with LLM as a judge
- Github - RAGAS
- DeepLearningAI - Safe and reliable AI via guardrails
- Linkedin - Evaluation
- Keynote - Evaluating LLM Models for Production Systems: Methods and Practices
- Wide variety of tasks is a typical for use patterns of LLMs in every business
- LLM evaluation
- MEASURING MASSIVE MULTITASK LANGUAGE UNDERSTANDING
- PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models
- TrustLLM
- TrustScore
- Reasoning tasks, requires specific evaluation (ex ConceptArc)
- Multi turn tasks, requires a specific evaluation (ex LMRL Gym)
- Instruction following
- Embedding vs LLM - Generative Representational Instruction Tuning
- LLM Embedding evaluation - MTEB (Massive Text Embedding Benchmark), Huggingface leaderboard
- In Context Learning (ICL) evaluation: HellaSwag, OpenICL article
- LLM as a judge: MT-bench & Chatbot Arena
- RAG Evaluation:
- Hallucination Evaluation - HaluEval
- Pairwise comparaison - Arenas:
- Hardness Evaluation:
- EleutherAI/lm-evaluation-harness, very powerful, for zero and few short tasks
- OpenAI Evals
- bigcode-project/bigcode-evaluation-harness, code evaluation tasks
- MLFlow LLM Evaluate , integrated with ML Flow,
- MosaicML Composer, icl tasks, superfast, scaling to multi gpu
- RAGAs for LLM based evaluation https://docs.ragas.io/en/latest/
- TruLens
- Lnkd Gregoire M - Incertitude LLM
- Arxiv - Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models: How to Measure LLM Uncertainty and Why It Matters?
- Researchers from the University of Illinois propose a three-step method:
- Generate 10 or more responses to the same question.
- Compute a similarity matrix between the responses.
- Derive an uncertainty score based on response diversity and inconsistency.
- Two key criteria validate the approach:
- Error prediction: Uncertainty should correlate with incorrect answers (evaluated via AUC-ROC analysis).
- Error rejection: Filtering uncertain responses should improve accuracy (measured via rejection curve analysis).
- Findings?
- The worst approach is asking the LLM to estimate its own uncertainty.
- The best method combines: NLI-based implication scores for the similarity matrix. + Spectral clustering of responses to compute the final uncertainty score.
- Challenges and Limitations:
- High cost: Each score requires 10+ LLM calls.
- Requires non-zero temperature, making it inapplicable to deterministic responses.
- Generic context: NLI models used are general-purpose and may not fit business-specific cases.
- Lack of interpretability: The uncertainty scores (e.g., 0.4 to 10) lack clear business or probabilistic meaning.
- Short responses work best: For long responses, contradictions get diluted, biasing NLI models trained on short sentences.
- Researchers from the University of Illinois propose a three-step method:
- Practical Guide - LLM as a Judge
- Github - lmms eval: evaluation module from microsoft
- Arxiv - Agent as a Judge: Evaluate Agents with Agents
- How to implement LLM as a Judge to test AI Agents?
- Github - deepeval
- GoogleBlog - DeepMind/Giskard evaluation