References

Arxiv - Evaluate Agent with Agent
GitHub - Eval-Assist (IBM) - LLM as a Judge
BLEU - ROUGE score
Giskard - LLM as a judge
Arxiv - Lessons from red teaming 100 Generative AI Products
- Lnkd Post
AWS Blog - RAG evaluation with LLM as a judge
Github - RAGAS
DeepLearningAI - Safe and reliable AI via guardrails
Linkedin - Evaluation
Keynote - Evaluating LLM Models for Production Systems: Methods and Practices
- Wide variety of tasks is a typical for use patterns of LLMs in every business
- LLM evaluation
- Embedding vs LLM - Generative Representational Instruction Tuning
  - LLM Embedding evaluation - MTEB (Massive Text Embedding Benchmark), Huggingface leaderboard
  - In Context Learning (ICL) evaluation: HellaSwag, OpenICL article
- LLM as a judge: MT-bench & Chatbot Arena
- RAG Evaluation:
- Hallucination Evaluation - HaluEval
- Pairwise comparaison - Arenas:
- Hardness Evaluation:
  - EleutherAI/lm-evaluation-harness, very powerful, for zero and few short tasks
  - OpenAI Evals
  - bigcode-project/bigcode-evaluation-harness, code evaluation tasks
  - MLFlow LLM Evaluate , integrated with ML Flow,
  - MosaicML Composer, icl tasks, superfast, scaling to multi gpu
  - RAGAs for LLM based evaluation https://docs.ragas.io/en/latest/
  - TruLens
Lnkd Gregoire M - Incertitude LLM
Arxiv - Generating with Confidence: Uncertainty Quantification for Black-box Large Language Models: How to Measure LLM Uncertainty and Why It Matters?
- Researchers from the University of Illinois propose a three-step method:
  1. Generate 10 or more responses to the same question.
  2. Compute a similarity matrix between the responses.
  3. Derive an uncertainty score based on response diversity and inconsistency.
- Two key criteria validate the approach:
  - Error prediction: Uncertainty should correlate with incorrect answers (evaluated via AUC-ROC analysis).
  - Error rejection: Filtering uncertain responses should improve accuracy (measured via rejection curve analysis).
- Findings?
  - The worst approach is asking the LLM to estimate its own uncertainty.
  - The best method combines: NLI-based implication scores for the similarity matrix. + Spectral clustering of responses to compute the final uncertainty score.
- Challenges and Limitations:
  - High cost: Each score requires 10+ LLM calls.
  - Requires non-zero temperature, making it inapplicable to deterministic responses.
  - Generic context: NLI models used are general-purpose and may not fit business-specific cases.
  - Lack of interpretability: The uncertainty scores (e.g., 0.4 to 10) lack clear business or probabilistic meaning.
  - Short responses work best: For long responses, contradictions get diluted, biasing NLI models trained on short sentences.
Practical Guide - LLM as a Judge
Github - lmms eval: evaluation module from microsoft
Arxiv - Agent as a Judge: Evaluate Agents with Agents
How to implement LLM as a Judge to test AI Agents?
Github - deepeval
GoogleBlog - DeepMind/Giskard evaluation