LLM Evals

LLM Evals are tests and measurement systems for model behavior, agent performance, retrieval quality, safety, and task success. GROUNDING tracks eval benchmarks, failure modes, and practical measurement patterns.

Topic: LLMs Related: RAG Evaluation Agents Code Agents

Recent Updates

2026-06-08: UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs (cs.CL updates on arXiv.org) · arxiv.org — Amirhossein Abaskohi
2026-06-08: Anthropic’s AI Code Generation and the Blind Spots of Automated Verification (Искусственный интеллект – AI, ANN и иные формы искусственного разума) · habr.com — Anthropic METR Claude
2026-06-08: Introducing FrontierCode: A Benchmark for Code Quality and Mergeability (Hacker News) · cognition.ai — Cognition Claude Opus 4.8 GPT 5.5 Gemini-3.1-Pro kimi-k2.6
2026-06-09: LATTEArena: An Evaluation Framework for LLM-powered Tabular Feature Engineering (cs.AI updates on arXiv.org) · arxiv.org
2026-06-09: From Holistic Evaluation to Structured Criteria: Rubrics Across the Evolving LLM Landscape (cs.CL updates on arXiv.org) · arxiv.org
2026-06-09: Calibration of Structured Ignorance Certificates for Diagnosing Unknown Unknowns in Reasoning Models (cs.CL updates on arXiv.org) · arxiv.org — Qwen3-14B
2026-06-09: Artificial Intelligence for Mathematical Reasoning: A Unified Survey (cs.AI updates on arXiv.org) · arxiv.org — Syed Rifat Raiyan
2026-06-09: Scaffold Effects on GAIA: A Controlled Comparison (cs.AI updates on arXiv.org) · arxiv.org — Anthropic Google OpenAI Claude Opus 4.7 Claude Sonnet 4.6 Claude Haiku 4.5 Gemini 3.1 Pro Preview GPT 5.5
2026-06-09: Constrained Paraphrase Consistency for LLM Hallucination Detection (cs.CL updates on arXiv.org) · arxiv.org — DeBERTa Flan-T5
2026-06-09: Cross Paraphrastic Invariance Learning for Hallucination Detection (cs.CL updates on arXiv.org) · arxiv.org
2026-06-09: When Languages Disagree: Self-Evolving Multilingual LLM Judges (cs.CL updates on arXiv.org) · arxiv.org — Fu Liu
2026-06-09: Testing the Black Box: Structural Barriers to Independent Evaluation of Consumer-Facing Health LLMs (cs.AI updates on arXiv.org) · arxiv.org — arXiv Zeamanuel Tesfaye
2026-06-09: Support Vector Rubrics: Closing the Gap Between Self-Generated and Human Rubrics (cs.CL updates on arXiv.org) · arxiv.org — arXiv
2026-06-09: Benchmarking Open-Ended Multi-Agent Coordination in Language Agents (cs.AI updates on arXiv.org) · arxiv.org — arXiv Gemini-3.1-Pro-High GPT-5.4-High
2026-06-09: To Nuke or Not to Nuke: Evaluating Ethical Reasoning in Agentic Decision-Making (cs.AI updates on arXiv.org) · arxiv.org
2026-06-09: Beyond English Benchmarks: Clinical LLM Evaluation in Brazilian Portuguese (cs.CL updates on arXiv.org) · arxiv.org — SciELO Giordano De Pinho Souza MedGemma-27B Sabiá-4 DeepSeek R1 o3-mini
2026-06-09: Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation (cs.LG updates on arXiv.org) · arxiv.org
2026-06-09: PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents (cs.AI updates on arXiv.org) · arxiv.org — arXiv Qwen2.5
2026-06-09: SLMJury: Framework for Small Language Model Evaluation (cs.CL updates on arXiv.org) · arxiv.org — Phi-4
2026-06-09: Scaling Down: Efficient Merchant Information Extraction with Small Fine-Tuned Models (cs.AI updates on arXiv.org) · arxiv.org — Databricks Gemma 3 Qwen-3.5 Aya Llama 3.1
2026-06-09: SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows (cs.AI updates on arXiv.org) · arxiv.org — GitLab
2026-06-09: Unlocking Latent Value: Taxonomy-Guided Recovery of High-Performing Data from Low-Tier Web Corpora (cs.CL updates on arXiv.org) · arxiv.org — Qwen2.5 32B E5
2026-06-09: Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories (cs.LG updates on arXiv.org) · arxiv.org — Anthropic Google Alibaba Claude Sonnet 4.6 Qwen3.5-35B-A3B Gemma4-31B
2026-06-09: Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation (cs.CL updates on arXiv.org) · arxiv.org
2026-06-09: Multilingual Refusal Alignment for Safer Large Language Models (cs.CL updates on arXiv.org) · arxiv.org — Aleksandra Krasnodębska
2026-06-09: Evaluating Hallucinations in Domain-Adapted Large Language Models (cs.CL updates on arXiv.org) · arxiv.org — Lamini LLaMA-2
2026-06-09: TinyJudge: Improving LLM Instruction Following via Lightweight Specialist Ensembles (cs.CL updates on arXiv.org) · arxiv.org — arXiv
2026-06-09: Principled Agent Debate: Adversarial Arbitration for Sycophancy Reduction in Large Language Models (cs.CL updates on arXiv.org) · arxiv.org
2026-06-09: BEACON: Behavioral Entropy Aggregation for Cross-Model Hallucination Detection (cs.CL updates on arXiv.org) · arxiv.org — Shoaib Sadiq Salehmohamed
2026-06-09: Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity (cs.LG updates on arXiv.org) · arxiv.org — arXiv Hugging Face Florian E. Dorner
2026-06-09: SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models (cs.CL updates on arXiv.org) · arxiv.org — arXiv Hugging Face Qwen2.5
2026-06-09: The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM Routers (cs.LG updates on arXiv.org) · arxiv.org
2026-06-09: The AI Epistemic Deference Index: A Continuous Measure of Sycophancy (cs.AI updates on arXiv.org) · arxiv.org — Paul de Font-Reaulx Claude Grok Gemini
2026-06-09: Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs (cs.AI updates on arXiv.org) · arxiv.org — LeetCode Sayed Erfan Arefin Yi-Coder-9B-Chat Qwen2.5-Coder-14B-Instruct Gemma-2-27B-IT
2026-06-09: Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy (cs.AI updates on arXiv.org) · arxiv.org
2026-06-09: When Behavioral Safety Evaluation Fails: A Representation-Level Perspective (cs.LG updates on arXiv.org) · arxiv.org
2026-06-09: Evaluating AI Coding Agents on Neuroscience Data Pipelines (cs.AI updates on arXiv.org) · arxiv.org
2026-06-09: Summarization is Not Dead Yet (cs.CL updates on arXiv.org) · arxiv.org
2026-06-09: ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research (cs.LG updates on arXiv.org) · arxiv.org — Anthropic arXiv Claude Opus 4.7
2026-06-09: Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators (cs.AI updates on arXiv.org) · arxiv.org
2026-06-09: Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning (cs.LG updates on arXiv.org) · arxiv.org
2026-06-09: Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Honest Evaluation and Benchmark Contamination (cs.CL updates on arXiv.org) · arxiv.org — OpenAI NVIDIA Whisper large-v3 Phi-4-multimodal
2026-06-09: Beyond Agent Architecture: Execution Assumptions and Reproducibility in LLM-Based Trading Systems (cs.AI updates on arXiv.org) · arxiv.org — arXiv alphaXiv CatalyzeX DagsHub Gotit.pub Hugging Face ScienceCast
2026-06-09: Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents (cs.AI updates on arXiv.org) · arxiv.org
2026-06-09: Hacking Generative Perplexity: A Critique of Unconditional Text Evaluation (cs.CL updates on arXiv.org) · arxiv.org — gpt2-large
2026-06-09: Building Comparative Motivation Profiles with Instrumental Interventions (cs.CL updates on arXiv.org) · arxiv.org — arXiv Hugging Face Llama-3.1-70B Llama-3.1-405B Qwen-2.5-72B
2026-06-09: FrontierCode: Benchmarking for Code Quality over Slop (Latent.Space) · latent.space — Cognition OpenAI METR Latent.Space Apple Google Scott Wu Omar Sar0 Graham Neubig Hamel Husain Opus 4.8
2026-06-09: Exploiting Tiny LLMs: Manipulating Persona and Safety via Token Sensitivity (Все статьи подряд / Искусственный интеллект / Хабр) · habr.com — Alibaba vast.ai Qwen3.5-0.8B
2026-06-09: Data Scientist’s Revenge: Why Data Science Remains Critical in the LLM Era (Все статьи подряд / Искусственный интеллект / Хабр) · habr.com — OpenAI Harvard Business Review JosH100 Andrej Karpathy
2026-06-09: Anthropic Launches Claude Fable 5 and Mythos 5 (эйай ньюз) · anthropic.com — Anthropic Stripe Cognition Hebbia IMC US Government Claude Fable 5 Claude Mythos 5 Claude Opus 4.8 Claude Mythos Preview

FAQ

What is LLM Evals?

Which topic does LLM Evals belong to?

On the GROUNDING radar, LLM Evals is grouped under the LLMs topic.

Related concepts tracked by the radar include RAG Evaluation, Agents, Code Agents.

GROUNDING

Explorer

Recent Updates

FAQ

What is LLM Evals?

Which topic does LLM Evals belong to?

Graph View

Table of Contents

Backlinks

GROUNDING

Explorer

LLM Evals

Recent Updates

FAQ

What is LLM Evals?

Which topic does LLM Evals belong to?

Which concepts are related to LLM Evals?

Graph View

Table of Contents

Backlinks