LLM Evals are tests and measurement systems for model behavior, agent performance, retrieval quality, safety, and task success. GROUNDING tracks eval benchmarks, failure modes, and practical measurement patterns.

Topic: LLMs Related: RAG Evaluation Agents Code Agents

Recent Updates

  • 2026-06-08: UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs (cs.CL updates on arXiv.org) · arxiv.org — Amirhossein Abaskohi
  • 2026-06-08: Anthropic’s AI Code Generation and the Blind Spots of Automated Verification (Искусственный интеллект – AI, ANN и иные формы искусственного разума) · habr.comAnthropic METR Claude
  • 2026-06-08: Introducing FrontierCode: A Benchmark for Code Quality and Mergeability (Hacker News) · cognition.aiCognition Claude Opus 4.8 GPT 5.5 Gemini-3.1-Pro kimi-k2.6
  • 2026-06-09: LATTEArena: An Evaluation Framework for LLM-powered Tabular Feature Engineering (cs.AI updates on arXiv.org) · arxiv.org
  • 2026-06-09: From Holistic Evaluation to Structured Criteria: Rubrics Across the Evolving LLM Landscape (cs.CL updates on arXiv.org) · arxiv.org
  • 2026-06-09: Calibration of Structured Ignorance Certificates for Diagnosing Unknown Unknowns in Reasoning Models (cs.CL updates on arXiv.org) · arxiv.orgQwen3-14B
  • 2026-06-09: Artificial Intelligence for Mathematical Reasoning: A Unified Survey (cs.AI updates on arXiv.org) · arxiv.org — Syed Rifat Raiyan
  • 2026-06-09: Scaffold Effects on GAIA: A Controlled Comparison (cs.AI updates on arXiv.org) · arxiv.orgAnthropic Google OpenAI Claude Opus 4.7 Claude Sonnet 4.6 Claude Haiku 4.5 Gemini 3.1 Pro Preview GPT 5.5
  • 2026-06-09: Constrained Paraphrase Consistency for LLM Hallucination Detection (cs.CL updates on arXiv.org) · arxiv.orgDeBERTa Flan-T5
  • 2026-06-09: Cross Paraphrastic Invariance Learning for Hallucination Detection (cs.CL updates on arXiv.org) · arxiv.org
  • 2026-06-09: When Languages Disagree: Self-Evolving Multilingual LLM Judges (cs.CL updates on arXiv.org) · arxiv.org — Fu Liu
  • 2026-06-09: Testing the Black Box: Structural Barriers to Independent Evaluation of Consumer-Facing Health LLMs (cs.AI updates on arXiv.org) · arxiv.orgarXiv Zeamanuel Tesfaye
  • 2026-06-09: Support Vector Rubrics: Closing the Gap Between Self-Generated and Human Rubrics (cs.CL updates on arXiv.org) · arxiv.orgarXiv
  • 2026-06-09: Benchmarking Open-Ended Multi-Agent Coordination in Language Agents (cs.AI updates on arXiv.org) · arxiv.orgarXiv Gemini-3.1-Pro-High GPT-5.4-High
  • 2026-06-09: To Nuke or Not to Nuke: Evaluating Ethical Reasoning in Agentic Decision-Making (cs.AI updates on arXiv.org) · arxiv.org
  • 2026-06-09: Beyond English Benchmarks: Clinical LLM Evaluation in Brazilian Portuguese (cs.CL updates on arXiv.org) · arxiv.orgSciELO Giordano De Pinho Souza MedGemma-27B Sabiá-4 DeepSeek R1 o3-mini
  • 2026-06-09: Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation (cs.LG updates on arXiv.org) · arxiv.org
  • 2026-06-09: PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents (cs.AI updates on arXiv.org) · arxiv.orgarXiv Qwen2.5
  • 2026-06-09: SLMJury: Framework for Small Language Model Evaluation (cs.CL updates on arXiv.org) · arxiv.orgPhi-4
  • 2026-06-09: Scaling Down: Efficient Merchant Information Extraction with Small Fine-Tuned Models (cs.AI updates on arXiv.org) · arxiv.orgDatabricks Gemma 3 Qwen-3.5 Aya Llama 3.1
  • 2026-06-09: SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows (cs.AI updates on arXiv.org) · arxiv.orgGitLab
  • 2026-06-09: Unlocking Latent Value: Taxonomy-Guided Recovery of High-Performing Data from Low-Tier Web Corpora (cs.CL updates on arXiv.org) · arxiv.orgQwen2.5 32B E5
  • 2026-06-09: Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories (cs.LG updates on arXiv.org) · arxiv.orgAnthropic Google Alibaba Claude Sonnet 4.6 Qwen3.5-35B-A3B Gemma4-31B
  • 2026-06-09: Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation (cs.CL updates on arXiv.org) · arxiv.org
  • 2026-06-09: Multilingual Refusal Alignment for Safer Large Language Models (cs.CL updates on arXiv.org) · arxiv.org — Aleksandra Krasnodębska
  • 2026-06-09: Evaluating Hallucinations in Domain-Adapted Large Language Models (cs.CL updates on arXiv.org) · arxiv.orgLamini LLaMA-2
  • 2026-06-09: TinyJudge: Improving LLM Instruction Following via Lightweight Specialist Ensembles (cs.CL updates on arXiv.org) · arxiv.orgarXiv
  • 2026-06-09: Principled Agent Debate: Adversarial Arbitration for Sycophancy Reduction in Large Language Models (cs.CL updates on arXiv.org) · arxiv.org
  • 2026-06-09: BEACON: Behavioral Entropy Aggregation for Cross-Model Hallucination Detection (cs.CL updates on arXiv.org) · arxiv.org — Shoaib Sadiq Salehmohamed
  • 2026-06-09: Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity (cs.LG updates on arXiv.org) · arxiv.orgarXiv Hugging Face Florian E. Dorner
  • 2026-06-09: SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models (cs.CL updates on arXiv.org) · arxiv.orgarXiv Hugging Face Qwen2.5
  • 2026-06-09: The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM Routers (cs.LG updates on arXiv.org) · arxiv.org
  • 2026-06-09: The AI Epistemic Deference Index: A Continuous Measure of Sycophancy (cs.AI updates on arXiv.org) · arxiv.org — Paul de Font-Reaulx Claude Grok Gemini
  • 2026-06-09: Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs (cs.AI updates on arXiv.org) · arxiv.orgLeetCode Sayed Erfan Arefin Yi-Coder-9B-Chat Qwen2.5-Coder-14B-Instruct Gemma-2-27B-IT
  • 2026-06-09: Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy (cs.AI updates on arXiv.org) · arxiv.org
  • 2026-06-09: When Behavioral Safety Evaluation Fails: A Representation-Level Perspective (cs.LG updates on arXiv.org) · arxiv.org
  • 2026-06-09: Evaluating AI Coding Agents on Neuroscience Data Pipelines (cs.AI updates on arXiv.org) · arxiv.org
  • 2026-06-09: Summarization is Not Dead Yet (cs.CL updates on arXiv.org) · arxiv.org
  • 2026-06-09: ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research (cs.LG updates on arXiv.org) · arxiv.orgAnthropic arXiv Claude Opus 4.7
  • 2026-06-09: Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators (cs.AI updates on arXiv.org) · arxiv.org
  • 2026-06-09: Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning (cs.LG updates on arXiv.org) · arxiv.org
  • 2026-06-09: Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Honest Evaluation and Benchmark Contamination (cs.CL updates on arXiv.org) · arxiv.orgOpenAI NVIDIA Whisper large-v3 Phi-4-multimodal
  • 2026-06-09: Beyond Agent Architecture: Execution Assumptions and Reproducibility in LLM-Based Trading Systems (cs.AI updates on arXiv.org) · arxiv.orgarXiv alphaXiv CatalyzeX DagsHub Gotit.pub Hugging Face ScienceCast
  • 2026-06-09: Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents (cs.AI updates on arXiv.org) · arxiv.org
  • 2026-06-09: Hacking Generative Perplexity: A Critique of Unconditional Text Evaluation (cs.CL updates on arXiv.org) · arxiv.orggpt2-large
  • 2026-06-09: Building Comparative Motivation Profiles with Instrumental Interventions (cs.CL updates on arXiv.org) · arxiv.orgarXiv Hugging Face Llama-3.1-70B Llama-3.1-405B Qwen-2.5-72B
  • 2026-06-09: FrontierCode: Benchmarking for Code Quality over Slop (Latent.Space) · latent.spaceCognition OpenAI METR Latent.Space Apple Google Scott Wu Omar Sar0 Graham Neubig Hamel Husain Opus 4.8
  • 2026-06-09: Exploiting Tiny LLMs: Manipulating Persona and Safety via Token Sensitivity (Все статьи подряд / Искусственный интеллект / Хабр) · habr.comAlibaba vast.ai Qwen3.5-0.8B
  • 2026-06-09: Data Scientist’s Revenge: Why Data Science Remains Critical in the LLM Era (Все статьи подряд / Искусственный интеллект / Хабр) · habr.comOpenAI Harvard Business Review JosH100 Andrej Karpathy
  • 2026-06-09: Anthropic Launches Claude Fable 5 and Mythos 5 (эйай ньюз) · anthropic.comAnthropic Stripe Cognition Hebbia IMC US Government Claude Fable 5 Claude Mythos 5 Claude Opus 4.8 Claude Mythos Preview

FAQ

What is LLM Evals?

LLM Evals are tests and measurement systems for model behavior, agent performance, retrieval quality, safety, and task success. GROUNDING tracks eval benchmarks, failure modes, and practical measurement patterns.

Which topic does LLM Evals belong to?

On the GROUNDING radar, LLM Evals is grouped under the LLMs topic.

Related concepts tracked by the radar include RAG Evaluation, Agents, Code Agents.