LLMs

LLMs, or large language models, are neural models that generate and transform language, code, and multimodal content. GROUNDING tracks model releases, evaluations, inference, tooling, and builder impact.

Key Developments

2026-06-09: Support Vector Rubrics: Closing the Gap Between Self-Generated and Human Rubrics (cs.CL updates on arXiv.org) · arxiv.org LLM Evals arXiv
2026-06-09: Benchmarking Open-Ended Multi-Agent Coordination in Language Agents (cs.AI updates on arXiv.org) · arxiv.org LLM Evals arXiv Gemini-3.1-Pro-High GPT-5.4-High
2026-06-09: Customer-Agent: Overcoming Context Limitations in Ultra-Long Shopping Trajectories via Tool-Augmented Agents and RLVR (cs.CL updates on arXiv.org) · arxiv.org Context Engineering
2026-06-09: To Nuke or Not to Nuke: Evaluating Ethical Reasoning in Agentic Decision-Making (cs.AI updates on arXiv.org) · arxiv.org LLM Evals
2026-06-09: Beyond English Benchmarks: Clinical LLM Evaluation in Brazilian Portuguese (cs.CL updates on arXiv.org) · arxiv.org LLM Evals SciELO Giordano De Pinho Souza MedGemma-27B Sabiá-4 DeepSeek R1 o3-mini
2026-06-09: Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation (cs.LG updates on arXiv.org) · arxiv.org LLM Evals
2026-06-09: PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents (cs.AI updates on arXiv.org) · arxiv.org LLM Evals arXiv Qwen2.5
2026-06-09: SLMJury: Framework for Small Language Model Evaluation (cs.CL updates on arXiv.org) · arxiv.org LLM Evals Phi-4
2026-06-09: Scaling Down: Efficient Merchant Information Extraction with Small Fine-Tuned Models (cs.AI updates on arXiv.org) · arxiv.org LLM Evals Databricks Gemma 3 Qwen-3.5 Aya Llama 3.1
2026-06-09: SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows (cs.AI updates on arXiv.org) · arxiv.org LLM Evals GitLab
2026-06-09: Unlocking Latent Value: Taxonomy-Guided Recovery of High-Performing Data from Low-Tier Web Corpora (cs.CL updates on arXiv.org) · arxiv.org LLM Evals Qwen2.5 32B E5
2026-06-09: Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories (cs.LG updates on arXiv.org) · arxiv.org LLM Evals Anthropic Google Alibaba Claude Sonnet 4.6 Qwen3.5-35B-A3B Gemma4-31B
2026-06-09: Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation (cs.CL updates on arXiv.org) · arxiv.org LLM Evals
2026-06-09: Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents (cs.AI updates on arXiv.org) · arxiv.org Context Engineering arXiv Opus Qwen Codex GPT 5.5 Qwen-QLoRA Qwen3.6-Plus Gemini-3.1-Pro-High Qwen3.5-122B-A10B
2026-06-09: Multilingual Refusal Alignment for Safer Large Language Models (cs.CL updates on arXiv.org) · arxiv.org LLM Evals Aleksandra Krasnodębska
2026-06-09: Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models (cs.AI updates on arXiv.org) · arxiv.org Context Engineering Google Anthropic Alibaba OpenAI Gemma-4-31B-IT Qwen3.6-35B-A3B Claude Sonnet 4.6 GPT-5.3
2026-06-09: Evaluating Hallucinations in Domain-Adapted Large Language Models (cs.CL updates on arXiv.org) · arxiv.org LLM Evals Lamini LLaMA-2
2026-06-09: TinyJudge: Improving LLM Instruction Following via Lightweight Specialist Ensembles (cs.CL updates on arXiv.org) · arxiv.org LLM Evals arXiv
2026-06-09: Principled Agent Debate: Adversarial Arbitration for Sycophancy Reduction in Large Language Models (cs.CL updates on arXiv.org) · arxiv.org LLM Evals
2026-06-09: BEACON: Behavioral Entropy Aggregation for Cross-Model Hallucination Detection (cs.CL updates on arXiv.org) · arxiv.org LLM Evals Shoaib Sadiq Salehmohamed
2026-06-09: Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity (cs.LG updates on arXiv.org) · arxiv.org LLM Evals arXiv Hugging Face Florian E. Dorner
2026-06-09: SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models (cs.CL updates on arXiv.org) · arxiv.org LLM Evals arXiv Hugging Face Qwen2.5
2026-06-09: The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM Routers (cs.LG updates on arXiv.org) · arxiv.org LLM Evals
2026-06-09: The AI Epistemic Deference Index: A Continuous Measure of Sycophancy (cs.AI updates on arXiv.org) · arxiv.org LLM Evals Paul de Font-Reaulx Claude Grok Gemini
2026-06-09: Beyond Pass Rate: A Multilingual, Execution-Grounded Evaluation of Open Code LLMs (cs.AI updates on arXiv.org) · arxiv.org LLM Evals LeetCode Sayed Erfan Arefin Yi-Coder-9B-Chat Qwen2.5-Coder-14B-Instruct Gemma-2-27B-IT
2026-06-09: Stress-testing medical large language models reveals latent safety pathology beyond benchmark accuracy (cs.AI updates on arXiv.org) · arxiv.org LLM Evals
2026-06-09: When Behavioral Safety Evaluation Fails: A Representation-Level Perspective (cs.LG updates on arXiv.org) · arxiv.org LLM Evals
2026-06-09: Evaluating AI Coding Agents on Neuroscience Data Pipelines (cs.AI updates on arXiv.org) · arxiv.org LLM Evals
2026-06-09: Summarization is Not Dead Yet (cs.CL updates on arXiv.org) · arxiv.org LLM Evals
2026-06-09: ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research (cs.LG updates on arXiv.org) · arxiv.org LLM Evals Anthropic arXiv Claude Opus 4.7
2026-06-09: Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators (cs.AI updates on arXiv.org) · arxiv.org LLM Evals
2026-06-09: Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning (cs.LG updates on arXiv.org) · arxiv.org LLM Evals
2026-06-09: Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Honest Evaluation and Benchmark Contamination (cs.CL updates on arXiv.org) · arxiv.org LLM Evals OpenAI NVIDIA Whisper large-v3 Phi-4-multimodal
2026-06-09: Beyond Agent Architecture: Execution Assumptions and Reproducibility in LLM-Based Trading Systems (cs.AI updates on arXiv.org) · arxiv.org LLM Evals arXiv alphaXiv CatalyzeX DagsHub Gotit.pub Hugging Face ScienceCast
2026-06-09: Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents (cs.AI updates on arXiv.org) · arxiv.org LLM Evals
2026-06-09: Hacking Generative Perplexity: A Critique of Unconditional Text Evaluation (cs.CL updates on arXiv.org) · arxiv.org LLM Evals gpt2-large
2026-06-09: Building Comparative Motivation Profiles with Instrumental Interventions (cs.CL updates on arXiv.org) · arxiv.org LLM Evals arXiv Hugging Face Llama-3.1-70B Llama-3.1-405B Qwen-2.5-72B
2026-06-09: FrontierCode: Benchmarking for Code Quality over Slop (Latent.Space) · latent.space LLM Evals Cognition OpenAI METR Latent.Space Apple Google Scott Wu Omar Sar0 Graham Neubig Hamel Husain Opus 4.8
2026-06-09: Building a Custom Billing System for Multi-tenant AI Agents (Искусственный интеллект – AI, ANN и иные формы искусственного разума) · habr.com Context Engineering LLMStart.ru 1C Ayton OpenRouter LangChain Sergey Smirnov Gemini Pro Gemini Flash
2026-06-09: Building a Practical Harness for Coding Agents: A Real-World Perspective (Все статьи подряд / Искусственный интеллект / Хабр) · habr.com Context Engineering Anthropic Google Vercel Redis Ltd.
2026-06-09: Exploiting Tiny LLMs: Manipulating Persona and Safety via Token Sensitivity (Все статьи подряд / Искусственный интеллект / Хабр) · habr.com LLM Evals Alibaba vast.ai Qwen3.5-0.8B
2026-06-09: Spring Explore Skill for AI-Assisted Development (Все статьи подряд / Искусственный интеллект / Хабр) · habr.com Context Engineering Amplicode Spring Anthropic Google Josh Long
2026-06-09: Analyzing the Claude Code Best Practices Repository (Все статьи подряд / Искусственный интеллект / Хабр) · habr.com Context Engineering Anthropic GitHub Boris Cherny shanraisshan Claude
2026-06-09: Building an Advanced RAG Pipeline for Corporate AI Assistants (Все статьи подряд / Искусственный интеллект / Хабр) · habr.com Context Engineering Confluence Jira GitLab Саша
2026-06-09: Data Scientist’s Revenge: Why Data Science Remains Critical in the LLM Era (Все статьи подряд / Искусственный интеллект / Хабр) · habr.com LLM Evals OpenAI Harvard Business Review JosH100 Andrej Karpathy
2026-06-09: Navigating On-Premises LLM Deployment Challenges (Все статьи подряд / Искусственный интеллект / Хабр) · habr.com Context Engineering
2026-06-09: Anthropic Launches Claude Fable 5 and Mythos 5 (эйай ньюз) · anthropic.com LLM Evals Context Engineering Anthropic Stripe Cognition Hebbia IMC US Government Claude Fable 5 Claude Mythos 5 Claude Opus 4.8 Claude Mythos Preview
2026-06-09: Anthropic Launches Claude Fable 5 and Mythos 5 Models (Data Science by ODS.ai 🦜) · anthropic.com Long Context Anthropic Stripe Cognition Hebbia IMC Claude Fable 5 Claude Mythos 5 Claude Opus 4.8 Claude Mythos Preview
2026-06-09: FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention (alphaXiv) · twitter.com Long Context Context Engineering DeepSeek DeepSeek V4
2026-06-10: Anthropic Releases Claude Fable 5 and Mythos 5 (‌alphaXiv) · alphaxiv.org Context Engineering Anthropic Stripe Cognition US Government Claude Fable 5 Claude Mythos 5 Claude Opus 4.8 Claude Mythos Preview

FAQ

What is the LLMs topic?

What does the LLMs topic page track?

Key developments the GROUNDING radar mapped to LLMs, updated through 2026-06-10.

How current is this page?

The most recent LLMs development listed here is dated 2026-06-10; the radar refreshes hourly.

GROUNDING

Explorer

Key Developments

FAQ

What is the LLMs topic?

What does the LLMs topic page track?

How current is this page?

Graph View

Table of Contents

Backlinks