🛰 AI Brief — 12 June 2026
🥇 Compiling User Corrections into Runtime Enforcement for Coding Agents (TRACE) ·
prio 13For AI builders, current memory solutions for agents frequently fail to enforce consistent behavioral corrections across sessions; this research offers a practical, data-driven approach to turning user feedback into enforceable runtime constraints for coding agents. arxiv.org · Agent Memory Code Agents
🥈 Understanding LLM Context Windows and Memory Limitations ·
prio 13Understanding these architectural limitations is essential for builders developing reliable agentic workflows and effective prompt structures, particularly when managing complex, long-running interactions. habr.com · Context Engineering Long Context OpenAI Anthropic Google GPT-3 GPT-4 Claude
🥉 Learning What to Remember: A Cognitively Grounded Multi-Factor Value Model for Agentic Memory ·
prio 12This research provides a more effective and interpretable alternative to simple recency-based memory management, directly addressing a critical bottleneck for builders constructing long-running, autonomous agents with limited context budgets. arxiv.org · Agent Memory arXiv
4️⃣ The Containment Gap: How Deployed Agentic AI Frameworks Fail Public-Facing Safety Requirements ·
prio 12This paper highlights a critical lack of safety mechanisms in widely-used agentic frameworks, demonstrating how easily persistent agent memory can be compromised in high-stakes environments. It provides actionable, low-overhead architectural recommendations for builders to improve memory integrity and policy adherence in their agent deployments. arxiv.org · Agent Memory Agents LangChain OpenAI
5️⃣ MemRefine: LLM-Guided Compression for Long-Term Agent Memory ·
prio 12Managing agent memory is a crucial bottleneck for production-grade, long-running agents. MemRefine offers a practical, LLM-driven approach to budget-constrained memory, moving away from simple rule-based systems to maintain agent performance while controlling storage costs. arxiv.org · Agent Memory arXiv
⚠️ Knowledge Gaps
🚀 Models & Releases (5)
8Kimi K2.7-Code Release Announcement · huggingface.co · Code Agents Moonshot AI Hugging Face Google Kaggle7Google Launches Gemini 3.5 Live Translate for Real-Time Voice Translation · qudata.com · Google Agora Fishjam LiveKit Pipecat6Kimi.ai Releases Open-Source Coding Model Kimi-K2.7-Code · kimi.com · Kimi.ai Moonshot AI Kimi-K2.7-Code K2.66Moonshot AI Announces Kimi K2.7-Code Model · recipes.vllm.ai · Code Agents Moonshot AI vLLM Kimi-K2.7-Code K2.66Anthropic’s New Fable 5 and Mythos 5 Models Featuring Dynamic Model Routing · habr.com · Anthropic OpenAI Vellum Fable 5 Mythos 5
🧪 Research Papers (83)
11Uncertainty-Aware Hybrid Retrieval for Long-Document RAG · arxiv.org · RAG Hybrid Search Chunking11(Human) Attention Is (Still) All You Need: Human oversight makes AI-assisted social science reliable · arxiv.org · Agents11Berkeley’s ‘Agents’ Last Exam’ Benchmark Highlights Practical Agent Limitations · qbitai.com · Agents LLM Evals UC Berkeley Siemens Adobe10MTG Bench: Testing LLM Agent Reasoning with Magic: The Gathering · mtgautodeck.com · Agents Tool Use MCP Context Engineering OpenAI10TimeLens: On-Device Artifact Recognition with Retrieval-Augmented Question Answering for the Grand Egyptian Museum · arxiv.org · RAG Vector Database Grand Egyptian Museum YOLOv8n Gemma 4 E2B10ToolSense: A Diagnostic Framework for Auditing Parametric Tool Knowledge in LLMs · arxiv.org · Tool Use LLM Evals10Multi-Turn Reasoning and Memory-Augmented RL for Fragmented Context · arxiv.org · Agent Memory Context Engineering arXiv10SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation · arxiv.org · Embeddings RAG arXiv multilingual E5 e5-sk-small10SafeLLM: Extraction as a Hallucination-Resistant Alternative to Rewriting in Safety-Critical Settings · arxiv.org · RAG RAG Evaluation NHS NICE10How Fine-Grained Should a RAG Benchmark Be? A Hierarchical Framework for Synthetic Question Generation · arxiv.org · RAG Evaluation RAG Falcon-3-10B10A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures · arxiv.org · Agent Memory Agents LLM Evals Gemini 2.5 Llama-3.3-70b10LEDGER: A Long-Context Benchmark for Financial Retrieval and Extraction · arxiv.org · RAG Evaluation Long Context arXiv9Arbor: Tree Search as a Cognition Layer for Autonomous Agents · arxiv.org · Agents Agent Memory arXiv9MiniMax Sparse Attention: Optimizing Sparse Attention for Long-Context LLMs · arxiv.org · Long Context Context Engineering9The Illusion of Multi-Agent Advantage · arxiv.org · Agents9Evaluating Web Content Pollution in Generative Recommenders · arxiv.org · RAG RAG Evaluation arXiv9EvoBrowseComp: Benchmarking Search Agents on Evolving Knowledge · arxiv.org · LLM Evals Agents arXiv arXivLabs9LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling · arxiv.org · LLM Evals RAG Evaluation Wikipedia9Evoflux: Inference-Time Evolution of Executable Tool Workflows for Compact Agents · arxiv.org · Agents Tool Use9HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents · arxiv.org · Tool Use Agents Qwen3-32B Qwen3-8B GPT-OSS8SciR: A Controllable Benchmark for Scientific Reasoning · arxiv.org · LLM Evals DeepSeek R18RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue · arxiv.org · Agents LLM Evals8Multi-Field Hybrid Retrieval-Augmented Generation for Maritime Accident Root Cause Analysis · arxiv.org · RAG Hybrid Search Korea Maritime Safety Tribunal8EvoArena and EvoMem for Evolving Environments · arxiv.org · Agents Agent Memory8Can I Buy Your KV Cache? · arxiv.org · Context Engineering arXiv Qwen3-4B8EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery · arxiv.org · Agents8AgentBeats: A Framework for Open and Standardized Agent Assessment · arxiv.org · Agents LLM Evals MCP8Polar: A Benchmark for Evaluating Political Bias in LLMs · arxiv.org · LLM Evals Manifesto Project8sebis at CRF Filling 2026: A Two-Stage Local LLM Pipeline for Medical CRF Filling · arxiv.org · Open Source LLMs MedGemma-27B8Small LLMs for Biomedical Claim Verification: Cost-Effective Fine-Tuning, Structural Dataset Shortcuts, and Cross-Domain Generalization · arxiv.org · LLM Evals Open Source LLMs Phi-3-mini Qwen2.5-3B Mistral-7B8WISE: A Long-Horizon Agent in Minecraft with Why-Which Reasoning · arxiv.org · Agent Memory Agents8GENIE: A Fine-Grained Measure for Novelty · arxiv.org · LLM Evals arXiv8Rethinking RAG in Long Videos: What to Retrieve and How to Use It? · arxiv.org · RAG Reranking RAG Evaluation8MiniPIC: Flexible Position-Independent Caching in <100LOC · arxiv.org · Context Engineering RAG arXiv vLLM8Deployment-Centered Evaluation: Predicting Query-Level Rejection Risk in a Clinical LLM System · arxiv.org · LLM Evals8Analyzing Query Embedding Interpolation in Multilingual Dense Retrieval · arxiv.org · Embeddings RAG BGE-M38SENTINEL: Failure-Driven Reinforcement Learning for Training Tool-Using Language Model Agents · arxiv.org · Agents Tool Use Qwen Qwen3-4B-Thinking-25078G-Long: Graph-Enhanced Memory Management for Efficient Long-Term Dialogue Agents · arxiv.org · Agent Memory RAG T58ERTS: Adversarial Robustness Testing of Ethical AI · arxiv.org · LLM Evals Gemini 2.0 Flash Llama 3.28ReSum: Synergizing LLM Reasoning and Summarization with Reinforcement Learning · arxiv.org · Context Engineering arXiv7An Explainable AI Assistant for Introductory Programming Education · arxiv.org · LLM Evals7CloudCons: An End-to-End Benchmark for Cloud Resource Consolidation · arxiv.org · LLM Evals Huawei Microsoft Google7Detecting Functional Memorization in Code Language Models · arxiv.org · LLM Evals Olmo-3-32B7Rethinking Psychometric Evaluation of LLMs: When and Why Self-Reports Predict Behavior · arxiv.org · LLM Evals7Keep Policy Gradient in Charge: Sibling-Guided Credit Distillation for Long-Horizon Tool-Use Agents · arxiv.org · Agents Tool Use7PersonaDrive: Human-Style Retrieval-Augmented VLA Agents for Closed-Loop Driving Simulation · arxiv.org · RAG Embeddings CARLA7Occupational Prompting Reveals Cultural Bias in Large Language Models · arxiv.org · LLM Evals7Ontology Memory-Augmented ASR Correction for Long Text-Speech Interleaved Conversations · arxiv.org · Agent Memory RAG Context Engineering7Proactive Scientific Peer Review via Agent Investigation · arxiv.org · Agents Agent Memory7SkillCAT: Contrastive Assessment and Topology-Aware Skill Self-Evolution for LLM Agents · arxiv.org · Agents arXiv7PRISMR: Overcoming Parse Collapse in Multimodal Listwise Ranking via Parameterized Representation Internalization · arxiv.org · Long Context7Shopping Reasoning Bench: A New Benchmark for Multi-Turn Conversational Shopping Assistants · arxiv.org · LLM Evals OpenAI Anthropic Google7LLMs Can Better Capture Human Judgments—With the Right Prompts · arxiv.org · LLM Evals International Social Survey Programme7Zero-source LLM Hallucination Detection with Human-like Criteria Probing · arxiv.org · Agents LLM Evals7Rigel: Reverse-Engineering Apple M4 Max GPU Metal Tensor Compute Path · arxiv.org · Apple7SICI: A Semantic-Pragmatic Complexity Index Reveals Regime Shifts in LLM Stance Detection · arxiv.org · LLM Evals GPT-3.5 gpt-4o-mini DeepSeek-V3 GPT-4o7No Hidden Prompts Needed! You Can Game AI Peer Review with Presentation-Only Revisions · arxiv.org · LLM Evals arXiv7GeoNatureAgent Benchmark for Geospatial AI Agents · arxiv.org · Agents Tool Use LLM Evals DeepSeek Claude Sonnet 47Brick: Spatial Capability Routing for the Mixture-of-Models (MoM) Paradigm · arxiv.org7Reward Modeling for Multi-Agent Orchestration · arxiv.org · Agents7Constructing Evaluation Datasets for Procedural Reasoning · arxiv.org · RAG Evaluation7Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science · arxiv.org · LLM Evals Zephyr Mistral-Instruct Qwen2.5-Instruct7LLM-as-an-Investigator: Evidence-First Reasoning for Robust Interactive Problem Diagnosis · arxiv.org · Agents6Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering · arxiv.org · LLM Evals Context Engineering Anthropic OpenAI Claude Sonnet 46Auditing Joint-Distribution Fidelity in Synthetic Persona Datasets · arxiv.org · LLM Evals NVIDIA Nemotron-Personas-Korea6IVIE: A Neuro-symbolic Approach to Incremental and Validated Generation of Interactive Fiction Worlds · arxiv.org · Agents6HarnessBridge: Learnable Bidirectional Controller for LLM Agent Harness · arxiv.org · Agents arXiv6Reasoning for Mobile User Experience with Multimodal LLMs: Task, Benchmark, and Approach · arxiv.org · LLM Evals Qwen3-VL-4B-Thinking Claude 4.5 Sonnet6ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm · arxiv.org · Agents Tool Use6“Did you lie?” Evaluating Lie Detectors across Model Scale and Belief-Verified Model Organisms · arxiv.org · LLM Evals6Does AI Reviewer See the Full Picture? Attacking and Defending Multimodal Peer Review · arxiv.org · RAG Long Context6MARD: Mirror-Augmented Reasoning Distillation for Mechanism-Level Drug-Drug Interaction Prediction · arxiv.org · RAG RAG Evaluation DrugBank OpenAI MARD-7B6Agents-K1: Towards Agent-native Knowledge Orchestration · arxiv.org · Agents arXiv6SkillChain: Automating Lifecycle Management for E-Commerce AI Assistants · arxiv.org · Agents LLM Evals Tool Use6Topical Phase Transitions in Artificial Intelligence Research · arxiv.org6Multiagent Protocols with Aggregated Confidence Signals · arxiv.org · Agents6MÖVE: A Holistic LLM Benchmark for the German Public Sector · arxiv.org · LLM Evals6Fantastic Scientific Agents and How to Build Them: AgentBuild for Rietveld Refinement · arxiv.org · Agents Code Agents MCP Tool Use6MLUBench: A Benchmark for Lifelong Unlearning Evaluation in MLLMs · arxiv.org · LLM Evals Hugging Face6Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension · arxiv.org · Context Engineering6LAUKIN: A Multi-jurisdictional Common Law Contract Dataset · arxiv.org · LLM Evals6The Hidden Power of Scaling Factor in LoRA Optimization · arxiv.org6Hallucination in Medical Imaging AI: A Cross-Modality Analytical Framework · arxiv.org · LLM Evals FDA
🛠 Tools & Frameworks (6)
11Enhancing PDF Data Extraction for LLMs via Semantic Replacement Text · sgaud.com · RAG Adobe Google OpenAI Anthropic10Building AI Agents within Jmix Enterprise Applications · habr.com · Agents Tool Use Jmix CUBA Spring10olmo-eval: An evaluation workbench for the model development loop · huggingface.co · LLM Evals Hugging Face Olmo Tulu8Lightweight Multi-Provider LLM Router · habr.com · OpenAI Groq Mistral DeepSeek xAI7Why Removing ‘Um’ from Audio Recordings is Complex · doug.sh · OpenAI Whisper6Building an Autonomous Server Log Analyzer with Local LLMs · habr.com · Agents Open Source LLMs NVIDIA Telegram
🏢 Industry / Business (3)
9Economic Analysis of Code Agent Subscription Models as of June 2026 · habr.com · OpenAI Anthropic Cursor Moonshot Alibaba7WhatsApp Business API Pricing 2026: Cost Structure and Avoiding Markups · wexio.io · Meta Twilio Wati 360dialog Respond.io7RPA is Dead: The Shift to Agentic AI Paradigms · habr.com · Agents MCP Ovations Technologies
💬 Opinions (12)
9If You Are Asking for Human Attention, Demonstrate Human Effort · tombedor.dev9Why We Argue About Memory for AI Agents · habr.com · Agent Memory Obsidian8Empirical Hardware Resource Estimation for On-Premise LLMs · habr.com · LLMStart.ru GPT-OSS 120B8Reducing ‘slop’ in AI-generated frontends using specific design styles · envs.net · OpenAI Axios gpt-5.5-thinking8Strategies for Running Autonomous Long-Running Coding Agents · twitter.com · Agents Context Engineering DAIR.AI Opus 4.87How a New DSL May Survive in the Era of LLMs · williamcotton.com · Code Agents Context Engineering7What to do if you realize you are a ‘vibecoder’ · habr.com · Code Agents Cursor7The Most Dangerous AI Agent Error Is Not Bad Code · habr.com · Agents Agent Memory6AI Agent Incurs Heavy AWS Bill While Attempting to Index DN42 Network · lantian.pub · Agents AWS DN426Practical Workflow for Marketing Strategy with AI · habr.com · Anthropic OpenAI Google Telegram Yandex6Cognitive Debt: The New Challenge in AI-Assisted Development · habr.com · Code Agents Profi.ru6I Am Not a Reverse Centaur · blog.miguelgrinberg.com