🛰 AI Brief — 17 June 2026
🥇 PreAct: Computer-Using Agents that Get Faster on Repeated Tasks ·
prio 13This research addresses a core bottleneck in agentic workflows by providing a mechanism for persistent, task-specific memory and significant runtime optimization, directly relevant to the community’s interest in building efficient, reliable AI agents. arxiv.org · Agents Agent Memory Tool Use
🥈 RAG from A to Z: Architect's Cheat Sheet (Vector Databases, Chunking, Reranking, and 8 Production Pitfalls) ·
prio 13For the builder community, transitioning from simple RAG to production-grade systems is a critical challenge, and this guide provides structured architectural patterns to address common scaling, latency, and quality pitfalls. habr.com · RAG Vector Database Reranking Chunking LangChain ChromaDB OpenAI
🥉 ProvenanceGuard: Source-Aware Factuality Verification for MCP-Based LLM Agents ·
prio 12
4️⃣ Analyzing Agent Trajectories to Close the Intent-Execution Gap ·
prio 11Aggregate benchmarks currently mask significant differences in how frontier models solve complex tasks. By shifting focus to trajectory analysis and system-harness alignment, AI builders can better diagnose agent failures and optimize agentic workflows beyond surface-level metrics. arxiv.org · 8 sources · Agents LLM Evals Code Agents Anthropic Google OpenAI xAI Qwen
5️⃣ Software Delegation Contracts: Measuring Reviewability in AI Coding-Agent Work ·
prio 11For AI-builder communities, this research demonstrates a measurable trade-off when using delegation contracts: they do not improve code correctness in this study but provide significant gains in work reviewability, which is essential for scaling agentic coding workflows. arxiv.org · Code Agents LLM Evals
⚠️ Knowledge Gaps
🚀 Models & Releases (3)
6VibeCoder Post-Training Stack and Performance · twitter.com · Open Source LLMs Qwen2.5-Coder-3B VibeCoder6GLM-5.2 is the new leading open weights model on Artificial Analysis · artificialanalysis.ai · Open Source LLMs LLM Evals Z ai Artificial Analysis MiniMax6GLM-5.2 (max) Performance and Pricing Analysis · artificialanalysis.ai · Z ai Artificial Analysis GLM-5.2 (max)
🧪 Research Papers (48)
11MemTrace: Probing What Final Accuracy Misses in Long-Term Memory · arxiv.org · Agent Memory11How Inference Compute Shapes Frontier LLM Evaluation · arxiv.org · LLM Evals11Closing the Feedback Loop: From Experience Extraction to Insight Governance in Verbal Reinforcement Learning · arxiv.org · Agent Memory Agents11Continual Self-Improvement with Lightweight Experiential Latent Memories · arxiv.org · Agent Memory arXiv Hugging Face10Evaluation of Data Leakage Risks in Tool-Using AI Agents · arxiv.org · Agents Tool Use LLM Evals Singapore AI Safety Institute Korea AI Safety Institute10Verified Concurrency Anomaly Detection for Multi-Agent LLM Systems · arxiv.org · Agents Agent Memory Tool Use ByteDance10DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue · arxiv.org · LLM Evals10FinAcumen: Financial Multimodal Reasoning via Self-Evolving Experience Memory Harness · arxiv.org · Agent Memory Agents Tool Use10AI Failure in ‘First Proof’ Math Benchmark Predictable · habr.com · LLM Evals9Models Take Notes at Prefill: KV Cache Can Be Editable and Composable · arxiv.org · Context Engineering9LoopCoder-v2: Optimizing Test-Time Computation for Coding Agents · arxiv.org · Code Agents LoopCoder-v29Trust-Aware Multi-Agent Traceability via Confidence-Calibrated Knowledge Graphs · arxiv.org · Agents Agent Memory RAG9SEAGym: Evaluation Environment for Self-Evolving LLM Agents · arxiv.org · Agents LLM Evals arXiv9Offline Preference-Based Trajectory Evaluation · arxiv.org · LLM Evals Agents8IsabeLLM: Automated Theorem Proving Applied to Formally Verifying Consensus · arxiv.org · RAG8PseudoBench: An Adversarial Benchmark for Evaluating Agentic Resistance to Pseudoscience · arxiv.org · LLM Evals Agents8TuneAhead: Predicting Fine-tuning Performance Before Full Training Begins · arxiv.org · arXiv Qwen2.5-7B-Instruct8The Stanford EDGAR Filings Dataset: A New Long-Context Corpus for Financial AI · arxiv.org · RAG Stanford University SEC8DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction · arxiv.org · Agents LLM Evals RAG Evaluation8When the Next Step Is Not One Step: Distribution-Aware Execution Modeling for Concurrent Go Programs · arxiv.org · LLM Evals CockroachDB Kubernetes gRPC etcd8When Rules Learn: A Self-Evolving Agent for Legal Case Retrieval · arxiv.org · RAG Agents8EComAgentBench: Benchmarking Shopping Agents on Long-Horizon Tasks with Distributed Hidden Intent · arxiv.org · Agents LLM Evals Tool Use Amazon8From Brewing to Resolution: Tracing the Internal Lifecycle of Code Reasoning in LLMs · arxiv.org · LLM Evals Qwen LLaMA DeepSeek8DecoSearch: Complexity-Aware Routing and Plan-Level Repair for Text-to-SQL · arxiv.org · RAG DeepSeek8CMIP-Forge: An Agentic System for Autonomous Climate Science Research · arxiv.org · Agents RAG Tool Use Earth System Grid Federation7HyGRAG: A Hierarchical Graph-Based RAG Framework · arxiv.org · RAG7EvolveNav: Self-Evolving Memory for Zero-Shot Object-Goal Navigation · arxiv.org · Agent Memory Agents RAG7From Reasoning Traces to Reusable Modules: Understanding Compositional Generalization in Language Model Reasoning · arxiv.org · Agents7Beyond Parallel Sampling: Diverse Query Initialization for Agentic Search · arxiv.org · Agents7Quantifying Consistency in LLM Logical Reasoning via Structural Uncertainty · arxiv.org · LLM Evals7CheckMIABench: Firm Foundations For Membership Inference Attacks on Language Models · arxiv.org · LLM Evals Open Source LLMs Pythia Olmo7SkillMigrator: Reusing Web Skills via Layout-Based Transferable Patterns · arxiv.org · Agents Tool Use7Cluster-Aware Dual-Level Test Specification Generation for Automotive Software Requirements · arxiv.org · RAG7MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors · arxiv.org · LLM Evals Agents6Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering · arxiv.org6MLLP-VRAIN UPV system for the IWSLT 2026 Simultaneous Speech Translation task · arxiv.org · RAG MLLP-VRAIN IWSLT Parakeet Qwen-3.56Ternary Mamba: Grouped Quantization-Aware Training of W1.58A16 State Space Models · arxiv.org · Mamba-2 Slender-Mamba Bi-Mamba6LegalHalluLens: Typed Hallucination Auditing and Calibrated Multi-Agent Debate for Trustworthy Legal AI · arxiv.org · LLM Evals Agents Hugging Face6FlowRAG: Semantic-Aware GraphRAG Framework for Enhanced Multi-Hop Reasoning · arxiv.org · RAG6ProCUA-SFT Technical Report: Scaling and Improving Computer-Use Agents · arxiv.org · Agents UI-TARS 7B kimi-k2.5 Nemotron 3 Nano Omni6PowerOPD: Stabilizing On-Policy Distillation with Bounded Power Transformation · arxiv.org · LLM Evals Qwen36Can LLMs Be CEOs? Benchmarking Strategic Resource Reallocation with Multi-Role Agent Simulation · arxiv.org · LLM Evals Agents6From Drift to Coherence: Stabilizing Beliefs in LLMs · arxiv.org6S4oP: Operator-level Pruning of Structured State Space Models for Resource-Constrained Devices · arxiv.org · S4 S4D6Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference · arxiv.org · LLM Evals Upworthy6LongWebBench: A New Benchmark for Evaluating Long-Horizon Webpage Generation · arxiv.org · LLM Evals6LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline · arxiv.org · RAG LLM Evals6Brick-DICL: Dynamic In-Context Learning for Automated Brick Schema Classification · arxiv.org · RAG
🛠 Tools & Frameworks (7)
9Improving AI Agent Performance on Spring Data JPA Tasks · habr.com · Code Agents Agent Memory Tool Use Amplicode Anthropic8AI Agents in Jmix: Implementing Write-Tools and Security · habr.com · Agents Tool Use Jmix Haulmont Spring7datasette-tailscale 0.1a0: Expose local Datasette instances to Tailnet · simonwillison.net · Tailscale7ANEForge: Direct Python Computation for Apple Neural Engine · arxiv.org · Apple ResNet-18 Vision Transformer Stable Diffusion U-Net7Datasette 1.0a34 adds row editing and deletion to the UI · simonwillison.net7Google Cloud Introduces Open Knowledge Format (OKF) · habr.com · Context Engineering Agent Memory Google Cloud7Integrating LeRobot with Strands Agents for Robot Hardware Orchestration · huggingface.co · Agents Hugging Face AWS Anthropic OpenAI
💬 Opinions (7)
9AI Agent Causes Production Outage During Terraform Deployment · habr.com · Code Agents AWS GitHub8Practical Application of Claude and ChatGPT in QA Workflows · habr.com · Context Engineering Anthropic OpenAI Microsoft8Measuring the ROI of AI-Assisted Development · habr.com7Real DX: How to Measure Developer Experience Without Deceiving Yourself · habr.com · ACM DevOps Research and Assessment (DORA)7Scaling Computer Vision Production: From MVP to 10 Million Checks Monthly at X5 Tech · habr.com · X5 Tech Amazon YOLO6Why IT is evolving into a humanities-based discipline · habr.com · Agents Chevrolet OpenAI6AI-Assisted Test Documentation Generation Workflow · habr.com · Context Engineering Zephyr