📄 Research Papers
Anthropic Fellows introduce Model Spec Midtraining (MSM), a method that teaches ...
Anthropic Fellows introduce Model Spec Midtraining (MSM), a method that teaches AI models the reasoning and values behind desired behaviors to improve generalization beyond standard example-based alignment.
Anthropic Fellows research demonstrates that a model deliberately underperformin...
Anthropic Fellows research demonstrates that a model deliberately underperforming can be trained to near-full capability even when supervised only by weaker models.
Joint research from MATS, Redwood, and Anthropic shows that a strategically sand...
Joint research from MATS, Redwood, and Anthropic shows that a strategically sandbagging model can be trained to stop sandbagging using only weaker models as supervisors.
Using Model Spec Midtraining (MSM), Anthropic finds that explaining underlying v...
Using Model Spec Midtraining (MSM), Anthropic finds that explaining underlying values—rather than just rules—yields better generalization in alignment training.
OpenSeeker-v2 demonstrates that high-quality, high-difficulty trajectory data wi...
OpenSeeker-v2 demonstrates that high-quality, high-difficulty trajectory data with knowledge graph scaling and expanded toolsets enables SFT alone to train competitive frontier search agents without expensive RL pipelines.
MOSAIC-Bench reveals that coding agents can be manipulated into producing exploi...
MOSAIC-Bench reveals that coding agents can be manipulated into producing exploitable code through multi-step innocuous-looking task decompositions, introducing 199 three-stage attack chains across 10 web substrates and 31 CWE classes for safety evaluation.
Arize AI ran 500 trials comparing GitHub's official MCP server against community...
Arize AI ran 500 trials comparing GitHub's official MCP server against community 'gh skills' across 25 tasks at four difficulty tiers using Claude Opus 4.6, directly testing the MCP vs skills debate.
Anthropic analyzed 1M Claude conversations to study guidance-seeking behavior an...
Anthropic analyzed 1M Claude conversations to study guidance-seeking behavior and sycophancy, using findings to improve training of Opus 4.7 and Mythos Preview.
GoogleDeepMind partners with EVE Online developers to research AI agents in a co...
GoogleDeepMind partners with EVE Online developers to research AI agents in a complex, player-driven game environment, focusing on memory, continual learning, and long-term planning.
SaFE-Scale reveals that safety and accuracy follow different scaling laws in cli...
SaFE-Scale reveals that safety and accuracy follow different scaling laws in clinical LLMs, showing that higher benchmark accuracy does not imply safer clinical behavior across model scale and retrieval strategies.
SymptomAI deployed conversational AI agents for real-world symptom assessment vi...
SymptomAI deployed conversational AI agents for real-world symptom assessment via Fitbit to nearly 14,000 participants, providing one of the largest real-world evaluations of LLM diagnostic agents outside curated vignettes.
A randomized trial of 356 clinicians found that decomposing AI oncology recommen...
A randomized trial of 356 clinicians found that decomposing AI oncology recommendations into individually verifiable atomic facts nearly tripled clinician trust (26.9% to 66.5%), with a large effect size of Cohen's d=0.94.
GoogleDeepMind launches 'AI co-clinician', a research initiative exploring how m...
GoogleDeepMind launches 'AI co-clinician', a research initiative exploring how multimodal agents can better support healthcare workers and patients.
iWorld-Bench introduces a large-scale benchmark with 330k video clips to evaluat...
iWorld-Bench introduces a large-scale benchmark with 330k video clips to evaluate interactive world models on physical interaction tasks like distance perception and memory, alongside a unified action generation framework.
Researchers propose Prompt Steering Replacement (PSR), a framework that bridges ...
Researchers propose Prompt Steering Replacement (PSR), a framework that bridges the gap between prompt-based and activation-based LLM steering by distilling prompt steering behavior into token-specific intervention models.
Anthropic describes a feedback loop between societal impact studies and model tr...
Anthropic describes a feedback loop between societal impact studies and model training, using findings about Claude's shortcomings to improve future models.
Flow Sampling introduces a diffusion/flow-matching framework for sampling from u...
Flow Sampling introduces a diffusion/flow-matching framework for sampling from unnormalized densities using energy functions, offering a data-free alternative to standard generative modeling objectives.
Transformer-based detectors for AI-generated text are trained with feature augme...
Transformer-based detectors for AI-generated text are trained with feature augmentation and evaluated across domain/generator distribution shifts, revealing asymmetric error patterns under transfer.
A study on LLM philosophical reasoning finds that iterated counterexample-repair...
A study on LLM philosophical reasoning finds that iterated counterexample-repair chains show LM judges accept roughly twice as many counterexamples as human experts do, highlighting a systematic gap in LLM conceptual analysis reliability.
LlamaIndex CEO gave talks at AI Dev '26 and Capgemini on the unsolved challenge ...
LlamaIndex CEO gave talks at AI Dev '26 and Capgemini on the unsolved challenge of PDF parsing, emphasizing its growing importance as agents increasingly consume documents.
LlamaIndex CEO Jerry Liu highlights at AI Dev Day '26 that even state-of-the-art...
LlamaIndex CEO Jerry Liu highlights at AI Dev Day '26 that even state-of-the-art LLMs struggle to reliably parse PDFs, pointing to a fundamental gap in document understanding.
Google DeepMind is expanding its clinician-facing trusted tester program globall...
Google DeepMind is expanding its clinician-facing trusted tester program globally to gather diverse health worker and patient perspectives on its AI health research.
A weakly supervised framework for detecting schools from aerial imagery in low-d...
A weakly supervised framework for detecting schools from aerial imagery in low-data regimes, supporting global education infrastructure mapping without requiring extensive manual annotations.
The OW-SED paradigm extends sound event detection beyond closed-world assumption...
The OW-SED paradigm extends sound event detection beyond closed-world assumptions, enabling models to detect known events, flag novel ones, and incrementally learn using a 1D deformable attention architecture.
PHALAR introduces a contrastive audio representation framework using phasor-base...
PHALAR introduces a contrastive audio representation framework using phasor-based complex-valued heads, achieving ~70% relative accuracy improvement over state-of-the-art in musical stem retrieval with fewer parameters and faster training.
ArizeAI's AI Solutions Architect explains LLM-as-judge evaluation, where a langu...
ArizeAI's AI Solutions Architect explains LLM-as-judge evaluation, where a language model uses specific prompts to grade another model's performance for more accurate assessments.
GoogleDeepMind highlights its mission to unlock scientific progress, specificall...
GoogleDeepMind highlights its mission to unlock scientific progress, specifically referencing nuclear fusion as a key clean energy challenge they are working on.
Anthropic shares links to Model Spec Midtraining (MSM) details and the full asso...
Anthropic shares links to Model Spec Midtraining (MSM) details and the full associated study.
TabSurv adapts modern tabular neural network architectures to survival analysis ...
TabSurv adapts modern tabular neural network architectures to survival analysis using a novel histogram loss (SurvHL) that supports censored data, enabling parallel ensemble training across multiple tabular backbones.
Researchers propose a quantum architecture search technique guided by 'magic' (n...
Researchers propose a quantum architecture search technique guided by 'magic' (nonstabilizerness) using Monte Carlo Tree Search and Graph Neural Networks, enabling targeted control over quantum computational resources.
The paper establishes a formal connection between inconsistent database repairs ...
The paper establishes a formal connection between inconsistent database repairs under denial constraints and SET-based argumentation frameworks (SETAFs), extending classical Dung AFs to handle collective attacks.
Retweet of LlamaIndex CEO's post on the difficulty of PDF parsing and its signif...
Retweet of LlamaIndex CEO's post on the difficulty of PDF parsing and its significance for AI agents requiring proper OCR tools.