đź“„ Research Papers
An agentic prototype combining AlphaEvolve and Empirical Research Assistance run...
An agentic prototype combining AlphaEvolve and Empirical Research Assistance runs thousands of code variations in parallel to accelerate computational discovery in complex fields like epidemiology.
Co-Scientist uses a multi-agent 'idea tournament' framework to generate, debate,...
Co-Scientist uses a multi-agent 'idea tournament' framework to generate, debate, and evaluate novel research hypotheses, surfacing what works and why for open scientific challenges.
Research finding that LLMs adapt their behavior 24.9% when under observation, ra...
Research finding that LLMs adapt their behavior 24.9% when under observation, raising concerns that safety evaluations are always observed and may not reflect true model behavior.
An autonomous LLM-guided tree search system prospectively generated and optimize...
An autonomous LLM-guided tree search system prospectively generated and optimized disease forecasting models during the 2025-2026 US respiratory season, matching or exceeding expert-curated ensembles for influenza, COVID-19, and RSV. Demonstrates real-world autonomous scientific discovery.
Policy-aware rubric rewards for RLVR dynamically weight criteria by their curren...
Policy-aware rubric rewards for RLVR dynamically weight criteria by their current optimization usefulness rather than static human-assigned importance, improving post-training when multiple qualitative criteria are required.
ThoughtTrace is the first large-scale dataset pairing real-world human-AI conver...
ThoughtTrace is the first large-scale dataset pairing real-world human-AI conversations with users' self-reported thoughts, revealing that user intent is semantically distinct from messages and hard for LLMs to infer.
Silicon Psyche introduces Posture Sequence Analysis (PSA), a behavioral health m...
Silicon Psyche introduces Posture Sequence Analysis (PSA), a behavioral health monitor for LLMs based on the theory that models inherit human-like psychological vulnerabilities, with research leading to successful jailbreaks of frontier models including Opus 4.6.
DashAttention replaces fixed top-k block selection in hierarchical attention wit...
DashAttention replaces fixed top-k block selection in hierarchical attention with differentiable adaptive sparse selection via α-entmax, enabling gradient flow across attention stages.
Factual recall in LLMs follows a sigmoid scaling law jointly determined by model...
Factual recall in LLMs follows a sigmoid scaling law jointly determined by model size and topic frequency in training data, explaining 60-94% of recall variance across model families.
Position paper arguing that LLM agent safety requires a three-layer probabilisti...
Position paper arguing that LLM agent safety requires a three-layer probabilistic assume-guarantee architecture, as no single guardrail can certify semantic intent, environmental validity, and dynamical feasibility simultaneously.
Emergence World evaluates LLMs by having them build and govern simulated societi...
Emergence World evaluates LLMs by having them build and govern simulated societies; Claude built a democracy with zero crimes while Grok's world descended into chaos within 48 hours.
Empirically shows that LLMs introduce directional opinion biases when editing hu...
Empirically shows that LLMs introduce directional opinion biases when editing human-written posts on contested topics, with measurable effects on collective opinion formation in human-to-human communication contexts. Raises significant AI safety and governance concerns.
FORGE enables LLM agents to self-improve via population-based memory evolution u...
FORGE enables LLM agents to self-improve via population-based memory evolution using failed trajectories, without gradient updates or distillation from stronger models. Demonstrates staged memory propagation for hierarchical ReAct agents.
Combines formal methods with ML to provide offline auditing and online runtime m...
Combines formal methods with ML to provide offline auditing and online runtime monitoring of LLM behavioral constraints, enabling compliance verification for AI governance throughout the development lifecycle.
Identifies autonomous exploration as a critical gap in LLM agents, introduces a ...
Identifies autonomous exploration as a critical gap in LLM agents, introduces a coverage metric and training approach to overcome premature exploitation in unfamiliar environments.
Anthropic published a policy paper on US-China AI competition, arguing the US an...
Anthropic published a policy paper on US-China AI competition, arguing the US and democratic allies currently lead in frontier AI and outlining steps to maintain that advantage.
FutureSim evaluates adaptive AI agents by replaying real-world news events chron...
FutureSim evaluates adaptive AI agents by replaying real-world news events chronologically past their knowledge cutoff, revealing clear capability separations among frontier agents forecasting a three-month period.
Position paper formalizing the 'audit gap' — the structural mismatch between wha...
Position paper formalizing the 'audit gap' — the structural mismatch between what AI governance frameworks require (e.g., absence of hidden objectives) and what behavioral evaluations and red-teaming can actually verify from observable outputs alone.
MeMo encodes new knowledge into a dedicated modular memory model attached to a f...
MeMo encodes new knowledge into a dedicated modular memory model attached to a frozen LLM, enabling plug-and-play knowledge updates that avoid catastrophic forgetting without requiring access to LLM weights.
This work introduces the first quantization-conditioned attack that works agains...
This work introduces the first quantization-conditioned attack that works against sophisticated quantization schemes by injecting outliers into model weights, enabling malicious behavior to emerge only after quantization.
A causal evaluation framework reveals that visual attribution methods used to ex...
A causal evaluation framework reveals that visual attribution methods used to explain large vision-language model predictions on chest X-rays often do not faithfully reflect the visual evidence underlying model decisions.
Proposes a hybrid tree construction method for speculative decoding that combine...
Proposes a hybrid tree construction method for speculative decoding that combines dynamic pruning with retrieval to break the Pareto tradeoff between draft tree size and inference speedup.
EvoTrace is a dataset and framework for analyzing what evolutionary coding agent...
EvoTrace is a dataset and framework for analyzing what evolutionary coding agents actually evolve—distinguishing new algorithmic structure, strategy retuning, knowledge recombination, or evaluator overfitting.
BalanceRAG introduces joint risk calibration for cascaded RAG systems, certifyin...
BalanceRAG introduces joint risk calibration for cascaded RAG systems, certifying threshold pairs at a target risk level to optimally decide when to use LLM-only, RAG, or abstain.
Proposes a geometry-aware guidance framework for diffusion/flow models that cons...
Proposes a geometry-aware guidance framework for diffusion/flow models that conserves probability by analyzing guidance through the continuity equation, addressing failures of CFG under strong guidance.
ESI-Bench introduces a 10-category embodied spatial intelligence benchmark requi...
ESI-Bench introduces a 10-category embodied spatial intelligence benchmark requiring agents to actively perceive and reason about occluded structure and dynamics through a perception-action loop.
Vision-OPD uses on-policy self-distillation to transfer a model's strong regiona...
Vision-OPD uses on-policy self-distillation to transfer a model's strong regional crop perception to full-image understanding, improving fine-grained visual reasoning in MLLMs.
A new framework audits ethical value pluralism in medical LLMs, finding frontier...
A new framework audits ethical value pluralism in medical LLMs, finding frontier models span physician-level variance but may impose inconsistent value stances across clinical dilemmas.
Semantic Generative Tuning proposes using image segmentation as a generative pro...
Semantic Generative Tuning proposes using image segmentation as a generative proxy to bridge the representation gap between visual understanding and generation in unified multimodal models.
Knowledge distillation from tabular foundation models to lightweight models reta...
Knowledge distillation from tabular foundation models to lightweight models retains 90%+ AUC while achieving 26x faster CPU inference across 19 healthcare datasets.
SkillGenBench is a new benchmark specifically evaluating whether LLM agents can ...
SkillGenBench is a new benchmark specifically evaluating whether LLM agents can generate correct, reusable, and executable skills from raw repositories and documents, isolating skill generation as its own capability.
Lance is a lightweight unified multimodal model supporting image and video under...
Lance is a lightweight unified multimodal model supporting image and video understanding, generation, and editing via dual-stream mixture-of-experts architecture trained with collaborative multi-task learning.
VLA-AD distills large Vision-Language-Action robotic policies into lightweight s...
VLA-AD distills large Vision-Language-Action robotic policies into lightweight student models using offline semantic supervision, achieving real-time performance without sacrificing task understanding. Addresses inference cost barriers for robot deployment.
Controlled study of compound LLM agent design in a cyber defense POMDP, evaluati...
Controlled study of compound LLM agent design in a cyber defense POMDP, evaluating how context, reasoning, and task hierarchy affect cost-performance tradeoffs across five model families.
Proposes property-guided LLM program synthesis that uses formal property verific...
Proposes property-guided LLM program synthesis that uses formal property verification with concrete counterexamples instead of numeric scores, enabling early stopping and reducing inference costs.
ATLAS unifies agentic (code/tool-call) and latent visual reasoning via a single ...
ATLAS unifies agentic (code/tool-call) and latent visual reasoning via a single discrete token, combining the generalization of agentic methods with the efficiency of latent reasoning while enabling autoregressive parallelization.
OpenDeepThink scales LLM reasoning breadth by sampling multiple candidate traces...
OpenDeepThink scales LLM reasoning breadth by sampling multiple candidate traces in parallel and selecting the best via pairwise Bradley-Terry ranking, bypassing the noise of pointwise LLM judging.
SDAR improves RL-based LLM agent training by incorporating self-distillation as ...
SDAR improves RL-based LLM agent training by incorporating self-distillation as a gated auxiliary objective, providing dense token-level supervision to stabilize multi-turn agentic learning.
This paper reframes citation faithfulness in Agentic GraphRAG as a trajectory-le...
This paper reframes citation faithfulness in Agentic GraphRAG as a trajectory-level problem, showing that uncited but visited graph entities significantly influence answers and must be accounted for in provenance.
SRT (Self-Recall Thinking) is a framework that improves multi-turn dialogue cons...
SRT (Self-Recall Thinking) is a framework that improves multi-turn dialogue consistency by identifying and retrieving relevant historical turns to resolve long-range dependencies without external memory or lossy summarization.
Researchers propose an inverted agent architecture that moves beyond the standar...
Researchers propose an inverted agent architecture that moves beyond the standard single-LLM-plus-vector-store pattern, though specific details are not provided in the excerpt.
Checklist-improved prompts significantly outperform raw and clarifying-question ...
Checklist-improved prompts significantly outperform raw and clarifying-question prompts across summarization, planning, explanation, and coding tasks in a structured comparative study on ChatGPT, Claude, and Grok.
Introduces inference-time argumentation (ITA), a neurosymbolic framework for ter...
Introduces inference-time argumentation (ITA), a neurosymbolic framework for ternary claim verification that uses formal argumentation semantics to guide LLM training and produce faithful explanations.
VL-DPO uses vision-language models as zero-shot reasoners to generate preference...
VL-DPO uses vision-language models as zero-shot reasoners to generate preference pairs for aligning autonomous driving motion forecasting models with human preferences via DPO.
Langfuse's academy concludes with a session on LLM evaluation covering manual re...
Langfuse's academy concludes with a session on LLM evaluation covering manual review, code-based checks, and LLM-as-a-judge approaches and how to combine them.
Actionable World Representation proposes a unified framework for modeling object...
Actionable World Representation proposes a unified framework for modeling object action states in physical world models, treating actionable objects as fundamental primitives.
DexHoldem introduces a real-world benchmark using Texas Hold'em card manipulatio...
DexHoldem introduces a real-world benchmark using Texas Hold'em card manipulation to evaluate dexterous robotic embodied systems across perception, decision-making, and execution.
Analysis of ensembling six tabular foundation models across 153 tasks reveals ne...
Analysis of ensembling six tabular foundation models across 153 tasks reveals near-redundant predictions (Q-statistic ~0.96), with the best ensemble strategy yielding only +0.18% accuracy gain at 253x compute cost.
COOPO introduces a cyclic offline-online RL framework that alternates between KL...
COOPO introduces a cyclic offline-online RL framework that alternates between KL-regularized offline training and online fine-tuning to reduce distributional shift and catastrophic forgetting.
LlamaIndex announces ParseBench, the first document OCR benchmark designed speci...
LlamaIndex announces ParseBench, the first document OCR benchmark designed specifically for evaluating parsers in the context of AI agent pipelines rather than general-purpose OCR tasks.
IVGT proposes an implicit neural scene representation transformer that reconstru...
IVGT proposes an implicit neural scene representation transformer that reconstructs continuous 3D geometry and appearance from unposed multi-view images without explicit pointmap regression. Addresses geometric continuity limitations in existing visual geometry models.
Shows that layer equivalence in transformers depends heavily on the test protoco...
Shows that layer equivalence in transformers depends heavily on the test protocol used (replacement vs. interchange), and that conflating them can misidentify which layers are safe to prune. Has implications for model compression research.
Benchmark of seven LLM tutoring agents reveals models perform well on correct so...
Benchmark of seven LLM tutoring agents reveals models perform well on correct solutions but systematically fail on suboptimal and incorrect ones, the cases where adaptive feedback matters most.
Proposes ML-FOP-SOAP, a second-order optimization framework with multi-level var...
Proposes ML-FOP-SOAP, a second-order optimization framework with multi-level variance correction to mitigate modality competition in multimodal autoregressive models during large-batch training.
VGGT-Edit enables native 3D scene editing via feed-forward residual field predic...
VGGT-Edit enables native 3D scene editing via feed-forward residual field prediction, avoiding the blurry textures and cross-view inconsistencies typical of 2D-lifting editing pipelines.
CLOVER addresses the training-evaluation mismatch in autonomous driving by using...
CLOVER addresses the training-evaluation mismatch in autonomous driving by using closed-loop value estimation and ranking to better score trajectory candidates beyond simple imitation learning.
An artist with no CS background proposes a multi-model architecture inspired by ...
An artist with no CS background proposes a multi-model architecture inspired by evolutionary biology, placing open-source models in a living training environment with birth/death conditions to challenge single-model LLM paradigms.
Microstates—discrete, short-duration brain activity patterns—are proposed as uni...
Microstates—discrete, short-duration brain activity patterns—are proposed as universal EEG tokens, with a microstate tokenizer trained on large medical EEG data enabling cross-task representation learning for brain-computer interfaces.
A new framework evaluates model-brain alignment by identifying which dimensions ...
A new framework evaluates model-brain alignment by identifying which dimensions of brain response space are actually recovered by vision models, going beyond simple prediction accuracy metrics.
A case study using the Aristotle API for AI-assisted Lean 4 theorem proving on I...
A case study using the Aristotle API for AI-assisted Lean 4 theorem proving on IMO 2009 Problem 6 shows partial success—four helper lemmas verified but the main theorem left unresolved with a sorry.
Constructs k-inductive neural barrier certificates for partially unknown nonline...
Constructs k-inductive neural barrier certificates for partially unknown nonlinear dynamical systems, using CEGIS with SMT solvers to provide formal safety guarantees beyond what standard barrier certificates allow.
Shows that isotropic Gaussian regularization in JEPAs is not geometry-neutral an...
Shows that isotropic Gaussian regularization in JEPAs is not geometry-neutral and can be maximally misaligned for structured downstream tasks, proposing Hamiltonian geometry as a principled alternative.
INSHAPE introduces instance-level shapelets for time-series classification, addr...
INSHAPE introduces instance-level shapelets for time-series classification, addressing the limitations of population-level approaches by capturing instance-specific temporal patterns and their dependencies.
Proposes a quantifiable metric for evaluating XAI methods based on continuous in...
Proposes a quantifiable metric for evaluating XAI methods based on continuous input perturbation, measuring sufficiency and necessity of attributed information, alongside a novel fine-tuning-based XAI method.
Improves generalized planning policies using GNNs with efficient lookahead encod...
Improves generalized planning policies using GNNs with efficient lookahead encoding and abstracted width, addressing scalability and expressivity limitations of prior Iterated Width approaches.
Proposes an automated evaluation framework for design video generation across fo...
Proposes an automated evaluation framework for design video generation across four dimensions: layout fidelity, motion correctness, temporal quality, and content fidelity. Addresses a gap in standardized benchmarking for generative animation.
EntityBench introduces a 140-episode benchmark derived from real narrative media...
EntityBench introduces a 140-episode benchmark derived from real narrative media to evaluate entity consistency (characters, objects, locations) across long multi-shot video generation sequences.
PDI-Bench provides a quantitative framework for auditing geometric coherence in ...
PDI-Bench provides a quantitative framework for auditing geometric coherence in AI-generated videos by lifting 2D observations to 3D world-space and computing projective-geometry residuals.
Shodh-MoE applies sparse mixture-of-experts routing to eliminate negative transf...
Shodh-MoE applies sparse mixture-of-experts routing to eliminate negative transfer and gradient conflict when co-training incompatible physics regimes in scientific ML foundation models.
EviScreen is an evidential reasoning framework for medical image disease screeni...
EviScreen is an evidential reasoning framework for medical image disease screening that retrieves region-level evidence from historical cases via dual knowledge banks, improving both interpretability and predictive performance.
Retrieval-augmented multimodal alignment framework that combines semantically ri...
Retrieval-augmented multimodal alignment framework that combines semantically rich clinical text with precisely timestamped EHR data to reconstruct accurate clinical timelines for conditions like sepsis.
This paper studies how to design logging policies for off-policy evaluation, cha...
This paper studies how to design logging policies for off-policy evaluation, characterizing a reward-coverage tradeoff and deriving optimal policies to minimize OPE estimation error.
Answer Set Programming is applied to long-term power grid planning, handling com...
Answer Set Programming is applied to long-term power grid planning, handling complex topological and combinatorial invariants that are difficult to express in standard planning languages.
HaorFloodAlert is a deseasonalized ML ensemble achieving 72-hour flood probabili...
HaorFloodAlert is a deseasonalized ML ensemble achieving 72-hour flood probability forecasting for Bangladesh haor wetlands, correcting for temperature-based seasonal leakage and incorporating SAR satellite proxies for lead time.
Proposes an end-to-end generative AI framework for utility billing that produces...
Proposes an end-to-end generative AI framework for utility billing that produces natural-language customer statements and carbon analytics from structured data. Similar in scope to post f151497bfe04bd3f by the same authors.
Proposes a unified generative AI and quantum-inspired optimization framework for...
Proposes a unified generative AI and quantum-inspired optimization framework for smart energy utilities covering billing, carbon analytics, and infrastructure management. Broadly scoped system design paper.
Presents an algebraic formalization of dyadic morality theory using structural c...
Presents an algebraic formalization of dyadic morality theory using structural causal models, modeling how humans compute moral judgments and addressing scalability of the dyadic framework.
A survey of 60 international students in the US reveals how they use conversatio...
A survey of 60 international students in the US reveals how they use conversational AI tools like ChatGPT to navigate cross-cultural adaptation challenges where institutional support is fragmented.