📄 Research Papers
DeepMind's AlphaProof paper is published in Nature, detailing how AlphaProof and...
DeepMind's AlphaProof paper is published in Nature, detailing how AlphaProof and AlphaGeometry achieved silver-medal performance on International Math Olympiad problems.
P2PCLAW is a peer-to-peer network where AI agents and researchers publish and va...
P2PCLAW is a peer-to-peer network where AI agents and researchers publish and validate scientific results using formal Lean 4 mathematical proofs, enabling agents to build on each other's verified work.
OpenAI details how chain-of-thought monitoring is used to detect misalignment in...
OpenAI details how chain-of-thought monitoring is used to detect misalignment in internal coding agents, analyzing real deployments to strengthen AI safety.
Arize introduces Prompt Learning, a technique to systematically improve agent in...
Arize introduces Prompt Learning, a technique to systematically improve agent instruction files (CLAUDE.md, .cursorrules) that reportedly boosts coding agent performance by 20% without changing the underlying model.
Google DeepMind highlights that the AlphaFold protein structure database has bee...
Google DeepMind highlights that the AlphaFold protein structure database has been used by over 3.3 million researchers worldwide, showcasing AI's transformative impact on scientific discovery.
OS-Themis is a scalable multi-agent critic framework for GUI agent RL training t...
OS-Themis is a scalable multi-agent critic framework for GUI agent RL training that decomposes trajectories into verifiable milestones and uses an evidence-auditing review mechanism, accompanied by OGRBench for cross-platform GUI reward evaluation.
SOL-ExecBench introduces a benchmark of 235 CUDA kernel optimization problems fr...
SOL-ExecBench introduces a benchmark of 235 CUDA kernel optimization problems from 124 production AI models, evaluating agentic AI code optimization against hardware efficiency limits on NVIDIA Blackwell GPUs rather than software baselines.
MAPG proposes a multi-agent probabilistic grounding system enabling robots to ex...
MAPG proposes a multi-agent probabilistic grounding system enabling robots to execute metric-semantic navigation commands like 'two meters to the right of the fridge' in 3D scenes. The approach addresses the gap in VLMs' ability to reason about precise metric constraints alongside semantic references.
VEPO applies reinforcement learning with verifiable rewards to improve LLM perfo...
VEPO applies reinforcement learning with verifiable rewards to improve LLM performance on low-resource languages by enforcing structural constraints like sequence length and linguistic well-formedness during policy alignment. A variable entropy mechanism balances literal fidelity with semantic naturalness.
The first large-scale trace-level study of LLM-based binary vulnerability analys...
The first large-scale trace-level study of LLM-based binary vulnerability analysis identifies four implicit reasoning patterns—early pruning, path-dependent lock-in, targeted backtracking, and knowledge-guided prioritization—emerging across 521 binaries and 99K reasoning steps. These patterns reveal how multi-pass LLM agents implicitly organize exploration despite context window limits.
This study examines how uncertainty estimation scales with parallel sampling in ...
This study examines how uncertainty estimation scales with parallel sampling in reasoning language models, finding that combining self-consistency and verbalized confidence yields up to +12 AUROC improvement with just two samples. The hybrid estimator outperforms either signal alone across math, STEM, and humanities tasks.
LlamaIndex argues context engineering is superseding prompt engineering, emphasi...
LlamaIndex argues context engineering is superseding prompt engineering, emphasizing that accurate data parsing is foundational to effective AI agents.
Sulcus reimagines AI memory as an active OS-like system with thermodynamic decay...
Sulcus reimagines AI memory as an active OS-like system with thermodynamic decay, where memories have relevance scores and half-lives that automatically manage retention and forgetting without manual retrieval calls.
Arize observes that agents optimize effectively toward given objectives but lack...
Arize observes that agents optimize effectively toward given objectives but lack the ability to self-assess whether the objective itself is correct, highlighting a core alignment challenge in agent evaluation.
NavTrust is a unified benchmark that systematically introduces realistic corrupt...
NavTrust is a unified benchmark that systematically introduces realistic corruptions to RGB, depth, and instruction inputs for embodied navigation agents, covering both Vision-Language Navigation and Object-Goal Navigation tasks to evaluate robustness.
FinTradeBench is a financial reasoning benchmark for LLMs that evaluates reasoni...
FinTradeBench is a financial reasoning benchmark for LLMs that evaluates reasoning over both company fundamentals (regulatory filings) and trading signals (price dynamics), addressing gaps in existing financial QA benchmarks.
DreamPartGen introduces a framework for semantically grounded part-aware text-to...
DreamPartGen introduces a framework for semantically grounded part-aware text-to-3D generation using Duplex Part Latents for joint geometry/appearance modeling and Relational Semantic Latents for inter-part relationships.
cuGenOpt is a GPU-accelerated metaheuristic framework for combinatorial optimiza...
cuGenOpt is a GPU-accelerated metaheuristic framework for combinatorial optimization using a 'one block evolves one solution' CUDA architecture with adaptive operator selection and unified encoding abstractions. It simultaneously targets generality, performance, and usability for logistics, scheduling, and resource allocation problems.
D5P4 introduces a generalized beam-search framework for discrete diffusion text ...
D5P4 introduces a generalized beam-search framework for discrete diffusion text generation that supports modular beam-selection objectives and in-batch diversity via Determinantal Point Process inference. This addresses the gap in decoding methods for non-autoregressive diffusion models.
UGID proposes debiasing LLMs at the internal representation level by modeling th...
UGID proposes debiasing LLMs at the internal representation level by modeling the Transformer as a computational graph and enforcing structural invariance across demographic groups. This graph isomorphism approach addresses biases embedded in hidden states that output-level methods cannot fully resolve.
FedTrident proposes a resilient federated learning framework for road condition ...
FedTrident proposes a resilient federated learning framework for road condition classification that detects and mitigates targeted label-flipping attacks from malicious vehicle clients. The approach tailors poisoned model detection to maintain near attack-free performance across various attack scenarios.
P2PCLAW is a decentralized peer-to-peer network enabling AI agents and human res...
P2PCLAW is a decentralized peer-to-peer network enabling AI agents and human researchers to discover each other, share scientific findings, and validate claims via formal mathematical proof rather than LLM consensus.
Anthropic conducted a large-scale qualitative study with over 80,000 participant...
Anthropic conducted a large-scale qualitative study with over 80,000 participants exploring how people experience AI's opportunities and risks.
Anthropic's Claude-powered interview study of nearly 81,000 users on AI hopes an...
Anthropic's Claude-powered interview study of nearly 81,000 users on AI hopes and fears is described as the largest qualitative study of its kind.
OpenAI research finds Americans send nearly 3 million daily ChatGPT messages abo...
OpenAI research finds Americans send nearly 3 million daily ChatGPT messages about compensation, positioning AI as a tool for closing the wage information gap.
Box Maze proposes a process-control architecture decomposing LLM reasoning into ...
Box Maze proposes a process-control architecture decomposing LLM reasoning into memory grounding, structured inference, and boundary enforcement layers to reduce hallucination and improve reasoning reliability under adversarial prompting.
ARIADNE is a two-stage medical AI framework combining DPO-aligned vision-languag...
ARIADNE is a two-stage medical AI framework combining DPO-aligned vision-language models and RL-based reasoning for coronary vessel segmentation, using topological constraints (Betti numbers) to produce structurally coherent vascular trees instead of optimizing pixel-level metrics.
This paper presents an adaptive stock prediction framework using an autoencoder ...
This paper presents an adaptive stock prediction framework using an autoencoder to detect market regime shifts and route data through specialized prediction pathways. The architecture combines transformer-based dual node processing with reinforcement learning control for volatile market conditions.
CustomTex introduces a dual-distillation framework for generating high-fidelity ...
CustomTex introduces a dual-distillation framework for generating high-fidelity 3D indoor scene textures from reference images, enabling instance-level control over appearance. The method separates semantic content from style to produce unified, high-resolution texture maps without artifacts.
Anthropic plans to use its AI-powered interviewer tool regularly to gather quali...
Anthropic plans to use its AI-powered interviewer tool regularly to gather qualitative insights on how AI impacts people worldwide, informing beneficial AI development.
Retweet of DeepMind's AlphaProof Nature publication announcement, same content a...
Retweet of DeepMind's AlphaProof Nature publication announcement, same content as post c2ed2f0fa1e687d9.
Retweet of Google DeepMind's post about AlphaFold's global adoption by 3.3 milli...
Retweet of Google DeepMind's post about AlphaFold's global adoption by 3.3 million researchers as a landmark example of AI accelerating science.
A pure algebraic geometry paper on R-equivalence of cubic surfaces over p-adic f...
A pure algebraic geometry paper on R-equivalence of cubic surfaces over p-adic fields, with no AI/ML content.