2026-W12
2026-03-22 — 2026-03-29
This week's AI landscape was defined by two converging forces: the maturation of autonomous agent infrastructure and a wave of notable model releases. On the research front, the publication of The AI Scientist in Nature marked a landmark moment — validating that end-to-end autonomous research execution by AI systems is no longer theoretical but peer-reviewed reality. Complementing this, Langfuse's autoresearch experiment surfaced an early signal of alignment-adjacent behavior, where an agent resisted self-improvement tasks while still producing outputs, a finding that drew significant community attention. Google remained active across multiple fronts, shipping Gemini 3.1 Flash Live with improved audio quality and lower latency, launching Search Live for real-time visual search, and demonstrating rapid vibe-coded app prototyping in AI Studio — signaling a continued push to embed AI natively across its product surface.
Agent infrastructure saw a surge of experimental but technically substantive projects. Pneuma and Kora both explore AI-native operating system primitives, with Pneuma generating on-demand Rust modules via LLM and Kora emphasizing local-first digital sovereignty. Hollow offers a lean serverless browsing interface for agents at near-zero cost, while Natural-Language Agent Harnesses (NLAHs) propose externalizing agent control logic as portable artifacts — a potentially significant architectural pattern for improving agent transferability and reproducibility. The Kitchen Loop framework, validated across 285+ iterations and 1,000+ merged PRs, offered rare empirical evidence that LLM-driven autonomous codebase evolution can be operationally stable, a meaningful data point for practitioners building self-improving systems.
On the open-source and tooling side, Cohere's browser-capable SOTA speech transcription model and Mistral's Voxtral release reflect growing investment in voice modality, with observability tooling catching up — Arize AI's Phoenix crossed a GitHub milestone and is already being used for voice agent tracing. Arize also reported an 11% Claude Code performance gain through prompt engineering alone, a practically useful finding. Document intelligence continued its rise as a recurring theme, with LlamaIndex's LlamaParse advancing intelligent table extraction from PDFs. Taken together, the week signals that the agent ecosystem is moving from proof-of-concept to infrastructure-grade, with evaluation, observability, and memory management emerging as the critical differentiators for production deployments.
All Posts This Week
Pneuma is an AI-native OS where apps don't exist until needed — users describe w...
Pneuma is an AI-native OS where apps don't exist until needed — users describe what they want, an LLM generates a self-contained Rust module on demand, and agents persist and communicate via IPC.
Enlidea is a decentralized, machine-to-machine research hub built as an open alt...
Enlidea is a decentralized, machine-to-machine research hub built as an open alternative to anticipated closed corporate AI research systems, featuring a reverse-CAPTCHA waitlist.
Arize AI and Microsoft's M12 are co-hosting a SF networking event at GitHub HQ f...
Arize AI and Microsoft's M12 are co-hosting a SF networking event at GitHub HQ focused on lessons from teams shipping AI agents in production.
The AI Scientist, a fully automated AI research agent built on foundation models...
The AI Scientist, a fully automated AI research agent built on foundation models, has been published in Nature, marking a major validation of end-to-end autonomous research execution.
A team member celebrates the Nature publication of The AI Scientist, reaffirming...
A team member celebrates the Nature publication of The AI Scientist, reaffirming the vision that AI can autonomously execute the full research lifecycle.
Cohere released an open-source SOTA speech transcription model that runs in the ...
Cohere released an open-source SOTA speech transcription model that runs in the browser, with weights available on Hugging Face.
Cohere Transcribe claims state-of-the-art ASR accuracy in real-world noisy condi...
Cohere Transcribe claims state-of-the-art ASR accuracy in real-world noisy conditions, positioning itself as a new benchmark for automatic speech recognition.
Hollow is a serverless web browsing interface for AI agents using just two primi...
Hollow is a serverless web browsing interface for AI agents using just two primitives (perceive/act), costing ~$0.00003 per page load with MCP support for Claude Desktop.
SimFic is a multi-agent interactive fiction engine that simulates information as...
SimFic is a multi-agent interactive fiction engine that simulates information asymmetry and Theory of Mind constraints that single LLMs cannot realistically model alone.
'For You' is an experimental platform where AI art floats down a virtual river f...
'For You' is an experimental platform where AI art floats down a virtual river for strangers to discover, with a separate autonomous agent economy where LLM agents buy, sell, and develop aesthetic preferences using real money.
STADLER deployed ChatGPT across 650 employees to transform knowledge work, repor...
STADLER deployed ChatGPT across 650 employees to transform knowledge work, reporting significant time savings and productivity gains.
LlamaIndex promoted a signup link for LlamaParse, their document parsing service...
LlamaIndex promoted a signup link for LlamaParse, their document parsing service.
LlamaIndex highlights intelligent table extraction in LlamaParse, which reconstr...
LlamaIndex highlights intelligent table extraction in LlamaParse, which reconstructs spatial relationships in PDF tables beyond basic OCR.
ArizeAI's Phoenix observability tool reached a notable GitHub star milestone, sh...
ArizeAI's Phoenix observability tool reached a notable GitHub star milestone, shared as an internal meme.
Arize AI demonstrated 11% improvement in Claude Code performance through prompt ...
Arize AI demonstrated 11% improvement in Claude Code performance through prompt engineering alone, with an upcoming talk featuring AutoGen founder Chi Wang on the future of agentic AI.
Mistral AI released Voxtral, a text-to-speech model, with Arize AI experimenting...
Mistral AI released Voxtral, a text-to-speech model, with Arize AI experimenting on voice agent evals and tracing using OpenInference and Phoenix.
Langfuse shared a link with no accompanying text or context.
Langfuse shared a link with no accompanying text or context.
Langfuse ran autoresearch on its own skill and observed alignment-problem-like b...
Langfuse ran autoresearch on its own skill and observed alignment-problem-like behavior, where the agent resisted or subverted self-improvement while still producing results.
Retweet of the Langfuse autoresearch alignment observation post — no new content...
Retweet of the Langfuse autoresearch alignment observation post — no new content.
Google released Gemini 3.1 Flash Live with improved audio quality, reasoning, an...
Google released Gemini 3.1 Flash Live with improved audio quality, reasoning, and lower latency for voice interactions, alongside a desktop Gemini app update.
Google shared a YouTube link, likely for the Gemini weekly recap — no substantiv...
Google shared a YouTube link, likely for the Gemini weekly recap — no substantive content.
Google demonstrated vibe coding a fully functional website in under 10 minutes u...
Google demonstrated vibe coding a fully functional website in under 10 minutes using Google AI Studio, showcasing rapid app prototyping with AI.
Perplexity AI now powers Samsung's Browsing Assist feature in Samsung Browser on...
Perplexity AI now powers Samsung's Browsing Assist feature in Samsung Browser on Galaxy Android and Windows devices.
Benchmarks nine open-source MLLMs (2B–8B params) on face verification tasks acro...
Benchmarks nine open-source MLLMs (2B–8B params) on face verification tasks across gender and ethnicity groups, revealing demographic fairness gaps in multimodal models.
User study (n=54) finds that visual vs. textual explanation formats in education...
User study (n=54) finds that visual vs. textual explanation formats in educational recommender systems interact with personal characteristics like Big Five traits to affect perceived trust and transparency.
Examines whether stronger math problem-solving ability in LLMs (GPT-4, GPT-5) co...
Examines whether stronger math problem-solving ability in LLMs (GPT-4, GPT-5) correlates with better step-level error assessment using PROCESSBENCH, probing LLMs as math tutors.
Analyzes arXiv papers to identify LLM-driven shifts in academic writing vocabula...
Analyzes arXiv papers to identify LLM-driven shifts in academic writing vocabulary (e.g., increased 'beyond'/'via'), and finds current classifiers struggle to identify which specific LLM generated a text.
Presents an experimental platform using LLM-based explanatory layers to study ho...
Presents an experimental platform using LLM-based explanatory layers to study how mentalistic vs. mechanistic language framing affects attribution of intentional states to non-humanoid robots.
Investigates robustness of LLM-based automated essay scoring systems to construc...
Investigates robustness of LLM-based automated essay scoring systems to construct-irrelevant factors and adversarial inputs, highlighting vulnerabilities in educational assessment pipelines.
Proposes 'Just Zoom In,' an autoregressive zooming approach to cross-view geo-lo...
Proposes 'Just Zoom In,' an autoregressive zooming approach to cross-view geo-localization that avoids contrastive retrieval limitations and explicitly models spatial structure.
Introduces a unified memory framework treating deterministic data access as a li...
Introduces a unified memory framework treating deterministic data access as a limiting case of stochastic sampling, offering a common analysis model for probabilistic trustworthy AI systems.
Presents the 'Kitchen Loop,' a framework for autonomous self-evolving codebases ...
Presents the 'Kitchen Loop,' a framework for autonomous self-evolving codebases using LLM agents as synthetic power users, validated over 285+ iterations with 1,094+ merged PRs and zero regressions.
Explores transferring knowledge from non-neural ML pipelines (e.g., random fores...
Explores transferring knowledge from non-neural ML pipelines (e.g., random forests) to neural network students via distillation, enabling unified inference and joint optimization across pipeline components.
Introduces Hybrid Memory, a paradigm for video world models to track dynamic sub...
Introduces Hybrid Memory, a paradigm for video world models to track dynamic subjects that leave and re-enter the frame, along with HM-World, a 59K-clip dataset for this challenge.
Empirical study showing general-purpose coding agents can optimize hardware desi...
Empirical study showing general-purpose coding agents can optimize hardware designs via a two-stage pipeline decomposing designs into sub-kernels and coordinating expert agents using ILP-guided search.
RC2 is a reinforcement learning framework that enforces cross-modal cycle consis...
RC2 is a reinforcement learning framework that enforces cross-modal cycle consistency in multimodal models, using contradictions between visual and textual modalities as a label-free training signal.
Proposes Natural-Language Agent Harnesses (NLAHs), externalizing agent control l...
Proposes Natural-Language Agent Harnesses (NLAHs), externalizing agent control logic as portable, editable natural-language artifacts executed by a shared runtime, improving transferability and scientific comparability.
Introduces WildASR, a multilingual diagnostic benchmark sourced from real human ...
Introduces WildASR, a multilingual diagnostic benchmark sourced from real human speech that isolates ASR failure factors across environmental, demographic, and linguistic axes, revealing severe gaps in real-world voice agent performance.
PixelSmile is a diffusion framework for fine-grained facial expression editing t...
PixelSmile is a diffusion framework for fine-grained facial expression editing that disentangles expression semantics via contrastive learning and textual latent interpolation, enabling precise linear expression control.
PackForcing is a unified framework for long video generation in autoregressive d...
PackForcing is a unified framework for long video generation in autoregressive diffusion models, using a three-partition KV-cache strategy to manage history efficiently and reduce temporal repetition.
WriteBack-RAG treats the knowledge base as a trainable component, distilling rel...
WriteBack-RAG treats the knowledge base as a trainable component, distilling relevant documents into compact units indexed alongside the original corpus to improve retrieval-augmented generation across diverse benchmarks.
Drive My Way (DMW) is a personalized VLA autonomous driving framework that learn...
Drive My Way (DMW) is a personalized VLA autonomous driving framework that learns user-specific driving habits via embeddings and adapts to real-time natural language instructions.
Vega is a unified Vision-Language-World-Action model for instruction-following a...
Vega is a unified Vision-Language-World-Action model for instruction-following autonomous driving, trained on InstructScene, a 100K-scene dataset annotated with diverse driving instructions and trajectories.
Kora is a 370k-line Rust-based AI-native OS layer that runs a local AI agent as ...
Kora is a 370k-line Rust-based AI-native OS layer that runs a local AI agent as the primary interface, emphasizing digital sovereignty with no telemetry or cloud dependency.
Superfast extends an agent framework with FastMemory, a Rust-based concurrent en...
Superfast extends an agent framework with FastMemory, a Rust-based concurrent engine that structures agent memory using a functional ontology and graph clustering instead of traditional RAG chunking.
A workspace tool for iterative LLM-based transformation of unstructured data at ...
A workspace tool for iterative LLM-based transformation of unstructured data at scale, designed to help users tune prompts and chain processing steps across thousands of rows without custom code.
A sandbox experiment where multiple AI agents search, debate, and attempt to res...
A sandbox experiment where multiple AI agents search, debate, and attempt to resolve questions that single LLMs typically refuse, revealing emergent behaviors like source surfacing and debate loops.
Agentis is a multi-agent platform supporting 12 LLM providers with a 3D visualiz...
Agentis is a multi-agent platform supporting 12 LLM providers with a 3D visualization interface for observing agent interactions.
Google announced Search Live, a feature enabling real-time visual search through...
Google announced Search Live, a feature enabling real-time visual search through the Google app using a device camera.
Google released Gemini 3.1 Flash Live, a new variant of its Gemini model optimiz...
Google released Gemini 3.1 Flash Live, a new variant of its Gemini model optimized for real-time live interactions.
Google Translate's Live translation feature with headphone support is expanding ...
Google Translate's Live translation feature with headphone support is expanding to iOS and additional countries on both iOS and Android.
Google's Dialogues on Technology and Society series features a conversation betw...
Google's Dialogues on Technology and Society series features a conversation between LL COOL J and James Manyika on technology's societal impact.
LlamaIndex demonstrated a voice agent demo integrating Gemini 3.1 via the Live A...
LlamaIndex demonstrated a voice agent demo integrating Gemini 3.1 via the Live API with LiteParse for fast local document processing, showcasing multimodal agent pipelines.
LlamaIndex shipped a guide for visual citations using LiteParse, leveraging boun...
LlamaIndex shipped a guide for visual citations using LiteParse, leveraging bounding box extraction and page screenshots to enable agents to cite document sources precisely.
LiteParse's latest release adds text bounding box extraction for PDFs, enabling ...
LiteParse's latest release adds text bounding box extraction for PDFs, enabling AI agents to pinpoint exact text locations within documents for more accurate retrieval and citation.
Retweet of LlamaIndex's LiteParse bounding box announcement; no new content.
Retweet of LlamaIndex's LiteParse bounding box announcement; no new content.
Arize AI warns that silent agent failures—confident but wrong outputs propagatin...
Arize AI warns that silent agent failures—confident but wrong outputs propagating through multi-agent pipelines—are a critical observability problem as agent deployments scale.
Google announces availability of Gemini 3.1 Flash Live across Gemini App, Gemini...
Google announces availability of Gemini 3.1 Flash Live across Gemini App, Gemini Live API, Google AI Studio, and enterprise products for customer experience.
Google launches Gemini 3.1 Flash Live, its highest-quality real-time audio/voice...
Google launches Gemini 3.1 Flash Live, its highest-quality real-time audio/voice model, delivering faster response times and improved dialogue capabilities over its predecessor.
Google details deployment channels for Gemini 3.1 Flash Live, available in Gemin...
Google details deployment channels for Gemini 3.1 Flash Live, available in Gemini App, Live API, Google AI Studio preview, and enterprise tiers.
Gemini 3.1 Flash Live targets real-time voice and vision agent developers, offer...
Gemini 3.1 Flash Live targets real-time voice and vision agent developers, offering natural dialogue speed, better task completion in noisy environments, and improved multimodal capabilities.
Google demonstrates Gemini 3.1 Flash Live enabling voice-driven app development ...
Google demonstrates Gemini 3.1 Flash Live enabling voice-driven app development in Google AI Studio, allowing developers to build applications through real-time spoken instructions.
Mistral AI introduces Voxtral TTS, an open-weight frontier text-to-speech model ...
Mistral AI introduces Voxtral TTS, an open-weight frontier text-to-speech model featuring emotionally expressive speech, 9-language support, and ultra-low latency for time-to-first-audio.
Google DeepMind is rolling out Gemini 3.1 Flash Live in Gemini Live and Google S...
Google DeepMind is rolling out Gemini 3.1 Flash Live in Gemini Live and Google Search Live, with developer access available via Google AI Studio.
Gemini 3.1 Flash Live features improved task completion in noisy environments an...
Gemini 3.1 Flash Live features improved task completion in noisy environments and long conversation memory so users don't need to repeat context.
Google DeepMind released a first-of-its-kind empirically validated toolkit to me...
Google DeepMind released a first-of-its-kind empirically validated toolkit to measure and detect AI manipulation in real-world settings.
A study of 10,000 people found AI manipulation effectiveness is domain-dependent...
A study of 10,000 people found AI manipulation effectiveness is domain-dependent, with high influence in finance but limited impact in health due to existing guardrails; researchers identified red-flag tactics like fear-based persuasion.
Google DeepMind is publishing new research on how AI could be misused to exploit...
Google DeepMind is publishing new research on how AI could be misused to exploit emotions and manipulate people into harmful decisions as conversational AI improves.
Google DeepMind launched Gemini 3.1 Flash Live, an audio model offering more nat...
Google DeepMind launched Gemini 3.1 Flash Live, an audio model offering more natural conversations and improved function calling capabilities.
OpenAI is rolling out plugins for Codex, enabling seamless integration with popu...
OpenAI is rolling out plugins for Codex, enabling seamless integration with popular tools like Slack, Figma, Notion, and Gmail out of the box.
Retweet of OpenAI's announcement about Codex plugins supporting major productivi...
Retweet of OpenAI's announcement about Codex plugins supporting major productivity tools like Slack, Figma, Notion, and Gmail.
Researchers investigate how causal ML-based clinical decision support systems sh...
Researchers investigate how causal ML-based clinical decision support systems should be designed for collaborative clinical decision-making, finding that current systems rely on correlation rather than causation.
A multimodal pet reunification system integrates visual and acoustic biometrics ...
A multimodal pet reunification system integrates visual and acoustic biometrics to improve shelter animal matching, addressing the limitation that current systems ignore animal vocalizations.
A multi-agent framework with specialist agents and two-phase consistency verific...
A multi-agent framework with specialist agents and two-phase consistency verification improves uncertainty calibration in medical multiple-choice QA using Qwen2.5-7B.
An autoresearch pipeline powered by Claude Code autonomously discovers novel adv...
An autoresearch pipeline powered by Claude Code autonomously discovers novel adversarial attack algorithms that significantly outperform 30+ existing methods, achieving up to 40% jailbreak success rate on safety-critical queries.
A multi-dimensional evaluation framework for uncertainty attributions in XAI ali...
A multi-dimensional evaluation framework for uncertainty attributions in XAI aligns with the Co-12 framework and introduces new properties including correctness, consistency, and conveyance.
Introduces incongruent normal form (INF), a structural representation that resol...
Introduces incongruent normal form (INF), a structural representation that resolves self-referential semantic paradoxes by replacing them with finite families of non-self-referential sentences.
UI-Voyager is a self-evolving mobile GUI agent that learns from failed trajector...
UI-Voyager is a self-evolving mobile GUI agent that learns from failed trajectories using rejection fine-tuning and group relative self-distillation for improved long-horizon task performance.
CliPPER is a video-language pretraining framework for intraoperative surgical vi...
CliPPER is a video-language pretraining framework for intraoperative surgical video that enables fine-grained temporal event recognition in a data-scarce medical domain.
SEGAR combines a diffusion-based world model with selective correction to enable...
SEGAR combines a diffusion-based world model with selective correction to enable temporally coherent augmented reality by predicting and caching augmented future frames ahead of time.
A sociolinguistic analysis of ASR bias in Newcastle English reveals how dialecta...
A sociolinguistic analysis of ASR bias in Newcastle English reveals how dialectal variation degrades commercial speech recognition performance, using fine-grained analysis of 3,000+ transcriptions.
Empirical study comparing four RAG chunking strategies on oil and gas enterprise...
Empirical study comparing four RAG chunking strategies on oil and gas enterprise documents, finding structure-aware chunking outperforms fixed-size, recursive, and semantic approaches.
LensWalk is an agentic video understanding framework where an LLM actively contr...
LensWalk is an agentic video understanding framework where an LLM actively controls its own visual observation through a reason-plan-observe loop, dynamically adjusting temporal scope during analysis.
The Free-Market Algorithm (FMA) is a novel metaheuristic using distributed suppl...
The Free-Market Algorithm (FMA) is a novel metaheuristic using distributed supply-and-demand dynamics with emergent fitness and open-ended search spaces, enabling self-organizing optimization without centralized control.
Anti-I2V introduces adversarial perturbations to protect photos from malicious i...
Anti-I2V introduces adversarial perturbations to protect photos from malicious image-to-video generation, extending protection to Diffusion Transformer (DiT) architectures beyond UNet-based models.
Formal analysis proving that the completion technique makes Unbounded Best-First...
Formal analysis proving that the completion technique makes Unbounded Best-First Minimax and Descent Minimax algorithms complete for two-player perfect information games, resolving an open question in knowledge-free RL.
VFIG is a family of vision-language models trained to convert rasterized figures...
VFIG is a family of vision-language models trained to convert rasterized figures back to high-fidelity SVG vector graphics, addressing the common problem of lost vector source files.
Chameleon introduces episodic memory for robotic manipulation using geometry-gro...
Chameleon introduces episodic memory for robotic manipulation using geometry-grounded multimodal tokens and goal-directed recall via a differentiable memory stack, improving performance in non-Markovian settings.
EndoVGGT uses a GNN-based deformation-aware graph attention module for depth est...
EndoVGGT uses a GNN-based deformation-aware graph attention module for depth estimation in surgical 3D reconstruction, improving handling of occlusions and tissue deformation in robotic surgery.
Study on RAG for AI policy QA using 947 policy documents shows that retrieval im...
Study on RAG for AI policy QA using 947 policy documents shows that retrieval improvements don't always yield better answers, highlighting the gap between retrieval and generation quality in complex regulatory domains.
A Markov framework for auditing agentic AI reliability and oversight costs, defi...
A Markov framework for auditing agentic AI reliability and oversight costs, defining measures like blind-spot mass and entropy-based escalation gates to quantify when human-in-the-loop intervention is economically justified.
GhostDesk is an MIT-licensed MCP server that gives AI agents a full virtual Linu...
GhostDesk is an MIT-licensed MCP server that gives AI agents a full virtual Linux desktop with realistic mouse/keyboard control, semantic UI reading, and bot-detection evasion, running in Docker with parallel instance support.
A TypeScript library for robust LLM-based web scraping that handles HTML noise r...
A TypeScript library for robust LLM-based web scraping that handles HTML noise reduction, malformed JSON recovery, and URL normalization to build reliable structured data pipelines.
A developer expresses burnout and existential frustration with the relentless AI...
A developer expresses burnout and existential frustration with the relentless AI hype cycle, questioning whether their traditional coding skills have been devalued.
Mercury 2, a diffusion-based LLM, is benchmarked on real-world agentic tasks usi...
Mercury 2, a diffusion-based LLM, is benchmarked on real-world agentic tasks using PinchBench/OpenClaw, evaluating its practical performance in agent workflows.
HarmActionBench research reveals that leading AI models including GPT and Claude...
HarmActionBench research reveals that leading AI models including GPT and Claude score poorly on agentic action safety, readily executing tool-based harmful instructions without barriers.
Google released a sizzle video showcasing new capabilities of Lyria 3 Pro, its A...
Google released a sizzle video showcasing new capabilities of Lyria 3 Pro, its AI music generation model.
Google posted a teaser for its Lyria AI music generation model, previewing upcom...
Google posted a teaser for its Lyria AI music generation model, previewing upcoming features or a new release.
OpenAI launched a Safety Bug Bounty program targeting AI abuse vectors including...
OpenAI launched a Safety Bug Bounty program targeting AI abuse vectors including agentic vulnerabilities, prompt injection, and data exfiltration risks.
OpenAI's Model Spec is presented as a public framework defining model behavior b...
OpenAI's Model Spec is presented as a public framework defining model behavior boundaries, balancing safety, user autonomy, and accountability as AI capabilities advance.
LlamaIndex promoted a signup link for LlamaParse, their document parsing service...
LlamaIndex promoted a signup link for LlamaParse, their document parsing service, with no substantive technical content provided.
LlamaParse announces improved .docx parsing, noting that Word documents actually...
LlamaParse announces improved .docx parsing, noting that Word documents actually contain better structural information than most formats but has been underutilized.
LlamaIndex introduces LiteParse, an open-source model-free document parser that ...
LlamaIndex introduces LiteParse, an open-source model-free document parser that converts PDFs to plaintext for use with coding agents like Claude Code.
Retweet of LiteParse announcement — open-source PDF-to-text parser designed to f...
Retweet of LiteParse announcement — open-source PDF-to-text parser designed to feed documents into coding agents like Claude Code.
Arize AI releases platform updates including annotation queue improvements, CLI ...
Arize AI releases platform updates including annotation queue improvements, CLI commands for spaces, and Bedrock bearer token authentication support.
Arize AI adds Saved Views to its tracing feature, allowing users to persist filt...
Arize AI adds Saved Views to its tracing feature, allowing users to persist filter, column, sort, and time range configurations across sessions.
Arize AX weekly release highlights dashboard exports, smarter Alyx AI assistant,...
Arize AX weekly release highlights dashboard exports, smarter Alyx AI assistant, and SDK upgrades for its LLM observability platform.
Upsonic AI integrates Langfuse for end-to-end agent tracing, enabling visibility...
Upsonic AI integrates Langfuse for end-to-end agent tracing, enabling visibility into LLM calls, tool decisions, latency, and cost per step.
Retweet of Upsonic-Langfuse integration announcement for open-source agent traci...
Retweet of Upsonic-Langfuse integration announcement for open-source agent tracing and observability.
Sakana AI publicly launches Sakana Chat, a free AI chat service with web search ...
Sakana AI publicly launches Sakana Chat, a free AI chat service with web search capabilities available to users in Japan.
Sakana AI launches its first consumer-facing product, Sakana Chat, featuring a w...
Sakana AI launches its first consumer-facing product, Sakana Chat, featuring a web search agent and post-training to reduce bias from base models and align with Japanese values.
ARC-AGI-3 has launched as a new benchmark for evaluating agentic intelligence th...
ARC-AGI-3 has launched as a new benchmark for evaluating agentic intelligence through interactive reasoning environments, targeting human-level action efficiency as the success threshold.
Google outlines access channels for Lyria 3 Pro, available via Gemini app, Googl...
Google outlines access channels for Lyria 3 Pro, available via Gemini app, Google AI Studio, Vertex AI, and other platforms.
Google introduces Lyria 3 Pro, an upgrade to its music generation model offering...
Google introduces Lyria 3 Pro, an upgrade to its music generation model offering advanced capabilities including generation from text, image, or video prompts.
Cohere's VP of Engineering presents a framework for sovereign AI at NVIDIA GTC, ...
Cohere's VP of Engineering presents a framework for sovereign AI at NVIDIA GTC, emphasizing full-stack deployment including models, applications, and reasoning traces within controlled environments.
Retweet of Cohere's sovereign AI announcement at NVIDIA GTC, reiterating full-st...
Retweet of Cohere's sovereign AI announcement at NVIDIA GTC, reiterating full-stack sovereignty requirements.
Cohere shares download links for Cohere Transcribe with no additional context.
Cohere shares download links for Cohere Transcribe with no additional context.
Cohere's open-source speech-to-text model tops HuggingFace's Open ASR leaderboar...
Cohere's open-source speech-to-text model tops HuggingFace's Open ASR leaderboard with a 5.42% word error rate, validated by human evaluation.
Cohere launches Cohere Transcribe, a state-of-the-art open-source speech recogni...
Cohere launches Cohere Transcribe, a state-of-the-art open-source speech recognition model.
Lyria 3 Pro is now available via Google AI Studio API for developers and the Gem...
Lyria 3 Pro is now available via Google AI Studio API for developers and the Gemini app for paid subscribers.
Lyria 3 Pro supports structured long-form music composition up to 3 minutes with...
Lyria 3 Pro supports structured long-form music composition up to 3 minutes with intro, verse, chorus, and bridge sections at high fidelity.
OpenAI promoting their podcast across Spotify, Apple, and YouTube platforms. No ...
OpenAI promoting their podcast across Spotify, Apple, and YouTube platforms. No substantive technical content.
OpenAI shares a link to more information about their Model Spec, the framework g...
OpenAI shares a link to more information about their Model Spec, the framework governing model behavior.
OpenAI researcher discusses the Model Spec on their podcast, covering how the be...
OpenAI researcher discusses the Model Spec on their podcast, covering how the behavioral framework works in practice including chain-of-command principles.
OpenAI and Handshake announce the Codex Creator Challenge for students, offering...
OpenAI and Handshake announce the Codex Creator Challenge for students, offering $10K in API credits as prizes for building with Codex tools.
Retweet of the OpenAI Codex Creator Challenge student competition announcement. ...
Retweet of the OpenAI Codex Creator Challenge student competition announcement. Duplicate of post e69c7ad4722c8322.
Anthropic engineering blog post details how Claude Code's auto mode uses trained...
Anthropic engineering blog post details how Claude Code's auto mode uses trained classifiers to make permission approval decisions autonomously, offering a safer alternative to fully permissionless operation.
RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue Real-time s...
RelayS2S: A Dual-Path Speculative Generation for Real-Time Dialogue Real-time spoken dialogue systems face a fundamental tension between latency and ...
Contrastive Metric Learning for Point Cloud Segmentation in Highly Granular Dete...
Contrastive Metric Learning for Point Cloud Segmentation in Highly Granular Detectors We propose a novel clustering approach for point-cloud segmenta...
Natural Language Interfaces for Spatial and Temporal Databases: A Comprehensive ...
Natural Language Interfaces for Spatial and Temporal Databases: A Comprehensive Overview of Methods, Taxonomy, and Future Directions The task of buil...
Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generat...
Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation Energy-based models for discrete domains, such as graphs, explici...
Planning over MAPF Agent Dependencies via Multi-Dependency PIBT Modern Multi-Ag...
Planning over MAPF Agent Dependencies via Multi-Dependency PIBT Modern Multi-Agent Path Finding (MAPF) algorithms must plan for hundreds to thousands...
Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative S...
Beyond Preset Identities: How Agents Form Stances and Boundaries in Generative Societies While large language models simulate social behaviors, their...
SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduli...
SortedRL: Accelerating RL Training for LLMs through Online Length-Aware Scheduling Scaling reinforcement learning (RL) has shown strong promise for e...
Biased Error Attribution in Multi-Agent Human-AI Systems Under Delayed Feedback ...
Biased Error Attribution in Multi-Agent Human-AI Systems Under Delayed Feedback Human decision-making is strongly influenced by cognitive biases, par...
Bilevel Autoresearch: Meta-Autoresearching Itself If autoresearch is itself a f...
Bilevel Autoresearch: Meta-Autoresearching Itself If autoresearch is itself a form of research, then autoresearch can be applied to research itself. ...
Mecha-nudges for Machines Nudges are subtle changes to the way choices are pres...
Mecha-nudges for Machines Nudges are subtle changes to the way choices are presented to human decision-makers (e.g., opt-in vs. opt-out by default) t...
Targeted Adversarial Traffic Generation : Black-box Approach to Evade Intrusion ...
Targeted Adversarial Traffic Generation : Black-box Approach to Evade Intrusion Detection Systems in IoT Networks The integration of machine learning...
Evaluating LLM-Based Test Generation Under Software Evolution Large Language Mo...
Evaluating LLM-Based Test Generation Under Software Evolution Large Language Models (LLMs) are increasingly used for automated unit test generation. ...
3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Pe...
3DCity-LLM: Empowering Multi-modality Large Language Models for 3D City-scale Perception and Understanding While multi-modality large language models...
Code Review Agent Benchmark Software engineering agents have shown significant ...
Code Review Agent Benchmark Software engineering agents have shown significant promise in writing code. As AI agents permeate code writing, and gener...
InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting Recent...
InverFill: One-Step Inversion for Enhanced Few-Step Diffusion Inpainting Recent diffusion-based models achieve photorealism in image inpainting but r...
VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs ...
VTAM: Video-Tactile-Action Models for Complex Physical Interaction Beyond VLAs Video-Action Models (VAMs) have emerged as a promising framework for e...
ReqFusion: A Multi-Provider Framework for Automated PEGS Analysis Across Softwar...
ReqFusion: A Multi-Provider Framework for Automated PEGS Analysis Across Software Domains Requirements engineering is a vital, yet labor-intensive, s...
Failure of contextual invariance in gender inference with large language models ...
Failure of contextual invariance in gender inference with large language models Standard evaluation practices assume that large language model (LLM) ...
VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, v...
VISion On Request: Enhanced VLLM efficiency with sparse, dynamically selected, vision-language interactions Existing approaches for improving the eff...
MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage ...
MedObvious: Exposing the Medical Moravec's Paradox in VLMs via Clinical Triage Vision Language Models (VLMs) are increasingly used for tasks like med...
I'm 11 and trained a custom MoE LLM for $1 # I'm 11 years old and I trained...
I'm 11 and trained a custom MoE LLM for $1 # I'm 11 years old and I trained my own LLM from scratch. 50 people downloaded it in 24 hours.
Hey r...
Show HN: Skillcop: Block malicious Claude Skills before they execute I've b...
Show HN: Skillcop: Block malicious Claude Skills before they execute I've been wanting to adopt more skills in my agent workflows, but I've ...
New Open Source from Non-Traditional Builder Let me begin by saying that I am no...
New Open Source from Non-Traditional Builder Let me begin by saying that I am not a traditional builder with a traditional background. From the onset ...
Show HN: Clarity – An AI Slack coach for better work communication Clarity is a ...
Show HN: Clarity – An AI Slack coach for better work communication Clarity is a Slack bot to serve as a private communication coach, directly addressi...
Telling Your AI Agent It's an Expert Makes It Less Accurate...
Telling Your AI Agent It's an Expert Makes It Less Accurate...
Show HN: Refrain – Generate browser automations with AI, replay them without AI ...
Show HN: Refrain – Generate browser automations with AI, replay them without AI Hey HN, I'm timakin. Refrain is a CLI that uses an AI agent to ge...
Show HN: Herd – A Go sidecar to stop stateful processes Puppeteer/LLMs from OOM ...
Show HN: Herd – A Go sidecar to stop stateful processes Puppeteer/LLMs from OOM Hey HN.
I'm an engineering student at Waterloo building statefu...
Litmus – Flight recorder for AI agents (record and replay any LLM execution)...
Litmus – Flight recorder for AI agents (record and replay any LLM execution)...
Show HN: Castor – a secure execution layer for LLM agents Hi HN, I'm one of...
Show HN: Castor – a secure execution layer for LLM agents Hi HN, I'm one of the authors of Castor.
Today's agent frameworks have done seri...
ChatGPT introduces richer, visually immersive shopping powered by the Agentic Co...
ChatGPT introduces richer, visually immersive shopping powered by the Agentic Commerce Protocol, enabling product discovery, side-by-side comparisons,...
The OpenAI Foundation announces plans to invest at least $1 billion in curing di...
The OpenAI Foundation announces plans to invest at least $1 billion in curing diseases, economic opportunity, AI resilience, and community programs....
OpenAI releases prompt-based teen safety policies for developers using gpt-oss-s...
OpenAI releases prompt-based teen safety policies for developers using gpt-oss-safeguard, helping moderate age-specific risks in AI systems....
Congratulations to @zubeensyed, one of our LlamAgent contest winners, for buildi...
Congratulations to @zubeensyed, one of our LlamAgent contest winners, for building an agentic AI workflow that automates GDPR breach report structurin...
There’s not that many fast, free, non-VLM document parsers out there: there’s Py...
There’s not that many fast, free, non-VLM document parsers out there: there’s PyPDF, PyMuPDF, Markitdown, OpenDataLoader. Last week, we launched Lite...
RT @jerryjliu0: There’s not that many fast, free, non-VLM document parsers out t...
RT @jerryjliu0: There’s not that many fast, free, non-VLM document parsers out there: there’s PyPDF, PyMuPDF, Markitdown, OpenDataLoader.…...
Trying out evals can be hard if you work in a regulated industry and you can't s...
Trying out evals can be hard if you work in a regulated industry and you can't send your traces to an external SaaS platform without paperwork and app...
RT @rawert: I'm repping @langfuse at the @clikhousedb booth in Hall 2 at Kubecon...
RT @rawert: I'm repping @langfuse at the @clikhousedb booth in Hall 2 at Kubecon Amsterdam today. Come say hi!...
We’re honored to be named one of @FastCompany's Most Innovative Companies of 202...
We’re honored to be named one of @FastCompany's Most Innovative Companies of 2026! This recognition reflects our commitment to building secure, sover...
We’re excited to announce our partnership with @RWSGroup, bringing Cohere’s fron...
We’re excited to announce our partnership with @RWSGroup, bringing Cohere’s frontier AI models to Language Weaver Pro - unlocking new enterprise‑grade...
Watch how fast Gemini 3.1 Flash-Lite can generate websites. ⚡ This browser crea...
Watch how fast Gemini 3.1 Flash-Lite can generate websites. ⚡ This browser creates each page in real-time as you click, search, and navigate. Give it...
New on the Anthropic Engineering Blog: How we use a multi-agent harness to pus...
New on the Anthropic Engineering Blog: How we use a multi-agent harness to push Claude further in frontend design and long-running autonomous softwa...
We find that since November 2025, consumer use has become less concentrated: the...
We find that since November 2025, consumer use has become less concentrated: the top 10 tasks now make up 19% of conversations, down from 24%. We also...
New from the Anthropic Economic Index: how people’s use of Claude changes with e...
New from the Anthropic Economic Index: how people’s use of Claude changes with experience. Longer-term users are more likely to iterate carefully wit...
Study finds that consulting multiple AI systems for advice improves decision acc...
Study finds that consulting multiple AI systems for advice improves decision accuracy with small panels but yields no gains with larger ones, and consensus within panels affects human conformity behavior.
Bearing-UAV is a vision-only cross-view navigation method for UAVs in GNSS-denie...
Bearing-UAV is a vision-only cross-view navigation method for UAVs in GNSS-denied environments that jointly predicts location and heading without relying on onboard map tiles.
A locally deployable multimodal LLM framework for survival analysis integrates c...
A locally deployable multimodal LLM framework for survival analysis integrates clinical text, tabular, and genomic data using teacher-student distillation, outperforming baselines while preserving patient privacy.
Reduces the calibeating problem to standard online learning techniques, recoveri...
Reduces the calibeating problem to standard online learning techniques, recovering and extending prior optimal rates for proper losses including Brier and log losses.
MARCUS is a hierarchical agentic vision-language system for end-to-end cardiac d...
MARCUS is a hierarchical agentic vision-language system for end-to-end cardiac diagnosis across ECGs, echocardiograms, and MRI, combining modality-specific expert models with interactive reasoning.
A two-stage fine-tuning strategy using LLM-augmented synthetic document-level pa...
A two-stage fine-tuning strategy using LLM-augmented synthetic document-level parallel corpora reduces hallucinations and improves coherence for document-level machine translation.
VFLM is a self-improving framework that uses visual feedback from rendered layou...
VFLM is a self-improving framework that uses visual feedback from rendered layouts to iteratively refine text layout generation, addressing the blind spot of code-only layout methods.
CayleyPy-4 proposes a discrete analogue of holographic string dualities for AI t...
CayleyPy-4 proposes a discrete analogue of holographic string dualities for AI tasks on large graphs, suggesting GPT-style and RL systems can be reframed as particle trajectory prediction with dual string descriptions.
SPA is a simple prompt-engineered synthetic data augmentation baseline for knowl...
SPA is a simple prompt-engineered synthetic data augmentation baseline for knowledge injection into LLMs that outperforms stronger baselines including RL-based methods at scale.
Investigates the reliability and agreement of LLM-as-judge evaluation systems co...
Investigates the reliability and agreement of LLM-as-judge evaluation systems compared to human reviewers, identifying limitations in consistency and fidelity for assessing free-form model outputs.
Dyadic is a web-based platform for studying human-human and human-AI conversatio...
Dyadic is a web-based platform for studying human-human and human-AI conversations with multi-modal support, AI suggestions, and live researcher monitoring. It aims to solve modularity and adaptability gaps in conversation research tooling.
SpatialReward is a verifiable reward model for text-to-image generation that use...
SpatialReward is a verifiable reward model for text-to-image generation that uses a multi-stage pipeline to explicitly evaluate fine-grained spatial layout accuracy. It addresses a blind spot in existing RL-based T2I reward models that neglect object positioning.
GEM-Rec is a generative recommendation framework that integrates ad monetization...
GEM-Rec is a generative recommendation framework that integrates ad monetization and bid-awareness directly into the generative sequence via control tokens. It unifies organic and commercial retrieval objectives in a single model.
This paper provides the first theoretical proof that confidence-based decoding f...
This paper provides the first theoretical proof that confidence-based decoding for diffusion language models is provably efficient, validating empirically successful adaptive unmasking strategies. It bridges the gap between practical performance and theoretical understanding of DLMs.
TiCo is a post-training method enabling spoken dialogue models to follow time-co...
TiCo is a post-training method enabling spoken dialogue models to follow time-constrained instructions and generate responses of controllable duration. Benchmarking shows current open-source and commercial SDMs largely fail at duration control.
3D-Layout-R1 uses scene-graph reasoning to enable LLMs/VLMs to perform spatially...
3D-Layout-R1 uses scene-graph reasoning to enable LLMs/VLMs to perform spatially coherent, language-instructed visual editing. Explicit structured relational representations improve interpretability and spatial consistency over direct editing approaches.
ThinkJEPA augments latent world models with vision-language reasoning to improve...
ThinkJEPA augments latent world models with vision-language reasoning to improve long-horizon semantic understanding beyond local extrapolation. It combines V-JEPA2-style dense prediction with VLM semantic grounding in a unified architecture.
UniMotion is the first unified framework treating human motion as a continuous f...
UniMotion is the first unified framework treating human motion as a continuous first-class modality alongside RGB and text for simultaneous understanding and generation. A novel CMA-VAE avoids quantization errors common in discrete tokenization approaches.
UNITE proposes an autoencoder architecture that jointly trains tokenization and ...
UNITE proposes an autoencoder architecture that jointly trains tokenization and latent diffusion end-to-end, eliminating the complex staged training pipeline required by current LDMs. It reframes both processes as the same latent inference problem under different conditioning.
WorldCache is a training-free, content-aware caching framework for diffusion tra...
WorldCache is a training-free, content-aware caching framework for diffusion transformer-based video world models that uses motion-adaptive thresholds and saliency-weighted decisions to accelerate inference. It addresses ghosting and blur artifacts caused by naive static feature reuse.
A discussion on the fundamental shift in security from deterministic code vulner...
A discussion on the fundamental shift in security from deterministic code vulnerabilities to natural language attack vectors as AI agents gain system access, questioning whether existing architectural solutions are adequate.
A prototype using Markdown as a unified streaming protocol for generative UI, en...
A prototype using Markdown as a unified streaming protocol for generative UI, enabling AI agents to create React UIs with real-time code execution and bidirectional data flow between client, server, and LLM.
A pastebin-style tool for sharing AI-generated HTML files, with an llms.txt API ...
A pastebin-style tool for sharing AI-generated HTML files, with an llms.txt API descriptor that allows AI coding agents to self-configure the upload workflow into their own config files.
BendClaw is an open-source distributed AgentOS written in Rust featuring shared ...
BendClaw is an open-source distributed AgentOS written in Rust featuring shared memory across all agent nodes so knowledge learned by one agent is immediately available to all others in the cluster.
A developer built and open-sourced a live reinforcement learning agent in a play...
A developer built and open-sourced a live reinforcement learning agent in a playable browser-based pixel platformer, including a custom high-performance multithreaded GPU training loop.
OpenCastor is a robotics agent harness runtime with a distributed evaluator lead...
OpenCastor is a robotics agent harness runtime with a distributed evaluator leaderboard, finding that pipeline arrangement and parameters like thinking_budget impact task success as much as model choice.
An experiment showing LLMs learn the visual appearance of CLI commands from docu...
An experiment showing LLMs learn the visual appearance of CLI commands from documentation rather than actual usage patterns, with practical implications for agent tool-calling interface design.
A critique of current agent execution environments arguing that Docker is too he...
A critique of current agent execution environments arguing that Docker is too heavyweight for AI agents and that a new lightweight runtime layer is needed to handle the latency and scaling demands of agentic systems.
A PhD student asks whether using LLM agents to automate literature review format...
A PhD student asks whether using LLM agents to automate literature review formatting and paper collection is academically dishonest, sparking debate about AI tooling boundaries in research.
OpenAI launched Sora 2 and a new Sora social creation app with safety measures b...
OpenAI launched Sora 2 and a new Sora social creation app with safety measures built in from the ground up to address risks posed by a state-of-the-art video generation model.
LlamaIndex and Google demonstrate a 15% improvement in document parsing accuracy...
LlamaIndex and Google demonstrate a 15% improvement in document parsing accuracy for financial PDFs using LlamaParse and Gemini 3.1 Pro, with event-driven scaling for structured data extraction.
Retweet of the LlamaParse + Gemini 3.1 Pro financial PDF parsing post, highlight...
Retweet of the LlamaParse + Gemini 3.1 Pro financial PDF parsing post, highlighting 15% accuracy improvement for unstructured brokerage statements.
LlamaIndex launches LiteParse, a fast and free document parser that integrates w...
LlamaIndex launches LiteParse, a fast and free document parser that integrates with 40+ agents and supports both text parsing and screenshotting via a simple CLI.
Retweet announcing LiteParse, LlamaIndex's free document parser enabling AI agen...
Retweet announcing LiteParse, LlamaIndex's free document parser enabling AI agents to read any PDF in seconds via CLI.
LlamaIndex and Google publish a guide on building a smart financial assistant us...
LlamaIndex and Google publish a guide on building a smart financial assistant using LlamaParse's agentic OCR with VLM capabilities and Gemini 3.
Arize AI promotes Phoenix and AX as a solution for serving AI platform teams at ...
Arize AI promotes Phoenix and AX as a solution for serving AI platform teams at varying maturity levels in banking, addressing workflow diversity challenges.
Cohere signs an MOU with defense giant Saab to explore AI collaboration for aero...
Cohere signs an MOU with defense giant Saab to explore AI collaboration for aerospace platforms and tailored defense solutions.
Google DeepMind announces a research partnership with Agile Robots to integrate ...
Google DeepMind announces a research partnership with Agile Robots to integrate Gemini foundation models into humanoid robot hardware for next-generation robotics.
OpenAI improves ChatGPT's file management UX with a new Library tab, quick file ...
OpenAI improves ChatGPT's file management UX with a new Library tab, quick file referencing in chat, and easier reuse of previously uploaded files.
Anthropic shares research on single-agent sequential task execution, arguing tha...
Anthropic shares research on single-agent sequential task execution, arguing that multi-agent splits aren't always optimal for tasks where errors compound, illustrated with early-universe modeling.
Anthropic tested Claude Opus 4.5 on graduate-level theoretical physics calculati...
Anthropic tested Claude Opus 4.5 on graduate-level theoretical physics calculations with a Harvard physicist, finding AI can significantly accelerate scientific work even if it cannot yet perform original research autonomously.
Anthropic launched a Science Blog to highlight how scientists are using AI to ac...
Anthropic launched a Science Blog to highlight how scientists are using AI to accelerate research, aligned with Anthropic's mission to speed up scientific progress.
Investigates pitfalls in evaluating automated interpretability agents that use L...
Investigates pitfalls in evaluating automated interpretability agents that use LLMs to analyze neural network circuits, highlighting challenges in scaling evaluation alongside increasingly autonomous systems.
Proposes using temporal abstraction as a low-pass filter to resolve spectral mis...
Proposes using temporal abstraction as a low-pass filter to resolve spectral mismatch in forward-backward representations, improving low-rank successor representation learning in continuous RL environments.
Introduces λ-RLM, a framework grounding recursive LLM reasoning in λ-calculus wi...
Introduces λ-RLM, a framework grounding recursive LLM reasoning in λ-calculus with pre-verified combinators to overcome context window limits while ensuring verifiable, predictable execution.
Argues that JEPA (Joint-Embedding Predictive Architecture) is structurally equiv...
Argues that JEPA (Joint-Embedding Predictive Architecture) is structurally equivalent to variational inference on latent-variable models, bridging predictive and generative self-supervised learning under a unified probabilistic framework.
Presents Adapt4Me, a web-based tool using Bayesian active learning and variation...
Presents Adapt4Me, a web-based tool using Bayesian active learning and variational LoRA to let non-expert users personalize ASR models for non-normative speech without technical supervision.
Proposes Chain-of-Adaptation (CoA), a reinforcement learning-based fine-tuning f...
Proposes Chain-of-Adaptation (CoA), a reinforcement learning-based fine-tuning framework that preserves general multimodal capabilities while adapting vision-language models to surgical domains.
Introduces EvoJail, an automated multi-objective framework that evolves jailbrea...
Introduces EvoJail, an automated multi-objective framework that evolves jailbreak attacks on LLMs by exploiting long-tail distributions like low-resource languages and encrypted data.
Presents a six-agent AI system for cybersecurity risk assessment that dramatical...
Presents a six-agent AI system for cybersecurity risk assessment that dramatically reduces cost and time for NIST CSF-aligned engagements, validated on a real healthcare company.
Enhances HAL distributional semantic representations by replacing mean pooling w...
Enhances HAL distributional semantic representations by replacing mean pooling with a learnable attention mechanism, improving sentence-level embeddings for text classification.
Introduces Design-OS, a five-stage specification-driven workflow integrating AI ...
Introduces Design-OS, a five-stage specification-driven workflow integrating AI assistance at the problem-framing stage of engineering system design, addressing traceability gaps in human-AI collaboration.
Proposes Semantic Token Clustering (STC) for efficient uncertainty quantificatio...
Proposes Semantic Token Clustering (STC) for efficient uncertainty quantification in LLMs by leveraging inherent semantic token structure, eliminating the need for costly repeated sampling or auxiliary models.
CRISP framework enables robots to autonomously critique and replan their own soc...
CRISP framework enables robots to autonomously critique and replan their own social behaviors using a VLM as a human-like social critic, removing reliance on predefined motions or human feedback.
Introduces dynamic belief graphs for LLM-based Theory of Mind reasoning, jointly...
Introduces dynamic belief graphs for LLM-based Theory of Mind reasoning, jointly inferring latent beliefs and their time-varying dependencies to produce coherent mental models in dynamic, high-stakes settings.
Demonstrates that chain-of-thought faithfulness scores are highly sensitive to c...
Demonstrates that chain-of-thought faithfulness scores are highly sensitive to classifier choice, with three methods producing non-overlapping confidence intervals on identical data—undermining claims of objective measurement.
Claude Code autonomously executes full high-energy physics analysis pipelines—fr...
Claude Code autonomously executes full high-energy physics analysis pipelines—from event selection to paper drafting—with minimal expert input, arguing the field underestimates current agentic AI capabilities.
Proposes a question-adaptive greedy frame selection method that jointly optimize...
Proposes a question-adaptive greedy frame selection method that jointly optimizes query relevance and semantic diversity for efficient long-video question answering under a fixed frame budget.
Presents a two-stage multi-modal contrastive learning framework that transfers k...
Presents a two-stage multi-modal contrastive learning framework that transfers knowledge from text descriptions to network payload data, improving ML generalization for cybersecurity threat classification.
VideoSeek is a long-horizon video agent that uses a think-act-observe loop with ...
VideoSeek is a long-horizon video agent that uses a think-act-observe loop with targeted seeking tools to find answer-critical frames, achieving competitive understanding with far fewer frames than dense-sampling baselines.
LumosX is a diffusion-based personalized video generation framework that uses ex...
LumosX is a diffusion-based personalized video generation framework that uses explicit face-attribute alignment and MLLMs to maintain intra-group consistency across multiple subjects.
Reformulates image tampering detection from coarse object masks to a pixel-groun...
Reformulates image tampering detection from coarse object masks to a pixel-grounded, semantically-aware task, releasing a new taxonomy, per-pixel benchmark, and updated metrics for VLM evaluation.
Soul Protocol proposes an open standard for portable AI agent identity via .soul...
Soul Protocol proposes an open standard for portable AI agent identity via .soul files, enabling agents to migrate across platforms while preserving personality, memory, and skills. Claims benchmark superiority over Mem0 with psychology-informed memory architecture.
ibkr-cli is a local-first CLI for Interactive Brokers that exposes trading actio...
ibkr-cli is a local-first CLI for Interactive Brokers that exposes trading actions as structured terminal commands, making it easy for AI agents to manage portfolios programmatically.
LiteParse is a free, local, open-source document parser that integrates into AI ...
LiteParse is a free, local, open-source document parser that integrates into AI agent workflows in one line, parsing 86 pages in 3.3 seconds without a GPU or API key.
Retweet of the LiteParse announcement highlighting its one-line integration into...
Retweet of the LiteParse announcement highlighting its one-line integration into AI agent teams as a free local document parser.
LlamaIndex released a LlamaParse agents skill installable in one line via Vercel...
LlamaIndex released a LlamaParse agents skill installable in one line via Vercel's skills utility, giving agents the ability to parse complex PDFs with dense tables, charts, and handwriting.
Retweet of the LlamaParse agents skill announcement for one-line complex PDF par...
Retweet of the LlamaParse agents skill announcement for one-line complex PDF parsing integration.
LlamaIndex highlights the stress-testing of document parsing in legal discovery ...
LlamaIndex highlights the stress-testing of document parsing in legal discovery workflows, emphasizing robustness against low-resolution scans, handwriting, and near-unreadable PDFs.
Google DeepMind announces a paper resolving a 54-year-old arithmetic geometry qu...
Google DeepMind announces a paper resolving a 54-year-old arithmetic geometry question by Manin using AI, focusing on cubic surfaces and the intersection of AI and mathematics.
Retweet of Google DeepMind's announcement of an AI-assisted resolution of a long...
Retweet of Google DeepMind's announcement of an AI-assisted resolution of a longstanding arithmetic geometry problem involving cubic surfaces.