2026-W19
2026-05-10 — 2026-05-17
The week of May 10–17, 2026 was dominated by two converging forces: the rapid productionization of AI agents and an intensifying focus on enterprise observability. Multi-agent infrastructure emerged as the week's defining theme, with ArizeAI partnering with Deloitte Canada to help organizations move complex agent systems into production, and Langfuse announcing in-person training in San Francisco on agentic deployment. The broader industry surfaced a consistent set of challenges at this frontier — context loss during agent handoffs, token consumption at scale, and the difficulty of evaluating multi-turn consistency — with research papers like SRT (Self-Recall Thinking) and SDAR directly addressing these gaps through improved memory handling and RL-based training stabilization. OpenAI continued its enterprise push with Codex use case showcases across sales, data science, and business operations, while Databricks adopted GPT-5.5 for enterprise agent workflows after strong benchmark performance, signaling growing confidence in deploying frontier models within production data environments.
On the model and research side, open-weight releases remained active. INF's Infinity-Parser2-Pro (35B) and Flash (2B) topped ParseBench for document understanding, and Cohere's Compass demonstrated retrieval from unstructured and handwritten document formats — both pointing toward growing demand for high-fidelity document ingestion in enterprise pipelines. A notable research result from Emergence World drew attention by evaluating LLMs through simulated society governance, finding stark behavioral differences between models: Claude produced a stable democracy while Grok's simulation descended into disorder within 48 hours. Separately, xAI's Hermes Agent expanded its capabilities with X post search and Premium subscription integration, reflecting broader competition to embed AI agents deeper into consumer platforms.
Looking at the macro trends, the week reinforced that the industry has moved decisively from model capability debates toward deployment, governance, and evaluation infrastructure. The convergence of observability tooling (ArizeAI, Langfuse), agentic framework maturation (LlamaIndex finance agents, APWA distributed architecture), and governance research (the "audit gap" position paper on AI evaluation limits) suggests practitioners are grappling seriously with the gap between what agent systems can do and what can be reliably verified and monitored in production. Google I/O on May 19 loomed as a near-term catalyst, with the industry widely anticipating a new wave of multimodal and agentic announcements from DeepMind that could reshape the competitive landscape heading into summer.
All Posts This Week
A developer built OpenClaw, a minimalist self-hosted Telegram bot interfacing wi...
A developer built OpenClaw, a minimalist self-hosted Telegram bot interfacing with a Pi AI agent harness, supporting shell commands, cron tasking, and session switching from mobile.
LlamaIndex outlines two categories of finance AI agents: repetitive back-office ...
LlamaIndex outlines two categories of finance AI agents: repetitive back-office automation (invoices, KYC) and assistive agents, both requiring high-quality context engineering from documents.
Retweet of the LlamaIndex finance AI agents post about context engineering categ...
Retweet of the LlamaIndex finance AI agents post about context engineering categories in back-office and assistive use cases.
LlamaIndex wrapped up participation at AI Engineer Singapore with a workshop, ke...
LlamaIndex wrapped up participation at AI Engineer Singapore with a workshop, keynote, and executive dinner, previewing an upcoming SF World Fair appearance.
Langfuse shared a link with no accompanying text, content unknown.
Langfuse shared a link with no accompanying text, content unknown.
Langfuse highlighted a recommended read on monitoring for LLM applications.
Langfuse highlighted a recommended read on monitoring for LLM applications.
xAI prompted users to connect their X account to an unspecified service, likely ...
xAI prompted users to connect their X account to an unspecified service, likely Grok or Hermes Agent.
xAI's Hermes Agent now supports X Premium subscriptions and can search X posts, ...
xAI's Hermes Agent now supports X Premium subscriptions and can search X posts, expanding its real-time data access capabilities.
Google DeepMind announced Google I/O on May 19, teasing new product updates and ...
Google DeepMind announced Google I/O on May 19, teasing new product updates and AI breakthroughs from the event.
Retweet of Google's Google I/O reminder for May 19, promising AI product announc...
Retweet of Google's Google I/O reminder for May 19, promising AI product announcements and breakthroughs.
Revkit.ai founder argues AI will replace the entire human-and-spreadsheet layer ...
Revkit.ai founder argues AI will replace the entire human-and-spreadsheet layer around Salesforce, not just improve it, representing a massive enterprise opportunity.
Emergence World evaluates LLMs by having them build and govern simulated societi...
Emergence World evaluates LLMs by having them build and govern simulated societies; Claude built a democracy with zero crimes while Grok's world descended into chaos within 48 hours.
OpenAI showcases how sales teams can use Codex to automate pipeline briefs, meet...
OpenAI showcases how sales teams can use Codex to automate pipeline briefs, meeting prep, and deal analysis from real work inputs.
ChatGPT is introducing a personal finance feature for Pro users in the US, allow...
ChatGPT is introducing a personal finance feature for Pro users in the US, allowing secure connection of financial accounts for AI-powered insights.
OpenAI demonstrates Codex use cases for data science teams, including root-cause...
OpenAI demonstrates Codex use cases for data science teams, including root-cause briefs, KPI memos, and dashboard specs from real work inputs.
Databricks adopts GPT-5.5 for enterprise agent workflows after the model achieve...
Databricks adopts GPT-5.5 for enterprise agent workflows after the model achieved state-of-the-art results on the OfficeQA Pro benchmark.
OpenAI highlights Codex capabilities for business operations teams, enabling aut...
OpenAI highlights Codex capabilities for business operations teams, enabling automated creation of strategy updates, initiative briefs, and leadership decision packets.
OpenAI partners with Malta to expand AI access by offering ChatGPT Plus subscrip...
OpenAI partners with Malta to expand AI access by offering ChatGPT Plus subscriptions and training programs to help citizens develop practical AI skills.
INF released two open-weight models (Infinity-Parser2-Pro 35B and Flash 2B) that...
INF released two open-weight models (Infinity-Parser2-Pro 35B and Flash 2B) that top the ParseBench leaderboard for document understanding, trained on expanded synthetic data.
Repost of the INF Infinity-Parser2 model release announcement topping the docume...
Repost of the INF Infinity-Parser2 model release announcement topping the document understanding leaderboard on HuggingFace's ParseBench.
ArizeAI announces partnership with Deloitte Canada to help enterprises move comp...
ArizeAI announces partnership with Deloitte Canada to help enterprises move complex AI systems from experimentation into production-grade workflows with better observability.
ArizeAI highlights key challenges of scaling multi-agent systems in production, ...
ArizeAI highlights key challenges of scaling multi-agent systems in production, including context loss during handoffs and excessive token consumption.
ArizeAI partners with Deloitte Canada to help enterprises operationalize agent s...
ArizeAI partners with Deloitte Canada to help enterprises operationalize agent systems with tracing, evaluation, monitoring, and governance tooling.
ArizeAI joins MistralAI, CoderHQ, and Workato at the AWS Agentic AI Partner Show...
ArizeAI joins MistralAI, CoderHQ, and Workato at the AWS Agentic AI Partner Showcase in SF to discuss what it takes to ship agents to production.
Langfuse is hosting an in-person training session in San Francisco on May 26th c...
Langfuse is hosting an in-person training session in San Francisco on May 26th covering how to bring agents to production using Langfuse observability tools.
Retweet of Langfuse's in-person SF training announcement for bringing agents to ...
Retweet of Langfuse's in-person SF training announcement for bringing agents to production using Langfuse.
Langfuse shared a link with no accompanying text, providing minimal context abou...
Langfuse shared a link with no accompanying text, providing minimal context about the content.
Langfuse posted a brief informal message expressing enthusiasm for traces, likel...
Langfuse posted a brief informal message expressing enthusiasm for traces, likely referencing their tracing/observability product.
Cohere highlights how its Compass product can search and retrieve information fr...
Cohere highlights how its Compass product can search and retrieve information from unstructured data including scans of handwritten and typed declassified documents.
Grok subscribers can now use their subscription within the Nous Research Hermes ...
Grok subscribers can now use their subscription within the Nous Research Hermes Agent, expanding Grok's integration into third-party agent frameworks.
SRT (Self-Recall Thinking) is a framework that improves multi-turn dialogue cons...
SRT (Self-Recall Thinking) is a framework that improves multi-turn dialogue consistency by identifying and retrieving relevant historical turns to resolve long-range dependencies without external memory or lossy summarization.
This paper studies how to design logging policies for off-policy evaluation, cha...
This paper studies how to design logging policies for off-policy evaluation, characterizing a reward-coverage tradeoff and deriving optimal policies to minimize OPE estimation error.
This paper reframes citation faithfulness in Agentic GraphRAG as a trajectory-le...
This paper reframes citation faithfulness in Agentic GraphRAG as a trajectory-level problem, showing that uncited but visited graph entities significantly influence answers and must be accounted for in provenance.
CLOVER addresses the training-evaluation mismatch in autonomous driving by using...
CLOVER addresses the training-evaluation mismatch in autonomous driving by using closed-loop value estimation and ranking to better score trajectory candidates beyond simple imitation learning.
A survey of 60 international students in the US reveals how they use conversatio...
A survey of 60 international students in the US reveals how they use conversational AI tools like ChatGPT to navigate cross-cultural adaptation challenges where institutional support is fragmented.
APWA introduces a distributed multi-agent architecture that enables high-through...
APWA introduces a distributed multi-agent architecture that enables high-throughput parallel processing of complex agentic workloads, addressing coordination and scaling bottlenecks in LLM-based multi-agent systems.
This work introduces the first quantization-conditioned attack that works agains...
This work introduces the first quantization-conditioned attack that works against sophisticated quantization schemes by injecting outliers into model weights, enabling malicious behavior to emerge only after quantization.
Pelican-Unified 1.0 is an embodied foundation model that uses a single VLM for u...
Pelican-Unified 1.0 is an embodied foundation model that uses a single VLM for unified understanding, reasoning, and action, jointly generating future videos and actions in a single forward pass.
SDAR improves RL-based LLM agent training by incorporating self-distillation as ...
SDAR improves RL-based LLM agent training by incorporating self-distillation as a gated auxiliary objective, providing dense token-level supervision to stabilize multi-turn agentic learning.
MeMo encodes new knowledge into a dedicated modular memory model attached to a f...
MeMo encodes new knowledge into a dedicated modular memory model attached to a frozen LLM, enabling plug-and-play knowledge updates that avoid catastrophic forgetting without requiring access to LLM weights.
Position paper formalizing the 'audit gap' — the structural mismatch between wha...
Position paper formalizing the 'audit gap' — the structural mismatch between what AI governance frameworks require (e.g., absence of hidden objectives) and what behavioral evaluations and red-teaming can actually verify from observable outputs alone.
Retrieval-augmented multimodal alignment framework that combines semantically ri...
Retrieval-augmented multimodal alignment framework that combines semantically rich clinical text with precisely timestamped EHR data to reconstruct accurate clinical timelines for conditions like sepsis.
EviScreen is an evidential reasoning framework for medical image disease screeni...
EviScreen is an evidential reasoning framework for medical image disease screening that retrieves region-level evidence from historical cases via dual knowledge banks, improving both interpretability and predictive performance.
OpenDeepThink scales LLM reasoning breadth by sampling multiple candidate traces...
OpenDeepThink scales LLM reasoning breadth by sampling multiple candidate traces in parallel and selecting the best via pairwise Bradley-Terry ranking, bypassing the noise of pointwise LLM judging.
Shodh-MoE applies sparse mixture-of-experts routing to eliminate negative transf...
Shodh-MoE applies sparse mixture-of-experts routing to eliminate negative transfer and gradient conflict when co-training incompatible physics regimes in scientific ML foundation models.
PDI-Bench provides a quantitative framework for auditing geometric coherence in ...
PDI-Bench provides a quantitative framework for auditing geometric coherence in AI-generated videos by lifting 2D observations to 3D world-space and computing projective-geometry residuals.
VGGT-Edit enables native 3D scene editing via feed-forward residual field predic...
VGGT-Edit enables native 3D scene editing via feed-forward residual field prediction, avoiding the blurry textures and cross-view inconsistencies typical of 2D-lifting editing pipelines.
FutureSim evaluates adaptive AI agents by replaying real-world news events chron...
FutureSim evaluates adaptive AI agents by replaying real-world news events chronologically past their knowledge cutoff, revealing clear capability separations among frontier agents forecasting a three-month period.
ATLAS unifies agentic (code/tool-call) and latent visual reasoning via a single ...
ATLAS unifies agentic (code/tool-call) and latent visual reasoning via a single discrete token, combining the generalization of agentic methods with the efficiency of latent reasoning while enabling autoregressive parallelization.
EntityBench introduces a 140-episode benchmark derived from real narrative media...
EntityBench introduces a 140-episode benchmark derived from real narrative media to evaluate entity consistency (characters, objects, locations) across long multi-shot video generation sequences.
A developer built orobot.io, a curated directory of 61 3D-printable robots with ...
A developer built orobot.io, a curated directory of 61 3D-printable robots with AI-generated descriptions and tips, supporting multiple hardware platforms like Raspberry Pi and Arduino.
OpenAI updated ChatGPT's safety systems to improve context-aware risk detection ...
OpenAI updated ChatGPT's safety systems to improve context-aware risk detection in sensitive conversations, enabling more nuanced and safer responses over time.
OpenAI's Codex is now accessible via the ChatGPT mobile app, allowing users to m...
OpenAI's Codex is now accessible via the ChatGPT mobile app, allowing users to monitor, steer, and approve coding tasks in real time across remote environments.
Sea Limited's CPO shares how the company is rolling out Codex across engineering...
Sea Limited's CPO shares how the company is rolling out Codex across engineering teams to drive AI-native software development across Asia.
MagicPath 2.0 launches as a multiplayer canvas enabling humans and AI coding age...
MagicPath 2.0 launches as a multiplayer canvas enabling humans and AI coding agents like Codex and Claude Code to collaboratively design and build functional prototypes in real time.
LlamaIndex hosted two sold-out back-to-back developer events in NYC covering AI ...
LlamaIndex hosted two sold-out back-to-back developer events in NYC covering AI engineering, including a hands-on workshop led by founders.
Arize AI shares how their marketing team built a content engine that clones foun...
Arize AI shares how their marketing team built a content engine that clones founder voices using AI trained on years of historical content.
Arize AI discusses how Cursor integrates AI observability into the developer wor...
Arize AI discusses how Cursor integrates AI observability into the developer workflow, highlighting the operational challenges at Cursor's scale.
Langfuse shared a link with no accompanying text, providing no analyzable conten...
Langfuse shared a link with no accompanying text, providing no analyzable content.
Langfuse launches Langfuse Academy, a free open educational resource covering th...
Langfuse launches Langfuse Academy, a free open educational resource covering the full AI engineering lifecycle including tracing, monitoring, evaluation, and experimentation.
Langfuse retweet containing only a URL with no additional context or content.
Langfuse retweet containing only a URL with no additional context or content.
Langfuse post containing only a URL with no additional context or content.
Langfuse post containing only a URL with no additional context or content.
Langfuse introduces the 'AI Engineering Loop', a structured process the best AI ...
Langfuse introduces the 'AI Engineering Loop', a structured process the best AI teams use to ship complex AI systems to production, with a supporting academy series.
Cohere posts a cryptic teaser ('the truth is out there') with URLs, suggesting a...
Cohere posts a cryptic teaser ('the truth is out there') with URLs, suggesting an upcoming product or model announcement with no explicit details.
Perplexity AI expands its Snowflake integration to support dashboard and automat...
Perplexity AI expands its Snowflake integration to support dashboard and automation building for pipeline analysis and customer segmentation, with admin-level access controls.
Perplexity AI's 'Computer' product now connects to Snowflake, enabling natural-l...
Perplexity AI's 'Computer' product now connects to Snowflake, enabling natural-language querying of live warehouse data with SQL, source tables, and metrics — functioning as an on-call data science assistant.
xAI launches an early beta of Grok Build, an agentic CLI tool for coding, app bu...
xAI launches an early beta of Grok Build, an agentic CLI tool for coding, app building, and workflow automation, initially available to SuperGrok Heavy subscribers.
Google DeepMind and Kaggle announce a free 5-day AI Agents intensive course (Jun...
Google DeepMind and Kaggle announce a free 5-day AI Agents intensive course (June 15–19) featuring a new simulated capstone challenge called Kaggriculture, designed by Google researchers.
Retweet of Google DeepMind's announcement of the Kaggle AI Agents intensive cour...
Retweet of Google DeepMind's announcement of the Kaggle AI Agents intensive course and Kaggriculture capstone challenge — no additional content.
OpenAI post containing only a URL with no additional context or content.
OpenAI post containing only a URL with no additional context or content.
OpenAI shared a link via retweet from their developer account, but the content i...
OpenAI shared a link via retweet from their developer account, but the content is a URL with no additional context.
OpenAI is rolling out the Codex mobile app preview on iOS and Android globally, ...
OpenAI is rolling out the Codex mobile app preview on iOS and Android globally, with Windows phone-to-desktop support coming soon.
OpenAI launches Codex in the ChatGPT mobile app, enabling users to start tasks, ...
OpenAI launches Codex in the ChatGPT mobile app, enabling users to start tasks, review outputs, and steer agent execution remotely while Codex runs on a local machine.
Anthropic is partnering with the Gates Foundation, committing $200M in grants, C...
Anthropic is partnering with the Gates Foundation, committing $200M in grants, Claude credits, and technical support toward global health, education, agriculture, and economic mobility.
Anthropic published a policy paper on US-China AI competition, arguing the US an...
Anthropic published a policy paper on US-China AI competition, arguing the US and democratic allies currently lead in frontier AI and outlining steps to maintain that advantage.
Arrivl is a free analytics tool that parses raw server logs to track AI agent/LL...
Arrivl is a free analytics tool that parses raw server logs to track AI agent/LLM crawler traffic, filling the gap left by JS-based analytics tools that agents bypass entirely.
A Show HN submission presenting a multi-LLM AI trading agent harness, with no fu...
A Show HN submission presenting a multi-LLM AI trading agent harness, with no further details provided in the post.
OpenAI disclosed its response to the TanStack 'Mini Shai-Hulud' supply chain att...
OpenAI disclosed its response to the TanStack 'Mini Shai-Hulud' supply chain attack, detailing protective measures for signing certificates and urging macOS users to update OpenAI apps by June 12, 2026.
OpenAI built a secure sandboxed environment for Codex on Windows, enabling codin...
OpenAI built a secure sandboxed environment for Codex on Windows, enabling coding agents to operate with controlled file access and network restrictions for safety.
Post contains only a URL with no extractable content or context to summarize.
Post contains only a URL with no extractable content or context to summarize.
ArizeAI shares a write-up on their approach to agent feedback loops, linking to ...
ArizeAI shares a write-up on their approach to agent feedback loops, linking to a detailed article on closing the observability-to-development cycle.
ArizeAI outlines their agent development feedback loop—trace, diagnose, change p...
ArizeAI outlines their agent development feedback loop—trace, diagnose, change prompt/code, eval, redeploy—integrated between observability tooling and the IDE to reduce manual context-switching.
ArizeAI describes the challenge of debugging agents across many spans, drawing o...
ArizeAI describes the challenge of debugging agents across many spans, drawing on lessons from building their internal Alyx AI engineering agent to define a structured feedback loop.
Factory CTO Eno Reyes is presenting at ArizeAI's Observe conference on productio...
Factory CTO Eno Reyes is presenting at ArizeAI's Observe conference on production patterns for fully autonomous AI product engineering teams, covering the human-agent division of labor.
Perplexity CEO highlights their computer/agent security architecture featuring h...
Perplexity CEO highlights their computer/agent security architecture featuring hardware-isolated per-task sandboxes with VPC-level separation and short-lived proxy tokens instead of raw API keys.
Perplexity is developing a secure, scalable agent runtime sandbox featuring prox...
Perplexity is developing a secure, scalable agent runtime sandbox featuring proxy API key management, real-time safety detection, and encrypted connector data for enterprise agents.
PayPal runs 74,000 weekly tasks through Perplexity Enterprise for use cases incl...
PayPal runs 74,000 weekly tasks through Perplexity Enterprise for use cases including model validation, market research, and competitive intelligence, highlighting strong enterprise AI adoption.
Perplexity details its agent security stack: parallel ML classifiers and a Brows...
Perplexity details its agent security stack: parallel ML classifiers and a BrowseSafe model scan external content before agents act, while file connector data is encrypted and auto-deleted after 7 days.
Perplexity's Computer product runs every task in a hardware-isolated sandbox wit...
Perplexity's Computer product runs every task in a hardware-isolated sandbox with VPC-level separation and authenticates agents via short-lived proxy tokens instead of raw API keys.
OpenAI is promoting Codex to enterprise customers with a 2-free-months incentive...
OpenAI is promoting Codex to enterprise customers with a 2-free-months incentive for new users who switch within 30 days.
OpenAI teases an additional reason to adopt Codex, but the post content is incom...
OpenAI teases an additional reason to adopt Codex, but the post content is incomplete with no specific details provided.
SEMIR introduces a graph-based representation learning framework that decouples ...
SEMIR introduces a graph-based representation learning framework that decouples visual segmentation inference from native image grids, improving handling of small/sparse structures with topology-preserving latent representations.
A Random Matrix Theory method detects the onset of overfitting in neural network...
A Random Matrix Theory method detects the onset of overfitting in neural networks without requiring train/test data, identifying 'Correlation Traps' that emerge during an 'anti-grokking' phase.
OGLS-SD improves on-policy self-distillation for LLM reasoning by using outcome-...
OGLS-SD improves on-policy self-distillation for LLM reasoning by using outcome-guided logit steering to correct teacher-student calibration mismatches caused by reflection-induced bias.
This paper introduces 'Semantic Reward Collapse' (SRC) to explain how scalarized...
This paper introduces 'Semantic Reward Collapse' (SRC) to explain how scalarized RLHF optimization conflates distinct failure modes—like sycophancy and hallucination—into undifferentiated signals, undermining epistemic integrity.
Proposes a text-tabular modeling approach for predicting decisions of unknown AI...
Proposes a text-tabular modeling approach for predicting decisions of unknown AI agents in negotiation scenarios from limited interactions, with implications for multi-agent systems.
LLMs' in-context learning is framed as Bayesian inference over a low-dimensional...
LLMs' in-context learning is framed as Bayesian inference over a low-dimensional 'conceptual belief space,' with belief updates forming structured trajectories on geometric manifolds.
A new benchmark (CP-SynC-XL) shows LLMs should use declarative constraint modeli...
A new benchmark (CP-SynC-XL) shows LLMs should use declarative constraint modeling (MiniZinc) rather than optimizing Python heuristics when synthesizing combinatorial solvers, revealing a key design principle for neuro-symbolic systems.
CAAFC is a chronological automated fact-checking framework that outperforms stat...
CAAFC is a chronological automated fact-checking framework that outperforms state-of-the-art systems on both misinformation detection and hallucination correction, better aligning with real-world fact-checking workflows.
Temporarily switching encoder pretraining from MLM to CLM before a short MLM dec...
Temporarily switching encoder pretraining from MLM to CLM before a short MLM decay phase yields consistent downstream gains (+0.3–2.8pp) on biomedical NLP tasks, suggesting CLM provides richer low-layer supervision.
A large-scale audit of 1.7M posts across nine crisis events finds that LLM-gener...
A large-scale audit of 1.7M posts across nine crisis events finds that LLM-generated political discourse exhibits systematic statistical deviations from real online populations, enabling detection beyond surface-level token cues.
Researchers present a real-world dataset collected from commercially deployed 5G...
Researchers present a real-world dataset collected from commercially deployed 5G networks across multiple mobility modes to support AI/ML-based beam management and handover optimization.
A Gymnasium reinforcement learning environment is introduced for optimizing elec...
A Gymnasium reinforcement learning environment is introduced for optimizing electric utility demand-response programs, addressing the gap between offline historical data and dynamic real-world grid interactions.
Attractor Models are proposed as an alternative to looped transformers, using im...
Attractor Models are proposed as an alternative to looped transformers, using implicit differentiation to find fixed points in latent representations, achieving constant training memory and adaptive iteration depth with strong language modeling and reasoning results.
KV-Fold is a training-free protocol for long-context inference that treats the K...
KV-Fold is a training-free protocol for long-context inference that treats the KV cache as a left-fold accumulator over sequence chunks, enabling efficient recurrent-style processing without model retraining.
This work studies reward hacking in rubric-based RL post-training, identifying t...
This work studies reward hacking in rubric-based RL post-training, identifying two failure modes—verifier failure and rubric-design limitations—and showing weak verifiers lead to poor generalization across medical and science domains.
OmniNFT applies reinforcement learning to joint audio-video generation, addressi...
OmniNFT applies reinforcement learning to joint audio-video generation, addressing multi-objective advantage inconsistency and cross-modal gradient imbalance to improve per-modality fidelity and synchronization.
ToolCUA is an end-to-end computer use agent that learns optimal selection betwee...
ToolCUA is an end-to-end computer use agent that learns optimal selection between GUI actions and high-level tool calls through a staged training paradigm using interleaved GUI-Tool trajectories.
The paper proposes a sparse-to-dense reward principle for LLM post-training, arg...
The paper proposes a sparse-to-dense reward principle for LLM post-training, arguing that GRPO-style sparse RL and dense on-policy distillation should be applied at different stages based on reward density rather than treated as separate recipes.
A fast-slow learning framework for LLMs is introduced that combines in-context (...
A fast-slow learning framework for LLMs is introduced that combines in-context (fast) and in-weights (slow) adaptation to enable continual learning, mitigating catastrophic forgetting while retaining the benefits of parameter updates.
AlphaGRPO applies GRPO to unified multimodal models to enable reasoning-driven t...
AlphaGRPO applies GRPO to unified multimodal models to enable reasoning-driven text-to-image generation and self-reflective output correction, using a Decompositional Verifiable Reward for stable supervision.
HYPD is an AI co-pilot for Google Ads marketers that connects to ad accounts to ...
HYPD is an AI co-pilot for Google Ads marketers that connects to ad accounts to run audits, natural language data analysis, and generate ad copy. Built by a founder with prior ad-tech exits, it targets PPC freelancers and agencies.
A CS student reflects on how AI coding agents have changed the emotional and int...
A CS student reflects on how AI coding agents have changed the emotional and intellectual experience of programming, expressing a sense of loss of deep learning and grounded engineering. A personal essay on developer identity in the LLM era.
Torrix is a self-hosted LLM observability tool that runs as a single Docker cont...
Torrix is a self-hosted LLM observability tool that runs as a single Docker container backed by SQLite, requiring no Postgres or Redis. It lowers the barrier to monitoring AI agents in production.
Gigacatalyst provides an embedded AI builder layer for SaaS products, allowing n...
Gigacatalyst provides an embedded AI builder layer for SaaS products, allowing non-engineers to create custom features via natural language. Targets long-tail enterprise workflow customization without engineering overhead.
Statewright uses visual state machines to constrain AI agent behavior, improving...
Statewright uses visual state machines to constrain AI agent behavior, improving reliability by shrinking solution spaces rather than scaling up model size. Built by a veteran engineer with NVIDIA/AMD background.
Voker (YC S24) is an LLM-stack-agnostic analytics SDK for AI agent products, giv...
Voker (YC S24) is an LLM-stack-agnostic analytics SDK for AI agent products, giving engineering teams visibility into what users ask and whether agents are delivering in production. Addresses the gap in agent performance observability.
AutoScout24 Group uses OpenAI's Codex and ChatGPT to accelerate development cycl...
AutoScout24 Group uses OpenAI's Codex and ChatGPT to accelerate development cycles and improve code quality across their engineering organization. A case study in enterprise AI coding adoption.
OpenAI's Parameter Golf competition drew 1,000+ participants to explore AI-assis...
OpenAI's Parameter Golf competition drew 1,000+ participants to explore AI-assisted ML research, coding agents, quantization, and novel model design under strict parameter constraints. Highlights community innovation in efficient model design.
OpenAI highlights teams using Codex with GPT-5.5 to ship production systems and ...
OpenAI highlights teams using Codex with GPT-5.5 to ship production systems and accelerate research-to-experiment pipelines. Positions Codex as a key tool for both engineering and research workflows.
OpenAI demonstrates Codex being used by finance teams to automate reporting work...
OpenAI demonstrates Codex being used by finance teams to automate reporting workflows including MBRs, variance bridges, and planning scenarios from real work inputs. Expands Codex use cases beyond engineering into finance.
LlamaIndex introduces liteparse-server, a self-hostable open-source HTTP server ...
LlamaIndex introduces liteparse-server, a self-hostable open-source HTTP server for parsing PDFs, Office files, and images locally without sending data to external services.
Arize AI advocates a hybrid evaluation strategy combining LLM-as-a-judge for nua...
Arize AI advocates a hybrid evaluation strategy combining LLM-as-a-judge for nuanced assessment, code-based evals for speed, and human annotators for ground truth rather than relying on any single method.
Arize AI shared a link with no substantive text content available for analysis.
Arize AI shared a link with no substantive text content available for analysis.
Retweet of a link-only post with no substantive text content available for analy...
Retweet of a link-only post with no substantive text content available for analysis.
Arize AI shared a link with no substantive text content available for analysis.
Arize AI shared a link with no substantive text content available for analysis.
Retweet of a link-only post with no substantive text content available for analy...
Retweet of a link-only post with no substantive text content available for analysis.
Cohere's Chief AI Officer Joelle Pineau highlights the stark disparity in academ...
Cohere's Chief AI Officer Joelle Pineau highlights the stark disparity in academic entrepreneurship between California and Canada, suggesting Canada lags in translating AI research into startups.
Retweet of Cohere CAO Joelle Pineau's comment on the entrepreneurship gap betwee...
Retweet of Cohere CAO Joelle Pineau's comment on the entrepreneurship gap between California and Canadian universities in AI.
Perplexity AI highlights NVIDIA's GB200 (Blackwell) as the leading platform for ...
Perplexity AI highlights NVIDIA's GB200 (Blackwell) as the leading platform for large-model inference, citing prefill/decode disaggregation, Blackwell-native quantization, and NVLink rack-scale networking for lower serving costs.
Benchmark data shows NVIDIA GB200 cuts NVLink all-reduce latency nearly in half ...
Benchmark data shows NVIDIA GB200 cuts NVLink all-reduce latency nearly in half versus H200 (313µs vs 586µs) and significantly improves MoE prefill and decode throughput, demonstrating a major generational leap in inference hardware.
Perplexity AI published research on serving Qwen3 235B MoE models on NVIDIA GB20...
Perplexity AI published research on serving Qwen3 235B MoE models on NVIDIA GB200 NVL72 Blackwell racks, demonstrating GB200's superiority for high-throughput inference beyond just training workloads.
Google DeepMind teases experimental AI-enabled mouse pointer capabilities availa...
Google DeepMind teases experimental AI-enabled mouse pointer capabilities available to try in Google AI Studio, hinting at next-generation UI interactions.
Google DeepMind demonstrates AI-powered mouse pointer that understands context o...
Google DeepMind demonstrates AI-powered mouse pointer that understands context of what is being pointed at, enabling interactions like converting scribbled notes to to-do lists or paused video frames to booking links.
Google DeepMind announces experimental reimagining of the mouse pointer using Ge...
Google DeepMind announces experimental reimagining of the mouse pointer using Gemini, enabling users to direct AI on-screen via motion, speech, and natural shorthand.
OpenAI's 'parameter golf' competition attracted 2,000+ submissions exploring tec...
OpenAI's 'parameter golf' competition attracted 2,000+ submissions exploring techniques like quantization, TTT LoRA, SSMs, and JEPA, with autoresearch tooling accelerating iteration and enabling emergent collaboration.
Retweet of the OpenAI parameter golf competition post summarizing community part...
Retweet of the OpenAI parameter golf competition post summarizing community participation and research directions explored.
Proposes a practical evaluation protocol for AI pentesting agents that shifts fr...
Proposes a practical evaluation protocol for AI pentesting agents that shifts from task completion metrics to validated vulnerability discovery, better reflecting real-world complexity.
Introduces Clin-JEPA, a co-training framework applying JEPA-style predictive pre...
Introduces Clin-JEPA, a co-training framework applying JEPA-style predictive pretraining to EHR patient trajectories for trajectory forecasting and downstream risk prediction without per-task fine-tuning.
DISCA is a training-free, black-box inference-time method that uses within-count...
DISCA is a training-free, black-box inference-time method that uses within-country sociodemographic disagreement signals from World Values Survey to culturally align LLMs without fine-tuning.
Pi-Serini demonstrates that BM25 lexical retrieval paired with capable frontier ...
Pi-Serini demonstrates that BM25 lexical retrieval paired with capable frontier LLMs (e.g., GPT-5.5) can achieve 83.1% accuracy on deep research benchmarks, questioning the necessity of dense retrieval in agentic search.
Introduces the Generalized Turing Test (GTT), a formal dataset- and task-agnosti...
Introduces the Generalized Turing Test (GTT), a formal dataset- and task-agnostic framework for comparing agent intelligence via indistinguishability, with analysis of transitivity and ordering properties.
BenchCAD presents a benchmark of 17,900 verified CadQuery programs across 106 in...
BenchCAD presents a benchmark of 17,900 verified CadQuery programs across 106 industrial part families to evaluate MLLMs on realistic parametric CAD code generation tasks.
BEACON is a large-scale multimodal dataset (~430 GB) capturing behavioral biomet...
BEACON is a large-scale multimodal dataset (~430 GB) capturing behavioral biometrics from competitive Valorant gameplay to support continuous authentication research.
Proposes a decision-centric rate-distortion framework for agent memory that dete...
Proposes a decision-centric rate-distortion framework for agent memory that determines what can be safely forgotten based on impact to decision quality rather than descriptive relevance.
Attractor-Vascular Coupling Theory provides a mathematical framework linking car...
Attractor-Vascular Coupling Theory provides a mathematical framework linking cardiac attractor geometry to blood pressure estimation from smartphone PPG, validated to AAMI standards using LightGBM.
CADBench unifies multimodal CAD program generation evaluation with 18,000 sample...
CADBench unifies multimodal CAD program generation evaluation with 18,000 samples across six benchmark families, five input modalities, and six metrics covering geometry, executability, and compactness.
AssayBench introduces a new benchmark for evaluating LLMs and agents on in silic...
AssayBench introduces a new benchmark for evaluating LLMs and agents on in silico phenotypic screens, filling a gap in virtual cell modeling evaluation.
LoKA proposes a system-model co-design approach to apply FP8 low-precision arith...
LoKA proposes a system-model co-design approach to apply FP8 low-precision arithmetic to large recommendation models, overcoming numerical sensitivity and training inefficiencies.
This paper formalizes probabilistic safety shielding for autonomous agents in MD...
This paper formalizes probabilistic safety shielding for autonomous agents in MDPs, proving impossibility of classical guarantees while providing weaker but practical alternatives.
A training-free diagnostic framework for on-policy distillation analyzes per-tok...
A training-free diagnostic framework for on-policy distillation analyzes per-token supervision signals to clarify when teacher distillation helps or hurts reasoning model training.
DataMaster is an autonomous agent that handles the full data engineering pipelin...
DataMaster is an autonomous agent that handles the full data engineering pipeline for ML—discovery, selection, cleaning, and transformation—without modifying the learning algorithm.
This paper argues that AI agents built on rapid on-the-fly synthesis bypass rigo...
This paper argues that AI agents built on rapid on-the-fly synthesis bypass rigorous software engineering practices, proposing an AI Workflow Store paradigm to embed SE discipline into agentic systems.
Shepherd is a meta-agent runtime that records typed execution traces in a Git-li...
Shepherd is a meta-agent runtime that records typed execution traces in a Git-like structure, enabling fast forking and replay of agent states, and significantly improving pair coding pass rates via runtime intervention.
A confidence-guided diffusion augmentation framework synthesizes training data f...
A confidence-guided diffusion augmentation framework synthesizes training data for handwritten Bangla compound character recognition, improving generalization across writing styles.
A neural exponential tilting framework enables scalable variational inference fo...
A neural exponential tilting framework enables scalable variational inference for Lévy-driven SDEs, capturing heavy-tailed and jump phenomena beyond Gaussian assumptions.
ELF proposes continuous-space diffusion language models using Flow Matching in e...
ELF proposes continuous-space diffusion language models using Flow Matching in embedding space, showing competitive performance with minimal adaptation from the discrete token domain.
AI agents discovered a reasoning strategy that reduces LLM token usage by 70%, p...
AI agents discovered a reasoning strategy that reduces LLM token usage by 70%, potentially significant for cost and efficiency optimization.
A developer shares Origami, a terminal workspace manager built with AI assistanc...
A developer shares Origami, a terminal workspace manager built with AI assistance, offering a grounded perspective on AI coding tools as accelerators rather than replacements.
VibeServe explores whether AI agents can autonomously design and build bespoke L...
VibeServe explores whether AI agents can autonomously design and build bespoke LLM serving infrastructure, probing the limits of agentic software engineering.
Graft introduces semantic memory for AI agents that operates without requiring a...
Graft introduces semantic memory for AI agents that operates without requiring an LLM, offering a lightweight and cost-effective memory layer.
JetBrains launched Junie, an LLM-agnostic AI coding agent integrated into their ...
JetBrains launched Junie, an LLM-agnostic AI coding agent integrated into their IDE ecosystem, broadening developer tooling options.
ChatGPT saw its fastest Q1 2026 growth among users over 35 with more balanced ge...
ChatGPT saw its fastest Q1 2026 growth among users over 35 with more balanced gender demographics, indicating mainstream AI adoption beyond early adopters.
sandboxed-lit is a Rust CLI agent combining LiteParse for multi-format document ...
sandboxed-lit is a Rust CLI agent combining LiteParse for multi-format document parsing with a secure Bash sandbox, enabling safe and powerful file-handling agents.
OpenAI's Stuart Sy will present on a voice-of-the-customer agent that distills m...
OpenAI's Stuart Sy will present on a voice-of-the-customer agent that distills millions of customer interactions into actionable insights at Arize's Observe conference.
Arize Phoenix is evolving beyond human-facing observability into a context platf...
Arize Phoenix is evolving beyond human-facing observability into a context platform accessible to both humans and agents for building AI-native software.
Arize argues that AI observability must shift from human-readable dashboards to ...
Arize argues that AI observability must shift from human-readable dashboards to API/CLI and agent-facing interfaces as agents increasingly consume operational context.
Arize AI proposes that AI observability is evolving into a collaborative context...
Arize AI proposes that AI observability is evolving into a collaborative context platform where both humans and agents can debug and improve AI systems together.
Arize AI is partnering with Google Cloud for the Rapid Agent Hackathon, focusing...
Arize AI is partnering with Google Cloud for the Rapid Agent Hackathon, focusing on bridging the gap between agents that demo well and agents that execute reliably in production.
Langfuse now supports running LLM experiments in CI/CD pipelines via a GitHub Ac...
Langfuse now supports running LLM experiments in CI/CD pipelines via a GitHub Action, enabling teams to catch quality regressions and gate releases on evaluation metrics.
Google DeepMind collaborated with The Sainsbury Lab on AI-guided discovery of at...
Google DeepMind collaborated with The Sainsbury Lab on AI-guided discovery of atypical protein assemblies, advancing AI applications in structural biology.
OpenAI shared a link with no accompanying text, providing no analyzable content.
OpenAI shared a link with no accompanying text, providing no analyzable content.
OpenAI announced Daybreak, a security automation platform for detecting, validat...
OpenAI announced Daybreak, a security automation platform for detecting, validating, and responding to threats using frontier AI models.
OpenAI launched Daybreak, a cyber defense platform combining its most capable mo...
OpenAI launched Daybreak, a cyber defense platform combining its most capable models and Codex to help security teams accelerate threat detection and continuously secure software.
Anthropic released Claude's Constitution as an audiobook narrated by authors Ama...
Anthropic released Claude's Constitution as an audiobook narrated by authors Amanda Askell and Joe Carlsmith, including discussion of the philosophies behind the document and how it may evolve.
Researchers use probing and activation patching to locate where language models ...
Researchers use probing and activation patching to locate where language models form internal representations of future tokens, finding that planning signals are linearly decodable and scale-dependent, with only Gemma-3-27B causally relying on this encoding.
Dooly is a configuration-agnostic LLM inference simulator that avoids redundant ...
Dooly is a configuration-agnostic LLM inference simulator that avoids redundant re-profiling by exploiting structural redundancy across model configurations, making hardware and serving engine exploration significantly cheaper.
NIST researchers propose a structured methodology for AI evaluation scenarios gr...
NIST researchers propose a structured methodology for AI evaluation scenarios grounded in real-world use cases, promoting methodological transparency and human-centered design to enable apples-to-apples comparisons across AI benchmarks.
Tool selection in LLM agents is linearly readable and steerable via internal act...
Tool selection in LLM agents is linearly readable and steerable via internal activations, enabling 77-100% accuracy in switching tool choices and allowing error prediction before execution across 12 models.
PSP-HDC applies graph-structured hyperdimensional computing to process-structure...
PSP-HDC applies graph-structured hyperdimensional computing to process-structure-property prediction in materials science, achieving data-efficient and explainable results where conventional ML fails due to sparse data.
A probabilistic framework for abductive commonsense reasoning is proposed that e...
A probabilistic framework for abductive commonsense reasoning is proposed that explicitly models variation in human commonsense beliefs, moving beyond binary truth assumptions in neurosymbolic LLM systems.
A position paper auditing 30 mechanistic interpretability studies finds that cau...
A position paper auditing 30 mechanistic interpretability studies finds that causal claims consistently lack explicit identification assumptions, with validation metrics incorrectly treated as causal evidence.
RL-trained CLI agents are studied with a focus on structured action credit assig...
RL-trained CLI agents are studied with a focus on structured action credit assignment and selective observation, addressing two core bottlenecks: evidence localization in large codebases and sparse reward attribution over long trajectories.
Frontier large reasoning models (LRMs) are evaluated against human game learners...
Frontier large reasoning models (LRMs) are evaluated against human game learners using behavioral data and fMRI recordings, jointly assessing gameplay performance, learning behavior alignment, and brain activity prediction.
A parameter reconstruction algorithm for spiking neural networks is proposed tha...
A parameter reconstruction algorithm for spiking neural networks is proposed that avoids surrogate gradient approximation errors by extending convexification theory to parallel recurrent threshold networks.
MPD²-Router introduces a mask-aware multi-expert learning-to-defer framework for...
MPD²-Router introduces a mask-aware multi-expert learning-to-defer framework for glaucoma screening that routes uncertain cases to appropriate human experts while enforcing availability constraints and handling workload imbalance.
GraphDPO generalizes Direct Preference Optimization to operate over directed acy...
GraphDPO generalizes Direct Preference Optimization to operate over directed acyclic preference graphs from ranked rollouts, better exploiting multi-response training data and avoiding instability from collapsing rankings into pairs.
SCOPE is a skill orchestration framework for complex text-to-image generation th...
SCOPE is a skill orchestration framework for complex text-to-image generation that maintains semantic commitments in a structured specification throughout the full generation lifecycle to reduce conceptual drift.
Fast Byte Latent Transformer introduces diffusion-based parallel byte generation...
Fast Byte Latent Transformer introduces diffusion-based parallel byte generation and speculative decoding extensions to address the slow autoregressive bottleneck of byte-level language models.
CA-SQL dynamically scales solution space exploration based on estimated query co...
CA-SQL dynamically scales solution space exploration based on estimated query complexity and uses prompt seeding via evolutionary principles to improve LLM performance on hard Text-to-SQL benchmarks.
Expanding LLM context windows in multi-agent social dilemmas systematically degr...
Expanding LLM context windows in multi-agent social dilemmas systematically degrades cooperation across 7 models and 4 games—termed the 'memory curse'—driven by erosion of forward-looking intent rather than increased paranoia.
Rubric-grounded RL decomposes rewards into weighted, verifiable criteria scored ...
Rubric-grounded RL decomposes rewards into weighted, verifiable criteria scored by a frozen LLM judge, providing partial-credit optimization signals that improve generalizable reasoning over binary or holistic rewards.
Flow-OPD is the first post-training framework integrating on-policy distillation...
Flow-OPD is the first post-training framework integrating on-policy distillation into flow matching text-to-image models, using specialized teacher models and a two-stage strategy to mitigate reward hacking and the seesaw effect.
VecCISC improves confidence-informed self-consistency by clustering reasoning tr...
VecCISC improves confidence-informed self-consistency by clustering reasoning traces to reduce redundant critic LLM calls, lowering inference cost while maintaining or improving accuracy on majority voting.
EmambaIR applies visual state space models (Mamba) to event-guided image reconst...
EmambaIR applies visual state space models (Mamba) to event-guided image reconstruction, combining sparse cross-modal attention with linear complexity to outperform CNN and ViT baselines at high resolutions.
AI agents are showing performance gains from Long Context Models (LCM), enabling...
AI agents are showing performance gains from Long Context Models (LCM), enabling more specialized and capable applications that leverage extended context windows.
WUPHF is an open-source, local-first multi-agent framework that prevents context...
WUPHF is an open-source, local-first multi-agent framework that prevents context drift across agent handoffs using a shared markdown+git wiki and cross-agent peer review rather than just shared memory.
Google has integrated an AI-powered experience into Google Finance, though detai...
Google has integrated an AI-powered experience into Google Finance, though details of the specific features or capabilities are not provided in this post.
OpenAI has launched DeployCo, an enterprise-focused deployment company designed ...
OpenAI has launched DeployCo, an enterprise-focused deployment company designed to help organizations move frontier AI from experimentation into production with measurable business outcomes.
OpenAI is launching the OpenAI Campus Network to connect student clubs globally ...
OpenAI is launching the OpenAI Campus Network to connect student clubs globally with AI tools and resources for building campus AI communities.
OpenAI outlines a framework for how enterprises can scale AI adoption through tr...
OpenAI outlines a framework for how enterprises can scale AI adoption through trust-building, governance structures, deliberate workflow design, and maintaining quality at scale.
OpenAI is acquiring Tomoro to immediately staff its new Deployment Company with ...
OpenAI is acquiring Tomoro to immediately staff its new Deployment Company with 150 experienced Forward Deployed Engineers and Deployment Specialists from day one.
OpenAI officially launched the OpenAI Deployment Company, a majority-owned subsi...
OpenAI officially launched the OpenAI Deployment Company, a majority-owned subsidiary that unites 19 investment firms, consultancies, and system integrators to help businesses deploy frontier AI to production.