Agent-desktop is a CLI tool for AI agents that uses native OS accessibility APIs (instead of screenshot-based pixel clic
Over the last few months, a lot of computer-use agents have come out: Codex, Claude Code, CUA, and others. Most of them seem to work roughly like this: 1. Take a screenshot 2. Have the model predict pixel coordinates 3. Click x,y 4. Take another screenshot 5. Repeat
That works, but it's slow, expensive in tokens, and fragile. If the UI shifts a few pixels, things break. And the model still doesn't know what any element actually is.
But the OS already exposes structured UI information:
- macOS: Accessibility API
- Windows: UI Automation
- Linux: AT-SPI
Screen readers have used these APIs for years. On the web, Playwright beat screenshot scraping for the same reason: structured access is just a better abstraction than pixels.So I built a desktop equivalent: agent-desktop.
It's a cross-platform CLI for structured desktop automation through the accessibility tree. One Rust binary, about 15 MB, no runtime dependencies. It exposes 53 commands with JSON output, so an LLM can inspect and operate native apps without screenshots or vision models. Inspired by agent-browser by Vercel Labs.
A typical loop looks like this:
agent-desktop snapshot --app Slack -i --compact
agent-desktop click @e12
agent-desktop type @e5 "ship it"
agent-d