Your shortcut to every AI CLI. Spawn Claude, Codex, Gemini, OpenCode side‑by‑side. Benchmark them. Ship the winner. Real CLIs. Real tools. Your judgment.




Warp bets on one cloud agent. iTerm2 adds a chat sidebar. Ghostty stays pure. None of them let you spawn Claude, Codex, Gemini and OpenCode side-by-side, watch them coordinate in a shared chat, score them against each other, and ship the winner. AnvilTerm does. It treats multi-agent as a first-class primitive — live tiles, MCP control plane, SwarmRoom, Arena, Forge marketplace — not a feature on the roadmap.
Watching one AI work is productivity.
Watching four of them compete for your prompt is the future.
One prompt, N agents. Each is its own live PTY tile running a real CLI — not a wrapper, not an API reskin. Auto-grid re-flows as agents finish. They post to a shared SwarmRoom where the lead delegates and specialists report back.
Lead agent delegates subtasks. Specialists pick them up, report back in the same room. You watch the transcript live in a chat tile. Works with any agent that speaks MCP — which is all of them.
Your agents are literal claude, codex, gemini binaries. You see exactly what the TUI sees.
swarm_route(["multimodal"]) → Gemini. ["reasoning"] → Claude. ["refactor"] → Codex.
One agent = one tab. Four = 2×2. Eight = 3×3. Each tile stops, closes, exports on its own.
Spawn a single Claude lead. It reads the prompt, calls swarm_vendors, picks specialists, delegates, merges the result. You just watch and approve. No team-building UI required.
Every tile streams its token counter scraped from the TUI. The toolbar aggregates daily + weekly spend across Claude + Codex + OpenCode. Forecast tells you when you'll hit the cap.
Pick contestants. Paste a prompt. Hit start. Each tile streams live output, renders deliverables as iframes when the agent calls arena_push_artifact. Judge on quality, speed, correctness. Export the fight as markdown + JSONL for reproducibility — or as a 9:16 battle video for your timeline.
Every battle is stored at ~/.anvil/arena/<id>.jsonl. Bring it back six months later, re-run on new model versions, watch the winner flip. Reproducible benchmarking — for the laptop era.
Arena tiles stream live output side-by-side. The full app view shows Forge marketplace, SwarmRoom chat, and the menu-bar usage tray all alive at once.
AnvilTerm ships with a Model Context Protocol server. One command registers it across Claude Code, Codex, Gemini, OpenCode. Every agent gains — for free — a browser, a PTY, inter-agent chat, artifact rendering, usage tracking, screenshot.
# wire AnvilTerm into every installed agent at once npx anvilterm-doctor --install # now from any agent: terminal_create() · terminal_write() · terminal_screen() tui_type() · tui_interrupt() · tui_choose() swarm_spawn() · swarm_route() · swarm_vendors() swarm_room_post() · swarm_room_listen() · swarm_room_thread() arena_push_artifact() · arena_current()
A tab created via MCP appears as a live tile in the AnvilTerm window. You watch — and intervene if needed. Standalone fallback via node-pty if the UI isn't running.
Every MCP-speaking client. Stdio transport. Register once, the toolset travels with you.
Curated catalog, star-ranked, kind-filtered. Pick a server, hit install, choose Claude / Codex / Gemini / OpenCode — or all of them. Forge writes directly to each agent's config with a _forge:true tag so it can manage updates and removal cleanly.
Skills get git-cloned to ~/.anvil/skills/ and symlinked into each agent's skills dir. MCP servers get registered in each vendor's config. Installed view shows per-agent status chips so you always know what's actually wired.
Cmd+K opens an Ollama-backed assistant with native function calling. Gemma 4, Qwen 2.5/3, Llama 3.1, Mistral Nemo. Prompts never leave the machine. Ideal for regulated work, sensitive repos, a plane, a café with no wifi.
Code, prompts, context — none of it leaves your laptop.
Models that speak tools return structured calls that render as one-click run buttons in the chat.
The assistant reads the visible terminal screen so suggestions are grounded in what's actually running.
Claude's built-in subagents are great when one model is enough. A real swarm — Claude + Codex + Gemini + OpenCode running in parallel as live CLIs — wins on parallelism, model diversity and visibility. Here's the honest breakdown on 40 mixed refactor / research / UI tasks.
| Dimension | AnvilTerm Swarm | Claude Subagents |
|---|---|---|
| Parallelism | N real PTYs in parallel | Sequential within parent turn |
| Context window per worker | Full · 1M tokens each · no compaction | Shared parent, compacted on dispatch |
| Model diversity | Claude · Codex · Gemini · OpenCode · Copilot · Ollama | Claude family only |
| Live output | Dedicated live tile per agent | Opaque spinner until result |
| Human-in-the-loop | Type into any tile, paste refs, interrupt | None once dispatched |
| Artifact rendering | Iframes · SVG · markdown · live | Text summary returned to parent |
| Failure isolation | One agent fails, N-1 keep working | Subagent failure blocks parent turn |
| Token cost routing | Per-vendor metered, route cheap tasks to Ollama | All charged against parent's Claude quota |
| Determinism | Fresh PTY state per agent | Inherits parent compaction |
| Tool access | Each agent carries its own MCP toolkit | Parent's MCP toolkit only |
| Reproducibility | Session JSONL + Arena replay | No persistent transcript |
| Best for | Cross-vendor compare, parallel research, long refactors | Tightly-coupled chains in one model family |
Note: wall-clock × is normalized to swarm=1.0 across 40 mixed tasks. Subagents win when the task is inherently sequential (each step depends on the prior result) — swarm wins when the work fan-outs. Use both.
Every other terminal is excellent at its thing. The matrix below is the proof — not marketing. Hover any row to see only AnvilTerm light up.
| Capability | AnvilTerm | Warp | iTerm2 | Ghostty | WezTerm | Kitty | Alacritty | Tabby |
|---|---|---|---|---|---|---|---|---|
| Multi-agent swarm · live tiles | ● | — | — | — | — | — | — | — |
| MCP server · agents drive the terminal | ● | partial | — | — | — | — | — | — |
| SwarmRoom · inter-agent chat | ● | — | — | — | — | — | — | — |
| Arena · head-to-head benchmark | ● | — | — | — | — | — | — | — |
| Forge · MCP + Skills marketplace | ● | catalog | — | — | — | — | — | — |
| Capability routing (task → best model) | ● | — | — | — | — | — | — | — |
| Cross-vendor usage tracking | ● | — | — | — | — | — | — | — |
| Usage forecast · threshold alerts | ● | — | — | — | — | — | — | — |
| Inline images · SVG render | ● | partial | ✓ | — | partial | ✓ | — | — |
| Inline video · PDF · audio waveform | ● | — | — | — | — | — | — | — |
| YouTube embed · hover previews | ● | — | — | — | — | — | — | — |
| Markdown table → spreadsheet | ● | — | — | — | — | — | — | — |
| Local AI · offline assistant | ● | — | plugin | — | — | — | — | — |
| Voice input · push-to-talk | ● | — | — | — | — | — | — | — |
| Session recording · ANSI + plain | ● | partial | — | — | — | — | — | — |
| Interactive screenshot → MCP | ● | — | — | — | — | — | — | — |
| Automation API (DevTools · Playwright) | ● | — | AppleScript | — | Lua | RC | — | plugins |
| Works offline · no cloud lock-in | ● | — | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Open source | Apache + FSL | closed | GPL | MIT | MIT | GPL | Apache | MIT |
| macOS · Linux · Windows | ● ● ● | ✓ ✓ ✓ | mac | ✓ ✓ β | ✓ ✓ ✓ | ✓ ✓ — | ✓ ✓ ✓ | ✓ ✓ ✓ |
| Native GPU renderer | xterm.js | ✓ | Metal | ✓ | ✓ | ✓ | OpenGL | — |
Every AI CLI in one workspace. Benchmark them on real work. Pick the best tool for each job. Never break your flow.
A browser, a PTY, an artifact renderer, image / PDF / YouTube understanding, a room to talk to their siblings. MCP, done for you.
Arena runs head-to-head on real CLIs, real tasks. Your model's wins become shareable 9:16 battle videos. Free distribution.