Current shape
Beta benchmark with ~100 public tasks
Keyword explainer for real terminal agents
Instead of grading toy prompts, Terminal-Bench combines a dataset of practical tasks with an execution harness that drops agents into a sandboxed terminal, verifies the result with tests, and makes end-to-end performance comparable.
Current shape
Beta benchmark with ~100 public tasks
Core design
Task dataset + execution harness
Verification
Test scripts plus oracle solutions
Overview
Terminal-Bench focuses on agents that need to use shells, files, tools, and local services the way engineers and operators do. The benchmark is meant for real workflows, not isolated puzzle prompts.
Agent builders, evaluation teams, and framework authors use Terminal-Bench to see whether a model can autonomously finish a job from instruction to verified completion under repeatable conditions.
The public project describes each task with an instruction, a verification script, and an oracle solution, which makes the benchmark easier to inspect, extend, and critique than vibe-based claims about agent skill.
Why It Matters
Terminal-Bench matters because it tests whether an agent can survive ambiguity, file layouts, command errors, partial progress, and verification pressure without a human quietly patching the path behind the scenes.
Compilers, package managers, training jobs, and server processes produce friction. That friction is the point: it surfaces whether the agent can recover, inspect state, and adapt.
Success is checked with scripts instead of anecdotes. The benchmark asks whether the work is actually done, not whether the final message sounded competent.
Because the harness and dataset are explicit, teams can rerun agents against known task versions and track what genuinely improved rather than arguing from selective examples.
How It Works
Step 1
Terminal-Bench publishes versioned task collections, including the beta `terminal-bench-core` release that powers the public leaderboard.
Step 2
The official CLI links a model or agent runtime to a sandboxed terminal so the benchmark can observe actions, files, and command outcomes in one place.
Step 3
The agent has to inspect the environment, choose tools, edit files, launch commands, and drive the task to a completed state instead of only generating a plausible answer.
Step 4
A task is considered complete when the verification script passes. Oracle solutions exist to ground the task definition and make the target state concrete.
Representative Task Types
The official site and repository show tasks across coding, file operations, security, system work, data processing, and data science. These examples are paraphrased from public descriptions and homepage task excerpts.
Some tasks force agents to configure systems and deliver a verified result, not merely describe a setup plan. This is where sloppy shell habits get punished quickly.
One public example asks the agent to train a fastText model on Yelp data while staying under a size limit and still hitting an accuracy target on a hidden test distribution.
Terminal-Bench also covers the mixed reality of repos, local services, and environment state, where the agent has to inspect, modify, run, and verify instead of stopping at code generation.
FAQ
It is the official project name. The public website, GitHub repository, docs, and leaderboard all use Terminal-Bench branding.
The benchmark evaluates whether an agent can complete practical terminal work inside a sandbox and then pass a verification script. That is a stricter bar than producing a good-looking transcript.
Each task carries a test script. The verifier determines whether the expected end state exists, which helps keep scoring consistent across runs and agents.
Start with the official docs and repo. The project distributes a CLI called tb, and the public
quickstart explains the install path, harness usage, and dataset selection flow.