What is Terminal-Bench?

Terminal-Bench is a benchmark for AI agents operating in real terminal environments. It combines a task dataset with an execution harness that connects models to a sandboxed terminal.

What does each Terminal-Bench task include?

Each task includes an English instruction, a test script to verify success, and an oracle solution that shows a correct completion path.

Why is Terminal-Bench useful for agent evaluation?

It measures end-to-end task completion in realistic terminal settings instead of relying on toy prompts or isolated code snippets.

How do you run Terminal-Bench?

The official project distributes a CLI named tb. The public docs recommend installing the package, then using the harness to run agents against a dataset version in a sandboxed environment.

Keyword explainer for real terminal agents

Terminal-Bench is the benchmark for AI agents that must finish real terminal work.

Instead of grading toy prompts, Terminal-Bench combines a dataset of practical tasks with an execution harness that drops agents into a sandboxed terminal, verifies the result with tests, and makes end-to-end performance comparable.

Open Graphify Read the official repo

Current shape

Beta benchmark with ~100 public tasks

Core design

Task dataset + execution harness

Verification

Test scripts plus oracle solutions

Overview

What Terminal-Bench measures, and why people keep searching for it.

It is built for terminal-native agents.

Terminal-Bench focuses on agents that need to use shells, files, tools, and local services the way engineers and operators do. The benchmark is meant for real workflows, not isolated puzzle prompts.

It helps teams compare end-to-end ability.

Agent builders, evaluation teams, and framework authors use Terminal-Bench to see whether a model can autonomously finish a job from instruction to verified completion under repeatable conditions.

It is grounded in concrete task artifacts.

The public project describes each task with an instruction, a verification script, and an oracle solution, which makes the benchmark easier to inspect, extend, and critique than vibe-based claims about agent skill.

Why It Matters

Terminal environments expose a different class of failure than polished demos do.

Terminal-Bench matters because it tests whether an agent can survive ambiguity, file layouts, command errors, partial progress, and verification pressure without a human quietly patching the path behind the scenes.

Real tools make reasoning visible.

Compilers, package managers, training jobs, and server processes produce friction. That friction is the point: it surfaces whether the agent can recover, inspect state, and adapt.

Verified outcomes beat hand-wavy impressions.

Success is checked with scripts instead of anecdotes. The benchmark asks whether the work is actually done, not whether the final message sounded competent.

Reproducible harnesses let teams compare progress.

Because the harness and dataset are explicit, teams can rerun agents against known task versions and track what genuinely improved rather than arguing from selective examples.

How It Works

The benchmark is simple in shape, but sharp in what it asks the agent to prove.

Step 1

Select a task and dataset version.

Terminal-Bench publishes versioned task collections, including the beta `terminal-bench-core` release that powers the public leaderboard.

Step 2

Connect the agent through the harness.

The official CLI links a model or agent runtime to a sandboxed terminal so the benchmark can observe actions, files, and command outcomes in one place.

Step 3

Let the agent finish the work autonomously.

The agent has to inspect the environment, choose tools, edit files, launch commands, and drive the task to a completed state instead of only generating a plausible answer.

Step 4

Score completion with tests and references.

A task is considered complete when the verification script passes. Oracle solutions exist to ground the task definition and make the target state concrete.

Representative Task Types

The public benchmark spans multiple categories because terminal work rarely stays in one lane.

The official site and repository show tasks across coding, file operations, security, system work, data processing, and data science. These examples are paraphrased from public descriptions and homepage task excerpts.

coding / file-operations / security / system

Secure setup tasks

Some tasks force agents to configure systems and deliver a verified result, not merely describe a setup plan. This is where sloppy shell habits get punished quickly.

data-processing / data-science

Model training under constraints

One public example asks the agent to train a fastText model on Yelp data while staying under a size limit and still hitting an accuracy target on a hidden test distribution.

version-control / web / system

Repository and service workflows

Terminal-Bench also covers the mixed reality of repos, local services, and environment state, where the agent has to inspect, modify, run, and verify instead of stopping at code generation.

FAQ

Fast answers for people landing on the keyword before they read the docs.

Is Terminal-Bench the official project name or just a keyword?

It is the official project name. The public website, GitHub repository, docs, and leaderboard all use Terminal-Bench branding.

What makes Terminal-Bench different from toy agent demos?

The benchmark evaluates whether an agent can complete practical terminal work inside a sandbox and then pass a verification script. That is a stricter bar than producing a good-looking transcript.

How is success checked?

Each task carries a test script. The verifier determines whether the expected end state exists, which helps keep scoring consistent across runs and agents.

Where do I start if I want to run it?

Start with the official docs and repo. The project distributes a CLI called tb, and the public quickstart explains the install path, harness usage, and dataset selection flow.