代理利用剖析 The Anatomy of an Agent Harness —

The Anatomy of an Agent Harness

代理利用剖析

https://www.langchain.com/blog/the-anatomy-of-an-agent-harness

TLDR:Agent = Model + Harness. Harness engineering is how we build systems around models to turn them into work engines. The model contains the intelligence and the harness makes that intelligence useful.We define what a harness is and derive the core components today's and tomorrow's agents need.

简而言之：智能体 = 模型 + 框架。框架工程是我们围绕模型构建系统、将其转化为工作引擎的方式。模型蕴含智能，而框架使智能发挥作用。我们将定义框架的概念，并推演出当今及未来智能体所需的核心组件。

Can Someone Please Define a "Harness"?

有人能解释一下“Harness”是什么意思吗？

Agent = Model + Harness

智能体 = 模型 + 工具链

If you're not the model, you're the harness.

如果你不是模特，你就是马具。

A harness is every piece of code, configuration, and execution logic that isn't the model itself. A raw model is not an agent. But it becomes one when a harness gives it things like state, tool execution, feedback loops, and enforceable constraints.

马具是除了模型本身之外的所有代码、配置和执行逻辑。原始模型并非智能体，但当马具为其赋予状态、工具执行、反馈循环和可执行约束等要素时，它便成为了智能体。

Concretely, a harness includes things like:

System Prompts
Tools, Skills, MCPs + and their descriptions
Bundled Infrastructure (filesystem, sandbox, browser)
Orchestration Logic (subagent spawning, handoffs, model routing)
Hooks/Middleware for deterministic execution (compaction, continuation, lint checks)

具体来说，一个工具集包含以下内容：

系统提示
工具、技能、MCP（多组件程序）及其描述
捆绑式基础设施（文件系统、沙盒环境、浏览器）
编排逻辑（子代理生成、任务交接、模型路由）
确定性执行的钩子/中间件（压缩、延续、代码规范检查）

There are many messy ways to split the boundaries of an agent system between the model and the harness. But in my opinion, this is the cleanest definition because it forces us to think aboutdesigning systems around model intelligence.

The rest of this post walks through core harness components and deriveswhyeach piece exists working backwards from the core primitive of a model.

在模型与框架之间划分智能体系统边界存在许多混乱方式。但在我看来，这是最清晰的定义，因为它迫使我们围绕模型智能来设计系统。

本文剩余部分将逐步解析核心框架组件，并从模型的核心原语出发逆向推导每个部分存在的根本原因。

Why Do We Need Harnesses. From a Model's Perspective

为什么我们需要安全绳。从模型的角度看

There are things we want an agent to do that a model cannot do out of the box. This is where a harness comes in.Models (mostly) take in data like text, images, audio, video and they output text. That's it. Out of the box they cannot:

Maintain durable state across interactions
Execute code
Access realtime knowledge
Setup environments and install packages to complete work

我们希望智能体完成的任务中，有些是模型无法直接实现的。这时就需要引入控制框架。模型（大多情况下）只能处理文本、图像、音频、视频等数据并输出文本。仅此而已。原生模型无法实现以下功能：

跨交互维持持久状态执行代码获取实时知识配置工作环境及安装软件包

These are allharness level features. The structure of LLMs requires some sort of machinery that wraps them to do useful work.For example, to get a product UX like "chatting", we wrap the model in a while loop to track previous messages and append new user messages. Everyone reading this has already used this kind of harness. The main idea is that we want to convert a desired agent behavior into an actual feature in the harness.

这些都是工具层面的功能。大型语言模型（LLM）的结构需要某种机制来包装它们以实现实用功能。例如，为了实现类似"聊天"的产品用户体验，我们会将模型封装在一个循环结构中，用于追踪历史消息并追加新用户消息。阅读本文的读者都曾使用过这类工具封装。核心思路在于：我们需要将期望的智能体行为转化为工具中的实际功能模块。

Working Backwards from Desired Agent Behavior to Harness Engineering

从期望的智能体行为回溯以驾驭工程

Harness Engineering helps humans inject useful priors to guide agent behavior. And as models have gotten more capable, harnesses have been used to surgically extend and correct models to complete previously impossible tasks.

We won’t go over an exhaustive list of every harness feature. The goal is to derive a set of features from the starting point of helping models do useful work. We’ll follow a pattern like this:

约束工程帮助人类注入有用的先验知识来指导代理行为。随着模型能力的提升，约束被用来精确地扩展和修正模型，以完成以前不可能的任务。

我们不会详尽列举每一个约束功能。目标是从帮助模型完成有用工作的起点出发，推导出一组功能。我们将遵循这样的模式：

Behavior we want (or want to fix) → Harness Design to help the model achieve this.

我们希望的行为（或想要修正的行为）→ 利用设计来帮助模型实现这一目标。

Filesystems for Durable Storage and Context Management

持久化存储与上下文管理的文件系统

We want agents to have durable storage to interface with real data, offload information that doesn't fit in context, and persist work across sessions.

Models can only directly operate on knowledge within their context window. Before filesystems, users had to copy/paste content directly to the model, that’s clunky UX and doesn't work for autonomous agents. The world was already using filesystems to do work so models were naturally trained on billions of tokens of how to use them. The natural solution became:

我们希望智能体拥有持久存储功能，以便对接真实数据、卸载超出上下文容量的信息，并能跨会话持续保存工作内容。

模型只能直接操作其上下文窗口内的知识。在文件系统出现之前，用户不得不直接复制粘贴内容给模型使用，这种用户体验笨拙且不适用于自主智能体。现实世界早已运用文件系统开展工作，因此模型自然接受了数十亿token量级的文件系统使用训练。最终的解决方案顺理成章地演变为：

Harnesses ship with filesystem abstractions and tools for fs-ops.

线束随附文件系统抽象和fs-ops工具。

The filesystem is arguably the most foundational harness primitive because of what it unlocks:

Agents get a workspace to read data, code, and documentation.
Work can be incrementally added and offloaded instead of holding everything in context. Agents can store intermediate outputs and maintain state that outlasts a single session.
The filesystem is a natural collaboration surface.Multiple agents and humans can coordinate through shared files. Architectures like Agent Teams rely on this.

文件系统可以说是最基础的核心工具，因为它实现了以下关键功能：

它为智能体提供了读取数据、代码和文档的工作空间。
工作内容可以逐步添加和转移，而无需将所有信息都保存在上下文中。智能体可以存储中间输出并维持跨会话的持久状态。
文件系统天然就是协作平台。多个智能体和人类可以通过共享文件进行协调，像"智能体团队"这样的架构正是依赖于此。

Git adds versioning to the filesystem so agents can track work, rollback errors, and branch experiments. We revisit the filesystem more below, because it turns out to be a key harness primitive for other features we need.

Git 为文件系统添加了版本控制功能，使得开发人员能够追踪工作、回滚错误以及创建实验分支。我们将在下文进一步探讨文件系统，因为它实际上成为了实现其他必要功能的关键基础原语。

Bash + Code as a General Purpose Tool

Bash + 代码作为通用工具

We want agents to autonomously solve problems without humans needing to pre-design every tool.

我们希望智能体能够自主解决问题，而无需人类预先设计好所有工具。

The main agent execution pattern today is a ReAct loop, where a model reasons, takes an action via a tool call, observes the result, and repeats in a while loop. But harnesses can only execute the tools they have logic for. Instead of forcing users to build tools for every possible action, a better solution is to give agents a general purpose tool like bash.

当今主流的智能体执行模式是ReAct循环：模型先进行推理，通过工具调用执行动作，观察结果，然后进入下一轮循环。但执行框架只能运行预设逻辑的工具。与其强迫用户为每个潜在动作构建专用工具，更好的解决方案是为智能体配备像Bash这样的通用工具。

Harnesses ship with a bash tool so models can solve problems autonomously by writing & executing code.

安全带附带一个bash工具，因此模型可以通过编写和执行代码来自主解决问题。

Bash + code exec is a big step towardsgiving models a computerand letting them figure out the rest autonomously. The model can design its own tools on the fly via code instead of being constrained to a fixed set of pre-configured tools.

Harnesses still ship with other tools, but code execution has become the default general-purpose strategy for autonomous problem solving.

Bash + 代码执行是向模型提供计算机并让其自主解决问题的重大进展。模型能通过代码即时设计自己的工具，而非受限于一组预先配置的固定工具。

虽然工具套件仍附带其他工具，但代码执行已成为自主问题解决的默认通用策略。

Sandboxes and Tools to Execute & Verify Work

沙盒和工具用于执行与验证工作

Agents need an environment with the right defaults so they can safely act, observe results, and make progress.

代理需要一个具有适当默认设置的环境，以便它们能够安全地行动、观察结果并取得进展。

We've given models storage and the ability to execute code, but all of that needs to happen somewhere. Running agent-generated code locally is risky and a single local environment doesn’t scale to large agent workloads.

Sandboxes give agents safe operating environments.Instead of executing locally, the harness can connect to a sandbox to run code, inspect files, install dependencies, and complete tasks. This creates secure, isolated execution of code. For more security, harnesses can allow-list commands and enforce network isolation. Sandboxes also unlock scale because environments can be created on demand, fanned out across many tasks, and torn down when the work is done.

Good environments come with good default tooling.Harnesses are responsible for configuring tooling so agents can do useful work. This includes pre-installing language runtimes and packages, CLIs for git and testing, browsers for web interaction and verification.

Tools like browsers, logs, screenshots, and test runners give agents a way to observe and analyze their work. This helps them createself-verification loops wherethey canwrite application code,run tests, inspect logs, and fix errors.

The model doesn’t configure its own execution environment out of the box. Deciding where the agent runs, what tools are available, what it can access, and how it verifies its work are all harness-level design decisions.

我们为模型提供了存储和执行代码的能力，但这些操作都需要在特定环境中完成。在本地运行智能体生成的代码存在风险，且单一的本地环境无法应对大规模智能体工作负载。

沙箱为智能体提供了安全的操作环境。执行框架无需在本地运行代码，而是连接沙箱来执行代码、检查文件、安装依赖并完成任务。这种方式实现了代码的安全隔离执行。为增强安全性，执行框架可采用命令白名单机制并实施网络隔离。沙箱还具备弹性扩展能力，可根据需求创建环境，将任务分散处理，并在完成后销毁环境。

优质环境配备完善的默认工具链。执行框架负责配置工具，使智能体能够高效工作，包括预装语言运行时环境、软件包、git命令行工具、测试工具，以及用于网络交互和验证的浏览器。

浏览器、日志记录、屏幕截图和测试运行器等工具让智能体能够观察分析工作成果。这有助于建立自我验证闭环：编写应用代码、运行测试、检查日志并修正错误。

模型本身并不具备开箱即用的环境配置能力。决定智能体的运行位置、可用工具、访问权限及工作验证方式，都属于执行框架层面的设计决策。

Memory & Search for Continual Learning

记忆与持续学习的探索

Agents should remember what they've seen and access information that didn't exist when they were trained.

Models have no additional knowledge beyond their weights and what's in their current context. Without access to edit model weights, the only way to "add knowledge" is viacontext injection.

For memory, the filesystem is again a core primitive. Harnesses support memory file standards like AGENTS.md which get injected into context on agent start. As agents add and edit this file, harnesses load the updated file into context. This is a form of continual learning where agents durably store knowledge from one session and inject that knowledge into future sessions.

Knowledge cutoffs mean that models can't directly access new data like updated library versions without the user providing them directly. For up-to-date knowledge, Web Search and MCP tools like Context7 help agents access information beyond the knowledge cutoff like new library versions or current data that didn't exist when training stopped.

Web Search and tools for querying up-to-date context are useful primitives to bake into a harness.

智能体应记住所见内容，并能获取训练时不存在的信息。

模型仅掌握其权重参数及当前上下文信息，无法直接扩充知识。在不修改模型权重的前提下，"添加知识"的唯一方式是通过上下文注入。

文件系统仍是实现记忆功能的核心基础。框架支持类似AGENTS.md这样的记忆文件标准，这些文件会在智能体启动时注入上下文。随着智能体对该文件进行增改，框架会将更新后的文件重新加载至上下文中。这种持续学习机制使得智能体能够持久保存会话知识，并将其注入未来会话。

知识截止日期意味着模型无法直接获取更新数据（如新版库），除非用户主动提供。要获取最新知识，需依赖网络搜索及Context7等MCP工具，这些工具能帮助智能体突破训练截止日期的限制，获取新版库或训练结束后出现的最新数据。

将网络搜索和实时上下文查询工具集成至框架中，是极具价值的基础功能设计。

Battling Context Rot

对抗语境腐化

Agent performance shouldn’t degrade over the course of work.

Context Rot describes how models become worse at reasoning and completing tasks as their context window fills up. Context is a precious and scarce resource, so harnesses need strategies to manage it.

Harnesses today are largely delivery mechanisms for good context engineering.

Compactionaddresses what to do when the context window is close to filling up. Without compaction, what happens when a conversation exceeds the context window? One option is that the API errors, that’s not good. The harness has to use some strategy for this case. So compaction intelligently offloads and summarizes the existing context window so the agent can continue working.

Tool call offloadinghelps reduce the impact of large tool outputs that can noisily clutter the context window without providing useful information. The harness keeps the head and tail tokens of tool outputs above a threshold number of tokens and offloads the full output to the filesystem so the model can access it if needed.

Skillsaddress the issue of too many tools or MCP servers loaded into context on agent start which degrades performance before the agent can start working. Skills are a harness level primitive that solve this viaprogressive disclosure.The model didn't choose to have Skill front-matter loaded into context on start but the harness can support this to protect the model against context rot.

Long Horizon Autonomous Execution

We want agents to complete complex work, autonomously, correctly, over long time horizons.

Autonomous software creation is the holy grail for coding agents. But today's models suffer from early stopping, issues decomposing complex problems, and incoherence as work stretches across multiple context windows. A good harness has to design around all of this.

This is where the earlier harness primitives start to compound. Long-horizon work requires durable state, planning, observation, and verification to keep working across multiple context windows.

Filesystems and git for tracking work across sessions.Agents produce millions of tokens over a long task so the filesystem durably captures work to track progress over time. Adding git allows new agents to quickly get up to speed on the latest work and history of the project. For multiple agents working together, the filesystem also acts as a shared ledger of work where agents can collaborate.

Ralph Loops for continuing work.The Ralph Loop is a harness pattern that intercepts the model's exit attempt via a hook and reinjects the original prompt in a clean context window, forcing the agent to continue its work against a completion goal. The filesystem makes this possible because each iteration starts with fresh context but reads state from the previous iteration.

Planning and self-verification to stay on track.Planning is when a model decomposes a goal into a series of steps. Harnesses support this via good prompting and injecting reminders how to use a plan file in the filesystem. After completing each step, agents benefit from the checking correctness of their work viaself-verification.Hooks in harnesses can run a pre-defined test suite and loop back to the model on failure with the error message or models can be prompted to self-evaluate their code independently. Verification grounds solution in tests and creates a feedback signal for self-improvement.

The Future of Harnesses

The Coupling of Model Training and Harness Design

Today's agent products like Claude Code and Codex are post-trained with models and harnesses in the loop. This helps models improve at actions that the harness designers think they should be natively good at like filesystem operations, bash execution, planning, or parallelizing work with subagents.

This creates a feedback loop. Useful primitives are discovered, added to the harness, and then used when training the next generation of models. As this cycle repeats, models become more capable within the harness they were trained in.

But this co-evolution has interesting side effects for generalization. It shows up in ways like how changing tool logic leads to worse model performance. A good example is described here in the Codex-5.3 prompting guide with the apply_patch tool logic for editing files. A truly intelligent model should have little trouble switching between patch methods, but training with a harness in the loop creates this overfitting.

But this doesn't mean that the best harness for your task is the one a model was post-trained with.The Terminal Bench 2.0 Leaderboard is a good example. Opus 4.6 in Claude Code scores far below Opus 4.6 in other harnesses. In a previous blog, we showed how we improved our coding agent Top 30 to Top 5 on Terminal Bench 2.0 by only changing the harness. There's a lot of juice to be squeezed out of optimizing the harness for your task.

Where Harness Engineering is Going

As models get more capable, some of what lives in the harness today will get absorbed into the model. Models will get better at planning, self-verification, and long horizon coherence natively, thus requiring less context injection for example.

That suggests harnesses should matter less over time. But just as prompt engineering continues to be valuable today, it’s likely that harness engineering will continue to be useful for building good agents.

It’s true that harnesses today patch over model deficiencies, but they also engineer systems around model intelligence to make them more effective. A well-configured environment, the right tools, durable state, and verification loops make any model more efficient regardless of its base intelligence.

Harness engineering is a very active area of research that we use to improve our harness building library deepagents at LangChain. Here are a few open and interesting problems we’re exploring today:

orchestrating hundreds of agents working in parallel on a shared codebase
agents that analyze their own traces to identify and fix harness-level failure modes
harnesses that dynamically assemble the right tools and context just-in-time for a given task instead of being pre-configured

This blog was an exercise in defining what a harness is and how it’s shaped by the work we want models to do.

The model contains the intelligence and the harness is the system that makes that intelligence useful.

To more harness building, better systems, and better agents.

----