What is a Harness?
Core Insight: Models are commoditizing β GPT, Claude, Gemini converge in capability. The harness is the real moat: how you orchestrate context, memory, tools, and agent lifecycle determines whether you ship a chatbot or a production agent.
Definition
A harness is the runtime wrapper that turns a bare language model into an agent β an autonomous system that can perceive its environment, make decisions, and take actions over multiple steps to achieve goals.
It's important to distinguish "agent" here from earlier usage. In 2023-2024, "agent" typically meant a model plus tools β you gave GPT a web search tool and called it an agent. The agents that harness engineering targets are fundamentally more complex:
| Component | 2023 "Agent" | Harness-era Agent |
|---|---|---|
| Model | β LLM | β LLM |
| Tools | β Function calling | β Dynamic tool system |
| Memory | β Stateless | β Persistent cross-session memory |
| Context management | β Naive | β Priority-based context assembly |
| Orchestration | β Single-turn | β Agentic loop with error recovery |
| Execution environment | β Host process | β Sandboxed runtime |
| Guardrails | β Minimal | β Permission model + trust boundaries |
The harness is the engineering layer that provides all of this. Without it, you have a chatbot that can call functions. With it, you have an agent that can navigate a codebase, fix bugs across multiple files, and commit the result β all autonomously.
Anatomy of a Harness
Every harness, regardless of implementation, has four subsystems:
ββββββββββββββββββββββββββββββββββββββββββββββββ
β HARNESS β
β β
β ββββββββββββ ββββββββββββ ββββββββββββββ β
β β Agentic β β Tool β β Memory & β β
β β Loop β β System β β Context β β
β ββββββββββββ ββββββββββββ ββββββββββββββ β
β β
β ββββββββββββββββββββββββββββββββββββββββββ β
β β Guardrails β β
β ββββββββββββββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββ
-
Agentic Loop β The think β act β observe cycle that drives all agent behavior. The model reasons, invokes a tool, observes the result, and loops until the task is complete.
-
Tool System β The registry of capabilities available to the agent: file I/O, shell execution, web search, API calls. Tools can be static (loaded at startup) or dynamic (loaded on demand via skill menus).
-
Memory & Context β The system that decides what the model can see. This encompasses three distinct concerns:
- Context β what goes into the current API call (system prompt, tools, files, conversation history)
- Memory β what persists across sessions (MEMORY.md, daily logs, learned preferences)
- Session β the boundary of a single agent run (message history, tool results, scratch state)
-
Guardrails β Permission boundaries, sandbox enforcement, and safety constraints. What the agent can and cannot do, and how to prevent prompt injection from bypassing those boundaries.
These four subsystems are explored in depth in the Core Concepts section.
A Minimal Example
The simplest harness is a loop. This is production-incomplete but structurally correct:
import openai
client = openai.OpenAI()
tools = [{"type": "function", "function": {"name": "read_file", ...}}]
messages = [{"role": "system", "content": "You are a coding agent."}]
messages.append({"role": "user", "content": user_input})
# The agentic loop
while True:
response = client.chat.completions.create(
model="gpt-4o", messages=messages, tools=tools
)
msg = response.choices[0].message
messages.append(msg)
if not msg.tool_calls:
print(msg.content) # Done β model has no more actions
break
for call in msg.tool_calls:
result = execute_tool(call.function.name, call.function.arguments)
messages.append({
"role": "tool",
"tool_call_id": call.id,
"content": result
})
# Loop back β model sees the tool results and decides next action
Every harness β from a 50-line script to Claude Code β is a variation of this loop. The complexity comes from what you build around it: context assembly, memory persistence, skill orchestration, error recovery, and sandboxing.
Harness vs. Framework vs. Runtime
These three terms are often confused. They are different layers:
| Term | Role | Examples |
|---|---|---|
| Harness | The orchestration code that wraps a model into an agent | Claude Code, Codex CLI, OpenClaw |
| Framework | A library that provides building blocks for constructing harnesses | LangChain, CrewAI, AutoGen |
| Runtime | The persistent process that keeps a harness running, manages its lifecycle, and connects it to the outside world | OpenClaw runtime, Docker container, systemd service |
A framework helps you build a harness. A runtime hosts a harness β keeping it alive, handling reconnection, scheduling heartbeats, and routing messages to it. The harness itself is the orchestration logic: how context is assembled, which tools are loaded, and how the agentic loop behaves.
Common Pitfalls
- Blaming the model for harness problems β When an agent fails, it's usually a context issue (wrong files loaded, missing instructions) or a tool issue (incorrect schema, silent errors), not a model capability problem.
- Over-engineering from day one β Start with the minimal loop above. Add memory when you need cross-session state. Add skills when you have too many tools. Add guardrails when you move to production.
- Treating the context window as unlimited β The model can only reason about what's in its context. If critical information isn't assembled into the prompt, it effectively doesn't exist.
Further Reading
- OpenAI: Harness Engineering β The blog post that named the discipline
- Anthropic: Building Effective Agents β Anthropic's patterns for production agents