Memory & Context

Core Insight: The model only knows what's in its context window. Memory is how you bridge the gap between what the model needs to know and what it can see in a single API call. Getting this right is the highest-leverage problem in harness engineering.

Three Distinct Concepts

These terms are often conflated but serve different purposes:

Concept Scope Persistence Example
Context Single API call None β€” rebuilt every turn System prompt + tools + recent messages + relevant files
Session Single conversation or task In-memory, lost on restart Message history, tool call results, working state
Memory Cross-session, indefinite Written to disk MEMORY.md, daily logs, learned preferences

Context is the model's "working memory" β€” everything assembled into a single prompt. Session is the state of an ongoing interaction. Memory is what survives after the session ends.

Context Assembly

Every turn of the agentic loop starts with assembling the context. This is a prioritized packing problem β€” you have a fixed token budget and must decide what goes in:

Context Window (e.g., 128K tokens)
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  System Prompt        (~500)    β”‚  ← Always included, highest priority
β”‚  Tool Schemas         (~2000)   β”‚  ← Active tools only
β”‚  Memory Summary       (~1000)   β”‚  ← Compressed long-term memory
β”‚  Relevant Files       (~5000)   β”‚  ← Task-specific context
β”‚  Conversation History (~varies) β”‚  ← Grows over time, needs pruning
β”‚  [Remaining Budget]             β”‚  ← Available for new content
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

A priority system determines what gets included when space is tight:

class ContextAssembler:
    def __init__(self, max_tokens: int = 128_000):
        self.max_tokens = max_tokens
        self.sections = []  # (priority, name, content)

    def add(self, priority: int, name: str, content: str):
        self.sections.append((priority, name, content))

    def build(self) -> list[dict]:
        # Sort by priority (lower = higher priority)
        self.sections.sort(key=lambda x: x[0])
        messages = []
        used_tokens = 0
        for priority, name, content in self.sections:
            token_count = estimate_tokens(content)
            if used_tokens + token_count > self.max_tokens:
                break  # Budget exceeded β€” skip lower-priority content
            messages.append({"role": "system", "content": f"[{name}]\n{content}"})
            used_tokens += token_count
        return messages

Session Management

A session is the boundary of a single agent run. It holds:

  • Message history β€” the full conversation including tool calls and results
  • Working state β€” which files are open, which skills are loaded, current task progress
  • Scratch space β€” temporary data the agent generated but hasn't committed

The critical session design choice is when to clear it. Some options:

Strategy Behavior Use case
Per-task New session per user request Stateless assistant
Per-conversation Session persists across turns in one chat Interactive coding
Persistent Session survives process restart Long-running background agent

Persistent sessions require serialization β€” writing session state to disk so it can be restored. This is where session and memory overlap: anything worth keeping across restarts should be written to a memory file rather than kept in session state.

Memory Architecture

The proven memory architecture uses two tiers:

Tier 1: Daily Logs

Raw, chronological records of what happened. Written during the session, not curated:

<!-- memory/2026-04-15.md -->
# 2026-04-15

## 14:30 β€” Refactored auth module
- Moved JWT validation from middleware to dedicated service
- Tests passing (23/23)
- User prefers explicit error messages over error codes

## 16:00 β€” Deploy to staging
- Used blue-green deployment
- Rollback plan: revert commit abc123

Tier 2: Long-term Memory

Curated, distilled knowledge. Updated periodically (not every session):

<!-- MEMORY.md -->
# Long-term Memory

## User Preferences
- Prefers explicit error messages over error codes
- Uses pytest, not unittest
- Deploy strategy: blue-green with rollback plan

## Project Knowledge
- Auth module: JWT validation in /src/services/auth.py
- Database: PostgreSQL 15, migrations in /db/migrations/
- CI: GitHub Actions, ~3min build time

## Lessons Learned
- Always run tests before committing (broke build on 4/10)
- User dislikes verbose output β€” keep summaries under 5 lines

The key insight: daily logs are cheap to write (just append). Long-term memory requires judgment (what's worth keeping?). Production harnesses write daily logs automatically and curate MEMORY.md periodically β€” either on a schedule or when the agent detects significant learnings.

Memory Read/Write Cycle

def session_startup(memory_dir: str) -> str:
    """Read memory at session start."""
    sections = []
    # Always read long-term memory
    memory_path = os.path.join(memory_dir, "MEMORY.md")
    if os.path.exists(memory_path):
        sections.append(open(memory_path).read())
    # Read recent daily logs (today + yesterday)
    for days_ago in [0, 1]:
        date = (datetime.now() - timedelta(days=days_ago)).strftime("%Y-%m-%d")
        daily_path = os.path.join(memory_dir, f"memory/{date}.md")
        if os.path.exists(daily_path):
            sections.append(open(daily_path).read())
    return "\n---\n".join(sections)

The AGENTS.md Pattern

A related but distinct file is AGENTS.md β€” a plain-text file that defines how an agent should behave (not what it remembers). Place it in any directory and a compatible harness reads it automatically:

<!-- AGENTS.md -->
# Behavior

- You are a Python backend engineer
- Use pytest for all tests
- Follow Google style docstrings
- Never modify files in /config/ without asking

# Tools

- Prefer `ruff` over `pylint` for linting
- Use `uv` for package management

AGENTS.md is declarative (what to do) while MEMORY.md is experiential (what happened). Both are injected into context at session startup but serve different purposes.

Common Pitfalls

  • Treating context as unlimited β€” Even 128K tokens fill up fast with tool schemas, file contents, and conversation history. Plan your token budget explicitly.
  • Never pruning session history β€” A 50-turn conversation accumulates redundant content. Compress or summarize older turns to reclaim space.
  • Writing memory too eagerly β€” Not every turn produces knowledge worth persisting. Over-writing creates noise that dilutes useful information.
  • Forgetting to read memory at startup β€” An agent without memory read is effectively amnesiac. This is the most common configuration bug.

Further Reading