Workshop — Agent Harness Engineering

The premise

A CLAUDE.md is context, not enforced configuration.

Anthropic's own documentation is explicit about it: the instruction file is context the model is given, with no compliance guarantee. Most teams' mental model is often at variance — they treat it like a config that binds the agent's behavior.

That gap is where production breaks: the agent quietly writes raw SQL into the handler, bypasses the repository layer, ignores the rule you care about most. This session makes the difference measurable on your stack — then makes it controllable.

The workshop · two ways to run it

Half-day teaching, or the full day on your own repo.

The half-day teaches the framework. The full day is the complete option — the same teaching, plus a pre-session Claude Code Diagnostic and an afternoon strengthening your repo with us.

½half-day · instructional

Half-day workshop

Four 60-minute blocks (~4 hours plus breaks). Remote or in-person. Taught and demonstrated, built around examples from 300+ analyzed Claude Code setups. No repo work in the room.

01How Claude Code actually works — what the agent sees each turn, how it walks your file tree, and the distinction the day rests on: suggestion vs instruction.
02Your Control Surfaces, in practice — the levers you actually have, each with its effective use and its costly failure mode.
03The one choice — every failure collapses to a single decision: gate the rule, or make it win the attention competition.
04Measure it, don't trust it — why output passing review is not evidence the rule was followed.
leaveA shared model of how the agent works, the method to map your own Control Surfaces, and a verification discipline you can keep running.
bringOne recent example of the agent ignoring an instruction you cared about. We use it on the day.

1full day · the complete option

Full-day intensive

Everything in the half-day, plus a Diagnostic we run on your most important repo before the session, and an afternoon working that repo with you. In-person by default; remote by arrangement.

05The full field guide — the complete taxonomy of failure modes, mapped onto the three moves from the morning.
06Read your own results — we open the analysis we ran on your most-shared repo: a scored map of its Control Surfaces and your specific gaps. As a room you triage by leverage and pick the targets.
07Strengthen it with us — hands on keyboards, in small groups, working on your two or three highest-leverage gaps on that repo.
08Verify each other's work — each group checks another's change, and wires the checks so drift surfaces early, in CI.
leaveA scored Control-Surface Map, two or three gaps strengthened hands-on, an errors-only scan of your other services, and the verification discipline running on your own changes.
platformRunning this continuously across every repo — scan, fix, measure, catch drift — is the platform; the day proves the method on the repo that matters most.

Your facilitator

Philip Forshaw

Olinave · ex-Apple AI Strategy & Operations

Ex-Apple AI Strategy & Operations lead, with decades of training and workshop delivery behind him. The framework behind the workshop is patent-pending.

The method is a named taxonomy of the recurring failure modes, grounded in controlled experiments and Anthropic's own published guidance — not theory. The room works from the evidence.

Why us · the research

A named taxonomy — then measured across 300+ real setups.

Our framework is a named taxonomy of the recurring failure modes, grounded in controlled experiments and Anthropic's own published guidance. Separately, we applied it to 300+ real Claude Code setups to measure how common each failure is.

88%

bury critical rules in prose, with nothing enforcing them.

across 300+ measured setups

2.1

critical-severity gaps in the average setup.

across 300+ measured setups

46%

of the teams that use hooks at all ship at least one that doesn't actually block.

of hook-adopting setups

Aggregate findings from Olinave's measurement of public Claude Code setups. No company named.

Questions

Before you request a session

No. It's intermediate — we assume daily Claude Code use and start everyone on one shared model. The value is in making the agent's behaviour measurable and controllable on a real codebase.

For the full day, yes — the afternoon is hands-on, strengthening the repo your team shares, and a toy repo won't surface the gaps worth fixing. The half-day is instructional and needs only one real example of the agent ignoring you.

The half-day runs remote or in-person. The full day is in-person by default (remote by arrangement), because the afternoon is hands-on work on your repo.

Because agent behaviour is tied to versions. A working install on your own machine lets you reproduce exactly what we run rather than take our word for it — so check claude --version returns before you arrive.

12–15 senior engineers or architects. Small on purpose: we open with the room's real failures and spend most of the time hands-on, not lecturing.

You leave with a stronger setup and a way to verify it. Running the method continuously across every repo — scan, fix, measure, catch drift — is the platform; the day proves it on the repo that matters most.

Get started

Tell us which engagement fits.

The half-day teaches the framework; the full day does it with you on your own repo. Tell us about your team and we'll set up a call.

Start the conversation →

Half-day or full-day · intermediate · for teams already running Claude Code