Work · 01 · SimPilot— Solo project, 2025 — present —simpilot.dev →

A multi-agent system
for engineering automation.

SimPilot is what I've been building to find out how far agent engineering actually goes when the work is real. You describe a problem in plain English, and a fleet of typed, tool-using agents plans the work, runs it on sandboxed compute, diagnoses its own failures, and hands back a report you can audit. A stress test for the patterns I keep coming back to: long-horizon orchestration, typed protocols, durable memory, and being honest about evaluation.

§ 01By the numbers

The shape of the system.

04 / facts
155+
Tools registered
23
Lifecycle phases
3
LLM providers
15+
Internal packages
§ 02What's actually interesting

Six design ideas doing most of the work.

06 / patterns
01

A typed phase loop, not vibes

The agent moves through 23 numbered lifecycle phases. Each has a Zod-typed evidence gate, so it can't move forward without producing the artifact the next phase needs. No hidden state, no fuzzy checkpoint — a long-horizon run stays auditable end to end.

02

Typed protocol between intent and execution

User intent lives in a mutable spec. The moment it's committed, it freezes into an immutable input package that every downstream tool consumes. Upstream of the freeze, anything can change; downstream, every change is auditable. Prompt-invalidation tracking catches stale assumptions.

03

Sandboxed tool execution, never on the host

Every tool runs inside a Docker image with a strict workspace mount and a command-policy admission gate. Same images locally via Docker Compose, same images in production on AWS Batch / ECS — one debugging surface in both places.

04

A Guardian for when things break

Real tools fail constantly. The Guardian subsystem collects forensic evidence on failure and spawns specialist subagents — a debugger, a UI verifier, a report reviewer — to diagnose and propose a fix. Every recovery is journaled, so retries don't quietly re-introduce the same bug.

05

Memory that means something

A durable memory layer stores domain gotchas, validated precedents, and project patterns. The agent stops relearning the same lesson on every project — a small idea with a large effect on long-horizon competence.

06

Deep research, not just chat

A supervisor / researcher multi-agent loop reads technical literature, vendor docs, and the agent's own memory, and synthesizes a grounded answer streamed back with provenance — not hallucinated citations.

§ 03Stack

Boring tech, where boring means proven.

6 / layers
Web
Next.js 16 · React 19 · AI SDK v7 · tRPC · Zod · Tailwind
Compute
AWS Batch / ECS · Docker sandboxes · Vercel Workflows · Redis
Data
Postgres · Drizzle ORM · S3 / MinIO · Better Auth
Solvers
OpenFOAM · SU2 · CalculiX · Gmsh · FreeCAD · Trame / PyVista
Observability
OpenTelemetry · Sentry · Langfuse
Billing
Stripe (usage-based credits)
§ 04Why I'm building it

The cleanest test case I know.

Manifesto

Most agent demos work because the tasks are short, the tools forgive everything, and a wrong answer is cheap. I wanted to see what happens at the other extreme.

Engineering automation is the cleanest test case I know. The ground truth exists, and you can't talk your way around a bad result.

SimPilot is my bet on what a system like that actually needs. The protocol is typed where most demos rely on free-form prompts. Memory is durable instead of stuffed into context. Tools run in sandboxes rather than on trust. The multi-agent topology is scoped narrowly enough that each agent does one job. The domain is engineering simulation, but I care more about the shape of the system underneath.