When OpenClaw started going viral, I watched with a mixture of awe and dread. It enabled compelling use cases: managing your calendar, replying to messages, controlling your browser, searching flights, all from WhatsApp. The first thing that popped into my head was the Terminator movie and Skynet. We are handing agents enormous power over our digital lives with almost no thought about what happens when things go wrong.
And things have already gone wrong. Security researchers found over 40,000 exposed OpenClaw instances leaking API keys, credentials, and months of private conversation history. CrowdStrike published a taxonomy of prompt injection attacks that could hijack an OpenClaw agent and turn it against its owner. An OpenClaw agent published a hit piece on an open source maintainer who rejected its pull request. These happened within weeks of the project going viral.
So I asked myself: how would I build a personal AI assistant while taking security seriously from the start?
The Architectural Approach
The agent needs a chokepoint: a single place every action passes through, where you can enforce policy. A post by Kenton Varda and Sunil Pai at Cloudflare called “Code Mode: the better way to use MCP” pointed me toward a good foundation. Their insight was that LLMs are excellent at writing TypeScript, since training data for real-world code dwarfs the synthetic examples used to teach tool-calling, so letting the LLM write code against a typed API works better than having it emit structured tool calls one at a time.
I realized that this approach also gives you the chokepoint. You put an MCP proxy between the agent and all the MCP servers it talks to. The proxy exposes one tool, execute_code, and the agent writes TypeScript against a typed API instead of calling tools directly. The proxy maps those function calls back to MCP operations. Every action funnels through that single proxy, and that is where you enforce policy.
Code Mode is one answer to sandboxing, but some agents need a full shell. Claude Code, for instance, runs commands, edits files, and manages its own workflow. You cannot reduce that to TypeScript snippets against a typed API.
So IronCurtain supports two sandbox architectures.
Code Mode runs the LLM’s TypeScript in a V8 isolate with no filesystem, network, or environment access. The only thing the code can do is issue typed function calls that map to MCP operations.
Docker Mode puts a full autonomous agent inside a container with --network=none and no elevated capabilities. The agent gets its own shell and filesystem but has exactly two ways out: a Unix domain socket to the MCP proxy, and another to a TLS-terminating MITM proxy that handles LLM API requests.
Both modes funnel every action through a trusted process that acts as an MCP proxy. That process holds the policy engine and decides: allow, deny, or escalate to the human. Approved calls go to MCP servers running in their own OS-level sandboxes with minimum permissions.
Credential separation falls out of this architecture. In Code Mode, the agent has no access to credentials because it has no access to anything except the typed API. Docker Mode is more interesting: the container receives a fake API key that passes format validation but does nothing. The MITM proxy intercepts outbound LLM requests, swaps the fake key for the real one, and forwards upstream. The real key never enters the container.
Policy in Plain English
Writing security policy is genuinely hard. Even experts struggle with it. The languages are difficult, edge cases multiply fast, and most people give up and open everything. That is how most agent frameworks end up with all-or-nothing permissions: full access or sandboxed, pick one.
You also cannot delegate policy enforcement to the LLM itself. LLMs are stochastic. The same prompt that blocks a dangerous action today might approve it tomorrow. Security policy requires determinism, and determinism has to live outside the model.
It does not need to be this way. One of my favorite papers, by Microsoft Research about automating privacy compliance at Bing, introduced a policy language called LEGALEASE that expressed policy in something close to natural language and compiled it into enforceable rules, dramatically lowering the barrier to good security hygiene.
IronCurtain applies this idea to agent security. You write a constitution for your agent in plain English. No DSL, no YAML, no regex. Something like: “the agent may read and write files in the project directory, may perform read-only git operations without approval, and must ask me before pushing to any remote.” IronCurtain compiles that into deterministic rules the trusted process enforces on every MCP call.
The agent is allowed to perform all local read and write git operations within the sandbox.
The agent must ask for human approval for all other git operations.
For cases where the policy gives no clear answer, an auto-approver can recognize when the user’s explicit instructions already cover the requested action. “Push my changes to origin” approves a git push without interrupting the user. “Go ahead” or “continue” always escalates to a human. Every decision is logged.
The auto-approver reduces alert fatigue but risks wrongly approving an action. To limit exposure, it only sees the human’s prompt and sanitized escalation information in a single-turn decision, requiring explicit user intent toward the specific action. This keeps the prompt injection surface small, but the feature remains an explicit opt-in.
What It Can Do Today
IronCurtain is a research prototype. The current feature set covers filesystem access, git operations, web fetching with HTML-to-markdown conversion, and web search through providers like Brave and SerpAPI. You can also connect IronCurtain to Signal, turning it into a bot that is securely paired with your phone so you can send tasks and receive results from anywhere over end-to-end encrypted messaging. More MCP servers will come. The Claude Code integration is working but still has rough edges.
The core architecture is solid: sandbox isolation, policy engine, credential separation, audit log. Those are the foundations everything else builds on.
The Threat Model
A word about what “secure” means here. I tend to mistrust anyone who uses that word without qualification. Every system I have seen described as secure has eventually been broken. “Secure” in the context of an agent relates to whether its actions match the intentions of the human who prompted it. Even after that agent has gone on a long journey. That turns out to be hard for reasons beyond malice.
Prompt injection is an unsolved problem, but it is the acute form of a chronic one. LLMs drift from their instructions over multi-turn conversations even without adversarial input. Injection exploits and accelerates this natural tendency. IronCurtain cannot prevent either. It can contain the blast radius through sandbox isolation, and when the policy engine denies an action, it returns the constitutional reason for the denial. That corrective signal helps re-anchor the model toward the original intent, counteracting drift rather than just blocking the immediate request.
The goal is to keep the gap between “yes, exactly what I wanted” and “how did this happen” as small as possible, and when the agent does cross a line, to contain the damage. This is where the name comes from. In theater, an iron curtain is a fireproof safety barrier between the stage and the audience. If something catches fire on stage, the curtain drops and contains the disaster. The agent performs on stage. Your files, your credentials, your systems are in the audience. IronCurtain is the barrier.
What’s Next
I have been fortunate to get early feedback from people whose judgment I trust. I asked Dino Dai Zovi and Michal Zalewski to look at the overarching approach and both of them were conceptually aligned and provided good feedback that I plan to integrate.
AI assistants are going to keep getting more capable and more deeply embedded in daily life. That trajectory is clear. If we treat security as an afterthought, something to bolt on once the features are built, we will end up with powerful agents running on brittle foundations. IronCurtain is my proposal for a different starting point: an architecture where security is baked in from the beginning, where sandboxing and policy enforcement make the agent more trustworthy and therefore more usable. Security and usability should reinforce each other. I want people to take this approach, improve it, and prove that we can build agents that are both capable and safe.
If you want to try it: npx @provos/ironcurtain
The code is at github.com/provos/ironcurtain and there is more detail on the architecture at ironcurtain.dev. I am genuinely interested in feedback, especially from people who find the approach wrong or incomplete.
