Finding Zero-Days with Any Model

Wed, 29 Apr 2026 00:00:00 +0000

The prevailing narrative in AI-driven security claims that discovering novel vulnerabilities is a “frontier” feature reserved for restricted models like Anthropic’s recently announced Mythos Preview. Recent high-profile reports highlighted the ability of these advanced models to uncover decades-old memory safety violations, such as the 1998 OpenBSD TCP SACK implementation flaw, framing these findings as a paradigm shift in the threat landscape.

The reality differs. My research demonstrates that these capabilities reside not just in proprietary frontier models, but in the orchestration architecture used to manage commercial models. I built workflows on top of my open-source IronCurtain framework to explore this. Using a specialized vulnerability discovery workflow, I replicated these frontier findings and autonomously discovered new zero-day vulnerabilities in foundational software with commercial models like Opus 4.6 and Sonnet 4.6, as well as open-weight models like Z.AI’s GLM 5.1.

Orchestrating Vulnerability Discovery

IronCurtain is a research prototype I designed to enable structured, agentic security research. The framework supports arbitrary workflows structured as finite-state machines (FSM) via plain YAML definitions. To automate vulnerability discovery, I built a vuln-discovery workflow within this FSM. This workflow introduces a central Orchestrator agent that acts as a strategic router. It decides which specialized agent to dispatch next based on an append-only execution journal.

This journal lets the model maintain state, test hypotheses, and navigate the code. The Orchestrator does not read the target source code itself. It relies entirely on the journal to manage the investigation until producing a final vulnerability findings report. This journal and other on-disk artifacts allow every agent state to begin with a fresh context window and rehydrate from disk. That said, this workflow is token-intensive. A single run against a moderately sized codebase consumes roughly 10 million tokens on Opus or Sonnet, costing $150 or $30 per investigation respectively. Five runs on Z.AI’s hosted GLM 5.1 averaged 27 million tokens each, reflecting more iteration cycles to reach the same conclusion. Z.AI lists GLM 5.1 at $1.40 per million input tokens ($0.26 cached) and $4.40 per million output, putting per-investigation cost in the same range as Sonnet despite the higher volume. Truly local inference on commodity workstation hardware remains an aspiration rather than a demonstration here: a smaller distilled candidate I tested (Qwen 3.5 distilled) could not run the workflow, so GLM 5.1 ran on Z.AI’s GPUs throughout.

Replicating the 1998 OpenBSD Discovery

Anthropic’s Red Team highlighted a 27-year-old vulnerability in the OpenBSD TCP SACK implementation as the marquee finding of their Mythos report. This specific bug was meaningful to me as I was responsible for committing the OpenBSD TCP SACK implementation, including the bug, in November 1998. To test whether open orchestration could replicate this frontier capability, I pointed an early iteration of the vuln-discovery workflow at this historical, unpatched C code.

Using an initial Sonnet 4.6 analyst agent, the workflow mapped the structural data flow and call chains. The FSM orchestration follows a simple discipline: hypothesize statically, validate by execution. Anything else is noise. In this workflow, a proof-of-concept (PoC) is an executable harness that triggers the vulnerability to prove reachability and surface memory corruption, providing empirical evidence beyond static analysis.

However, during this first test of the FSM harness, the prompting and journal-keeping were not yet sufficient to keep the system on track. Because the secure IronCurtain container could not natively boot an OpenBSD virtual machine, the agent fell back entirely on static analysis. It identified the bug but stopped short of executing it, producing a report that lacked a proof-of-concept (PoC).

This limitation led to an iterative improvement in the workflow. It became clear that initial hypothesis exploration does not require a full VM; it can happen via lightweight harnesses, such as single-function fuzzing. Only the final PoC requires the VM path.

To finalize the validation, I used Opus 4.6 directly via Claude Code. With some manual steering, I got the model to first replicate the bug via lightweight fuzzing. It designed a standalone, high-performance fuzzer for the specific C function, systematically sweeping the input space in seconds. The model isolated the exact trigger: a difference of just two sequence numbers out of 4.3 billion, sitting precisely on the 32-bit integer sign boundary.

Once the parameters were isolated via fuzzing, I had the model build a QEMU-based driver to test against a live VM, reliably replicating the kernel panic.

This OpenBSD replication was the workflow’s first practical test. The manual steering required during this initial run allowed for direct improvements to the FSM’s prompts and journal-keeping, establishing a tiered approach to harness building: single-function isolation harnesses, multi-component harnesses, and full end-to-end VM validation. The workflow now dynamically scales through these tiers based on what is required to establish the exploitation primitive. After these changes the workflow ran the next investigations without manual steering.

Autonomous Discovery and Scaling Defense

An orchestration framework proves its value by autonomously discovering bugs in modern codebases. The vuln-discovery workflow surfaced previously unreported vulnerabilities and significant bugs across four widely-used open-source projects, each with years of public fuzzing and dedicated security review. The two case studies below describe representative runs. Specific identities and exact bug mechanics are withheld while upstream coordination, CVE assignments, and advisories remain in progress.

While analyzing a widely-deployed media framework, the vuln-discovery workflow discovered a previously unreported vulnerability. Utilizing Opus 4.6, the framework identified the flaw and produced a multi-component harness to confirm the underlying primitive. Validating the flaw with a full end-to-end harness required some human guidance; memory-constrained reproduction environments initially masked the trigger condition. After harness optimizations, the workflow generated a proof-of-concept that reliably triggered the vulnerability, and I reported the finding to the upstream maintainers. The result validates the approach of independent vulnerability discovery conducted by a commercially available model paired with open-source orchestration.

Moving off the Anthropic ecosystem entirely, a subsequent run pointed the same vuln-discovery workflow at a different foundational library. Only the model changed: a LiteLLM gateway rewrote Anthropic identifiers to route to Z.AI’s GLM 5.1 on an Anthropic-compatible endpoint, while IronCurtain and the FSM orchestration layer were untouched.

GLM 5.1 drove discovery end-to-end. Working from the append-only journal, the Orchestrator scoped the target and routed the system to tiered harness construction. The autonomous workflow isolated an integer truncation flaw on a memory allocation path that had been present for eighteen years, echoing the 27-year dormancy of the OpenBSD SACK case above. Through structural analysis of the relevant arithmetic, the orchestrated model generated a proof-of-concept and a sanitizer-validated harness sufficient to confirm the vulnerability class.

To ensure accurate severity scoring for responsible disclosure, it was necessary to confirm the underlying exploitability primitives. I conducted this guided manual analysis in Claude Code with Opus 4.7, my interactive research environment for deep technical work and a stronger model than GLM 5.1 for analysis of this depth. The work demonstrated a controlled out-of-bounds heap read and write primitive. Because the affected library is widely deployed across internet-facing infrastructure, the flaw represents a serious remote risk.

Confirming critical severity required a working exploit, which falls outside the discovery-scoped vuln-discovery workflow. When the model refused my initial request to construct one, I manually broke the exploit development process into a granular, seven-step plan to work around the refusal. The model successfully executed the first two steps before its Acceptable Use Policy (AUP) guardrails forced it to decline further collaboration. Fortunately, the second step proved that Address Space Layout Randomization (ASLR) could be circumvented by reading base pointers, sufficient empirical evidence to support a high-severity assessment.

Three observations follow from these runs. First, generating a proof-of-concept (PoC) exploit is required for defenders. Theoretical vulnerabilities identified through static analysis inevitably produce a high volume of false positives. False positives consume massive amounts of time for security operations teams to triage and manually verify. An execution proof of a vulnerability quickly removes these time-consuming false positives, ensuring defenders focus only on validated threats.

Second, while open orchestration provides the necessary scaffolding to execute complex workflows, the underlying quality of the foundation models still matters. Foundation model quality sets the floor for what orchestration can extract, but the open-weight result shows that floor is now low enough for commodity models to clear.

Third, the economics now favor frequent, broad audits. At commercial-API pricing, an investigation runs $30 to $150 per codebase (Sonnet to Opus); at production scale that determines how many libraries a defender can afford to look at in a year. Hosted open-weight providers run in the same range as Sonnet at higher token volumes, and self-hosting on capable hardware would reduce the marginal cost further once the upfront investment is made.

Orchestration cuts both ways. The reality is that well-resourced adversaries already use orchestrated workflows to hunt for zero-days at scale. They operate free from vendor usage policies, AUP friction during legitimate research, API rate limits on multi-hour runs, and curated access lists for embargoed frontier models. The seven-step refusal during severity assessment is exactly the asymmetry at issue: a defender doing legitimate work hit friction that a well-resourced adversary using uncensored open-weight models would not. A locally-hosted open-weight model removes that gate. The vendor’s friction layer is liability protection rather than accountability; on a local model, accountability for the work rests directly with the researcher, where it belongs. IronCurtain exists to close this gap. By combining open-source scaffolding with local or commodity open-weight models, defenders can audit codebases and ship patches before automated exploitation catches up. I encourage security engineers to review the IronCurtain framework, contribute to the orchestration scaffolding, and help build the baseline tools for automated defense. Onboarding is still being polished, so contributions toward easier setup are particularly welcome.

Zero-Day on Niels Provos

Finding Zero-Days with Any Model

Orchestrating Vulnerability Discovery

Replicating the 1998 OpenBSD Discovery

Autonomous Discovery and Scaling Defense