Why the Spec-to-Code Gap Cannot Be Closed?
Read OpenAI’s Symphony README or GitHub’s Spec Kit and you’ll get the same idea: the spec is the hard part. The code is output.
Gabriella Gonzalez tested this. Her diagnosis: the spec was not really a spec, it was pseudocode in markdown, and even then, it wasn’t enough. Making a spec detailed enough to actually work is the same work as writing code.
This is correct. And not just about this particular case. This is true about the general case. A sufficiently detailed spec is code, because the only way to close the gap is to specify every dependency, every coupling, every shared-state decision, every error transmission. At that point you’re not writing a spec anymore, but literal code. That’s not closing the gap, that’s conceding it.
I argue that this gap is structurally inevitable for three reasons.
1. Behavior vs. Topology: A Phase Transition
A spec describes intent and behavior: what the system should do. Log in. Show dashboard. Send email. Show products. Process payment.
Code creates topology: the coupling between modules, the shared state, the call chains, the dependency graph. When a developer writes import TTLCache or injects a shared database connection, they draw an edge on a graph that didn’t exist before.
Under success, the system is happily a sequence of steps, including the expected unexpected steps. But under failure, the same system is rigidly a graph of dependencies.
When something breaks, the blast radius follows the edges of the dependency graph: who shares state with whom, who calls whom, what is coupled to what. The failure follows the topology, not the spec. The same bug in the same function produces a minor error in one codebase and a total collapse in another, depending entirely on what depends on what.
And the topology is an emergent property of the implementation.
It comes into being through specific implementation choices: which library to use, which module to import, which state to share, etc. These choices are not in the spec. They cannot be. The spec was written before they were made.
In normal mode, the system is liquid, flowing along the path the spec describes, but once in failure mode, the same system crystallizes. The cascading propagation, invisible while everything flowed, is suddenly the only thing that matters.
We are reduced to trying to describe the physics of a building (how something falls down) using only the floor plan (what the rooms are for). The physics of load-bearing walls only emerges once the bricks are laid.
2. Underdetermination
Underdetermination means that a set of observations is compatible with multiple theories.
The spec-to-code relationship has exactly this structure. A single spec is compatible with multiple correct implementations.
Consider a high-level spec: “handle authentication.” Code A uses a shared Redis-backed session cache, with the billing module reading from the same cache. Code B uses stateless JWT tokens with no shared infrastructure between auth and billing. Both satisfy the spec. A reviewer checking against the spec approves either one. An LLM could generate either one. Under the high-level spec, they are identical.
Their topologies are completely different. Code A couples authentication to billing through shared infrastructure. A Redis failure cascades into both systems simultaneously. Code B has no such coupling. The same infrastructure failure that collapses the entire application in Code A is a contained, single-service outage in Code B.
The spec underdetermines the topology: Multiple valid topologies satisfy the same spec. The failure dynamics, the thing that actually matters in production, are a property of the selected topology, not the behavior, and that selection is underdetermined.
3. Prose Is Graph-Blind. Code Is Graph-Explicit.
There is a structural reason the spec cannot see the topology, and it is not a matter of adding more detail.
In a word processor, you write “Step 1, then Step 2.” The medium does not force you to see that Step 1 and Step 2 might share a database connection pool. The dependencies between components are invisible in the text. You can describe a system’s goals comprehensively and never once represent a dependency edge, because prose has no syntax for dependencies. It has syntax for intent.
In an IDE, dependencies are literal. You import them. You @Inject them. You await them. Every dependency is a line of code, a visible, explicit edge on a graph. The compiler complains if a dependency is missing. The type system forces explicitness.
The medium of code forces a dependency graph into your field of vision whether you want to see it or not.
This isn’t about intelligence or experience of the writer. The spec writer and the developer can be the same person, equally brilliant, equally experienced. Put that person in front of a word processor and they will produce a document that describes goals. Put them in front of an IDE and they will start seeing connections, imports, shared state, coupling, edge cases, that simply don’t appear in prose. Not because they couldn’t think of them, but because the medium wasn’t representing them. A Sapir-Whorf effect for tooling.
You could hand a developer a finished codebase and ask them to write a spec, and the spec would still be graph-blind. The medium of prose can’t represent dependency edges whether they exist yet or not.
The spec will be reliably incomplete even when written by brilliant people, not because they’re incapable, but because the medium they’re working in doesn’t surface the gaps.
What This Means for AI
The promise of agentic coding is that an LLM can read a spec and produce working code. And it can, mostly. LLMs are trained overwhelmingly on functional code, and they generate implementations that satisfy behavioral requirements competently. Given “handle authentication”, an LLM will produce something that authenticates.
But the LLM has no basis for choosing between topologies. Code A and Code B both satisfy the spec. The training data contains both patterns. The LLM picks whichever is more common, and common patterns are common precisely because they handle common cases.
The failure regime is where uncommon, project-specific topological interactions govern. The dependency graph of any real codebase is unique, a product of hundreds of implementation decisions made over months or years. A new piece of code doesn’t just need to satisfy the spec. It needs to fit into an existing graph without creating cascade paths that the existing architecture can’t absorb. The training data can’t help here, because the graph doesn’t exist in any training set. It exists only in this codebase, right now.
When a human reviewer checks AI-generated code against the spec, they check behavior. Does it do what the spec says? If the code works, and the tests pass, the review is approved.
But the topology that the AI chose, the specific coupling it introduced, the shared state it created, the dependency edges it drew, is invisible in behavioral review. The cascade paths are hiding in implementation choices that satisfy the spec perfectly while creating failure dynamics that can only be evaluated by architectural reviewers working with the knowledge of the codebase, not just the spec.
Conclusion
The developer’s job is not just translating specs into code. Any competent developer can do that. A capable LLM can do that.
The developer’s job is resolving the underdetermination: choosing, among the multiple valid implementations that satisfy the spec, the topology whose failure dynamics are manageable.
This is judgment about a structure that the spec cannot represent, the reviewer cannot see in a behavioral check, and the LLM has no basis for preferring. It is the choice of which graph to live with, which web of dependencies to build, knowing that when something eventually breaks, the blast radius will follow the edges of that graph.
The spec-to-code gap is unbridgeable not because specs lack detail, but because the act of specifying is graph-blind while the act of coding creates the graph. Correctness is a property of the output. Reliability is a property of the architecture. The spec can fully determine the first. It is structurally blind to the second.
The gap is categorical.

