Essay

Why AI app builders break on the second prompt — and how diff-aware multi-agent pipelines fix it

9 min read

The first prompt almost always works. You type something like “build a fitness app with daily workouts, a streak counter, and a paywall,” the model thinks for a few seconds, and a screen renders. Buttons tap. Navigation moves between tabs. The demo gif you screen-record looks legitimate. This is the moment that sells AI app builders, and it is also the moment most of them peak.

Then you send the second prompt. Maybe it is something modest: change the primary button to green. Maybe it is structural: add a friends tab with shared streaks. Either way, you watch the same model regenerate the project, and somewhere between the third and fifth file something quietly snaps. The navigator loses a screen. The icon library swaps from one package to another. The paywall copy that took you twenty minutes to tune is back to the placeholder. The preview goes red.

This pattern is consistent enough across single-pass LLM app builders that it deserves a name. Call it the second-prompt problem. The first generation is a greenfield request to a stateless model and the model gets to invent a coherent project from scratch. The second generation is asking that same stateless model to hold the previous project in its head, identify the minimal change, and emit it without disturbing anything else. Those are different jobs. The first one is generative; the second one is surgical. Most builders ship a generative tool and then route every prompt through it, which is why prompt two looks like a brand-new app every time.

Why single-pass loops fail

A single-pass loop is structurally simple: the user prompt and (optionally) the prior code go into one big context window, the model produces the next version of the project, the project replaces the old one. There is no router deciding what kind of change this is. There is no contract on what the model is allowed to touch. There is no validator between the model and the live preview. The model is trusted to do the right thing because there is nothing else to do it.

Three failure modes follow from that shape. First, context saturation: by the fourth or fifth refinement, the prior project plus the new prompt plus the system instructions are crowding the window, and the model starts forgetting which screens it was supposed to keep. Second, no structural diff awareness: the model has no way to say “the user wants the button color changed, so I will edit one StyleSheet entry and emit nothing else.” Without a diff representation, every refinement is implicitly a full rewrite. Third, no validation gate: a hallucinated import or a missing entrypoint goes straight to the preview, and the user sees a red screen instead of a controlled retry.

The deeper issue is that there are no contracts between turns. A working app is a graph of files, dependencies, navigation routes, and assumed runtime libraries. When the model regenerates that graph from natural language alone, it has to guess the graph again, and small guessing differences accumulate into broken builds.

What diff-aware multi-agent means

The fix is to stop treating “generate an app” as one job. AppGenie splits the work across six specialized stages, with a router in front that decides how much of the pipeline each prompt actually needs to run. The stages are A1 through A6, and each one exists because of a specific failure mode that hits single-pass loops.

A1 — Intent classifier. The first thing the pipeline does with a prompt is decide what kind of request it is: a brand-new app, a cosmetic tweak, a feature addition, or a structural rebuild. A1 reads the prompt in isolation and emits an intent label. Doing this as a discrete step means downstream stages can be skipped when they are not needed, which keeps refinements cheap and bounded.

DiffRouter — Cosmetic / Feature / FullRegen. The router takes A1's intent and the existence of a prior project and decides which scope to run. Cosmetic scope means a small visual change and the pipeline will skip the architect entirely and emit a tiny patch. Feature scope means new screens or new data and the architect will plan a delta against the existing manifest. FullRegen is reserved for greenfield prompts. The router is what stops a button-color request from ever reaching the regeneration codepath.

A2 — Prompt expander. For new generations, A2 turns a one-line user idea into a richer brief: implied screens, plausible data shapes, the integrations a user of this category will assume. This stage exists because vague prompts produce vague apps, and it is much cheaper to expand the prompt deterministically than to recover from an under-specified architecture three stages later.

A3 — PRD writer. A3 commits the expanded brief to a structured product requirements document: screens, flows, data models, acceptance criteria. On refinement runs, A3 also writes against the existing PRD so it can describe the delta rather than restate the whole product. This is the contract that the architect and code generator will obey, and it is the artifact a human can read before approving the build.

A4 — Architect. A4 turns the PRD into a manifest: the file tree, the component graph, the navigation map, the dependencies, and the per-file responsibilities. Crucially, A4 only runs when the router decides the change is big enough to need it. A cosmetic patch goes straight from A1/router to the code generator with the existing manifest, untouched. A feature patch runs A4 in delta mode against the previous manifest so it can add a screen without re-planning the app.

A5 — Code generator (Patch vs Full). A5 is the only stage that writes code, and it runs in one of two modes depending on the router's scope decision. In Full mode it emits the entire project file by file against a locked starter template. In Patch mode it emits a structured Patch JSON: a list of PatchOps that each describe a single edit to a single file. The patch contract is what makes prompt two safe — there is literally no way for the code generator to accidentally rewrite a screen it was not asked to touch, because the schema does not let it.

A6 — Validator. A6 runs static checks on whatever A5 produced before it ever reaches the live preview. It parses every emitted file, verifies the Expo entrypoint is present, verifies imports resolve against the locked dependency allowlist Snack supports, and rejects output that does not pass. When validation fails, the orchestrator can retry the offending stage with a corrective message rather than push broken code to the user. This is the gate that turns a prompt-two disaster into a transient error the pipeline recovers from.

The patch contract

A patch contract is the part most people miss when they try to bolt “refinement” onto a single-pass builder. The reason refinements can be cheap and surgical is that the project starts from a locked template — in AppGenie's case a starter called rn-base-v1 baked into the backend image — and the code generator is constrained to express edits as structured operations against that template, not as a freeform code blob.

Concretely, when the router classifies a prompt as Cosmetic, A5 is invoked in patch mode and asked for something shaped like:

// illustrative — actual schema lives in apps/api/src/appgenie/patch
{
  "ops": [
    {
      "op": "edit",
      "file": "src/screens/HomeScreen.tsx",
      "find": "bg-blue-500",
      "replace": "bg-emerald-500"
    }
  ]
}

That JSON is parseable, validatable, and applies deterministically to the existing project on disk. Nothing else in the file tree moves. A Feature-scoped prompt produces a longer Patch with create operations for new files plus edit operations that wire them into the navigator, but the same property holds: the model was never given the option to silently rewrite an unrelated screen. The patch contract is the structural reason refinements stay safe across many turns.

The locked template matters here too. Because every project starts from the same starter — same Expo version, same navigation library, same styling layer (twrnc), same entrypoint — the architect and code generator share a stable mental model of what already exists. They are not negotiating the framework on every prompt. That stability is what lets the patch contract be small.

Validation

Even with a router and a patch contract, the model can still produce something wrong: a hallucinated import, a typo in a component name, a dependency Snack cannot resolve. A6 exists to catch those before they hit the preview. It runs cheap static checks: it parses each emitted file as JSON or TypeScript, it verifies the Expo entrypoint is intact, it checks every imported package against the Snack allowlist baked into the starter template, and it confirms imports actually resolve to files in the manifest.

When something fails, the orchestrator does not just throw. It distinguishes transient failures (retry the stage with corrective feedback) from validation failures (raise a structured pipeline error with a user-facing message) from provider failures (surface the upstream error and stop). The frontend sees a specific reason, not a generic error toast. That distinction is what turns “the AI builder broke” into “the AI builder caught a problem and told you what it was,” which is a very different user experience.

What you actually ship

The output of all this is not a screenshot or a sandbox illusion. It is a real Expo + React Native project sitting in your project workspace. The same code runs in the live Snack preview, gets versioned into a snapshot on every successful generation, and is yours to export. You can edit it by hand in the file tree, save it back, and the next refinement will route around your hand edits rather than clobber them.

Two things are worth being explicit about, because the AI app builder space has promised them often and shipped them rarely. Native build and store submission flows (EAS, TestFlight, Play Console) are not part of AppGenie's in-product loop today; the export path is real code you can take into your own EAS configuration. Subscription billing wiring is a separate concern from app generation; the generated apps include hooks where billing belongs, not a turnkey merchant integration. Both are deliberately scoped narrowly so the part the pipeline owns — the codegen loop — stays honest about what it does.

Try it on your own prompt

The fastest way to feel the difference is to run a refinement chain that would break a single-pass builder. Pick a category, generate the first version, and then issue three or four prompts in a row that touch different layers: copy change, color change, new screen, new data field. Watch the router pick a different scope each time, and watch the unrelated parts of the project stay exactly where you left them.

Some good starting points if you want a prompt that is already shaped for the pipeline:

And if you want a longer-form guide before you generate, the use-case pages walk through what the pipeline produces for two of the most common categories:

The second-prompt problem is not a model problem. It is a pipeline problem, and pipelines are the part you can engineer. Routing the prompt by intent, contracting the generator to patches, validating before preview, and snapshotting every version is what turns an AI demo into something you can iterate on for an afternoon without losing what you built in the morning.

Run a four-prompt refinement chain

Open the builder, generate any app, and send four refinements back to back. Watch the diff-aware pipeline keep the parts you did not ask to change.