You Can’t Prompt Your Away Your LLM Problems

Author(s): Venkat Peri Originally published on Towards AI. You Can’t Prompt Your Away Your LLM Problems When an LLM feature breaks in production, the first instinct in the room is to open the prompt and reword it. We had that instinct too. Then we built a production assistant for financial advisors and kept a record of every LLM-related failure the system hit, along with the fix that actually closed each one. Across the whole build, almost nothing that mattered was fixable by editing a prompt. The durable fixes were architectural. The single time we tried a prompt-only fix on the hardest problem, it made things measurably worse, and we reverted it. This is the case for treating the model as one untrusted component in a larger system, written from the failures that taught us to do it. The fix that made it worse Routing was our most unstable surface, and it was unstable in a way that prompt editing could not reach. The same question routed one way on one run and a different way on the next, with no code change in between. On the ambiguous edges, routing accuracy sat around 56 to 64 percent and was non-deterministic from run to run. A question like “rank my households by AUM” came back as a clarification request on one run and a confident answer on the next. The obvious move was to explain the boundary better in the routing prompt. We added a block of guidance describing how to tell the categories apart. The classifier got less stable, not more, so we took the prose back out. The reason mattered more than the symptom. The flakiness was not bad luck. It was structural, and it got worse every time we added a domain. The router guessed an abstract category first, and then code mapped that category to a concrete tool. The category was a lossy middle step. When the guess was wrong, the right tool was unreachable, and there was no signal left to recover from. The fix was to delete the abstract category. We collapsed routing into a single stage that picks one concrete tool directly from the catalog, with the scope derived from the tool it picked rather than guessed up front. We narrowed what each decision had to weigh, so any one call discriminates among a handful of tools instead of the full set. We grounded each tool with a few example utterances carried as structured data, not as prose inside the prompt. Accuracy moved from a flaky 98 percent on the old design, down through a real regression to 72 percent right after the rewrite, and up to 100 percent on both of our evaluation suites once the grounding was in place. The improvement came from removing decisions the model did not need to make. That set the pattern for most of what followed: take work away from the model wherever code can do it, and catch what is left in deterministic guardrails. A related failure looked like a bad model decision and was actually a missing option. Asked to “show the first account,” the model picked the nearest available tool, a holdings lookup, because no account-listing tool existed, and it invented an ordinal to make the answer fit. The model was choosing the closest thing within reach. We fixed it by building the tool it needed, not by writing a prompt that apologized for the gap. A value the model invented is not the value you computed A model fills a blank confidently, and confidence is the dangerous part. “Create a task for 2pm” put the string “2pm” into a field our code expected to hold a computed timestamp. The parser tried to read “2pm” as an ISO instant, threw, and the user saw a generic server error. This crash only reproduced with real model output. Every offline test we had written passed an empty argument map, and an empty map never triggers the bug. The lesson we wrote down for the next person chasing a live crash they cannot reproduce: vary the arguments the model actually produces, because empty-argument mocks hide a whole class of failure. The crash was one instance of a wider habit. When asked to create a task with no subject, the model invented a subject that was a bare echo of the tool’s own name, and because the field was now non-empty, it sent the task straight to confirmation instead of asking. It presented a specific row as “the first” on a list our schema never ordered, reading fixture order as if it were real order. Given a comparison of two figures, it would do the arithmetic itself and sometimes get it wrong. These were not instruction problems, and the fixes had the same shape every time. Detect the value the model invented and refuse it. A matcher catches tool-name-echo placeholders and forces a clarification. A check drops a non-ISO time argument and re-derives the time from the user’s words in code. A flag on unordered lists, plus a guard that returns the full list with a notice instead of asserting an ordinal. Percentages, deltas, and filtered totals moved into code, with the prompt reduced to one line: do not compute this yourself. The principle underneath all of them is to never trust a model-populated value as if it were the typed or computed value. Validate before you parse. The guardrail is also code We run a deterministic grounding check over every answer that cites figures. It compares the numbers in the rendered answer against the numbers the tool actually returned, and when the answer cites a figure with no support, it withholds the answer behind a message saying we could not fully verify it. This is the reason a fabricated “that’s 5% of your $8.3M,” which the renderer invented on three or four runs out of six, never reached a user. The render prompt already banned invented statistics. The ban alone did not hold, so deterministic shaping that […]