Dynamic Prompt Composition: Lessons from Shipping LLM Analysis in Production

Most LLM demos hard-code a single prompt. Production rarely gets that luxury. When you ship an analysis product where each customer defines what “analysis” even means, the prompt stops being a string you write and becomes an artifact your system composes at runtime. That shift — from authored prompts to composed prompts — is where the interesting engineering lives.

The problem: one pipeline, many schemas

XTRACT analyses call recordings, but no two teams want the same thing. A collections team cares about promise-to-pay signals and objection handling. A SaaS sales team cares about discovery quality and next-step commitment. Same audio, completely different questions. So the analysis dimensions are user-configurable: a team picks the signals they want, and the system has to deliver them reliably — for configurations the engineers never saw.

That means the prompt cannot be static. It is assembled per request from a set of user-defined parameters, then dispatched to the model, then parsed back into a structured result. Three stages, each with its own failure modes.

Stage 1 — Composition

A composed prompt is built from parts: a stable system preamble, the transcript, the user’s selected dimensions, and an output contract. The discipline is keeping the variable part (user config) cleanly separated from the invariant part (instructions, format rules). Treat the prompt like a function call: invariant logic in the body, user config as typed arguments. When something goes wrong, you want to know immediately whether the bug is in the template or the input.

The practical rule that saved the most time: never let user text silently become instruction text. User-provided labels and descriptions get fenced into clearly delimited blocks so a customer can’t accidentally (or deliberately) rewrite the system’s behaviour. Prompt injection isn’t a hypothetical when the prompt is half user-authored.

Stage 2 — Stability across configurations

The hardest requirement isn’t quality on one config — it’s consistent quality across arbitrary ones. A prompt that scores well on a five-dimension sales config can fall apart on a twelve-dimension collections config that asks for subtler judgments. Output drifts: fields get skipped, formats wander, confidence collapses on long transcripts.

Two things help. First, a rigid output contract — a fixed structure the model fills, regardless of how many dimensions the user added, so parsing never has to guess. Second, bounding the variability: cap dimensions per call, chunk long transcripts, and keep the instruction scaffold identical no matter the config. The model should only see more data, never more instructions, as configs grow.

Stage 3 — Evaluation when the spec is user-defined

You cannot hand-write a golden set for configurations that don’t exist yet. So evaluation becomes structural rather than semantic: does every requested dimension appear? Does the output parse? Are confidence and formatting within bounds? Layer targeted spot-checks on representative configs on top of that, and you can ship changes without manually reviewing thousands of outputs.

The throughput tax

All of this runs under a real-time constraint. STT latency is variable, the LLM call dominates wall-clock time, and the pipeline still has to sustain high throughput across thousands of calls a day. The lesson that generalises: in an LLM system, the prompt is not a creative-writing problem — it is an interface, and you engineer it like one. Versioned, contract-bound, injection-safe, and measured.