Build
Implementation patterns: how to scaffold around probabilistic output, surface failure modes early, and keep human judgment in the loop while the code lands.
No practices in this stage match the current filters.
-
WhenYou’re designing a user-facing AI system and choosing how to present the model’s reasoning, recommendations, or confidence.
UseInterfaces that prompt active interpretation rather than passive consumption. Give users enough to act and enough feedback to test their hypotheses, but stop short of presenting the system’s output as authoritative. Build in friction that requires users to evaluate whether the AI’s explanation is trustworthy in the current context.
EvidenceQualitative analysis of Arknights showed that an interface that withholds and occasionally misleads, when paired with rich feedback, produced a more robust user-system relationship than one offering full transparency. Players developed working mental models through action, failure, and revision rather than through dashboards. The same pattern applies to XAI interfaces where comprehensive explanation often fails to produce comprehension.
-
WhenYou’re building a system that depends on LLM output for any progression, scoring, or downstream action.
UseFailure-mode prototyping before happy-path implementation. Simulate hallucinated answers, malformed outputs, schema violations, and difficulty misfires up front. Decide how the system should respond to each before you build the success path, so that fallbacks, retries, and validation are part of the architecture rather than patches added under pressure.
EvidenceUniversity of Calgary developers building two LLM-driven games reported that incorrect outputs were not bugs but fairness violations. They documented cases like a math question with no correct answer option and patterned outputs (correct answer always in the same multiple-choice slot) that broke the implicit contract with the player. The team explicitly recommends prototyping failure modes before the happy path.
-
WhenAny LLM call whose output flows into downstream code, a database, a UI, or another model.
UseStrict output schemas plus a validation pipeline. Define the exact format you require, parse against it, and reject or retry on schema violations. Constrain the model’s output space wherever possible. Treat free-form text from the model with the same skepticism you’d apply to a public API response or a form submission from an untrusted client.
EvidenceCalgary developers building Wizdom Run and Sena consistently described “building scaffolding around the model’s outputs”: structured schemas, validation pipelines, strict output formats. One reflection noted that ensuring LLM responses were formatted exactly as expected was what kept the back-end design coherent. Without that scaffolding, the probabilistic output broke deterministic gameplay rules.
-
WhenA developer is in a vibe-coding or agent-driven coding session where the AI is writing or modifying many files in rapid succession.
UseExternal version control with frequent, small commits, or an explicit instruction to the AI to log its own changes to a file. Commit before each new conversational turn that touches code, not at the end of the session. If you cannot commit per turn, ask the model to summarize the diff so you have a recoverable trail.
EvidenceAcross the qualitative study’s 190,000 words of practitioner data, runaway code changes were one of the two highest-severity pain points. Practitioners reported sessions where 30+ files accumulated in the change log with hours of uncommitted work, leading to “fuckup cascades” that were difficult to unwind. External version control was one of the two most universal community-derived best practices.
-
WhenYou’re designing AI guidance, recommendations, or copilots where users will rely on the model’s stated reasoning to make their own decisions.
UseInterfaces that mark explanations as provisional and give the user a low-cost way to disagree with them. Show the model’s reasoning, but also show the user the cost of taking it on faith. Pair recommendations with the option to override and observe consequences, so users practice judgment instead of compliance.
EvidenceArknights reframes player agency from “take meaningful action” to “evaluate whether the system’s explanations are trustworthy.” When the in-game AI deliberately offered misleading deployment suggestions, players who had built independent mental models through earlier play could reject the recommendation and succeed. Players who deferred failed. The game design treated continuous evaluation as the skill, not blind trust.
-
WhenYou’re using an LLM as a critic, judge, or self-evaluator on its own (or another model’s) output.
UseA critic prompt that asks “how much did this step gain compared to the previous state?” instead of “how good is this overall?” Score marginal change, not absolute quality. Where possible, surface concrete pre/post artifacts (the answer at step N-1 versus step N) so the comparison is grounded in observable change rather than vibes.
EvidenceBudget-Aware Value Trees rely on this distinction as a core technique. LLMs are well-documented to be overconfident when scoring their own absolute reasoning quality. The authors found that scoring the delta (the change) was much harder for the model to inflate, and this made step-level critics genuinely reliable enough to drive pruning decisions.
-
WhenYour agent is mid-task, has already spent some compute on a particular line of reasoning or tool calls, and the most recent step yielded little or no new information.
UseAn explicit early-exit rule on doomed paths. If the marginal gain from a step falls below a threshold, abandon the branch and try a different approach. Build the escape mechanism into the agent loop. Do not rely on the model to notice it is stuck, because LLMs are subject to sunk-cost behavior and will keep exploring failed paths.
EvidenceBudget-Aware Value Trees treat this as a first-class principle. Over four multi-hop QA benchmarks, the technique outperformed standard agents partly by pruning low-gain branches early, freeing compute for more promising ones. Spending 4x more tool calls on standard agents did not produce 4x better answers, indicating that without an early-exit mechanism, additional compute often goes to doomed paths.