wits
    Use Cases · May 26, 2026 · Updated May 25, 2026 · 9 min read

    When AI fails: postmortem template + recovery playbook

    AI will go wrong in production. A postmortem template, a recovery playbook, and the cultural moves that turn failures into compounding improvements.

    When AI fails: postmortem template + recovery playbook
    TL;DR
    • AI will go wrong in production. The question is whether your team learns from it or hides from it.
    • Six categories of AI failure — hallucination, prompt injection, drift, runaway cost, escalation overload, scope creep.
    • A postmortem template that covers what happened, why, what changes, what to test. Blameless. Specific.
    • The cultural moves that turn AI failures into compounding improvements instead of recurring crises.
    Quick answer
    What do you do when AI fails in production?
    When AI fails in production, treat it like any other production incident: contain the blast radius first (disable the feature if needed), then investigate root cause within a week, then write a blameless postmortem covering what happened, why, what changes, and what to test. Categorise the failure type (hallucination, injection, drift, cost overrun, escalation flood, scope creep) and add a regression case to your eval suite so it cannot recur silently. Teams that postmortem AI failures compound improvements; teams that hide them repeat the same failures every quarter.

    AI in production will go wrong. The model will hallucinate a customer order. A prompt injection will surface in your logs. Cost will spike on a Saturday for reasons nobody understands. The escalation queue will overflow.

    How a team responds to these moments separates teams that get good at AI from teams that quietly stop trusting it. Below is the working frame.

    The six failure categories

    1. Hallucination

    The AI invented something. A non-existent product, a wrong policy, a fictional historical event, a fabricated source citation.

    Diagnostic: did the model have the grounding data it needed? If yes, the prompt let it stray. If no, the architecture skipped retrieval.

    2. Prompt injection

    Untrusted input contained instructions; the AI followed them. The AI revealed something it should not have or took an action outside its scope.

    Diagnostic: where did the injected text come from? What capability did it exploit? See prompt injection explained.

    3. Drift

    The AI's output quality declined over weeks. Edge cases that used to work, now do not. Customers noticed before you did.

    Diagnostic: model version changed? User patterns shifted? Prompt mutated through small tweaks? Eval set is stale?

    4. Runaway cost

    AI spend doubled on a weekday for no obvious reason. Worse: spend has been climbing 10% week-over-week and nobody noticed.

    Diagnostic: which feature drove the spend? Loop without breaking condition? Increased retries due to a quality regression? Test traffic hitting production?

    5. Escalation overload

    AI is escalating 60% of cases instead of 15%. The human queue is drowning. Customers wait. Agents burn out.

    Diagnostic: confidence calibration off? Edge cases changed? A prompt update made the AI more cautious?

    6. Scope creep

    The AI is doing things it was not supposed to do. Started as draft assistant; now writing customer commitments. Started as classifier; now generating refunds.

    Diagnostic: what is the actual current behaviour vs documented? Who added the new behaviour? Was the eval set updated?

    The postmortem template

    Use this structure. Keep it one page where possible.

    1. Summary (2-3 sentences)

    What happened, when, what was the user-visible impact.

    2. Timeline (chronological)

    Time-stamped: when first signal, when noticed, when mitigated, when resolved. Be specific.

    3. Root cause

    What enabled this failure. Not just the trigger — the underlying gap.

    4. Contributing factors

    What else went wrong that made this worse than it could have been.

    5. What worked

    Honest. What part of the system caught this, slowed it, or made the recovery faster.

    6. Actions

    Specific things to change. Each has an owner and a date. Includes:

    • Add a regression case to the eval set.
    • Update prompts / guardrails / capability boundaries.
    • Add monitoring / alerts.
    • Update documentation / runbook.

    7. Lessons learned

    What we now know that we did not before.

    The blameless principle

    Postmortems work only if they are blameless. "Sarah deployed a bad prompt" is not a finding. "Our deploy process did not require eval pass before promote" is a finding.

    Three rules:

    • Talk about systems, not individuals.
    • The person closest to the incident leads the writeup — they have the context, and removing them creates a "outsiders investigate insiders" dynamic.
    • Distribute the postmortem internally. Hiding it kills the learning.

    The recovery playbook

    Before the postmortem, contain the damage. Standard sequence:

    1. Detect. Monitoring alerts you. Or a user reports it. The faster the detection, the smaller the blast.
    2. Contain. Disable the affected feature, route around it, or roll back the recent change. Bias toward containment over diagnosis at this stage.
    3. Communicate. Internal team. Affected customers if needed. Be specific about what you know and what you do not.
    4. Diagnose. Now that the bleeding stopped, find root cause.
    5. Mitigate. Patch the immediate issue.
    6. Postmortem. Within 5-7 days.
    7. Verify. Confirm the regression case in the eval set catches the original failure. Confirm the new monitoring would have detected it earlier.

    The cultural moves

    Make postmortems normal

    First postmortem feels embarrassing. By the tenth, it is routine. Routine postmortems are the marker of a team that learns.

    Read postmortems team-wide

    Every team member should read every AI postmortem. Patterns compound across systems; lessons cross-pollinate.

    Reward postmortem authors

    Writing a postmortem about your own bug should be celebrated, not punished. The team that catches its own failures publicly is healthier than the team that hides them.

    Track actions to closure

    Every action item gets owner + due date. Track open postmortem actions like you track open bugs. Things that get tracked get fixed.

    The compounding pattern

    Each postmortem adds:

    • One or more regression cases to the eval set.
    • One or more new monitoring rules.
    • One or more documentation updates.
    • Sometimes: a structural change to the system.

    After 20 postmortems, your eval suite is comprehensive, your monitoring catches obvious issues fast, your docs are sharp. Your system fails less often and recovers faster when it does.

    Teams that do not postmortem do not get this compounding effect. They have the same incidents repeatedly.

    What this means for you

    • AI will fail in production. Plan the response now, not when it happens.
    • Six failure categories. Each has a different diagnostic path.
    • Postmortems are the learning mechanism. Blameless. Distributed. Specific actions.
    • Recovery sequence: detect → contain → communicate → diagnose → mitigate → postmortem → verify.
    • Compounding: each postmortem makes the next failure less likely.
    • Read properties of production AI and human-in-the-loop AI.

    Want help setting up the AI postmortem rhythm? Book a 30-minute call. We will share the template and the cultural moves that make it work.

    Now over to you

    Talk to a real engineer.

    A 30-minute call. We will tell you honestly whether AI is the right fix and what it would take.