Post Mortem Report

[Incident Title]

1. Incident Summary#

Incident Title [Short descriptive title]
Incident ID [INC-XXXX]
Severity [S0 / S1 / S2 / S3]
Date of Incident [YYYY-MM-DD]
Duration of Impact [HH:MM start – HH:MM end, timezone]
Time to Detection [Minutes/hours from onset to first alert]
Time to Mitigation [Minutes/hours from detection to mitigation]
Time to Resolution [Minutes/hours from detection to full resolution]
Affected Services [List services, APIs, regions impacted]
Customer Impact [# users affected, error rates, revenue impact if known]
Incident Commander [Name]
Post Mortem Author [Name]
Post Mortem Date [YYYY-MM-DD]
Post Mortem Status [Draft / In Review / Final]

2. Executive Summary#

[Provide a 3–5 sentence high-level summary of the incident suitable for senior leadership. Cover what happened, who was affected, how it was resolved, and the most critical follow-up actions. Keep this non-technical and focused on business impact.]

3. Impact Assessment#

3.1 User / Customer Impact#

[Quantify the impact: number of users affected, error rates observed, failed transactions, degraded functionality, SLA breaches, etc. Be specific with metrics.]

3.2 Business / Financial Impact#

[Revenue loss (estimated or actual), contractual penalties, reputational damage, compliance implications. If not yet quantified, note that and assign an owner to calculate.]

3.3 Internal Impact#

[Teams diverted, planned work delayed, on-call burden, employee morale considerations.]

4. Timeline of Events#

Record all significant events in UTC. Include automated alerts, human actions, escalations, communications, and status changes. Aim for completeness — the timeline is the factual backbone of this document.

Timestamp (UTC) Event / Action Actor / System
[HH:MM] [First anomaly detected by monitoring] [System/Person]
[HH:MM] [Alert fired / on-call paged] [System/Person]
[HH:MM] [Investigation began] [Person]
[HH:MM] [Root cause identified] [Person]
[HH:MM] [Mitigation applied] [Person]
[HH:MM] [Service fully restored] [Person]
[HH:MM] [Add additional rows as needed] [...]

5. Root Cause Analysis#

5.1 Root Cause#

[Describe the underlying technical or process cause. Be precise. Distinguish between the trigger (what initiated the incident) and the root cause (why the system was vulnerable to that trigger). Reference ISO/IEC 20000-1 problem management principles: identify the cause, not just the symptom.]

5.2 Contributing Factors#

[List factors that amplified the impact or delayed detection/resolution. Examples: missing monitoring, unclear runbooks, insufficient test coverage, configuration drift, single points of failure, communication gaps. Per NIST SP 800-34 guidance, consider preventive controls that were absent or ineffective.]

5.3 Five Whys Analysis#

Drill down from symptom to systemic cause. Stop when you reach an actionable organizational or process-level cause.

  • Why did the outage occur? [Answer]

  • Why did that happen? [Answer]

  • Why did that happen? [Answer]

  • Why did that happen? [Answer]

  • Why did that happen? [Answer / Root systemic cause]

6. Detection and Response Assessment#

6.1 Detection#

[How was the incident first detected? Was it automated monitoring, customer report, internal observation? Evaluate whether detection time was acceptable. Identify gaps in observability.]

6.2 Incident Response#

[Evaluate how the response process worked. Was the right team engaged quickly? Were escalation paths clear? Was the incident commander role effective? Were communication channels (internal status page, customer comms) activated promptly?]

6.3 What Went Well#

[Explicitly call out things that worked. This is critical for morale and for reinforcing good practices. Examples: fast paging, effective runbook, good cross-team collaboration, quick rollback.]

6.4 What Could Be Improved#

[Identify gaps without assigning blame. Focus on systems and processes, not individuals. Examples: monitoring blind spots, missing documentation, unclear ownership, slow escalation.]

7. Remediation and Action Items#

Every action item must have an owner and a target date. Categorize by type to ensure coverage across prevention, detection, mitigation, and process improvement. Track these items to closure in your project management system.

# Type Action Item Owner Due / Status
1 Prevent [Description of action] [Name] [Date / Open]
2 Detect [Description of action] [Name] [Date / Open]
3 Mitigate [Description of action] [Name] [Date / Open]
4 Process [Description of action] [Name] [Date / Open]

Action item types:

  • Prevent — Eliminate or reduce the probability of this class of incident recurring.

  • Detect — Improve the speed and accuracy of detection for similar issues.

  • Mitigate — Reduce the blast radius or duration of impact when similar issues occur.

  • Process — Improve incident response, communication, documentation, or training.

8. Lessons Learned#

[Capture broader organizational insights. What does this incident reveal about architectural patterns, testing practices, deployment processes, on-call structures, or organizational culture? These are the insights that go beyond the specific fix and inform longer-term strategy. Reference NIST SP 800-34 after-action report practices: document findings, update plans, and distribute to stakeholders.]

9. External Communication#

9.1 Customer Communication Summary#

[Summarize what was communicated to customers, when, and through which channels (status page, email, support tickets, social media). Attach or link to actual communications sent.]

9.2 Regulatory / Compliance Notifications#

[Were any regulatory notifications required (e.g., data breach notification, SLA reporting)? If so, document what was sent and to whom. If not applicable, state so.]

10. Appendices#

[Attach supporting material as needed:]

  • Relevant graphs, dashboards, or monitoring screenshots

  • Log excerpts (sanitized of sensitive data)

  • Configuration diffs or deploy artifacts

  • Links to related incidents or previous post mortems

  • Communication transcripts (Slack threads, war room notes)

11. Review and Sign-Off#

This post mortem should be reviewed collaboratively in a blameless post mortem meeting. All parties involved in the incident should have the opportunity to contribute corrections and context before the document is finalized.

Role Name Date Reviewed
Post Mortem Author [Name] [Date]
Incident Commander [Name] [Date]
Engineering Manager [Name] [Date]
VP/Director of Engineering [Name] [Date]