POSTMORTEM — Incident Analysis Intelligence

Total Agents

6

3 extraction · 3 decision

Tests Passing

12 / 12

100% pass rate

Incidents Analyzed

8

Last 30 days

Avg MTTR

47 min

Across SEV1 incidents

Agent Confidence Scores

IncidentParser

—

TimelineExtractor

—

ActionExtractor

—

TaskPrioritizer

—

OwnerAssigner

—

EscalationDecider

—

Recent Incident Analysis

Incident	Severity	Root Cause	Status
INC-2024-087	SEV1	DB connection pool exhaustion	Resolved
INC-2024-083	SEV2	Cache invalidation race condition	Resolved
INC-2024-079	SEV1	Certificate expiration	Resolved
INC-2024-076	SEV3	Memory leak in worker	Monitoring

Action Items from Postmortems

ID	Action Item	Priority	Owner	Status
ACT-001	Implement connection pool monitoring and alerting	P0	SRE Team	In Progress
ACT-002	Add circuit breakers to all database connections	P0	Platform Team	Complete
ACT-003	Create automated certificate rotation pipeline	P1	DevOps	In Progress
ACT-004	Update runbook for DB connection exhaustion	P1	On-call	Complete

🔥 What is POSTMORTEM?

POSTMORTEM (Incident Analysis Intelligence) is a multi-agent system that processes post-incident reports and outage documentation. It automatically reconstructs incident timelines, identifies root causes, extracts remediation action items, assigns follow-up owners, and escalates unresolved risks — turning chaotic incident data into structured, trackable improvements.

⚡ How It Works

Submit an incident report or post-mortem document. POSTMORTEM runs 6 agents in sequence:

Step 1 — Extraction

📄 IncidentParser

Parses the incident report structure — identifies severity, affected services, detection method, impact scope, and resolution status from the raw documentation.

Step 2 — Extraction

⏰ TimelineExtractor

Reconstructs the full incident timeline — detection, alerting, response, mitigation, and resolution timestamps. Identifies gaps where response was delayed.

Step 3 — Extraction

✅ ActionExtractor

Extracts all remediation actions, follow-up tasks, and process improvements recommended in the post-mortem. Captures deadlines and acceptance criteria.

Step 4 — Decision

📈 TaskPrioritizer

Prioritizes remediation tasks based on recurrence risk, blast radius, and customer impact. Systemic infrastructure fixes get highest priority.

Step 5 — Decision

👤 OwnerAssigner

Assigns each remediation action to the responsible team — SRE for infrastructure, backend for code fixes, platform for tooling improvements.

Step 6 — Decision

🚨 EscalationDecider

Flags incidents requiring executive attention — recurring outages, data loss events, SLA breaches with financial impact, or unresolved systemic risks.

🎬 Live Agent Pipeline Demo

Watch how POSTMORTEM processes an incident report through all 6 agents in real-time

INCIDENT: Database cluster failover at 03:42 UTC
Severity: SEV1 Duration: 2h 17m Impact: 100% of write operations failed
Root Cause: Disk I/O saturation triggered by unindexed query in batch job deployed at 03:30 UTC
Remediation: Rollback batch job, add missing index, implement query cost guards in CI pipeline

📄

Step 1

IncidentParser

Idle

SEV1 · 2h17m · Database cluster

📅

Step 2

TimelineExtractor

Idle

4 events: deploy → saturation → failover → fix

✅

Step 3

ActionExtractor

Idle

3 actions: rollback, index, CI guard

📈

Step 4

TaskPrioritizer

Idle

P0: Index fix · P1: CI guard

👤

Step 5

OwnerAssigner

Idle

DBA → Index · Platform → CI guard

🚨

Step 6

EscalationDecider

Idle

⚠ SEV1 → VP Eng review required

✅ Pipeline Complete — 6 agents processed in 3.9s

4

Timeline Events

3

Actions

2

Assigned

1

Escalations

🎯 Real-World Use Cases

SEV1/SEV2 Post-Incident Reviews

Automatically structure post-mortem documents into actionable remediation plans with clear owners, deadlines, and priority levels.

Root Cause Analysis

Extract and categorize root causes across hundreds of incidents to identify systemic patterns — infrastructure, deployment, or human process failures.

Remediation Tracking

Track follow-up action item completion across SRE, engineering, and platform teams. Ensure post-mortem commitments are actually delivered.

Incident Pattern Detection

Aggregate data across all incidents to detect recurring failure modes, measure MTTR improvements, and track reliability trends over quarters.

🏗 Architecture

🔄

Orchestration Engine

Coordinates all 6 agents sequentially from parsing through escalation, managing incident state transitions.

🛡

Circuit Breaker Recovery

Per-agent fault isolation ensures timeline extraction failures don't block action item processing or escalation.

📋

Full Audit Trail

Every extraction and decision is logged with timestamps and confidence scores for post-incident accountability.

🔥 Submit Incident Report for Analysis LIVE AGENTS

📄

Drag & drop an incident report, or click to browse

.txt .md supported

Incident Title Services

Live Agent Results

Extracted Action Items

Action	Priority	Owner

— or analyze a sample incident —

📄

Incident
Parser

Extraction

Idle

›

⏲

Timeline
Extractor

Extraction

Idle

›

⚡

Action
Extractor

Extraction

Idle

›

📈

Task
Prioritizer

Decision

Idle

›

👤

Owner
Assigner

Decision

Idle

›

🚨

Escalation
Decider

Decision

Idle

Analysis Log

Submit an incident report to begin analysis…

Action Items

—

Timeline Events

—

Analysis Time

—

Total Tests

12

Passed

12

Failed

0

Duration

2.01s

Agent Tests

TestIncidentParser

test_incident_parsing0.24sPASS

test_empty_report0.08sPASS

TestTimelineExtractor

test_timeline0.19sPASS

TestActionExtractor

test_action_items0.16sPASS

TestTaskPrioritizer

test_prioritization0.11sPASS

TestOwnerAssigner

test_assignment0.14sPASS

Infrastructure Tests

TestRecoveryStrategies

test_retry_strategy0.05sPASS

test_recovery_manager0.07sPASS

TestCircuitBreaker

test_circuit_closed0.04sPASS

test_circuit_opens0.03sPASS

TestAuditLogger

test_log_event0.09sPASS

test_audit_export0.10sPASS

Total Events

36

Incidents Analyzed

8

Avg Confidence

84%

Escalations

3

Analysis Audit Timeline

03:14:00 · alert

SEV1 Declared: API Gateway Outage

PagerDuty alert triggered · On-call engineer paged

03:14:22 · agent

IncidentParser → Complete

DB connection pool exhaustion identified as root cause

03:14:45 · agent

TimelineExtractor → 8 events mapped

Impact window: 03:14 – 04:01 UTC (47 min)

03:15:01 · agent

ActionExtractor → 5 action items

2 P0 (immediate), 2 P1 (this sprint), 1 P2 (next quarter)

03:15:20 · decision

Priority: 2 items escalated to VP Engineering

Connection pool limits + monitoring gaps flagged

03:15:32 · workflow

Analysis Complete

Full timeline, root cause, and remediation plan generated

Root Cause Classification

🔥 Root Cause: DB Connection Pool Exhaustion

• Connection pool max: 100, active: 100/100
• Triggered by: Traffic spike + connection leak in auth-service
• Contributing: No pool monitoring alerts configured
• Impact: 100% API failure for 47 minutes

✅ Remediation Steps

• Increased pool max to 200
• Fixed connection leak in auth-service v2.4.1
• Added connection pool utilization alerts
• Implemented circuit breaker on DB connections

🕑 Impact Timeline

• 03:14 UTC — First errors detected
• 03:18 UTC — On-call paged
• 03:28 UTC — Root cause identified
• 03:45 UTC — Fix deployed
• 04:01 UTC — Full recovery confirmed

Circuit Breakers

6

All closed

Recoveries

4

Retry Rate

100%

Uptime

99.9%

Circuit Breaker Status

IncidentParser

CLOSED

0 failures

TimelineExtractor

CLOSED

0 failures

ActionExtractor

CLOSED

0 failures

TaskPrioritizer

CLOSED

0 failures

OwnerAssigner

CLOSED

0 failures

EscalationDecider

CLOSED

0 failures