Incident Communication Playbook
Templates, scripts, and workflows for communicating during service incidents. Covers every stage from detection to post-mortem, with ready-to-use templates for status pages, email, Slack, and social media.
Introduction
How you communicate during an incident matters more than the incident itself. A 30-minute outage with clear, timely communication is forgiven. A 5-minute blip with no communication erodes trust.
This playbook provides ready-to-use templates and workflows for every stage of incident communication. It is designed for SaaS teams of any size, from solo founders to enterprise engineering organizations.
The Incident Communication Timeline
Every incident follows a predictable communication arc:
| Phase | Time | Action |
|---|---|---|
| Detection | T+0 | Acknowledge the issue internally |
| Initial Update | T+5 min | Post first public status update |
| Investigation | T+5-30 min | Regular updates every 15-30 min |
| Identification | When root cause found | Update with cause and ETA |
| Fix Deployed | When fix is live | Update status to monitoring |
| Resolution | After stability confirmed | Mark incident resolved |
| Post-Mortem | T+24-48 hours | Publish detailed analysis |
The single most important rule: never go more than 30 minutes without an update during an active incident.
Phase 1: Detection and Acknowledgment
Internal Alert (Slack/Teams)
INCIDENT DETECTED
What: [Brief description of the issue]
Impact: [Who is affected and how]
Severity: [P1/P2/P3]
On-call: [Name of person investigating]
Status page: [Link to status page]
Thread below for updates. Do NOT communicate externally
until the first status page update is posted.
Status Page: Investigating
Investigating - [Component Name]
We are investigating reports of [brief, user-facing description].
Some users may experience [specific symptom: errors, slow loading,
failed transactions, etc.].
We are actively working to identify the root cause and will provide
updates every 15 minutes.
Posted at [time] [timezone]
Email to Subscribers
Subject: [Service Name] - Investigating issues with [Component]
We are currently investigating an issue affecting [component/feature].
What is happening:
[1-2 sentences describing the user-visible impact]
What we are doing:
Our engineering team is actively investigating. We will send
updates as we learn more.
Current status: Investigating
Follow live updates: [status page URL]
Phase 2: Investigation Updates
15-Minute Update (No New Info)
Update - [Component Name]
We are continuing to investigate the issue affecting [component].
Our engineering team is actively working on this. We do not have
additional information at this time but will provide another
update within 15 minutes.
Posted at [time] [timezone]
15-Minute Update (Progress)
Update - [Component Name]
We have narrowed down the issue to [general area: database,
third-party service, network, etc.]. Our team is working on
[specific action: rolling back a deployment, scaling
infrastructure, contacting the provider, etc.].
We expect to have more information within the next 15 minutes.
Posted at [time] [timezone]
Third-Party Issue Identified
Update - [Component Name]
We have identified that this issue is related to [third-party
service name], which is currently experiencing [their reported
status]. This is affecting our [specific feature/component].
We are monitoring [third-party]'s status page for updates and
will communicate any changes. [Workaround if available].
[Third-party status page URL]
Posted at [time] [timezone]
Phase 3: Root Cause Identified
Status Page Update
Identified - [Component Name]
We have identified the root cause of the issue affecting
[component]. [One sentence plain-language explanation].
Our engineering team is [specific remediation action]. We
expect this to be resolved by approximately [time estimate]
[timezone].
[If applicable: Workaround: Users can [specific workaround]
in the meantime.]
Posted at [time] [timezone]
Email to Subscribers
Subject: [Service Name] - Root cause identified for [Component] issue
We have identified the root cause of the issue affecting
[component/feature].
What happened:
[2-3 sentences explaining in plain language]
What we are doing:
[Specific remediation steps]
Expected resolution:
We expect this to be resolved by [time] [timezone].
[If applicable]
Workaround:
[Steps users can take to work around the issue]
Current status: Identified
Follow live updates: [status page URL]
Phase 4: Fix Deployed
Status Page Update
Monitoring - [Component Name]
A fix has been deployed for the issue affecting [component].
We are monitoring the system to confirm stability.
If you continue to experience issues, please contact our
support team at [support email/URL].
We will provide a final update once we have confirmed
the fix is stable.
Posted at [time] [timezone]
Phase 5: Resolution
Status Page Update
Resolved - [Component Name]
The incident affecting [component] has been fully resolved.
All systems are now operating normally.
Duration: [start time] to [end time] ([total duration])
Impact: [brief summary of what was affected]
We will publish a detailed post-incident report within 48 hours.
We apologize for any inconvenience this may have caused.
Posted at [time] [timezone]
Email to Subscribers
Subject: [Service Name] - [Component] issue resolved
The incident affecting [component/feature] has been fully resolved.
Summary:
- Duration: [total duration]
- Impact: [what was affected]
- Root cause: [one sentence]
- Resolution: [one sentence]
All systems are now operating normally. We will publish a
detailed post-incident report within 48 hours.
We apologize for any inconvenience and appreciate your patience.
[status page URL]
Social Media (X/Twitter)
Update: The issue affecting [feature] has been resolved.
All systems are operating normally.
Duration: [X] minutes
Root cause: [brief]
Full details: [status page URL]
We apologize for the disruption.
Phase 6: Post-Incident Report
Template
Post-Incident Report: [Incident Title]
Date: [Date]
Duration: [Start time] to [End time] ([Total duration])
Severity: [P1/P2/P3]
Impact: [Number of affected users/requests/transactions]
## Summary
[2-3 paragraph summary of what happened, written for a
non-technical audience]
## Timeline
[Chronological list of key events]
- HH:MM - [Event description]
- HH:MM - [Event description]
- HH:MM - [Event description]
- HH:MM - [Event description]
## Root Cause
[Technical explanation of what caused the incident.
Be specific but accessible.]
## Resolution
[What was done to fix the immediate issue]
## Preventive Measures
[What changes are being made to prevent recurrence]
| Action Item | Owner | Target Date | Status |
|-------------|-------|-------------|--------|
| [Action 1] | [Name]| [Date] | In progress |
| [Action 2] | [Name]| [Date] | Planned |
| [Action 3] | [Name]| [Date] | Planned |
## Lessons Learned
- [Key takeaway 1]
- [Key takeaway 2]
- [Key takeaway 3]
Severity Classification
P1 - Critical
Criteria: Core functionality unavailable for all or most users. Revenue-impacting. Data integrity risk.
Communication cadence: Updates every 10-15 minutes. All hands on deck. Executive notification.
Channels: Status page, email, Slack/Discord webhooks, social media.
P2 - Major
Criteria: Significant functionality degraded. Subset of users affected. Workaround available.
Communication cadence: Updates every 15-30 minutes. On-call engineer plus backup.
Channels: Status page, email, Slack/Discord webhooks.
P3 - Minor
Criteria: Minor functionality affected. Small user impact. Easy workaround.
Communication cadence: Updates every 30-60 minutes. On-call engineer.
Channels: Status page only.
Channel-Specific Guidelines
Status Page
- Always the primary source of truth
- Update before any other channel
- Use the standardized status levels (Investigating, Identified, Monitoring, Resolved)
- Include timestamps with timezone
- Only send for P1 and P2 incidents
- Keep subject lines factual, not alarming
- Include a link to the status page for live updates
- Send at most 3 emails per incident (initial, identified, resolved)
Slack / Discord / Telegram
- Use for real-time updates to subscribed users
- Keep messages concise (under 280 characters for the summary)
- Include status page link for details
- Use appropriate formatting (bold for status, code blocks for technical details)
Social Media
- Only post for P1 incidents affecting a large user base
- Be factual, not defensive
- Do not engage with angry replies during an active incident
- Post resolution update once confirmed
Customer Success / Sales
- Prepare talking points before CS/Sales teams are asked
- Include: what happened, who is affected, ETA, workaround
- Update talking points with each status change
- Provide the post-incident report for follow-up conversations
Tone and Language Guide
Do
- Use plain language ("the payment system is slow" not "elevated P99 latencies")
- Be specific about impact ("some users cannot log in" not "we are experiencing issues")
- Give time estimates when possible ("we expect resolution within 1 hour")
- Acknowledge the inconvenience
- Use active voice ("we identified the issue" not "the issue was identified")
Do Not
- Blame third parties without confirming ("this appears to be a Stripe issue")
- Use jargon (P99, 5xx, pod, cluster, shard)
- Minimize the impact ("a small number of users" when it is 30%)
- Promise it will never happen again
- Use humor during active incidents
- Share internal details (server names, IP addresses, code snippets)
Scheduled Maintenance Communication
7-Day Advance Notice
Subject: Scheduled maintenance: [Component] on [Date]
We will be performing scheduled maintenance on [component]
on [date] from [start time] to [end time] [timezone].
What to expect:
- [Specific impact: "the dashboard will be unavailable",
"API response times may be slower", etc.]
- Duration: approximately [X] hours
- [Workaround if applicable]
Why:
[Brief explanation: database migration, security update,
infrastructure upgrade, etc.]
No action is required from your side. We will send a
reminder 24 hours before the maintenance window.
Questions? Contact [support email/URL].
24-Hour Reminder
Reminder: Scheduled maintenance on [component] begins
tomorrow at [time] [timezone].
Expected duration: [X] hours
Impact: [brief description]
[status page URL]
Maintenance Started
Maintenance in progress - [Component]
Scheduled maintenance on [component] has begun. This is
expected to last approximately [X] hours.
[Impact description]
We will update this status when maintenance is complete.
Posted at [time] [timezone]
Maintenance Completed
Maintenance complete - [Component]
Scheduled maintenance on [component] has been completed
successfully. All systems are operating normally.
Thank you for your patience.
Posted at [time] [timezone]
Building Your Incident Response Team
Roles
| Role | Responsibility |
|---|---|
| Incident Commander | Coordinates response, makes decisions, manages timeline |
| Technical Lead | Diagnoses and implements the fix |
| Communications Lead | Writes and posts all external updates |
| Customer Success Liaison | Handles direct customer inquiries |
| Scribe | Documents the timeline for post-mortem |
For Small Teams (1-5 people)
One person handles both technical response and communication. Use templates to reduce cognitive load during incidents. Automate initial detection and status updates with tools like StatusDrop.
For Larger Teams (5+ people)
Separate the Communication Lead role from the Technical Lead. The person writing status updates should not be the person debugging the issue. This separation improves both response speed and communication quality.
Automation Opportunities
What to Automate
- Initial detection and alerting
- First status page update ("Investigating" based on monitoring triggers)
- Subscriber notifications when status changes
- Escalation when no update is posted within 30 minutes
- Post-incident report template generation
What Not to Automate
- Root cause explanation (requires human judgment)
- Time estimates (too risky to automate)
- Post-mortem analysis (requires reflection)
- Social media responses (too nuanced)
StatusDrop Automation
StatusDrop automates the detection and status update pipeline:
- Monitors 550+ third-party services every 1-5 minutes
- Automatically updates status when a dependency goes down
- Sends notifications via email, Slack, Discord, and Telegram
- Updates the embedded widget in real-time
- Provides a hosted status page with zero manual intervention
Measuring Communication Effectiveness
Key Metrics
- Time to first update: Target under 5 minutes
- Update frequency during incidents: Target every 15-30 minutes
- Support ticket volume during incidents: Compare with and without status updates
- Customer satisfaction post-incident: Survey affected users
- Post-mortem publication rate: Target 100% for P1/P2 incidents
Benchmarks
| Metric | Good | Great | Elite |
|---|---|---|---|
| Time to first update | Under 15 min | Under 5 min | Under 2 min |
| Update frequency | Every 30 min | Every 15 min | Every 10 min |
| Ticket deflection | 20% | 35% | 50%+ |
| Post-mortem rate | 80% | 95% | 100% |
| Customer satisfaction | 3.5/5 | 4.0/5 | 4.5/5 |
Conclusion
Incident communication is a skill that improves with practice and preparation. The templates in this playbook give you a starting point, but the most important factor is consistency: always communicate, always be honest, and always follow up.
Use StatusDrop to automate the detection and notification pipeline so your team can focus on what matters most -- resolving the issue and communicating clearly with your users.
Published by StatusDrop - Drop-in status monitoring for SaaS applications.