Back to guides
Guide· 16 pages· 11 min read

Incident Communication Playbook

Templates, scripts, and workflows for communicating during service incidents. Covers every stage from detection to post-mortem, with ready-to-use templates for status pages, email, Slack, and social media.

·StatusDrop

Introduction

How you communicate during an incident matters more than the incident itself. A 30-minute outage with clear, timely communication is forgiven. A 5-minute blip with no communication erodes trust.

This playbook provides ready-to-use templates and workflows for every stage of incident communication. It is designed for SaaS teams of any size, from solo founders to enterprise engineering organizations.


The Incident Communication Timeline

Every incident follows a predictable communication arc:

PhaseTimeAction
DetectionT+0Acknowledge the issue internally
Initial UpdateT+5 minPost first public status update
InvestigationT+5-30 minRegular updates every 15-30 min
IdentificationWhen root cause foundUpdate with cause and ETA
Fix DeployedWhen fix is liveUpdate status to monitoring
ResolutionAfter stability confirmedMark incident resolved
Post-MortemT+24-48 hoursPublish detailed analysis

The single most important rule: never go more than 30 minutes without an update during an active incident.


Phase 1: Detection and Acknowledgment

Internal Alert (Slack/Teams)

INCIDENT DETECTED

What: [Brief description of the issue]
Impact: [Who is affected and how]
Severity: [P1/P2/P3]
On-call: [Name of person investigating]
Status page: [Link to status page]

Thread below for updates. Do NOT communicate externally
until the first status page update is posted.

Status Page: Investigating

Investigating - [Component Name]

We are investigating reports of [brief, user-facing description].
Some users may experience [specific symptom: errors, slow loading,
failed transactions, etc.].

We are actively working to identify the root cause and will provide
updates every 15 minutes.

Posted at [time] [timezone]

Email to Subscribers

Subject: [Service Name] - Investigating issues with [Component]

We are currently investigating an issue affecting [component/feature].

What is happening:
[1-2 sentences describing the user-visible impact]

What we are doing:
Our engineering team is actively investigating. We will send
updates as we learn more.

Current status: Investigating
Follow live updates: [status page URL]

Phase 2: Investigation Updates

15-Minute Update (No New Info)

Update - [Component Name]

We are continuing to investigate the issue affecting [component].
Our engineering team is actively working on this. We do not have
additional information at this time but will provide another
update within 15 minutes.

Posted at [time] [timezone]

15-Minute Update (Progress)

Update - [Component Name]

We have narrowed down the issue to [general area: database,
third-party service, network, etc.]. Our team is working on
[specific action: rolling back a deployment, scaling
infrastructure, contacting the provider, etc.].

We expect to have more information within the next 15 minutes.

Posted at [time] [timezone]

Third-Party Issue Identified

Update - [Component Name]

We have identified that this issue is related to [third-party
service name], which is currently experiencing [their reported
status]. This is affecting our [specific feature/component].

We are monitoring [third-party]'s status page for updates and
will communicate any changes. [Workaround if available].

[Third-party status page URL]

Posted at [time] [timezone]

Phase 3: Root Cause Identified

Status Page Update

Identified - [Component Name]

We have identified the root cause of the issue affecting
[component]. [One sentence plain-language explanation].

Our engineering team is [specific remediation action]. We
expect this to be resolved by approximately [time estimate]
[timezone].

[If applicable: Workaround: Users can [specific workaround]
in the meantime.]

Posted at [time] [timezone]

Email to Subscribers

Subject: [Service Name] - Root cause identified for [Component] issue

We have identified the root cause of the issue affecting
[component/feature].

What happened:
[2-3 sentences explaining in plain language]

What we are doing:
[Specific remediation steps]

Expected resolution:
We expect this to be resolved by [time] [timezone].

[If applicable]
Workaround:
[Steps users can take to work around the issue]

Current status: Identified
Follow live updates: [status page URL]

Phase 4: Fix Deployed

Status Page Update

Monitoring - [Component Name]

A fix has been deployed for the issue affecting [component].
We are monitoring the system to confirm stability.

If you continue to experience issues, please contact our
support team at [support email/URL].

We will provide a final update once we have confirmed
the fix is stable.

Posted at [time] [timezone]

Phase 5: Resolution

Status Page Update

Resolved - [Component Name]

The incident affecting [component] has been fully resolved.
All systems are now operating normally.

Duration: [start time] to [end time] ([total duration])
Impact: [brief summary of what was affected]

We will publish a detailed post-incident report within 48 hours.
We apologize for any inconvenience this may have caused.

Posted at [time] [timezone]

Email to Subscribers

Subject: [Service Name] - [Component] issue resolved

The incident affecting [component/feature] has been fully resolved.

Summary:
- Duration: [total duration]
- Impact: [what was affected]
- Root cause: [one sentence]
- Resolution: [one sentence]

All systems are now operating normally. We will publish a
detailed post-incident report within 48 hours.

We apologize for any inconvenience and appreciate your patience.

[status page URL]

Social Media (X/Twitter)

Update: The issue affecting [feature] has been resolved.
All systems are operating normally.

Duration: [X] minutes
Root cause: [brief]

Full details: [status page URL]

We apologize for the disruption.

Phase 6: Post-Incident Report

Template

Post-Incident Report: [Incident Title]
Date: [Date]
Duration: [Start time] to [End time] ([Total duration])
Severity: [P1/P2/P3]
Impact: [Number of affected users/requests/transactions]

## Summary

[2-3 paragraph summary of what happened, written for a
non-technical audience]

## Timeline

[Chronological list of key events]

- HH:MM - [Event description]
- HH:MM - [Event description]
- HH:MM - [Event description]
- HH:MM - [Event description]

## Root Cause

[Technical explanation of what caused the incident.
Be specific but accessible.]

## Resolution

[What was done to fix the immediate issue]

## Preventive Measures

[What changes are being made to prevent recurrence]

| Action Item | Owner | Target Date | Status |
|-------------|-------|-------------|--------|
| [Action 1]  | [Name]| [Date]      | In progress |
| [Action 2]  | [Name]| [Date]      | Planned |
| [Action 3]  | [Name]| [Date]      | Planned |

## Lessons Learned

- [Key takeaway 1]
- [Key takeaway 2]
- [Key takeaway 3]

Severity Classification

P1 - Critical

Criteria: Core functionality unavailable for all or most users. Revenue-impacting. Data integrity risk.

Communication cadence: Updates every 10-15 minutes. All hands on deck. Executive notification.

Channels: Status page, email, Slack/Discord webhooks, social media.

P2 - Major

Criteria: Significant functionality degraded. Subset of users affected. Workaround available.

Communication cadence: Updates every 15-30 minutes. On-call engineer plus backup.

Channels: Status page, email, Slack/Discord webhooks.

P3 - Minor

Criteria: Minor functionality affected. Small user impact. Easy workaround.

Communication cadence: Updates every 30-60 minutes. On-call engineer.

Channels: Status page only.


Channel-Specific Guidelines

Status Page

  • Always the primary source of truth
  • Update before any other channel
  • Use the standardized status levels (Investigating, Identified, Monitoring, Resolved)
  • Include timestamps with timezone

Email

  • Only send for P1 and P2 incidents
  • Keep subject lines factual, not alarming
  • Include a link to the status page for live updates
  • Send at most 3 emails per incident (initial, identified, resolved)

Slack / Discord / Telegram

  • Use for real-time updates to subscribed users
  • Keep messages concise (under 280 characters for the summary)
  • Include status page link for details
  • Use appropriate formatting (bold for status, code blocks for technical details)

Social Media

  • Only post for P1 incidents affecting a large user base
  • Be factual, not defensive
  • Do not engage with angry replies during an active incident
  • Post resolution update once confirmed

Customer Success / Sales

  • Prepare talking points before CS/Sales teams are asked
  • Include: what happened, who is affected, ETA, workaround
  • Update talking points with each status change
  • Provide the post-incident report for follow-up conversations

Tone and Language Guide

Do

  • Use plain language ("the payment system is slow" not "elevated P99 latencies")
  • Be specific about impact ("some users cannot log in" not "we are experiencing issues")
  • Give time estimates when possible ("we expect resolution within 1 hour")
  • Acknowledge the inconvenience
  • Use active voice ("we identified the issue" not "the issue was identified")

Do Not

  • Blame third parties without confirming ("this appears to be a Stripe issue")
  • Use jargon (P99, 5xx, pod, cluster, shard)
  • Minimize the impact ("a small number of users" when it is 30%)
  • Promise it will never happen again
  • Use humor during active incidents
  • Share internal details (server names, IP addresses, code snippets)

Scheduled Maintenance Communication

7-Day Advance Notice

Subject: Scheduled maintenance: [Component] on [Date]

We will be performing scheduled maintenance on [component]
on [date] from [start time] to [end time] [timezone].

What to expect:
- [Specific impact: "the dashboard will be unavailable",
  "API response times may be slower", etc.]
- Duration: approximately [X] hours
- [Workaround if applicable]

Why:
[Brief explanation: database migration, security update,
infrastructure upgrade, etc.]

No action is required from your side. We will send a
reminder 24 hours before the maintenance window.

Questions? Contact [support email/URL].

24-Hour Reminder

Reminder: Scheduled maintenance on [component] begins
tomorrow at [time] [timezone].

Expected duration: [X] hours
Impact: [brief description]

[status page URL]

Maintenance Started

Maintenance in progress - [Component]

Scheduled maintenance on [component] has begun. This is
expected to last approximately [X] hours.

[Impact description]

We will update this status when maintenance is complete.

Posted at [time] [timezone]

Maintenance Completed

Maintenance complete - [Component]

Scheduled maintenance on [component] has been completed
successfully. All systems are operating normally.

Thank you for your patience.

Posted at [time] [timezone]

Building Your Incident Response Team

Roles

RoleResponsibility
Incident CommanderCoordinates response, makes decisions, manages timeline
Technical LeadDiagnoses and implements the fix
Communications LeadWrites and posts all external updates
Customer Success LiaisonHandles direct customer inquiries
ScribeDocuments the timeline for post-mortem

For Small Teams (1-5 people)

One person handles both technical response and communication. Use templates to reduce cognitive load during incidents. Automate initial detection and status updates with tools like StatusDrop.

For Larger Teams (5+ people)

Separate the Communication Lead role from the Technical Lead. The person writing status updates should not be the person debugging the issue. This separation improves both response speed and communication quality.


Automation Opportunities

What to Automate

  • Initial detection and alerting
  • First status page update ("Investigating" based on monitoring triggers)
  • Subscriber notifications when status changes
  • Escalation when no update is posted within 30 minutes
  • Post-incident report template generation

What Not to Automate

  • Root cause explanation (requires human judgment)
  • Time estimates (too risky to automate)
  • Post-mortem analysis (requires reflection)
  • Social media responses (too nuanced)

StatusDrop Automation

StatusDrop automates the detection and status update pipeline:

  1. Monitors 550+ third-party services every 1-5 minutes
  2. Automatically updates status when a dependency goes down
  3. Sends notifications via email, Slack, Discord, and Telegram
  4. Updates the embedded widget in real-time
  5. Provides a hosted status page with zero manual intervention

Measuring Communication Effectiveness

Key Metrics

  • Time to first update: Target under 5 minutes
  • Update frequency during incidents: Target every 15-30 minutes
  • Support ticket volume during incidents: Compare with and without status updates
  • Customer satisfaction post-incident: Survey affected users
  • Post-mortem publication rate: Target 100% for P1/P2 incidents

Benchmarks

MetricGoodGreatElite
Time to first updateUnder 15 minUnder 5 minUnder 2 min
Update frequencyEvery 30 minEvery 15 minEvery 10 min
Ticket deflection20%35%50%+
Post-mortem rate80%95%100%
Customer satisfaction3.5/54.0/54.5/5

Conclusion

Incident communication is a skill that improves with practice and preparation. The templates in this playbook give you a starting point, but the most important factor is consistency: always communicate, always be honest, and always follow up.

Use StatusDrop to automate the detection and notification pipeline so your team can focus on what matters most -- resolving the issue and communicating clearly with your users.


Published by StatusDrop - Drop-in status monitoring for SaaS applications.

Ready to add a status page to your SaaS?

StatusDrop monitors 550+ services with one script tag. Free plan available, Pro at $14.99/month.