The Incident Communication Playbook
Everything your team needs to communicate effectively during outages — severity classification, status page templates, notification matrices, and post-incident review guides.
In this playbook
Why incident communication matters
When your service goes down, silence is your worst enemy. Customers don’t just want to know what’s happening — they need to know you’re on it. Without a communication plan, your team scrambles to draft messages while support tickets pile up and trust erodes.
The data backs this up: companies with proactive incident communication see a 60% reduction in support tickets during outages and retain 95% of customers who would otherwise churn from the experience. The difference between keeping and losing a customer during an outage often comes down to how quickly and clearly you communicate.
This playbook gives your team a ready-made framework. No more drafting messages from scratch at 3 AM. No more debating who to notify. Every severity level, every stakeholder, every phase of an incident — covered with templates you can copy, paste, and customize.
Step 1
Severity classification guide
Classify every incident immediately. The severity level determines response time, update frequency, and who gets notified.
- Impact
- Complete service outage. All users affected.
- Response time
- Immediate (< 5 minutes)
- Update frequency
- Every 15 minutes
- Who’s involved
- Engineering lead, exec team, support lead
- Example
- “All API endpoints returning 500 errors”
- Impact
- Significant degradation. Many users affected.
- Response time
- Within 15 minutes
- Update frequency
- Every 30 minutes
- Who’s involved
- Engineering lead, support team
- Example
- “Payment processing failing for 40% of transactions”
- Impact
- Partial degradation. Some users affected.
- Response time
- Within 1 hour
- Update frequency
- Every 1–2 hours
- Who’s involved
- On-call engineer, support team
- Example
- “Email notifications delayed by 10 minutes”
- Impact
- Minimal impact. Workaround available.
- Response time
- Next business day
- Update frequency
- As needed
- Who’s involved
- Assigned engineer
- Example
- “Dark mode rendering incorrectly on Safari”
Step 2
Status page update templates
Copy-paste templates for every phase of an incident. Replace the bracketed text with your specifics.
We are currently investigating reports of [issue description]. Our engineering team is actively looking into this. We will provide an update within [timeframe]. We apologize for the inconvenience.
We have identified the root cause of [issue description]. The issue is related to [brief technical explanation]. Our team is working on a fix. Next update in [timeframe].
A fix has been deployed for [issue description]. We are currently monitoring the situation to ensure stability. If you continue to experience issues, please contact support at [support email].
The incident affecting [component/service] has been resolved. [Brief explanation of what happened and what was done]. Total downtime: [duration]. We apologize for the disruption and will be conducting a post-incident review.
We will be performing scheduled maintenance on [component] on [date] from [start time] to [end time] [timezone]. During this window, [expected impact]. No action is required on your part.
Step 3
Stakeholder notification matrix
Who to notify, when, and how — based on severity level.
| Stakeholder | SEV1 | SEV2 | SEV3 | SEV4 |
|---|---|---|---|---|
| Status Page | Immediate | Within 15 min | Within 1 hr | As needed |
| Email Subscribers | Immediate | Within 15 min | Within 1 hr | — |
| Slack / Teams | Immediate | Immediate | Within 30 min | — |
| Executive Team | Immediate | Within 30 min | Daily summary | — |
| Customer Support | Immediate | Immediate | Within 1 hr | Next standup |
| Account Managers | Within 15 min | Within 1 hr | — | — |
| Social Media | Within 30 min | As needed | — | — |
Step 4
Post-incident review template
Run a blameless post-incident review within 48 hours. Use this template to structure the conversation.
Incident Summary
- Incident title:
- [Title]
- Severity:
- SEV[1-4]
- Duration:
- [Start time] to [End time]
- Impact:
- [Who/what was affected]
Timeline
- [HH:MM] — Issue first detected
- [HH:MM] — Engineering notified
- [HH:MM] — Status page updated
- [HH:MM] — Root cause identified
- [HH:MM] — Fix deployed
- [HH:MM] — Service restored
- [HH:MM] — All-clear confirmed
Root Cause
- What happened:
- [Description]
- Why it happened:
- [Description]
- Why it wasn’t caught earlier:
- [Description]
Action Items
- [Preventive action] — Owner: [Name] — Due: [Date]
- [Detection improvement] — Owner: [Name] — Due: [Date]
- [Process change] — Owner: [Name] — Due: [Date]
Lessons Learned
- What went well:
- [Description]
- What could be improved:
- [Description]
Step 5
Sample language for common scenarios
Real-world examples for the most common incident types.
-
Investigating:
“We are investigating issues with our primary database that are affecting all application services. Users may experience errors loading pages or saving data. Our database team is actively working on restoration. Next update in 15 minutes.”
Identified:
“We’ve identified that a failed database migration caused our primary database cluster to become unresponsive. We are rolling back the migration and restoring from our latest backup. Estimated resolution: 30 minutes.”
Resolved:
“The database has been fully restored and all services are operating normally. The failed migration has been rolled back and will be reworked before re-deployment. Total downtime: 47 minutes. We sincerely apologize for the disruption.”
-
Investigating:
“Some API users are experiencing increased 429 (rate limit) responses. We are investigating unexpected rate limiting behavior that appears to be affecting accounts below their plan limits. API dashboard and webhooks are unaffected.”
Resolved:
“The rate limiting issue has been resolved. A configuration change incorrectly lowered rate limit thresholds for Pro plan users. Limits have been restored to their correct values. No data was lost. We apologize for any failed API calls during this window.”
-
Investigating:
“Email notifications are currently delayed due to an issue with our email delivery provider. We are in contact with their team and monitoring their status page for updates. Push notifications and webhook notifications are unaffected.”
Resolved:
“Our email delivery provider has resolved their infrastructure issue. Email notifications are being delivered normally. Any queued emails during the outage have been sent. We are evaluating backup email providers to prevent future impact from third-party outages.”
-
Before (48 hours):
“Scheduled maintenance: We will be upgrading our database infrastructure on Saturday, March 28 from 2:00 AM to 4:00 AM UTC. During this window, the application will be in read-only mode. No data will be lost. We recommend completing any pending changes before the maintenance window.”
During:
“Scheduled maintenance is in progress. The application is currently in read-only mode while we upgrade our database infrastructure. We expect to complete this by 4:00 AM UTC.”
After:
“Scheduled maintenance is complete. All services are fully operational. The database upgrade was successful and users should see improved query performance. Thank you for your patience.”
-
Initial disclosure:
“We are investigating a security concern that was brought to our attention. Out of an abundance of caution, we have taken [specific protective action]. We are working with our security team to fully assess the situation. We will provide a detailed update within [timeframe].”
Resolution:
“Our investigation into the security concern reported on [date] is complete. [Summary of findings]. [Whether user data was affected]. [Actions taken]. [Whether users need to take any action, e.g., password reset]. We take security seriously and have implemented additional safeguards to prevent similar issues.”
-
Investigating:
“We are aware of slower than normal response times across our application. Pages may take longer to load and API responses may be delayed. Our infrastructure team is investigating the cause. All data operations are functioning correctly — only speed is affected.”
Resolved:
“Performance has been restored to normal levels. The slowdown was caused by a spike in background job processing that consumed excessive database connections. We’ve optimized our job queue configuration to prevent recurrence. Response times are back to under 200ms.”
Put this playbook into action with CheckStatus
CheckStatus gives your team the tools to execute this playbook — incident templates, subscriber notifications, and status pages your customers trust.