Free Resource

The Incident Communication Playbook

Everything your team needs to communicate effectively during outages — severity classification, status page templates, notification matrices, and post-incident review guides.

Copy-paste templates SEV1–SEV4 classification Free to use

Jump to templates Start free trial

In this playbook

Why incident communication matters

When your service goes down, silence is your worst enemy. Customers don’t just want to know what’s happening — they need to know you’re on it. Without a communication plan, your team scrambles to draft messages while support tickets pile up and trust erodes.

The data backs this up: companies with proactive incident communication see a 60% reduction in support tickets during outages and retain 95% of customers who would otherwise churn from the experience. The difference between keeping and losing a customer during an outage often comes down to how quickly and clearly you communicate.

This playbook gives your team a ready-made framework. No more drafting messages from scratch at 3 AM. No more debating who to notify. Every severity level, every stakeholder, every phase of an incident — covered with templates you can copy, paste, and customize.

Step 1

Severity classification guide

Classify every incident immediately. The severity level determines response time, update frequency, and who gets notified.

SEV1 Critical

Impact: Complete service outage. All users affected.
Response time: Immediate (< 5 minutes)
Update frequency: Every 15 minutes
Who’s involved: Engineering lead, exec team, support lead
Example: “All API endpoints returning 500 errors”

SEV2 Major

Impact: Significant degradation. Many users affected.
Response time: Within 15 minutes
Update frequency: Every 30 minutes
Who’s involved: Engineering lead, support team
Example: “Payment processing failing for 40% of transactions”

SEV3 Minor

Impact: Partial degradation. Some users affected.
Response time: Within 1 hour
Update frequency: Every 1–2 hours
Who’s involved: On-call engineer, support team
Example: “Email notifications delayed by 10 minutes”

SEV4 Low

Impact: Minimal impact. Workaround available.
Response time: Next business day
Update frequency: As needed
Who’s involved: Assigned engineer
Example: “Dark mode rendering incorrectly on Safari”

Step 2

Status page update templates

Copy-paste templates for every phase of an incident. Replace the bracketed text with your specifics.

Investigating

We are currently investigating reports of [issue description]. Our engineering team is actively looking into this. We will provide an update within [timeframe]. We apologize for the inconvenience.

Identified

We have identified the root cause of [issue description]. The issue is related to [brief technical explanation]. Our team is working on a fix. Next update in [timeframe].

Monitoring

A fix has been deployed for [issue description]. We are currently monitoring the situation to ensure stability. If you continue to experience issues, please contact support at [support email].

Resolved

The incident affecting [component/service] has been resolved. [Brief explanation of what happened and what was done]. Total downtime: [duration]. We apologize for the disruption and will be conducting a post-incident review.

Scheduled Maintenance

We will be performing scheduled maintenance on [component] on [date] from [start time] to [end time] [timezone]. During this window, [expected impact]. No action is required on your part.

Step 3

Stakeholder notification matrix

Who to notify, when, and how — based on severity level.

Stakeholder	SEV1	SEV2	SEV3	SEV4
Status Page	Immediate	Within 15 min	Within 1 hr	As needed
Email Subscribers	Immediate	Within 15 min	Within 1 hr	—
Slack / Teams	Immediate	Immediate	Within 30 min	—
Executive Team	Immediate	Within 30 min	Daily summary	—
Customer Support	Immediate	Immediate	Within 1 hr	Next standup
Account Managers	Within 15 min	Within 1 hr	—	—
Social Media	Within 30 min	As needed	—	—

Step 4

Post-incident review template

Run a blameless post-incident review within 48 hours. Use this template to structure the conversation.

Incident Summary

Incident title:: [Title]
Severity:: SEV[1-4]
Duration:: [Start time] to [End time]
Impact:: [Who/what was affected]

Timeline

[HH:MM] — Issue first detected
[HH:MM] — Engineering notified
[HH:MM] — Status page updated
[HH:MM] — Root cause identified
[HH:MM] — Fix deployed
[HH:MM] — Service restored
[HH:MM] — All-clear confirmed

Root Cause

What happened:: [Description]
Why it happened:: [Description]
Why it wasn’t caught earlier:: [Description]

Action Items

[Preventive action] — Owner: [Name] — Due: [Date]
[Detection improvement] — Owner: [Name] — Due: [Date]
[Process change] — Owner: [Name] — Due: [Date]

Lessons Learned

What went well:: [Description]
What could be improved:: [Description]

Step 5

Sample language for common scenarios

Real-world examples for the most common incident types.

Database outage

Investigating:

“We are investigating issues with our primary database that are affecting all application services. Users may experience errors loading pages or saving data. Our database team is actively working on restoration. Next update in 15 minutes.”

Identified:

“We’ve identified that a failed database migration caused our primary database cluster to become unresponsive. We are rolling back the migration and restoring from our latest backup. Estimated resolution: 30 minutes.”

Resolved:

“The database has been fully restored and all services are operating normally. The failed migration has been rolled back and will be reworked before re-deployment. Total downtime: 47 minutes. We sincerely apologize for the disruption.”

API rate limiting issues

Investigating:

“Some API users are experiencing increased 429 (rate limit) responses. We are investigating unexpected rate limiting behavior that appears to be affecting accounts below their plan limits. API dashboard and webhooks are unaffected.”

Resolved:

“The rate limiting issue has been resolved. A configuration change incorrectly lowered rate limit thresholds for Pro plan users. Limits have been restored to their correct values. No data was lost. We apologize for any failed API calls during this window.”

Third-party provider outage

Investigating:

“Email notifications are currently delayed due to an issue with our email delivery provider. We are in contact with their team and monitoring their status page for updates. Push notifications and webhook notifications are unaffected.”

Resolved:

“Our email delivery provider has resolved their infrastructure issue. Email notifications are being delivered normally. Any queued emails during the outage have been sent. We are evaluating backup email providers to prevent future impact from third-party outages.”

Scheduled maintenance

Before (48 hours):

“Scheduled maintenance: We will be upgrading our database infrastructure on Saturday, March 28 from 2:00 AM to 4:00 AM UTC. During this window, the application will be in read-only mode. No data will be lost. We recommend completing any pending changes before the maintenance window.”

During:

“Scheduled maintenance is in progress. The application is currently in read-only mode while we upgrade our database infrastructure. We expect to complete this by 4:00 AM UTC.”

After:

“Scheduled maintenance is complete. All services are fully operational. The database upgrade was successful and users should see improved query performance. Thank you for your patience.”

Security incident

Initial disclosure:

“We are investigating a security concern that was brought to our attention. Out of an abundance of caution, we have taken [specific protective action]. We are working with our security team to fully assess the situation. We will provide a detailed update within [timeframe].”

Resolution:

“Our investigation into the security concern reported on [date] is complete. [Summary of findings]. [Whether user data was affected]. [Actions taken]. [Whether users need to take any action, e.g., password reset]. We take security seriously and have implemented additional safeguards to prevent similar issues.”

Performance degradation

Investigating:

“We are aware of slower than normal response times across our application. Pages may take longer to load and API responses may be delayed. Our infrastructure team is investigating the cause. All data operations are functioning correctly — only speed is affected.”

Resolved:

“Performance has been restored to normal levels. The slowdown was caused by a spike in background job processing that consumed excessive database connections. We’ve optimized our job queue configuration to prevent recurrence. Response times are back to under 200ms.”

Put this playbook into action with CheckStatus

CheckStatus gives your team the tools to execute this playbook — incident templates, subscriber notifications, and status pages your customers trust.

Start free trial See pricing