Tabletop Scenario Template: Cloud Outage
This scenario simulates a major outage affecting your primary cloud provider (e.g., AWS, Azure, GCP) and is intended for SaaS companies with remote teams and cloud-hosted infrastructure.
Exercise Setup
Field | Value |
|---|---|
Scenario Name | Cloud Infrastructure Outage |
Exercise Type | Discussion-based Tabletop |
Facilitator | Name/Role of Facilitator |
Date | MM/DD/YYYY |
Duration | 60–90 minutes |
Participants | Engineering, Security, Customer Success, Product, Executive |
Objective | Test response to a sustained cloud infrastructure failure, assess comms, data resilience, and decision-making under pressure. |
Scenario Briefing
At 8:30 AM EST, your monitoring system detects elevated latency and timeout errors for your primary SaaS application hosted in AWS (us-east-1).
By 8:45 AM, your status page shows partial outage. Users are reporting login failures and data not loading.
Timeline of Escalation
Time | New Information |
|---|---|
T+0 | Customers report your app is down. Monitoring shows service errors. |
T+15 min | AWS status page confirms “connectivity issues” in us-east-1. |
T+30 min | Your backups are stored in the same region. Users escalate via social media. |
T+45 min | A key enterprise client emails saying their team is blocked. |
T+60 min | Internal tools (Slack, CI/CD) also affected. Engineers are asking for direction. |
Discussion Prompts
Category | Questions |
|---|---|
Detection | How do we first become aware of the issue? Who confirms it? |
Communication | When and how do we notify users? Who approves the message? |
Internal Coordination | What tools do we use to coordinate? Who leads incident command? |
Business Impact | What services are down? What SLAs are being violated? |
Recovery Plan | Do we have cross-region failover? Can we restore from backup? |
Escalation | Who is authorized to declare this a major incident? Do we contact AWS? |
Postmortem Readiness | Are logs and actions being recorded? What needs to be reviewed later? |
Key Documents to Reference
After-Action Review (Facilitator Use)
Question | Notes |
|---|---|
What went well? | E.g., “Quick detection via monitoring.” |
What needs improvement? | E.g., “Confusion over who owns customer messaging.” |
What were the biggest delays or gaps? | E.g., “Backups were not regionally redundant.” |
Which documents or procedures need updating? | E.g., IR Plan did not include Slack fallback channels. |
Who owns each follow-up action? | List action items and owners. |