cotalks.dev

Plan for Unplanned Work: Game Days with Chaos Engineering by Daniel Afonso

(link)
Channel: Devoxx

Summary

Daniel Afonso explains how teams can **plan for unplanned work** by using **chaos engineering** and **game days** to practice incident response before a real outage happens. The talk defines chaos engineering as a disciplined way to test system behavior against a hypothesis, not just to break things randomly. It walks through the core loop of steady state, hypothesis, experiment, verify, and improve, then shows how game days add the human side: incident commanders, scribes, escalation, communications, troubleshooting, and coordinated recovery. The talk emphasizes starting small, minimizing blast radius, using production only when appropriate, and avoiding experiments when error budgets are at risk, during reorganizations, or right before critical business periods. A major theme is turning findings into action: write post-incident reviews, share them publicly inside the company, fix issues in the backlog, and revisit SLOs when they are unrealistic. The result is better resilience, better onboarding, and more confidence when real incidents occur.

Key Takeaways

  • Chaos engineering is about running strategic experiments against a hypothesis to build confidence in system resilience.
  • Game days extend chaos engineering by practicing real incident response roles, coordination, communication, and recovery.
  • Start with a clear steady state, measurable signals, and a defined rollback or stop plan before experimenting.
  • Use real-world failure modes such as crashes, latency, network loss, and failovers to keep experiments relevant.
  • Avoid game days when error budgets are already strained, during reorganizations, or right before peak business periods.
  • Share post-incident reviews and turn findings into backlog items so lessons become organizational knowledge.

Sections

What chaos engineering is and why it matters

The talk frames chaos engineering as a way to learn about system behavior before users discover failures. Instead of waiting for incidents, teams run controlled experiments to validate resilience, expose weak points, and build confidence in how services behave under stress. The goal is not random destruction; it is to improve knowledge and system reliability.

Core principles: steady state, hypothesis, experiment, verify, improve

A useful chaos engineering workflow starts by defining the system's steady state: what normal looks like and which metrics prove that the service is healthy. From there, teams write a hypothesis about how the system should behave under a failure scenario, inject a real-world variable, verify whether the steady state changed, and then improve based on the result. The added fifth step, improve, means the experiment should lead to action, not just observation.

Designing useful experiments

Experiments should reflect realistic conditions that users actually encounter, such as server crashes, network failures, increased latency, or storage issues. The talk stresses identifying the right hypothesis from past incidents, postmortems, and service ownership knowledge. It also recommends minimizing blast radius by starting in staging or QA when teams are new to the practice, then moving toward production only when the organization is ready.

Game days as practiced incident response

Game days are presented as a structured way to run chaos experiments while treating the event like a real incident. Teams bring in an incident commander, subject matter experts, scribes, and communications roles to rehearse the full response process. The practice helps test alerting, paging, incident room coordination, access to dashboards and repositories, and the ability to execute operational fixes such as restarts, failovers, and scale-ups.

What to test during a game day

The speaker highlights practical checks: whether alerts fire, whether the correct people are paged, whether responders know where to go, whether documentation and runbooks are current, and whether troubleshooting access exists. Game days are also useful for onboarding because they reveal who can do what, which systems are connected, and where hidden silos or outdated instructions exist.

Post-incident review and organizational learning

After the experiment or incident simulation, teams should run a post-incident review or postmortem to capture timelines, recovery time, user impact, and blind spots. The talk strongly recommends sharing the results publicly inside the organization to avoid tribal knowledge and turning any code or process fixes into prioritized backlog work. This is also where teams revisit SLOs if they are too aggressive or unrealistic.

When to run game days and when not to

Good times for game days are when SLOs and error budgets are safe, the organization is calm, or a new feature has been released and enough telemetry exists to define normal behavior. Bad times include periods where the error budget is already at risk, when there is a known production issue to fix directly, during reorganizations, or before peak business periods such as major retail events. The talk also argues that surprise game days should be rare and only used by teams that are already comfortable with the practice.

How PagerDuty plans a game day

The planning process described includes scheduling a date, inviting the relevant on-call stakeholders, filling out a structured game day plan, defining the target system, hypothesis, rollback path, attendance, customer impact, steady-state metrics, preparation steps, and attack plan. The session ends with a post-incident review. The broader message is to start small, document carefully, and expand the practice across teams once the value is proven.

Keywords: chaos engineering, game days, incident response, pagerduty, incident commander, postmortem, post-incident review, slos, error budgets, observability, alerting, paging, microservices resilience, fault injection, production experiments, staging environment, rollback plan, runbooks, failover testing, latency injection, network failure simulation, netflix chaos monkey, chaos kong, operational readiness, site reliability engineering

note