SREcon21

Videos

1 — SREcon21 - Games We Play to Improve Incident Response Effectiveness
2 — SREcon21 - Taking Control of Metrics Growth and Cardinality: Tips for Maximizing Your Observability
3 — SREcon21 - Panel: Unsolved Problems in SRE
4 — SREcon21 - SRE "Power Words"—the Lexicon of SRE as an Industry
5 — SREcon21 - Nine Questions to Build Great Infrastructure Automation Pipelines
6 — SREcon21 - SRE for ML: The First 10 Years and the Next 10
7 — SREcon21 - Practical TLS Advice for Large Infrastructure
8 — SREcon21 - A Political Scientist's View on Site Reliability
9 — SREcon21 - Designing an Autonomous Workbench for Data Science on AWS
10 — SREcon21 - What To Do When SRE is Just a New Job Title?
11 — SREcon21 - Demystifying Machine Learning in Production: Reasoning about a Large-Scale ML Platform
12 — SREcon21 - DevOps Ten Years After: Review of a Failure with John Allspaw and Paul Hammond
13 — SREcon21 - How We Built Out Our SRE Department to Support over 100 Million Users for the World's 3rd
14 — SREcon21 - Grand National 2021: Managing Extreme Online Demand at William Hill
15 — SREcon21 - From 15,000 Database Connections to under 100—A Tech Debt Tale
16 — SREcon21 - Watching the Watchers: Generating Absent Alerts for Prometheus
17 — SREcon21 - Take Me Down to the Paradise City Where the Metric Is Green and Traces Are Pretty
18 — SREcon21 - Don't Follow Leaders or "All Models Are Wrong (and So Am I)"
19 — SREcon21 - Panel: Observability
20 — SREcon21 - When Linux Memory Accounting Goes Wrong
21 — SREcon21 - Let the Chaos Begin—SRE Chaos Engineering Meets Cybersecurity
22 — SREcon21 - 10 Lessons Learned in 10 Years of SRE
23 — SREcon21 - Elephant in the Blameless War Room—Accountability
24 — SREcon21 - Rethinking the SDLC
25 — SREcon21 - How LinkedIn Performs Maintenances at Scale
26 — SREcon21 - Need for SPEED: Site Performance Efficiency, Evaluation and Decision
27 — SREcon21 - SLX: An Extended SLO Framework to Expedite Incident Recovery
28 — SREcon21 - A Principled Approach to Monitoring Streaming Data Infrastructure at Scale
29 — SREcon21 - Let's Bring System Dynamics Back to CS!
30 — SREcon21 - Capacity Management for Fun & Profit
31 — SREcon21 - MySQL and InnoDB Performance for the Rest of Us
32 — SREcon21 - Cache Strategies with Best Practices
33 — SREcon21 - Optimizing Cost and Performance with arm64
34 — SREcon21 - Ceci N'est Pas un CPU Load
35 — SREcon21 - Horizontal Data Freshness Monitoring in Complex Pipelines
36 — SREcon21 - Microservices above the Cloud—Designing the International Space Station for Reliability
37 — SREcon21 - What's the Cost of a Millisecond?
38 — SREcon21 - You've Lost That Process Feeling: Some Lessons from Resilience Engineering
39 — SREcon21 - Scaling for a Pandemic: How We Keep Ahead of Demand for Google Meet during COVID-19
40 — SREcon21 - What If the Promise of AIOps Was True?
41 — SREcon21 - Model Monitoring: Detecting and Analyzing Data Issues
42 — SREcon21 - Leveraging ML to Detect Application HotSpots [@scale, of Course!]
43 — SREcon21 - When Systems Flatline—Enhancing Incident Response with Learnings from the Medical Field
44 — SREcon21 - Panel: OpML
45 — SREcon21 - Evolution of Incident Management at Slack
46 — SREcon21 - Hacking ML into Your Organization
47 — SREcon21 - Automating Performance Tuning with Machine Learning
48 — SREcon21 - Panel: Engineering Onboarding
49 — SREcon21 - Of Mice & Elephants
50 — SREcon21 - Learning More from Complex Systems
51 — SREcon21 - Nothing to Recommend It: An Interactive ML Outage Fable
52 — SREcon21 - User Uptime in Practice
53 — SREcon21 - Improving Observability in Your Observability: Simple Tips for SREs
54 — SREcon21 - Lessons Learned Using the Operator Pattern to Build a Kubernetes Platform
55 — SREcon21 - Hard Problems We Handle in Incidents but Aren't Recognized
56 — SREcon21 - Sparking Joy for Engineers with Observability
57 — SREcon21 - Reliable Data Processing with Minimal Toil
58 — SREcon21 - Experiments for SRE
59 — SREcon21 - How Our SREs Safeguard Nanosecond Performance—at Scale—in an Environment Built to Fail
60 — SREcon21 - Beyond Goldilocks Reliability
61 — SREcon21 - The Origins of USAA's Postmortem of the Week
62 — SREcon21 - A Retrospective: Five Years Later, Was Chaos Engineering Worth It?
63 — SREcon21 - Cache for Cash—Speeding Up Production with Kafka and MySQL binlog
64 — SREcon21 - Food for Thought: What Restaurants Can Teach Us about Reliability
65 — SREcon21 - Latency Distributions and Micro-Benchmarking to Identify and Characterize Kernel Hotspots
66 — SREcon21 - Trustworthy Graceful Degradation: Fault Tolerance across Service Boundaries
67 — SREcon21 - Spike Detection in Alert Correlation at LinkedIn