Site Reliability Engineering (SRE)
Keep the internet running: reliability, observability, and on-call excellence
What you'll be able to do
- Define and track SLIs, SLOs, and error budgets
- Instrument systems with metrics, logs, and traces
- Design for reliability and run incident response
- Automate toil and operate at scale
Before you start
- Comfort with Linux and the command line
- Basic scripting (Bash or Python)
- Experience running or deploying an application
Phase 1 · SRE Foundations
Linux Deep Dive for SRE
Process internals, memory model, file descriptors, networking stack, kernel tuning, and performance analysis with perf, strace, and eBPF.
- The Linux Command Line (free book)docfree
- Brendan Gregg: Systems Performance (book)coursepaid
- Linux Performance Analysis in 60 Seconds (Netflix blog)articlefree
- Diagnose CPU bottleneck with top/perf
- Trace system calls with strace
- Analyse memory leak with valgrind / /proc
- Network tuning: sysctl TCP parameters
SRE Principles: Google SRE Book
SLIs, SLOs, SLAs, error budgets, toil elimination, and the SRE vs DevOps mental model. Chapter-by-chapter reading with applied exercises.
- Google SRE Book (free online)docfree
- Google SRE Workbook (free online)docfree
- School of SRE: Linkedin Engineering (free)coursefree
- Write SLOs for a hypothetical e-commerce service
- Calculate error budget burn rate
- Identify toil in a daily workflow and automate it
Python & Go for SRE Automation
Scripting, tooling, and automation: Python for ops scripts, Go for reliable CLI tools and services. Focus on practical SRE use cases.
- Automate the Boring Stuff with Pythondocfree
- Go Tour (free official)coursefree
- Python for DevOps (O'Reilly)coursepaid
- Python: health-check script that pages on failure
- Go: build a simple HTTP load-test tool
- Automate runbook steps as a script
Phase 2 · Observability Stack
Metrics: Prometheus & Grafana
PromQL, recording rules, alerting rules, Grafana dashboards, and the RED and USE methods for service health.
- Prometheus Official Docsdocfree
- Grafana: Getting Starteddocfree
- Robust Perception: PromQL for Humansarticlefree
- Instrument a Node/Python service with client library
- PromQL: p99 latency, error rate, saturation
- Grafana: RED dashboard (Rate, Errors, Duration)
- Alertmanager: page on SLO breach
Logging: ELK / Loki + Structured Logs
Structured JSON logging, log aggregation with Loki or Elasticsearch, Logstash pipelines, Kibana/Grafana queries, and log-based alerting.
- Elastic: Getting Started with ELK Stackdocfree
- Grafana Loki Docsdocfree
- Structured log format: trace_id, service, level
- LogQL: find all 5xx errors in last 1h
- Correlation: trace a request across 3 services
Distributed Tracing: OpenTelemetry & Jaeger
Spans, traces, context propagation, sampling strategies, and correlating traces with metrics and logs (the three pillars).
- OpenTelemetry Docsdocfree
- Jaeger Tracing Docsdocfree
- Instrument a service with OTel SDK
- View trace in Jaeger: identify slow span
- Correlate a trace to a Prometheus spike
Phase 3 · Reliability Engineering
Incident Management & On-Call
Incident response lifecycle, postmortem culture, on-call best practices, PagerDuty setup, runbooks, and blameless retrospectives.
- PagerDuty Incident Response Docs (free)docfree
- Google SRE Book: Chapter 14 - Managing Incidentsdocfree
- Increment: On-Call Issuearticlefree
- Write a postmortem for a real or simulated incident
- Create a runbook for top-3 alert types
- Configure PagerDuty escalation policy
Chaos Engineering & Resilience Testing
Game days, fault injection, Chaos Monkey, Gremlin, blast radius limiting, and recovery testing.
- Chaos Engineering: O'Reilly Book (free preview)docfree
- Netflix TechBlog: Chaos Monkeyarticlefree
- Chaos Toolkit (open source)repofree
- Run a game day: kill a service instance
- Verify graceful degradation under load
- Chaos experiment: latency injection on a dependency
Kubernetes for SRE
Pod autoscaling (HPA/VPA/KEDA), disruption budgets, priority classes, resource quotas, node affinity, and SRE-focused Kubernetes patterns.
- Kubernetes Docs: Production Best Practicesdocfree
- Learnk8s: Production Kubernetes Checklistarticlefree
- Mumshad Mannambeth: CKA (Udemy)coursepaid
- Configure HPA on a deployment
- PodDisruptionBudget: survive a node drain
- Resource requests/limits: avoid OOMKill
- Pass CKA exam (target)
Capstone: SRE for a Production Service
Apply everything to a real service: define SLOs, instrument metrics + logs + traces, build dashboards, configure alerts, write runbooks, and run a chaos game day.
- SLO document approved by stakeholders
- Dashboards covering RED + USE methods
- Alerting with no false positives for 2 weeks
- Game day executed and postmortem written
Frequently asked
Is the Site Reliability Engineering (SRE) roadmap free?+
Yes. The entire Site Reliability Engineering (SRE) roadmap and every curated resource is free to follow on Commit. You can track your progress, keep a daily streak, and earn a shareable certificate at no cost — there is no paywall.
How long does the Site Reliability Engineering (SRE) roadmap take to complete?+
About 150 hours of focused study across 10 courses and 3 stages. At roughly one hour a day that is about 5 months; you can move faster by studying more each day.
Do I get a certificate for finishing the Site Reliability Engineering (SRE) roadmap?+
Yes. When you complete the roadmap on Commit you receive a verifiable certificate of completion that you can add to LinkedIn and your public Commit profile as proof of what you finished.
Related roadmaps
Make it stick
Copy this roadmap into Commit and turn it into a tracked program with a streak graph, study logging, and a shareable certificate when you finish. Free forever.
Start Site Reliability Engineering (SRE) free