Site Reliability Engineering (SRE)

Keep the internet running: reliability, observability, and on-call excellence

150h total10 courses3 stages

Start this roadmap free

What you'll be able to do

Define and track SLIs, SLOs, and error budgets
Instrument systems with metrics, logs, and traces
Design for reliability and run incident response
Automate toil and operate at scale

Before you start

Comfort with Linux and the command line
Basic scripting (Bash or Python)
Experience running or deploying an application

Phase 1 · SRE Foundations

Linux Deep Dive for SRE

beginner18h

Process internals, memory model, file descriptors, networking stack, kernel tuning, and performance analysis with perf, strace, and eBPF.

The Linux Command Line (free book)docfree
Brendan Gregg: Systems Performance (book)coursepaid
Linux Performance Analysis in 60 Seconds (Netflix blog)articlefree

Diagnose CPU bottleneck with top/perf
Trace system calls with strace
Analyse memory leak with valgrind / /proc
Network tuning: sysctl TCP parameters

SRE Principles: Google SRE Book

beginner14h

SLIs, SLOs, SLAs, error budgets, toil elimination, and the SRE vs DevOps mental model. Chapter-by-chapter reading with applied exercises.

Google SRE Book (free online)docfree
Google SRE Workbook (free online)docfree
School of SRE: Linkedin Engineering (free)coursefree

Write SLOs for a hypothetical e-commerce service
Calculate error budget burn rate
Identify toil in a daily workflow and automate it

Python & Go for SRE Automation

beginner16h

Scripting, tooling, and automation: Python for ops scripts, Go for reliable CLI tools and services. Focus on practical SRE use cases.

Automate the Boring Stuff with Pythondocfree
Go Tour (free official)coursefree
Python for DevOps (O'Reilly)coursepaid

Python: health-check script that pages on failure
Go: build a simple HTTP load-test tool
Automate runbook steps as a script

Phase 2 · Observability Stack

Metrics: Prometheus & Grafana

intermediate16h

PromQL, recording rules, alerting rules, Grafana dashboards, and the RED and USE methods for service health.

Prometheus Official Docsdocfree
Grafana: Getting Starteddocfree
Robust Perception: PromQL for Humansarticlefree

Instrument a Node/Python service with client library
PromQL: p99 latency, error rate, saturation
Grafana: RED dashboard (Rate, Errors, Duration)
Alertmanager: page on SLO breach

Logging: ELK / Loki + Structured Logs

intermediate12h

Structured JSON logging, log aggregation with Loki or Elasticsearch, Logstash pipelines, Kibana/Grafana queries, and log-based alerting.

Elastic: Getting Started with ELK Stackdocfree
Grafana Loki Docsdocfree

Structured log format: trace_id, service, level
LogQL: find all 5xx errors in last 1h
Correlation: trace a request across 3 services

Distributed Tracing: OpenTelemetry & Jaeger

intermediate10h

Spans, traces, context propagation, sampling strategies, and correlating traces with metrics and logs (the three pillars).

OpenTelemetry Docsdocfree
Jaeger Tracing Docsdocfree

Instrument a service with OTel SDK
View trace in Jaeger: identify slow span
Correlate a trace to a Prometheus spike

Phase 3 · Reliability Engineering

Incident Management & On-Call

intermediate12h

Incident response lifecycle, postmortem culture, on-call best practices, PagerDuty setup, runbooks, and blameless retrospectives.

PagerDuty Incident Response Docs (free)docfree
Google SRE Book: Chapter 14 - Managing Incidentsdocfree
Increment: On-Call Issuearticlefree

Write a postmortem for a real or simulated incident
Create a runbook for top-3 alert types
Configure PagerDuty escalation policy

Chaos Engineering & Resilience Testing

advanced12h

Game days, fault injection, Chaos Monkey, Gremlin, blast radius limiting, and recovery testing.

Chaos Engineering: O'Reilly Book (free preview)docfree
Netflix TechBlog: Chaos Monkeyarticlefree
Chaos Toolkit (open source)repofree

Run a game day: kill a service instance
Verify graceful degradation under load
Chaos experiment: latency injection on a dependency

Kubernetes for SRE

advanced16h

Pod autoscaling (HPA/VPA/KEDA), disruption budgets, priority classes, resource quotas, node affinity, and SRE-focused Kubernetes patterns.

Kubernetes Docs: Production Best Practicesdocfree
Learnk8s: Production Kubernetes Checklistarticlefree
Mumshad Mannambeth: CKA (Udemy)coursepaid

Configure HPA on a deployment
PodDisruptionBudget: survive a node drain
Resource requests/limits: avoid OOMKill
Pass CKA exam (target)

Capstone: SRE for a Production Service

advanced24h

Apply everything to a real service: define SLOs, instrument metrics + logs + traces, build dashboards, configure alerts, write runbooks, and run a chaos game day.

Google SRE Workbook: Implementing SLOsdocfree
awesome-sre: Curated SRE resources (GitHub)repofree

SLO document approved by stakeholders
Dashboards covering RED + USE methods
Alerting with no false positives for 2 weeks
Game day executed and postmortem written

Frequently asked

Is the Site Reliability Engineering (SRE) roadmap free?+

Yes. The entire Site Reliability Engineering (SRE) roadmap and every curated resource is free to follow on Commit. You can track your progress, keep a daily streak, and earn a shareable certificate at no cost — there is no paywall.

How long does the Site Reliability Engineering (SRE) roadmap take to complete?+

About 150 hours of focused study across 10 courses and 3 stages. At roughly one hour a day that is about 5 months; you can move faster by studying more each day.

Do I get a certificate for finishing the Site Reliability Engineering (SRE) roadmap?+

Yes. When you complete the roadmap on Commit you receive a verifiable certificate of completion that you can add to LinkedIn and your public Commit profile as proof of what you finished.

Backend Engineering

Make it stick

Copy this roadmap into Commit and turn it into a tracked program with a streak graph, study logging, and a shareable certificate when you finish. Free forever.

Start Site Reliability Engineering (SRE) free

Site Reliability Engineering (SRE)

What you'll be able to do

Before you start

Phase 1 · SRE Foundations

Linux Deep Dive for SRE

SRE Principles: Google SRE Book

Python & Go for SRE Automation

Phase 2 · Observability Stack

Metrics: Prometheus & Grafana

Logging: ELK / Loki + Structured Logs

Distributed Tracing: OpenTelemetry & Jaeger

Phase 3 · Reliability Engineering

Incident Management & On-Call

Chaos Engineering & Resilience Testing

Kubernetes for SRE

Capstone: SRE for a Production Service

Frequently asked

Backend Engineering: Node.js

Frontend React Developer

Full-Stack Next.js Engineer

Make it stick

Site Reliability Engineering (SRE)

What you'll be able to do

Before you start

Phase 1 · SRE Foundations

Linux Deep Dive for SRE

SRE Principles: Google SRE Book

Python & Go for SRE Automation

Phase 2 · Observability Stack

Metrics: Prometheus & Grafana

Logging: ELK / Loki + Structured Logs

Distributed Tracing: OpenTelemetry & Jaeger

Phase 3 · Reliability Engineering

Incident Management & On-Call

Chaos Engineering & Resilience Testing

Kubernetes for SRE

Capstone: SRE for a Production Service

Frequently asked

Related roadmaps

Backend Engineering: Node.js

Frontend React Developer

Full-Stack Next.js Engineer

Make it stick