Data Engineering
Build the pipelines that make data useful at scale
What you'll be able to do
- Build batch and streaming data pipelines
- Model and warehouse data for analytics
- Orchestrate workflows with tools like Airflow
- Operate reliable, monitored data systems
Before you start
- Python fundamentals
- Basic SQL
- Comfort with the command line
Level 1 ·Programming & SQL Mastery
Python for Data Engineering
Python beyond basics: file I/O, subprocess, requests, and writing production scripts.
- Automate the Boring Stuff with Pythondocfree
- Kaggle: Python Coursecoursefree
- Parse CSV & JSON from disk and APIs
- Context managers & error handling
- Write a file processing pipeline script
SQL: Advanced Querying & Data Modelling
CTEs, window functions, dimensional modelling, and query optimisation.
- Mode SQL Tutorialcoursefree
- SQLBolt: Interactive SQL lessonscoursefree
- PostgreSQL Tutorialdocfree
- Window functions: ROW_NUMBER, LAG, LEAD
- CTEs & recursive queries
- Star schema design for a sales dataset
dbt: Data Build Tool
Transform data in your warehouse with version-controlled, tested SQL models.
- dbt Fundamentals Course (free)coursefree
- dbt Docsdocfree
- Staging → intermediate → mart model layers
- dbt tests: unique, not_null, relationships
- Generate dbt docs site
Level 2 ·Data Pipeline Tools
Apache Airflow: Workflow Orchestration
DAGs, operators, sensors, XComs, and managing complex pipeline dependencies.
- ETL DAG: extract from API → load to Postgres
- Sensor that waits for a file to land
- TaskGroup & dynamic task mapping
Kafka: Event Streaming
Producers, consumers, topics, partitions, and stream processing with Kafka Streams.
- Confluent Developer: Kafka Tutorials (free)coursefree
- Kafka Docsdocfree
- Producer & consumer in Python
- Topic partitioning & consumer groups
- Stream a real-time clickstream
Apache Spark: Distributed Processing
PySpark DataFrames, SQL, UDFs, and processing large datasets at scale.
- Spark by Examples: PySpark Tutorialarticlefree
- Frank Kane: Taming Big Data with Spark (Udemy)coursepaid
- Load & transform 1M-row CSV with PySpark
- Spark SQL join & aggregation
- Write partitioned Parquet to S3
Level 3 ·Cloud Data Platforms & Capstone
BigQuery & Snowflake Data Warehousing
Cloud DWH architecture, cost management, partitioning, clustering, and BI integration.
- Load data from GCS to BigQuery
- Partition by date, cluster by user_id
- Connect Looker Studio for visualisation
Data Quality & Great Expectations
Validate, document, and profile your data with Great Expectations.
- Great Expectations Docsdocfree
- Expectation suite for a pipeline output
- Integrate GX into Airflow DAG
Capstone: End-to-End Data Platform
Ingest → transform → model → visualise: a complete modern data stack.
- Kafka → Spark → BigQuery pipeline
- dbt models on BigQuery
- Airflow orchestrating the full flow
- Dashboard in Looker Studio or Metabase
Frequently asked
Is the Data Engineering roadmap free?+
Yes. The entire Data Engineering roadmap and every curated resource is free to follow on Commit. You can track your progress, keep a daily streak, and earn a shareable certificate at no cost — there is no paywall.
How long does the Data Engineering roadmap take to complete?+
About 160 hours of focused study across 9 courses and 3 stages. At roughly one hour a day that is about 6 months; you can move faster by studying more each day.
Do I get a certificate for finishing the Data Engineering roadmap?+
Yes. When you complete the roadmap on Commit you receive a verifiable certificate of completion that you can add to LinkedIn and your public Commit profile as proof of what you finished.
Related roadmaps
Make it stick
Copy this roadmap into Commit and turn it into a tracked program with a streak graph, study logging, and a shareable certificate when you finish. Free forever.
Start Data Engineering free