DATA ENGINEER COURSE
Build Production Ready Data Warehouses at Scale
COURSE BUILT IN COLLABORATION WITH
Before You Start Prerequisites:
In order to succeed in this program, we recommend having intermediate SQL and Python programming skills.
Educational Objectives
● Create user-friendly relational and NoSQL data models
● Create scalable and efficient data warehouses
● Identify the appropriate use cases for different big data technologies
● Work efficiently with massive datasets
● Build and interact with a cloud-based data lake
● Automate and monitor data pipelines
● Develop proficiency in Spark, Airflow, and AWS tools
1
Course1
Data Modeling
In this course, you’ll learn to create relational and NoSQL data models to fit the diverse needs of data consumers.
You’ll understand the differences between different data models, and how to choose the appropriate data model for a given situation.
You’ll also build fluency in PostgreSQL and Apache Cassandra.
Project 1: Data Modeling with Postgres and Apache Cassandra
In these projects, you’ll model user activity data for a music streaming app called Sparkify.
You’ll create a database and ETL pipeline, in both Postgres and Apache Cassandra, designed to optimize queries for understanding what songs users are listening to.
For PostgreSQL, you will also define Fact and Dimension tables and insert data into your new tables.
For Apache Cassandra, you will model your data so you can run specific queries provided by the analytics team at Sparkify.
2
Course 2
Cloud Data Warehouses
In this course, you’ll learn to create cloud-based data warehouses.
You’ll sharpen your data warehousing skills, deepen your understanding of data infrastructure, and be introduced to data engineering on the cloud using Amazon Web Services (AWS).
Project 2: Data Infrastructure on the Cloud
In this project, you are tasked with building an ELT pipeline that extracts their data from S3, stages them in Redshift, and transforms data into a set of dimensional tables for their analytics team to continue finding insights in what songs their users are listening to.
3
Course 3
Data Lakes with Spark
Real-world data rarely comes clean. Using Python, you’ll gather data from a variety of sources, assess its quality and tidiness, then clean it.
You’ll document your wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python and SQL.
Project 3: Big Data with Spark
In this project, you’ll build an ETL pipeline for a data lake.
The data resides in S3, in a directory of JSON logs on user activity on the app, as well as a directory with JSON metadata on the songs in the app.
You will load data from S3, process the data into analytics tables using Spark, and load them back into S3. You’ll deploy this Spark process on a cluster using AWS.
4
Course 4
Automate Data Pipelines
In this course, you’ll learn to schedule, automate, and monitor data pipelines using Apache Airflow.
You’ll learn to run data quality checks, track data lineage, and work with data pipelines in production.
Project 4: Data Pipelines with Airflow
In this project, you’ll continue your work on the music streaming company’s data infrastructure by creating and automating a set of data pipelines.
You’ll configure and schedule data pipelines with Airflow and monitor and debug production pipelines.