This content has been archived. It may no longer be relevant

DATA ENGINEER COURSE

Build Production Ready Data Warehouses at Scale

 COURSE BUILT IN COLLABORATION WITH 

insight

Before You Start Prerequisites:

In order to succeed in this program, we recommend having intermediate SQL and Python programming skills.

Educational Objectives

● Create user-friendly relational and NoSQL data models

● Create scalable and efficient data warehouses

● Identify the appropriate use cases for different big data technologies

● Work efficiently with massive datasets

● Build and interact with a cloud-based data lake

● Automate and monitor data pipelines

● Develop proficiency in Spark, Airflow, and AWS tools

1

Course1

Data Modeling 

In this course, you’ll learn to create relational and NoSQL data models to fit the diverse  needs of data consumers.

You’ll understand the differences between different data  models, and how to choose the appropriate data model for a given situation.

You’ll also  build fluency in PostgreSQL and Apache Cassandra.

Project 1: Data Modeling with Postgres and Apache Cassandra

In these projects, you’ll model user activity data for a music streaming app called Sparkify.

You’ll create a database and ETL pipeline, in both Postgres and Apache Cassandra,  designed to optimize queries for understanding what songs users are listening to.

 For  PostgreSQL, you will also define Fact and Dimension tables and insert data into your new  tables.

For Apache Cassandra, you will model your data so you can run specific queries  provided by the analytics team at Sparkify.

2

Course 2

Cloud Data Warehouses

In this course, you’ll learn to create cloud-based data warehouses.

You’ll sharpen your data  warehousing skills, deepen your understanding of data infrastructure, and be introduced  to data engineering on the cloud using Amazon Web Services (AWS).

Project 2: Data Infrastructure on the Cloud

In this project, you are tasked with building an ELT pipeline that extracts their data from S3,  stages them in Redshift, and transforms data into a set of dimensional tables for their  analytics team to continue finding insights in what songs their users are listening to.

3

Course 3

Data Lakes with Spark

Real-world data rarely comes clean. Using Python, you’ll gather data from a variety of sources, assess its quality and tidiness, then clean it.

You’ll document your wrangling efforts in a Jupyter Notebook, plus showcase them through analyses and visualizations using Python and SQL.

Project 3: Big Data with Spark

In this project, you’ll build an ETL pipeline for a data lake. 

The data resides in S3, in a  directory of JSON logs on user activity on the app, as well as a directory with JSON metadata  on the songs in the app.

You will load data from S3, process the data into analytics tables  using Spark, and load them back into S3. You’ll deploy this Spark process on a cluster using  AWS.

4

Course 4

Automate Data Pipelines

In this course, you’ll learn to schedule, automate, and monitor data pipelines using Apache  Airflow.

You’ll learn to run data quality checks, track data lineage, and work with data  pipelines in production.

Project 4: Data Pipelines with Airflow

In this project, you’ll continue your work on the music streaming company’s data  infrastructure by creating and automating a set of data pipelines.

You’ll configure and  schedule data pipelines with Airflow and monitor and debug production pipelines.