Lets Build An ELT Pipeline Part I: Introduction

Jonathan Duran
2 min readDec 22, 2020
Photo by tian kuan on Unsplash

*Alert 🚨* For the sake of everyone’s attention span, I will be breaking up this project over multiple articles… Not sure how many, I am kind of making things up as I go 😊.

Prerequisites

  • High level knowledge of Docker, Dockerfiles, and Docker Compose.
  • Some familiarity with Airflow.
  • Ability to provision an S3 bucket and Redshift cluster.
  • Awareness of Redshift charges and AWS billing.

Overview

Why ELT? It works really well for smaller teams with little to no engineers to spare for analytics. A well rounded data team is expensive especially for startups who’s number one focus is building out the product.

Taking the more traditional ETL approach means pulling a software engineer or two away from building the application, and toward maintaining pipelines, and transforming data to meet the needs of data analysts and data scientists.

By adopting the ELT paradigm, the data engineering stage ends once data is loaded into a data warehouse in its raw, unaltered form. Using a tool like DBT makes it really simple for data analysts to take over the transformations and database modeling necessary for their analysis.

ELT will be the star of this project.

Extract

We will begin by using the NYC Open Data API to pull some Motor Vehicle Collisions data into our local project.

Load

The collisions data will be uploaded to an S3 bucket and subsequently loaded into a Redshift cluster.

⚠️ WARNING ⚠️ I do not believe Redshift is a part of AWS free tier services, so please research the cost and make sure to pause your clusters every time you are done with it. Alternatively, you can use an RDBMS like Postgres or MySql for this project. ⚠️

Transform

Once the raw data is in our data warehouse, we will use DBT to create a few data models and materialize them as views in Redshift. We will also create some schema and data tests against our data.

Looking Ahead

This was an introduction to what should be a fun project. In the next article we will work on setting up our docker environment, and making sure Airflow, Postgres, and DBT are configured correctly.

Check out the second article in this series.

--

--