Introduction
This project is a sample project to showcase basic data engineering skills. The subject matter is discussed a little bit later.
Demonstrated skills include: streaming data, batch data analysis, data transformation, data scraping, database modelling, and more.
Motivation
My biggest motivation was building an end-to-end pipeline that I could display my data engineering skills with.
One of my biggest pet peeves when going grocery shopping was not knowing how much money I would be spending on groceries. I always wanted to know how much a recipe would cost me to make. Sure I could ballpark it, but sometimes I would have to buy another container of spice, and that would increase my bill. Or sometimes I wouldn’t know which store had the best deals on certain items. This made me think that maybe it would be interesting to build a project that helped me solve this problem.
Architecture
The following image shows the architecture used to accomplish the end-to-end pipeline
Technologies
All components listed below use Docker so that the environment is stable from one machine to another.
Flask
A REST API, created in Flask, is one of the first com the pipeline. By querying certain endpoints with certain parameters, a request is sent to a grocery store’s website to get data for that store’s grocery “type” (i.e., “apples”, “chocolate”, etc…)
Airflow
Details on Airflow can be found on the Automation page.
Kafka
Kafka is used as the main communication between services such as the Flask container and a Beam pipeline. It is also used as the source for the Druid sink with data also coming from a Beam pipeline.
MySQL
Details on MySQL can be found on the relevant pages
Flink/Beam
Apache Beam pipelines are used to parse raw data from the Flask container and store it into the MySQL database. Beam pipelines are also used to push analytic data, transformed from the MySQL database, to Apache Kafka to be stored in the Druid database.
Beam pipelines are run on Apache Flink.
Druid
Druid is used to house historical and analytic grocery data. This data also includes the raw data from the MySQL database.
Requirements
- Java 8 – You can use OpenJDK as well if you prefer that on Linux distros
- Gradle 6.3+ – There is no guarantee any major versions (5, 7, etc…) will work
- Docker – Install the main Docker engine
- Docker-compose – Install Docker-compose. On Windows this should be done automatically with Docker Desktop install
- Python 3.8 – Install Python 3.8. This version comes with a required Pip version for Airflow