Data Transformation Pipeline
This post is part 2 of cloud data architecture series. I will write about the design choices made when designing a sales data pipeline on cloud.
What data stored in each data lake tiers?
We use ADLS as our data lake and we sperate it into 3 tiers: Bronze, Silver and Gold. Bronze store raw data, Silver store preprocessed data and Gold store data to be ingested to data warehouse.
Using the 3 tiers framework, we first store raw sales data ingested from POS system API into Bronze tier. Then we build ETL pipelines to transform the raw data into data ready for prediction and store to Silver tier. We then build a machine learning pipeline and ingest the data to our model. The predicted sales data are ingested back to the Gold tier. The data is now ready to be ingested into our data warehouse.
What tools should we use for data transformation?
My favorite framework for choosing data transformation tools is by Olivier and Breght
We are not migrating data from on-premises as historical data are stored another SQL database. Currently we do have data engineering skill sets present and some complex transformations are required. We also prefer a Synapse architecture and currently no Databricks functionalities are required since most transformations are done using Pandas library. Hence the best data transformation tool is Synapse Notebook.
How to orchestrate data pipelines?
We need data orchestration tools to schedule our ingestion and transformation data pipelines to run daily. There are many free open source tools such as Airflow and Prefect which are widely used by the data engineer community. However in the end we decided to go with Azure Synapse which also allow us to schedule our data pipelines using time and event triggers.
The main reason is currently our team do not have a data engineer to maintain and debug data orchestration tools. As a lean team we want our data scientists to focus on task that has an impact on the revenue line. This also makes future hiring easier as they do not need to have experience on Airflow or Prefect. Azure Synapse provide us with a low-code tool which are maintained by Microsoft.
How to monitor data pipelines?
We use Azure Synapse Studio to monitor all our data pipeline status. Whenever a data pipeline breaks, an email is sent to all the team members. We also set up a webhook to our Microsoft Teams channel where we have the running status of all data pipeline. This allow us the monitor the pipelines closely and minimize downtime.