From Zero to Production!
In this guide, you will learn how to build a Data Lakehouse from scratch using Datazone.
Prerequisites
Before you start building your Data Lakehouse, make sure you have the following prerequisites:
-
A Datazone account. If you donβt have one, you can sign up here.
-
Datazone CLI installed on your local machine. You can install it by following the instructions here.
Task List
To understand how to build a Data Lakehouse from scratch using Datazone, letβs follow these steps:
-
π Connecting your data source: Start by connecting AWS S3 as a data source.
-
π Initialize first project: Set up your first project and add an Extract component.
-
π Run first execution: Launch your first execution to fetch data from the source.
-
π Create first pipeline: Design a simple pipeline to process the data.
-
π Run first pipeline: Execute the pipeline to transform your data.
-
β° Create first schedule: Configure periodic runs for automated processing.
-
π Access the data: Learn how to query and use the processed data.
π Connect Source
- Go to the Sources page by clicking on the Sources tab in the sidebar.
- Click on the Create Source button.
- Fill in the required fields and click on the Create button. And you are done! You have successfully connected your source. Check your source in the Sources page.
On next steps, we will create an extract entity to fetch data from this source.
π Create Project
- Go to the Projects page by clicking on the Projects tab in the sidebar.
- Click on the Create Project button.
- Fill in the required fields and click on the Create button. Boom! π You have successfully created your first project.
Define your Extract
-
On your project page, click on the Add button in the top right corner to add a new entity.
-
Select Extract as the entity type.
- Fill in the required fields and click on the Create button. You have successfully created your first Extract entity.
Base Attributes
-
name
: The name of the extract. -
source
: The source you want to extract data from. (It is already selected) -
mode
: The mode of the extract. Options are;-
Overwrite
: Fetch all the data from the source every time. -
Append
: Fetch only the new data from the source.
-
Source Dependent Attributes (In this case, AWS S3). Check the AWS S3 page for more details.
-
search_prefix
: The prefix you want to search for in the bucket. -
search_pattern
: The pattern you want to search for in the bucket.
π Run First Execution and Check the Data
- Click to the created Extract entity and move to the Executions tab. Via clicking the Run button, you can start your first execution.
- Simultaneously, you can check the execution logs and the other details in the Logs tab. You can cancel the execution if you want. After a while, execution will be completed and you notice the new dataset in left explorer. You can check the data by clicking on the dataset.
- On the dataset drawer, you can see the data fetched from the source. You can also check the schema and make queries on the data to explore it.
- With above way, we can fetch the other csv files from the source and create the datasets for each of them.
β¨οΈ Click Less, Code More: Create First Pipeline
If you have already created your project on the UI, you can clone it to your local machine using the Datazone CLI.
and you will see;
Check your project folder
- We can create our pipeline file in the project folder. Letβs create a new file named
order_reports.py
in the project folder.
Above code is a simple pipeline that joins three tables and creates two reports.
-
You can see that we have two functions that are decorated with
@transform
. These functions are the steps of the pipeline. You can specify the input datasets and the output dataset of the function by using theinput_mapping
andoutput_mapping
classes. -
The first function
join_tables
joins theorders
,order_lines
, andcustomers
tables. -
The second function
sales_by_country
calculates the total sales and order count by country. -
The third function
most_popular_products
calculates the total sales, total quantity, average price, and order count by product.
You can create your own pipeline according to your needs. Also you can check the Transform Functions page to see the available functions.
- Then you need to reference this pipeline in the
config.yml
file.
- Letβs deploy the pipeline by running the following command in the project folder.
And you will see the deployed pipeline if you click the newly created pipeline in the project page.
- Letβs execute the pipeline by running the following command in the project folder.
- While execution is running, you can check the logs both in the terminal and in the UI. After the execution is completed, you can check the logs and the output dataset in the Executions tab.
- Our new Dataset is ready to use. You can check and explore the data in the dataset drawer.
β° Orchestrate Your Pipeline
-
Select the pipeline you want to schedule in the Explorer.
-
Open the Schedules tab and click on the + Set Schedule button.
-
Attributes are:
-
pipeline
: The pipeline you want to schedule. (It is already selected) -
name
: The name of the schedule. -
expression
: The cron expression for the schedule. You can use the presets or write your own.
-
π Access the Data
SQL Interfaces
Rest API Access
You can use the Datazone API to access your data programmatically. You can check the API Reference page for more details.