Prerequisites
Before you start building your Data Lakehouse, make sure you have the following prerequisites:- A Datazone account. If you donβt have one, you can sign up here.
- Datazone CLI installed on your local machine. You can install it by following the instructions here.
Task List
To understand how to build a Data Lakehouse from scratch using Datazone, letβs follow these steps:- π Connecting your data source: Start by connecting AWS S3 as a data source.
- π Initialize first project: Set up your first project and add an Extract component.
- π Run first execution: Launch your first execution to fetch data from the source.
- π Create first pipeline: Design a simple pipeline to process the data.
- π Run first pipeline: Execute the pipeline to transform your data.
- β° Create first schedule: Configure periodic runs for automated processing.
- π Access the data: Learn how to query and use the processed data.
π Connect Source
- Go to the Sources page by clicking on the Sources tab in the sidebar.

- Click on the Create Source button.

- Fill in the required fields and click on the Create button. And you are done! You have successfully connected your source. Check your source in the Sources page.

On next steps, we will create an extract entity to fetch data from this source.
π Create Project
- Go to the Projects page by clicking on the Projects tab in the sidebar.

- Click on the Create Project button.

- Fill in the required fields and click on the Create button. Boom! π You have successfully created your first project.

Define your Extract
- On your project page, click on the Add button in the top right corner to add a new entity.
- Select Extract as the entity type.

- Fill in the required fields and click on the Create button. You have successfully created your first Extract entity.

-
name
: The name of the extract. -
source
: The source you want to extract data from. (It is already selected) -
mode
: The mode of the extract. Options are;-
Overwrite
: Fetch all the data from the source every time. -
Append
: Fetch only the new data from the source.
-
-
search_prefix
: The prefix you want to search for in the bucket. -
search_pattern
: The pattern you want to search for in the bucket.
π Run First Execution and Check the Data
- Click to the created Extract entity and move to the Executions tab. Via clicking the Run button, you can start your first execution.

- Simultaneously, you can check the execution logs and the other details in the Logs tab. You can cancel the execution if you want. After a while, execution will be completed and you notice the new dataset in left explorer. You can check the data by clicking on the dataset.

- On the dataset drawer, you can see the data fetched from the source. You can also check the schema and make queries on the data to explore it.

- With above way, we can fetch the other csv files from the source and create the datasets for each of them.
β¨οΈ Click Less, Code More: Create First Pipeline
If you have already created your project on the UI, you can clone it to your local machine using the Datazone CLI.- We can create our pipeline file in the project folder. Letβs create a new file named
order_reports.py
in the project folder.
order_reports.py
-
You can see that we have two functions that are decorated with
@transform
. These functions are the steps of the pipeline. You can specify the input datasets and the output dataset of the function by using theinput_mapping
andoutput_mapping
classes. -
The first function
join_tables
joins theorders
,order_lines
, andcustomers
tables. -
The second function
sales_by_country
calculates the total sales and order count by country. -
The third function
most_popular_products
calculates the total sales, total quantity, average price, and order count by product.
- Then you need to reference this pipeline in the
config.yml
file.
config.yml
- Letβs deploy the pipeline by running the following command in the project folder.

- Letβs execute the pipeline by running the following command in the project folder.
There are many ways to do something in Datazone. You can run your pipeline via UI, CLI or API.
- While execution is running, you can check the logs both in the terminal and in the UI. After the execution is completed, you can check the logs and the output dataset in the Executions tab.

- Our new Dataset is ready to use. You can check and explore the data in the dataset drawer.

β° Orchestrate Your Pipeline
- Select the pipeline you want to schedule in the Explorer.
- Open the Schedules tab and click on the + Set Schedule button.
-
Attributes are:
-
pipeline
: The pipeline you want to schedule. (It is already selected) -
name
: The name of the schedule. -
expression
: The cron expression for the schedule. You can use the presets or write your own.
-
