In this guide, you will learn how to build a Data Lakehouse from scratch using Datazone.
On next steps, we will create an extract entity to fetch data from this source.
name
: The name of the extract.
source
: The source you want to extract data from. (It is already selected)
mode
: The mode of the extract. Options are;
Overwrite
: Fetch all the data from the source every time.
Append
: Fetch only the new data from the source.
search_prefix
: The prefix you want to search for in the bucket.
search_pattern
: The pattern you want to search for in the bucket.
order_reports.py
in the project folder.@transform
. These functions are the steps of the pipeline.
You can specify the input datasets and the output dataset of the function by using the input_mapping
and output_mapping
classes.
join_tables
joins the orders
, order_lines
, and customers
tables.
sales_by_country
calculates the total sales and order count by country.
most_popular_products
calculates the total sales, total quantity, average price, and order count by product.
config.yml
file.pipeline
: The pipeline you want to schedule. (It is already selected)
name
: The name of the schedule.
expression
: The cron expression for the schedule. You can use the presets or write your own.