Prerequisites

Before you start building your Data Lakehouse, make sure you have the following prerequisites:

  • A Datazone account. If you don’t have one, you can sign up here.

  • Datazone CLI installed on your local machine. You can install it by following the instructions here.

Task List

To understand how to build a Data Lakehouse from scratch using Datazone, let’s follow these steps:

  1. πŸ”Œ Connecting your data source: Start by connecting AWS S3 as a data source.

  2. πŸ“ Initialize first project: Set up your first project and add an Extract component.

  3. πŸš€ Run first execution: Launch your first execution to fetch data from the source.

  4. πŸ“„ Create first pipeline: Design a simple pipeline to process the data.

  5. πŸš‚ Run first pipeline: Execute the pipeline to transform your data.

  6. ⏰ Create first schedule: Configure periodic runs for automated processing.

  7. πŸ† Access the data: Learn how to query and use the processed data.

πŸ”Œ Connect Source

  1. Go to the Sources page by clicking on the Sources tab in the sidebar.
  1. Click on the Create Source button.
  1. Fill in the required fields and click on the Create button. And you are done! You have successfully connected your source. Check your source in the Sources page.

On next steps, we will create an extract entity to fetch data from this source.

πŸ“ Create Project

  1. Go to the Projects page by clicking on the Projects tab in the sidebar.
  1. Click on the Create Project button.
  1. Fill in the required fields and click on the Create button. Boom! πŸš€ You have successfully created your first project.

Define your Extract

  1. On your project page, click on the Add button in the top right corner to add a new entity.

  2. Select Extract as the entity type.

  1. Fill in the required fields and click on the Create button. You have successfully created your first Extract entity.

Base Attributes

  • name: The name of the extract.

  • source: The source you want to extract data from. (It is already selected)

  • mode: The mode of the extract. Options are;

    • Overwrite: Fetch all the data from the source every time.

    • Append: Fetch only the new data from the source.

Source Dependent Attributes (In this case, AWS S3). Check the AWS S3 page for more details.

  • search_prefix: The prefix you want to search for in the bucket.

  • search_pattern: The pattern you want to search for in the bucket.

πŸš€ Run First Execution and Check the Data

  1. Click to the created Extract entity and move to the Executions tab. Via clicking the Run button, you can start your first execution.
  1. Simultaneously, you can check the execution logs and the other details in the Logs tab. You can cancel the execution if you want. After a while, execution will be completed and you notice the new dataset in left explorer. You can check the data by clicking on the dataset.
  1. On the dataset drawer, you can see the data fetched from the source. You can also check the schema and make queries on the data to explore it.
  1. With above way, we can fetch the other csv files from the source and create the datasets for each of them.

⌨️ Click Less, Code More: Create First Pipeline

If you have already created your project on the UI, you can clone it to your local machine using the Datazone CLI.

datazone project clone <project-id>

and you will see;

Repository has initialized
πŸ‘‰ Go to repository directory: cd ecommerce-project/

Check your project folder

> cd ecommerce-project/
> ls -ll
ecommerce-project/
β”œβ”€β”€ README.md
β”œβ”€β”€ hello-world.py
β”œβ”€β”€ config.yml
  1. We can create our pipeline file in the project folder. Let’s create a new file named order_reports.py in the project folder.
order_reports.py
from datazone import transform, Input, Dataset
from pyspark.sql import functions as F


@transform(
    input_mapping={
        "orders": Input(Dataset(alias="orders_c90dc0")),
        "order_lines": Input(Dataset(alias="order_lines_8c5238")),
        "customers": Input(Dataset(alias="customers_9cd9ab")),
    }
)
def join_tables(orders, order_lines, customers):
    return orders.join(order_lines, on="OrderID", how="inner").join(
        customers, on="CustomerID", how="inner"
    )


@transform(input_mapping={"joined": Input(join_tables)}, materialized=True)
def sales_by_country(joined):
    country_report = (
        joined.groupBy("Country")
        .agg(
            F.sum("TotalAmount").alias("TotalSales"),
            F.count("OrderID").alias("OrderCount"),
        )
        .orderBy("TotalSales", ascending=False)
    )

    return country_report


@transform(input_mapping={"joined": Input(join_tables)}, materialized=True)
def most_popular_products(joined):
    product_report = (
        joined.groupBy("ProductID")
        .agg(
            F.sum("TotalAmount").alias("TotalSales"),
            F.sum("Quantity").alias("TotalQuantity"),
            F.avg("UnitPrice").alias("AveragePrice"),
            F.count("OrderID").alias("OrderCount"),
        )
        .orderBy("TotalSales", ascending=False)
    )

    return product_report

Above code is a simple pipeline that joins three tables and creates two reports.

  • You can see that we have two functions that are decorated with @transform. These functions are the steps of the pipeline. You can specify the input datasets and the output dataset of the function by using the input_mapping and output_mapping classes.

  • The first function join_tables joins the orders, order_lines, and customers tables.

  • The second function sales_by_country calculates the total sales and order count by country.

  • The third function most_popular_products calculates the total sales, total quantity, average price, and order count by product.

You can create your own pipeline according to your needs. Also you can check the Transform Functions page to see the available functions.

  1. Then you need to reference this pipeline in the config.yml file.
config.yml
project_name: ecommerce-project
project_id: 673f8ef62a466524757a7de1
pipelines:
- alias: order_reports
  path: order_reports.py
  1. Let’s deploy the pipeline by running the following command in the project folder.
datazone project deploy

And you will see the deployed pipeline if you click the newly created pipeline in the project page.

  1. Let’s execute the pipeline by running the following command in the project folder.
datazone execution run --pipeline-id <pipeline-id>
There are many ways to do something in Datazone. You can run your pipeline via UI, CLI or API.
  1. While execution is running, you can check the logs both in the terminal and in the UI. After the execution is completed, you can check the logs and the output dataset in the Executions tab.
  1. Our new Dataset is ready to use. You can check and explore the data in the dataset drawer.

⏰ Orchestrate Your Pipeline

  1. Select the pipeline you want to schedule in the Explorer.

  2. Open the Schedules tab and click on the + Set Schedule button.

  3. Attributes are:

    • pipeline: The pipeline you want to schedule. (It is already selected)

    • name: The name of the schedule.

    • expression: The cron expression for the schedule. You can use the presets or write your own.

πŸ† Access the Data

SQL Interfaces

Rest API Access

You can use the Datazone API to access your data programmatically. You can check the API Reference page for more details.

curl -X GET "https://api.datazone.co/v1/datasets/<dataset>/data" -H "Authorization <your-token>"