Pipelines are the main building blocks of your project. You can define your data processing steps in a pipeline file and deploy it to Datazone.
depends
or input_mapping
parameter to create a directed acyclic graph (DAG).@transform
decorator enables you to define data transformation functions efficiently.
Each function should:
clean_data
function takes input_data
as input. You can check the dataset alias in the Datazone UI or use the datazone dataset list
command to list all datasets.clean_data
function returns the cleaned data PySpark DataFrame as lazy evaluation.aggregate_data
function takes the cleaned data as input and aggregates it and returns the aggregated data.materialized
parameter is set to True
, the aggregate_data
function will be materialized and create a new dataset in Datazone.