Development
Pipeline
Pipelines are the main building blocks of your project. You can define your data processing steps in a pipeline file and deploy it to Datazone.
- Each pipeline should have a unique alias and should be defined in the different files.
- A pipeline should have at least one transform function.
- You can define dependencies between pipelines using the
depends
orinput_mapping
parameter to create a directed acyclic graph (DAG).
Complex Pipeline Example
Data Flow Management
The @transform
decorator enables you to define data transformation functions efficiently.
Each function should:
- Accept input data as arguments
- Process the data
- Return the transformed data
Data is handled as PySpark DataFrames both for input and output operations.
In above example,
clean_data
function takesinput_data
as input. You can check the dataset alias in the Datazone UI or use thedatazone dataset list
command to list all datasets.- After cleaning the data, the
clean_data
function returns the cleaned data PySpark DataFrame as lazy evaluation. - The
aggregate_data
function takes the cleaned data as input and aggregates it and returns the aggregated data. - Since the
materialized
parameter is set toTrue
, theaggregate_data
function will be materialized and create a new dataset in Datazone.
Check the Transform section for more information on how to define a transform decorator.