- Each pipeline should have a unique alias and should be defined in the different files.
- A pipeline should have at least one transform function.
- You can define dependencies between pipelines using the
dependsorinput_mappingparameter to create a directed acyclic graph (DAG).
Complex Pipeline Example
Data Flow Management
The@transform decorator enables you to define data transformation functions efficiently.
Each function should:
- Accept input data as arguments
- Process the data
- Return the transformed data
Data is handled as PySpark DataFrames both for input and output operations.
clean_datafunction takesinput_dataas input. You can check the dataset alias in the Datazone UI or use thedatazone dataset listcommand to list all datasets.- After cleaning the data, the
clean_datafunction returns the cleaned data PySpark DataFrame as lazy evaluation. - The
aggregate_datafunction takes the cleaned data as input and aggregates it and returns the aggregated data. - Since the
materializedparameter is set toTrue, theaggregate_datafunction will be materialized and create a new dataset in Datazone.
Check the Transform section for more information on how to define a transform decorator.