- Each pipeline should have a unique alias and should be defined in the different files.
- A pipeline should have at least one transform function.
- You can define dependencies between pipelines using the
dependsorinput_mappingparameter to create a directed acyclic graph (DAG).
Complex Pipeline Example
Data Flow Management
The@transform decorator enables you to define data transformation functions efficiently.
Each function should:
- Accept input data as arguments
- Process the data
- Return the transformed data
Data is handled as PySpark DataFrames both for input and output operations.
clean_datafunction takesinput_dataas input. You can check the dataset alias in the Datazone UI or use thedatazone dataset listcommand to list all datasets.- After cleaning the data, the
clean_datafunction returns the cleaned data PySpark DataFrame as lazy evaluation. - The
aggregate_datafunction takes the cleaned data as input and aggregates it and returns the aggregated data. - Since the
materializedparameter is set toTrue, theaggregate_datafunction will be materialized and create a new dataset in Datazone.
Check the Transform section for more information on how to define a transform decorator.
Transform Selection
When executing a pipeline, you can selectively run specific transforms using thetransform_selection parameter in the “Run with Config” modal. This allows you to execute only the transforms you need, along with their dependencies if required.
Selection Patterns
| Pattern | Description |
|---|---|
some_transform | Select the transform only |
*some_transform | Select transform and all ancestors (upstream dependencies) |
some_transform* | Select transform and all descendants (downstream dependencies) |
*some_transform* | Select transform with all ancestors and descendants |
+some_transform | Select transform and its direct parents |
some_transform+ | Select transform and direct children |
some_transform++ | Select transform and 2 levels of children |
some_transform+++ | Select transform and 3 levels of children |
Use transform selection to optimize execution time by running only the necessary parts of your pipeline during development and testing.
Usage Examples
For the pipeline example above withprepare, build_project, build_report, and notify_email:
build_project- Runs only thebuild_projecttransform*build_project- Runsprepareandbuild_project(transform with all ancestors)prepare*- Runsprepare,build_project, andbuild_report(transform with all descendants)*notify_email*- Runs the entire pipeline (transform with all ancestors and descendants)