Skip to main content
from datazone import transform

@transform
def say_hello():
    print("Hello, World!")

@transform(depends=[say_hello])
def say_goodbye():
    print("Goodbye, World!")
  • Each pipeline should have a unique alias and should be defined in the different files.
  • A pipeline should have at least one transform function.
  • You can define dependencies between pipelines using the depends or input_mapping parameter to create a directed acyclic graph (DAG).

Complex Pipeline Example

from datazone import transform

@transform
def prepare():
    print("Preparing data...")

@transform(depends=[prepare])
def build_project():
    print("Building project...")

@transform(depends=[prepare])
def build_report():
    print("Building report...")

@transform(depends=[build_project, build_report])
def notify_email():
    print("Sending email notification...")

Data Flow Management

The @transform decorator enables you to define data transformation functions efficiently. Each function should:
  • Accept input data as arguments
  • Process the data
  • Return the transformed data
Data is handled as PySpark DataFrames both for input and output operations.
from datazone import transform

@transform(input_mapping={'data': Input(Dataset(alias='input_data')})
def clean_data(data):
    return data.filter(data['column'] > 0)

@transform(input_mapping={'clean_data': Input(clean_data)}, materialized=True)
def aggregate_data(clean_data):
    return clean_data.groupBy('column').agg({'column': 'sum'})
In above example,
  1. clean_data function takes input_data as input. You can check the dataset alias in the Datazone UI or use the datazone dataset list command to list all datasets.
  2. After cleaning the data, the clean_data function returns the cleaned data PySpark DataFrame as lazy evaluation.
  3. The aggregate_data function takes the cleaned data as input and aggregates it and returns the aggregated data.
  4. Since the materialized parameter is set to True, the aggregate_data function will be materialized and create a new dataset in Datazone.
Check the Transform section for more information on how to define a transform decorator.

Transform Selection

When executing a pipeline, you can selectively run specific transforms using the transform_selection parameter in the “Run with Config” modal. This allows you to execute only the transforms you need, along with their dependencies if required.

Selection Patterns

PatternDescription
some_transformSelect the transform only
*some_transformSelect transform and all ancestors (upstream dependencies)
some_transform*Select transform and all descendants (downstream dependencies)
*some_transform*Select transform with all ancestors and descendants
+some_transformSelect transform and its direct parents
some_transform+Select transform and direct children
some_transform++Select transform and 2 levels of children
some_transform+++Select transform and 3 levels of children
Use transform selection to optimize execution time by running only the necessary parts of your pipeline during development and testing.

Usage Examples

For the pipeline example above with prepare, build_project, build_report, and notify_email:
  • build_project - Runs only the build_project transform
  • *build_project - Runs prepare and build_project (transform with all ancestors)
  • prepare* - Runs prepare, build_project, and build_report (transform with all descendants)
  • *notify_email* - Runs the entire pipeline (transform with all ancestors and descendants)