Transform functions are the bricks of your pipeline. You can atomize your data processing steps into small functions and chain them together to build a pipeline.
Parameter Name | Default | Description |
---|---|---|
compute_fn | - | The main computation function to be transformed. |
name | - | Name of the transform function. If not provided, uses the function name. |
description | - | Description of the transform function for documentation purposes. |
materialized | False | If True, the output will be persisted/cached for reuse. |
input_mapping | - | Dictionary defining input dependencies and their sources. Maps input parameter names to their corresponding datasets or transforms. |
depends | - | List of transform functions that must complete before this transform can run. |
partition_by | - | List of column names to partition the output data by. |
output_mapping | - | Dictionary defining how the output should be mapped or stored. |
tags | - | List of tags for categorizing and organizing transforms. |
engine | pyspark | The computation engine to use. Options are ‘pyspark’ or ‘pandas’. |
engine
parameter.
input_mapping
or depends
), all connected transforms must use the same engine.
For example, if a transform uses the Pandas engine, all transforms it depends on or that depend on it must also use the Pandas engine.context.resources["pyspark"].spark
to access the PySpark session. For more information,
check the Context section.fetch_data
function will be stored in Datazone as a dataset.
You can check the dataset alias in the Datazone UI or use the datazone dataset list
command to list all datasets.
Parameter Name | Default | Description |
---|---|---|
entity | - | The entity to be used as input. It can be a dataset or another transform function. |
output_name | - | If you use a transform that has multiple outputs, you can specify the output name. |
Input
class.Input
class accepts a Dataset
or another transform function as an argument.output_mapping
parameter.
Parameter Name | Default | Description |
---|---|---|
dataset | - | The dataset where the output should be stored. |
materialized | False | If True, the output will be stored as a materialized dataset. |
partition_by | - | List of column names to partition the output data by. |
mode | overwrite | The write mode for the output data. Options are overwrite , append . |
Output
class.on_success
and on_failure
parameters in the transform
decorator.
partition_by
parameter.