Transform functions are the bricks of your pipeline. You can atomize your data processing steps into small functions and chain them together to build a pipeline.
Parameter Name | Default | Description |
---|---|---|
compute_fn | - | The main computation function to be transformed. |
name | - | Name of the transform function. If not provided, uses the function name. |
description | - | Description of the transform function for documentation purposes. |
materialized | False | If True, the output will be persisted/cached for reuse. |
input_mapping | - | Dictionary defining input dependencies and their sources. Maps input parameter names to their corresponding datasets or transforms. |
depends | - | List of transform functions that must complete before this transform can run. |
partition_by | - | List of column names to partition the output data by. |
output_mapping | - | Dictionary defining how the output should be mapped or stored. |
tags | - | List of tags for categorizing and organizing transforms. |
engine | pyspark | The computation engine to use. Options are ‘pyspark’ or ‘pandas’. |
The transform decorator supports two computation engines: PySpark and Pandas. You can specify the engine using the engine
parameter.
When using transforms with dependencies (via input_mapping
or depends
), all connected transforms must use the same engine.
For example, if a transform uses the Pandas engine, all transforms it depends on or that depend on it must also use the Pandas engine.
Materialization is the process of storing the output of a transform function for reuse. This can be useful when a transform function is computationally expensive and its output is used multiple times in the pipeline.
context.resources["pyspark"].spark
to access the PySpark session. For more information,
check the Context section.After running the pipeline, the output of the fetch_data
function will be stored in Datazone as a dataset.
You can check the dataset alias in the Datazone UI or use the datazone dataset list
command to list all datasets.
You can define input mappings to specify the data sources and dependencies for your transform functions. Input mappings enable you to:
Parameter Name | Default | Description |
---|---|---|
entity | - | The entity to be used as input. It can be a dataset or another transform function. |
output_name | - | If you use a transform that has multiple outputs, you can specify the output name. |
Here are the common input mapping patterns:
Input
class.Input
class accepts a Dataset
or another transform function as an argument.Output mapping defines how the output of a transform function should be stored or mapped.
You can specify the output mapping using the output_mapping
parameter.
Parameter Name | Default | Description |
---|---|---|
dataset | - | The dataset where the output should be stored. |
materialized | False | If True, the output will be stored as a materialized dataset. |
partition_by | - | List of column names to partition the output data by. |
mode | overwrite | The write mode for the output data. Options are overwrite , append . |
Output
class.Partitioning helps organize and optimize your datasets in our data platform.
When you create a transform function, you can specify partition columns using the partition_by
parameter.
All transformed datasets are stored in Delta Lake format. Choose partition columns based on your most common filtering needs, typically date-based or categorical columns with reasonable cardinality. If you partition by a high-cardinality column, it may lead to a large number of small files, which can impact query performance.
Transform functions are the bricks of your pipeline. You can atomize your data processing steps into small functions and chain them together to build a pipeline.
Parameter Name | Default | Description |
---|---|---|
compute_fn | - | The main computation function to be transformed. |
name | - | Name of the transform function. If not provided, uses the function name. |
description | - | Description of the transform function for documentation purposes. |
materialized | False | If True, the output will be persisted/cached for reuse. |
input_mapping | - | Dictionary defining input dependencies and their sources. Maps input parameter names to their corresponding datasets or transforms. |
depends | - | List of transform functions that must complete before this transform can run. |
partition_by | - | List of column names to partition the output data by. |
output_mapping | - | Dictionary defining how the output should be mapped or stored. |
tags | - | List of tags for categorizing and organizing transforms. |
engine | pyspark | The computation engine to use. Options are ‘pyspark’ or ‘pandas’. |
The transform decorator supports two computation engines: PySpark and Pandas. You can specify the engine using the engine
parameter.
When using transforms with dependencies (via input_mapping
or depends
), all connected transforms must use the same engine.
For example, if a transform uses the Pandas engine, all transforms it depends on or that depend on it must also use the Pandas engine.
Materialization is the process of storing the output of a transform function for reuse. This can be useful when a transform function is computationally expensive and its output is used multiple times in the pipeline.
context.resources["pyspark"].spark
to access the PySpark session. For more information,
check the Context section.After running the pipeline, the output of the fetch_data
function will be stored in Datazone as a dataset.
You can check the dataset alias in the Datazone UI or use the datazone dataset list
command to list all datasets.
You can define input mappings to specify the data sources and dependencies for your transform functions. Input mappings enable you to:
Parameter Name | Default | Description |
---|---|---|
entity | - | The entity to be used as input. It can be a dataset or another transform function. |
output_name | - | If you use a transform that has multiple outputs, you can specify the output name. |
Here are the common input mapping patterns:
Input
class.Input
class accepts a Dataset
or another transform function as an argument.Output mapping defines how the output of a transform function should be stored or mapped.
You can specify the output mapping using the output_mapping
parameter.
Parameter Name | Default | Description |
---|---|---|
dataset | - | The dataset where the output should be stored. |
materialized | False | If True, the output will be stored as a materialized dataset. |
partition_by | - | List of column names to partition the output data by. |
mode | overwrite | The write mode for the output data. Options are overwrite , append . |
Output
class.Partitioning helps organize and optimize your datasets in our data platform.
When you create a transform function, you can specify partition columns using the partition_by
parameter.
All transformed datasets are stored in Delta Lake format. Choose partition columns based on your most common filtering needs, typically date-based or categorical columns with reasonable cardinality. If you partition by a high-cardinality column, it may lead to a large number of small files, which can impact query performance.