Skip to main content

Overview

A Datazone project is a Git repository containing your data pipelines, actions, intelligent apps, and endpoints. All project components are defined in a central config.yml file.

Project Structure

my-project
config.yml
pipelines
hello_world.py
data_processing.py
utils.py
  • Each project must have a config.yml file
  • Each pipeline should have a unique alias and be defined in separate files
  • You can organize your code with utility files for shared logic

Configuration File

The config.yml file defines all project resources:
config.yml
project_name: my-pretty-project
project_id: 67280ba2f4a0960d02159675

pipelines:
  - alias: hello_world
    path: pipelines/hello_world.py
    compute: LARGE
    spark_config:
      deploy_mode: client
      executor_instances: 3
    python_dependencies:
      - name: pandas
        version: 1.3.3

actions:
  - path: actions/send_notification.py

apps:
  - path: apps/my_app.py

endpoints:
  - path: endpoints/api_config.yml

Configuration Reference

Project Fields

project_name
string
required
Name of your project. Used for display and identification.
project_name: data-processing-platform
project_id
string
required
Unique identifier for your project. Generated when you create a project.
project_id: 67280ba2f4a0960d02159675
pipelines
array
List of data pipeline definitions. Each pipeline processes and transforms data.
pipelines:
  - alias: etl_pipeline
    path: pipelines/etl.py
actions
array
List of serverless action functions. Actions can be triggered by endpoints or used by AI agents.
actions:
  - path: actions/send_email.py
Learn more in the Actions documentation.
apps
array
List of intelligent AI applications.
apps:
  - path: apps/customer_support.py
endpoints
array
List of API endpoint configurations.
endpoints:
  - path: endpoints/webhooks.yml
Learn more in the Endpoints documentation.

Pipeline Configuration

alias
string
required
Short, unique identifier for the pipeline. Used in CLI commands and UI.
alias: daily_etl
path
string
required
Relative path to the pipeline Python file from project root.
path: pipelines/etl_pipeline.py
compute
string
default:"SMALL"
Compute instance size for pipeline execution.Available sizes:
  • XSMALL - 2 vCPU, 8 GB RAM
  • SMALL - 4 vCPU, 16 GB RAM
  • MEDIUM - 8 vCPU, 32 GB RAM
  • LARGE - 16 vCPU, 64 GB RAM
  • XLARGE - 32 vCPU, 128 GB RAM (Enterprise only)
compute: LARGE
spark_config.deploy_mode
string
default:"local"
Spark deployment mode for distributed processing.Options:
  • local - Runs on a single machine (default)
  • client - Driver runs in the same process, executors run separately
  • cluster - Both driver and executors run in separate processes (Enterprise only)
spark_config:
  deploy_mode: client
spark_config.executor_instances
integer
Number of Spark executors for parallel processing. Only applies when deploy_mode is client or cluster.
spark_config:
  deploy_mode: client
  executor_instances: 5
spark_config.extra_spark_config
object
Additional Spark configuration properties. Pass any custom Spark configuration key-value pairs.
spark_config:
  deploy_mode: client
  executor_instances: 3
  extra_spark_config:
    spark.sql.shuffle.partitions: "200"
    spark.default.parallelism: "100"
    spark.sql.adaptive.enabled: "true"
python_dependencies
array
Python packages required by the pipeline. Installed before execution.
python_dependencies:
  - name: pandas
    version: 2.0.0
  - name: requests
    version: 2.31.0
    index_url: https://pypi.org/simple

Python Dependency Fields

name
string
required
Python package name from PyPI or custom index.
- name: numpy
version
string
Specific package version. If omitted, installs the latest version.
- name: pandas
  version: 2.0.0
index_url
string
Custom Python package index URL. Useful for private packages or mirrors.
- name: my-private-package
  version: 1.2.3
  index_url: https://pypi.mycompany.com/simple

Action Configuration

path
string
required
Relative path to the action Python file containing an @action decorated function.
actions:
  - path: actions/send_notification.py
  - path: workflows/data_validator.py
Each file should contain one action function. Learn more in the Actions documentation.

App Configuration

path
string
required
Relative path to the intelligent app Python file.
apps:
  - path: apps/customer_assistant.py

Endpoint Configuration

path
string
required
Relative path to the endpoint configuration YAML file.
endpoints:
  - path: endpoints/webhooks.yml
  - path: endpoints/api_routes.yml
Learn more in the Endpoints documentation.

Example Projects

Data Processing Pipeline

project_name: sales-analytics
project_id: 67280ba2f4a0960d02159675

pipelines:
  - alias: daily_sales_etl
    path: pipelines/sales_pipeline.py
    compute: MEDIUM
    spark_config:
      deploy_mode: client
      executor_instances: 3
    python_dependencies:
      - name: pandas
        version: 2.0.0
      - name: sqlalchemy
        version: 2.0.0

Multi-Component Project

project_name: customer-platform
project_id: 67280ba2f4a0960d02159675

pipelines:
  - alias: customer_data_sync
    path: pipelines/sync.py
    compute: SMALL
    python_dependencies:
      - name: requests
      - name: pydantic
        version: 2.0.0

actions:
  - path: actions/send_welcome_email.py
  - path: actions/validate_customer.py

apps:
  - path: apps/support_chatbot.py

endpoints:
  - path: endpoints/customer_api.yml

Next Steps