Project Structure

my-project-directory
├── pipeline.py
├── utils.py
├── config.yml
  • Each project should have a config.yml file.
  • You can have multiple pipeline files in your project. If you refer to a pipeline file in your configuration file, you should have that file in your project directory.
  • Each pipeline should have a unique alias and should be defined in the different files.
  • You can have multiple utility files in your project.

When you change your project files, recommended way to deploy it using Datazone CLI. Thus we can check your changes and if there is any error, we can notify you before deploying it. You can deploy your changes via following command.

datazone project deploy

After you deploy your changes, you can check your project status by running the following command:

datazone project summary

Configuration File

project_name: my-pretty-project
project_id: 67280ba2f4a0960d02159675
pipelines:
- alias: hello_world
  path: hello-world.py
  compute: LARGE
  spark_config:
    deploy_mode: client
    executor_instances: 3
  python_dependencies:
    - name: pandas
      version: 1.3.3

Configuration File Schema

Basic Fields

  • project_name: Name of your project (required)
  • project_id : Unique ID for your project (required)
  • pipelines: List of pipelines in your project

Pipeline

  • alias: Short name for your pipeline
  • path: Location of your pipeline file
  • compute : Computing instance type. Options are:
    • XSMALL: 2 vCPU, 8 GB RAM
    • SMALL: 4 vCPU, 16 GB RAM
    • MEDIUM: 8 vCPU, 32 GB RAM
    • LARGE: 16 vCPU, 64 GB RAM
    • XLARGE: 32 vCPU, 128 GB RAM (Enterprise only)
  • spark_config.deploy_mode: How Spark runs. Default is local. Options are:
    • client: Spark runs in the same process as the driver
    • cluster: Spark runs in a separate process
  • spark_config.executor_instances: Number of executors to use in PySpark. If spark_config.deploy_mode is local, this field is ignored.
  • python_dependencies: List of Python dependencies for your pipeline

Python Dependencies

  • name: Name of the Python package
  • version: Version of the Python package (optional)
  • index_url: URL of the Python package index (optional)