Development
Project Repository
You can manage your projects with your way.
Project Structure
- Each project should have a
config.yml
file. - You can have multiple pipeline files in your project. If you refer to a pipeline file in your configuration file, you should have that file in your project directory.
- Each pipeline should have a unique alias and should be defined in the different files.
- You can have multiple utility files in your project.
When you change your project files, recommended way to deploy it using Datazone CLI. Thus we can check your changes and if there is any error, we can notify you before deploying it. You can deploy your changes via following command.
After you deploy your changes, you can check your project status by running the following command:
Configuration File
Configuration File Schema
Basic Fields
project_name
: Name of your project (required)project_id
: Unique ID for your project (required)pipelines
: List of pipelines in your project
Pipeline
alias
: Short name for your pipelinepath
: Location of your pipeline filecompute
: Computing instance type. Options are:XSMALL
: 2 vCPU, 8 GB RAMSMALL
: 4 vCPU, 16 GB RAMMEDIUM
: 8 vCPU, 32 GB RAMLARGE
: 16 vCPU, 64 GB RAMXLARGE
: 32 vCPU, 128 GB RAM (Enterprise only)
spark_config.deploy_mode
: How Spark runs. Default islocal
. Options are:client
: Spark runs in the same process as the drivercluster
: Spark runs in a separate process
spark_config.executor_instances
: Number of executors to use in PySpark. Ifspark_config.deploy_mode
islocal
, this field is ignored.python_dependencies
: List of Python dependencies for your pipeline
Python Dependencies
name
: Name of the Python packageversion
: Version of the Python package (optional)index_url
: URL of the Python package index (optional)