Concepts
Dataset

Dataset

Overview

The Dataset entity in the data platform represents a structured collection of data, usually in tabular form. It is a crucial component of data management and processing within the platform, serving as the primary format for storing, manipulating, and retrieving data.

Properties

  • ID: A unique identifier for the dataset.
  • Schema: Describes the structure of the dataset, including columns, data types, and lengths.
  • Transactions: Each execution creates a transaction in the dataset. Transactions can either overwrite or append to the dataset.
  • Write Behavior: Determines how new data is added to the dataset (e.g., append, overwrite).
  • Source: The origin of the data, which could be from various sources like transorms, other datasets, databases, files, or external APIs.
  • Format: The data format used to store the dataset. Currently, it is Delta Lake format.
  • Index:
  • View:

Usage

  • Data Integration: Datasets are created and populated through data extraction and execution processes of transforms. They can originate from various sources like relational databases, flat files, or APIs.
  • Data Transformation: Datasets can be used as inputs and outputs for data transformation processes, where they undergo various operations like filtering, aggregation, or splitting.
  • Data Analysis: Once transformed, datasets are ready for analysis and can be consumed in data analytics tools or used for further data processing.

Versioning

Each significant change to a dataset, either in its schema or data, should be version controlled to track the dataset's evolution and maintain a history of changes.