Dataset
Overview
The Dataset
entity in the data platform represents a structured collection of data, usually in tabular form. It is a crucial component of data management and processing within the platform, serving as the primary format for storing, manipulating, and retrieving data.
Properties
ID
: A unique identifier for the dataset.Schema
: Describes the structure of the dataset, including columns, data types, and lengths.Transactions
: Each execution creates a transaction in the dataset. Transactions can either overwrite or append to the dataset.Write Behavior
: Determines how new data is added to the dataset (e.g., append, overwrite).Source
: The origin of the data, which could be from various sources like transorms, other datasets, databases, files, or external APIs.Format
: The data format used to store the dataset. Currently, it is Delta Lake format.Index
:View
:
Usage
- Data Integration: Datasets are created and populated through data extraction and execution processes of transforms. They can originate from various sources like relational databases, flat files, or APIs.
- Data Transformation: Datasets can be used as inputs and outputs for data transformation processes, where they undergo various operations like filtering, aggregation, or splitting.
- Data Analysis: Once transformed, datasets are ready for analysis and can be consumed in data analytics tools or used for further data processing.
Versioning
Each significant change to a dataset, either in its schema or data, should be version controlled to track the dataset's evolution and maintain a history of changes.