Concepts
Extract

Extract

Overview

The Extract entity is a key component in the data extraction process within the data platform. It defines how data is retrieved from a source, focusing on specific aspects such as database tables or file paths in a storage system. Extracts are responsible for the initial stage of data ingestion and preprocessing.

Properties

  • ID: A unique identifier for the extract.
  • Source ID: The identifier of the source from which data is extracted.
  • Data Path: Specifies the exact location of the data within the source, such as a database table name or a file path in a storage system.
  • Schema Definition: Describes the structure of the data to be extracted, including column names, data types, and lengths. For certain file types (e.g., CSV, text), the schema can be inferred automatically.
  • Extract Type: The format or method used for extraction, which may vary based on the source type (e.g., SQL query for databases, file pattern matching for file storage systems).

Usage

  • Data Ingestion: Extracts are used to ingest data from various sources into the data platform for further processing.
  • Schema Mapping: They play a crucial role in defining and understanding the structure of incoming data, which is essential for data integration and transformation.
  • Configurability: Extracts offer flexibility in configuring the specifics of data extraction, adapting to different data formats and source types.

Best Practices

  • Precision in Data Path: Clearly define the data path to ensure accurate and efficient data extraction.
  • Performance Optimization: Optimize extraction processes to manage large datasets effectively, minimizing resource consumption and extraction time.

Integration with Other Entities

Extracts are typically followed by the creation of 'executions' and 'datasets' in the data platform, forming a pipeline that transforms raw data into structured, usable formats.