orders
, SKUs
, and customers
- as our foundation, we will explore various PySpark operations. Each example is designed to demonstrate practical applications in data transformation, showcasing the versatility and power of PySpark in a Datazone environment.
From basic operations like projection and filtering to more advanced techniques such as joins, unions, and aggregations, each section of this guide offers concise yet comprehensive insights into PySpark’s capabilities. Whether you’re a beginner or an experienced user, these examples provide a clear pathway to enhance your data processing workflows in Datazone using PySpark.
In the upcoming examples, the terms dataset
and dataframe
are used interchangeably. This means that whenever either term is mentioned, it refers to the same concept of a structured collection of data within our context.
Please note that the usage of Dataset(id="<dataset_orders_id>")
in the provided code examples serves merely as an illustrative placeholder. It is important to replace the string <dataset_orders_id>
with the actual identifier of your specific orders dataset when implementing these examples in your environment. This ensures that the code correctly references and interacts with your dataset.
SELECT
statement in SQL. It is useful when you want to work with only a subset of columns in your dataset.
Example Use Case: Selecting only the order_id
and order_date
columns from an orders DataFrame.
WHERE
clause in SQL. It allows for both simple and complex filtering criteria.
Example Use Case: Fetching orders made after a certain date.
Filtering records based on a condition.
order_date
to date_of_order
for clarity.
orderBy
function is used to sort the DataFrame based on one or more columns, either in ascending or descending order.
Detailed Usage:
orderBy("column")
: Sorts the DataFrame in ascending order based on the specified column.orderBy("column"
, ascending=False): Sorts in descending order.groupBy("customer_id")
: Groups the data by the customer_id column.agg(...)
: Performs the specified aggregation function, which is counting the number of order_id for each group.alias("total_orders")
: Renames the result of the aggregation to total_orders for clarity.groupBy("column").pivot("pivot_column")
: Groups the data by a column and pivots on another column.sum()
, count()
, etc., can be applied to the pivoted data for aggregation.Window.partitionBy("column").orderBy("other_column")
.rank()
, row_number()
, lead()
, lag()
, etc.UDF
.UDF
to a DataFrame column.fillna()
: Fills null values with specified value(s).dropna()
: Drops rows with null values.Filtering
: Only includes orders from the year 2023.Sorting
: Orders are sorted by their date.UDF Application
: A user-defined function categorizes each order based on its value.Aggregation
: The total value of orders is computed for each customer.Pivoting
: The data is then pivoted to show the sum of order values per month for each customer.dataset_orders_id
is replaced with the actual dataset ID in your implementation.order_date
, order_value
, etc., in your orders DataFrame.