DAZL Documentation | Data Analytics A-to-Z Processing Language


Contents

combine

data management

slug: step-combine

Combine Step

Purpose

Integrates multiple datasets into a single unified dataset using either vertical stacking (append) or relational joins. Provides flexible options for merging data from disparate sources.

When to Use

  • Consolidate data from multiple sources
  • Integrate data across different time periods
  • Combine complementary datasets for comprehensive analysis
  • Enrich base data with additional attributes from lookup tables
  • Create master datasets from specialized extracts
  • Implement SQL-like joins in the data pipeline

How It Works

  1. Takes the base dataset and one or more additional datasets
  2. Applies one of two combination methods:
    • Append: Stacks datasets vertically (adding rows)
    • Join: Merges datasets horizontally (adding columns) based on key relationships
  3. Handles metadata merging for PDV (Physical Data View)
  4. Tracks detailed statistics about the combination process
  5. Returns a unified dataset with consolidated structure

Parameters

Required

  • datasets (array) - Datasets to combine with the base dataset

Optional

  • method (string) - Combination method: append (default) or join
  • join_type (string) - For join method: inner, left, right, or full (default: inner)
  • join_on (string|array) - Join key(s) or conditions:
    • Simple string: "customer_id" (assumes same column name in both datasets)
    • Explicit mapping: {"left": "id", "right": "customer_id"}
    • Multiple conditions: [{"left": "id", "right": "customer_id"}, {"left": "region", "right": "territory"}]
  • join_prefix (boolean) - Add dataset name prefix to columns to prevent conflicts (default: true)
  • keep_duplicates (boolean) - For append method: keep duplicate records (default: true)
  • strict_mode (boolean) - Throw exceptions on errors (default: false)
  • preserve_types (boolean) - Maintain data types during combination (default: true)

Input Requirements

  • Base dataset must be properly structured
  • Additional datasets must be available and compatible for the chosen combination method
  • For joins, specified join keys must exist in respective datasets
  • At least one additional dataset must be specified in datasets parameter

Output

Data

  • For append: Vertically stacked records from all datasets
  • For join: Base dataset enriched with columns from other datasets based on join relationships

PDV

  • Combined metadata for all columns in the result dataset
  • For joins with join_prefix: true, column names include dataset prefixes
  • Latest metadata definition wins for overlapping column names

Extras

  • combineData_method - Method used (append or join)
  • combineData_applied - Timestamp when operation was performed
  • records_in - Number of records in the base dataset
  • records_out - Number of records in the result dataset
  • datasets_combined - Number of datasets combined

For append operations:

  • append_stats - Detailed statistics about the append operation:
    • base_records - Count of records in base dataset
    • datasets_appended - Array of appended datasets with record counts

For join operations:

  • join_type - Type of join performed
  • join_stats - Detailed statistics about the join operation:
    • base_records - Count of records in base dataset
    • joins_performed - Array of join operations with before/after counts

Example Usage

Append Example

combine:
  method: append
  datasets:
    - q1_sales
    - q2_sales
    - q3_sales
    - q4_sales
  keep_duplicates: false

Join Example

combine:
  method: join
  datasets:
    customers: customers_dataset
    orders: orders_dataset
    products: products_lookup
  join_type: left
  join_on:
    - {left: "customer_id", right: "id"}
    - {left: "product_id", right: "sku"}
  join_prefix: true

Example Output

Append Example

Input Datasets

Base Dataset (q1_sales) quarter region product revenue
Q1 North Widgets 12500
Q1 South Gadgets 8750
Additional Dataset (q2_sales) quarter region product revenue
Q2 North Widgets 15000
Q2 East Gadgets 11250

Appended Output

quarter region product revenue
Q1 North Widgets 12500
Q1 South Gadgets 8750
Q2 North Widgets 15000
Q2 East Gadgets 11250

Join Example

Input Datasets

Base Dataset (orders) order_id customer_id product_id quantity
1001 C123 P456 2
1002 C789 P345 1
Additional Dataset (customers) id name segment
C123 John Smith Premium
C789 Sarah Jones Standard

Joined Output (with join_prefix: true)

order_id customer_id product_id quantity customers_name customers_segment
1001 C123 P456 2 John Smith Premium
1002 C789 P345 1 Sarah Jones Standard

Related Documentation

  • load step - Load data before combining
  • filter step - Filter combined datasets
  • calculate step - Add calculated columns to combined data