DAZL Documentation | Data Analytics A-to-Z Processing Language


Contents

Combining Data

data management

slug: topic-map-data-management-combining-data

vocabulary:

  • Join: Operation that combines rows from two or more datasets based on related columns
  • Append: Stacking datasets vertically by adding rows from one dataset below another
  • Inner Join: Returns only rows where the join key matches in both datasets
  • Left Join: Returns all rows from left dataset plus matched rows from right dataset
  • Right Join: Returns all rows from right dataset plus matched rows from left dataset
  • Outer Join: Returns all rows from both datasets, matching where possible
  • Full Join: Same as outer join - all rows from both datasets with NULLs for non-matches
  • Join Key: Column(s) used to match records between datasets
  • Match Key: Same as join key - the identifier for linking records
  • Cardinality: Relationship between rows in joined tables (one-to-one, one-to-many, many-to-many)
  • Left Table: Primary or first dataset in join operation (master table)
  • Right Table: Secondary or second dataset in join operation (lookup table)
  • Matched Records: Rows where join key exists in both datasets
  • Unmatched Records: Rows where join key exists in only one dataset
  • NULL Values: Missing data markers inserted when no matching record exists
  • Composite Key: Join key made from multiple columns combined
  • Cross Join: Cartesian product of two datasets (all possible row combinations)
  • Anti-Join: Returns rows from left table that have no match in right table
  • Semi-Join: Returns rows from left table that have at least one match in right table
  • Equi-Join: Join using equality operator on join keys
  • Self-Join: Joining a table to itself
  • Referential Integrity: Consistency of relationships between datasets
  • Orphaned Record: Row with foreign key value that doesn't exist in referenced table
  • Many-to-Many: Relationship where multiple rows in each table can match multiple rows in the other

concepts:

  • Set Theory Foundation: Joins implement set operations - inner join is intersection, outer joins are unions with different preservation rules for non-matching elements
  • Data Preservation Strategy: Join type determines which records survive the operation based on business rules about completeness versus accuracy
  • Referential Integrity: Joins expose data quality by revealing missing relationships, duplicates, and orphaned records between related datasets

procedures:

  • Execute Inner Join: Identify common key columns, specify equality condition, return only matching rows, combine columns from both datasets
  • Execute Left Join: Identify left dataset to preserve, specify join key, return all left rows plus matching right columns, insert NULLs for non-matches
  • Execute Append: Verify matching column structures, stack datasets vertically, preserve all rows from both sources, handle column mismatches
  • Diagnose Join Results: Count pre-join records, execute join, compare result count to expected cardinality, investigate duplicates or missing records, check NULL patterns
  • Build Composite Key Join: Identify multiple columns needed for unique match, concatenate or combine in join condition, test for uniqueness

topics:

  • Inner join mechanics and use cases
  • Left join for preserving primary dataset
  • Right join for preserving lookup dataset
  • Full outer join for complete data inventory
  • Append operations for combining similar datasets
  • Cross join for generating combinations
  • Self-join for hierarchical or sequential data
  • Anti-join for finding orphaned records
  • Semi-join for existence checking
  • Multi-column composite keys
  • Join ordering and performance
  • Handling duplicate keys in joins
  • NULL value propagation in outer joins
  • One-to-many join expansion
  • Many-to-many join explosion risks
  • Cartesian product dangers

categories:

  • Matching Joins: Inner, left, right, full outer
  • Set Operations: Union (append), intersection (inner), difference (anti)
  • Lookup Patterns: Enriching data with reference tables
  • Data Quality: Finding orphans, duplicates, gaps
  • Aggregation Support: Pre-join vs post-join calculations
  • Temporal Joins: Time-based matching, as-of joins

themes:

  • Data integration and consolidation
  • Relational database theory in practice
  • Master data management patterns
  • Data quality validation through relationships
  • ETL and data pipeline design
  • Query optimization strategies
  • Business rule enforcement via joins
  • Dimensional modeling (fact-dimension joins)

trends:

  • Cloud data warehouse join optimization
  • Distributed join strategies for big data
  • Streaming join operations
  • Graph database alternatives to joins
  • Fuzzy matching and approximate joins
  • JSON and nested data joins
  • Zero-copy joins in columnar databases
  • Push-down join optimization in data virtualization

use_cases:

  • Customer Enrichment: Left join customer transactions to customer profile to add demographic attributes
  • Sales Analysis: Inner join orders to products to analyze only completed, valid sales
  • Data Quality Audit: Anti-join to find customers with orders but no profile record
  • Inventory Reconciliation: Full outer join warehouse data to accounting system to find discrepancies
  • Time Series Append: Stack monthly sales files vertically to create annual dataset
  • Product Hierarchy: Self-join product table to build parent-child category relationships
  • Reference Data Lookup: Left join transactions to currency exchange rates to standardize values
  • Many-to-Many Resolution: Inner join orders to order_items to products for detailed product performance
  • Deduplication Prep: Self-join on fuzzy match keys to identify potential duplicate records
  • Completeness Check: Full outer join expected customer list to actual customer list to find gaps