DAZL Documentation | Data Analytics A-to-Z Processing Language

Quick Index Pages (1)

Welcome to DAZL

Recipes (24)

Topic Maps (18)

Examples (18)

Tutorials (6)

Reference (7)

Combining Data

data management

slug: topic-map-data-management-combining-data

vocabulary:

Join: Operation that combines rows from two or more datasets based on related columns
Append: Stacking datasets vertically by adding rows from one dataset below another
Inner Join: Returns only rows where the join key matches in both datasets
Left Join: Returns all rows from left dataset plus matched rows from right dataset
Right Join: Returns all rows from right dataset plus matched rows from left dataset
Outer Join: Returns all rows from both datasets, matching where possible
Full Join: Same as outer join - all rows from both datasets with NULLs for non-matches
Join Key: Column(s) used to match records between datasets
Match Key: Same as join key - the identifier for linking records
Cardinality: Relationship between rows in joined tables (one-to-one, one-to-many, many-to-many)
Left Table: Primary or first dataset in join operation (master table)
Right Table: Secondary or second dataset in join operation (lookup table)
Matched Records: Rows where join key exists in both datasets
Unmatched Records: Rows where join key exists in only one dataset
NULL Values: Missing data markers inserted when no matching record exists
Composite Key: Join key made from multiple columns combined
Cross Join: Cartesian product of two datasets (all possible row combinations)
Anti-Join: Returns rows from left table that have no match in right table
Semi-Join: Returns rows from left table that have at least one match in right table
Equi-Join: Join using equality operator on join keys
Self-Join: Joining a table to itself
Referential Integrity: Consistency of relationships between datasets
Orphaned Record: Row with foreign key value that doesn't exist in referenced table
Many-to-Many: Relationship where multiple rows in each table can match multiple rows in the other

concepts:

Set Theory Foundation: Joins implement set operations - inner join is intersection, outer joins are unions with different preservation rules for non-matching elements
Data Preservation Strategy: Join type determines which records survive the operation based on business rules about completeness versus accuracy
Referential Integrity: Joins expose data quality by revealing missing relationships, duplicates, and orphaned records between related datasets

procedures:

Execute Inner Join: Identify common key columns, specify equality condition, return only matching rows, combine columns from both datasets
Execute Left Join: Identify left dataset to preserve, specify join key, return all left rows plus matching right columns, insert NULLs for non-matches
Execute Append: Verify matching column structures, stack datasets vertically, preserve all rows from both sources, handle column mismatches
Diagnose Join Results: Count pre-join records, execute join, compare result count to expected cardinality, investigate duplicates or missing records, check NULL patterns
Build Composite Key Join: Identify multiple columns needed for unique match, concatenate or combine in join condition, test for uniqueness

topics:

Inner join mechanics and use cases
Left join for preserving primary dataset
Right join for preserving lookup dataset
Full outer join for complete data inventory
Append operations for combining similar datasets
Cross join for generating combinations
Self-join for hierarchical or sequential data
Anti-join for finding orphaned records
Semi-join for existence checking
Multi-column composite keys
Join ordering and performance
Handling duplicate keys in joins
NULL value propagation in outer joins
One-to-many join expansion
Many-to-many join explosion risks
Cartesian product dangers

categories:

Matching Joins: Inner, left, right, full outer
Set Operations: Union (append), intersection (inner), difference (anti)
Lookup Patterns: Enriching data with reference tables
Data Quality: Finding orphans, duplicates, gaps
Aggregation Support: Pre-join vs post-join calculations
Temporal Joins: Time-based matching, as-of joins

themes:

Data integration and consolidation
Relational database theory in practice
Master data management patterns
Data quality validation through relationships
ETL and data pipeline design
Query optimization strategies
Business rule enforcement via joins
Dimensional modeling (fact-dimension joins)

trends:

Cloud data warehouse join optimization
Distributed join strategies for big data
Streaming join operations
Graph database alternatives to joins
Fuzzy matching and approximate joins
JSON and nested data joins
Zero-copy joins in columnar databases
Push-down join optimization in data virtualization

use_cases:

Customer Enrichment: Left join customer transactions to customer profile to add demographic attributes
Sales Analysis: Inner join orders to products to analyze only completed, valid sales
Data Quality Audit: Anti-join to find customers with orders but no profile record
Inventory Reconciliation: Full outer join warehouse data to accounting system to find discrepancies
Time Series Append: Stack monthly sales files vertically to create annual dataset
Product Hierarchy: Self-join product table to build parent-child category relationships
Reference Data Lookup: Left join transactions to currency exchange rates to standardize values
Many-to-Many Resolution: Inner join orders to order_items to products for detailed product performance
Deduplication Prep: Self-join on fuzzy match keys to identify potential duplicate records
Completeness Check: Full outer join expected customer list to actual customer list to find gaps

DAZL Documentation | Data Analytics A-to-Z Processing Language

Contents

Quick Index Pages (1)

Steps (34)

Recipes (24)

Topic Maps (18)

Examples (18)

Tutorials (6)

Reference (7)

Combining Data