DAZL Documentation | Data Analytics A-to-Z Processing Language


Contents

keep

data management

slug: step-keep

Keep Step

Purpose

Simplifies a dataset by retaining only specified columns, removing all others. Creates a focused view with only the fields needed for downstream analysis.

When to Use

  • Reduce dataset complexity for better performance
  • Remove unnecessary or sensitive columns
  • Prepare data for specific analysis needs
  • Create lightweight extracts for reporting
  • Focus on key variables for visualization
  • Simplify datasets before exporting or sharing

How It Works

  1. Takes a list of columns to retain
  2. Filters the dataset to include only the specified columns
  3. Updates the PDV (Physical Data View) metadata to reflect only kept columns
  4. Preserves the row count but reduces column count
  5. Tracks metadata about the column reduction process

Parameters

Required

  • columns - Specifies which columns to keep, using either:
    • String: Single column name (e.g., "customer_id")
    • Array: Multiple column names (e.g., ["customer_id", "purchase_date", "total"])

Input Requirements

  • Any dataset with columns
  • If specified columns don't exist, they're simply ignored (no error)
  • If no columns are specified, the dataset is returned unchanged

Output

Data

  • Same number of rows as input but with reduced columns
  • Only specified columns are retained, in the order they were listed

PDV

  • Contains metadata only for the kept columns
  • Original column metadata structure is preserved

Extras

  • keep_applied - Timestamp when the operation was performed
  • columns_before - Number of columns before the operation
  • columns_after - Number of columns after the operation
  • columns_kept - List of column names that were kept

Example Usage

# Keep only customer identifier columns
keep:
  columns:
    - "customer_id"
    - "email"
    - "account_number"

# Keep only the columns needed for financial analysis
keep:
  columns:
    - "transaction_date"
    - "amount"
    - "category"
    - "account_type"

Example Output

Input Data

customer_id first_name last_name email phone purchase_date amount status
1001 John Smith john@example.com 555-123-4567 2023-10-15 125.99 Completed
1002 Sarah Jones sarah@example.com 555-987-6543 2023-10-16 89.50 Pending
1003 Michael Brown mike@example.com 555-456-7890 2023-10-17 215.75 Completed

Kept Output (Using columns: ["customer_id", "email", "purchase_date", "amount"])

customer_id email purchase_date amount
1001 john@example.com 2023-10-15 125.99
1002 sarah@example.com 2023-10-16 89.50
1003 mike@example.com 2023-10-17 215.75

Related Documentation

  • drop step - Remove specific columns (opposite of keep)