DAZL Documentation | Data Analytics A-to-Z Processing Language


Contents

filter

data management

slug: step-filter

Filter Data Step

Purpose

Filters a dataset based on specified conditions, removing rows that don't meet the criteria. Provides detailed tracking of how many records were removed by each filter.

When to Use

  • Remove outliers or unwanted data points
  • Focus analysis on specific segments or conditions
  • Pre-process data before applying other transformations
  • Create targeted subsets of your data
  • Implement business rules for data inclusion/exclusion

How It Works

  1. Takes input data and applies one or more filter conditions
  2. Each filter is processed sequentially as a waterfall
  3. Records are removed if they don't match the specified criteria
  4. Tracks the impact of each filter on the dataset size
  5. Returns the filtered dataset along with detailed waterfall statistics

Parameters

Required

  • where (array) - Filter conditions to apply. Each condition should be a string expression.

Optional

  • strict_mode (boolean) - When true, throws exceptions for invalid filter expressions. Default: false

Input Requirements

  • Any dataset with columns referenced in the filter conditions
  • Columns referenced in filter conditions must exist in the dataset

Filter Expression Syntax

Expressions follow the pattern: column operator value

Supported Operators

  • = - Equals
  • != - Not equals
  • < - Less than
  • > - Greater than
  • <= - Less than or equal to
  • >= - Greater than or equal to
  • contains - String contains
  • in - Value is in list (e.g., status in [active, pending])
  • not in - Value is not in list

Output

Data

  • Filtered version of the input dataset, containing only rows that match all conditions

Extras

  • waterfall - Detailed statistics tracking the impact of each filter:
    • initial_count - Number of rows before filtering
    • filters - Array with details for each filter:
    • expression - The filter condition applied
    • before_count - Row count before this filter
    • after_count - Row count after this filter
    • removed_count - Number of rows removed by this filter
    • final_count - Number of rows in the final dataset

Example Usage

filter:
  where:
    - "age > 18"
    - "status = active"
    - "region in [North, South, East]"
    - "name contains Smith"

Alternative Syntax

Filter conditions can also be passed as a direct array:

filter:
  - "age > 18" 
  - "status = active"
  - "region in [North, South, East]"

Example Output

Data (filtered rows only)

id name age status region
1 John Smith 35 active North
4 Jane Smith 42 active East

Extras (waterfall statistics)

extras.waterfall = {
  "initial_count": 100,
  "filters": [
    {
      "expression": "age > 18",
      "before_count": 100,
      "after_count": 82,
      "removed_count": 18
    },
    {
      "expression": "status = active",
      "before_count": 82,
      "after_count": 45,
      "removed_count": 37
    },
    {
      "expression": "region in [North, South, East]",
      "before_count": 45,
      "after_count": 38,
      "removed_count": 7
    },
    {
      "expression": "name contains Smith",
      "before_count": 38,
      "after_count": 2,
      "removed_count": 36
    }
  ],
  "final_count": 2
}

Related Documentation

  • calculate step - Create new columns based on calculated values, and categorize/classify segments
  • sort step - Arrange filtered data in specific order