Quick Index Pages (1)

Welcome to DAZL

Recipes (24)

Topic Maps (18)

Examples (18)

Tutorials (6)

Reference (7)

filter

data management

slug: step-filter

Filter Data Step

Purpose

Filters a dataset based on specified conditions, removing rows that don't meet the criteria. Provides detailed tracking of how many records were removed by each filter.

When to Use

Remove outliers or unwanted data points
Focus analysis on specific segments or conditions
Pre-process data before applying other transformations
Create targeted subsets of your data
Implement business rules for data inclusion/exclusion

How It Works

Takes input data and applies one or more filter conditions
Each filter is processed sequentially as a waterfall
Records are removed if they don't match the specified criteria
Tracks the impact of each filter on the dataset size
Returns the filtered dataset along with detailed waterfall statistics

Parameters

Required

where (array) - Filter conditions to apply. Each condition should be a string expression.

Optional

strict_mode (boolean) - When true, throws exceptions for invalid filter expressions. Default: false

Input Requirements

Any dataset with columns referenced in the filter conditions
Columns referenced in filter conditions must exist in the dataset

Filter Expression Syntax

Expressions follow the pattern: column operator value

Supported Operators

= - Equals
!= - Not equals
< - Less than
> - Greater than
<= - Less than or equal to
>= - Greater than or equal to
contains - String contains
in - Value is in list (e.g., status in [active, pending])
not in - Value is not in list

Output

Data

Filtered version of the input dataset, containing only rows that match all conditions

Extras

waterfall - Detailed statistics tracking the impact of each filter:
- initial_count - Number of rows before filtering
- filters - Array with details for each filter:
- expression - The filter condition applied
- before_count - Row count before this filter
- after_count - Row count after this filter
- removed_count - Number of rows removed by this filter
- final_count - Number of rows in the final dataset

Example Usage

filter:
  where:
    - "age > 18"
    - "status = active"
    - "region in [North, South, East]"
    - "name contains Smith"

Alternative Syntax

Filter conditions can also be passed as a direct array:

filter:
  - "age > 18" 
  - "status = active"
  - "region in [North, South, East]"

Example Output

Data (filtered rows only)

id	name	age	status	region
1	John Smith	35	active	North
4	Jane Smith	42	active	East

Extras (waterfall statistics)

extras.waterfall = {
  "initial_count": 100,
  "filters": [
    {
      "expression": "age > 18",
      "before_count": 100,
      "after_count": 82,
      "removed_count": 18
    },
    {
      "expression": "status = active",
      "before_count": 82,
      "after_count": 45,
      "removed_count": 37
    },
    {
      "expression": "region in [North, South, East]",
      "before_count": 45,
      "after_count": 38,
      "removed_count": 7
    },
    {
      "expression": "name contains Smith",
      "before_count": 38,
      "after_count": 2,
      "removed_count": 36
    }
  ],
  "final_count": 2
}

DAZL Documentation | Data Analytics A-to-Z Processing Language

Contents