DAZL Documentation | Data Analytics A-to-Z Processing Language


Contents

dataset statement

data management

slug: reference-dataset-statement

Dataset Command

Purpose

Manages dataset loading and handling across all DAZL steps, providing a unified interface to access different data sources. Acts as the core data provider mechanism for the entire DAZL processing pipeline.

When to Use

  • Reference an existing dataset in any step
  • Load data from various sources (arrays, SQL, APIs, files)
  • Pass data between steps in a standardized way
  • Create composite workflows with multiple data sources
  • Switch between data sources without changing step logic
  • Load reference data for enrichment or comparison

How It Works

  1. Each DAZL step can include a dataset parameter to specify its input data
  2. The dataset command processes this parameter and loads the appropriate data
  3. Handles multiple data source types through a consistent interface
  4. Supports both simple string references and complex configuration objects
  5. Resolves references to data in the DAZL work environment
  6. Provides standardized error handling for missing or invalid datasets

Dataset Specification

Simple Reference (String)

dataset: customerData  # References $work['customerData']

Full Configuration (Object)

dataset:
  source: source_identifier
  type: array|sql|api|json|yaml
  options: {}  # Type-specific options

Supported Data Sources

Work Arrays (type: array)

  • Source: Name of an array in the DAZL work environment
  • Options: None required
  • Example:
    dataset:
    source: customerData
    type: array

SQL Queries (type: sql)

  • Source: SQL query or reference to a stored query
  • Options:
    • connection: Database connection identifier
    • parameters: Query parameters for prepared statements
    • cache: Cache settings (duration, key)
  • Example:
    dataset:
    source: "SELECT * FROM customers WHERE region = :region"
    type: sql
    options:
      connection: main_db
      parameters:
        region: "North"

API Endpoints (type: api)

  • Source: API endpoint URL or endpoint identifier
  • Options:
    • method: HTTP method (GET, POST, etc.)
    • headers: HTTP headers
    • body: Request body for POST/PUT
    • auth: Authentication details
  • Example:
    dataset:
    source: "https://api.example.com/v1/products"
    type: api
    options:
      method: GET
      headers:
        Authorization: "Bearer ${API_TOKEN}"

JSON Files (type: json)

  • Source: File path or JSON string
  • Options:
    • path: Alternate file path specification
    • jsonPath: JSON path expression for data extraction
  • Example:
    dataset:
    source: "/data/products.json"
    type: json
    options:
      jsonPath: "$.products[*]"

YAML Files (type: yaml)

  • Source: File path or YAML string
  • Options:
    • path: Alternate file path specification
    • node: Path to specific node in YAML document
  • Example:
    dataset:
    source: "/data/config.yaml"
    type: yaml
    options:
      node: "settings.defaults"

Example Usage

In a Filter Step

filter:
  dataset: salesData
  where:
    - "region = 'North'"
    - "sales > 1000"

In a Chart Step with SQL Source

chart:
  dataset:
    source: "SELECT product, SUM(revenue) as total FROM sales GROUP BY product ORDER BY total DESC LIMIT 10"
    type: sql
    options:
      connection: reporting_db
  type: bar
  x_axis: product
  y_axis: total
  title: "Top 10 Products by Revenue"

In a Combine Step with Multiple Sources

combine:
  method: join
  datasets:
    customers:
      source: customers
      type: array
    orders:
      source: "SELECT * FROM orders WHERE date >= :start_date"
      type: sql
      options:
        connection: orders_db
        parameters:
          start_date: "2024-01-01"
  join_on:
    left: customer_id
    right: customer_id

Related Documentation

  • work-environment - How datasets are stored in the DAZL work environment
  • sql-connection-configuration - Setting up database connections for SQL datasets
  • api-configuration - Configuring API endpoints for data access
  • json-path-syntax - Using JSON path expressions for data extraction
  • cache-configuration - Optimizing data access with caching