DAZL Documentation | Data Analytics A-to-Z Processing Language


Contents

trainModel

machine learning

slug: step-trainmodel

Purpose

Trains a machine learning model using a dataset and specified features and target. This step allows workflows to include predictive modeling, classification, or regression directly within the pipeline. The trained model is returned in the extras for downstream use, such as scoring, evaluation, or visualization.

When to Use

  • Build predictive models for classification or regression
  • Evaluate relationships between features and an outcome
  • Generate models for scoring new data in the workflow
  • Integrate ML workflows without leaving the nollejScript pipeline

How It Works

  1. Extracts input components from the pipeline: data, pdv, and extras.
  2. Merges default ML parameters with those provided in $params.
  3. Uses the MachineLearning class to train the model:

    • target: The column to predict
    • features: List of columns used as predictors
    • modelType: Type of model (e.g., logistic, linear, knn)
    • params: Optional model hyperparameters (learning rate, regularization, etc.)
  4. Returns the original dataset unchanged.
  5. Stores the trained model in extras['ml'] for downstream scoring or analysis.

Parameters

Required

  • target (string) — Column name of the outcome variable to predict.
  • features (array) — List of predictor columns.
  • modelType (string) — Type of ML model to train (e.g., logistic, linear, knn).

Optional

  • params (array) — Additional hyperparameters for the model (learning rate, regularization, etc.)
  • categorical (string or array) — Columns to treat as categorical; default 'auto'
  • missing_values (string) — How to handle missing values; default 'error'
  • normalize (boolean) — Whether to normalize numeric features; default true
  • test_size (float) — Fraction of data reserved for testing; default null
  • random_state (int) — Random seed for reproducibility; default 42
  • k (int) — Number of neighbors for k-NN models; default 5
  • distance_metric (string) — Metric for distance-based models; default 'euclidean'

Input Requirements

  • Dataset (data) must be an array of associative arrays (rows).
  • Columns listed in features and target must exist and contain valid numeric or categorical values appropriate for the model type.

Output

Data

  • Returns the original dataset unchanged.

PDV

  • Passed through unchanged from input.

Extras

  • ml — Contains the trained model object returned by the MachineLearning class.

Output Structure

Key Description
data Original dataset array
pdv Metadata about dataset columns
extras Contains ml with the trained model
outputType "array" — Indicates structured array output

Example Usage

steps:
  - loadInline:
      data:
        - {age: 22, income: 38000, spend: 800, outcome: 1}
        - {age: 25, income: 45000, spend: 1200, outcome: 0}
        - {age: 29, income: 56000, spend: 1800, outcome: 1}
      output: trainingData

  - trainModel:
      dataset: trainingData
      target: outcome
      features: [age, income, spend]
      modelType: logistic
      params:
        learning_rate: 0.01
      output: trainedModel

Example Output

{
  "data": [
    {"age":22,"income":38000,"spend":800,"outcome":1},
    {"age":25,"income":45000,"spend":1200,"outcome":0},
    {"age":29,"income":56000,"spend":1800,"outcome":1}
  ],
  "pdv": {},
  "extras": {
    "ml": {
      "modelType": "logistic",
      "coefficients": {"age":0.12,"income":0.0003,"spend":0.01},
      "intercept": -1.23,
      "training_metrics": {"accuracy":0.67,"loss":0.52}
    }
  },
  "outputType": "array"
}

Related Documentation

  • predict-step – Score new datasets using a trained model
  • calculate-step – Generate features before model training
  • filter-step – Prepare or subset data for training
  • univariate-step – Explore numeric features before modeling