DAZL Documentation | Data Analytics A-to-Z Processing Language

trainModel

machine learning

slug: step-trainmodel

Purpose

Trains a machine learning model using a dataset and specified features and target. This step allows workflows to include predictive modeling, classification, or regression directly within the pipeline. The trained model is returned in the extras for downstream use, such as scoring, evaluation, or visualization.

When to Use

Build predictive models for classification or regression
Evaluate relationships between features and an outcome
Generate models for scoring new data in the workflow
Integrate ML workflows without leaving the nollejScript pipeline

How It Works

Extracts input components from the pipeline: data, pdv, and extras.
Merges default ML parameters with those provided in $params.
Uses the MachineLearning class to train the model:
- target: The column to predict
- features: List of columns used as predictors
- modelType: Type of model (e.g., logistic, linear, knn)
- params: Optional model hyperparameters (learning rate, regularization, etc.)
Returns the original dataset unchanged.
Stores the trained model in extras['ml'] for downstream scoring or analysis.

Parameters

Required

target (string) — Column name of the outcome variable to predict.
features (array) — List of predictor columns.
modelType (string) — Type of ML model to train (e.g., logistic, linear, knn).

Optional

params (array) — Additional hyperparameters for the model (learning rate, regularization, etc.)
categorical (string or array) — Columns to treat as categorical; default 'auto'
missing_values (string) — How to handle missing values; default 'error'
normalize (boolean) — Whether to normalize numeric features; default true
test_size (float) — Fraction of data reserved for testing; default null
random_state (int) — Random seed for reproducibility; default 42
k (int) — Number of neighbors for k-NN models; default 5
distance_metric (string) — Metric for distance-based models; default 'euclidean'

Input Requirements

Dataset (data) must be an array of associative arrays (rows).
Columns listed in features and target must exist and contain valid numeric or categorical values appropriate for the model type.

Output

Data

Returns the original dataset unchanged.

PDV

Passed through unchanged from input.

Extras

ml — Contains the trained model object returned by the MachineLearning class.

Output Structure

Key	Description
`data`	Original dataset array
`pdv`	Metadata about dataset columns
`extras`	Contains `ml` with the trained model
`outputType`	`"array"` — Indicates structured array output

Example Usage

steps:
  - loadInline:
      data:
        - {age: 22, income: 38000, spend: 800, outcome: 1}
        - {age: 25, income: 45000, spend: 1200, outcome: 0}
        - {age: 29, income: 56000, spend: 1800, outcome: 1}
      output: trainingData

  - trainModel:
      dataset: trainingData
      target: outcome
      features: [age, income, spend]
      modelType: logistic
      params:
        learning_rate: 0.01
      output: trainedModel

Example Output

{
  "data": [
    {"age":22,"income":38000,"spend":800,"outcome":1},
    {"age":25,"income":45000,"spend":1200,"outcome":0},
    {"age":29,"income":56000,"spend":1800,"outcome":1}
  ],
  "pdv": {},
  "extras": {
    "ml": {
      "modelType": "logistic",
      "coefficients": {"age":0.12,"income":0.0003,"spend":0.01},
      "intercept": -1.23,
      "training_metrics": {"accuracy":0.67,"loss":0.52}
    }
  },
  "outputType": "array"
}

DAZL Documentation | Data Analytics A-to-Z Processing Language

Contents

Quick Index Pages (1)

Steps (34)

Recipes (24)

Topic Maps (18)

Examples (18)

Tutorials (6)

Reference (7)