DAZL Documentation | Data Analytics A-to-Z Processing Language

Understanding trainModel Parameters

machine learning

slug: tutorial-understanding-trainmodel-parameters

Tutorial: Understanding `trainModel` Parameters

This tutorial gives a clear understanding of each trainModel parameter, how it impacts the model, and when to use it.

The trainModel step allows you to train machine learning models in your workflow. The step accepts several parameters to customize model training, control data handling, and optimize performance. This tutorial explains each parameter and why it is useful.

Required Parameters

`dataset`

Type: Array of associative arrays (rows)
Description: The input dataset used to train the model.
Use Case: Must contain the features and target variable you intend to use for training.

`target`

Type: String
Description: The name of the column to predict.
Use Case: Specifies the outcome variable (dependent variable) for regression or classification tasks.
Example: "spend" or "churn"

`features`

Type: Array of strings
Description: List of columns used as predictors.
Use Case: Select only relevant features to improve model accuracy and reduce noise.
Example: ["age", "income", "previous_purchases"]

`modelType`

Type: String
Description: Type of model to train. Common options include:
- "linear" – Linear regression
- "logistic" – Logistic regression
- "knn" – k-nearest neighbors
Use Case: Choose a model appropriate for your target variable type (numeric vs categorical).

Optional Parameters

`params`

Type: Object / associative array
Description: Model-specific hyperparameters (e.g., learning rate, regularization).
Use Case: Fine-tune the model for better performance.
Example: {"learning_rate": 0.01, "max_iterations": 1000}

`categorical`

Type: 'auto' | Array of column names
Description: Specifies which columns should be treated as categorical.
Use Case: Automatically encode categorical features if using numeric-based models.
Default: 'auto' detects categorical columns automatically.

`missing_values`

Type: 'error' | 'ignore' | 'impute'
Description: How to handle missing values in the dataset.
Use Case: Prevent model training errors due to missing data.
- 'error' → Throws an error if any missing values exist
- 'ignore' → Skips rows with missing values
- 'impute' → Fills missing values using a strategy (mean, median, mode)

`normalize`

Type: Boolean
Description: Whether numeric features should be normalized (scaled).
Use Case: Ensures features with large ranges don’t dominate the model, especially important for distance-based models (like k-NN) or gradient-based optimization.
Default: true

`test_size`

Type: Float between 0 and 1
Description: Fraction of the dataset reserved for testing.
Use Case: Split the data into training and testing sets for model evaluation.
Example: 0.2 reserves 20% for testing.

`random_state`

Type: Integer
Description: Seed for random number generators used in splitting or model initialization.
Use Case: Ensures reproducible results across runs.

`k`

Type: Integer
Description: Number of neighbors for k-NN models.
Use Case: Controls model complexity and local sensitivity in k-NN.

`distance_metric`

Type: String ('euclidean', 'manhattan', etc.)
Description: Distance metric used in distance-based models like k-NN.
Use Case: Choice of metric can affect neighbor selection and model accuracy.
Default: 'euclidean'

Example: Parameter Usage

steps:
  - trainModel:
      dataset: customerData
      target: spend
      features: [age, income]
      modelType: linear
      params:
        learning_rate: 0.01
        max_iterations: 1000
      categorical: auto
      missing_values: impute
      normalize: true
      test_size: 0.2
      random_state: 42
      output: spendModel

Explanation:

dataset: customerData containing historical spend
target: "spend" column to predict
features: "age" and "income" as predictors
modelType: Linear regression
params: Learning rate and max iterations for gradient descent
categorical: Auto-detect categorical columns
missing_values: Impute missing values
normalize: Scale numeric features
test_size: Reserve 20% for testing
random_state: Ensure reproducibility
output: Store trained model as spendModel

Tips & Best Practices

Always ensure that features exist in both training and prediction datasets.
Normalize numeric features for distance-based or gradient-based models.
Handle missing values appropriately before training.
Tune hyperparameters in params to improve model performance.
Use test_size and random_state to evaluate model reliability.

DAZL Documentation | Data Analytics A-to-Z Processing Language

Contents

Quick Index Pages (1)

Steps (34)

Recipes (24)

Topic Maps (18)

Examples (19)

Tutorials (6)

Reference (7)

Understanding trainModel Parameters

Tutorial: Understanding `trainModel` Parameters

Required Parameters

`dataset`

`target`

`features`

`modelType`

Optional Parameters

`params`

`categorical`

`missing_values`

`normalize`

`test_size`

`random_state`

`k`

`distance_metric`

Example: Parameter Usage

Tips & Best Practices

DAZL Documentation | Data Analytics A-to-Z Processing Language

Contents

Quick Index Pages (1)

Steps (34)

Recipes (24)

Topic Maps (18)

Examples (19)

Tutorials (6)

Reference (7)

Understanding trainModel Parameters

Tutorial: Understanding trainModel Parameters

Required Parameters

dataset

target

features

modelType

Optional Parameters

params

categorical

missing_values

normalize

test_size

random_state

k

distance_metric

Example: Parameter Usage

Tips & Best Practices

Tutorial: Understanding `trainModel` Parameters

`dataset`

`target`

`features`

`modelType`

`params`

`categorical`

`missing_values`

`normalize`

`test_size`

`random_state`

`k`

`distance_metric`