DAZL Documentation | Data Analytics A-to-Z Processing Language


Contents

Understanding trainModel Parameters

machine learning

slug: tutorial-understanding-trainmodel-parameters

Tutorial: Understanding trainModel Parameters

This tutorial gives a clear understanding of each trainModel parameter, how it impacts the model, and when to use it.

The trainModel step allows you to train machine learning models in your workflow. The step accepts several parameters to customize model training, control data handling, and optimize performance. This tutorial explains each parameter and why it is useful.


Required Parameters

dataset

  • Type: Array of associative arrays (rows)
  • Description: The input dataset used to train the model.
  • Use Case: Must contain the features and target variable you intend to use for training.

target

  • Type: String
  • Description: The name of the column to predict.
  • Use Case: Specifies the outcome variable (dependent variable) for regression or classification tasks.
  • Example: "spend" or "churn"

features

  • Type: Array of strings
  • Description: List of columns used as predictors.
  • Use Case: Select only relevant features to improve model accuracy and reduce noise.
  • Example: ["age", "income", "previous_purchases"]

modelType

  • Type: String
  • Description: Type of model to train. Common options include:

    • "linear" – Linear regression
    • "logistic" – Logistic regression
    • "knn" – k-nearest neighbors
  • Use Case: Choose a model appropriate for your target variable type (numeric vs categorical).

Optional Parameters

params

  • Type: Object / associative array
  • Description: Model-specific hyperparameters (e.g., learning rate, regularization).
  • Use Case: Fine-tune the model for better performance.
  • Example: {"learning_rate": 0.01, "max_iterations": 1000}

categorical

  • Type: 'auto' | Array of column names
  • Description: Specifies which columns should be treated as categorical.
  • Use Case: Automatically encode categorical features if using numeric-based models.
  • Default: 'auto' detects categorical columns automatically.

missing_values

  • Type: 'error' | 'ignore' | 'impute'
  • Description: How to handle missing values in the dataset.
  • Use Case: Prevent model training errors due to missing data.

    • 'error' → Throws an error if any missing values exist
    • 'ignore' → Skips rows with missing values
    • 'impute' → Fills missing values using a strategy (mean, median, mode)

normalize

  • Type: Boolean
  • Description: Whether numeric features should be normalized (scaled).
  • Use Case: Ensures features with large ranges don’t dominate the model, especially important for distance-based models (like k-NN) or gradient-based optimization.
  • Default: true

test_size

  • Type: Float between 0 and 1
  • Description: Fraction of the dataset reserved for testing.
  • Use Case: Split the data into training and testing sets for model evaluation.
  • Example: 0.2 reserves 20% for testing.

random_state

  • Type: Integer
  • Description: Seed for random number generators used in splitting or model initialization.
  • Use Case: Ensures reproducible results across runs.

k

  • Type: Integer
  • Description: Number of neighbors for k-NN models.
  • Use Case: Controls model complexity and local sensitivity in k-NN.

distance_metric

  • Type: String ('euclidean', 'manhattan', etc.)
  • Description: Distance metric used in distance-based models like k-NN.
  • Use Case: Choice of metric can affect neighbor selection and model accuracy.
  • Default: 'euclidean'

Example: Parameter Usage

steps:
  - trainModel:
      dataset: customerData
      target: spend
      features: [age, income]
      modelType: linear
      params:
        learning_rate: 0.01
        max_iterations: 1000
      categorical: auto
      missing_values: impute
      normalize: true
      test_size: 0.2
      random_state: 42
      output: spendModel

Explanation:

  • dataset: customerData containing historical spend
  • target: "spend" column to predict
  • features: "age" and "income" as predictors
  • modelType: Linear regression
  • params: Learning rate and max iterations for gradient descent
  • categorical: Auto-detect categorical columns
  • missing_values: Impute missing values
  • normalize: Scale numeric features
  • test_size: Reserve 20% for testing
  • random_state: Ensure reproducibility
  • output: Store trained model as spendModel

Tips & Best Practices

  • Always ensure that features exist in both training and prediction datasets.
  • Normalize numeric features for distance-based or gradient-based models.
  • Handle missing values appropriately before training.
  • Tune hyperparameters in params to improve model performance.
  • Use test_size and random_state to evaluate model reliability.