DAZL Documentation | Data Analytics A-to-Z Processing Language


Contents

Clustering analysis using k-means

exploratory statistics

slug: recipe-exploratory-statistics-clustering-analysis-using-k-means

Recipe: Clustering analysis using k-means

category: exploratory statistics

Problem

You need to identify natural groupings in your data:

  • segment customers based on behavior or demographics
  • detect patterns or clusters in transactional or numeric data
  • inform targeted marketing, promotions, or analysis

Solution

Follow these steps to perform k-means clustering:

  • load the dataset
  • select numeric fields or derived features for clustering
  • apply k-means with a chosen number of clusters
  • review cluster assignments and characteristics
  • optionally visualize clusters

Step Sequence

load step -> [step-kmeans] -> calculate step -> chart step

Input Datasets

  • transactions_clean — cleaned transactional data
  • Notes: numeric fields like amount, frequency, recency, or other derived metrics

Output Dataset

  • clustered_data — dataset with cluster assignments for each observation
  • Notes: clusters can be used for segmentation, targeting, or further analysis

Step-By-Step Explanation

Step Purpose Notes
load step Load dataset Supports local file, database, or API sources
[step-kmeans] Apply k-means clustering Example: segment customers into 3–5 clusters based on purchase behavior
calculate step Compute cluster statistics or derived metrics Example: cluster centroids, average spend per cluster
chart step Visualize clusters Optional scatterplot, 2D/3D projection, or cluster distribution chart

Variations & Extensions

  • Experiment with different numbers of clusters
  • Preprocess data using calculate step or [step-standardize] for scaling
  • Combine with classify step to assign new observations to existing clusters

Concepts Demonstrated

  • Unsupervised clustering with k-means
  • Data segmentation and pattern detection
  • Integration of clustering results with analytics workflow
  • Sequencing analytics and visualization steps

Related Recipes

  • Time series analysis
  • Regression analysis

Notes & Best Practices

  • Standardize numeric features to prevent scale bias
  • Evaluate cluster quality using silhouette scores or other metrics
  • Document clustering parameters and rationale for reproducibility

Metadata


title: "Clustering analysis using k-means"
category: "exploratory statistics"
difficulty: "Intermediate"
tags: [clustering, k-means, segmentation, EDA]
inputs: [transactions_clean]
outputs: [clustered_data]
steps: [step-load, step-kmeans, step-calculate, step-chart]
author: "Tom Argiro"
last_updated: "2025-10-25"
doc_type: "recipe"