K-means is an iterative algorithm that partitions data into K distinct clusters.
1Initialize: Choose K random points as initial cluster centers
2Assign: Assign each data point to its nearest cluster center
3Update: Recalculate cluster centers based on assigned points
4Repeat: Continue steps 2-3 until cluster centers stop changing
KAPLAN BUSINESS SCHOOL
K-means in Action
Watch how clusters form through iterations
Click "Start Animation" to begin
KAPLAN BUSINESS SCHOOL
Quick Check: K-means Algorithm
Question 2
In the K-means algorithm, what does the "K" represent?
A) The number of iterations the algorithm runs
B) The number of clusters we want to create
C) The number of features in the dataset
D) The number of data points in the dataset
Correct! K is the number of clusters we specify before running the algorithm.
This is why choosing the right K value is crucial - we'll explore methods for this next.
KAPLAN BUSINESS SCHOOL
Choosing the Right K
One of the biggest challenges in K-means: How many clusters should we create?
Three Main Methods
1. Elbow Method
Plot inertia vs. K and look for the "elbow" where improvement slows
2. Silhouette Method
Measures how well-separated clusters are
3. Domain Knowledge
Use business understanding (e.g., we want 3 customer tiers: Bronze, Silver, Gold)
KAPLAN BUSINESS SCHOOL
The Elbow Method
Run K-means with different K values and plot the inertia (within-cluster sum of squares).
Inertia = Σ (distance from each point to its cluster center)²
Interpretation: Choose K where the curve starts to level off (the "elbow"). Here, K=4 is optimal.
KAPLAN BUSINESS SCHOOL
The Silhouette Method
Measures how similar a point is to its own cluster compared to other clusters.
Silhouette Score = (b - a) / max(a, b)
Where:
a = average distance to points in same cluster
b = average distance to points in nearest other cluster
Score Range
+1: Point is well-matched to its cluster
0: Point is on the border between clusters
-1: Point might be in the wrong cluster
Choose K that maximizes average silhouette score
KAPLAN BUSINESS SCHOOL
Understanding Inertia
Inertia measures how compact and tight our clusters are.
Calculation Process
1 For each point, calculate its distance to its cluster center
2 Square each distance
3 Sum all squared distances
Low Inertia (Good)
Points are close to their cluster centers
Tight, well-defined clusters
High Inertia (Poor)
Points are spread out from centers
Loose, poorly-defined clusters
KAPLAN BUSINESS SCHOOL
Interactive: Finding the Elbow
Hover over points to see inertia values
Key Insight: The rate of decrease in inertia slows significantly after K=4, suggesting this is our optimal number of clusters.
KAPLAN BUSINESS SCHOOL
Quick Check: Choosing K
Question 3
You run K-means with K=2 through K=10. The inertia values are: K=2 (500), K=3 (300), K=4 (200), K=5 (180), K=6 (170). Which K should you choose using the elbow method?
A) K=2 (lowest K value)
B) K=3 (biggest single drop)
C) K=4 (where decreases start to slow)
D) K=6 (lowest inertia)
Correct! K=4 is the elbow point.
The drop from K=2→3→4 is significant (500→300→200), but K=4→5→6 shows much smaller improvements (200→180→170). This diminishing return indicates K=4 is optimal.
KAPLAN BUSINESS SCHOOL
Real-World Clustering Applications
Customer Segmentation
Retail: Group customers by purchasing behavior
Streaming: Identify viewer preferences for recommendations
Banking: Segment customers for targeted products
Medical Imaging
Diagnostics: Group similar medical images
Pattern Recognition: Identify disease markers
Treatment: Cluster patient responses
Social Networks
Community Detection: Find groups with similar interests
Influence: Identify key network positions
Anomaly Detection
Fraud: Detect unusual transactions
Quality Control: Identify defective products
KAPLAN BUSINESS SCHOOL
K-means: Advantages and Limitations
Advantages
Conceptually simple and easy to understand
Scales well to large datasets
Easy to visualize results
Efficient computationally
Adapts quickly to new data
Limitations
Must specify K in advance
Sensitive to initial centroid placement
Struggles with clusters of different sizes/densities
Affected by outliers
Assumes spherical clusters
Best Practice: Run K-means multiple times with different random initializations and choose the result with lowest inertia.
KAPLAN BUSINESS SCHOOL
Alternative: Hierarchical Clustering
Unlike K-means, hierarchical clustering doesn't require specifying K upfront.
Agglomerative (Bottom-Up)
Start: Each point is its own cluster
Repeatedly merge closest clusters
End: One big cluster containing all points
Creates a dendrogram (tree diagram)
Distance Linkage Methods
Average: Average distance between all pairs
Complete: Maximum distance between points
Single: Minimum distance between points
Ward: Minimizes within-cluster variance
Advantage: Provides a hierarchy of clusters; choose K by "cutting" the dendrogram at desired level
KAPLAN BUSINESS SCHOOL
Interpreting Clustering Results
After running clustering, ask these critical business questions:
1. Can you understand the data without clustering?
Is there obvious structure, or do clusters reveal hidden patterns?
2. How many clusters is reasonable?
Balance statistical metrics (elbow, silhouette) with business needs
3. What characterizes each cluster?
Examine average feature values per cluster - give them meaningful names
4. What actions should we take?
How do insights translate to marketing strategies, product development, or operations?
KAPLAN BUSINESS SCHOOL
Quick Check: Business Application
Question 4
You've identified 3 customer clusters: Cluster A (high income, low purchase frequency), Cluster B (medium income, high purchase frequency), Cluster C (low income, medium purchase frequency). Which cluster should receive discount promotions?
A) Cluster A - they have money to spend
B) Cluster C - discounts might increase their frequency
C) Cluster B - reward loyal customers
D) All clusters equally
Correct! Cluster C is price-sensitive and might respond well to discounts.
Cluster A may not need discounts (already high-value), and Cluster B is already purchasing frequently. Cluster C has potential to increase engagement through targeted promotions.
KAPLAN BUSINESS SCHOOL
The Curse of Dimensionality
As the number of features increases, distances become less meaningful and clustering becomes harder.
The Problem
2 features: Nearest neighbor at distance 0.2
10 features: Nearest neighbor at distance 2.8
100 features: All points seem equally distant!
Consequences
Clusters become harder to find
Distance measures lose meaning
Computation becomes expensive
Solution
Dimension reduction (PCA)
Feature selection
Alternative distance metrics
KAPLAN BUSINESS SCHOOL
Principal Component Analysis (PCA)
PCA is a dimension reduction technique that transforms data into a new coordinate system while preserving the most important information.
Goal
Reduce number of features from many (e.g., 20) to few (e.g., 2-3) while retaining maximum variance
Benefits
Visualize high-dimensional data
Remove noise
Speed up algorithms
Address curse of dimensionality
Analogy: Taking a photograph of a 3D object creates a 2D image that captures most of the important information. PCA does this mathematically.
KAPLAN BUSINESS SCHOOL
PCA: Finding the Best View
PCA finds new axes (principal components) that capture maximum variance in the data.
First Principal Component (PC1)
Direction of maximum variance in the data
Captures the most information
Second Principal Component (PC2)
Perpendicular to PC1
Captures next most variance
KAPLAN BUSINESS SCHOOL
The PCA Process
1Standardize: Scale all features to have mean=0 and standard deviation=1
2Compute Covariance Matrix: Measure how features vary together
3Calculate Eigenvectors: Find directions of maximum variance (principal components)
4Sort by Eigenvalues: Rank components by variance explained
5Transform Data: Project original data onto top principal components
KAPLAN BUSINESS SCHOOL
Visualizing Variance Explained
Each principal component captures some portion of total variance
Decision Rule: Keep enough components to explain 80-95% of variance. Here, 2 components capture 85% of information.
KAPLAN BUSINESS SCHOOL
Understanding Biplots
A biplot combines two visualizations:
1. Scatter Plot
Shows data points in PC1-PC2 space
Points close together are similar
Reveals natural groupings
2. Loading Vectors
Arrows show original features
Direction indicates correlation with PCs
Length indicates importance
Interpretation Guide
Long arrows: Feature contributes strongly to the PCs
Arrows pointing same direction: Features are positively correlated
Arrows pointing opposite directions: Features are negatively correlated
Perpendicular arrows: Features are uncorrelated
KAPLAN BUSINESS SCHOOL
Interactive PCA Biplot
Observation: Income and TransactionValue point in similar directions (positively correlated). Age points perpendicular (independent). Three distinct customer clusters emerge.
KAPLAN BUSINESS SCHOOL
Quick Check: PCA Concepts
Question 5
You perform PCA on a dataset with 15 features. PC1 explains 40% of variance, PC2 explains 25%, PC3 explains 15%, and PC4 explains 10%. How many components should you keep to retain 80% of variance?
A) 1 component
B) 2 components
C) 3 components
D) 4 components
Correct! You need 3 components.
PC1 + PC2 = 65% (not enough)
PC1 + PC2 + PC3 = 80% (meets our target)
This reduces from 15 features to just 3 while retaining 80% of information!
KAPLAN BUSINESS SCHOOL
Clustering vs. PCA Comparison
Aspect
K-means Clustering
PCA
Purpose
Group similar observations
Reduce number of features
Output
Cluster labels for each point
New transformed features
Use When
Need to segment data
Too many features to visualize/process
Interpretability
High - clusters have clear meaning
Medium - components are combinations
Specify in Advance
Number of clusters (K)
Variance to retain (or # components)
Best For
Customer segmentation, anomaly detection
Visualization, noise reduction
They Work Together: Often use PCA first to reduce dimensions, then cluster in the reduced space!