DATA4800

Artificial Intelligence and Machine Learning

Workshop 4

Clustering and Principal Component Analysis

Learning Outcomes

1 Investigate unsupervised learning and understand when to apply it

2 Explore K-means clustering algorithm and its business applications

3 Apply principal components analysis for dimension reduction

Types of Machine Learning

Supervised Learning

  • Has labeled training data
  • Learns from examples
  • Predicts outcomes
  • Examples: Classification, Regression

Unsupervised Learning

  • No labeled training data
  • Finds hidden patterns
  • Discovers structure
  • Examples: Clustering, PCA
Today's Focus: Unsupervised learning - finding patterns in data without predefined labels

What is Unsupervised Learning?

Unsupervised learning applies algorithms to unlabeled data to discover patterns and natural groupings.

Two Main Applications

Clustering

Separating data into meaningful groups

Example: Customer segmentation - grouping similar customers for targeted marketing

Dimension Reduction

Reducing the number of features while preserving information

Example: PCA - simplifying complex datasets for visualization

Business Motivation: Retail Campaign

The Challenge

You are a retail analyst with a dataset of 200 customers containing:

  • Age, Income, Average Transaction Value
  • Previous Purchases, Days Since Last Purchase
  • Email Open Rate, Loyalty Status, Preferred Channel

The Question

How can we identify natural customer segments without manually labeling each customer?

Traditional approach: Manual categorization (time-consuming, subjective)

Machine learning approach: Clustering algorithm discovers groups automatically

Feature Scaling: Why It Matters

Before clustering, we must ensure all features are on comparable scales.

The Problem

Consider measuring customer similarity with:

  • Age: ranges from 18 to 93 years
  • Account Balance: ranges from $0 to $80,000

A difference of 17 years in age appears equal to a difference of $17 in balance, but they have very different business meanings!

Solution: Transform all features to a common scale (typically 0 to 1) before calculating distances

Feature Scaling Methods

Min-Max Scaling

scaled_value = (value - minimum) / (maximum - minimum)

Transforms values to range [0, 1]

Z-Score Normalization

z_score = (value - mean) / standard_deviation

Centers data around mean = 0, standard deviation = 1

Quick Check: Feature Scaling

Question 1

You have customer ages ranging from 20 to 80. Using Min-Max scaling, what would an age of 50 become?

A) 0.25
B) 0.40
C) 0.50
D) 0.75
Correct! Using the formula: (50 - 20) / (80 - 20) = 30 / 60 = 0.50

Age 50 is exactly halfway between 20 and 80, so it scales to 0.50.

Measuring Similarity: Distance

Clustering requires a way to measure how similar or different data points are. We use distance measures.

Euclidean Distance (Most Common)

The straight-line distance between two points in space.

Distance = √[(x₂ - x₁)² + (y₂ - y₁)²]

Distance Calculation: Worked Example

Scenario: Comparing Two Customers

Customer A: Age = 0.35 (scaled), Income = 0.45 (scaled)

Customer B: Age = 0.52 (scaled), Income = 0.48 (scaled)

1 Calculate differences:

Age difference: 0.52 - 0.35 = 0.17

Income difference: 0.48 - 0.45 = 0.03

2 Square the differences:

Age: (0.17)² = 0.0289

Income: (0.03)² = 0.0009

3 Sum and take square root:

Distance = √(0.0289 + 0.0009) = √0.0298 = 0.173

Cluster Center (Centroid)

The centroid is the center point of a cluster, calculated by averaging all feature values of points in that cluster.

Example: Finding Centroid

Customer Feature 1 Feature 2 Feature 3
11.000.400.25
20.800.350.40
31.100.370.27
Centroid0.970.370.31

Calculation: Feature 1 centroid = (1.00 + 0.80 + 1.10) / 3 = 0.97

The K-means Algorithm

K-means is an iterative algorithm that partitions data into K distinct clusters.

1 Initialize: Choose K random points as initial cluster centers

2 Assign: Assign each data point to its nearest cluster center

3 Update: Recalculate cluster centers based on assigned points

4 Repeat: Continue steps 2-3 until cluster centers stop changing

K-means in Action

Watch how clusters form through iterations

Click "Start Animation" to begin

Quick Check: K-means Algorithm

Question 2

In the K-means algorithm, what does the "K" represent?

A) The number of iterations the algorithm runs
B) The number of clusters we want to create
C) The number of features in the dataset
D) The number of data points in the dataset
Correct! K is the number of clusters we specify before running the algorithm.

This is why choosing the right K value is crucial - we'll explore methods for this next.

Choosing the Right K

One of the biggest challenges in K-means: How many clusters should we create?

Three Main Methods

1. Elbow Method

Plot inertia vs. K and look for the "elbow" where improvement slows

2. Silhouette Method

Measures how well-separated clusters are

3. Domain Knowledge

Use business understanding (e.g., we want 3 customer tiers: Bronze, Silver, Gold)

The Elbow Method

Run K-means with different K values and plot the inertia (within-cluster sum of squares).

Inertia = Σ (distance from each point to its cluster center)²
Interpretation: Choose K where the curve starts to level off (the "elbow"). Here, K=4 is optimal.

The Silhouette Method

Measures how similar a point is to its own cluster compared to other clusters.

Silhouette Score = (b - a) / max(a, b)

Where:

Score Range

  • +1: Point is well-matched to its cluster
  • 0: Point is on the border between clusters
  • -1: Point might be in the wrong cluster

Choose K that maximizes average silhouette score

Understanding Inertia

Inertia measures how compact and tight our clusters are.

Calculation Process

1 For each point, calculate its distance to its cluster center

2 Square each distance

3 Sum all squared distances

Low Inertia (Good)

Points are close to their cluster centers

Tight, well-defined clusters

High Inertia (Poor)

Points are spread out from centers

Loose, poorly-defined clusters

Interactive: Finding the Elbow

Hover over points to see inertia values

Key Insight: The rate of decrease in inertia slows significantly after K=4, suggesting this is our optimal number of clusters.

Quick Check: Choosing K

Question 3

You run K-means with K=2 through K=10. The inertia values are: K=2 (500), K=3 (300), K=4 (200), K=5 (180), K=6 (170). Which K should you choose using the elbow method?

A) K=2 (lowest K value)
B) K=3 (biggest single drop)
C) K=4 (where decreases start to slow)
D) K=6 (lowest inertia)
Correct! K=4 is the elbow point.

The drop from K=2→3→4 is significant (500→300→200), but K=4→5→6 shows much smaller improvements (200→180→170). This diminishing return indicates K=4 is optimal.

Real-World Clustering Applications

Customer Segmentation

Retail: Group customers by purchasing behavior

Streaming: Identify viewer preferences for recommendations

Banking: Segment customers for targeted products

Medical Imaging

Diagnostics: Group similar medical images

Pattern Recognition: Identify disease markers

Treatment: Cluster patient responses

Social Networks

Community Detection: Find groups with similar interests

Influence: Identify key network positions

Anomaly Detection

Fraud: Detect unusual transactions

Quality Control: Identify defective products

K-means: Advantages and Limitations

Advantages

  • Conceptually simple and easy to understand
  • Scales well to large datasets
  • Easy to visualize results
  • Efficient computationally
  • Adapts quickly to new data

Limitations

  • Must specify K in advance
  • Sensitive to initial centroid placement
  • Struggles with clusters of different sizes/densities
  • Affected by outliers
  • Assumes spherical clusters
Best Practice: Run K-means multiple times with different random initializations and choose the result with lowest inertia.

Alternative: Hierarchical Clustering

Unlike K-means, hierarchical clustering doesn't require specifying K upfront.

Agglomerative (Bottom-Up)

  • Start: Each point is its own cluster
  • Repeatedly merge closest clusters
  • End: One big cluster containing all points
  • Creates a dendrogram (tree diagram)

Distance Linkage Methods

  • Average: Average distance between all pairs
  • Complete: Maximum distance between points
  • Single: Minimum distance between points
  • Ward: Minimizes within-cluster variance
Advantage: Provides a hierarchy of clusters; choose K by "cutting" the dendrogram at desired level

Interpreting Clustering Results

After running clustering, ask these critical business questions:

1. Can you understand the data without clustering?

Is there obvious structure, or do clusters reveal hidden patterns?

2. How many clusters is reasonable?

Balance statistical metrics (elbow, silhouette) with business needs

3. What characterizes each cluster?

Examine average feature values per cluster - give them meaningful names

4. What actions should we take?

How do insights translate to marketing strategies, product development, or operations?

Quick Check: Business Application

Question 4

You've identified 3 customer clusters: Cluster A (high income, low purchase frequency), Cluster B (medium income, high purchase frequency), Cluster C (low income, medium purchase frequency). Which cluster should receive discount promotions?

A) Cluster A - they have money to spend
B) Cluster C - discounts might increase their frequency
C) Cluster B - reward loyal customers
D) All clusters equally
Correct! Cluster C is price-sensitive and might respond well to discounts.

Cluster A may not need discounts (already high-value), and Cluster B is already purchasing frequently. Cluster C has potential to increase engagement through targeted promotions.

The Curse of Dimensionality

As the number of features increases, distances become less meaningful and clustering becomes harder.

The Problem

2 features: Nearest neighbor at distance 0.2

10 features: Nearest neighbor at distance 2.8

100 features: All points seem equally distant!

Consequences

  • Clusters become harder to find
  • Distance measures lose meaning
  • Computation becomes expensive

Solution

  • Dimension reduction (PCA)
  • Feature selection
  • Alternative distance metrics

Principal Component Analysis (PCA)

PCA is a dimension reduction technique that transforms data into a new coordinate system while preserving the most important information.

Goal

Reduce number of features from many (e.g., 20) to few (e.g., 2-3) while retaining maximum variance

Benefits

  • Visualize high-dimensional data
  • Remove noise
  • Speed up algorithms
  • Address curse of dimensionality
Analogy: Taking a photograph of a 3D object creates a 2D image that captures most of the important information. PCA does this mathematically.

PCA: Finding the Best View

PCA finds new axes (principal components) that capture maximum variance in the data.

First Principal Component (PC1)

Direction of maximum variance in the data

Captures the most information

Second Principal Component (PC2)

Perpendicular to PC1

Captures next most variance

The PCA Process

1 Standardize: Scale all features to have mean=0 and standard deviation=1

2 Compute Covariance Matrix: Measure how features vary together

3 Calculate Eigenvectors: Find directions of maximum variance (principal components)

4 Sort by Eigenvalues: Rank components by variance explained

5 Transform Data: Project original data onto top principal components

Visualizing Variance Explained

Each principal component captures some portion of total variance

Decision Rule: Keep enough components to explain 80-95% of variance. Here, 2 components capture 85% of information.

Understanding Biplots

A biplot combines two visualizations:

1. Scatter Plot

Shows data points in PC1-PC2 space

Points close together are similar

Reveals natural groupings

2. Loading Vectors

Arrows show original features

Direction indicates correlation with PCs

Length indicates importance

Interpretation Guide

  • Long arrows: Feature contributes strongly to the PCs
  • Arrows pointing same direction: Features are positively correlated
  • Arrows pointing opposite directions: Features are negatively correlated
  • Perpendicular arrows: Features are uncorrelated

Interactive PCA Biplot

Observation: Income and TransactionValue point in similar directions (positively correlated). Age points perpendicular (independent). Three distinct customer clusters emerge.

Quick Check: PCA Concepts

Question 5

You perform PCA on a dataset with 15 features. PC1 explains 40% of variance, PC2 explains 25%, PC3 explains 15%, and PC4 explains 10%. How many components should you keep to retain 80% of variance?

A) 1 component
B) 2 components
C) 3 components
D) 4 components
Correct! You need 3 components.

PC1 + PC2 = 65% (not enough)
PC1 + PC2 + PC3 = 80% (meets our target)
This reduces from 15 features to just 3 while retaining 80% of information!

Clustering vs. PCA Comparison

Aspect K-means Clustering PCA
Purpose Group similar observations Reduce number of features
Output Cluster labels for each point New transformed features
Use When Need to segment data Too many features to visualize/process
Interpretability High - clusters have clear meaning Medium - components are combinations
Specify in Advance Number of clusters (K) Variance to retain (or # components)
Best For Customer segmentation, anomaly detection Visualization, noise reduction
They Work Together: Often use PCA first to reduce dimensions, then cluster in the reduced space!

Practical Implementation

Python (scikit-learn)

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# K-means
kmeans = KMeans(n_clusters=4)
labels = kmeans.fit_predict(data)

# PCA
pca = PCA(n_components=2)
transformed = pca.fit_transform(data)
                    

Google Colab: Interactive notebooks for practice

Orange Data Mining

Workflow:

  1. File widget → Load CSV
  2. K-means widget → Set K
  3. Scatter Plot → Visualize
  4. Data Table → Examine clusters

Key Settings:

  • Normalize features: Yes
  • Optimization runs: 10+
  • Distance metric: Euclidean

Workshop Summary

Key Takeaways

  • Unsupervised learning finds patterns in unlabeled data
  • K-means clustering groups similar observations using distance measures
  • Feature scaling is essential before calculating distances
  • Choosing K requires elbow method, silhouette analysis, or domain knowledge
  • PCA reduces dimensions while preserving variance
  • Biplots visualize both data points and feature contributions

Next Steps

  • Practice with retail dataset
  • Experiment with different K values
  • Interpret business meaning of clusters

Resources

  • Google Colab notebooks
  • Orange Data Mining tutorials
  • scikit-learn documentation

Questions?

1 / 36