DATA4800

Artificial Intelligence and Machine Learning

Workshop 4

Clustering and Principal Component Analysis

KAPLAN BUSINESS SCHOOL

Learning Outcomes

1 Investigate unsupervised learning and understand when to apply it

2 Explore K-means clustering algorithm and its business applications

3 Apply principal components analysis for dimension reduction

KAPLAN BUSINESS SCHOOL

Types of Machine Learning

Supervised Learning

Has labeled training data
Learns from examples
Predicts outcomes
Examples: Classification, Regression

Unsupervised Learning

No labeled training data
Finds hidden patterns
Discovers structure
Examples: Clustering, PCA

                Today's Focus: Unsupervised learning - finding patterns in data without predefined labels
            

KAPLAN BUSINESS SCHOOL

What is Unsupervised Learning?

Unsupervised learning applies algorithms to unlabeled data to discover patterns and natural groupings.

Two Main Applications

Clustering

Separating data into meaningful groups

Example: Customer segmentation - grouping similar customers for targeted marketing

Dimension Reduction

Reducing the number of features while preserving information

Example: PCA - simplifying complex datasets for visualization

KAPLAN BUSINESS SCHOOL

Business Motivation: Retail Campaign

The Challenge

You are a retail analyst with a dataset of 200 customers containing:

Age, Income, Average Transaction Value
Previous Purchases, Days Since Last Purchase
Email Open Rate, Loyalty Status, Preferred Channel

The Question

How can we identify natural customer segments without manually labeling each customer?

Traditional approach: Manual categorization (time-consuming, subjective)

Machine learning approach: Clustering algorithm discovers groups automatically

KAPLAN BUSINESS SCHOOL

Feature Scaling: Why It Matters

Before clustering, we must ensure all features are on comparable scales.

The Problem

Consider measuring customer similarity with:

Age: ranges from 18 to 93 years
Account Balance: ranges from $0 to $80,000

A difference of 17 years in age appears equal to a difference of $17 in balance, but they have very different business meanings!

                Solution: Transform all features to a common scale (typically 0 to 1) before calculating distances
            

KAPLAN BUSINESS SCHOOL

Feature Scaling Methods

Min-Max Scaling

scaled_value = (value - minimum) / (maximum - minimum)

Transforms values to range [0, 1]

Z-Score Normalization

z_score = (value - mean) / standard_deviation

Centers data around mean = 0, standard deviation = 1

KAPLAN BUSINESS SCHOOL

Quick Check: Feature Scaling

Question 1

You have customer ages ranging from 20 to 80. Using Min-Max scaling, what would an age of 50 become?

A) 0.25

B) 0.40

C) 0.50

D) 0.75

Correct! Using the formula: (50 - 20) / (80 - 20) = 30 / 60 = 0.50

Age 50 is exactly halfway between 20 and 80, so it scales to 0.50.

KAPLAN BUSINESS SCHOOL

Measuring Similarity: Distance

Clustering requires a way to measure how similar or different data points are. We use distance measures.

Euclidean Distance (Most Common)

The straight-line distance between two points in space.

Distance = √[(x₂ - x₁)² + (y₂ - y₁)²]

KAPLAN BUSINESS SCHOOL

Distance Calculation: Worked Example

Scenario: Comparing Two Customers

Customer A: Age = 0.35 (scaled), Income = 0.45 (scaled)

Customer B: Age = 0.52 (scaled), Income = 0.48 (scaled)

1 Calculate differences:

Age difference: 0.52 - 0.35 = 0.17

Income difference: 0.48 - 0.45 = 0.03

2 Square the differences:

Age: (0.17)² = 0.0289

Income: (0.03)² = 0.0009

3 Sum and take square root:

Distance = √(0.0289 + 0.0009) = √0.0298 = 0.173

KAPLAN BUSINESS SCHOOL

Cluster Center (Centroid)

The centroid is the center point of a cluster, calculated by averaging all feature values of points in that cluster.

Example: Finding Centroid

Customer	Feature 1	Feature 2	Feature 3
1	1.00	0.40	0.25
2	0.80	0.35	0.40
3	1.10	0.37	0.27
Centroid	0.97	0.37	0.31

Calculation: Feature 1 centroid = (1.00 + 0.80 + 1.10) / 3 = 0.97

KAPLAN BUSINESS SCHOOL

The K-means Algorithm

K-means is an iterative algorithm that partitions data into K distinct clusters.

1 Initialize: Choose K random points as initial cluster centers

2 Assign: Assign each data point to its nearest cluster center

3 Update: Recalculate cluster centers based on assigned points

4 Repeat: Continue steps 2-3 until cluster centers stop changing

KAPLAN BUSINESS SCHOOL

K-means in Action

Watch how clusters form through iterations

Click "Start Animation" to begin

KAPLAN BUSINESS SCHOOL

Quick Check: K-means Algorithm

Question 2

In the K-means algorithm, what does the "K" represent?

A) The number of iterations the algorithm runs

B) The number of clusters we want to create

C) The number of features in the dataset

D) The number of data points in the dataset

Correct! K is the number of clusters we specify before running the algorithm.

This is why choosing the right K value is crucial - we'll explore methods for this next.

KAPLAN BUSINESS SCHOOL

Choosing the Right K

One of the biggest challenges in K-means: How many clusters should we create?

Three Main Methods

1. Elbow Method

Plot inertia vs. K and look for the "elbow" where improvement slows

2. Silhouette Method

Measures how well-separated clusters are

3. Domain Knowledge

Use business understanding (e.g., we want 3 customer tiers: Bronze, Silver, Gold)

KAPLAN BUSINESS SCHOOL

The Elbow Method

Run K-means with different K values and plot the inertia (within-cluster sum of squares).

Inertia = Σ (distance from each point to its cluster center)²

                Interpretation: Choose K where the curve starts to level off (the "elbow"). Here, K=4 is optimal.
            

KAPLAN BUSINESS SCHOOL

The Silhouette Method

Measures how similar a point is to its own cluster compared to other clusters.

Silhouette Score = (b - a) / max(a, b)

Where:

a = average distance to points in same cluster
b = average distance to points in nearest other cluster

Score Range

+1: Point is well-matched to its cluster
0: Point is on the border between clusters
-1: Point might be in the wrong cluster

Choose K that maximizes average silhouette score

KAPLAN BUSINESS SCHOOL

Understanding Inertia

Inertia measures how compact and tight our clusters are.

Calculation Process

1 For each point, calculate its distance to its cluster center

2 Square each distance

3 Sum all squared distances

Low Inertia (Good)

Points are close to their cluster centers

Tight, well-defined clusters

High Inertia (Poor)

Points are spread out from centers

Loose, poorly-defined clusters

KAPLAN BUSINESS SCHOOL

Interactive: Finding the Elbow

Hover over points to see inertia values

                Key Insight: The rate of decrease in inertia slows significantly after K=4, suggesting this is our optimal number of clusters.
            

KAPLAN BUSINESS SCHOOL

Quick Check: Choosing K

Question 3

You run K-means with K=2 through K=10. The inertia values are: K=2 (500), K=3 (300), K=4 (200), K=5 (180), K=6 (170). Which K should you choose using the elbow method?

A) K=2 (lowest K value)

B) K=3 (biggest single drop)

C) K=4 (where decreases start to slow)

D) K=6 (lowest inertia)

Correct! K=4 is the elbow point.

The drop from K=2→3→4 is significant (500→300→200), but K=4→5→6 shows much smaller improvements (200→180→170). This diminishing return indicates K=4 is optimal.

KAPLAN BUSINESS SCHOOL

Real-World Clustering Applications

Customer Segmentation

Retail: Group customers by purchasing behavior

Streaming: Identify viewer preferences for recommendations

Banking: Segment customers for targeted products

Medical Imaging

Diagnostics: Group similar medical images

Pattern Recognition: Identify disease markers

Treatment: Cluster patient responses

Social Networks

Community Detection: Find groups with similar interests

Influence: Identify key network positions

Anomaly Detection

Fraud: Detect unusual transactions

Quality Control: Identify defective products

KAPLAN BUSINESS SCHOOL

K-means: Advantages and Limitations

Advantages

Conceptually simple and easy to understand
Scales well to large datasets
Easy to visualize results
Efficient computationally
Adapts quickly to new data

Limitations

Must specify K in advance
Sensitive to initial centroid placement
Struggles with clusters of different sizes/densities
Affected by outliers
Assumes spherical clusters

                Best Practice: Run K-means multiple times with different random initializations and choose the result with lowest inertia.
            

KAPLAN BUSINESS SCHOOL

Alternative: Hierarchical Clustering

Unlike K-means, hierarchical clustering doesn't require specifying K upfront.

Agglomerative (Bottom-Up)

Start: Each point is its own cluster
Repeatedly merge closest clusters
End: One big cluster containing all points
Creates a dendrogram (tree diagram)

Distance Linkage Methods

Average: Average distance between all pairs
Complete: Maximum distance between points
Single: Minimum distance between points
Ward: Minimizes within-cluster variance

                Advantage: Provides a hierarchy of clusters; choose K by "cutting" the dendrogram at desired level
            

KAPLAN BUSINESS SCHOOL

Interpreting Clustering Results

After running clustering, ask these critical business questions:

1. Can you understand the data without clustering?

Is there obvious structure, or do clusters reveal hidden patterns?

2. How many clusters is reasonable?

Balance statistical metrics (elbow, silhouette) with business needs

3. What characterizes each cluster?

Examine average feature values per cluster - give them meaningful names

4. What actions should we take?

How do insights translate to marketing strategies, product development, or operations?

KAPLAN BUSINESS SCHOOL

Quick Check: Business Application

Question 4

You've identified 3 customer clusters: Cluster A (high income, low purchase frequency), Cluster B (medium income, high purchase frequency), Cluster C (low income, medium purchase frequency). Which cluster should receive discount promotions?

A) Cluster A - they have money to spend

B) Cluster C - discounts might increase their frequency

C) Cluster B - reward loyal customers

D) All clusters equally

Correct! Cluster C is price-sensitive and might respond well to discounts.

Cluster A may not need discounts (already high-value), and Cluster B is already purchasing frequently. Cluster C has potential to increase engagement through targeted promotions.

KAPLAN BUSINESS SCHOOL

The Curse of Dimensionality

As the number of features increases, distances become less meaningful and clustering becomes harder.

The Problem

2 features: Nearest neighbor at distance 0.2

10 features: Nearest neighbor at distance 2.8

100 features: All points seem equally distant!

                    Consequences
                    Clusters become harder to find
Distance measures lose meaning
Computation becomes expensive

                

                    Solution
                    Dimension reduction (PCA)
Feature selection
Alternative distance metrics

                

KAPLAN BUSINESS SCHOOL

Principal Component Analysis (PCA)

PCA is a dimension reduction technique that transforms data into a new coordinate system while preserving the most important information.

Goal

Reduce number of features from many (e.g., 20) to few (e.g., 2-3) while retaining maximum variance

                    Benefits
                    Visualize high-dimensional data
Remove noise
Speed up algorithms
Address curse of dimensionality

                

Analogy: Taking a photograph of a 3D object creates a 2D image that captures most of the important information. PCA does this mathematically.

KAPLAN BUSINESS SCHOOL

PCA: Finding the Best View

PCA finds new axes (principal components) that capture maximum variance in the data.

First Principal Component (PC1)

Direction of maximum variance in the data

Captures the most information

Second Principal Component (PC2)

Perpendicular to PC1

Captures next most variance

KAPLAN BUSINESS SCHOOL

The PCA Process

1 Standardize: Scale all features to have mean=0 and standard deviation=1

2 Compute Covariance Matrix: Measure how features vary together

3 Calculate Eigenvectors: Find directions of maximum variance (principal components)

4 Sort by Eigenvalues: Rank components by variance explained

5 Transform Data: Project original data onto top principal components

KAPLAN BUSINESS SCHOOL

Visualizing Variance Explained

Each principal component captures some portion of total variance

                Decision Rule: Keep enough components to explain 80-95% of variance. Here, 2 components capture 85% of information.
            

KAPLAN BUSINESS SCHOOL

Understanding Biplots

A biplot combines two visualizations:

1. Scatter Plot

Shows data points in PC1-PC2 space

Points close together are similar

Reveals natural groupings

2. Loading Vectors

Arrows show original features

Direction indicates correlation with PCs

Length indicates importance

Interpretation Guide

Long arrows: Feature contributes strongly to the PCs
Arrows pointing same direction: Features are positively correlated
Arrows pointing opposite directions: Features are negatively correlated
Perpendicular arrows: Features are uncorrelated

KAPLAN BUSINESS SCHOOL

Interactive PCA Biplot

                Observation: Income and TransactionValue point in similar directions (positively correlated). Age points perpendicular (independent). Three distinct customer clusters emerge.
            

KAPLAN BUSINESS SCHOOL

Quick Check: PCA Concepts

Question 5

You perform PCA on a dataset with 15 features. PC1 explains 40% of variance, PC2 explains 25%, PC3 explains 15%, and PC4 explains 10%. How many components should you keep to retain 80% of variance?

A) 1 component

B) 2 components

C) 3 components

D) 4 components

Correct! You need 3 components.

PC1 + PC2 = 65% (not enough)
PC1 + PC2 + PC3 = 80% (meets our target)
This reduces from 15 features to just 3 while retaining 80% of information!

KAPLAN BUSINESS SCHOOL

Clustering vs. PCA Comparison

Aspect	K-means Clustering	PCA
Purpose	Group similar observations	Reduce number of features
Output	Cluster labels for each point	New transformed features
Use When	Need to segment data	Too many features to visualize/process
Interpretability	High - clusters have clear meaning	Medium - components are combinations
Specify in Advance	Number of clusters (K)	Variance to retain (or # components)
Best For	Customer segmentation, anomaly detection	Visualization, noise reduction

                They Work Together: Often use PCA first to reduce dimensions, then cluster in the reduced space!
            

KAPLAN BUSINESS SCHOOL

Practical Implementation

Python (scikit-learn)

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# K-means
kmeans = KMeans(n_clusters=4)
labels = kmeans.fit_predict(data)

# PCA
pca = PCA(n_components=2)
transformed = pca.fit_transform(data)

Google Colab: Interactive notebooks for practice

Orange Data Mining

Workflow:

File widget → Load CSV
K-means widget → Set K
Scatter Plot → Visualize
Data Table → Examine clusters

Key Settings:

Normalize features: Yes
Optimization runs: 10+
Distance metric: Euclidean

KAPLAN BUSINESS SCHOOL

Workshop Summary

                Key Takeaways
                Unsupervised learning finds patterns in unlabeled data
K-means clustering groups similar observations using distance measures
Feature scaling is essential before calculating distances
Choosing K requires elbow method, silhouette analysis, or domain knowledge
PCA reduces dimensions while preserving variance
Biplots visualize both data points and feature contributions

            

Next Steps

Practice with retail dataset
Experiment with different K values
Interpret business meaning of clusters

Resources

Google Colab notebooks
Orange Data Mining tutorials
scikit-learn documentation

Questions?

KAPLAN BUSINESS SCHOOL