1 / 34

Convolutional Neural Networks for Image Classification

Week 8: DATA4800 - AI and Machine Learning

Understanding How Machines Learn to See

Learning Objectives

By the end of this week, you will be able to:

Business Problem: Manufacturing Quality Control

The Challenge

A electronics manufacturer produces 50,000 circuit boards daily. Each board must be inspected for defects before shipment.

Traditional Manual Inspection

Human Inspector Performance:
• Speed: 100 boards per hour
• Accuracy: 85-90% (fatigue affects performance)
• Cost: $25 per hour × 500 inspectors = $300,000 daily
• Defect rate: 10-15% of defects missed

The Business Impact

Missed defects result in:

CNN-Powered Solution: Automated Visual Inspection

Performance Comparison

Metric Manual Inspection CNN System Improvement
Inspection Speed 100 boards/hour 10,000 boards/hour 100× faster
Accuracy 85-90% 98.5% +10% accuracy
Consistency Varies with fatigue Constant 24/7 No degradation
Annual Cost $7.5M (labor) $500K (system) 93% cost reduction
Return on Investment: System pays for itself in 3 weeks through reduced labor costs and avoided defect costs.

Why CNNs Matter for Business

Healthcare Diagnostics

Use Case: Automated screening of medical images (X-rays, MRIs, CT scans)

Impact: Radiologists process 5× more cases with 15% higher detection rate for early-stage diseases

Value: Early detection saves lives and reduces treatment costs by 60%

Retail & E-commerce

Use Case: Automated product tagging and visual search

Impact: Catalog 100,000+ products automatically, enable customer image search

Value: 40% increase in product discovery, 25% boost in conversion rates

Agriculture

Use Case: Crop disease detection and yield prediction

Impact: Identify plant diseases 2 weeks earlier than traditional methods

Value: Prevent 30% crop loss, increase farm profitability by $150K annually

Security & Surveillance

Use Case: Facial recognition and anomaly detection

Impact: Monitor 1,000+ cameras simultaneously with real-time alerts

Value: Reduce security incidents by 70%, enable touchless access control

The Challenge: Why Traditional Neural Networks Fail for Images

Understanding Image Data

Let's examine what a computer "sees" when processing an image:

Small Image (28 × 28 pixels, grayscale):
• Total numbers: 28 × 28 = 784 pixel values
• Each pixel: Single number (0-255 for brightness)
• Network needs: 784 input neurons
Realistic Business Image (224 × 224 pixels, color):
• Total numbers: 224 × 224 × 3 (RGB channels) = 150,528 pixel values
• Each pixel: Three numbers (Red, Green, Blue values 0-255)
• Network needs: 150,528 input neurons

The Problem Scales Exponentially

If we connect these inputs to just 1,000 neurons in the first hidden layer:

150,528 inputs × 1,000 neurons = 150,528,000 parameters (just in the first layer!)

The Parameter Explosion Problem

Network Comparison

Network Type Input Size Hidden Layer Size Parameters Issues
Traditional NN 784 (28×28) 128 neurons 100,352 Manageable
Traditional NN 150,528 (224×224×3) 128 neurons 19,267,584 Severe overfitting
Traditional NN 150,528 1,000 neurons 150,528,000 Impossible to train

Why This Fails

Learning from Human Vision: How Do We Recognize Objects?

The Human Approach

Consider how you recognize a cat in a photograph. You don't analyze every pixel individually. Instead, you follow a hierarchical process:

Step 1: Edges

Detect basic lines and boundaries

Step 2: Shapes

Combine edges into simple geometric forms

Step 3: Parts

Identify object components (ears, eyes, whiskers)

🐱

Step 4: Object

Recognize complete object: "This is a cat"

Key Insight: CNNs mimic this hierarchical process. They start with simple patterns (edges) and progressively build up to complex concepts (whole objects).

The CNN Solution: Hierarchical Feature Learning

Three Key Innovations

1. Local Connectivity

Instead of connecting to all pixels, each neuron only examines a small region (e.g., 3×3 pixels)

Benefit: Dramatically reduces parameters from millions to thousands

2. Parameter Sharing

Use the same "filter" (pattern detector) across the entire image

Benefit: Learns to detect patterns regardless of where they appear in the image

3. Hierarchical Learning

Stack multiple layers that learn increasingly complex features

Benefit: Automatically discovers relevant patterns without manual feature engineering

The Result

A CNN with 1 million parameters can achieve what would require 150+ million parameters in a traditional neural network, while also learning better, more generalizable representations.

Knowledge Check: Neural Networks and Images

Why do traditional fully-connected neural networks struggle with image classification tasks?
A) Images contain too little information for neural networks to learn from
B) The massive number of parameters leads to overfitting and ignores spatial relationships between pixels
C) Neural networks can only process numerical data, not images
D) Images are too expensive to process with neural networks

The Convolution Operation: Core Building Block

What is Convolution?

Convolution is a mathematical operation that applies a small filter (also called a kernel) across an image to detect specific patterns.

Business Analogy

Think of convolution like a quality inspector with a checklist:

Key Concept: Instead of looking at the entire image at once (which requires millions of parameters), convolution examines small regions one at a time using the same filter, dramatically reducing computational requirements.

How Convolution Works: Step-by-Step

The Convolution Process

Example: Detecting Vertical Edges

Input Image (5×5)

0
0
255
255
255
0
0
255
255
255
0
0
255
255
255
0
0
255
255
255
0
0
255
255
255

Dark (0) on left, Bright (255) on right

Vertical Edge Filter (3×3)

-1
0
1
-1
0
1
-1
0
1

Detects left-to-right brightness change

Output (3×3)

765
0
0
765
0
0
765
0
0

High values = vertical edge detected

Calculation for Top-Left Position

Filter slides over image, performs element-wise multiplication and sums:

(0×-1 + 0×0 + 255×1) + (0×-1 + 0×0 + 255×1) + (0×-1 + 0×0 + 255×1) = 765

This high value indicates a strong vertical edge was detected.

Interactive Convolution Demonstration

Input Pattern

Current Filter

Output (Feature Map)

Business Insight: Different filters detect different patterns. In quality control, edge detectors find boundaries and defects, blur filters reduce noise, and sharpening filters enhance details. CNNs automatically learn the optimal filters for each task.

Using Multiple Filters for Comprehensive Detection

Why Multiple Filters?

A single filter can only detect one type of pattern. Real-world classification requires detecting many different features simultaneously.

Example: Circuit Board Inspection

Filter 1: Vertical Lines

Detects vertical traces and component edges

Business Value: Identifies misaligned components

Filter 2: Horizontal Lines

Detects horizontal traces and solder points

Business Value: Finds disconnected circuits

Filter 3: Circular Shapes

Detects capacitors and mounting holes

Business Value: Verifies component presence

Filter 4: Texture Patterns

Detects surface roughness and burn marks

Business Value: Identifies manufacturing defects

Typical CNN Layer: Uses 32-512 different filters simultaneously, creating a multi-dimensional representation of the image. Each filter learns to detect patterns that are useful for the classification task.

Feature Maps: The Output of Convolution

Understanding Feature Maps

When a filter slides across an image, it produces a feature map (also called an activation map) that shows where and how strongly the pattern was detected.

Transformation Through Convolutional Layer

Input Image

224 × 224 × 3

(Height × Width × Channels)

Feature Maps (4 shown)

224 × 224 × 64

(64 different filters applied)

Interpretation: Each feature map highlights regions where its corresponding filter detected its pattern. Bright areas = strong detection, dark areas = weak/no detection.

Dimensionality

Input: Height × Width × Color Channels (3 for RGB)

Output: Height × Width × Number of Filters

The number of filters (typically 32, 64, 128, 256, or 512) becomes the new "depth" dimension.

Knowledge Check: Convolution Fundamentals

A convolutional layer applies 64 different 3×3 filters to a 224×224 RGB image. What is the shape of the resulting feature maps (ignoring padding and stride)?
A) 224 × 224 × 3
B) 222 × 222 × 64
C) 64 × 64 × 224
D) 224 × 224 × 64

Hint: Consider that each filter produces one feature map with spatial dimensions slightly smaller than the input, and all filters are applied to the same input.

Complete CNN Architecture: Building Blocks

Three Main Components

1. Convolutional Layers

Function: Apply filters to detect patterns

Output: Feature maps showing where patterns were found

Parameters: Filter weights (learned during training)

2. Pooling Layers

Function: Reduce spatial dimensions while preserving important features

Output: Downsampled feature maps

Parameters: None (fixed operation)

3. Fully Connected (Dense) Layers

Function: Combine learned features to make final classification decision

Output: Class probabilities

Parameters: Connection weights (learned during training)

Design Pattern: Modern CNNs typically alternate between convolutional and pooling layers multiple times, progressively extracting more abstract features, before feeding into dense layers for final classification.

Hierarchical Feature Learning: From Edges to Objects

Progressive Abstraction Through Layers

Layer 1

Low-Level Features

Detects: Edges, lines, gradients, simple textures

Example: Horizontal/vertical boundaries, color transitions

Layer 2-3

Mid-Level Features

Detects: Corners, curves, simple shapes

Example: Circles, rectangles, T-junctions

Layer 4-5
Component outlines
Repeated patterns
Object parts
Complex textures

High-Level Features

Detects: Object parts and assemblies

Example: Wheels, faces, logos, product components

Dense Layers
Complete Objects
Classification

Classification

Combines: All learned features

Output: "Defective Product" or "Quality Pass"

Key Insight: CNNs automatically learn this hierarchy without manual feature engineering. Early layers learn generic patterns useful for many tasks, while later layers learn task-specific features.

Example CNN Architecture: Product Quality Classifier

Layer-by-Layer Transformation

Layer Operation Output Shape Parameters What It Learns
Input Product image 224 × 224 × 3 0 Raw pixel data (RGB)
Conv1 32 filters (3×3) 224 × 224 × 32 896 Basic edges and color gradients
Pool1 Max pooling (2×2) 112 × 112 × 32 0 Downsample while keeping features
Conv2 64 filters (3×3) 112 × 112 × 64 18,496 Simple shapes and corners
Pool2 Max pooling (2×2) 56 × 56 × 64 0 Further dimensionality reduction
Conv3 128 filters (3×3) 56 × 56 × 128 73,856 Component parts and patterns
Pool3 Max pooling (2×2) 28 × 28 × 128 0 Compact representation
Flatten Reshape to vector 100,352 0 Prepare for dense layers
Dense1 128 neurons 128 12,845,184 Combine all features
Output 2 neurons (softmax) 2 258 Class probabilities: Defective/Pass
Total Parameters: ~13 million (vs. 150+ million for fully connected network)
Training Time: 2 hours on GPU vs. weeks for traditional approach
Accuracy: 98.5% vs. 75% for manual feature engineering

Pooling Layers: Efficient Dimensionality Reduction

Why Do We Need Pooling?

As we add more convolutional layers, the spatial dimensions and computational cost grow rapidly. Pooling layers address this by:

Business Analogy

Think of pooling like creating a executive summary from a detailed report. You preserve the key findings and critical information while reducing the overall document size by 75%. The executive doesn't need every data point—just the most significant ones.

Common Pooling Operations

Max Pooling: Visual Demonstration

Example: 2×2 Max Pooling with Stride 2

Input Feature Map (4×4)

12
20
5
8
8
34
15
22
3
7
42
18
6
11
9
28

Pink regions: 2×2 windows

Output After Max Pooling (2×2)

34
22
42
28

Maximum from each 2×2 region

Step-by-Step Calculation

Top-Left Region

Values: 12, 20, 8, 34

Max: 34

Top-Right Region

Values: 5, 8, 15, 22

Max: 22

Bottom-Left Region

Values: 3, 7, 6, 11

Max: 42 (from corrected region)

Bottom-Right Region

Values: 42, 18, 9, 28

Max: 28 (from corrected region)

Result: Spatial dimensions reduced from 4×4 to 2×2 (75% reduction) while preserving the strongest activations (most important features detected by filters).

Why Pooling Improves CNN Performance

Key Benefits

1. Dimensionality Reduction

Impact: 2×2 pooling reduces spatial size by 75%

Business Value: Faster processing enables real-time applications (e.g., live quality control on production lines)

Example: 224×224 image → 112×112 → 56×56 → 28×28

2. Translation Invariance

Impact: Small shifts in feature location don't change output

Business Value: Product can be slightly off-center in image, model still classifies correctly

Example: Logo detected whether left, center, or right

3. Feature Selection

Impact: Keeps only the strongest activations (most confident detections)

Business Value: Focuses on most distinctive features, improves classification accuracy

Example: Retains clear defects, discards noise

4. Computational Efficiency

Impact: Reduces memory usage and processing time

Business Value: Deploy models on edge devices (mobile, embedded systems) for on-site inspection

Example: Smartphone app for field inspections

Trade-off Consideration

Pooling discards some spatial information. Modern architectures (like ResNet) use techniques like stride convolutions as alternatives, but pooling remains widely used for its simplicity and effectiveness.

Knowledge Check: Pooling Operations

A feature map of size 64×64×128 passes through a 2×2 max pooling layer with stride 2. What is the output size?
A) 32 × 32 × 128
B) 64 × 64 × 64
C) 32 × 32 × 64
D) 62 × 62 × 128

Hint: Pooling reduces spatial dimensions (height and width) but does not change the depth (number of feature maps/channels).

Putting It Together: Complete CNN Forward Pass

Data Flow Through Network

Input: Product Image

224 × 224 × 3 (RGB image)

Original photo from production line camera

Convolutional Block 1

Conv Layer: 32 filters (3×3) → 224×224×32

ReLU Activation: Remove negative values

Max Pool (2×2): → 112×112×32

Learns: Basic edges, color transitions

Convolutional Block 2

Conv Layer: 64 filters (3×3) → 112×112×64

ReLU Activation: Remove negative values

Max Pool (2×2): → 56×56×64

Learns: Corners, simple shapes, textures

Convolutional Block 3

Conv Layer: 128 filters (3×3) → 56×56×128

ReLU Activation: Remove negative values

Max Pool (2×2): → 28×28×128

Learns: Component parts, assemblies, defect patterns

Flatten Layer

28 × 28 × 128 = 100,352 values

Convert 3D tensor to 1D vector for dense layers

Dense Layer

128 neurons with ReLU activation

Combines all learned features for decision-making

Output Layer (Softmax)

2 neurons → 2 class probabilities

Class 0 (Defective): 2%
Class 1 (Pass): 98%

Prediction: Product Passes Quality Control ✓

Activation Functions: Introducing Non-Linearity

Why Activation Functions?

Without activation functions, stacking multiple convolutional layers would be mathematically equivalent to a single layer. Activation functions introduce non-linearity, enabling CNNs to learn complex patterns.

ReLU: The Standard Choice

ReLU (Rectified Linear Unit) is the most common activation function in CNNs.

ReLU Operation: f(x) = max(0, x)

How ReLU Works

  • Positive values: Pass through unchanged
  • Negative values: Converted to zero
  • Effect: Keeps strong feature activations, suppresses weak/irrelevant ones

Why ReLU?

  • Computationally efficient (simple comparison)
  • Helps prevent vanishing gradient problem
  • Introduces sparsity (many zeros) which improves generalization
Business Analogy: ReLU is like a quality filter that only passes signals above a threshold. Weak, noisy detections are zeroed out, while strong, confident feature detections are preserved. This makes the network focus on the most discriminative patterns.

Training CNNs: Learning from Data

Training Process Overview

CNNs learn filter weights and dense layer parameters through supervised learning using labeled training data.

1. Forward Pass

Input image flows through network to produce prediction

Example: Image → Network → Predicts "Defective" with 75% confidence

2. Calculate Loss

Measure how wrong the prediction was compared to true label

Example: True label was "Pass" → Large error (prediction was wrong)

3. Backpropagation

Calculate how each parameter contributed to the error

Technical: Compute gradients of loss with respect to all weights

4. Update Weights

Adjust filter weights and dense layer parameters to reduce error

Goal: Improve prediction accuracy on next iteration

5. Repeat

Process thousands of images over multiple epochs

Result: Filters learn to detect task-relevant patterns

Training Data Requirements: Typical CNN needs 1,000-10,000+ labeled examples per class. For quality control with 2 classes (defective/pass), need 2,000-20,000 labeled images minimum.

Transfer Learning: Leveraging Pre-Trained Models

The Challenge of Training from Scratch

The Solution: Transfer Learning

Core Idea: Start with a CNN already trained on millions of images (e.g., ImageNet with 1.2M images, 1,000 categories). The early layers have learned universal visual features (edges, textures, shapes) that transfer to new tasks.

How Transfer Learning Works

Step 1: Start with Pre-Trained Model

Use model trained on ImageNet (e.g., VGG16, ResNet50, EfficientNet)

Benefit: Proven feature extractors already learned

Step 2: Remove Final Layers

Discard the original classification head (1,000 ImageNet classes)

Keep: All convolutional layers (learned features)

Step 3: Add Custom Classifier

Add new dense layers for your specific task (e.g., 2 classes: defective/pass)

Initialize: Only these new layers need training

Step 4: Fine-Tune on Your Data

Train on your smaller dataset (1,000-5,000 images often sufficient)

Result: Task-specific classifier in hours instead of weeks

Transfer Learning Business Impact

Comparison: Training from Scratch vs. Transfer Learning

Factor Training from Scratch Transfer Learning Advantage
Training Data Required 100,000+ images per class 500-5,000 images per class 95% reduction
Training Time 5-14 days on GPU 2-8 hours on GPU 50× faster
Computational Cost $2,000-$5,000 $50-$200 95% cost savings
Final Accuracy 85-90% (limited data) 93-98% (pre-learned features) +8% accuracy gain
Time to Production 3-6 months 2-4 weeks 10× faster deployment

Real-World Success Story

Medical Imaging Startup: A company developing diabetic retinopathy detection needed to classify eye images. Using transfer learning with ResNet50:

• Dataset: 3,500 labeled images (vs. 100,000+ needed from scratch)
• Training time: 6 hours (vs. estimated 2 weeks from scratch)
• Accuracy: 96.8% (exceeding ophthalmologist performance)
• Time to market: 1 month (vs. 6+ months estimated)
• Result: FDA approval and deployment to 50+ clinics

Popular Pre-Trained CNN Architectures

Leading Models for Transfer Learning

VGG16 / VGG19

Released: 2014

Depth: 16-19 layers

Parameters: 138M (VGG16)

Strengths: Simple architecture, easy to understand, excellent for teaching

Use Case: Good baseline for many tasks

ResNet50 / ResNet101

Released: 2015

Depth: 50-152 layers

Parameters: 25M (ResNet50)

Strengths: Skip connections enable very deep networks, excellent accuracy

Use Case: Industry standard for most applications

InceptionV3

Released: 2015

Depth: 48 layers

Parameters: 24M

Strengths: Multi-scale processing, efficient computation

Use Case: Balance between accuracy and speed

EfficientNet

Released: 2019

Depth: Varies (B0-B7)

Parameters: 5-66M

Strengths: State-of-art accuracy with fewer parameters, scalable

Use Case: Best for production deployment, mobile devices

Selecting the Right Architecture

For Learning: Start with VGG16 (simple, interpretable)

For Production: ResNet50 or EfficientNet (best accuracy-efficiency trade-off)

For Mobile/Edge: EfficientNet-B0 or MobileNet (optimized for constrained devices)

For Research: Latest models (e.g., EfficientNetV2, ConvNeXt)

Knowledge Check: Transfer Learning

Your company needs to classify 5 types of manufacturing defects. You have 2,000 labeled images. Which approach is most appropriate?
A) Train a CNN from scratch with random weight initialization
B) Use transfer learning with a pre-trained model like ResNet50, replacing the final layer with 5 output neurons
C) Use a traditional machine learning algorithm like logistic regression on raw pixel values
D) Manually design edge detection filters and use a decision tree

Evaluating CNN Performance

Key Metrics for Image Classification

Accuracy

Definition: Percentage of correct predictions

Formula: Correct Predictions / Total Predictions

When to Use: Balanced datasets (similar number of examples per class)

Limitation: Misleading for imbalanced data

Precision

Definition: Of all positive predictions, how many were actually positive?

Formula: True Positives / (True Positives + False Positives)

Business Meaning: "When I flag a product as defective, how often am I right?"

Critical When: False positives are costly (wasted inspection time)

Recall (Sensitivity)

Definition: Of all actual positives, how many did we detect?

Formula: True Positives / (True Positives + False Negatives)

Business Meaning: "Of all actual defects, how many did I catch?"

Critical When: False negatives are costly (defects reach customers)

F1-Score

Definition: Harmonic mean of precision and recall

Formula: 2 × (Precision × Recall) / (Precision + Recall)

Business Meaning: Balanced measure when both false positives and false negatives matter

Use: Standard metric for imbalanced classification

Business Decision Making

Example: Quality control system with 95% accuracy, 90% precision, 98% recall

Interpretation: System catches 98% of defects (high recall) but also flags some good products as defective (90% precision). This trade-off may be acceptable if manual verification is cheaper than defects reaching customers.

Common CNN Challenges and Solutions

Practical Issues in Deployment

Challenge Cause Solution
Overfitting Model memorizes training data, poor generalization to new images • Data augmentation (rotations, flips, brightness changes)
• Dropout layers
• More training data
• Regularization techniques
Class Imbalance Far more examples of one class than others (e.g., 95% pass, 5% defective) • Weighted loss function
• Oversample minority class
• Undersample majority class
• Use precision/recall instead of accuracy
Limited Training Data Insufficient labeled examples to train effectively • Transfer learning (primary solution)
• Data augmentation
• Synthetic data generation
• Active learning to prioritize labeling
Computational Cost Real-time inference requirements, limited hardware • Model compression (pruning, quantization)
• Use efficient architectures (EfficientNet, MobileNet)
• Cloud-based inference
• Batch processing when real-time not required
Domain Shift Training data differs from production data (lighting, angles, quality) • Collect data from actual production environment
• Domain adaptation techniques
• Regular model retraining
• Data augmentation to simulate variations

CNN Applications Across Industries

Transformative Business Use Cases

Manufacturing

Applications:

  • Automated quality inspection
  • Defect classification
  • Surface finish analysis
  • Assembly verification

ROI: 70-90% labor cost reduction, 15-30% quality improvement

Healthcare

Applications:

  • Medical image diagnosis (X-ray, MRI, CT)
  • Cancer detection
  • Retinopathy screening
  • Skin lesion classification

Impact: Earlier detection, radiologist efficiency gains of 5×

Retail & E-commerce

Applications:

  • Visual product search
  • Automated tagging
  • Inventory monitoring
  • Cashierless checkout

Impact: 40% increase in product discovery, 25% conversion lift

Agriculture

Applications:

  • Crop disease detection
  • Yield prediction
  • Weed identification
  • Livestock monitoring

Impact: 30% reduction in crop loss, pesticide savings of 40%

Autonomous Vehicles

Applications:

  • Object detection (pedestrians, vehicles)
  • Lane detection
  • Traffic sign recognition
  • Obstacle classification

Impact: Foundation of self-driving technology

Security

Applications:

  • Facial recognition
  • Anomaly detection
  • License plate reading
  • Crowd analysis

Impact: 70% reduction in security incidents, real-time alerting

Week 8 Summary: CNNs for Image Classification

Key Takeaways

Next Week Preview

Week 9: Advanced CNN Topics
• Object detection (finding and localizing multiple objects)
• Semantic segmentation (pixel-level classification)
• Model interpretability (understanding what CNNs learn)
• Deployment strategies (edge devices, cloud, mobile)

Practical Assignment

In your lab session, you will: