Week 8: DATA4800 - AI and Machine Learning
Understanding How Machines Learn to See
By the end of this week, you will be able to:
A electronics manufacturer produces 50,000 circuit boards daily. Each board must be inspected for defects before shipment.
Missed defects result in:
| Metric | Manual Inspection | CNN System | Improvement |
|---|---|---|---|
| Inspection Speed | 100 boards/hour | 10,000 boards/hour | 100× faster |
| Accuracy | 85-90% | 98.5% | +10% accuracy |
| Consistency | Varies with fatigue | Constant 24/7 | No degradation |
| Annual Cost | $7.5M (labor) | $500K (system) | 93% cost reduction |
Use Case: Automated screening of medical images (X-rays, MRIs, CT scans)
Impact: Radiologists process 5× more cases with 15% higher detection rate for early-stage diseases
Value: Early detection saves lives and reduces treatment costs by 60%
Use Case: Automated product tagging and visual search
Impact: Catalog 100,000+ products automatically, enable customer image search
Value: 40% increase in product discovery, 25% boost in conversion rates
Use Case: Crop disease detection and yield prediction
Impact: Identify plant diseases 2 weeks earlier than traditional methods
Value: Prevent 30% crop loss, increase farm profitability by $150K annually
Use Case: Facial recognition and anomaly detection
Impact: Monitor 1,000+ cameras simultaneously with real-time alerts
Value: Reduce security incidents by 70%, enable touchless access control
Let's examine what a computer "sees" when processing an image:
If we connect these inputs to just 1,000 neurons in the first hidden layer:
150,528 inputs × 1,000 neurons = 150,528,000 parameters (just in the first layer!)
| Network Type | Input Size | Hidden Layer Size | Parameters | Issues |
|---|---|---|---|---|
| Traditional NN | 784 (28×28) | 128 neurons | 100,352 | Manageable |
| Traditional NN | 150,528 (224×224×3) | 128 neurons | 19,267,584 | Severe overfitting |
| Traditional NN | 150,528 | 1,000 neurons | 150,528,000 | Impossible to train |
Consider how you recognize a cat in a photograph. You don't analyze every pixel individually. Instead, you follow a hierarchical process:
Detect basic lines and boundaries
Combine edges into simple geometric forms
Identify object components (ears, eyes, whiskers)
🐱
Recognize complete object: "This is a cat"
Instead of connecting to all pixels, each neuron only examines a small region (e.g., 3×3 pixels)
Benefit: Dramatically reduces parameters from millions to thousands
Use the same "filter" (pattern detector) across the entire image
Benefit: Learns to detect patterns regardless of where they appear in the image
Stack multiple layers that learn increasingly complex features
Benefit: Automatically discovers relevant patterns without manual feature engineering
Convolution is a mathematical operation that applies a small filter (also called a kernel) across an image to detect specific patterns.
Think of convolution like a quality inspector with a checklist:
Dark (0) on left, Bright (255) on right
Detects left-to-right brightness change
High values = vertical edge detected
Filter slides over image, performs element-wise multiplication and sums:
(0×-1 + 0×0 + 255×1) + (0×-1 + 0×0 + 255×1) + (0×-1 + 0×0 + 255×1) = 765
This high value indicates a strong vertical edge was detected.
A single filter can only detect one type of pattern. Real-world classification requires detecting many different features simultaneously.
Detects vertical traces and component edges
Business Value: Identifies misaligned components
Detects horizontal traces and solder points
Business Value: Finds disconnected circuits
Detects capacitors and mounting holes
Business Value: Verifies component presence
Detects surface roughness and burn marks
Business Value: Identifies manufacturing defects
When a filter slides across an image, it produces a feature map (also called an activation map) that shows where and how strongly the pattern was detected.
Input Image
224 × 224 × 3
(Height × Width × Channels)
Feature Maps (4 shown)
224 × 224 × 64
(64 different filters applied)
Interpretation: Each feature map highlights regions where its corresponding filter detected its pattern. Bright areas = strong detection, dark areas = weak/no detection.
Input: Height × Width × Color Channels (3 for RGB)
Output: Height × Width × Number of Filters
The number of filters (typically 32, 64, 128, 256, or 512) becomes the new "depth" dimension.
Hint: Consider that each filter produces one feature map with spatial dimensions slightly smaller than the input, and all filters are applied to the same input.
Function: Apply filters to detect patterns
Output: Feature maps showing where patterns were found
Parameters: Filter weights (learned during training)
Function: Reduce spatial dimensions while preserving important features
Output: Downsampled feature maps
Parameters: None (fixed operation)
Function: Combine learned features to make final classification decision
Output: Class probabilities
Parameters: Connection weights (learned during training)
Detects: Edges, lines, gradients, simple textures
Example: Horizontal/vertical boundaries, color transitions
Detects: Corners, curves, simple shapes
Example: Circles, rectangles, T-junctions
Detects: Object parts and assemblies
Example: Wheels, faces, logos, product components
Combines: All learned features
Output: "Defective Product" or "Quality Pass"
| Layer | Operation | Output Shape | Parameters | What It Learns |
|---|---|---|---|---|
| Input | Product image | 224 × 224 × 3 | 0 | Raw pixel data (RGB) |
| Conv1 | 32 filters (3×3) | 224 × 224 × 32 | 896 | Basic edges and color gradients |
| Pool1 | Max pooling (2×2) | 112 × 112 × 32 | 0 | Downsample while keeping features |
| Conv2 | 64 filters (3×3) | 112 × 112 × 64 | 18,496 | Simple shapes and corners |
| Pool2 | Max pooling (2×2) | 56 × 56 × 64 | 0 | Further dimensionality reduction |
| Conv3 | 128 filters (3×3) | 56 × 56 × 128 | 73,856 | Component parts and patterns |
| Pool3 | Max pooling (2×2) | 28 × 28 × 128 | 0 | Compact representation |
| Flatten | Reshape to vector | 100,352 | 0 | Prepare for dense layers |
| Dense1 | 128 neurons | 128 | 12,845,184 | Combine all features |
| Output | 2 neurons (softmax) | 2 | 258 | Class probabilities: Defective/Pass |
As we add more convolutional layers, the spatial dimensions and computational cost grow rapidly. Pooling layers address this by:
Pink regions: 2×2 windows
Maximum from each 2×2 region
Values: 12, 20, 8, 34
Max: 34
Values: 5, 8, 15, 22
Max: 22
Values: 3, 7, 6, 11
Max: 42 (from corrected region)
Values: 42, 18, 9, 28
Max: 28 (from corrected region)
Impact: 2×2 pooling reduces spatial size by 75%
Business Value: Faster processing enables real-time applications (e.g., live quality control on production lines)
Example: 224×224 image → 112×112 → 56×56 → 28×28
Impact: Small shifts in feature location don't change output
Business Value: Product can be slightly off-center in image, model still classifies correctly
Example: Logo detected whether left, center, or right
Impact: Keeps only the strongest activations (most confident detections)
Business Value: Focuses on most distinctive features, improves classification accuracy
Example: Retains clear defects, discards noise
Impact: Reduces memory usage and processing time
Business Value: Deploy models on edge devices (mobile, embedded systems) for on-site inspection
Example: Smartphone app for field inspections
Pooling discards some spatial information. Modern architectures (like ResNet) use techniques like stride convolutions as alternatives, but pooling remains widely used for its simplicity and effectiveness.
Hint: Pooling reduces spatial dimensions (height and width) but does not change the depth (number of feature maps/channels).
224 × 224 × 3 (RGB image)
Original photo from production line camera
Conv Layer: 32 filters (3×3) → 224×224×32
ReLU Activation: Remove negative values
Max Pool (2×2): → 112×112×32
Learns: Basic edges, color transitions
Conv Layer: 64 filters (3×3) → 112×112×64
ReLU Activation: Remove negative values
Max Pool (2×2): → 56×56×64
Learns: Corners, simple shapes, textures
Conv Layer: 128 filters (3×3) → 56×56×128
ReLU Activation: Remove negative values
Max Pool (2×2): → 28×28×128
Learns: Component parts, assemblies, defect patterns
28 × 28 × 128 = 100,352 values
Convert 3D tensor to 1D vector for dense layers
128 neurons with ReLU activation
Combines all learned features for decision-making
2 neurons → 2 class probabilities
Class 0 (Defective): 2%
Class 1 (Pass): 98%
Prediction: Product Passes Quality Control ✓
Without activation functions, stacking multiple convolutional layers would be mathematically equivalent to a single layer. Activation functions introduce non-linearity, enabling CNNs to learn complex patterns.
ReLU (Rectified Linear Unit) is the most common activation function in CNNs.
CNNs learn filter weights and dense layer parameters through supervised learning using labeled training data.
Input image flows through network to produce prediction
Example: Image → Network → Predicts "Defective" with 75% confidence
Measure how wrong the prediction was compared to true label
Example: True label was "Pass" → Large error (prediction was wrong)
Calculate how each parameter contributed to the error
Technical: Compute gradients of loss with respect to all weights
Adjust filter weights and dense layer parameters to reduce error
Goal: Improve prediction accuracy on next iteration
Process thousands of images over multiple epochs
Result: Filters learn to detect task-relevant patterns
Use model trained on ImageNet (e.g., VGG16, ResNet50, EfficientNet)
Benefit: Proven feature extractors already learned
Discard the original classification head (1,000 ImageNet classes)
Keep: All convolutional layers (learned features)
Add new dense layers for your specific task (e.g., 2 classes: defective/pass)
Initialize: Only these new layers need training
Train on your smaller dataset (1,000-5,000 images often sufficient)
Result: Task-specific classifier in hours instead of weeks
| Factor | Training from Scratch | Transfer Learning | Advantage |
|---|---|---|---|
| Training Data Required | 100,000+ images per class | 500-5,000 images per class | 95% reduction |
| Training Time | 5-14 days on GPU | 2-8 hours on GPU | 50× faster |
| Computational Cost | $2,000-$5,000 | $50-$200 | 95% cost savings |
| Final Accuracy | 85-90% (limited data) | 93-98% (pre-learned features) | +8% accuracy gain |
| Time to Production | 3-6 months | 2-4 weeks | 10× faster deployment |
Released: 2014
Depth: 16-19 layers
Parameters: 138M (VGG16)
Strengths: Simple architecture, easy to understand, excellent for teaching
Use Case: Good baseline for many tasks
Released: 2015
Depth: 50-152 layers
Parameters: 25M (ResNet50)
Strengths: Skip connections enable very deep networks, excellent accuracy
Use Case: Industry standard for most applications
Released: 2015
Depth: 48 layers
Parameters: 24M
Strengths: Multi-scale processing, efficient computation
Use Case: Balance between accuracy and speed
Released: 2019
Depth: Varies (B0-B7)
Parameters: 5-66M
Strengths: State-of-art accuracy with fewer parameters, scalable
Use Case: Best for production deployment, mobile devices
For Learning: Start with VGG16 (simple, interpretable)
For Production: ResNet50 or EfficientNet (best accuracy-efficiency trade-off)
For Mobile/Edge: EfficientNet-B0 or MobileNet (optimized for constrained devices)
For Research: Latest models (e.g., EfficientNetV2, ConvNeXt)
Definition: Percentage of correct predictions
Formula: Correct Predictions / Total Predictions
When to Use: Balanced datasets (similar number of examples per class)
Limitation: Misleading for imbalanced data
Definition: Of all positive predictions, how many were actually positive?
Formula: True Positives / (True Positives + False Positives)
Business Meaning: "When I flag a product as defective, how often am I right?"
Critical When: False positives are costly (wasted inspection time)
Definition: Of all actual positives, how many did we detect?
Formula: True Positives / (True Positives + False Negatives)
Business Meaning: "Of all actual defects, how many did I catch?"
Critical When: False negatives are costly (defects reach customers)
Definition: Harmonic mean of precision and recall
Formula: 2 × (Precision × Recall) / (Precision + Recall)
Business Meaning: Balanced measure when both false positives and false negatives matter
Use: Standard metric for imbalanced classification
| Challenge | Cause | Solution |
|---|---|---|
| Overfitting | Model memorizes training data, poor generalization to new images |
• Data augmentation (rotations, flips, brightness changes) • Dropout layers • More training data • Regularization techniques |
| Class Imbalance | Far more examples of one class than others (e.g., 95% pass, 5% defective) |
• Weighted loss function • Oversample minority class • Undersample majority class • Use precision/recall instead of accuracy |
| Limited Training Data | Insufficient labeled examples to train effectively |
• Transfer learning (primary solution) • Data augmentation • Synthetic data generation • Active learning to prioritize labeling |
| Computational Cost | Real-time inference requirements, limited hardware |
• Model compression (pruning, quantization) • Use efficient architectures (EfficientNet, MobileNet) • Cloud-based inference • Batch processing when real-time not required |
| Domain Shift | Training data differs from production data (lighting, angles, quality) |
• Collect data from actual production environment • Domain adaptation techniques • Regular model retraining • Data augmentation to simulate variations |
Applications:
ROI: 70-90% labor cost reduction, 15-30% quality improvement
Applications:
Impact: Earlier detection, radiologist efficiency gains of 5×
Applications:
Impact: 40% increase in product discovery, 25% conversion lift
Applications:
Impact: 30% reduction in crop loss, pesticide savings of 40%
Applications:
Impact: Foundation of self-driving technology
Applications:
Impact: 70% reduction in security incidents, real-time alerting
In your lab session, you will: