Week 4 · Assessment Quiz

Neural Networks & Gradient Descent

25 multiple-choice questions on neurons, backpropagation, PyTorch and gradient descent from scratch, plus 5 short-answer questions.

📋 30 questions total ⭐ 30 marks 🕐 No time limit 🔒 Answers not revealed

In a neural network, a single neuron computes:

AA random sample from a normal distribution scaled by its weights BA weighted sum of its inputs followed by a non-linear activation function CThe mean of all inputs passed to it during training DThe gradient of the loss with respect to its position in the network

The purpose of an activation function after a linear layer is to:

ANormalise the output to have zero mean and unit variance BScale the output into a fixed range such as [0, 255] CIntroduce non-linearity so the network can learn complex patterns beyond linear functions DPrevent the loss from increasing during backpropagation

The ReLU activation function outputs:

AA value between −1 and +1 (like tanh) BThe input if positive, and 0 otherwise — i.e. max(0, x) CA probability between 0 and 1 (like sigmoid) DThe absolute value of the input |x|

Backpropagation is used to:

ACompute the gradient of the loss with respect to each parameter using the chain rule BForward pass the input through the network to produce a prediction CUpdate the model architecture based on the validation performance DInitialise the model weights before training begins

Gradient descent updates each weight by:

AAdding the gradient to increase the loss BMultiplying the weight by the gradient directly CSubtracting the learning rate times the gradient to reduce the loss DDividing the weight by the total number of parameters

The vanishing gradient problem occurs when:

AThe learning rate is set too high, causing weights to overflow BGradients shrink exponentially as they are propagated back through many layers, making early layers learn very slowly CThe loss function is not differentiable at the minimum DThe batch size is too large, causing gradient estimates to be noisy

A fully connected layer with 10 inputs and 5 output neurons contains how many weight parameters (excluding biases)?

A15 (10 + 5) B5 (one per output neuron) C50 (10 × 5 — one weight per input–output pair) D2 (one matrix per direction)

The learning rate controls:

AHow many layers are used in the network BHow large a step is taken in the direction of the negative gradient during each weight update CHow many training examples are in each mini-batch DThe fraction of neurons dropped out during each forward pass

A bias term in a neuron allows:

AThe activation to fire even when all inputs are zero, shifting the decision boundary BThe gradient to propagate backwards without diminishing CNeurons in the same layer to share information DThe model to learn without needing any labelled data

Q10

"Depth" in a deep neural network refers to:

AThe total number of parameters in the model BThe number of hidden layers between input and output CThe dimensionality of the input data DThe number of neurons in the widest layer

Q11

Which activation function is most commonly used in hidden layers of modern deep networks?

ASigmoid — outputs values in (0, 1) BTanh — outputs values in (−1, 1) CReLU — simple, avoids vanishing gradients, computationally efficient DSoftmax — produces probability distributions

Q12

Stochastic Gradient Descent (SGD) updates weights:

AOnce per epoch using the gradient averaged over the entire training set BAfter each mini-batch, using the gradient estimated from that batch only COnce at the very end of training using the accumulated gradient DAfter each layer, not after each batch

Q13

The forward pass through a neural network computes:

AThe predicted output for a given input by applying layers sequentially BThe gradient of the loss with respect to every weight CThe optimal weight values that minimise the loss DA random permutation of the data for the next epoch

Q14

Mean Squared Error (MSE) loss measures:

AThe probability that the model assigns to the correct class BThe average of the squared differences between predictions and true values CThe cross-entropy between the predicted and true distributions DThe absolute difference between predictions and true values

Q15

If a model's training loss increases consistently over epochs, this most likely indicates:

AThe model has converged to a perfect solution BThe validation set is too small to give reliable estimates CThe learning rate may be too high, causing gradient updates to overshoot the minimum DThe batch size is too large, slowing convergence

Q16

The chain rule in backpropagation allows:

AMultiple GPUs to synchronise gradients across a distributed training cluster BGradients of composite functions to be computed by multiplying local gradients layer by layer CThe learning rate to adapt automatically based on the curvature of the loss DWeight updates to be applied before all gradients are computed

Q17

In PyTorch, tensor.requires_grad_(True) tells PyTorch to:

ATrack operations on this tensor and build a computational graph so gradients can be computed via .backward() BSet the tensor's initial value to a random gradient CPrevent the tensor from being updated during the optimiser step DConvert the tensor to a GPU tensor for faster computation

Q18

loss.backward() in PyTorch:

AComputes the forward pass and stores the loss value BApplies the computed gradients directly to update the model weights CComputes and accumulates the gradient of the loss with respect to all tensors that require grad DClears the gradient buffer to zero before the next update

Q19

optimizer.step() in a PyTorch training loop:

AComputes the loss and runs the backward pass BUpdates the model parameters using the gradients computed by .backward() CClears all accumulated gradients from the previous step DEvaluates the model on the validation set

Q20

optimizer.zero_grad() must be called each iteration to:

APrevent gradients from accumulating across multiple backward passes, which would give incorrect gradient estimates BReset the learning rate to its initial value before each update CRe-initialise all model weights before the next forward pass DSynchronise gradients across GPUs in a multi-device setup

Q21

Momentum in gradient descent helps by:

AIncreasing the learning rate automatically when the gradient is small BAccumulating a velocity from past gradients, smoothing updates and helping the optimiser escape flat regions CRandomly zeroing out some gradients to act as regularisation DScaling each gradient by the inverse of its historical variance

Q22

The difference between a parameter and a hyperparameter is:

AParameters are set by the programmer; hyperparameters are learned during training BThey are synonymous — both refer to learned weight values CParameters (weights, biases) are learned during training; hyperparameters (learning rate, batch size) are set before training DHyperparameters are only used in the output layer

Q23

In the Excel-to-PyTorch analogy taught in Week 4, Excel formulas correspond to:

AThe training dataset — the inputs that are fed to the model BThe forward pass computations — the mathematical operations applied to inputs to produce outputs CThe validation metrics reported after each epoch DThe learning rate schedule used to update weights

Q24

Weight initialisation matters because:

AStarting all weights at exactly zero causes all neurons to compute identical gradients and learn the same features (symmetry problem) BLarge random weights are always better because they give the model more flexibility CThe learning rate must match the scale of the initial weights precisely DPyTorch cannot compute gradients unless weights are initialised to 1.0

Q25

A computational graph in PyTorch records:

AThe sequence of data augmentations applied to each training batch BThe architecture of the network as a JSON structure for serialisation CThe operations performed on tensors so that gradients can be computed automatically via backpropagation DThe validation loss at each epoch for later plotting

Answer each question in 2–4 sentences. Precise technical language is expected. Code snippets are welcome where relevant.

Q26

Explain gradient descent in your own words. Why do we need iterative optimisation rather than solving directly for the optimal weights?written

Your answer

0 / 700

Q27

Describe the forward pass and backward pass of a neural network. What is computed in each, and how do they work together during training?written

Your answer

0 / 700

Q28

What is the vanishing gradient problem and why does ReLU help address it compared to sigmoid or tanh?written

Your answer

0 / 700

Q29

Write out the steps of a single training iteration (one batch) in PyTorch and explain what each line does: forward pass, loss computation, zero_grad, backward, step.written

Your answer

0 / 700

Q30

Explain the bias-variance tradeoff in the context of neural network depth. How does adding more layers affect a model's bias and variance?written

Your answer

0 / 700

Your full name

Complete all 30 questions then click Submit. Your MCQ score (25/25) will be shown. Short answers are marked separately.

Neural Networks & Gradient Descent

Multiple Choice (25 marks)

Short Answer (5 marks — marked by lecturer)