Quiz: Convolutional Neural Networks¶

Test your understanding of convolutional neural networks with these questions.

1. What is the primary advantage of using convolutional layers over fully connected layers for image processing?¶

Convolutional layers require more parameters, leading to better accuracy
Convolutional layers preserve spatial structure through local connectivity and weight sharing
Convolutional layers can only process grayscale images
Convolutional layers eliminate the need for activation functions

Show Answer

The correct answer is B. Convolutional layers preserve spatial structure by connecting neurons only to small local regions (local connectivity) and using the same weights across the entire image (weight sharing). This dramatically reduces parameters compared to fully connected layers while maintaining spatial relationships between pixels. For example, a fully connected layer on a 224×224 image with 1,000 neurons requires over 150 million weights, while a convolutional layer with 64 filters of size 3×3 requires only 576 weights.

Concept Tested: Convolutional Neural Network, Local Connectivity, Weight Sharing

2. Given a 7×7 input image and a 3×3 filter with stride 2 and valid padding, what is the output size?¶

5×5
3×3
7×7
2×2

Show Answer

The correct answer is B. Using the output size formula: \(\lfloor \frac{n - k}{s} + 1 \rfloor\) where n=7 (input size), k=3 (filter size), and s=2 (stride), we get: \(\lfloor \frac{7 - 3}{2} + 1 \rfloor = \lfloor \frac{4}{2} + 1 \rfloor = \lfloor 2 + 1 \rfloor = 3\). Therefore, the output is 3×3.

Concept Tested: Convolution Operation, Stride, Valid Padding

3. What type of padding should you use to maintain the same spatial dimensions through multiple convolutional layers when using stride 1?¶

Valid padding
Same padding
Zero padding
Reflective padding

Show Answer

The correct answer is B. Same padding adds enough zeros around the border so that the output has the same spatial dimensions as the input when using stride 1. For a k×k filter, same padding requires p = (k-1)/2 pixels of padding on all sides. For example, a 3×3 filter needs 1 pixel of padding, and a 5×5 filter needs 2 pixels. Valid padding provides no padding and shrinks the output, while "zero padding" and "reflective padding" are not standard CNN terminology for this purpose.

Concept Tested: Same Padding, Padding

4. In a CNN, what does the receptive field of a neuron in layer 3 represent?¶

The number of filters in that layer
The size of the feature map at that layer
The region of the input image that affects that neuron's activation
The stride used in that layer

Show Answer

The correct answer is C. The receptive field is the region of the input image that influences a particular neuron's activation. As you go deeper in the network, receptive fields grow larger. For example, a layer 1 neuron with a 3×3 filter has a 3×3 receptive field, but a layer 2 neuron receiving from a 3×3 region of layer 1 would have a 5×5 receptive field in the input, and a layer 3 neuron would have a 7×7 receptive field. This allows deep CNNs to capture increasingly global context while maintaining computational efficiency.

Concept Tested: Receptive Field, Spatial Hierarchies

5. What is the primary purpose of max pooling layers in CNNs?¶

To increase the spatial dimensions of feature maps
To provide translation invariance and reduce computational complexity
To add more learnable parameters to the network
To replace activation functions like ReLU

Show Answer

The correct answer is B. Max pooling downsamples feature maps by taking the maximum value in local regions, typically using 2×2 windows with stride 2 (which halves spatial dimensions). This provides two critical benefits: translation invariance (small shifts in the input don't significantly change the output) and computational efficiency (reduces dimensions for downstream layers). Max pooling does not add learnable parameters—it's a fixed operation—and it complements rather than replaces activation functions.

Concept Tested: Max Pooling, Translation Invariance

6. You're designing a CNN for 32×32 RGB images. Your first convolutional layer uses 64 filters of size 5×5 with valid padding and stride 1. How many parameters does this layer have (including biases)?¶

1,600
4,800
4,864
65,536

Show Answer

The correct answer is C. Each filter has 5×5×3 = 75 weights (5×5 spatial dimensions × 3 input channels for RGB) plus 1 bias term = 76 parameters per filter. With 64 filters, the total is 64 × 76 = 4,864 parameters. This demonstrates the parameter efficiency of CNNs: despite processing the entire 32×32×3 = 3,072-dimensional input, we only need 4,864 parameters due to weight sharing and local connectivity.

Concept Tested: Filter, Weight Sharing, Local Connectivity

7. Which architectural innovation was introduced by ResNet to enable training of very deep networks (100+ layers)?¶

Using only 3×3 filters
Skip connections (residual connections)
Inception modules with parallel paths
Aggressive data augmentation

Show Answer

The correct answer is B. ResNet introduced skip connections (also called residual connections) that add the input x directly to the output: H(x) = F(x) + x. This allows gradients to flow directly through the network via the skip path, alleviating vanishing gradients and making it easier to optimize very deep networks. ResNet-152 successfully trained 152 layers and achieved 3.6% error on ImageNet in 2015, surpassing human-level performance (~5% error). VGG introduced using only 3×3 filters, Inception introduced parallel paths, and AlexNet emphasized data augmentation.

Concept Tested: ResNet, CNN Architecture

8. In the context of CNNs, what hierarchical pattern of features typically emerges across layers?¶

Complex objects → object parts → edges
Edges → textures/parts → complete objects
All layers learn similar edge detectors
Random features with no hierarchical structure

Show Answer

The correct answer is B. CNNs learn spatial hierarchies of features where early layers detect simple patterns like edges, colors, and textures; middle layers combine these into object parts like eyes, wheels, or corners; and deep layers recognize complete objects, faces, or scenes. This hierarchical organization mirrors biological vision systems and emerges naturally from the training process without explicit programming. Visualization studies consistently show this progression from simple to complex features across CNN depth.

Concept Tested: Spatial Hierarchies, CNN Architecture

9. Your CNN is overfitting on a small image dataset. Which data augmentation technique would be LEAST appropriate for natural images?¶

Random horizontal flips
Random vertical flips
Random crops with padding
Color jittering (brightness/contrast adjustment)

Show Answer

The correct answer is B. Random vertical flips are generally inappropriate for natural images because most objects have a consistent orientation (e.g., animals, vehicles, buildings). Flipping an image vertically creates unrealistic training examples that don't match real-world test data. Horizontal flips are appropriate because objects can appear on either side of an image. Random crops, scaling, and color jittering all preserve the semantic content while providing useful variation to reduce overfitting.

Concept Tested: Data Augmentation

10. Which CNN architecture introduced the concept of multi-scale feature extraction through parallel convolutional paths with different filter sizes in the same layer?¶

LeNet
AlexNet
VGG
Inception (GoogLeNet)

Show Answer

The correct answer is D. Inception (also known as GoogLeNet) introduced the Inception module, which applies parallel convolutional paths with different filter sizes (1×1, 3×3, 5×5) and max pooling simultaneously, then concatenates the outputs. This captures patterns at multiple scales in a single layer. The architecture also uses 1×1 "bottleneck" convolutions to reduce dimensionality before expensive operations, achieving computational efficiency. Despite having only 5 million parameters (vs. 60M for AlexNet), Inception won ImageNet 2014 with 6.7% error.

Concept Tested: Inception, CNN Architecture