Neural Networks Fundamentals¶
Summary¶
This comprehensive chapter introduces neural networks, the foundation of modern deep learning. Students will learn about artificial neurons and the perceptron model, explore various activation functions (ReLU, tanh, sigmoid, Leaky ReLU) and their properties, and understand the architecture of multilayer networks with input, hidden, and output layers. The chapter provides detailed coverage of forward propagation for making predictions and backpropagation for computing gradients, introduces gradient descent and its variants (stochastic, mini-batch), and covers essential topics including loss functions (mean squared error, cross-entropy), weight initialization strategies (Xavier, He), and challenges like vanishing and exploding gradients. Students will also learn about advanced concepts including the universal approximation theorem, network architectures, and deep learning fundamentals.
Concepts Covered¶
This chapter covers the following 38 concepts from the learning graph:
- Neural Network
- Artificial Neuron
- Perceptron
- Activation Function
- ReLU
- Tanh
- Leaky ReLU
- Weights
- Bias
- Forward Propagation
- Backpropagation
- Gradient Descent
- Stochastic Gradient Descent
- Mini-Batch Gradient Descent
- Learning Rate
- Mean Squared Error
- Epoch
- Batch Size
- Vanishing Gradient
- Exploding Gradient
- Weight Initialization
- Xavier Initialization
- He Initialization
- Fully Connected Layer
- Hidden Layer
- Output Layer
- Input Layer
- Network Architecture
- Deep Learning
- Multilayer Perceptron
- Universal Approximation
- Pooling Layer
- Freezing Layers
- Learning Rate Scheduling
- Bias-Variance Tradeoff
- Batch Processing
- Dropout
- Early Stopping
Prerequisites¶
This chapter builds on concepts from:
- Chapter 1: Introduction to Machine Learning Fundamentals
- Chapter 3: Decision Trees and Tree-Based Learning
- Chapter 5: Regularization Techniques
Introduction: Inspired by the Brain¶
Neural networks are computational models inspired by the biological neural networks in animal brains. While greatly simplified compared to actual neurons, artificial neural networks have proven remarkably effective at learning complex patterns from data, powering modern advances in computer vision, natural language processing, speech recognition, and game playing.
Unlike traditional algorithms with explicit rules, neural networks learn from examples. Show a neural network thousands of images labeled "cat" or "dog," and it learns to distinguish between them—not through programmed rules about whiskers or ears, but by discovering patterns in the pixel data itself.
This chapter builds neural networks from the ground up, starting with a single artificial neuron and progressing to deep multilayer architectures capable of solving complex real-world problems.
The Artificial Neuron¶
An artificial neuron (or simply "neuron") is the fundamental building block of neural networks. It receives inputs, combines them with learned weights, adds a bias, and applies an activation function to produce an output.
Mathematical Model¶
For a neuron with \(n\) inputs \(x_1, x_2, \ldots, x_n\):
- Weighted sum: Compute \(z = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b\)
- Activation: Apply activation function \(a = f(z)\)
where: - Weights \(w_1, \ldots, w_n\) scale the importance of each input - Bias \(b\) shifts the activation threshold - Activation function \(f\) introduces nonlinearity
In vector notation:
The neuron learns by adjusting weights \(\mathbf{w}\) and bias \(b\) during training.
The Perceptron¶
The perceptron, introduced by Frank Rosenblatt in 1958, is the simplest neural network model. It uses a step activation function:
For linearly separable binary classification problems, the perceptron learning algorithm is guaranteed to converge. However, perceptrons cannot solve non-linearly separable problems (like XOR), which motivated the development of multilayer networks.
Biological Inspiration¶
Real biological neurons: - Receive signals through dendrites - Integrate signals in the cell body - Fire an electrical spike down the axon if threshold is exceeded - Transmit signals to other neurons via synapses
Artificial neurons capture this essence: weighted inputs (synapses), summation (cell body integration), and activation (neuron firing).
Activation Functions¶
Activation functions introduce nonlinearity into neural networks. Without nonlinearity, stacking multiple layers would be mathematically equivalent to a single layer—the network couldn't learn complex patterns.
Sigmoid¶
The sigmoid function was historically popular for its smooth, S-shaped curve:
Properties: - Output range: (0, 1) - Smooth and differentiable - Derivative: \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\) - Interpretable as probability
Drawbacks: - Vanishing gradients: For large \(|z|\), gradient approaches zero, slowing learning - Not zero-centered: Outputs always positive, causing zig-zagging in gradient descent - Expensive computation: Exponential function
Hyperbolic Tangent (Tanh)¶
Tanh is a scaled, shifted sigmoid:
Properties: - Output range: (-1, 1) - Zero-centered (better than sigmoid) - Derivative: \(\tanh'(z) = 1 - \tanh^2(z)\)
Drawbacks: - Still suffers from vanishing gradients - Still computationally expensive
Rectified Linear Unit (ReLU)¶
ReLU has become the default activation function for hidden layers:
Properties: - Derivative: \(\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases}\) - Computationally cheap (simple threshold) - Does not saturate for positive values - Sparse activations (many neurons output zero)
Advantages: - Alleviates vanishing gradient problem - Accelerates convergence (6x faster than sigmoid/tanh in some studies) - Promotes sparse representations
Drawbacks: - Dying ReLU problem: Neurons with large negative weights never activate, becoming permanently inactive - Not zero-centered - Not differentiable at \(z = 0\) (though subgradient works in practice)
Leaky ReLU¶
Leaky ReLU addresses the dying ReLU problem by allowing small negative values:
where \(\alpha\) is a small constant (typically 0.01).
Properties: - Derivative: \(\text{LeakyReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ \alpha & \text{if } z \leq 0 \end{cases}\) - Prevents dying neurons - Still computationally cheap
Variants: - Parametric ReLU (PReLU): Learn \(\alpha\) during training - Exponential Linear Unit (ELU): Smooth curve for negative values
Choosing Activation Functions¶
General guidelines: - Hidden layers: ReLU or Leaky ReLU (default choice) - Output layer (regression): Linear (no activation) - Output layer (binary classification): Sigmoid - Output layer (multiclass classification): Softmax
Network Architecture¶
A neural network consists of layers of interconnected neurons. The network architecture defines how many layers exist, how many neurons are in each layer, and how they connect.
Layer Types¶
Input Layer: The input layer receives raw features. It has one neuron per feature dimension and performs no computation—it simply passes values to the next layer.
Hidden Layers: Hidden layers perform intermediate transformations. A network can have zero, one, or many hidden layers. Each neuron in a hidden layer connects to all neurons in the previous layer (in a fully connected layer) and applies:
where: - \(a_j^{(l)}\) is activation of neuron \(j\) in layer \(l\) - \(w_{ji}^{(l)}\) is weight from neuron \(i\) in layer \(l-1\) to neuron \(j\) in layer \(l\) - \(b_j^{(l)}\) is bias for neuron \(j\) in layer \(l\) - \(f\) is the activation function
Output Layer: The output layer produces final predictions. For regression, it typically has one neuron with linear activation. For \(K\)-class classification, it has \(K\) neurons with softmax activation.
Multilayer Perceptron (MLP)¶
A multilayer perceptron (MLP) is a feedforward neural network with one or more hidden layers. Despite the name, MLPs typically use nonlinear activations (not the perceptron's step function).
Example architecture: - Input layer: 4 neurons (4 features) - Hidden layer 1: 20 neurons (ReLU activation) - Hidden layer 2: 30 neurons (ReLU activation) - Hidden layer 3: 25 neurons (ReLU activation) - Output layer: 3 neurons (softmax activation for 3-class classification)
This is a 4-20-30-25-3 architecture with 3 hidden layers.
Deep Learning¶
Deep learning refers to neural networks with multiple hidden layers (typically >2). Deep networks can learn hierarchical representations: - Lower layers learn simple features (edges, textures) - Middle layers combine features into parts (eyes, wheels) - Upper layers recognize high-level concepts (faces, cars)
The depth allows learning complex, compositional patterns that shallow networks struggle with.
Forward Propagation¶
Forward propagation is the process of computing the network's output given an input. Activations flow forward from input through hidden layers to output.
Algorithm¶
For an \(L\)-layer network with input \(\mathbf{x}\):
-
Input layer (\(l = 0\)): $\(\mathbf{a}^{(0)} = \mathbf{x}\)$
-
Hidden and output layers (\(l = 1, \ldots, L\)): $\(\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}\)$ $\(\mathbf{a}^{(l)} = f^{(l)}(\mathbf{z}^{(l)})\)$
-
Output: $\(\hat{\mathbf{y}} = \mathbf{a}^{(L)}\)$
where \(\mathbf{W}^{(l)}\) is the weight matrix for layer \(l\) and \(f^{(l)}\) is the activation function.
Example Computation¶
For a simple 2-3-1 network (2 inputs, 3 hidden neurons, 1 output):
Input: \(\mathbf{x} = [x_1, x_2]^T\)
Hidden layer: $\(z_1^{(1)} = w_{11}^{(1)} x_1 + w_{12}^{(1)} x_2 + b_1^{(1)}\)$ $\(z_2^{(1)} = w_{21}^{(1)} x_1 + w_{22}^{(1)} x_2 + b_2^{(1)}\)$ $\(z_3^{(1)} = w_{31}^{(1)} x_1 + w_{32}^{(1)} x_2 + b_3^{(1)}\)$
Output layer: $\(z^{(2)} = w_1^{(2)} a_1^{(1)} + w_2^{(2)} a_2^{(1)} + w_3^{(2)} a_3^{(1)} + b^{(2)}\)$ $\(\hat{y} = z^{(2)}\)$ (linear activation for regression)
Loss Functions¶
Loss functions quantify how well the network's predictions match the true labels. Training minimizes the loss.
Mean Squared Error (MSE)¶
For regression, mean squared error is commonly used:
MSE penalizes large errors heavily due to the squaring.
Cross-Entropy Loss¶
For classification, cross-entropy loss (also called log-loss) measures the difference between predicted and true probability distributions.
Binary cross-entropy (for 2 classes): $\(\text{Loss} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)]\)$
Categorical cross-entropy (for \(K\) classes): $\(\text{Loss} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log \hat{y}_{ik}\)$
where \(y_{ik} = 1\) if sample \(i\) belongs to class \(k\), otherwise 0 (one-hot encoding).
Cross-entropy loss combined with softmax output forms a numerically stable, theoretically motivated framework for classification.
Backpropagation¶
Backpropagation (short for "backward propagation of errors") computes gradients of the loss with respect to all weights and biases. These gradients guide parameter updates during training.
The Chain Rule¶
Backpropagation applies the chain rule from calculus to efficiently compute gradients layer by layer, moving backward from output to input.
For a simple network with loss \(L\), output \(\hat{y}\), and intermediate value \(z\):
Backpropagation Algorithm¶
Starting from the output layer and moving backward:
-
Output layer gradient: $\(\delta^{(L)} = \frac{\partial L}{\partial \mathbf{a}^{(L)}} \odot f'^{(L)}(\mathbf{z}^{(L)})\)$
-
Hidden layer gradients (for \(l = L-1, \ldots, 1\)): $\(\delta^{(l)} = [(\mathbf{W}^{(l+1)})^T \delta^{(l+1)}] \odot f'^{(l)}(\mathbf{z}^{(l)})\)$
-
Weight and bias gradients: $\(\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \delta^{(l)} (\mathbf{a}^{(l-1)})^T\)$ $\(\frac{\partial L}{\partial \mathbf{b}^{(l)}} = \delta^{(l)}\)$
where \(\odot\) denotes element-wise multiplication.
Why Backpropagation Matters¶
Before backpropagation, training neural networks required numerical gradient estimation (finite differences), which was computationally prohibitive for large networks. Backpropagation enables efficient gradient computation, making deep learning practical.
Gradient Descent¶
Gradient descent is the optimization algorithm that updates weights to minimize the loss. The update rule is:
where \(\eta\) is the learning rate, controlling step size.
Batch Gradient Descent¶
Standard gradient descent computes gradients using the entire training set:
- Forward propagation: Compute predictions for all samples
- Compute loss: Average loss over all samples
- Backpropagation: Compute gradients averaging over all samples
- Update weights: Apply gradient descent update
This is stable but slow for large datasets.
Stochastic Gradient Descent (SGD)¶
Stochastic gradient descent updates weights after each individual sample:
- Randomly shuffle training data
- For each sample:
- Forward propagation
- Compute loss for this sample
- Backpropagation
- Update weights
Advantages: - Much faster per update - Can escape local minima due to noise - Enables online learning (update as new data arrives)
Disadvantages: - Noisy gradients cause erratic convergence - Requires careful learning rate tuning
Mini-Batch Gradient Descent¶
Mini-batch gradient descent strikes a balance by updating on small batches of samples:
- Batch size (e.g., 32, 64, 128): Number of samples per update
- For each mini-batch:
- Forward propagation on batch
- Compute average loss over batch
- Backpropagation
- Update weights
Advantages: - More stable than SGD, faster than full batch - Efficient matrix operations (GPUs excel at batch processing) - Reduces gradient variance while maintaining speed
Batch processing enables efficient use of modern hardware accelerators.
Learning Rate¶
The learning rate \(\eta\) critically affects training:
- Too small: Slow convergence, may get stuck
- Too large: Oscillation, divergence, missing minimum
- Just right: Fast, stable convergence
Learning rate scheduling adaptively adjusts \(\eta\) during training: - Step decay: Reduce \(\eta\) by factor (e.g., ×0.1) every \(N\) epochs - Exponential decay: \(\eta(t) = \eta_0 e^{-kt}\) - 1/t decay: \(\eta(t) = \eta_0 / (1 + kt)\) - Adaptive methods (Adam, RMSprop): Automatically adjust per-parameter learning rates
Epochs¶
An epoch is one complete pass through the entire training dataset. Training typically runs for many epochs (10s to 1000s), with the network gradually improving as it sees data repeatedly.
Weight Initialization¶
Weight initialization significantly affects training dynamics. Poor initialization can prevent learning entirely.
Why Initialization Matters¶
- All zeros: Neurons in a layer behave identically (symmetry problem)
- Too large: Activations explode, gradients explode
- Too small: Activations vanish, gradients vanish
Xavier (Glorot) Initialization¶
Xavier initialization keeps variance of activations and gradients stable across layers. For a layer with \(n_{in}\) inputs and \(n_{out}\) outputs:
or uniform variant:
Best for: Tanh and sigmoid activations
He Initialization¶
He initialization accounts for ReLU's characteristics (kills negative values):
Best for: ReLU and Leaky ReLU activations
Proper initialization is crucial for training deep networks.
Training Challenges¶
Vanishing Gradients¶
The vanishing gradient problem occurs when gradients become extremely small as they propagate backward through many layers. This causes early layers to learn very slowly or not at all.
Causes: - Sigmoid/tanh activations saturate (gradients ≈ 0) - Deep networks multiply many small gradients
Solutions: - Use ReLU activations - Skip connections (ResNets) - Batch normalization - Proper weight initialization
Exploding Gradients¶
The exploding gradient problem is the opposite: gradients become extremely large, causing numerical instability and divergence.
Causes: - Poor weight initialization - Deep networks multiply many large gradients
Solutions: - Gradient clipping (cap gradient magnitude) - Proper weight initialization - Batch normalization
Regularization Techniques¶
Dropout¶
Dropout randomly sets a fraction of neuron activations to zero during training. For example, with dropout rate 0.5, each neuron has a 50% chance of being "dropped."
Effect: - Prevents co-adaptation (neurons relying on specific other neurons) - Acts like training an ensemble of networks - Significantly reduces overfitting
Implementation:
# During training
if training:
mask = np.random.binomial(1, keep_prob, size=activations.shape)
activations *= mask / keep_prob # Scale to maintain expected value
# During inference
# Use all activations (no dropout)
Dropout is typically applied to fully connected layers, not convolutional layers.
Early Stopping¶
Early stopping monitors validation loss during training and stops when validation performance stops improving. This prevents overfitting by avoiding overtraining.
Algorithm: 1. Train network and evaluate on validation set after each epoch 2. Track best validation loss seen so far 3. If validation loss doesn't improve for \(N\) consecutive epochs (patience), stop training 4. Return weights from epoch with best validation loss
Early stopping is a simple, effective regularization technique that requires no hyperparameter tuning beyond patience.
Universal Approximation Theorem¶
The universal approximation theorem states that a neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function to arbitrary accuracy, given enough neurons.
Implications: - Neural networks are theoretically capable of learning any function - Shallow networks can represent complex functions but may require exponentially many neurons - Deep networks learn hierarchical representations more efficiently
Important caveats: - Theorem guarantees existence, not learnability (training may not find the solution) - Says nothing about generalization - Doesn't specify how many neurons are needed
Neural Networks in Practice¶
Building a Neural Network with Scikit-Learn¶
Let's apply MLPClassifier to the Iris dataset:
import pandas as pd
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
# Load iris dataset
url = "https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/iris.csv"
iris_df = pd.read_csv(url)
# Examine feature correlations
feature_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']
correlation_matrix = iris_df[feature_names].corr().round(2)
plt.figure(figsize=(6, 6))
sns.heatmap(data=correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlations')
plt.show()
The correlation matrix reveals strong positive correlation between petal length and petal width (0.96), suggesting these features carry similar information. Sepal width and length have weak negative correlation.
Training the Network¶
# Prepare data
X = iris_df.loc[:, feature_names].values
y = iris_df.loc[:, 'species'].values
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Create neural network with 3 hidden layers
# Architecture: 4 inputs → 20 neurons → 30 neurons → 25 neurons → 3 outputs
mlp = MLPClassifier(hidden_layer_sizes=(20, 30, 25),
max_iter=1000,
activation='relu',
solver='adam',
random_state=42)
# Train
mlp.fit(X_train, y_train)
# Predictions
y_pred = mlp.predict(X_test)
# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.3f}")
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)
Evaluating Multiple Runs¶
Neural network training is stochastic, so results vary across runs:
# Run multiple times to assess stability
scores = []
for i in range(20):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=i)
mlp = MLPClassifier(hidden_layer_sizes=(20, 30, 25), max_iter=1000, random_state=i)
mlp.fit(X_train, y_train)
y_pred = mlp.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
scores.append(accuracy)
print(f"Average accuracy: {np.mean(scores):.3f}")
print(f"Std deviation: {np.std(scores):.3f}")
print(f"Min: {np.min(scores):.3f}, Max: {np.max(scores):.3f}")
This reveals the stability (or variability) of the model across different random initializations and data splits.
Hyperparameter Tuning¶
Key hyperparameters to tune:
from sklearn.model_selection import GridSearchCV
# Define hyperparameter grid
param_grid = {
'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50)],
'activation': ['relu', 'tanh'],
'alpha': [0.0001, 0.001, 0.01], # L2 regularization strength
'learning_rate_init': [0.001, 0.01]
}
# Grid search with cross-validation
grid_search = GridSearchCV(MLPClassifier(max_iter=1000), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)
print("Best parameters:", grid_search.best_params_)
print("Best CV score:", grid_search.best_score_)
# Evaluate on test set
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test, y_test)
print(f"Test accuracy: {test_accuracy:.3f}")
Advanced Architectures¶
Pooling Layers¶
Pooling layers reduce spatial dimensions in convolutional networks by downsampling:
- Max pooling: Take maximum value in each region
- Average pooling: Take average value in each region
Pooling provides translation invariance and reduces computational cost.
Freezing Layers¶
Freezing layers prevents weight updates during training. This is useful for:
- Transfer learning: Freeze pretrained layers, train only final layers
- Feature extraction: Use frozen network as feature extractor
- Progressive training: Gradually unfreeze layers
# Conceptual example (PyTorch syntax)
for param in model.layer1.parameters():
param.requires_grad = False # Freeze layer1
Interactive Visualization: Neural Network Architecture¶
Build and explore different neural network architectures:
View Fullscreen | Documentation
Interactive Visualization: Activation Functions¶
Compare different activation functions and their properties:
View Fullscreen | Documentation
Summary¶
Neural networks are powerful function approximators built from layers of artificial neurons. Each neuron computes a weighted sum of inputs, adds a bias, and applies an activation function. Activation functions like ReLU, tanh, and sigmoid introduce essential nonlinearity.
Forward propagation computes predictions by passing inputs through successive layers. Backpropagation efficiently computes gradients using the chain rule, enabling gradient descent optimization. Stochastic gradient descent and mini-batch variants balance speed and stability.
Weight initialization (Xavier for tanh/sigmoid, He for ReLU) prevents vanishing and exploding gradients. Regularization techniques like dropout and early stopping combat overfitting. The universal approximation theorem guarantees that neural networks can represent any function, though depth enables more efficient learning.
Modern deep learning frameworks automate much of the complexity, but understanding these fundamentals—neurons, activations, forward/back propagation, gradient descent, and training challenges—provides the foundation for effectively applying and debugging neural networks.
Key Takeaways¶
- Artificial neurons compute weighted sums plus bias, then apply activation functions
- Activation functions introduce nonlinearity; ReLU is the default for hidden layers
- Network architecture defines layers (input, hidden, output) and connections
- Forward propagation computes outputs by passing activations through layers
- Backpropagation efficiently computes gradients using the chain rule
- Gradient descent updates weights to minimize loss; SGD and mini-batch variants balance speed and stability
- Learning rate controls step size; too large causes divergence, too small causes slow convergence
- Weight initialization (Xavier, He) prevents vanishing/exploding gradients
- Dropout and early stopping prevent overfitting
- Vanishing gradients occur with sigmoid/tanh in deep networks; ReLU alleviates this
- Batch size affects gradient variance and computational efficiency
- Universal approximation theorem guarantees representation capacity
Further Reading¶
- Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning (Chapters 6-8)
- Nielsen, M. (2015). Neural Networks and Deep Learning (Free online book)
- LeCun, Y., Bengio, Y., & Hinton, G. (2015). "Deep learning." Nature, 521, 436-444
- Rumelhart, D., Hinton, G., & Williams, R. (1986). "Learning representations by back-propagating errors." Nature, 323, 533-536
- Glorot, X., & Bengio, Y. (2010). "Understanding the difficulty of training deep feedforward neural networks." AISTATS
Exercises¶
-
Manual Forward Propagation: Given a 2-3-1 network with specific weights and biases, manually compute the output for input \([1, 2]^T\) using ReLU activations.
-
Activation Function Analysis: Plot sigmoid, tanh, ReLU, and Leaky ReLU (α=0.01) and their derivatives on the same graph. At what input values does each function saturate?
-
Backpropagation by Hand: For a simple 2-2-1 network, compute gradients with respect to all weights and biases for a single training example using MSE loss.
-
Learning Rate Experiment: Train a network on a small dataset with learning rates [0.0001, 0.001, 0.01, 0.1, 1.0]. Plot training loss vs epoch for each. Which converges fastest? Which diverges?
-
Architecture Comparison: Compare 1-layer (4-50-3), 2-layer (4-25-25-3), and 3-layer (4-20-20-20-3) networks on Iris. Which achieves best test accuracy? Why might deeper not always be better for small datasets?
-
Dropout Impact: Train identical networks with dropout rates [0, 0.2, 0.5, 0.8]. Plot training vs validation accuracy. How does dropout affect the train-validation gap?
-
Weight Initialization: Initialize a deep network (10 layers) with all zeros, Xavier, He, and random uniform [-1, 1]. Plot activation distributions after forward pass. Which initializations cause saturation or vanishing activations?