Neural Networks Fundamentals¶

Summary¶

This comprehensive chapter introduces neural networks, the foundation of modern deep learning. Students will learn about artificial neurons and the perceptron model, explore various activation functions (ReLU, tanh, sigmoid, Leaky ReLU) and their properties, and understand the architecture of multilayer networks with input, hidden, and output layers. The chapter provides detailed coverage of forward propagation for making predictions and backpropagation for computing gradients, introduces gradient descent and its variants (stochastic, mini-batch), and covers essential topics including loss functions (mean squared error, cross-entropy), weight initialization strategies (Xavier, He), and challenges like vanishing and exploding gradients. Students will also learn about advanced concepts including the universal approximation theorem, network architectures, and deep learning fundamentals.

Concepts Covered¶

This chapter covers the following 38 concepts from the learning graph:

Neural Network
Artificial Neuron
Perceptron
Activation Function
ReLU
Tanh
Leaky ReLU
Weights
Bias
Forward Propagation
Backpropagation
Gradient Descent
Stochastic Gradient Descent
Mini-Batch Gradient Descent
Learning Rate
Mean Squared Error
Epoch
Batch Size
Vanishing Gradient
Exploding Gradient
Weight Initialization
Xavier Initialization
He Initialization
Fully Connected Layer
Hidden Layer
Output Layer
Input Layer
Network Architecture
Deep Learning
Multilayer Perceptron
Universal Approximation
Pooling Layer
Freezing Layers
Learning Rate Scheduling
Bias-Variance Tradeoff
Batch Processing
Dropout
Early Stopping

Prerequisites¶

This chapter builds on concepts from:

Introduction: Inspired by the Brain¶

Neural networks are computational models inspired by the biological neural networks in animal brains. While greatly simplified compared to actual neurons, artificial neural networks have proven remarkably effective at learning complex patterns from data, powering modern advances in computer vision, natural language processing, speech recognition, and game playing.

Unlike traditional algorithms with explicit rules, neural networks learn from examples. Show a neural network thousands of images labeled "cat" or "dog," and it learns to distinguish between them—not through programmed rules about whiskers or ears, but by discovering patterns in the pixel data itself.

This chapter builds neural networks from the ground up, starting with a single artificial neuron and progressing to deep multilayer architectures capable of solving complex real-world problems.

The Artificial Neuron¶

An artificial neuron (or simply "neuron") is the fundamental building block of neural networks. It receives inputs, combines them with learned weights, adds a bias, and applies an activation function to produce an output.

Mathematical Model¶

For a neuron with $n$ inputs $x_1, x_2, \ldots, x_n$:

Weighted sum: Compute $z = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n + b$
Activation: Apply activation function $a = f(z)$

where: - Weights $w_1, \ldots, w_n$ scale the importance of each input - Bias $b$ shifts the activation threshold - Activation function $f$ introduces nonlinearity

In vector notation:

\[z = \mathbf{w}^T \mathbf{x} + b$$ $$a = f(z)\]

The neuron learns by adjusting weights $\mathbf{w}$ and bias $b$ during training.

The Perceptron¶

The perceptron, introduced by Frank Rosenblatt in 1958, is the simplest neural network model. It uses a step activation function:

\[f(z) = \begin{cases} 1 & \text{if } z \geq 0 \\ 0 & \text{if } z < 0 \end{cases}\]

For linearly separable binary classification problems, the perceptron learning algorithm is guaranteed to converge. However, perceptrons cannot solve non-linearly separable problems (like XOR), which motivated the development of multilayer networks.

Biological Inspiration¶

Real biological neurons: - Receive signals through dendrites - Integrate signals in the cell body - Fire an electrical spike down the axon if threshold is exceeded - Transmit signals to other neurons via synapses

Artificial neurons capture this essence: weighted inputs (synapses), summation (cell body integration), and activation (neuron firing).

Activation Functions¶

Activation functions introduce nonlinearity into neural networks. Without nonlinearity, stacking multiple layers would be mathematically equivalent to a single layer—the network couldn't learn complex patterns.

Sigmoid¶

The sigmoid function was historically popular for its smooth, S-shaped curve:

\[\sigma(z) = \frac{1}{1 + e^{-z}}\]

Properties: - Output range: (0, 1) - Smooth and differentiable - Derivative: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ - Interpretable as probability

Drawbacks: - Vanishing gradients: For large $|z|$, gradient approaches zero, slowing learning - Not zero-centered: Outputs always positive, causing zig-zagging in gradient descent - Expensive computation: Exponential function

Hyperbolic Tangent (Tanh)¶

Tanh is a scaled, shifted sigmoid:

\[\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}} = \frac{2}{1 + e^{-2z}} - 1\]

Properties: - Output range: (-1, 1) - Zero-centered (better than sigmoid) - Derivative: $\tanh'(z) = 1 - \tanh^2(z)$

Drawbacks: - Still suffers from vanishing gradients - Still computationally expensive

Rectified Linear Unit (ReLU)¶

ReLU has become the default activation function for hidden layers:

\[\text{ReLU}(z) = \max(0, z) = \begin{cases} z & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases}\]

Properties: - Derivative: $\text{ReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{if } z \leq 0 \end{cases}$ - Computationally cheap (simple threshold) - Does not saturate for positive values - Sparse activations (many neurons output zero)

Advantages: - Alleviates vanishing gradient problem - Accelerates convergence (6x faster than sigmoid/tanh in some studies) - Promotes sparse representations

Drawbacks: - Dying ReLU problem: Neurons with large negative weights never activate, becoming permanently inactive - Not zero-centered - Not differentiable at $z = 0$ (though subgradient works in practice)

Leaky ReLU¶

Leaky ReLU addresses the dying ReLU problem by allowing small negative values:

\[\text{LeakyReLU}(z) = \max(\alpha z, z) = \begin{cases} z & \text{if } z > 0 \\ \alpha z & \text{if } z \leq 0 \end{cases}\]

where $\alpha$ is a small constant (typically 0.01).

Properties: - Derivative: $\text{LeakyReLU}'(z) = \begin{cases} 1 & \text{if } z > 0 \\ \alpha & \text{if } z \leq 0 \end{cases}$ - Prevents dying neurons - Still computationally cheap

Variants: - Parametric ReLU (PReLU): Learn $\alpha$ during training - Exponential Linear Unit (ELU): Smooth curve for negative values

Choosing Activation Functions¶

General guidelines: - Hidden layers: ReLU or Leaky ReLU (default choice) - Output layer (regression): Linear (no activation) - Output layer (binary classification): Sigmoid - Output layer (multiclass classification): Softmax

Network Architecture¶

A neural network consists of layers of interconnected neurons. The network architecture defines how many layers exist, how many neurons are in each layer, and how they connect.

Layer Types¶

Input Layer: The input layer receives raw features. It has one neuron per feature dimension and performs no computation—it simply passes values to the next layer.

Hidden Layers: Hidden layers perform intermediate transformations. A network can have zero, one, or many hidden layers. Each neuron in a hidden layer connects to all neurons in the previous layer (in a fully connected layer) and applies:

\[a_j^{(l)} = f\left(\sum_{i} w_{ji}^{(l)} a_i^{(l-1)} + b_j^{(l)}\right)\]

where: - $a_j^{(l)}$ is activation of neuron $j$ in layer $l$ - $w_{ji}^{(l)}$ is weight from neuron $i$ in layer $l-1$ to neuron $j$ in layer $l$ - $b_j^{(l)}$ is bias for neuron $j$ in layer $l$ - $f$ is the activation function

Output Layer: The output layer produces final predictions. For regression, it typically has one neuron with linear activation. For $K$-class classification, it has $K$ neurons with softmax activation.

Multilayer Perceptron (MLP)¶

A multilayer perceptron (MLP) is a feedforward neural network with one or more hidden layers. Despite the name, MLPs typically use nonlinear activations (not the perceptron's step function).

Example architecture: - Input layer: 4 neurons (4 features) - Hidden layer 1: 20 neurons (ReLU activation) - Hidden layer 2: 30 neurons (ReLU activation) - Hidden layer 3: 25 neurons (ReLU activation) - Output layer: 3 neurons (softmax activation for 3-class classification)

This is a 4-20-30-25-3 architecture with 3 hidden layers.

Deep Learning¶

Deep learning refers to neural networks with multiple hidden layers (typically >2). Deep networks can learn hierarchical representations: - Lower layers learn simple features (edges, textures) - Middle layers combine features into parts (eyes, wheels) - Upper layers recognize high-level concepts (faces, cars)

The depth allows learning complex, compositional patterns that shallow networks struggle with.

Forward Propagation¶

Forward propagation is the process of computing the network's output given an input. Activations flow forward from input through hidden layers to output.

Algorithm¶

For an $L$-layer network with input $\mathbf{x}$:

Input layer ($l = 0$): $$\mathbf{a}^{(0)} = \mathbf{x}$$
Hidden and output layers ($l = 1, \ldots, L$): $$\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$$ $$\mathbf{a}^{(l)} = f^{(l)}(\mathbf{z}^{(l)})$$
Output: $$\hat{\mathbf{y}} = \mathbf{a}^{(L)}$$

where $\mathbf{W}^{(l)}$ is the weight matrix for layer $l$ and $f^{(l)}$ is the activation function.

Example Computation¶

For a simple 2-3-1 network (2 inputs, 3 hidden neurons, 1 output):

Input: $\mathbf{x} = [x_1, x_2]^T$

Hidden layer: $$z_1^{(1)} = w_{11}^{(1)} x_1 + w_{12}^{(1)} x_2 + b_1^{(1)}$$ $$z_2^{(1)} = w_{21}^{(1)} x_1 + w_{22}^{(1)} x_2 + b_2^{(1)}$$ $$z_3^{(1)} = w_{31}^{(1)} x_1 + w_{32}^{(1)} x_2 + b_3^{(1)}$$

\[a_1^{(1)} = \text{ReLU}(z_1^{(1)}), \quad a_2^{(1)} = \text{ReLU}(z_2^{(1)}), \quad a_3^{(1)} = \text{ReLU}(z_3^{(1)})\]

Output layer: $$z^{(2)} = w_1^{(2)} a_1^{(1)} + w_2^{(2)} a_2^{(1)} + w_3^{(2)} a_3^{(1)} + b^{(2)}$$ $$\hat{y} = z^{(2)}$$ (linear activation for regression)

Loss Functions¶

Loss functions quantify how well the network's predictions match the true labels. Training minimizes the loss.

Mean Squared Error (MSE)¶

For regression, mean squared error is commonly used:

\[\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\]

MSE penalizes large errors heavily due to the squaring.

Cross-Entropy Loss¶

For classification, cross-entropy loss (also called log-loss) measures the difference between predicted and true probability distributions.

Binary cross-entropy (for 2 classes): $$\text{Loss} = -\frac{1}{n} \sum_{i=1}^{n} [y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)]$$

Categorical cross-entropy (for $K$ classes): $$\text{Loss} = -\frac{1}{n} \sum_{i=1}^{n} \sum_{k=1}^{K} y_{ik} \log \hat{y}_{ik}$$

where $y_{ik} = 1$ if sample $i$ belongs to class $k$, otherwise 0 (one-hot encoding).

Cross-entropy loss combined with softmax output forms a numerically stable, theoretically motivated framework for classification.

Backpropagation¶

Backpropagation (short for "backward propagation of errors") computes gradients of the loss with respect to all weights and biases. These gradients guide parameter updates during training.

The Chain Rule¶

Backpropagation applies the chain rule from calculus to efficiently compute gradients layer by layer, moving backward from output to input.

For a simple network with loss $L$, output $\hat{y}$, and intermediate value $z$:

\[\frac{\partial L}{\partial w} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial w}\]

Backpropagation Algorithm¶

Starting from the output layer and moving backward:

Output layer gradient: $$\delta^{(L)} = \frac{\partial L}{\partial \mathbf{a}^{(L)}} \odot f'^{(L)}(\mathbf{z}^{(L)})$$
Hidden layer gradients (for $l = L-1, \ldots, 1$): $$\delta^{(l)} = [(\mathbf{W}^{(l+1)})^T \delta^{(l+1)}] \odot f'^{(l)}(\mathbf{z}^{(l)})$$
Weight and bias gradients: $$\frac{\partial L}{\partial \mathbf{W}^{(l)}} = \delta^{(l)} (\mathbf{a}^{(l-1)})^T$$ $$\frac{\partial L}{\partial \mathbf{b}^{(l)}} = \delta^{(l)}$$

where $\odot$ denotes element-wise multiplication.

Why Backpropagation Matters¶

Before backpropagation, training neural networks required numerical gradient estimation (finite differences), which was computationally prohibitive for large networks. Backpropagation enables efficient gradient computation, making deep learning practical.

Gradient Descent¶

Gradient descent is the optimization algorithm that updates weights to minimize the loss. The update rule is:

\[w_{new} = w_{old} - \eta \frac{\partial L}{\partial w}\]

where $\eta$ is the learning rate, controlling step size.

Batch Gradient Descent¶

Standard gradient descent computes gradients using the entire training set:

Forward propagation: Compute predictions for all samples
Compute loss: Average loss over all samples
Backpropagation: Compute gradients averaging over all samples
Update weights: Apply gradient descent update

This is stable but slow for large datasets.

Stochastic Gradient Descent (SGD)¶

Stochastic gradient descent updates weights after each individual sample:

Randomly shuffle training data
For each sample:
Forward propagation
Compute loss for this sample
Backpropagation
Update weights

Advantages: - Much faster per update - Can escape local minima due to noise - Enables online learning (update as new data arrives)

Disadvantages: - Noisy gradients cause erratic convergence - Requires careful learning rate tuning

Mini-Batch Gradient Descent¶

Mini-batch gradient descent strikes a balance by updating on small batches of samples:

Batch size (e.g., 32, 64, 128): Number of samples per update
For each mini-batch:
Forward propagation on batch
Compute average loss over batch
Backpropagation
Update weights

Advantages: - More stable than SGD, faster than full batch - Efficient matrix operations (GPUs excel at batch processing) - Reduces gradient variance while maintaining speed

Batch processing enables efficient use of modern hardware accelerators.

Learning Rate¶

The learning rate $\eta$ critically affects training:

Too small: Slow convergence, may get stuck
Too large: Oscillation, divergence, missing minimum
Just right: Fast, stable convergence

Learning rate scheduling adaptively adjusts $\eta$ during training: - Step decay: Reduce $\eta$ by factor (e.g., ×0.1) every $N$ epochs - Exponential decay: $\eta(t) = \eta_0 e^{-kt}$ - 1/t decay: $\eta(t) = \eta_0 / (1 + kt)$ - Adaptive methods (Adam, RMSprop): Automatically adjust per-parameter learning rates

Epochs¶

An epoch is one complete pass through the entire training dataset. Training typically runs for many epochs (10s to 1000s), with the network gradually improving as it sees data repeatedly.

Weight Initialization¶

Weight initialization significantly affects training dynamics. Poor initialization can prevent learning entirely.

Why Initialization Matters¶

All zeros: Neurons in a layer behave identically (symmetry problem)
Too large: Activations explode, gradients explode
Too small: Activations vanish, gradients vanish

Xavier (Glorot) Initialization¶

Xavier initialization keeps variance of activations and gradients stable across layers. For a layer with $n_{in}$ inputs and $n_{out}$ outputs:

\[w \sim \mathcal{N}\left(0, \frac{2}{n_{in} + n_{out}}\right)\]

or uniform variant:

\[w \sim \text{Uniform}\left(-\sqrt{\frac{6}{n_{in} + n_{out}}}, \sqrt{\frac{6}{n_{in} + n_{out}}}\right)\]

Best for: Tanh and sigmoid activations

He Initialization¶

He initialization accounts for ReLU's characteristics (kills negative values):

\[w \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)\]

Best for: ReLU and Leaky ReLU activations

Proper initialization is crucial for training deep networks.

Training Challenges¶

Vanishing Gradients¶

The vanishing gradient problem occurs when gradients become extremely small as they propagate backward through many layers. This causes early layers to learn very slowly or not at all.

Causes: - Sigmoid/tanh activations saturate (gradients ≈ 0) - Deep networks multiply many small gradients

Solutions: - Use ReLU activations - Skip connections (ResNets) - Batch normalization - Proper weight initialization

Exploding Gradients¶

The exploding gradient problem is the opposite: gradients become extremely large, causing numerical instability and divergence.

Causes: - Poor weight initialization - Deep networks multiply many large gradients

Solutions: - Gradient clipping (cap gradient magnitude) - Proper weight initialization - Batch normalization

Regularization Techniques¶

Dropout¶

Dropout randomly sets a fraction of neuron activations to zero during training. For example, with dropout rate 0.5, each neuron has a 50% chance of being "dropped."

Effect: - Prevents co-adaptation (neurons relying on specific other neurons) - Acts like training an ensemble of networks - Significantly reduces overfitting

Implementation:

# During training
if training:
    mask = np.random.binomial(1, keep_prob, size=activations.shape)
    activations *= mask / keep_prob  # Scale to maintain expected value

# During inference
# Use all activations (no dropout)

Dropout is typically applied to fully connected layers, not convolutional layers.

Early Stopping¶

Early stopping monitors validation loss during training and stops when validation performance stops improving. This prevents overfitting by avoiding overtraining.

Algorithm: 1. Train network and evaluate on validation set after each epoch 2. Track best validation loss seen so far 3. If validation loss doesn't improve for $N$ consecutive epochs (patience), stop training 4. Return weights from epoch with best validation loss

Early stopping is a simple, effective regularization technique that requires no hyperparameter tuning beyond patience.

Universal Approximation Theorem¶

The universal approximation theorem states that a neural network with a single hidden layer containing a finite number of neurons can approximate any continuous function to arbitrary accuracy, given enough neurons.

Implications: - Neural networks are theoretically capable of learning any function - Shallow networks can represent complex functions but may require exponentially many neurons - Deep networks learn hierarchical representations more efficiently

Important caveats: - Theorem guarantees existence, not learnability (training may not find the solution) - Says nothing about generalization - Doesn't specify how many neurons are needed

Neural Networks in Practice¶

Building a Neural Network with Scikit-Learn¶

Let's apply MLPClassifier to the Iris dataset:

import pandas as pd
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns

# Load iris dataset
url = "https://gist.githubusercontent.com/curran/a08a1080b88344b0c8a7/raw/iris.csv"
iris_df = pd.read_csv(url)

# Examine feature correlations
feature_names = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width']

correlation_matrix = iris_df[feature_names].corr().round(2)
plt.figure(figsize=(6, 6))
sns.heatmap(data=correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlations')
plt.show()

The correlation matrix reveals strong positive correlation between petal length and petal width (0.96), suggesting these features carry similar information. Sepal width and length have weak negative correlation.

Training the Network¶

# Prepare data
X = iris_df.loc[:, feature_names].values
y = iris_df.loc[:, 'species'].values

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create neural network with 3 hidden layers
# Architecture: 4 inputs → 20 neurons → 30 neurons → 25 neurons → 3 outputs
mlp = MLPClassifier(hidden_layer_sizes=(20, 30, 25),
                    max_iter=1000,
                    activation='relu',
                    solver='adam',
                    random_state=42)

# Train
mlp.fit(X_train, y_train)

# Predictions
y_pred = mlp.predict(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.3f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

Evaluating Multiple Runs¶

Neural network training is stochastic, so results vary across runs:

# Run multiple times to assess stability
scores = []
for i in range(20):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=i)

    mlp = MLPClassifier(hidden_layer_sizes=(20, 30, 25), max_iter=1000, random_state=i)
    mlp.fit(X_train, y_train)

    y_pred = mlp.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    scores.append(accuracy)

print(f"Average accuracy: {np.mean(scores):.3f}")
print(f"Std deviation: {np.std(scores):.3f}")
print(f"Min: {np.min(scores):.3f}, Max: {np.max(scores):.3f}")

This reveals the stability (or variability) of the model across different random initializations and data splits.

Hyperparameter Tuning¶

Key hyperparameters to tune:

from sklearn.model_selection import GridSearchCV

# Define hyperparameter grid
param_grid = {
    'hidden_layer_sizes': [(50,), (100,), (50, 50), (100, 50)],
    'activation': ['relu', 'tanh'],
    'alpha': [0.0001, 0.001, 0.01],  # L2 regularization strength
    'learning_rate_init': [0.001, 0.01]
}

# Grid search with cross-validation
grid_search = GridSearchCV(MLPClassifier(max_iter=1000), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train, y_train)

print("Best parameters:", grid_search.best_params_)
print("Best CV score:", grid_search.best_score_)

# Evaluate on test set
best_model = grid_search.best_estimator_
test_accuracy = best_model.score(X_test, y_test)
print(f"Test accuracy: {test_accuracy:.3f}")

Advanced Architectures¶

Pooling Layers¶

Pooling layers reduce spatial dimensions in convolutional networks by downsampling:

Max pooling: Take maximum value in each region
Average pooling: Take average value in each region

Pooling provides translation invariance and reduces computational cost.

Freezing Layers¶

Freezing layers prevents weight updates during training. This is useful for:

Transfer learning: Freeze pretrained layers, train only final layers
Feature extraction: Use frozen network as feature extractor
Progressive training: Gradually unfreeze layers

# Conceptual example (PyTorch syntax)
for param in model.layer1.parameters():
    param.requires_grad = False  # Freeze layer1

Interactive Visualization: Neural Network Architecture¶

Build and explore different neural network architectures:

View Fullscreen | Documentation

Interactive Visualization: Activation Functions¶

Compare different activation functions and their properties:

View Fullscreen | Documentation

Summary¶

Neural networks are powerful function approximators built from layers of artificial neurons. Each neuron computes a weighted sum of inputs, adds a bias, and applies an activation function. Activation functions like ReLU, tanh, and sigmoid introduce essential nonlinearity.

Forward propagation computes predictions by passing inputs through successive layers. Backpropagation efficiently computes gradients using the chain rule, enabling gradient descent optimization. Stochastic gradient descent and mini-batch variants balance speed and stability.

Weight initialization (Xavier for tanh/sigmoid, He for ReLU) prevents vanishing and exploding gradients. Regularization techniques like dropout and early stopping combat overfitting. The universal approximation theorem guarantees that neural networks can represent any function, though depth enables more efficient learning.

Modern deep learning frameworks automate much of the complexity, but understanding these fundamentals—neurons, activations, forward/back propagation, gradient descent, and training challenges—provides the foundation for effectively applying and debugging neural networks.

Key Takeaways¶

Artificial neurons compute weighted sums plus bias, then apply activation functions
Activation functions introduce nonlinearity; ReLU is the default for hidden layers
Network architecture defines layers (input, hidden, output) and connections
Forward propagation computes outputs by passing activations through layers
Backpropagation efficiently computes gradients using the chain rule
Gradient descent updates weights to minimize loss; SGD and mini-batch variants balance speed and stability
Learning rate controls step size; too large causes divergence, too small causes slow convergence
Weight initialization (Xavier, He) prevents vanishing/exploding gradients
Dropout and early stopping prevent overfitting
Vanishing gradients occur with sigmoid/tanh in deep networks; ReLU alleviates this
Batch size affects gradient variance and computational efficiency
Universal approximation theorem guarantees representation capacity

Exercises¶

Manual Forward Propagation: Given a 2-3-1 network with specific weights and biases, manually compute the output for input $[1, 2]^T$ using ReLU activations.
Activation Function Analysis: Plot sigmoid, tanh, ReLU, and Leaky ReLU (α=0.01) and their derivatives on the same graph. At what input values does each function saturate?
Backpropagation by Hand: For a simple 2-2-1 network, compute gradients with respect to all weights and biases for a single training example using MSE loss.
Learning Rate Experiment: Train a network on a small dataset with learning rates [0.0001, 0.001, 0.01, 0.1, 1.0]. Plot training loss vs epoch for each. Which converges fastest? Which diverges?
Architecture Comparison: Compare 1-layer (4-50-3), 2-layer (4-25-25-3), and 3-layer (4-20-20-20-3) networks on Iris. Which achieves best test accuracy? Why might deeper not always be better for small datasets?
Dropout Impact: Train identical networks with dropout rates [0, 0.2, 0.5, 0.8]. Plot training vs validation accuracy. How does dropout affect the train-validation gap?
Weight Initialization: Initialize a deep network (10 layers) with all zeros, Xavier, He, and random uniform [-1, 1]. Plot activation distributions after forward pass. Which initializations cause saturation or vanishing activations?

Neural Networks Fundamentals¶

Summary¶

Concepts Covered¶

Prerequisites¶

Introduction: Inspired by the Brain¶

The Artificial Neuron¶

Mathematical Model¶

The Perceptron¶

Biological Inspiration¶

Activation Functions¶

Sigmoid¶

Hyperbolic Tangent (Tanh)¶

Rectified Linear Unit (ReLU)¶

Leaky ReLU¶

Choosing Activation Functions¶

Network Architecture¶

Layer Types¶

Multilayer Perceptron (MLP)¶

Deep Learning¶

Forward Propagation¶

Algorithm¶

Example Computation¶

Loss Functions¶

Mean Squared Error (MSE)¶

Cross-Entropy Loss¶

Backpropagation¶

The Chain Rule¶

Backpropagation Algorithm¶

Why Backpropagation Matters¶

Gradient Descent¶

Batch Gradient Descent¶

Stochastic Gradient Descent (SGD)¶

Mini-Batch Gradient Descent¶

Learning Rate¶

Epochs¶

Weight Initialization¶

Why Initialization Matters¶

Xavier (Glorot) Initialization¶

He Initialization¶

Training Challenges¶

Vanishing Gradients¶

Exploding Gradients¶

Regularization Techniques¶

Dropout¶

Early Stopping¶

Universal Approximation Theorem¶

Neural Networks in Practice¶

Building a Neural Network with Scikit-Learn¶

Training the Network¶

Evaluating Multiple Runs¶

Hyperparameter Tuning¶

Advanced Architectures¶

Pooling Layers¶

Freezing Layers¶

Interactive Visualization: Neural Network Architecture¶

Interactive Visualization: Activation Functions¶

Summary¶

Key Takeaways¶

Further Reading¶

Exercises¶