Quiz: Regularization Techniques¶

Test your understanding of regularization methods with these questions.

1. What is the primary purpose of regularization in machine learning?¶

To speed up model training time
To prevent overfitting by penalizing model complexity
To increase the number of features in the model
To eliminate the need for cross-validation

Show Answer

The correct answer is B. Regularization prevents overfitting by adding a penalty term to the loss function that discourages large coefficient values. This forces the model to find simpler explanations that are more likely to generalize to unseen data. Without regularization, complex models can memorize training data including noise, leading to excellent training performance but poor test performance.

Concept Tested: Regularization

2. What is the mathematical form of the L2 penalty term in Ridge regression?¶

λ Σ |βⱼ|
λ Σ βⱼ
λ Σ log(βⱼ)
λ Σ βⱼ²

Show Answer

The correct answer is D. The L2 penalty in Ridge regression is λ Σ βⱼ², the sum of squared coefficients multiplied by the regularization parameter λ. This quadratic penalty grows rapidly with coefficient magnitude, encouraging the model to use smaller, more distributed weights. The squared term ensures the penalty is always positive and differentiable everywhere, leading to smooth, stable optimization.

Concept Tested: L2 Regularization

3. What distinguishes L1 regularization (Lasso) from L2 regularization (Ridge)?¶

L1 can drive coefficients to exactly zero, performing automatic feature selection
L1 always trains faster than L2
L1 uses squared penalties while L2 uses absolute value penalties
L1 requires fewer training examples than L2

Show Answer

The correct answer is A. L1 regularization (Lasso) uses an absolute value penalty λ Σ |βⱼ| that can drive coefficients to exactly zero, effectively removing features from the model. This automatic feature selection makes Lasso valuable for high-dimensional problems with many irrelevant features. In contrast, L2 (Ridge) uses squared penalties that shrink coefficients smoothly toward zero but rarely reach exactly zero.

Concept Tested: L1 Regularization

4. In scikit-learn's Ridge and Lasso classes, what does the alpha parameter control?¶

The learning rate for gradient descent
The number of features to select
The regularization strength (λ)
The train-test split ratio

Show Answer

The correct answer is C. The alpha parameter in scikit-learn's Ridge and Lasso classes directly controls the regularization strength λ. Larger alpha values apply stronger regularization (more penalty on large coefficients), leading to simpler models with smaller coefficient magnitudes. Alpha=0 corresponds to ordinary least squares with no regularization, while very large alpha values shrink coefficients heavily toward zero, potentially causing underfitting.

Concept Tested: Ridge Regression

5. Why is feature standardization essential before applying regularization?¶

It speeds up the optimization algorithm
Regularization penalizes coefficient magnitudes, and features on different scales lead to unfair penalization
Standardization is not necessary for regularization
It reduces the number of features needed

Show Answer

The correct answer is B. Regularization penalties like λ Σ βⱼ² treat all coefficients equally, but features with larger natural scales require larger coefficients to have the same predictive impact. Without standardization, features with large scales (e.g., income in dollars) would have their coefficients penalized more heavily than features with small scales (e.g., age in decades), even if they're equally important. Standardizing ensures all features contribute fairly to the penalty term.

Concept Tested: Regularization

6. Given a dataset with 100 features where you suspect only 10 are truly predictive, which regularization method would be most appropriate?¶

Lasso (L1) because it performs automatic feature selection
Ridge (L2) because it handles all features equally
No regularization because it would remove important features
L2 because it's computationally faster

Show Answer

The correct answer is A. Lasso regression is ideal when you suspect many features are irrelevant because its L1 penalty drives coefficients of unimportant features to exactly zero. This automatic feature selection would identify the approximately 10 truly predictive features while eliminating the 90 noise features, resulting in a sparse, interpretable model. Ridge would keep all 100 features with small but non-zero coefficients, which doesn't solve the feature selection problem.

Concept Tested: Lasso Regression

7. In the geometric interpretation of Ridge regression with two coefficients, what shape does the L2 constraint region form?¶

A square
A diamond
A triangle
A circle

Show Answer

The correct answer is D. The L2 constraint β₁² + β₂² ≤ t defines a circle (in higher dimensions, a hypersphere) centered at the origin. The Ridge solution occurs where the smallest error contour ellipse touches this circular constraint region. Because circles have smooth boundaries with no corners, the solution typically doesn't lie exactly on an axis, which is why Ridge rarely sets coefficients to exactly zero.

Concept Tested: L2 Regularization

8. What shape does the L1 constraint region form in two-dimensional coefficient space?¶

A circle
An ellipse
A diamond (rotated square)
A hexagon

Show Answer

The correct answer is C. The L1 constraint |β₁| + |β₂| ≤ t defines a diamond shape (a square rotated 45 degrees) with corners aligned on the coordinate axes at points like (±t, 0) and (0, ±t). When error contours touch this diamond-shaped constraint region, they frequently contact at a corner where one coefficient is exactly zero. This geometric property explains why Lasso performs automatic feature selection—the corners correspond to sparse solutions.

Concept Tested: L1 Regularization

9. You fit Ridge regression with alpha values [0.01, 0.1, 1.0, 10.0, 100.0] and observe the following test R² scores: [0.72, 0.78, 0.82, 0.79, 0.65]. What does this pattern suggest?¶

Alpha should be increased further to improve performance
The optimal alpha is around 1.0, balancing bias and variance
Regularization is not helping this problem
The model is underfitting at all alpha values

Show Answer

The correct answer is B. The test R² scores peak at alpha=1.0 (R²=0.82) and decline for both smaller and larger alpha values. This indicates that alpha=1.0 provides the optimal bias-variance trade-off: smaller alpha values (0.01, 0.1) underregularize and allow overfitting, while larger values (10.0, 100.0) overregularize and cause underfitting. The cross-validation curve shows the classic U-shape (inverted for R²), with the optimal alpha balancing model complexity and generalization.

Concept Tested: Ridge Regression

10. In scikit-learn's LogisticRegression, the C parameter is the inverse of regularization strength. If C=0.1, what does this imply?¶

Strong regularization (equivalent to large λ), encouraging simpler models
Weak regularization (equivalent to small λ), allowing complex models
No regularization is applied
The model will automatically select 10% of features

Show Answer

The correct answer is A. Since C = 1/λ in scikit-learn's LogisticRegression, a small C value like 0.1 corresponds to large λ (λ=10), applying strong regularization. This heavily penalizes large coefficients, forcing the model toward simpler solutions with smaller weights. Strong regularization helps prevent overfitting but risks underfitting if too strong. Typical C values range from 0.001 (very strong regularization) to 100 (very weak regularization).

Concept Tested: Regularization