Willys Flashcards Download
Become an ActuaryExamsFlashcardsExam SRM › Model Selection & Diagnostics
Exam SRM · ~12-16%

Exam SRM — Model Selection & Diagnostics Flashcards

Checking and choosing a regression model for SOA Exam SRM: reading residual-vs-fitted and Q-Q plots, standardized vs studentized residuals, leverage $h_{ii}$ and Cook's distance, multicollinearity and the variance inflation factor, best-subset vs forward/backward stepwise selection, the AIC/BIC/Mallows' $C_p$/adjusted $R^2$ criteria, the validation-set/$k$-fold/LOOCV approaches with the leverage shortcut, and ridge vs lasso regularization with the bias-variance trade-off — with fully worked calculations.

44 cards6 topicsFree · fact-checked · LaTeX math
Tap card or press Space to flip
Answer

Import this deck

Download all 44 cards and import them into your flashcard app (JSON or CSV — works with Anki). Using the Willys app? No import needed — this deck is already built in (Settings → Library → Browse).

Every deck is built into the Willys app

All of these decks — including the full practice problem banks — come built into Willys AI Flashcards & Quizzes for iPhone & iPad (Mac version coming soon), with FSRS + SM-2 spaced repetition, streaks, and exam-date cram mode. 14-day free trial, then $14.99. To load a deck in the app: Settings → Library → Browse, then pick your exam and deck.

More Exam SRM decks:

Clustering & KNN Clustering & KNN Practice Decision Trees & Ensembles Decision Trees & Ensembles Practice Generalized Linear Models Generalized Linear Models Practice

← All Exam SRM decks

Browse all 44 cards as a list
  1. Residual diagnostics
    What does a **residual-vs-fitted** plot diagnose, and what patterns signal trouble?
    Plot the residuals $e_i=y_i-\hat y_i$ on the vertical axis against the fitted values $\hat y_i$ on the horizontal axis. A well-specified linear model shows a **random horizontal band** centered at $0$ with constant spread. **Curvature** (a U-shape or arch) signals a missing nonlinear term — the linearity assumption fails. A **funnel/fan** shape (spread growing or shrinking with $\hat y_i$) signals **non-constant variance** (heteroscedasticity).
  2. Residual diagnostics
    What is a **normal Q-Q plot** of residuals used for, and how do you read departures from the line?
    It plots the ordered standardized residuals against the quantiles of a standard normal. If the errors are normal, the points fall near the $45^{\circ}$ reference line. **Heavy tails** bend the points away from the line at both ends (low points below, high points above). **Right skew** curves the upper end above the line; **left skew** curves the lower end below. Normality of errors matters for the validity of $t$- and $F$-based inference, especially in small samples.
  3. Residual diagnostics
    Distinguish a raw **residual**, a **standardized** residual, and a **studentized** (deleted) residual.
    Raw residual: $e_i=y_i-\hat y_i$, on the scale of $y$. **Standardized (internally studentized):** $r_i=\frac{e_i}{\hat\sigma\sqrt{1-h_{ii}}}$ — divides by the estimated SD of that residual, which depends on its leverage $h_{ii}$. **Studentized (externally studentized / deleted):** $t_i=\frac{e_i}{\hat\sigma_{(i)}\sqrt{1-h_{ii}}}$, where $\hat\sigma_{(i)}$ is estimated with observation $i$ **removed**. This makes a true outlier stand out because the point can't inflate its own scale estimate. Values with $|r_i|$ or $|t_i|>2$ (or $3$) flag possible outliers.
  4. Residual diagnostics
    A fitted regression has $\hat\sigma=4.0$. Observation $i$ has residual $e_i=9.2$ and leverage $h_{ii}=0.16$. Compute its **standardized** residual and judge whether it is an outlier.
    $r_i=\dfrac{e_i}{\hat\sigma\sqrt{1-h_{ii}}}=\dfrac{9.2}{4.0\sqrt{1-0.16}}$. $\sqrt{0.84}\approx 0.916515$, so the denominator is $4.0(0.916515)\approx 3.666060$. $r_i=\dfrac{9.2}{3.666060}\approx 2.510$. Since $|r_i|>2$, observation $i$ is a candidate **outlier** in the $y$-direction.
  5. Residual diagnostics
    Why do we standardize residuals by $\hat\sigma\sqrt{1-h_{ii}}$ rather than just by $\hat\sigma$?
    Even when the true errors $\epsilon_i$ all have variance $\sigma^2$, the **fitted** residuals do not: $\operatorname{Var}(e_i)=\sigma^2(1-h_{ii})$. High-leverage points pull the fitted line toward themselves, shrinking their own residual variance. Dividing by $\hat\sigma\sqrt{1-h_{ii}}$ puts every residual on a common unit-variance scale, so a fixed cutoff like $2$ means the same thing at every leverage. Without the $(1-h_{ii})$ correction a high-leverage outlier can hide behind a deceptively small raw residual.
  6. Leverage & influence
    Define the **leverage** $h_{ii}$ of an observation and state its key properties ($\sum h_{ii}$, average, range).
    Leverage is the $i$-th diagonal of the hat matrix $H=X(X^{\top}X)^{-1}X^{\top}$, which satisfies $\hat y=Hy$; $h_{ii}=\partial\hat y_i/\partial y_i$ measures how much $y_i$ pulls its own fit. Properties: $0\le h_{ii}\le 1$, the leverages sum to the number of estimated coefficients, $\sum_{i=1}^{n} h_{ii}=p+1$ (intercept plus $p$ slopes), so the **average** leverage is $\frac{p+1}{n}$. Leverage depends only on the $X$-values, not on $y$ — it measures how unusual a point's predictor values are.
  7. Leverage & influence
    State the common **rule of thumb** for flagging a high-leverage point and apply it: a model with $p=3$ predictors fit to $n=50$ observations.
    Average leverage is $\frac{p+1}{n}$. A common rule flags a point as high-leverage when $h_{ii}>2\cdot\frac{p+1}{n}$ (some texts use $3\times$). Here $p+1=4$ and $n=50$, so the average leverage is $\frac{4}{50}=0.08$. Threshold $=2(0.08)=0.16$. Any observation with $h_{ii}>0.16$ is flagged as having unusually extreme predictor values.
  8. Leverage & influence
    Distinguish an **outlier**, a **high-leverage** point, and an **influential** point.
    An **outlier** has an unusual **response** $y$ given its predictors — a large studentized residual. A **high-leverage** point has unusual **predictor** values (large $h_{ii}$), regardless of its $y$. An **influential** point is one whose removal substantially changes the fitted coefficients. Influence combines the two: a point is most influential when it has **both** high leverage **and** a large residual. Cook's distance measures this combined effect.
  9. Leverage & influence
    State **Cook's distance** $D_i$ in terms of the standardized residual and leverage, and the usual cutoff.
    $D_i=\dfrac{r_i^{2}}{p+1}\cdot\dfrac{h_{ii}}{1-h_{ii}}$, where $r_i$ is the standardized residual, $h_{ii}$ the leverage, and $p+1$ the number of estimated coefficients. $D_i$ measures how much **all** fitted values shift when observation $i$ is deleted — it blends outlyingness ($r_i^2$) with leverage ($\frac{h_{ii}}{1-h_{ii}}$). A common rule flags $D_i>1$ (or $D_i>\frac{4}{n}$) as influential.
  10. Leverage & influence
    An observation in a model with $p=3$ predictors has standardized residual $r_i=2.2$ and leverage $h_{ii}=0.30$. Compute **Cook's distance** and judge influence.
    $D_i=\dfrac{r_i^{2}}{p+1}\cdot\dfrac{h_{ii}}{1-h_{ii}}=\dfrac{2.2^{2}}{4}\cdot\dfrac{0.30}{0.70}$. $\dfrac{4.84}{4}=1.21$ and $\dfrac{0.30}{0.70}\approx 0.428571$. $D_i=1.21(0.428571)\approx 0.519$. Since $D_i<1$, by the usual cutoff the point is **not** strongly influential, though with $\frac{4}{n}$ thresholds for small $n$ it could still warrant a look.
  11. Leverage & influence
    A simple regression has $n=20$, $\bar x=10$, and $S_{xx}=\sum(x_i-\bar x)^2=160$. Find the **leverage** of a point with $x_i=16$.
    For simple linear regression, $h_{ii}=\dfrac{1}{n}+\dfrac{(x_i-\bar x)^2}{S_{xx}}$. $(x_i-\bar x)^2=(16-10)^2=36$. $h_{ii}=\dfrac{1}{20}+\dfrac{36}{160}=0.05+0.225=0.275$. The average leverage here is $\frac{p+1}{n}=\frac{2}{20}=0.10$, and $2(0.10)=0.20<0.275$, so this point is flagged as **high leverage**.
  12. Multicollinearity
    Define **multicollinearity** and list the practical problems it causes in a multiple regression.
    Multicollinearity is a strong **linear** relationship among two or more predictors, so the columns of $X$ are nearly linearly dependent and $X^{\top}X$ is close to singular. Consequences: the coefficient estimates remain unbiased but have **inflated standard errors**, so $t$-statistics shrink and individually significant predictors can appear insignificant; estimates become **unstable** (large swings from small data changes) and can take counterintuitive signs. It does **not** hurt the model's overall predictive fit ($R^2$, predictions) — only the interpretation of individual coefficients.
  13. Multicollinearity
    Define the **variance inflation factor** $\text{VIF}_j$ and state how it is computed.
    $\text{VIF}_j=\dfrac{1}{1-R_j^{2}}$, where $R_j^{2}$ is the $R^2$ from regressing predictor $x_j$ on **all the other predictors**. It is the factor by which the variance of $\hat\beta_j$ is inflated relative to an ideal model with uncorrelated predictors. A high $R_j^2$ (that predictor is well-explained by the others) drives the VIF up. The minimum value is $1$ (when $x_j$ is uncorrelated with the others).
  14. Multicollinearity
    State the common **VIF rule-of-thumb** thresholds and what the corresponding $R_j^2$ values are.
    $\text{VIF}_j>5$ (some use $>10$) flags problematic multicollinearity. Invert $\text{VIF}=\frac{1}{1-R_j^2}$ to read the matching $R_j^2$: $\text{VIF}=5\Rightarrow R_j^2=1-\frac{1}{5}=0.80$, and $\text{VIF}=10\Rightarrow R_j^2=1-\frac{1}{10}=0.90$. So VIF $>5$ means a predictor is more than $80\%$ explained by the others; VIF $=1$ means no collinearity at all.
  15. Multicollinearity
    Regressing $x_2$ on the other predictors gives $R_2^{2}=0.86$. Compute $\text{VIF}_2$ and decide whether it crosses the usual thresholds.
    $\text{VIF}_2=\dfrac{1}{1-R_2^{2}}=\dfrac{1}{1-0.86}=\dfrac{1}{0.14}\approx 7.14$. This **exceeds** the $5$ threshold (so it is flagged under the stricter rule) but is **below** $10$. The standard error of $\hat\beta_2$ is inflated by a factor of $\sqrt{7.14}\approx 2.67$ versus an uncorrelated design.
  16. Multicollinearity
    A predictor has $\text{VIF}_j=9$. Recover the $R_j^2$ and the inflation of the **standard error** of $\hat\beta_j$.
    From $\text{VIF}_j=\frac{1}{1-R_j^2}$: $1-R_j^2=\frac{1}{9}\approx 0.1111$, so $R_j^2=1-0.1111=0.8889$ — about $88.9\%$ of $x_j$ is explained by the other predictors. VIF inflates the **variance** of $\hat\beta_j$ by $9$, so the **standard error** is inflated by $\sqrt{9}=3$. The confidence interval for $\beta_j$ is therefore three times as wide as it would be with an uncorrelated predictor.
  17. Multicollinearity
    List the usual **remedies** for multicollinearity.
    Drop one of the redundant predictors (keep the one that is more interpretable or cheaper to measure). Combine collinear predictors into a single index or use **principal components** as inputs. Collect more data to stabilize the estimates. Use a shrinkage method — **ridge regression** in particular tolerates collinearity well because the $\lambda$ penalty stabilizes $X^{\top}X+\lambda I$. Center/standardize predictors when the collinearity is between a variable and its interaction or polynomial term.
  18. Subset & stepwise selection
    Describe **best-subset selection** and why it is computationally expensive.
    Fit every possible model: all $\binom{p}{k}$ models for each subset size $k=0,1,\dots,p$, then pick the best model of each size by lowest RSS (or highest $R^2$), and finally choose among those $p+1$ contenders using a criterion that penalizes size (cross-validation, $C_p$, AIC, BIC, or adjusted $R^2$). With $p$ predictors there are $2^{p}$ models in total, so the search explodes — $p=20$ already gives over a million fits. This is why **stepwise** methods are used for large $p$.
  19. Subset & stepwise selection
    Contrast **forward** and **backward** stepwise selection, and note when backward is unavailable.
    **Forward stepwise** starts with the null (intercept-only) model and, at each step, **adds** the single predictor that most improves the fit, until a stopping rule triggers. **Backward stepwise** starts with the **full** model and, at each step, **removes** the least useful predictor. Each explores about $1+\frac{p(p+1)}{2}$ models — far fewer than $2^p$. They are greedy, so neither is guaranteed to find the best-subset optimum. Backward requires $n>p$ (the full model must be estimable); forward stepwise works even when $p\ge n$.
  20. Subset & stepwise selection
    How many models does **best-subset** examine versus **forward stepwise** for $p=10$ predictors?
    Best subset fits every subset: $2^{p}=2^{10}=1{,}024$ models. Forward stepwise fits the null model plus, at step $k$ (for $k=0,\dots,p-1$), the $p-k$ candidate additions: $1+\sum_{k=0}^{p-1}(p-k)=1+\frac{p(p+1)}{2}$. For $p=10$: $1+\frac{10\cdot 11}{2}=1+55=56$ models. Forward stepwise examines $56$ versus $1{,}024$ — a large saving that grows with $p$.
  21. Subset & stepwise selection
    Why can **forward stepwise** miss the best-subset model? Give the mechanism.
    Forward stepwise is **greedy** and **path-dependent**: once a predictor is added it stays, so an early choice constrains all later ones. If the best two-variable model uses $\{x_2,x_3\}$ but the single best one-variable model is $x_1$, forward selection locks in $x_1$ at step one and can never reach $\{x_2,x_3\}$. Best-subset, which evaluates every combination, would find it. This is the speed-vs-optimality trade-off: stepwise is fast but only explores a single nested path.
  22. Subset & stepwise selection
    Five candidate models (number of predictors $k$ and RSS) are below. By RSS alone the biggest model always wins — explain why, and what to use instead. $k=1$: RSS $=520$; $k=2$: $410$; $k=3$: $360$; $k=4$: $352$; $k=5$: $350$.
    RSS can only **decrease** (never increase) as predictors are added, because the larger model contains the smaller as a special case — so RSS always picks the full model and over-fits. The drops shrink sharply after $k=3$ ($360\to 352\to 350$), suggesting predictors $4$ and $5$ add little. To choose, apply a criterion that **penalizes complexity** — adjusted $R^2$, $C_p$, AIC, or BIC — or use cross-validation, all of which can favor the $k=3$ model over the full model.
  23. Information criteria
    Define **adjusted $R^2$** and explain how its penalty differs from ordinary $R^2$.
    $\bar R^{2}=1-\dfrac{\text{RSS}/(n-k-1)}{\text{TSS}/(n-1)}$, where $k$ is the number of predictors (so $k+1$ parameters with the intercept) and TSS is the total sum of squares. Ordinary $R^2=1-\frac{\text{RSS}}{\text{TSS}}$ never falls when a predictor is added. Adjusted $R^2$ divides RSS by the **degrees of freedom** $n-k-1$, so adding a useless predictor (tiny RSS drop, one fewer df) can **lower** $\bar R^2$. We prefer the model with the **highest** $\bar R^2$.
  24. Information criteria
    Compute **adjusted $R^2$** for a model with $n=30$, $k=4$ predictors, $\text{RSS}=180$, $\text{TSS}=900$.
    $\bar R^{2}=1-\dfrac{\text{RSS}/(n-k-1)}{\text{TSS}/(n-1)}=1-\dfrac{180/(30-4-1)}{900/(30-1)}$. $\dfrac{180}{25}=7.2$ and $\dfrac{900}{29}\approx 31.0345$. $\bar R^{2}=1-\dfrac{7.2}{31.0345}=1-0.23200\approx 0.768$. (For comparison, the unadjusted $R^2=1-\frac{180}{900}=0.80$, so the penalty pulls it down to $0.768$.)
  25. Information criteria
    State **Mallows' $C_p$** and the value it targets for a good model.
    $C_p=\dfrac{\text{SSE}_p}{\hat\sigma^{2}}-n+2(p+1)$, where $\text{SSE}_p$ is the residual sum of squares of the candidate model with $p$ predictors and $\hat\sigma^2$ is the error-variance estimate from the **full** model. A model that fits well without over-fitting has $C_p\approx p+1$ (close to its number of parameters). We prefer the model with the **smallest** $C_p$; a $C_p$ much larger than $p+1$ signals important omitted predictors (bias).
  26. Information criteria
    A candidate model has $p=4$ predictors and $\text{SSE}_p=315$. The full-model variance estimate is $\hat\sigma^2=9.0$ and $n=40$. Compute **Mallows' $C_p$** and judge it.
    $C_p=\dfrac{\text{SSE}_p}{\hat\sigma^2}-n+2(p+1)=\dfrac{315}{9.0}-40+2(5)$. $\dfrac{315}{9.0}=35.0$ and $2(5)=10$. $C_p=35.0-40+10=5.0$. The target for a good model is $C_p\approx p+1=5$. Here $C_p\approx 5$ matches almost exactly, indicating the model fits well with little evidence of omitted-variable bias.
  27. Information criteria
    State the **AIC** and **BIC** formulas and say which one penalizes complexity more heavily.
    $\text{AIC}=-2\ln L+2k$ and $\text{BIC}=-2\ln L+k\ln n$, where $L$ is the maximized likelihood and $k$ counts the estimated parameters. Both reward fit (small $-2\ln L$) and penalize size; we choose the model with the **lowest** value. Because $\ln n>2$ whenever $n>e^{2}\approx 7.4$, **BIC** charges a steeper per-parameter penalty for any realistic sample size, so it tends to select **smaller** (more parsimonious) models than AIC.
  28. Information criteria
    For a linear model with Gaussian errors, how do AIC and BIC simplify, and how does $\hat\sigma^2$ enter?
    With normal errors the maximized log-likelihood gives $-2\ln L = n\ln(\hat\sigma^2)+\text{const}$ where $\hat\sigma^2=\frac{\text{RSS}}{n}$. Dropping the constant common to all models: $\text{AIC}\propto n\ln\!\big(\tfrac{\text{RSS}}{n}\big)+2k$ and $\text{BIC}\propto n\ln\!\big(\tfrac{\text{RSS}}{n}\big)+k\ln n$. Lower RSS lowers the first term; more parameters raise the penalty term. Only **differences** in AIC/BIC between models matter, so the shared constant cancels.
  29. Information criteria
    Two nested models fit to $n=100$ data points have $-2\ln L$ values of $310$ (model A, $k=3$) and $305$ (model B, $k=6$). Choose between them by **AIC** and by **BIC**.
    **AIC** $=-2\ln L+2k$: Model A $=310+2(3)=316$; Model B $=305+2(6)=317$. Lower AIC is **A** (by $1$). **BIC** $=-2\ln L+k\ln n$ with $\ln 100\approx 4.60517$: Model A $=310+3(4.60517)=310+13.815=323.82$; Model B $=305+6(4.60517)=305+27.631=332.63$. Lower BIC is **A** (by $8.81$). Both criteria prefer the smaller model A; BIC's larger penalty widens the margin.
  30. Information criteria
    Model B improves $-2\ln L$ by $6.0$ over model A but uses $4$ more parameters, with $n=50$. Does **AIC** favor adding them? Does **BIC**?
    Compare the **change** in each criterion ($\Delta = \Delta(-2\ln L)+\text{penalty}$); adding helps only if the criterion falls. Fit term changes by $-6.0$. AIC penalty for $4$ extra params $=2(4)=8$, so $\Delta\text{AIC}=-6.0+8=+2.0>0$ — AIC says **do not** add them. BIC penalty $=4\ln 50=4(3.91202)=15.648$, so $\Delta\text{BIC}=-6.0+15.648=+9.65>0$ — BIC also says **do not** add them, even more strongly.
  31. Cross-validation & regularization
    Describe the **validation-set approach** and its two main drawbacks.
    Randomly split the data once into a **training** set and a held-out **validation (test)** set. Fit on the training set, predict on the validation set, and estimate test error as the validation MSE. Drawbacks: (1) the estimate is **highly variable** — it depends on the particular random split, so a different split gives a different number; (2) because only part of the data trains the model, it tends to **overestimate** the test error a model fit on the full data set would have (training on fewer points hurts fit).
  32. Cross-validation & regularization
    Explain **$k$-fold cross-validation** and the standard estimate of test error.
    Split the data into $k$ roughly equal folds. For each fold $j=1,\dots,k$: train on the other $k-1$ folds and compute the validation error $\text{MSE}_j$ on fold $j$. Each observation is used for validation **exactly once**. The CV estimate is the average $\text{CV}_{(k)}=\frac{1}{k}\sum_{j=1}^{k}\text{MSE}_j$. Common choices are $k=5$ or $k=10$, which give a good bias-variance compromise and far less computation than LOOCV.
  33. Cross-validation & regularization
    What is **LOOCV**, and how does it compare with $k$-fold CV on bias and variance?
    Leave-one-out CV is $k$-fold with $k=n$: each fold is a single observation. Fit on the other $n-1$ points, predict the held-out point, and average the $n$ squared errors. **Bias:** LOOCV trains on nearly all the data, so its error estimate is almost unbiased — lower bias than $5$- or $10$-fold. **Variance:** the $n$ training sets are nearly identical (highly correlated fits), so the averaged estimate has **higher variance** than $10$-fold. It is also $n$ times more computation (except for the linear-model shortcut).
  34. Cross-validation & regularization
    State the **LOOCV shortcut** for least-squares linear models that avoids refitting $n$ times.
    $\text{CV}_{(n)}=\dfrac{1}{n}\sum_{i=1}^{n}\left(\dfrac{y_i-\hat y_i}{1-h_{ii}}\right)^{2}$. Here $\hat y_i$ and the leverages $h_{ii}$ come from a **single** fit on the full data. The deleted residual is recovered as $\frac{e_i}{1-h_{ii}}$, so the entire LOOCV error is obtained without ever refitting. High-leverage points (large $h_{ii}$) get up-weighted, since deleting them changes the fit the most.
  35. Cross-validation & regularization
    A $3$-observation least-squares fit has residuals and leverages: $(e_1,h_{11})=(2.0,0.40)$, $(e_2,h_{22})=(-1.0,0.30)$, $(e_3,h_{33})=(1.5,0.50)$. Compute the **LOOCV** error.
    $\text{CV}_{(n)}=\dfrac{1}{n}\sum\left(\dfrac{e_i}{1-h_{ii}}\right)^2$. Term 1: $\left(\frac{2.0}{1-0.40}\right)^2=\left(\frac{2.0}{0.60}\right)^2=(3.3333)^2\approx 11.1111$. Term 2: $\left(\frac{-1.0}{0.70}\right)^2=(-1.42857)^2\approx 2.0408$. Term 3: $\left(\frac{1.5}{0.50}\right)^2=(3.0)^2=9.0000$. $\text{CV}_{(n)}=\frac{1}{3}(11.1111+2.0408+9.0000)=\frac{22.1519}{3}\approx 7.384$.
  36. Cross-validation & regularization
    A least-squares point has raw residual $e_i=3.0$ and leverage $h_{ii}=0.25$. Find its **deleted (LOOCV) residual** and contrast it with the raw residual.
    The deleted residual is $\dfrac{e_i}{1-h_{ii}}=\dfrac{3.0}{1-0.25}=\dfrac{3.0}{0.75}=4.0$. It is **larger** in magnitude than the raw $3.0$ because removing the point lets the line move away from it, enlarging the prediction error. The higher the leverage, the bigger this inflation — at $h_{ii}=0.25$ the residual grows by a factor $\frac{1}{0.75}\approx 1.33$.
  37. Cross-validation & regularization
    Contrast **ridge** and **lasso** regression: their penalties, and which one performs variable selection.
    Both minimize $\text{RSS}+\lambda\cdot(\text{penalty})$ with tuning parameter $\lambda\ge 0$. **Ridge (L2):** penalty $\lambda\sum_{j=1}^{p}\beta_j^{2}$. Shrinks coefficients toward $0$ but never exactly to $0$ — keeps **all** predictors. **Lasso (L1):** penalty $\lambda\sum_{j=1}^{p}|\beta_j|$. Its corner geometry forces some coefficients to **exactly $0$**, so it performs automatic **variable selection** and yields sparse, more interpretable models. Neither penalizes the intercept $\beta_0$.
  38. Cross-validation & regularization
    Why must predictors be **standardized** before applying ridge or lasso?
    The penalty $\sum\beta_j^2$ or $\sum|\beta_j|$ adds up the coefficients on whatever scale each predictor happens to use. A predictor measured in small units gets a large coefficient and is penalized more heavily — so the fit would depend on arbitrary units (dollars vs. thousands of dollars). Standardizing each predictor to mean $0$ and SD $1$ (e.g. $\tilde x_{ij}=\frac{x_{ij}-\bar x_j}{s_j}$) puts every coefficient on a common scale, so the penalty treats predictors even-handedly. Ordinary least squares, by contrast, is scale-invariant and needs no standardization.
  39. Cross-validation & regularization
    Explain the **bias-variance trade-off** as the ridge/lasso tuning parameter $\lambda$ increases from $0$.
    At $\lambda=0$ the penalty vanishes and you get ordinary least squares: **low bias, high variance**. As $\lambda\to\infty$ coefficients shrink toward $0$ (lasso sets them to $0$), approaching the null model: **low variance, high bias**. Increasing $\lambda$ trades a small rise in bias for a larger drop in variance, so test MSE typically falls then rises — a U-shape. The best $\lambda$ (minimizing test MSE) is chosen by **cross-validation**.
  40. Cross-validation & regularization
    Ridge regression is fit at three $\lambda$ values, with $5$-fold CV mean-squared errors below. Pick the tuning parameter. $\lambda=0.1$: CV-MSE $=24.5$; $\lambda=1$: $21.2$; $\lambda=10$: $26.8$.
    Cross-validation selects the $\lambda$ with the **lowest** CV error. Comparing $24.5$, $21.2$, $26.8$, the minimum is at $\lambda=1$ (CV-MSE $=21.2$). The U-shape is visible: too little penalty ($\lambda=0.1$) leaves the fit too variable, and too much ($\lambda=10$) over-shrinks (high bias), both raising CV error above the $\lambda=1$ minimum. So choose $\lambda=1$.
  41. Residual diagnostics
    Explain how a **non-constant variance** pattern in residuals affects inference and what a common fix is.
    Heteroscedasticity (a funnel-shaped residual-vs-fitted plot) leaves the OLS coefficient estimates **unbiased** but makes the usual standard-error formulas **wrong**, so $t$-tests, $F$-tests, and confidence intervals are no longer valid. Fixes: apply a **variance-stabilizing transformation** of the response such as $\ln y$ or $\sqrt{y}$ (often when variance grows with the mean), use **weighted least squares**, or report heteroscedasticity-robust standard errors. After transforming, re-check the residual plot to confirm the funnel is gone.
  42. Leverage & influence
    In a model with $p=5$ predictors and $n=60$, the leverages of three points are $0.05$, $0.14$, and $0.22$. Which exceed the high-leverage threshold $2\frac{p+1}{n}$?
    Average leverage $=\frac{p+1}{n}=\frac{6}{60}=0.10$; threshold $=2(0.10)=0.20$. Compare each: $0.05<0.20$ (not flagged), $0.14<0.20$ (not flagged), $0.22>0.20$ (**flagged** as high leverage). Only the third point has unusually extreme predictor values; it would warrant a Cook's-distance check to see whether it is also influential.
  43. Information criteria
    Two models for the same data give: Model A ($k=2$) RSS $=240$; Model B ($k=4$) RSS $=210$, with $n=40$ and TSS $=600$. Compare them by **adjusted $R^2$**.
    $\bar R^2=1-\dfrac{\text{RSS}/(n-k-1)}{\text{TSS}/(n-1)}$ with $\frac{\text{TSS}}{n-1}=\frac{600}{39}\approx 15.3846$. **Model A:** $\frac{240}{40-2-1}=\frac{240}{37}\approx 6.4865$; $\bar R^2=1-\frac{6.4865}{15.3846}=1-0.42162\approx 0.578$. **Model B:** $\frac{210}{40-4-1}=\frac{210}{35}=6.0$; $\bar R^2=1-\frac{6.0}{15.3846}=1-0.39000\approx 0.610$. Model B has the higher adjusted $R^2$ ($0.610>0.578$), so the two extra predictors earn their place.
  44. Cross-validation & regularization
    Why is **cross-validation** often preferred over $C_p$, AIC, and BIC for model selection?
    $C_p$, AIC, and BIC estimate test error **indirectly** through analytic penalties that assume a known error structure and require an estimate of $\sigma^2$ (typically from the full model) and a count of effective parameters — assumptions that can fail for flexible or non-likelihood methods. Cross-validation estimates the test error **directly** by holding out data, making essentially no model assumptions and needing no $\sigma^2$ estimate or degrees-of-freedom count. It applies to any prediction method (trees, KNN, regularized fits), which is why it is the default when those assumptions are doubtful — at the cost of more computation.