{
  "deckName": "Exam SRM — Model Selection & Diagnostics",
  "examCode": "Exam SRM",
  "cards": [
    {
      "front": "What does a **residual-vs-fitted** plot diagnose, and what patterns signal trouble?",
      "back": "Plot the residuals $e_i=y_i-\\hat y_i$ on the vertical axis against the fitted values $\\hat y_i$ on the horizontal axis.\nA well-specified linear model shows a **random horizontal band** centered at $0$ with constant spread.\n**Curvature** (a U-shape or arch) signals a missing nonlinear term — the linearity assumption fails.\nA **funnel/fan** shape (spread growing or shrinking with $\\hat y_i$) signals **non-constant variance** (heteroscedasticity).",
      "tag": "Residual diagnostics"
    },
    {
      "front": "What is a **normal Q-Q plot** of residuals used for, and how do you read departures from the line?",
      "back": "It plots the ordered standardized residuals against the quantiles of a standard normal. If the errors are normal, the points fall near the $45^{\\circ}$ reference line.\n**Heavy tails** bend the points away from the line at both ends (low points below, high points above).\n**Right skew** curves the upper end above the line; **left skew** curves the lower end below.\nNormality of errors matters for the validity of $t$- and $F$-based inference, especially in small samples.",
      "tag": "Residual diagnostics"
    },
    {
      "front": "Distinguish a raw **residual**, a **standardized** residual, and a **studentized** (deleted) residual.",
      "back": "Raw residual: $e_i=y_i-\\hat y_i$, on the scale of $y$.\n**Standardized (internally studentized):** $r_i=\\frac{e_i}{\\hat\\sigma\\sqrt{1-h_{ii}}}$ — divides by the estimated SD of that residual, which depends on its leverage $h_{ii}$.\n**Studentized (externally studentized / deleted):** $t_i=\\frac{e_i}{\\hat\\sigma_{(i)}\\sqrt{1-h_{ii}}}$, where $\\hat\\sigma_{(i)}$ is estimated with observation $i$ **removed**. This makes a true outlier stand out because the point can't inflate its own scale estimate.\nValues with $|r_i|$ or $|t_i|>2$ (or $3$) flag possible outliers.",
      "tag": "Residual diagnostics"
    },
    {
      "front": "A fitted regression has $\\hat\\sigma=4.0$. Observation $i$ has residual $e_i=9.2$ and leverage $h_{ii}=0.16$. Compute its **standardized** residual and judge whether it is an outlier.",
      "back": "$r_i=\\dfrac{e_i}{\\hat\\sigma\\sqrt{1-h_{ii}}}=\\dfrac{9.2}{4.0\\sqrt{1-0.16}}$.\n$\\sqrt{0.84}\\approx 0.916515$, so the denominator is $4.0(0.916515)\\approx 3.666060$.\n$r_i=\\dfrac{9.2}{3.666060}\\approx 2.510$.\nSince $|r_i|>2$, observation $i$ is a candidate **outlier** in the $y$-direction.",
      "tag": "Residual diagnostics"
    },
    {
      "front": "Why do we standardize residuals by $\\hat\\sigma\\sqrt{1-h_{ii}}$ rather than just by $\\hat\\sigma$?",
      "back": "Even when the true errors $\\epsilon_i$ all have variance $\\sigma^2$, the **fitted** residuals do not: $\\operatorname{Var}(e_i)=\\sigma^2(1-h_{ii})$. High-leverage points pull the fitted line toward themselves, shrinking their own residual variance.\nDividing by $\\hat\\sigma\\sqrt{1-h_{ii}}$ puts every residual on a common unit-variance scale, so a fixed cutoff like $2$ means the same thing at every leverage. Without the $(1-h_{ii})$ correction a high-leverage outlier can hide behind a deceptively small raw residual.",
      "tag": "Residual diagnostics"
    },
    {
      "front": "Define the **leverage** $h_{ii}$ of an observation and state its key properties ($\\sum h_{ii}$, average, range).",
      "back": "Leverage is the $i$-th diagonal of the hat matrix $H=X(X^{\\top}X)^{-1}X^{\\top}$, which satisfies $\\hat y=Hy$; $h_{ii}=\\partial\\hat y_i/\\partial y_i$ measures how much $y_i$ pulls its own fit.\nProperties: $0\\le h_{ii}\\le 1$, the leverages sum to the number of estimated coefficients, $\\sum_{i=1}^{n} h_{ii}=p+1$ (intercept plus $p$ slopes), so the **average** leverage is $\\frac{p+1}{n}$.\nLeverage depends only on the $X$-values, not on $y$ — it measures how unusual a point's predictor values are.",
      "tag": "Leverage & influence"
    },
    {
      "front": "State the common **rule of thumb** for flagging a high-leverage point and apply it: a model with $p=3$ predictors fit to $n=50$ observations.",
      "back": "Average leverage is $\\frac{p+1}{n}$. A common rule flags a point as high-leverage when $h_{ii}>2\\cdot\\frac{p+1}{n}$ (some texts use $3\\times$).\nHere $p+1=4$ and $n=50$, so the average leverage is $\\frac{4}{50}=0.08$.\nThreshold $=2(0.08)=0.16$. Any observation with $h_{ii}>0.16$ is flagged as having unusually extreme predictor values.",
      "tag": "Leverage & influence"
    },
    {
      "front": "Distinguish an **outlier**, a **high-leverage** point, and an **influential** point.",
      "back": "An **outlier** has an unusual **response** $y$ given its predictors — a large studentized residual.\nA **high-leverage** point has unusual **predictor** values (large $h_{ii}$), regardless of its $y$.\nAn **influential** point is one whose removal substantially changes the fitted coefficients. Influence combines the two: a point is most influential when it has **both** high leverage **and** a large residual. Cook's distance measures this combined effect.",
      "tag": "Leverage & influence"
    },
    {
      "front": "State **Cook's distance** $D_i$ in terms of the standardized residual and leverage, and the usual cutoff.",
      "back": "$D_i=\\dfrac{r_i^{2}}{p+1}\\cdot\\dfrac{h_{ii}}{1-h_{ii}}$, where $r_i$ is the standardized residual, $h_{ii}$ the leverage, and $p+1$ the number of estimated coefficients.\n$D_i$ measures how much **all** fitted values shift when observation $i$ is deleted — it blends outlyingness ($r_i^2$) with leverage ($\\frac{h_{ii}}{1-h_{ii}}$).\nA common rule flags $D_i>1$ (or $D_i>\\frac{4}{n}$) as influential.",
      "tag": "Leverage & influence"
    },
    {
      "front": "An observation in a model with $p=3$ predictors has standardized residual $r_i=2.2$ and leverage $h_{ii}=0.30$. Compute **Cook's distance** and judge influence.",
      "back": "$D_i=\\dfrac{r_i^{2}}{p+1}\\cdot\\dfrac{h_{ii}}{1-h_{ii}}=\\dfrac{2.2^{2}}{4}\\cdot\\dfrac{0.30}{0.70}$.\n$\\dfrac{4.84}{4}=1.21$ and $\\dfrac{0.30}{0.70}\\approx 0.428571$.\n$D_i=1.21(0.428571)\\approx 0.519$.\nSince $D_i<1$, by the usual cutoff the point is **not** strongly influential, though with $\\frac{4}{n}$ thresholds for small $n$ it could still warrant a look.",
      "tag": "Leverage & influence"
    },
    {
      "front": "A simple regression has $n=20$, $\\bar x=10$, and $S_{xx}=\\sum(x_i-\\bar x)^2=160$. Find the **leverage** of a point with $x_i=16$.",
      "back": "For simple linear regression, $h_{ii}=\\dfrac{1}{n}+\\dfrac{(x_i-\\bar x)^2}{S_{xx}}$.\n$(x_i-\\bar x)^2=(16-10)^2=36$.\n$h_{ii}=\\dfrac{1}{20}+\\dfrac{36}{160}=0.05+0.225=0.275$.\nThe average leverage here is $\\frac{p+1}{n}=\\frac{2}{20}=0.10$, and $2(0.10)=0.20<0.275$, so this point is flagged as **high leverage**.",
      "tag": "Leverage & influence"
    },
    {
      "front": "Define **multicollinearity** and list the practical problems it causes in a multiple regression.",
      "back": "Multicollinearity is a strong **linear** relationship among two or more predictors, so the columns of $X$ are nearly linearly dependent and $X^{\\top}X$ is close to singular.\nConsequences: the coefficient estimates remain unbiased but have **inflated standard errors**, so $t$-statistics shrink and individually significant predictors can appear insignificant; estimates become **unstable** (large swings from small data changes) and can take counterintuitive signs.\nIt does **not** hurt the model's overall predictive fit ($R^2$, predictions) — only the interpretation of individual coefficients.",
      "tag": "Multicollinearity"
    },
    {
      "front": "Define the **variance inflation factor** $\\text{VIF}_j$ and state how it is computed.",
      "back": "$\\text{VIF}_j=\\dfrac{1}{1-R_j^{2}}$, where $R_j^{2}$ is the $R^2$ from regressing predictor $x_j$ on **all the other predictors**.\nIt is the factor by which the variance of $\\hat\\beta_j$ is inflated relative to an ideal model with uncorrelated predictors. A high $R_j^2$ (that predictor is well-explained by the others) drives the VIF up.\nThe minimum value is $1$ (when $x_j$ is uncorrelated with the others).",
      "tag": "Multicollinearity"
    },
    {
      "front": "State the common **VIF rule-of-thumb** thresholds and what the corresponding $R_j^2$ values are.",
      "back": "$\\text{VIF}_j>5$ (some use $>10$) flags problematic multicollinearity.\nInvert $\\text{VIF}=\\frac{1}{1-R_j^2}$ to read the matching $R_j^2$: $\\text{VIF}=5\\Rightarrow R_j^2=1-\\frac{1}{5}=0.80$, and $\\text{VIF}=10\\Rightarrow R_j^2=1-\\frac{1}{10}=0.90$.\nSo VIF $>5$ means a predictor is more than $80\\%$ explained by the others; VIF $=1$ means no collinearity at all.",
      "tag": "Multicollinearity"
    },
    {
      "front": "Regressing $x_2$ on the other predictors gives $R_2^{2}=0.86$. Compute $\\text{VIF}_2$ and decide whether it crosses the usual thresholds.",
      "back": "$\\text{VIF}_2=\\dfrac{1}{1-R_2^{2}}=\\dfrac{1}{1-0.86}=\\dfrac{1}{0.14}\\approx 7.14$.\nThis **exceeds** the $5$ threshold (so it is flagged under the stricter rule) but is **below** $10$. The standard error of $\\hat\\beta_2$ is inflated by a factor of $\\sqrt{7.14}\\approx 2.67$ versus an uncorrelated design.",
      "tag": "Multicollinearity"
    },
    {
      "front": "A predictor has $\\text{VIF}_j=9$. Recover the $R_j^2$ and the inflation of the **standard error** of $\\hat\\beta_j$.",
      "back": "From $\\text{VIF}_j=\\frac{1}{1-R_j^2}$: $1-R_j^2=\\frac{1}{9}\\approx 0.1111$, so $R_j^2=1-0.1111=0.8889$ — about $88.9\\%$ of $x_j$ is explained by the other predictors.\nVIF inflates the **variance** of $\\hat\\beta_j$ by $9$, so the **standard error** is inflated by $\\sqrt{9}=3$. The confidence interval for $\\beta_j$ is therefore three times as wide as it would be with an uncorrelated predictor.",
      "tag": "Multicollinearity"
    },
    {
      "front": "List the usual **remedies** for multicollinearity.",
      "back": "Drop one of the redundant predictors (keep the one that is more interpretable or cheaper to measure).\nCombine collinear predictors into a single index or use **principal components** as inputs.\nCollect more data to stabilize the estimates.\nUse a shrinkage method — **ridge regression** in particular tolerates collinearity well because the $\\lambda$ penalty stabilizes $X^{\\top}X+\\lambda I$.\nCenter/standardize predictors when the collinearity is between a variable and its interaction or polynomial term.",
      "tag": "Multicollinearity"
    },
    {
      "front": "Describe **best-subset selection** and why it is computationally expensive.",
      "back": "Fit every possible model: all $\\binom{p}{k}$ models for each subset size $k=0,1,\\dots,p$, then pick the best model of each size by lowest RSS (or highest $R^2$), and finally choose among those $p+1$ contenders using a criterion that penalizes size (cross-validation, $C_p$, AIC, BIC, or adjusted $R^2$).\nWith $p$ predictors there are $2^{p}$ models in total, so the search explodes — $p=20$ already gives over a million fits. This is why **stepwise** methods are used for large $p$.",
      "tag": "Subset & stepwise selection"
    },
    {
      "front": "Contrast **forward** and **backward** stepwise selection, and note when backward is unavailable.",
      "back": "**Forward stepwise** starts with the null (intercept-only) model and, at each step, **adds** the single predictor that most improves the fit, until a stopping rule triggers.\n**Backward stepwise** starts with the **full** model and, at each step, **removes** the least useful predictor.\nEach explores about $1+\\frac{p(p+1)}{2}$ models — far fewer than $2^p$. They are greedy, so neither is guaranteed to find the best-subset optimum.\nBackward requires $n>p$ (the full model must be estimable); forward stepwise works even when $p\\ge n$.",
      "tag": "Subset & stepwise selection"
    },
    {
      "front": "How many models does **best-subset** examine versus **forward stepwise** for $p=10$ predictors?",
      "back": "Best subset fits every subset: $2^{p}=2^{10}=1{,}024$ models.\nForward stepwise fits the null model plus, at step $k$ (for $k=0,\\dots,p-1$), the $p-k$ candidate additions: $1+\\sum_{k=0}^{p-1}(p-k)=1+\\frac{p(p+1)}{2}$.\nFor $p=10$: $1+\\frac{10\\cdot 11}{2}=1+55=56$ models.\nForward stepwise examines $56$ versus $1{,}024$ — a large saving that grows with $p$.",
      "tag": "Subset & stepwise selection"
    },
    {
      "front": "Why can **forward stepwise** miss the best-subset model? Give the mechanism.",
      "back": "Forward stepwise is **greedy** and **path-dependent**: once a predictor is added it stays, so an early choice constrains all later ones.\nIf the best two-variable model uses $\\{x_2,x_3\\}$ but the single best one-variable model is $x_1$, forward selection locks in $x_1$ at step one and can never reach $\\{x_2,x_3\\}$. Best-subset, which evaluates every combination, would find it.\nThis is the speed-vs-optimality trade-off: stepwise is fast but only explores a single nested path.",
      "tag": "Subset & stepwise selection"
    },
    {
      "front": "Five candidate models (number of predictors $k$ and RSS) are below. By RSS alone the biggest model always wins — explain why, and what to use instead.\n$k=1$: RSS $=520$; $k=2$: $410$; $k=3$: $360$; $k=4$: $352$; $k=5$: $350$.",
      "back": "RSS can only **decrease** (never increase) as predictors are added, because the larger model contains the smaller as a special case — so RSS always picks the full model and over-fits.\nThe drops shrink sharply after $k=3$ ($360\\to 352\\to 350$), suggesting predictors $4$ and $5$ add little. To choose, apply a criterion that **penalizes complexity** — adjusted $R^2$, $C_p$, AIC, or BIC — or use cross-validation, all of which can favor the $k=3$ model over the full model.",
      "tag": "Subset & stepwise selection"
    },
    {
      "front": "Define **adjusted $R^2$** and explain how its penalty differs from ordinary $R^2$.",
      "back": "$\\bar R^{2}=1-\\dfrac{\\text{RSS}/(n-k-1)}{\\text{TSS}/(n-1)}$, where $k$ is the number of predictors (so $k+1$ parameters with the intercept) and TSS is the total sum of squares.\nOrdinary $R^2=1-\\frac{\\text{RSS}}{\\text{TSS}}$ never falls when a predictor is added. Adjusted $R^2$ divides RSS by the **degrees of freedom** $n-k-1$, so adding a useless predictor (tiny RSS drop, one fewer df) can **lower** $\\bar R^2$. We prefer the model with the **highest** $\\bar R^2$.",
      "tag": "Information criteria"
    },
    {
      "front": "Compute **adjusted $R^2$** for a model with $n=30$, $k=4$ predictors, $\\text{RSS}=180$, $\\text{TSS}=900$.",
      "back": "$\\bar R^{2}=1-\\dfrac{\\text{RSS}/(n-k-1)}{\\text{TSS}/(n-1)}=1-\\dfrac{180/(30-4-1)}{900/(30-1)}$.\n$\\dfrac{180}{25}=7.2$ and $\\dfrac{900}{29}\\approx 31.0345$.\n$\\bar R^{2}=1-\\dfrac{7.2}{31.0345}=1-0.23200\\approx 0.768$.\n(For comparison, the unadjusted $R^2=1-\\frac{180}{900}=0.80$, so the penalty pulls it down to $0.768$.)",
      "tag": "Information criteria"
    },
    {
      "front": "State **Mallows' $C_p$** and the value it targets for a good model.",
      "back": "$C_p=\\dfrac{\\text{SSE}_p}{\\hat\\sigma^{2}}-n+2(p+1)$, where $\\text{SSE}_p$ is the residual sum of squares of the candidate model with $p$ predictors and $\\hat\\sigma^2$ is the error-variance estimate from the **full** model.\nA model that fits well without over-fitting has $C_p\\approx p+1$ (close to its number of parameters). We prefer the model with the **smallest** $C_p$; a $C_p$ much larger than $p+1$ signals important omitted predictors (bias).",
      "tag": "Information criteria"
    },
    {
      "front": "A candidate model has $p=4$ predictors and $\\text{SSE}_p=315$. The full-model variance estimate is $\\hat\\sigma^2=9.0$ and $n=40$. Compute **Mallows' $C_p$** and judge it.",
      "back": "$C_p=\\dfrac{\\text{SSE}_p}{\\hat\\sigma^2}-n+2(p+1)=\\dfrac{315}{9.0}-40+2(5)$.\n$\\dfrac{315}{9.0}=35.0$ and $2(5)=10$.\n$C_p=35.0-40+10=5.0$.\nThe target for a good model is $C_p\\approx p+1=5$. Here $C_p\\approx 5$ matches almost exactly, indicating the model fits well with little evidence of omitted-variable bias.",
      "tag": "Information criteria"
    },
    {
      "front": "State the **AIC** and **BIC** formulas and say which one penalizes complexity more heavily.",
      "back": "$\\text{AIC}=-2\\ln L+2k$ and $\\text{BIC}=-2\\ln L+k\\ln n$, where $L$ is the maximized likelihood and $k$ counts the estimated parameters.\nBoth reward fit (small $-2\\ln L$) and penalize size; we choose the model with the **lowest** value.\nBecause $\\ln n>2$ whenever $n>e^{2}\\approx 7.4$, **BIC** charges a steeper per-parameter penalty for any realistic sample size, so it tends to select **smaller** (more parsimonious) models than AIC.",
      "tag": "Information criteria"
    },
    {
      "front": "For a linear model with Gaussian errors, how do AIC and BIC simplify, and how does $\\hat\\sigma^2$ enter?",
      "back": "With normal errors the maximized log-likelihood gives $-2\\ln L = n\\ln(\\hat\\sigma^2)+\\text{const}$ where $\\hat\\sigma^2=\\frac{\\text{RSS}}{n}$. Dropping the constant common to all models:\n$\\text{AIC}\\propto n\\ln\\!\\big(\\tfrac{\\text{RSS}}{n}\\big)+2k$ and $\\text{BIC}\\propto n\\ln\\!\\big(\\tfrac{\\text{RSS}}{n}\\big)+k\\ln n$.\nLower RSS lowers the first term; more parameters raise the penalty term. Only **differences** in AIC/BIC between models matter, so the shared constant cancels.",
      "tag": "Information criteria"
    },
    {
      "front": "Two nested models fit to $n=100$ data points have $-2\\ln L$ values of $310$ (model A, $k=3$) and $305$ (model B, $k=6$). Choose between them by **AIC** and by **BIC**.",
      "back": "**AIC** $=-2\\ln L+2k$:\nModel A $=310+2(3)=316$; Model B $=305+2(6)=317$. Lower AIC is **A** (by $1$).\n**BIC** $=-2\\ln L+k\\ln n$ with $\\ln 100\\approx 4.60517$:\nModel A $=310+3(4.60517)=310+13.815=323.82$; Model B $=305+6(4.60517)=305+27.631=332.63$.\nLower BIC is **A** (by $8.81$). Both criteria prefer the smaller model A; BIC's larger penalty widens the margin.",
      "tag": "Information criteria"
    },
    {
      "front": "Model B improves $-2\\ln L$ by $6.0$ over model A but uses $4$ more parameters, with $n=50$. Does **AIC** favor adding them? Does **BIC**?",
      "back": "Compare the **change** in each criterion ($\\Delta = \\Delta(-2\\ln L)+\\text{penalty}$); adding helps only if the criterion falls.\nFit term changes by $-6.0$. AIC penalty for $4$ extra params $=2(4)=8$, so $\\Delta\\text{AIC}=-6.0+8=+2.0>0$ — AIC says **do not** add them.\nBIC penalty $=4\\ln 50=4(3.91202)=15.648$, so $\\Delta\\text{BIC}=-6.0+15.648=+9.65>0$ — BIC also says **do not** add them, even more strongly.",
      "tag": "Information criteria"
    },
    {
      "front": "Describe the **validation-set approach** and its two main drawbacks.",
      "back": "Randomly split the data once into a **training** set and a held-out **validation (test)** set. Fit on the training set, predict on the validation set, and estimate test error as the validation MSE.\nDrawbacks: (1) the estimate is **highly variable** — it depends on the particular random split, so a different split gives a different number; (2) because only part of the data trains the model, it tends to **overestimate** the test error a model fit on the full data set would have (training on fewer points hurts fit).",
      "tag": "Cross-validation & regularization"
    },
    {
      "front": "Explain **$k$-fold cross-validation** and the standard estimate of test error.",
      "back": "Split the data into $k$ roughly equal folds. For each fold $j=1,\\dots,k$: train on the other $k-1$ folds and compute the validation error $\\text{MSE}_j$ on fold $j$. Each observation is used for validation **exactly once**.\nThe CV estimate is the average $\\text{CV}_{(k)}=\\frac{1}{k}\\sum_{j=1}^{k}\\text{MSE}_j$.\nCommon choices are $k=5$ or $k=10$, which give a good bias-variance compromise and far less computation than LOOCV.",
      "tag": "Cross-validation & regularization"
    },
    {
      "front": "What is **LOOCV**, and how does it compare with $k$-fold CV on bias and variance?",
      "back": "Leave-one-out CV is $k$-fold with $k=n$: each fold is a single observation. Fit on the other $n-1$ points, predict the held-out point, and average the $n$ squared errors.\n**Bias:** LOOCV trains on nearly all the data, so its error estimate is almost unbiased — lower bias than $5$- or $10$-fold.\n**Variance:** the $n$ training sets are nearly identical (highly correlated fits), so the averaged estimate has **higher variance** than $10$-fold. It is also $n$ times more computation (except for the linear-model shortcut).",
      "tag": "Cross-validation & regularization"
    },
    {
      "front": "State the **LOOCV shortcut** for least-squares linear models that avoids refitting $n$ times.",
      "back": "$\\text{CV}_{(n)}=\\dfrac{1}{n}\\sum_{i=1}^{n}\\left(\\dfrac{y_i-\\hat y_i}{1-h_{ii}}\\right)^{2}$.\nHere $\\hat y_i$ and the leverages $h_{ii}$ come from a **single** fit on the full data. The deleted residual is recovered as $\\frac{e_i}{1-h_{ii}}$, so the entire LOOCV error is obtained without ever refitting.\nHigh-leverage points (large $h_{ii}$) get up-weighted, since deleting them changes the fit the most.",
      "tag": "Cross-validation & regularization"
    },
    {
      "front": "A $3$-observation least-squares fit has residuals and leverages: $(e_1,h_{11})=(2.0,0.40)$, $(e_2,h_{22})=(-1.0,0.30)$, $(e_3,h_{33})=(1.5,0.50)$. Compute the **LOOCV** error.",
      "back": "$\\text{CV}_{(n)}=\\dfrac{1}{n}\\sum\\left(\\dfrac{e_i}{1-h_{ii}}\\right)^2$.\nTerm 1: $\\left(\\frac{2.0}{1-0.40}\\right)^2=\\left(\\frac{2.0}{0.60}\\right)^2=(3.3333)^2\\approx 11.1111$.\nTerm 2: $\\left(\\frac{-1.0}{0.70}\\right)^2=(-1.42857)^2\\approx 2.0408$.\nTerm 3: $\\left(\\frac{1.5}{0.50}\\right)^2=(3.0)^2=9.0000$.\n$\\text{CV}_{(n)}=\\frac{1}{3}(11.1111+2.0408+9.0000)=\\frac{22.1519}{3}\\approx 7.384$.",
      "tag": "Cross-validation & regularization"
    },
    {
      "front": "A least-squares point has raw residual $e_i=3.0$ and leverage $h_{ii}=0.25$. Find its **deleted (LOOCV) residual** and contrast it with the raw residual.",
      "back": "The deleted residual is $\\dfrac{e_i}{1-h_{ii}}=\\dfrac{3.0}{1-0.25}=\\dfrac{3.0}{0.75}=4.0$.\nIt is **larger** in magnitude than the raw $3.0$ because removing the point lets the line move away from it, enlarging the prediction error. The higher the leverage, the bigger this inflation — at $h_{ii}=0.25$ the residual grows by a factor $\\frac{1}{0.75}\\approx 1.33$.",
      "tag": "Cross-validation & regularization"
    },
    {
      "front": "Contrast **ridge** and **lasso** regression: their penalties, and which one performs variable selection.",
      "back": "Both minimize $\\text{RSS}+\\lambda\\cdot(\\text{penalty})$ with tuning parameter $\\lambda\\ge 0$.\n**Ridge (L2):** penalty $\\lambda\\sum_{j=1}^{p}\\beta_j^{2}$. Shrinks coefficients toward $0$ but never exactly to $0$ — keeps **all** predictors.\n**Lasso (L1):** penalty $\\lambda\\sum_{j=1}^{p}|\\beta_j|$. Its corner geometry forces some coefficients to **exactly $0$**, so it performs automatic **variable selection** and yields sparse, more interpretable models.\nNeither penalizes the intercept $\\beta_0$.",
      "tag": "Cross-validation & regularization"
    },
    {
      "front": "Why must predictors be **standardized** before applying ridge or lasso?",
      "back": "The penalty $\\sum\\beta_j^2$ or $\\sum|\\beta_j|$ adds up the coefficients on whatever scale each predictor happens to use. A predictor measured in small units gets a large coefficient and is penalized more heavily — so the fit would depend on arbitrary units (dollars vs. thousands of dollars).\nStandardizing each predictor to mean $0$ and SD $1$ (e.g. $\\tilde x_{ij}=\\frac{x_{ij}-\\bar x_j}{s_j}$) puts every coefficient on a common scale, so the penalty treats predictors even-handedly. Ordinary least squares, by contrast, is scale-invariant and needs no standardization.",
      "tag": "Cross-validation & regularization"
    },
    {
      "front": "Explain the **bias-variance trade-off** as the ridge/lasso tuning parameter $\\lambda$ increases from $0$.",
      "back": "At $\\lambda=0$ the penalty vanishes and you get ordinary least squares: **low bias, high variance**.\nAs $\\lambda\\to\\infty$ coefficients shrink toward $0$ (lasso sets them to $0$), approaching the null model: **low variance, high bias**.\nIncreasing $\\lambda$ trades a small rise in bias for a larger drop in variance, so test MSE typically falls then rises — a U-shape. The best $\\lambda$ (minimizing test MSE) is chosen by **cross-validation**.",
      "tag": "Cross-validation & regularization"
    },
    {
      "front": "Ridge regression is fit at three $\\lambda$ values, with $5$-fold CV mean-squared errors below. Pick the tuning parameter.\n$\\lambda=0.1$: CV-MSE $=24.5$; $\\lambda=1$: $21.2$; $\\lambda=10$: $26.8$.",
      "back": "Cross-validation selects the $\\lambda$ with the **lowest** CV error.\nComparing $24.5$, $21.2$, $26.8$, the minimum is at $\\lambda=1$ (CV-MSE $=21.2$).\nThe U-shape is visible: too little penalty ($\\lambda=0.1$) leaves the fit too variable, and too much ($\\lambda=10$) over-shrinks (high bias), both raising CV error above the $\\lambda=1$ minimum. So choose $\\lambda=1$.",
      "tag": "Cross-validation & regularization"
    },
    {
      "front": "Explain how a **non-constant variance** pattern in residuals affects inference and what a common fix is.",
      "back": "Heteroscedasticity (a funnel-shaped residual-vs-fitted plot) leaves the OLS coefficient estimates **unbiased** but makes the usual standard-error formulas **wrong**, so $t$-tests, $F$-tests, and confidence intervals are no longer valid.\nFixes: apply a **variance-stabilizing transformation** of the response such as $\\ln y$ or $\\sqrt{y}$ (often when variance grows with the mean), use **weighted least squares**, or report heteroscedasticity-robust standard errors. After transforming, re-check the residual plot to confirm the funnel is gone.",
      "tag": "Residual diagnostics"
    },
    {
      "front": "In a model with $p=5$ predictors and $n=60$, the leverages of three points are $0.05$, $0.14$, and $0.22$. Which exceed the high-leverage threshold $2\\frac{p+1}{n}$?",
      "back": "Average leverage $=\\frac{p+1}{n}=\\frac{6}{60}=0.10$; threshold $=2(0.10)=0.20$.\nCompare each: $0.05<0.20$ (not flagged), $0.14<0.20$ (not flagged), $0.22>0.20$ (**flagged** as high leverage).\nOnly the third point has unusually extreme predictor values; it would warrant a Cook's-distance check to see whether it is also influential.",
      "tag": "Leverage & influence"
    },
    {
      "front": "Two models for the same data give: Model A ($k=2$) RSS $=240$; Model B ($k=4$) RSS $=210$, with $n=40$ and TSS $=600$. Compare them by **adjusted $R^2$**.",
      "back": "$\\bar R^2=1-\\dfrac{\\text{RSS}/(n-k-1)}{\\text{TSS}/(n-1)}$ with $\\frac{\\text{TSS}}{n-1}=\\frac{600}{39}\\approx 15.3846$.\n**Model A:** $\\frac{240}{40-2-1}=\\frac{240}{37}\\approx 6.4865$; $\\bar R^2=1-\\frac{6.4865}{15.3846}=1-0.42162\\approx 0.578$.\n**Model B:** $\\frac{210}{40-4-1}=\\frac{210}{35}=6.0$; $\\bar R^2=1-\\frac{6.0}{15.3846}=1-0.39000\\approx 0.610$.\nModel B has the higher adjusted $R^2$ ($0.610>0.578$), so the two extra predictors earn their place.",
      "tag": "Information criteria"
    },
    {
      "front": "Why is **cross-validation** often preferred over $C_p$, AIC, and BIC for model selection?",
      "back": "$C_p$, AIC, and BIC estimate test error **indirectly** through analytic penalties that assume a known error structure and require an estimate of $\\sigma^2$ (typically from the full model) and a count of effective parameters — assumptions that can fail for flexible or non-likelihood methods.\nCross-validation estimates the test error **directly** by holding out data, making essentially no model assumptions and needing no $\\sigma^2$ estimate or degrees-of-freedom count. It applies to any prediction method (trees, KNN, regularized fits), which is why it is the default when those assumptions are doubtful — at the cost of more computation.",
      "tag": "Cross-validation & regularization"
    }
  ]
}