Exam MAS-II — Tree-Based Machine Learning Flashcards

Tree-based supervised learning for CAS Exam MAS-II: recursive binary splitting for regression and classification trees, impurity measures (Gini, cross-entropy, classification error), cost-complexity pruning, and the three ensemble families — bagging with out-of-bag error, random forests that decorrelate trees by sampling predictors, and gradient boosting with shrinkage — plus variable importance, classifier evaluation (confusion matrix, sensitivity/specificity/precision, ROC/AUC), k-fold cross-validation, and the bias-variance trade-off, with fully worked numeric examples.

44 cards6 topicsFree · fact-checked · LaTeX math

Tap card or press Space to flip

Answer

Import this deck

Download all 44 cards and import them into your flashcard app (JSON or CSV — works with Anki). Using the Willys app? No import needed — this deck is already built in (Settings → Library → Browse).

Download JSON Download CSV

Every deck is built into the Willys app

All of these decks — including the full practice problem banks — come built into Willys AI Flashcards & Quizzes for iPhone & iPad (Mac version coming soon), with FSRS + SM-2 spaced repetition, streaks, and exam-date cram mode. 14-day free trial, then $14.99. To load a deck in the app: Settings → Library → Browse, then pick your exam and deck.

Download on the App Store

More Exam MAS-II decks:

Bayesian Analysis Bayesian Analysis Practice Credibility Credibility Practice Generalized Linear Models Generalized Linear Models Practice

← All Exam MAS-II decks

Browse all 44 cards as a list

Decision trees
Describe **recursive binary splitting** used to grow a decision tree, and why it is a greedy top-down algorithm.
Starting with all observations in one region, the algorithm searches over every predictor $X_j$ and cutpoint $s$ and picks the split into $\{X_j<s\}$ and $\{X_j\ge s\}$ that most improves the objective (lowest RSS for regression, lowest impurity for classification). It then recurses **independently** within each resulting region. It is **top-down** (begins at the root, works toward the leaves) and **greedy** (each split is chosen to be best *at that step*, not looking ahead to splits that might be better later). Greediness makes it fast but not guaranteed to find the globally optimal tree.
Decision trees
For a **regression tree**, what quantity does each split minimize, and what value is predicted in a terminal region?
Each split minimizes the **residual sum of squares** $\text{RSS}=\sum_{j}\sum_{i\in R_j}(y_i-\hat y_{R_j})^2$, summed over the regions created. The prediction in a terminal region (leaf) $R_j$ is the **mean of the training responses** in that region: $\hat y_{R_j}=\frac{1}{n_j}\sum_{i\in R_j}y_i$. Every test point falling in $R_j$ receives that same constant prediction.
Decision trees
A region of a regression tree holds responses $\{4,\,6,\,9,\,13\}$. A candidate split sends $\{4,6\}$ to the left leaf and $\{9,13\}$ to the right. Compute the total RSS after the split.
Left leaf mean $=\frac{4+6}{2}=5$; RSS$_L=(4-5)^2+(6-5)^2=1+1=2$. Right leaf mean $=\frac{9+13}{2}=11$; RSS$_R=(9-11)^2+(13-11)^2=4+4=8$. Total RSS $=2+8=10$. For comparison, the parent (no split) has mean $\frac{4+6+9+13}{4}=8$ and RSS $=(4-8)^2+(6-8)^2+(9-8)^2+(13-8)^2=16+4+1+25=46$, so this split cuts RSS from $46$ to $10$.
Decision trees
State the three common node-impurity measures for a **classification tree** with class proportions $\hat p_k$ in a node.
**Gini index:** $G=\sum_{k}\hat p_k(1-\hat p_k)=1-\sum_k \hat p_k^{2}$. **Cross-entropy (deviance):** $D=-\sum_{k}\hat p_k\ln\hat p_k$. **Classification error:** $E=1-\max_k \hat p_k$. All three are $0$ when a node is pure (one class has $\hat p_k=1$) and are maximized when the classes are evenly mixed. Gini and cross-entropy are more sensitive to node purity than classification error, so they are preferred for *growing* the tree.
Decision trees
Why are **Gini index** and **cross-entropy** preferred over **classification error** when *growing* a classification tree, even though error rate is used to *prune* it?
Classification error $E=1-\max_k\hat p_k$ depends only on the most common class and is not very sensitive to changes in the other class proportions, so it often fails to reward splits that make nodes purer. Gini and cross-entropy are smooth, differentiable functions of all the $\hat p_k$ and decrease whenever a node becomes more node-pure, giving better splits during growth. For the final goal of **prediction accuracy**, the misclassification error rate is the natural criterion, so it is typically used to choose the size of the pruned tree.
Decision trees
A binary-classification node contains $30$ observations: $18$ of class A and $12$ of class B. Compute the Gini index, the cross-entropy, and the classification error.
Proportions: $\hat p_A=\frac{18}{30}=0.6$, $\hat p_B=\frac{12}{30}=0.4$. **Gini:** $G=1-(0.6^2+0.4^2)=1-(0.36+0.16)=1-0.52=0.48$. **Cross-entropy:** $D=-(0.6\ln 0.6+0.4\ln 0.4)=-(0.6(-0.510826)+0.4(-0.916291))=-(-0.306496-0.366516)\approx 0.6730$. **Classification error:** $E=1-\max(0.6,0.4)=1-0.6=0.40$.
Decision trees
A node of $40$ observations ($24$ class A, $16$ class B) is split into a left child ($20$: $18$ A, $2$ B) and a right child ($20$: $6$ A, $14$ B). Compute the weighted Gini index after the split and the reduction from the parent.
Parent: $\hat p_A=0.6,\hat p_B=0.4$, $G_{\text{parent}}=1-(0.36+0.16)=0.48$. Left: $\hat p_A=0.9,\hat p_B=0.1$, $G_L=1-(0.81+0.01)=0.18$. Right: $\hat p_A=0.3,\hat p_B=0.7$, $G_R=1-(0.09+0.49)=0.42$. Weighted child Gini $=\frac{20}{40}(0.18)+\frac{20}{40}(0.42)=0.5(0.18)+0.5(0.42)=0.09+0.21=0.30$. Reduction (Gini gain) $=0.48-0.30=0.18$.
Decision trees
Explain **cost-complexity (weakest-link) pruning** and the role of the tuning parameter $\alpha$.
Rather than stopping growth early, you grow a large tree $T_0$ then prune back. For each $\alpha\ge 0$ you find the subtree $T$ minimizing $\sum_{m=1}^{|T|}\sum_{i\in R_m}(y_i-\hat y_{R_m})^2 + \alpha|T|$, where $|T|$ is the number of terminal nodes. The penalty $\alpha|T|$ trades training fit against tree size: $\alpha=0$ gives the full tree $T_0$; as $\alpha$ grows, branches are collapsed and the tree shrinks. The best $\alpha$ is chosen by **cross-validation**, and the final tree is the corresponding subtree fit on the full data.
Decision trees
A grown regression tree has training RSS and terminal-node counts: full tree RSS $=120,\,|T|=8$; a pruned subtree RSS $=160,\,|T|=3$. At $\alpha=15$, which has the lower cost-complexity score $\text{RSS}+\alpha|T|$?
Full tree: $120 + 15(8)=120+120=240$. Pruned subtree: $160 + 15(3)=160+45=205$. The **pruned** subtree wins at $\alpha=15$ ($205 < 240$). Note at $\alpha=0$ the full tree (score $120$) beats the pruned one ($160$); the larger penalty $\alpha=15$ now favors the smaller tree, illustrating how increasing $\alpha$ pushes the selection toward simpler subtrees.
Decision trees
List the main **advantages and disadvantages** of a single decision tree relative to linear models.
**Advantages:** easy to explain and visualize; handle qualitative predictors with no dummy coding; can capture nonlinearities and interactions automatically; mirror human decision-making; no need to scale predictors. **Disadvantages:** lower predictive accuracy than many other methods; **high variance** — small changes in the data can produce very different trees; tend to overfit if grown deep without pruning. Ensembles (bagging, random forests, boosting) trade away some interpretability to fix the high-variance weakness and greatly improve accuracy.
Bagging & OOB
What is **bagging** (bootstrap aggregation), and why does it reduce variance for high-variance learners like trees?
Bagging fits the *same* learner to many **bootstrap samples** (samples of size $n$ drawn with replacement) of the training data, then aggregates: average the predictions for regression, or take a majority vote for classification. Averaging $B$ roughly-independent predictions with variance $\sigma^2$ gives variance $\approx\sigma^2/B$, so combining many high-variance trees sharply reduces variance while leaving bias roughly unchanged. Because deep unpruned trees are low-bias/high-variance, they are ideal candidates for bagging.
Bagging & OOB
In bagging, what is the **out-of-bag (OOB)** set, and how is OOB error computed?
Each bootstrap sample leaves out roughly a third of the observations; for a given observation $i$, the trees that did **not** train on $i$ form its OOB predictors. The OOB prediction for $i$ is the aggregate (average / majority vote) over just those trees, and the **OOB error** averages the loss of these predictions over all $i$. OOB error is a nearly unbiased estimate of test error that comes essentially **for free** — no separate validation set or cross-validation pass is needed, since every observation is scored only by trees that never saw it.
Bagging & OOB
Why is roughly $\tfrac{1}{3}$ of the data **out-of-bag** for each bagged tree? Derive the limiting fraction.
In one bootstrap draw of size $n$ (with replacement), the chance a specific observation is **not** picked on a single draw is $1-\frac{1}{n}$. Over $n$ independent draws it is left out entirely with probability $\left(1-\frac{1}{n}\right)^{n}$. As $n\to\infty$, $\left(1-\frac{1}{n}\right)^{n}\to e^{-1}\approx 0.368$. So about $36.8\%$ of observations are out-of-bag (and $\approx 63.2\%$ are in-bag) for each tree.
Bagging & OOB
For a training set of $n=10$ observations, compute the exact probability a given observation is out-of-bag for one bootstrap sample, and compare to the $n\to\infty$ limit.
$P(\text{out-of-bag})=\left(1-\frac{1}{10}\right)^{10}=0.9^{10}$. $0.9^{10}=(0.9^2)^5=0.81^5$. Step: $0.81^2=0.6561$, $0.6561\times 0.81=0.531441$, $\times 0.81\approx 0.430467$, $\times 0.81\approx 0.348678$. So $\approx 0.3487$ (about $34.9\%$), already close to the limiting $e^{-1}\approx 0.3679$. The expected in-bag count is $\approx (1-0.3487)\times 10\approx 6.5$ distinct observations.
Bagging & OOB
A bagged classifier of $5$ trees gives an observation these class predictions: $\{A,\,A,\,B,\,A,\,B\}$, of which trees $3,4,5$ are out-of-bag for it. State the bagged prediction and the OOB prediction.
**Bagged prediction (all $5$ trees, majority vote):** A appears $3$ times, B twice $\Rightarrow$ predict **A**. **OOB prediction (only trees $3,4,5$, which omitted this observation):** their votes are $\{B,A,B\}$, so B has $2$ votes vs $1$ for A $\Rightarrow$ OOB predicts **B**. The OOB prediction uses *only* the trees that never trained on this point, which is exactly why OOB error is an honest test-error estimate.
Random forests
What is a **random forest**, and the single key way it differs from plain bagging?
A random forest is bagging of decision trees with one extra randomization: at **each split**, the algorithm considers only a random subset of $m$ of the $p$ predictors (a fresh random subset per split) as candidates, rather than all $p$. This **decorrelates** the trees. In plain bagging, a few strong predictors dominate the top splits of almost every tree, making the trees highly correlated; averaging correlated trees reduces variance much less. Forcing most splits to ignore the strongest predictor lets other predictors contribute, lowering tree correlation and the variance of the averaged forest.
Random forests
Why does **decorrelating** the trees matter? Give the variance formula for the average of $B$ identically distributed variables with pairwise correlation $\rho$.
For $B$ identically distributed trees each with variance $\sigma^2$ and pairwise correlation $\rho$, the variance of their average is $\rho\,\sigma^{2} + \frac{1-\rho}{B}\,\sigma^{2}$. As $B\to\infty$ the second term vanishes but the first, $\rho\sigma^2$, remains — so highly correlated trees ($\rho$ near $1$) put a floor on how much variance averaging can remove. Random forests lower $\rho$ by restricting the candidate predictors at each split, pushing that floor down and improving the ensemble.
Random forests
State the usual default for $m$ (predictors tried per split) in a random forest for **classification** versus **regression**, and what $m=p$ corresponds to.
**Classification:** $m\approx\sqrt{p}$ (rounded). **Regression:** $m\approx p/3$. Setting $m=p$ means every predictor is a candidate at every split, which is exactly **bagging** — so bagging is the special case of a random forest with no predictor subsetting. Smaller $m$ produces more decorrelated, but individually weaker, trees; $m$ is a tuning parameter chosen (often) by OOB error.
Random forests
A random-forest model has $p=16$ predictors. Give the default number of predictors $m$ tried at each split for a classification forest and for a regression forest.
**Classification:** $m\approx\sqrt{p}=\sqrt{16}=4$ predictors per split. **Regression:** $m\approx \frac{p}{3}=\frac{16}{3}\approx 5.33$, i.e. about $5$ predictors per split. With $p=16$ the classification forest tries $4$ candidate predictors at each split and the regression forest tries roughly $5$ — a small fraction of the $16$, which is what decorrelates the trees.
Random forests
If a random forest with $p=9$ predictors is configured with $m=9$, what model is it equivalent to, and how would you expect its OOB error to compare to an $m=3$ forest?
$m=9=p$ means **every** predictor is a split candidate, so the model reduces to **bagging**. Versus the $m=3$ (the $\sqrt{9}=3$ default) forest, the $m=9$ bagged trees are more **correlated** because the strongest predictors dominate the top splits of nearly every tree. More correlation means averaging removes less variance, so the $m=9$ bagged forest will typically show **higher** OOB error than the decorrelated $m=3$ random forest.
Boosting
Describe **boosting** for trees and how it differs structurally from bagging / random forests.
Boosting grows trees **sequentially**: each new tree is fit to the current model's residuals (its mistakes), and a shrunken version of that tree is added to the running model. The trees are typically **small** (few splits / shallow). The final model is the sum of all the trees. Unlike bagging and random forests — which fit trees **independently** (in parallel) on resampled data and *average* them to cut variance — boosting builds trees **dependently** and additively, slowly reducing **bias**. Boosting does not use bootstrap sampling; it learns from the residuals of the prior fit.
Boosting
State the three boosting **tuning parameters** ($B$, $\lambda$, $d$) and what each controls.
**$B$ — number of trees:** unlike bagging, boosting *can* overfit if $B$ is too large, so $B$ is chosen by cross-validation. **$\lambda$ — shrinkage / learning rate** (e.g. $0.01$ or $0.001$): scales each tree's contribution; smaller $\lambda$ learns more slowly and usually needs a larger $B$. **$d$ — interaction depth** (number of splits per tree): controls complexity of each tree. Often $d=1$ (a "stump", an additive model with no interactions); $d=2$ allows two-variable interactions, etc.
Boosting
Outline the **gradient boosting** algorithm for regression (squared-error loss).
1. Initialize $\hat f(x)=0$ and residuals $r_i=y_i$ for all $i$. 2. For $b=1,\dots,B$: fit a tree $\hat f^{\,b}$ with $d$ splits to the data $(x_i,r_i)$; update the model $\hat f(x)\leftarrow \hat f(x)+\lambda\hat f^{\,b}(x)$; update residuals $r_i\leftarrow r_i-\lambda\hat f^{\,b}(x_i)$. 3. Output $\hat f(x)=\sum_{b=1}^{B}\lambda\hat f^{\,b}(x)$. Each tree attacks what the current model still gets wrong; the shrinkage $\lambda$ keeps each step small so the ensemble improves gradually.
Boosting
A boosting model starts with prediction $\hat f(x)=10$ for an observation whose true value is $y=16$. The next tree predicts the residual as $5$, and the learning rate is $\lambda=0.2$. Give the updated prediction and the new residual.
Current residual $r=y-\hat f=16-10=6$. The tree's contribution is shrunk: $\lambda\times(\text{tree prediction})=0.2\times 5=1$. **Updated prediction:** $\hat f\leftarrow 10+1=11$. **New residual:** $r\leftarrow 16-11=5$ (equivalently $6-1=5$). Small $\lambda$ means the model moved only part-way toward the residual, so several more rounds are needed to close the gap — the slow-learning behavior boosting relies on.
Boosting
How does the role of $B$ (number of trees) differ between **bagging/random forests** and **boosting**?
**Bagging / random forests:** more trees never hurt test error — adding trees only stabilizes the average. You pick $B$ large enough for the OOB/test error to flatten; overfitting from large $B$ is **not** a concern. **Boosting:** trees are added sequentially to fit residuals, so a too-large $B$ **can overfit** the training data. $B$ is a genuine tuning parameter, usually selected by cross-validation, and is paired with the learning rate $\lambda$ (smaller $\lambda$ needs larger $B$).
Tuning & importance
How is **variable (predictor) importance** measured in bagged trees and random forests?
For **regression** forests: for each predictor, sum the **decrease in RSS** from all splits on that predictor, averaged over the $B$ trees — large total decrease means an important predictor. For **classification** forests: sum the **decrease in the Gini index** (node impurity) from splits on that predictor, averaged over all trees. An alternative is **OOB permutation importance**: record OOB accuracy, randomly permute a predictor's OOB values, and measure how much accuracy drops — a big drop signals an important predictor. Importance restores some interpretability lost when moving from a single tree to an ensemble.
Tuning & importance
A regression random forest reports total RSS decrease attributable to each predictor (summed over trees): $X_1:\,540$, $X_2:\,180$, $X_3:\,60$, $X_4:\,20$. Report the relative importance scaled so the largest predictor $=100$.
Scale each by $\frac{\text{decrease}}{540}\times 100$: $X_1:\,\frac{540}{540}\times100=100.0$. $X_2:\,\frac{180}{540}\times100\approx 33.3$. $X_3:\,\frac{60}{540}\times100\approx 11.1$. $X_4:\,\frac{20}{540}\times100\approx 3.7$. $X_1$ is by far the dominant predictor; $X_4$ contributes almost nothing to reducing RSS.
Tuning & importance
What are the main **tuning parameters** for each tree-based method, and how is each typically chosen?
**Single tree:** the cost-complexity penalty $\alpha$ (equivalently tree size) — chosen by cross-validation. **Bagging:** number of trees $B$ — set large; OOB error monitors convergence (no overfitting risk from $B$). **Random forest:** $m$ (predictors per split) and $B$ — $m$ tuned by OOB/CV error around the $\sqrt{p}$ or $p/3$ default. **Boosting:** $B$, shrinkage $\lambda$, and depth $d$ — all chosen by cross-validation since boosting can overfit.
Model evaluation
Define a **confusion matrix** for a binary classifier and label TP, FP, TN, FN.
A $2\times 2$ table cross-tabulating predicted class against actual class: **TP** (true positive): predicted positive, actually positive. **FP** (false positive): predicted positive, actually negative (Type I error). **FN** (false negative): predicted negative, actually positive (Type II error). **TN** (true negative): predicted negative, actually negative. The total $n=\text{TP}+\text{FP}+\text{FN}+\text{TN}$, and most performance metrics (accuracy, sensitivity, specificity, precision) are ratios formed from these four counts.
Model evaluation
Give the formulas for **accuracy**, **sensitivity (recall)**, **specificity**, and **precision** in terms of TP, FP, TN, FN.
**Accuracy** $=\frac{TP+TN}{TP+FP+TN+FN}$ — overall fraction correct. **Sensitivity (recall, true-positive rate)** $=\frac{TP}{TP+FN}$ — share of actual positives caught. **Specificity (true-negative rate)** $=\frac{TN}{TN+FP}$ — share of actual negatives correctly cleared. **Precision (positive predictive value)** $=\frac{TP}{TP+FP}$ — share of predicted positives that are truly positive. Misclassification (error) rate $=1-\text{accuracy}$; the false-positive rate $=1-\text{specificity}$.
Model evaluation
A classifier produces this confusion matrix — TP $=80$, FN $=20$, FP $=30$, TN $=170$. Compute accuracy, sensitivity, specificity, and precision.
Total $n=80+20+30+170=300$. **Accuracy** $=\frac{80+170}{300}=\frac{250}{300}\approx 0.833$. **Sensitivity** $=\frac{TP}{TP+FN}=\frac{80}{80+20}=\frac{80}{100}=0.80$. **Specificity** $=\frac{TN}{TN+FP}=\frac{170}{170+30}=\frac{170}{200}=0.85$. **Precision** $=\frac{TP}{TP+FP}=\frac{80}{80+30}=\frac{80}{110}\approx 0.727$.
Model evaluation
A disease screen tests $1000$ people; $50$ truly have the disease. The test flags $45$ of the sick as positive and also flags $95$ healthy people as positive. Build the confusion matrix and compute sensitivity, specificity, and precision.
Sick $=50$: TP $=45$, FN $=50-45=5$. Healthy $=950$: FP $=95$, TN $=950-95=855$. **Sensitivity** $=\frac{45}{45+5}=\frac{45}{50}=0.90$. **Specificity** $=\frac{855}{855+95}=\frac{855}{950}=0.90$. **Precision** $=\frac{45}{45+95}=\frac{45}{140}\approx 0.321$. Despite $90\%$ sensitivity and specificity, precision is low ($\approx 32\%$) because the disease is rare — most positives are false positives. This is the classic base-rate effect.
Model evaluation
What is the **ROC curve**, and what does the **AUC** measure?
The **ROC (Receiver Operating Characteristic) curve** plots the true-positive rate (sensitivity) on the $y$-axis against the false-positive rate ($1-$specificity) on the $x$-axis as the classification **threshold** is varied across all values. Each threshold gives one point; sweeping the threshold traces the curve. The **AUC (area under the curve)** summarizes performance in a single number: AUC $=1$ is a perfect classifier, AUC $=0.5$ is the diagonal (no better than random guessing). AUC equals the probability that the classifier ranks a randomly chosen positive case above a randomly chosen negative one. A larger AUC indicates a better classifier across all thresholds.
Model evaluation
A model is scored at two thresholds. At threshold 1: sensitivity $=0.70$, specificity $=0.95$. At threshold 2: sensitivity $=0.90$, specificity $=0.60$. Give the ROC coordinates and describe the trade-off.
ROC plots (FPR, TPR) $=(1-\text{specificity},\ \text{sensitivity})$. Threshold 1: $(1-0.95,\,0.70)=(0.05,\,0.70)$. Threshold 2: $(1-0.60,\,0.90)=(0.40,\,0.90)$. Lowering the threshold (1$\to$2) raises sensitivity ($0.70\to0.90$, catching more positives) but lowers specificity ($0.95\to0.60$, more false alarms), moving the point up and to the right along the ROC curve — the universal sensitivity-specificity trade-off.
Model evaluation
Describe **$k$-fold cross-validation** and how it estimates test error.
Randomly partition the data into $k$ roughly equal **folds**. For each fold $j=1,\dots,k$: train the model on the other $k-1$ folds and compute the error on the held-out fold $j$. The CV estimate of test error is the average of the $k$ fold errors, $\text{CV}_{(k)}=\frac{1}{k}\sum_{j=1}^{k}\text{Err}_j$. Every observation is used for validation exactly once and for training $k-1$ times. Common choices are $k=5$ or $k=10$; $k=n$ is leave-one-out CV. It is the standard tool for tuning $\alpha$, $m$, $\lambda$, $d$, and $B$.
Model evaluation
A $5$-fold cross-validation of a tree model yields fold MSEs $\{12.0,\,14.5,\,11.0,\,13.5,\,9.0\}$. Compute the CV estimate of test MSE.
$\text{CV}_{(5)}=\frac{1}{5}(12.0+14.5+11.0+13.5+9.0)$. Sum $=12.0+14.5+11.0+13.5+9.0=60.0$. $\text{CV}_{(5)}=\frac{60.0}{5}=12.0$. This averaged held-out error is the figure you compare across candidate models or tuning-parameter values; you pick the setting with the smallest CV error (often within one standard error of the minimum, for a simpler model).
Model evaluation
Explain the **bias-variance trade-off** and how expected test error decomposes.
Expected test MSE at a point decomposes as $E[(y-\hat f(x))^2]=\text{Var}(\hat f(x))+[\text{Bias}(\hat f(x))]^2+\sigma^2$, where $\sigma^2$ is irreducible error. Flexible models (deep trees) have low bias but high variance; rigid models have high bias but low variance. Total error is minimized at an intermediate flexibility, so we tune complexity to balance the two. A deep unpruned tree sits at the high-variance/low-bias extreme — which is exactly why variance-reducing ensembles help so much.
Model evaluation
Place a single deep tree, bagging, random forests, and boosting on the **bias-variance** map.
**Single deep tree:** low bias, **high variance** (overfits; unstable). **Bagging:** keeps the low bias of deep trees but **lowers variance** by averaging many bootstrap trees; limited by tree correlation. **Random forest:** further **lowers variance** than bagging by decorrelating trees (subsetting predictors $m<p$), with similar bias. **Boosting:** uses small (high-bias) trees and **reduces bias** sequentially by fitting residuals; can drive both bias and variance low but risks overfitting if $B$ is too large. Bagging/forests primarily attack **variance**; boosting primarily attacks **bias**.
Model evaluation
Compute expected test MSE for two models given the components. Model A: bias $=2$, variance $=1$, irreducible $\sigma^2=3$. Model B: bias $=0.5$, variance $=6$, irreducible $\sigma^2=3$. Which is better?
Use $\text{MSE}=\text{bias}^2+\text{variance}+\sigma^2$. Model A: $2^2+1+3=4+1+3=8$. Model B: $0.5^2+6+3=0.25+6+3=9.25$. Model A has the lower expected test MSE ($8<9.25$): its larger bias is more than offset by its much smaller variance. This is the trade-off in action — the more flexible Model B reduced bias but its variance blow-up made it worse overall.
Random forests
A random forest with $B=500$ trees and $p=25$ predictors reports an OOB misclassification rate of $0.18$ at the default $m=\sqrt{p}$, and $0.22$ at $m=p$. Identify $m$ in each case, name the $m=p$ model, and interpret.
Default $m=\sqrt{p}=\sqrt{25}=5$ predictors per split (the random forest); $m=p=25$ means all predictors are candidates at every split, which is **bagging**. The random forest's OOB error $0.18$ beats bagging's $0.22$. The improvement comes from **decorrelating** the trees: with $m=5$ the strongest predictors can't dominate every split, so the averaged forest removes more variance. OOB error is computed for free during training, so it is the natural quantity for choosing $m$.
Model evaluation
When comparing classifiers on **imbalanced** data, why can accuracy mislead, and which metrics are more informative?
With heavy class imbalance a model can score high **accuracy** simply by predicting the majority class. Example: if $97\%$ of cases are negative, always predicting "negative" gives $97\%$ accuracy yet $0\%$ sensitivity — useless for detecting positives. Better to report **sensitivity/recall** and **precision** (which focus on the positive class), **specificity**, and threshold-free summaries like the **ROC curve / AUC**. These reveal performance on the rare class that overall accuracy hides.
Model evaluation
A confusion matrix for a fraud detector — TP $=18$, FN $=12$, FP $=6$, TN $=964$. Compute accuracy, sensitivity, precision, and explain why accuracy looks deceptively high.
Total $n=18+12+6+964=1000$. **Accuracy** $=\frac{18+964}{1000}=\frac{982}{1000}=0.982$. **Sensitivity** $=\frac{18}{18+12}=\frac{18}{30}=0.60$. **Precision** $=\frac{18}{18+6}=\frac{18}{24}=0.75$. Accuracy is $98.2\%$ only because fraud is rare ($30$ of $1000$); the model still misses $40\%$ of fraud (sensitivity $0.60$). On imbalanced problems, sensitivity and precision tell the real story that accuracy masks.
Decision trees
A node containing $50$ observations splits $\{A,B,C\}$ as $25,\,15,\,10$. Compute the Gini index and the cross-entropy for this three-class node.
Proportions: $\hat p_A=\frac{25}{50}=0.5,\ \hat p_B=\frac{15}{50}=0.3,\ \hat p_C=\frac{10}{50}=0.2$. **Gini:** $G=1-(0.5^2+0.3^2+0.2^2)=1-(0.25+0.09+0.04)=1-0.38=0.62$. **Cross-entropy:** $D=-(0.5\ln 0.5+0.3\ln 0.3+0.2\ln 0.2)$ $=-(0.5(-0.693147)+0.3(-1.203973)+0.2(-1.609438))$ $=-(-0.346574-0.361192-0.321888)\approx 1.0297$.
Boosting
Contrast how **bagging/random forests** and **boosting** combine their trees, and the consequence for interpretability and overfitting.
**Bagging / random forests:** trees are fit **independently** on resampled data and combined by **equal-weight averaging / voting**. Adding trees cannot overfit; more trees just stabilize the estimate. **Boosting:** trees are fit **sequentially** on residuals and **summed with shrinkage** $\lambda$; the combination is weighted and order-dependent, and too many trees **can** overfit. All three sacrifice the single tree's interpretability, which is partly recovered through **variable importance** measures (mean decrease in RSS/Gini, or permutation importance).