{
  "deckName": "Exam SRM — Decision Trees & Ensembles",
  "examCode": "Exam SRM",
  "cards": [
    {
      "front": "Describe **recursive binary splitting** and why it is a *greedy*, *top-down* algorithm.",
      "back": "Starting from the full predictor space, the algorithm scans every predictor $X_j$ and every cutpoint $s$ and chooses the single split into half-planes $\\{X_j<s\\}$ and $\\{X_j\\ge s\\}$ that most improves the objective. It then repeats within each resulting region until a stopping rule fires.\n**Top-down:** it begins at the root and works downward.\n**Greedy:** at each step it takes the best split *right now*, never looking ahead to whether a worse split now would enable a better tree later — so the tree is not guaranteed globally optimal.",
      "tag": "Regression trees"
    },
    {
      "front": "State the **splitting criterion** for a regression tree and the prediction rule in each leaf.",
      "back": "A regression tree partitions the space into $M$ boxes $R_1,\\dots,R_M$ chosen to minimize the residual sum of squares\n$\\text{RSS}=\\sum_{m=1}^{M}\\sum_{i\\in R_m}(y_i-\\hat y_{R_m})^2$,\nwhere $\\hat y_{R_m}$ is the **mean response of the training observations in box $R_m$**.\nA new observation is dropped down the tree and predicted as the mean response $\\hat y_{R_m}$ of the leaf it lands in.",
      "tag": "Regression trees"
    },
    {
      "front": "At a single regression split, what quantity is minimized over the predictor $j$ and cutpoint $s$?",
      "back": "Define $R_1(j,s)=\\{X\\mid X_j<s\\}$ and $R_2(j,s)=\\{X\\mid X_j\\ge s\\}$. The split minimizes\n$\\sum_{i\\in R_1}(y_i-\\hat y_{R_1})^2+\\sum_{i\\in R_2}(y_i-\\hat y_{R_2})^2$,\nwhere $\\hat y_{R_1}$ and $\\hat y_{R_2}$ are the mean responses in the two child regions. The pair $(j,s)$ giving the smallest combined RSS is chosen.",
      "tag": "Regression trees"
    },
    {
      "front": "A regression node holds responses $\\{4,6,9,13\\}$. A candidate split sends $\\{4,6\\}$ left and $\\{9,13\\}$ right. Compute the RSS of this split.",
      "back": "Left mean $=\\frac{4+6}{2}=5$; left RSS $=(4-5)^2+(6-5)^2=1+1=2$.\nRight mean $=\\frac{9+13}{2}=11$; right RSS $=(9-11)^2+(13-11)^2=4+4=8$.\nTotal split RSS $=2+8=10$.\nCompare with the parent (no split): mean $=\\frac{4+6+9+13}{4}=8$, RSS $=(4-8)^2+(6-8)^2+(9-8)^2+(13-8)^2=16+4+1+25=46$. The split cuts RSS from $46$ to $10$.",
      "tag": "Regression trees"
    },
    {
      "front": "For the node $\\{4,6,9,13\\}$, compare splitting as $\\{4,6\\}\\mid\\{9,13\\}$ versus $\\{4,6,9\\}\\mid\\{13\\}$ by RSS. Which split wins?",
      "back": "**Split A** $\\{4,6\\}\\mid\\{9,13\\}$: means $5$ and $11$; RSS $=2+8=10$ (from the prior card).\n**Split B** $\\{4,6,9\\}\\mid\\{13\\}$: left mean $=\\frac{4+6+9}{3}=6.333$, left RSS $=(4-6.333)^2+(6-6.333)^2+(9-6.333)^2\\approx5.444+0.111+7.111=12.667$; right is a single point, RSS $=0$. Total $\\approx12.667$.\nSplit A's RSS $10<12.667$, so the greedy algorithm chooses **Split A**.",
      "tag": "Regression trees"
    },
    {
      "front": "How does a **classification tree** make predictions and what is its natural error measure for *prediction*?",
      "back": "Each leaf predicts the **most commonly occurring class** (the majority class) among its training observations. The fitted class proportions $\\hat p_{mk}$ in a leaf also give estimated class probabilities.\nThe natural *prediction* error in region $m$ is the **classification error rate** $E=1-\\max_k \\hat p_{mk}$ — the fraction not in the majority class.",
      "tag": "Classification trees"
    },
    {
      "front": "Define the three **node-impurity measures** used for classification trees, where $\\hat p_{mk}$ is the proportion of class $k$ in node $m$.",
      "back": "**Classification error:** $E=1-\\max_k \\hat p_{mk}$.\n**Gini index:** $G=\\sum_{k=1}^{K}\\hat p_{mk}(1-\\hat p_{mk})$ — a measure of total variance / node \"purity.\"\n**Cross-entropy (deviance):** $D=-\\sum_{k=1}^{K}\\hat p_{mk}\\ln \\hat p_{mk}$.\nAll three are small when a node is pure (one class near proportion $1$) and large when classes are evenly mixed.",
      "tag": "Impurity measures"
    },
    {
      "front": "Why are the **Gini index** and **cross-entropy** preferred over the classification error rate for *growing* a classification tree?",
      "back": "Gini and cross-entropy are more **sensitive to node purity** — they respond to changes in the class proportions even when the predicted majority class does not change, so they reward splits that make nodes purer. The classification error $1-\\max_k\\hat p_{mk}$ is not sufficiently sensitive (it is flat unless the majority class flips), making it a poor splitting guide.\nThe classification error rate is, however, preferred when the goal is **prediction accuracy of the final pruned tree**.",
      "tag": "Impurity measures"
    },
    {
      "front": "For a two-class node with class-1 proportion $\\hat p$, write the Gini index, cross-entropy, and classification error as functions of $\\hat p$.",
      "back": "With proportions $\\hat p$ and $1-\\hat p$:\n**Gini:** $G=2\\hat p(1-\\hat p)$.\n**Cross-entropy:** $D=-\\hat p\\ln\\hat p-(1-\\hat p)\\ln(1-\\hat p)$.\n**Classification error:** $E=1-\\max(\\hat p,\\,1-\\hat p)$.\nAll three peak at $\\hat p=0.5$ (maximal impurity) and equal $0$ at $\\hat p=0$ or $\\hat p=1$ (a pure node).",
      "tag": "Impurity measures"
    },
    {
      "front": "A node has $30$ observations of class A and $10$ of class B. Compute the **Gini index**.",
      "back": "Proportions: $\\hat p_A=\\frac{30}{40}=0.75$, $\\hat p_B=\\frac{10}{40}=0.25$.\n$G=\\sum_k \\hat p_{k}(1-\\hat p_k)=0.75(0.25)+0.25(0.75)=0.1875+0.1875=0.375$.\nEquivalently $G=2(0.75)(0.25)=0.375$.",
      "tag": "Impurity measures"
    },
    {
      "front": "For the same node ($\\hat p_A=0.75$, $\\hat p_B=0.25$), compute the **cross-entropy** and the **classification error**.",
      "back": "**Cross-entropy:** $D=-[0.75\\ln 0.75 + 0.25\\ln 0.25]$. With $\\ln 0.75\\approx-0.287682$ and $\\ln 0.25\\approx-1.386294$:\n$D=-[0.75(-0.287682)+0.25(-1.386294)]=-[-0.215762-0.346574]=0.562335$.\n**Classification error:** $E=1-\\max(0.75,0.25)=1-0.75=0.25$.\nSo for this node $E=0.25$, $G=0.375$, $D\\approx0.562$.",
      "tag": "Impurity measures"
    },
    {
      "front": "A three-class node has counts $A=20$, $B=20$, $C=10$ (total $50$). Compute the **Gini index**.",
      "back": "Proportions: $\\hat p_A=0.4$, $\\hat p_B=0.4$, $\\hat p_C=0.2$.\n$G=\\sum_k \\hat p_k(1-\\hat p_k)=0.4(0.6)+0.4(0.6)+0.2(0.8)=0.24+0.24+0.16=0.64$.\nEquivalently $G=1-\\sum_k \\hat p_k^{2}=1-(0.16+0.16+0.04)=1-0.36=0.64$.",
      "tag": "Impurity measures"
    },
    {
      "front": "A parent node of $40$ observations is $\\hat p=0.5$ class-1. A split yields a left child ($20$ obs, $\\hat p=0.9$) and a right child ($20$ obs, $\\hat p=0.1$). Compute the **weighted Gini** after the split and the reduction.",
      "back": "Parent Gini $=2(0.5)(0.5)=0.5$.\nLeft Gini $=2(0.9)(0.1)=0.18$; right Gini $=2(0.1)(0.9)=0.18$.\nWeighted child Gini $=\\frac{20}{40}(0.18)+\\frac{20}{40}(0.18)=0.18$.\n**Reduction in impurity** $=0.5-0.18=0.32$. The split sharply improves purity.",
      "tag": "Impurity measures"
    },
    {
      "front": "Compare two candidate classification splits by weighted Gini. Split 1 children: ($\\hat p=0.8$, 25 obs) and ($\\hat p=0.4$, 25 obs). Split 2 children: ($\\hat p=1.0$, 10 obs) and ($\\hat p=0.5$, 40 obs). Which is chosen?",
      "back": "**Split 1:** Ginis $2(0.8)(0.2)=0.32$ and $2(0.4)(0.6)=0.48$; weighted $=\\frac{25}{50}(0.32)+\\frac{25}{50}(0.48)=0.16+0.24=0.40$.\n**Split 2:** Ginis $2(1.0)(0)=0$ and $2(0.5)(0.5)=0.5$; weighted $=\\frac{10}{50}(0)+\\frac{40}{50}(0.5)=0+0.4=0.40$.\nThe weighted Gini is $0.40$ for **both** — a tie; the algorithm would break it by the secondary criterion or pick either.",
      "tag": "Impurity measures"
    },
    {
      "front": "What problem arises if a tree is grown until every leaf is pure, and how does this manifest in bias and variance?",
      "back": "A fully grown tree fits the training data essentially perfectly but **overfits**: it has low bias but very **high variance**, so it generalizes poorly (high test error). Small changes in the data produce very different trees.\nThe remedy is to grow a large tree and then **prune** it back, or to stop early — trading a little training fit for much lower variance and better test performance.",
      "tag": "Pruning"
    },
    {
      "front": "Why is it better to grow a large tree and then prune, rather than stop splitting early (e.g. when RSS reduction falls below a threshold)?",
      "back": "Early stopping is too short-sighted: a split that yields little improvement *now* may be followed by a very valuable split below it. A threshold-based stop would never reach that good descendant split.\nGrowing a large tree first and **pruning back** lets the algorithm keep splits whose value only appears deeper, then remove branches that don't pay off — generally giving a better tree than greedy early stopping.",
      "tag": "Pruning"
    },
    {
      "front": "State the **cost-complexity (weakest-link) pruning** objective and the role of the tuning parameter $\\alpha$.",
      "back": "For a subtree $T$ with $|T|$ terminal nodes, minimize\n$\\sum_{m=1}^{|T|}\\sum_{i\\in R_m}(y_i-\\hat y_{R_m})^2+\\alpha|T|$,\ni.e. (training RSS) $+\\;\\alpha\\,(\\text{number of leaves})$.\nThe penalty $\\alpha\\ge0$ trades fit against tree size: $\\alpha=0$ keeps the full tree; as $\\alpha$ grows, leaves are pruned, yielding a nested sequence of subtrees. $\\alpha$ is chosen by **cross-validation**.",
      "tag": "Pruning"
    },
    {
      "front": "In cost-complexity pruning, how do **small** versus **large** values of $\\alpha$ affect the chosen tree, and how is $\\alpha$ selected?",
      "back": "**Small $\\alpha$:** little penalty per leaf, so a **large** tree is favored (low bias, higher variance — risk of overfit).\n**Large $\\alpha$:** each leaf is costly, so a **small** tree (even the root) is favored (lower variance, higher bias — risk of underfit).\n$\\alpha$ is tuned by **$K$-fold cross-validation**: compute CV error for the subtree corresponding to each $\\alpha$ in the pruning sequence and pick the $\\alpha$ minimizing CV error.",
      "tag": "Pruning"
    },
    {
      "front": "A subtree has training RSS $=120$ with $|T|=8$ leaves; pruning to $|T|=5$ raises RSS to $150$. At what $\\alpha$ are the two equally good under cost-complexity pruning?",
      "back": "Set the cost-complexity criteria equal:\n$120+\\alpha(8)=150+\\alpha(5)$.\n$120+8\\alpha=150+5\\alpha\\;\\Rightarrow\\;3\\alpha=30\\;\\Rightarrow\\;\\alpha=10$.\nFor $\\alpha>10$ the **smaller** $5$-leaf tree has the lower cost-complexity score (pruning is preferred); for $\\alpha<10$ the larger $8$-leaf tree wins.",
      "tag": "Pruning"
    },
    {
      "front": "Two subtrees are candidates at $\\alpha=4$: Tree X has RSS $=200$, $|T|=10$; Tree Y has RSS $=240$, $|T|=4$. Which does cost-complexity pruning select?",
      "back": "Cost-complexity score $=\\text{RSS}+\\alpha|T|$.\nTree X: $200+4(10)=200+40=240$.\nTree Y: $240+4(4)=240+16=256$.\nTree X's score $240<256$, so **Tree X** is selected at $\\alpha=4$.",
      "tag": "Pruning"
    },
    {
      "front": "List the **advantages** and **disadvantages** of a single decision tree versus linear/regression models.",
      "back": "**Advantages:** very easy to explain and visualize; can be displayed graphically; handle qualitative predictors without dummy coding; mirror human decision-making; naturally capture interactions and non-linear effects.\n**Disadvantages:** generally **lower predictive accuracy** than the best alternatives; **non-robust** — a small data change can produce a very different tree (high variance). Linear regression is preferable when the true relationship is approximately linear.\nEnsembles (bagging, random forests, boosting) recover much of the lost accuracy at the cost of interpretability.",
      "tag": "Regression trees"
    },
    {
      "front": "Define **bagging** (bootstrap aggregation) and explain why it reduces variance.",
      "back": "Bagging fits the same model (a deep, unpruned tree) on $B$ **bootstrap samples** of the training data and aggregates: for regression it **averages** the $B$ predictions $\\hat f_{\\text{bag}}(x)=\\frac{1}{B}\\sum_{b=1}^{B}\\hat f^{*b}(x)$; for classification it takes a **majority vote**.\nAveraging a set of independent quantities each with variance $\\sigma^2$ gives variance $\\sigma^2/B$, so combining many high-variance, low-bias trees cuts variance while leaving bias roughly unchanged. The trees are grown deep and **not pruned**.",
      "tag": "Bagging & random forests"
    },
    {
      "front": "What is the **out-of-bag (OOB) error**, and why does it approximate cross-validation \"for free\"?",
      "back": "Each bagged tree uses a bootstrap sample, so on average about $\\tfrac{1}{3}$ of observations are left out (out-of-bag) for that tree. To get the OOB prediction for observation $i$, average (or vote) only over the trees in which $i$ was **OOB**; the OOB error is the resulting prediction error across all $i$.\nBecause each observation is predicted only by trees that never saw it, the OOB error is a valid estimate of test error — essentially leave-one-out/CV-like accuracy with no extra refitting.",
      "tag": "Bagging & random forests"
    },
    {
      "front": "Why is roughly $\\tfrac{2}{3}$ of the data \"in-bag\" and $\\tfrac{1}{3}$ \"out-of-bag\" for each bootstrap tree? Show the limiting probability.",
      "back": "A bootstrap sample draws $n$ observations with replacement. The probability a specific observation is **not** drawn in one pick is $1-\\frac1n$, so over $n$ picks it is $\\left(1-\\frac1n\\right)^{n}$.\nAs $n\\to\\infty$, $\\left(1-\\frac1n\\right)^{n}\\to e^{-1}\\approx0.368$.\nThus about $36.8\\%$ of observations are **out-of-bag** and about $63.2\\%$ are **in-bag** for any given tree.",
      "tag": "Bagging & random forests"
    },
    {
      "front": "With $n=10$, compute the probability that a specific observation is **out-of-bag** for a given bootstrap sample.",
      "back": "$P(\\text{not drawn in one pick})=1-\\frac{1}{10}=0.9$.\nOver $n=10$ independent picks with replacement: $P(\\text{OOB})=(0.9)^{10}$.\n$(0.9)^{10}\\approx0.3487$.\nThis is already close to the large-sample limit $e^{-1}\\approx0.3679$, confirming roughly a third of observations sit out of each tree.",
      "tag": "Bagging & random forests"
    },
    {
      "front": "How does a **random forest** differ from bagging, and what does the choice $m\\approx\\sqrt{p}$ accomplish?",
      "back": "A random forest is bagging plus an extra randomization: at **each split**, only a fresh random subset of $m$ of the $p$ predictors is considered as split candidates (typically $m\\approx\\sqrt{p}$ for classification, $m\\approx p/3$ for regression).\nRestricting the candidates **decorrelates** the trees — without it, one dominant predictor would be the top split in almost every tree, making the trees highly correlated and limiting variance reduction. Decorrelated trees average to a lower-variance ensemble than bagging.",
      "tag": "Bagging & random forests"
    },
    {
      "front": "Why does averaging *correlated* trees reduce variance less, and how does the formula $\\rho\\sigma^2+\\frac{1-\\rho}{B}\\sigma^2$ make the random-forest motivation precise?",
      "back": "If $B$ trees each have variance $\\sigma^2$ and pairwise correlation $\\rho$, the variance of their average is\n$\\rho\\sigma^{2}+\\frac{1-\\rho}{B}\\sigma^{2}$.\nAs $B\\to\\infty$ the second term vanishes but the first, $\\rho\\sigma^2$, remains — so high correlation $\\rho$ caps the achievable variance reduction. Random forests **lower $\\rho$** by limiting each split to $m<p$ predictors, shrinking the floor $\\rho\\sigma^2$ and so reducing the ensemble variance below that of bagging.",
      "tag": "Bagging & random forests"
    },
    {
      "front": "A classification problem has $p=16$ predictors. State the typical random-forest $m$ and contrast it with $m=p$.",
      "back": "For classification the default is $m\\approx\\sqrt{p}=\\sqrt{16}=4$ predictors considered at each split.\nIf instead $m=p=16$ (all predictors eligible at every split), the random forest reduces to ordinary **bagging** — the trees stay correlated because the strongest predictor tends to be chosen repeatedly. Choosing $m=4$ decorrelates the trees and usually lowers test error.",
      "tag": "Bagging & random forests"
    },
    {
      "front": "A regression random forest has $p=30$ predictors. What $m$ is typically used, and what would $m=\\sqrt{p}$ give instead?",
      "back": "For **regression** forests the usual default is $m\\approx p/3=\\frac{30}{3}=10$ predictors per split.\nThe classification rule of thumb $m\\approx\\sqrt{p}=\\sqrt{30}\\approx5.5\\to6$ would be smaller. The exact $m$ is a tuning parameter — pick it by OOB or cross-validation error — but $p/3$ is the standard regression starting point.",
      "tag": "Bagging & random forests"
    },
    {
      "front": "A bagged regression ensemble has $B=5$ trees predicting $\\{12,14,11,15,13\\}$ for a new observation. Give the bagged prediction.",
      "back": "Bagging averages the individual tree predictions:\n$\\hat f_{\\text{bag}}(x)=\\frac{12+14+11+15+13}{5}=\\frac{65}{5}=13$.\nThe ensemble predicts $13$ — a smoother, lower-variance estimate than any single tree.",
      "tag": "Bagging & random forests"
    },
    {
      "front": "A random forest of $B=7$ classification trees votes $\\{A,A,B,A,B,A,A\\}$ for a new observation. Give the ensemble class and its estimated probability of class A.",
      "back": "Count the votes: A appears $5$ times, B appears $2$ times.\n**Majority vote** $\\Rightarrow$ predicted class **A**.\nEstimated probability of A $=\\frac{5}{7}\\approx0.714$ (the fraction of trees voting A).",
      "tag": "Bagging & random forests"
    },
    {
      "front": "How is **variable (predictor) importance** measured in bagging and random forests, given that the ensemble itself is hard to interpret?",
      "back": "**For bagged/RF regression trees:** record the total decrease in **RSS** due to splits on predictor $X_j$, averaged over the $B$ trees — large total RSS reduction means an important variable.\n**For classification trees:** record the total decrease in the **Gini index** from splits on $X_j$, averaged over the $B$ trees.\nThe results are usually plotted as a relative-importance ranking, recovering some interpretability lost when moving from a single tree to an ensemble.",
      "tag": "Bagging & random forests"
    },
    {
      "front": "Predictor $X_1$ produces Gini decreases of $0.30$, $0.20$, and $0.40$ across the splits it makes in a $3$-tree forest (one split per tree). Predictor $X_2$ gives $0.10$, $0.15$, $0.05$. Rank their importance.",
      "back": "Average the impurity (Gini) decrease over the $B=3$ trees:\n$X_1:\\ \\frac{0.30+0.20+0.40}{3}=\\frac{0.90}{3}=0.30$.\n$X_2:\\ \\frac{0.10+0.15+0.05}{3}=\\frac{0.30}{3}=0.10$.\nMean Gini decrease is larger for $X_1$ ($0.30>0.10$), so $X_1$ is the **more important** predictor.",
      "tag": "Bagging & random forests"
    },
    {
      "front": "Describe **boosting** for regression trees and contrast its core idea with bagging.",
      "back": "Boosting grows trees **sequentially**: each new tree is fit to the **current residuals** of the model rather than to the response $Y$, then a shrunken version is added to the running fit. Trees are typically **small** (few splits), so the model learns slowly and improves where it currently fits poorly.\nUnlike bagging — independent trees on bootstrap samples, combined by averaging to cut *variance* — boosting builds dependent trees on the full data to slowly reduce *bias*. No bootstrapping is used.",
      "tag": "Boosting"
    },
    {
      "front": "Write the boosting algorithm for regression trees in terms of $\\hat f$, the residuals, and the shrinkage $\\lambda$.",
      "back": "1. Set $\\hat f(x)=0$ and residuals $r_i=y_i$ for all $i$.\n2. For $b=1,\\dots,B$: fit a tree $\\hat f^{b}$ with $d$ splits to the data $(X,r)$; update\n$\\hat f(x)\\leftarrow \\hat f(x)+\\lambda\\,\\hat f^{b}(x)$ and $r_i\\leftarrow r_i-\\lambda\\,\\hat f^{b}(x_i)$.\n3. Output $\\hat f(x)=\\sum_{b=1}^{B}\\lambda\\,\\hat f^{b}(x)$.\nThe shrinkage $\\lambda$ (small, e.g. $0.01$–$0.1$) controls the learning rate.",
      "tag": "Boosting"
    },
    {
      "front": "Name the three **tuning parameters** of boosting and the danger of getting each wrong.",
      "back": "**$B$ — number of trees:** boosting *can* overfit if $B$ is too large (unlike bagging/RF, where large $B$ is safe); choose $B$ by cross-validation.\n**$\\lambda$ — shrinkage / learning rate:** small $\\lambda$ (e.g. $0.001$–$0.1$) learns slowly and usually generalizes better, but needs a larger $B$ to reach good fit.\n**$d$ — interaction depth (splits per tree):** controls complexity; $d=1$ gives a \"stump\" (additive model, no interactions), larger $d$ captures higher-order interactions.",
      "tag": "Boosting"
    },
    {
      "front": "Why are **small trees** (low $d$, often stumps with $d=1$) effective for boosting, and what does $d$ represent?",
      "back": "Because boosting fits trees sequentially to residuals, each tree only needs to nudge the fit a little, so a small tree suffices and the ensemble **learns slowly** — which guards against overfitting.\n$d$ is the **interaction depth**: a tree with $d$ splits can involve at most $d$ variables, so $d$ bounds the order of interactions the model can represent. $d=1$ (a stump) yields a purely **additive** model; raising $d$ permits $2$-way, $3$-way, … interactions.",
      "tag": "Boosting"
    },
    {
      "front": "Boosting with $\\lambda=0.1$ starts at $\\hat f=0$. The first tree predicts $+8$ for an observation whose true $y=20$. Give the updated prediction and the new residual.",
      "back": "Update the model by the shrunken tree prediction:\n$\\hat f(x)\\leftarrow 0+\\lambda(8)=0.1(8)=0.8$.\nNew residual $=y-\\hat f(x)=20-0.8=19.2$ (equivalently $r\\leftarrow 20-\\lambda\\hat f^1=20-0.8=19.2$).\nThe model has moved only slightly toward $20$ because of the small learning rate — subsequent trees keep chipping away at the residual.",
      "tag": "Boosting"
    },
    {
      "front": "Continuing the boosting example ($\\lambda=0.1$, current $\\hat f=0.8$, residual $19.2$, true $y=20$), the second tree fit to residuals predicts $+15$. Update $\\hat f$ and the residual.",
      "back": "Add the shrunken second-tree prediction:\n$\\hat f(x)\\leftarrow 0.8+\\lambda(15)=0.8+0.1(15)=0.8+1.5=2.3$.\nNew residual $=20-2.3=17.7$ (i.e. $r\\leftarrow 19.2-0.1(15)=19.2-1.5=17.7$).\nThe fit creeps toward the target; many such small additive steps are what make boosting accurate.",
      "tag": "Boosting"
    },
    {
      "front": "Summarize the **bias–variance and interpretability trade-off** across a single tree, bagging, random forests, and boosting.",
      "back": "**Single tree:** most interpretable; low bias but **high variance** (overfits, non-robust).\n**Bagging:** averages many deep trees to cut **variance**; loses easy interpretability but gains the OOB error estimate.\n**Random forest:** bagging plus $m<p$ split candidates to **decorrelate** trees, lowering variance further; tuning $m$.\n**Boosting:** sequential small trees on residuals slowly reduce **bias**; most flexible and often most accurate, but can overfit if $B$ too large and has three parameters ($B,\\lambda,d$) to tune. All three ensembles trade interpretability for accuracy, partly recovered via variable-importance plots.",
      "tag": "Boosting"
    }
  ]
}