{
  "deckName": "Exam SRM — Clustering & KNN",
  "examCode": "Exam SRM",
  "cards": [
    {
      "front": "What is the goal of **K-means clustering**, and what objective does it minimize?",
      "back": "K-means partitions $n$ observations into $K$ **non-overlapping** clusters $C_1,\\dots,C_K$ so as to minimize the **total within-cluster variation**:\n$\\min \\sum_{k=1}^{K} W(C_k)$, where a common choice is\n$W(C_k)=\\sum_{i\\in C_k}\\lVert x_i-\\bar x_k\\rVert^{2}$ (sum of squared Euclidean distances of each point to its cluster centroid $\\bar x_k$).\nEvery observation belongs to exactly one cluster and clusters do not overlap.",
      "tag": "K-means"
    },
    {
      "front": "How is the **pairwise** within-cluster variation $\\frac{1}{|C_k|}\\sum_{i,i'\\in C_k}\\lVert x_i-x_{i'}\\rVert^{2}$ related to the **centroid** form $\\sum_{i\\in C_k}\\lVert x_i-\\bar x_k\\rVert^{2}$?",
      "back": "They are **proportional**, not literally equal. The ISLR identity is $\\frac{1}{|C_k|}\\sum_{i,i'\\in C_k}\\sum_{j}(x_{ij}-x_{i'j})^{2}=2\\sum_{i\\in C_k}\\lVert x_i-\\bar x_k\\rVert^{2}$, i.e. the pairwise form is exactly **twice** the centroid form. Because they differ only by the constant factor $2$, minimizing one minimizes the other, and in both the cluster **mean** $\\bar x_k$ is the optimal center — which is why K-means updates centroids to cluster means.",
      "tag": "K-means"
    },
    {
      "front": "State the **K-means algorithm** (Lloyd's algorithm) step by step.",
      "back": "1. **Initialize:** randomly assign each observation to one of $K$ clusters (or pick $K$ random centroids).\n2. **Update centroids:** compute each cluster centroid $\\bar x_k$ as the mean (per feature) of the points currently in cluster $k$.\n3. **Reassign:** assign each observation to the cluster whose centroid is **nearest** in Euclidean distance.\n4. **Repeat** steps 2–3 until assignments stop changing (convergence).\nThe objective $\\sum_k W(C_k)$ decreases at every step, so the algorithm converges to a **local** optimum.",
      "tag": "K-means"
    },
    {
      "front": "Why must K-means be **run multiple times with different random starts**?",
      "back": "Lloyd's algorithm only guarantees a **local** minimum of the total within-cluster variation, and the result depends on the random initial assignment. Different starts can give different final clusterings. The standard remedy is to run the algorithm many times (e.g. $20$–$50$ random initializations) and keep the solution with the **smallest** total $\\sum_k W(C_k)$.",
      "tag": "K-means"
    },
    {
      "front": "How is the number of clusters $K$ chosen in K-means, and what limitation does this create?",
      "back": "$K$ must be **specified in advance** — it is not learned from the data. Analysts try several values and inspect a plot of total within-cluster variation versus $K$ (the **elbow method**), choosing the $K$ where adding more clusters yields little further reduction. The variation always decreases as $K$ rises (reaching $0$ when $K=n$), so it cannot be the sole criterion; domain judgment is required.",
      "tag": "K-means"
    },
    {
      "front": "**Worked centroid update.** Cluster $A$ contains the points $(1,2)$, $(3,4)$, $(2,0)$. Compute its centroid.",
      "back": "Average each coordinate separately.\nFirst coordinate: $\\bar x_1=\\frac{1+3+2}{3}=\\frac{6}{3}=2$.\nSecond coordinate: $\\bar x_2=\\frac{2+4+0}{3}=\\frac{6}{3}=2$.\nCentroid $\\bar x_A=(2,\\,2)$.\nThis mean vector becomes the reference point against which observations are reassigned in the next K-means iteration.",
      "tag": "K-means"
    },
    {
      "front": "**Worked K-means reassignment.** Centroids are $\\bar x_A=(1,1)$ and $\\bar x_B=(5,5)$. To which cluster is the point $(2,3)$ assigned?",
      "back": "Compute squared Euclidean distance to each centroid (no need to take square roots — the ordering is the same).\nTo $A$: $(2-1)^{2}+(3-1)^{2}=1+4=5$.\nTo $B$: $(2-5)^{2}+(3-5)^{2}=9+4=13$.\nSince $5<13$, the point $(2,3)$ is assigned to cluster $A$ (nearest centroid).",
      "tag": "K-means"
    },
    {
      "front": "**Worked one full K-means iteration.** With $K=2$, the current clusters are $A=\\{(0,0),(2,0)\\}$ and $B=\\{(6,0),(10,0)\\}$. Update the centroids, then check whether the point $(2,0)$ stays in $A$.",
      "back": "New centroids: $\\bar x_A=\\left(\\frac{0+2}{2},0\\right)=(1,0)$ and $\\bar x_B=\\left(\\frac{6+10}{2},0\\right)=(8,0)$.\nReassign $(2,0)$: distance$^2$ to $A=(2-1)^{2}=1$; to $B=(2-8)^{2}=36$. Since $1<36$, it **stays in $A$**.\nAll four points remain with their nearest centroid, so the assignments are stable — the algorithm has converged.",
      "tag": "K-means"
    },
    {
      "front": "**Worked total within-cluster variation.** A cluster has points $(1,0)$, $(3,0)$, $(5,0)$ with centroid $(3,0)$. Compute $W(C_k)=\\sum_{i\\in C_k}\\lVert x_i-\\bar x_k\\rVert^{2}$.",
      "back": "Squared distances to the centroid $(3,0)$:\n$(1-3)^{2}=4$, $(3-3)^{2}=0$, $(5-3)^{2}=4$.\n$W(C_k)=4+0+4=8$.\nIf this cluster were split so each point sat in its own cluster, $W$ would drop to $0$ — illustrating why total within-cluster variation always falls as $K$ increases.",
      "tag": "K-means"
    },
    {
      "front": "What is **hierarchical agglomerative clustering**, and how does it differ from K-means in specifying the number of clusters?",
      "back": "Agglomerative (bottom-up) hierarchical clustering starts with each observation as its **own cluster** and repeatedly **merges the two closest clusters** until a single cluster remains, recording the merge sequence as a **dendrogram**.\nUnlike K-means, it does **not** require $K$ to be fixed in advance: a single run produces a whole nested family of clusterings, and you choose the number of clusters afterward by **cutting** the dendrogram at a chosen height.",
      "tag": "Hierarchical clustering"
    },
    {
      "front": "State the steps of the **agglomerative hierarchical clustering** algorithm.",
      "back": "1. Begin with $n$ clusters, one per observation; compute all $\\binom{n}{2}$ pairwise dissimilarities.\n2. **Merge** the two clusters with the smallest inter-cluster dissimilarity into one.\n3. **Update** the dissimilarities between the new cluster and all remaining clusters using the chosen **linkage**.\n4. Repeat steps 2–3 until only one cluster remains.\nThe heights at which merges occur are recorded to draw the dendrogram.",
      "tag": "Hierarchical clustering"
    },
    {
      "front": "Define **complete linkage** and **single linkage**, and contrast the shape of clusters they tend to produce.",
      "back": "Given clusters $G$ and $H$, inter-cluster dissimilarity is:\n**Complete linkage:** the **maximum** pairwise distance, $\\max_{i\\in G,\\,j\\in H} d(i,j)$. Tends to produce compact, balanced, roughly spherical clusters.\n**Single linkage:** the **minimum** pairwise distance, $\\min_{i\\in G,\\,j\\in H} d(i,j)$. Can produce extended, \"chained\" clusters because a single close pair triggers a merge (the chaining phenomenon).",
      "tag": "Linkage & dissimilarity"
    },
    {
      "front": "Define **average linkage** and **centroid linkage**.",
      "back": "**Average linkage:** the **mean** of all pairwise distances between the two clusters, $\\frac{1}{|G||H|}\\sum_{i\\in G}\\sum_{j\\in H} d(i,j)$. A compromise between single and complete linkage.\n**Centroid linkage:** the distance between the two cluster **centroids**, $d(\\bar x_G,\\bar x_H)$. It can produce **inversions** (a merge at a lower height than an earlier merge), which makes its dendrogram harder to interpret.",
      "tag": "Linkage & dissimilarity"
    },
    {
      "front": "What is a dendrogram **inversion**, and which linkage is prone to it?",
      "back": "An inversion occurs when a later merge happens at a **lower** height than an earlier merge, so the dendrogram is no longer monotone and fusion heights cannot be read cleanly. **Centroid linkage** is the linkage prone to inversions, because merging two clusters can move the new centroid closer to another cluster than the merged pair were to each other. Complete, single, and average linkage do not produce inversions.",
      "tag": "Linkage & dissimilarity"
    },
    {
      "front": "Contrast **Euclidean distance** and **correlation-based dissimilarity** as inputs to clustering.",
      "back": "**Euclidean distance** $d(i,i')=\\sqrt{\\sum_j (x_{ij}-x_{i'j})^{2}}$ treats observations as close when their feature *values* are close in magnitude.\n**Correlation-based dissimilarity** treats two observations as similar when their feature *profiles* are highly correlated (move up and down together), ignoring overall level. Two observations can be far apart in Euclidean terms yet have correlation $1$. The right metric depends on whether magnitude or shape of the profile matters.",
      "tag": "Linkage & dissimilarity"
    },
    {
      "front": "**Worked linkage merge.** Three single-point clusters lie on a line at $P=0$, $Q=3$, $R=4$. Using **complete linkage**, which two merge first, and at what height?",
      "back": "Pairwise distances: $d(P,Q)=3$, $d(P,R)=4$, $d(Q,R)=1$.\nThe smallest is $d(Q,R)=1$, so $Q$ and $R$ merge **first at height $1$**.\nWith complete linkage the new cluster $\\{Q,R\\}$ sits at distance $\\max(d(P,Q),d(P,R))=\\max(3,4)=4$ from $P$, so the final merge of $P$ with $\\{Q,R\\}$ occurs at **height $4$**.",
      "tag": "Linkage & dissimilarity"
    },
    {
      "front": "**Worked single vs complete linkage.** For the same points $P=0$, $Q=3$, $R=4$ on a line, at what height does $P$ join $\\{Q,R\\}$ under **single** linkage versus **complete** linkage?",
      "back": "After $Q,R$ merge at height $1$, compute the distance from $P$ to cluster $\\{Q,R\\}$.\n**Single linkage:** $\\min(d(P,Q),d(P,R))=\\min(3,4)=3$ → final merge at height $\\mathbf{3}$.\n**Complete linkage:** $\\max(d(P,Q),d(P,R))=\\max(3,4)=4$ → final merge at height $\\mathbf{4}$.\nThe choice of linkage changes the fusion heights, hence where a dendrogram cut lands.",
      "tag": "Linkage & dissimilarity"
    },
    {
      "front": "**Worked average linkage update.** Clusters $G=\\{a,b\\}$ and $H=\\{c\\}$ have pairwise distances $d(a,c)=2$ and $d(b,c)=6$. Give the average-linkage dissimilarity between $G$ and $H$.",
      "back": "Average linkage averages all cross-cluster pairwise distances: there are $|G|\\cdot|H|=2\\cdot1=2$ such pairs.\n$d(G,H)=\\frac{d(a,c)+d(b,c)}{2}=\\frac{2+6}{2}=4$.\nFor comparison, single linkage would give $\\min(2,6)=2$ and complete linkage $\\max(2,6)=6$, so average linkage sits between them.",
      "tag": "Linkage & dissimilarity"
    },
    {
      "front": "**Worked centroid distance.** Cluster $G$ has centroid $(0,0)$ and cluster $H$ has centroid $(3,4)$. Compute the centroid-linkage dissimilarity.",
      "back": "Centroid linkage uses the Euclidean distance between the two centroids:\n$d(\\bar x_G,\\bar x_H)=\\sqrt{(3-0)^{2}+(4-0)^{2}}=\\sqrt{9+16}=\\sqrt{25}=5$.\nThis single number summarizes the gap between the clusters; if a third cluster's centroid were nearer than $5$ to either, it would merge before $G$ and $H$.",
      "tag": "Linkage & dissimilarity"
    },
    {
      "front": "How do you **read a dendrogram** to determine how similar two observations are?",
      "back": "Find the height at which the two observations are **first joined** into the same cluster (the lowest horizontal bar connecting their branches). The **lower** the fusion height, the more similar they are.\nCaution: proximity along the **horizontal axis** is *not* a measure of similarity — only the **vertical** fusion height is. Branches can be flipped at any node without changing the dendrogram's meaning.",
      "tag": "Dendrograms"
    },
    {
      "front": "How does **cutting a dendrogram** produce a specific number of clusters?",
      "back": "Drawing a **horizontal line** across the dendrogram at a chosen height and counting the number of vertical branches it crosses gives that many clusters; each set of leaves below a crossed branch forms one cluster.\nA **lower** cut yields **more, smaller** clusters; a **higher** cut yields **fewer, larger** clusters. One dendrogram thus encodes the entire nested family of clusterings from $1$ to $n$ clusters.",
      "tag": "Dendrograms"
    },
    {
      "front": "What does it mean that hierarchical clustering produces **nested** clusters, and when can that be a disadvantage?",
      "back": "The clusters at any cut are obtained by merging clusters from a lower cut, so every clustering is **nested** inside the coarser one above it. This is a disadvantage when the true grouping is **not** hierarchical — e.g. the best $2$-cluster split (by gender) and the best $3$-cluster split (by nationality) need not be nested. Forcing a hierarchy can then give worse results than K-means at the relevant $K$.",
      "tag": "Dendrograms"
    },
    {
      "front": "**Worked dendrogram cut.** A dendrogram merges $\\{A,B\\}$ at height $1$, $\\{C,D\\}$ at height $2$, then joins $\\{A,B\\}$ with $\\{C,D\\}$ at height $5$. Cutting at height $3$ gives how many clusters, and which ones?",
      "back": "A horizontal line at height $3$ lies **above** the merges at heights $1$ and $2$ but **below** the merge at height $5$. It crosses $2$ vertical branches.\nResult: $\\mathbf{2}$ clusters, namely $\\{A,B\\}$ and $\\{C,D\\}$.\nCutting below height $1$ would give $4$ singleton clusters; cutting above height $5$ gives $1$ cluster.",
      "tag": "Dendrograms"
    },
    {
      "front": "**Worked dendrogram reading.** Leaves $W,X$ fuse at height $0.5$, $Y,Z$ fuse at height $0.8$, and the two pairs fuse at height $3.0$. Which pair of observations is most similar, and what is the dissimilarity between $W$ and $Z$?",
      "back": "Most similar = lowest fusion height. $W,X$ fuse at $0.5$, lower than $Y,Z$'s $0.8$, so $\\mathbf{W}$ **and** $\\mathbf{X}$ are the most similar pair.\n$W$ and $Z$ are in different sub-trees that only join at the top, so their dissimilarity is the **height at which they first share a cluster** $=\\mathbf{3.0}$.",
      "tag": "Dendrograms"
    },
    {
      "front": "Why should features usually be **standardized** before clustering or KNN?",
      "back": "Distance-based methods sum squared differences across features, so a feature measured on a **large numerical scale** (e.g. income in dollars) dominates one on a small scale (e.g. number of children) and effectively drives the clustering by itself. Standardizing each feature to mean $0$ and standard deviation $1$ via $z_{ij}=\\frac{x_{ij}-\\bar x_j}{s_j}$ puts all features on a comparable footing so each contributes equally to the distance.",
      "tag": "Scaling & bias-variance"
    },
    {
      "front": "**Worked standardization.** A feature has sample mean $\\bar x=50$ and sample standard deviation $s=10$. Compute the standardized value of an observation with $x=65$.",
      "back": "$z=\\frac{x-\\bar x}{s}=\\frac{65-50}{10}=\\frac{15}{10}=1.5$.\nThe observation lies $1.5$ standard deviations above the mean. After standardizing, every feature contributes on the same $z$-scale to any Euclidean distance, so no single high-variance feature dominates the clustering or nearest-neighbor vote.",
      "tag": "Scaling & bias-variance"
    },
    {
      "front": "Describe the **K-nearest-neighbors (KNN) classifier**.",
      "back": "To classify a new point $x_0$: find the $K$ training observations **closest** to $x_0$ (typically by Euclidean distance), call this neighborhood $\\mathcal N_0$, then assign the class that holds a **majority vote** among those neighbors. Formally the estimated probability of class $j$ is\n$\\hat\\Pr(Y=j\\mid X=x_0)=\\frac{1}{K}\\sum_{i\\in\\mathcal N_0} I(y_i=j)$,\nand $x_0$ is labeled with the class of largest estimated probability.",
      "tag": "K-nearest neighbors"
    },
    {
      "front": "How does **KNN regression** differ from KNN classification?",
      "back": "KNN **regression** predicts a quantitative response by **averaging** the responses of the $K$ nearest neighbors:\n$\\hat f(x_0)=\\frac{1}{K}\\sum_{i\\in\\mathcal N_0} y_i$.\nKNN **classification** instead takes a **majority vote** of the neighbors' class labels. Both rely on the same neighborhood $\\mathcal N_0$; only the aggregation step (mean vs. vote) differs.",
      "tag": "K-nearest neighbors"
    },
    {
      "front": "Why is KNN called a **non-parametric** method, and what does that imply?",
      "back": "KNN makes **no assumption about the form** of the decision boundary or regression function — it does not estimate parameters like a regression's $\\hat\\beta$. It simply stores the training data and looks up neighbors at prediction time (a \"lazy learner\"). This flexibility lets it capture highly non-linear boundaries, but it gives no interpretable coefficients, needs the full training set at prediction time, and is sensitive to feature scaling and to irrelevant features.",
      "tag": "K-nearest neighbors"
    },
    {
      "front": "How does the choice of $K$ in KNN control the **bias-variance tradeoff**?",
      "back": "**Small $K$** (e.g. $K=1$) gives a very **flexible** fit: low bias but high variance — the decision boundary is jagged and overfits noise.\n**Large $K$** averages over more neighbors, giving a **smoother** boundary: higher bias but lower variance, and it can underfit.\nAs $1/K$ increases, flexibility increases. The best $K$ minimizes test error and is typically chosen by cross-validation.",
      "tag": "Scaling & bias-variance"
    },
    {
      "front": "What is the **curse of dimensionality** for KNN?",
      "back": "As the number of features $p$ grows, the training data become **sparse** — even the \"nearest\" neighbors of a point are far away, so they are no longer local and the prediction degrades. The volume needed to capture a fixed fraction of observations grows so fast that neighborhoods are no longer truly nearby. KNN therefore performs poorly in high dimensions unless the number of relevant features is small or dimension is reduced first.",
      "tag": "Scaling & bias-variance"
    },
    {
      "front": "Why does adding **irrelevant features** hurt KNN more than it hurts a parametric model like linear regression?",
      "back": "KNN's distance sums squared differences over **all** features equally, so an irrelevant feature injects noise into every distance and can swamp the relevant features, distorting which points count as \"nearest.\" A parametric model can shrink or zero-out an irrelevant predictor's coefficient and largely ignore it. KNN has no such mechanism, so feature selection and scaling matter a great deal.",
      "tag": "Scaling & bias-variance"
    },
    {
      "front": "**Worked KNN classification vote ($K=3$).** A new point sits at distances $1.0$ (class A), $1.5$ (class B), $2.0$ (class A), $3.0$ (class B), $5.0$ (class A) from five training points. Classify it with $K=3$.",
      "back": "Take the $K=3$ **nearest** points by distance: $1.0$ (A), $1.5$ (B), $2.0$ (A).\nVote: class A appears $2$ times, class B appears $1$ time.\nMajority is **class A**, so the new point is classified as **A**.\nEstimated probability $\\hat\\Pr(A)=\\frac{2}{3}\\approx 0.67$.",
      "tag": "K-nearest neighbors"
    },
    {
      "front": "**Worked KNN sensitivity to $K$.** Using the same five points — distances $1.0$ (A), $1.5$ (B), $2.0$ (A), $3.0$ (B), $5.0$ (A) — classify the new point with $K=5$. Does the label change from the $K=3$ answer?",
      "back": "With $K=5$ all five neighbors vote: classes are A, B, A, B, A → class A appears $3$ times, class B $2$ times.\nMajority is **class A** ($\\hat\\Pr(A)=\\frac{3}{5}=0.6$).\nHere the label is still A, but the estimated probability fell from $0.67$ ($K=3$) to $0.60$ ($K=5$) — larger $K$ smooths the estimate toward the overall class mix.",
      "tag": "K-nearest neighbors"
    },
    {
      "front": "**Worked KNN regression ($K=3$).** The three nearest neighbors of a new point have responses $y=10$, $y=14$, $y=18$. Give the KNN regression prediction.",
      "back": "KNN regression averages the neighbors' responses:\n$\\hat f(x_0)=\\frac{10+14+18}{3}=\\frac{42}{3}=14$.\nWith $K=1$ the prediction would instead be just the single nearest response (e.g. $10$), illustrating how a larger $K$ smooths the fitted surface by averaging.",
      "tag": "K-nearest neighbors"
    },
    {
      "front": "**Worked nearest-neighbor distances.** A query point is at $(0,0)$. Find its single nearest neighbor among training points $P_1=(1,2)$, $P_2=(2,2)$, $P_3=(3,0)$.",
      "back": "Compute squared Euclidean distances from $(0,0)$:\n$d^2(P_1)=1^{2}+2^{2}=5$,\n$d^2(P_2)=2^{2}+2^{2}=8$,\n$d^2(P_3)=3^{2}+0^{2}=9$.\nThe smallest is $5$ (at $P_1$), so the **$1$-nearest neighbor is $P_1=(1,2)$**, distance $\\sqrt{5}\\approx 2.236$.",
      "tag": "K-nearest neighbors"
    },
    {
      "front": "**Worked effect of NOT scaling in KNN.** Two features: age (years) and income (dollars). Query $(40,\\ 50000)$; candidate $X=(45,\\ 50100)$, candidate $Y=(60,\\ 50000)$. Which is the nearest neighbor on raw data, and why is that misleading?",
      "back": "Raw squared distances:\nTo $X$: $(40-45)^{2}+(50000-50100)^{2}=25+10000=10025$.\nTo $Y$: $(40-60)^{2}+(50000-50000)^{2}=400+0=400$.\n$Y$ is \"nearest\" ($400<10025$) only because the tiny \\$100 income gap dwarfs a $20$-year age gap on the raw scale. Income's large units dominate the distance. **Standardizing** both features first would let the $20$-year age difference register, giving a more sensible neighbor.",
      "tag": "Scaling & bias-variance"
    },
    {
      "front": "**Worked weighted vote / tie handling.** With $K=4$, a query's four nearest neighbors are classes A, A, B, B (a tie). How is this resolved, and what does it suggest about choosing $K$?",
      "back": "A plain majority vote ties $2$–$2$. Common tie-breakers: pick the class of the **single closest** neighbor, or use **distance-weighted** votes (closer neighbors count more). Choosing an **odd $K$** for a two-class problem avoids such ties entirely.\nThe tie shows why $K$ is often taken odd in binary classification and why distance weighting is sometimes preferred to a flat vote.",
      "tag": "K-nearest neighbors"
    },
    {
      "front": "Compare **K-means** and **hierarchical clustering** on three points: pre-specifying $K$, determinism, and output.",
      "back": "**Pre-specify $K$:** K-means **requires** $K$ up front; hierarchical does **not** ($K$ is chosen later by cutting).\n**Determinism:** K-means depends on a **random** start and finds a local optimum, so reruns can differ; hierarchical agglomerative clustering is **deterministic** given the data and linkage.\n**Output:** K-means returns one flat partition; hierarchical returns a **dendrogram** encoding all granularities.",
      "tag": "Hierarchical clustering"
    },
    {
      "front": "List key **practical decisions** that strongly affect any clustering result.",
      "back": "Clustering results are highly sensitive to:\n1. **Standardizing** the features (or not).\n2. The **dissimilarity measure** (Euclidean vs. correlation-based).\n3. For hierarchical: the **linkage** (complete/single/average/centroid) and where to **cut** the dendrogram.\n4. For K-means: the choice of **$K$** and the random initialization.\nBecause clustering is **unsupervised** there is no labeled \"truth\" to validate against, so results should be examined for robustness across these choices rather than reported as definitive.",
      "tag": "Hierarchical clustering"
    },
    {
      "front": "Why is **validating** a clustering hard, and what makes clustering fundamentally different from KNN?",
      "back": "Clustering is **unsupervised**: there is no response variable, so there is no test-set error to minimize and no objective way to confirm the clusters reflect real subgroups rather than noise. KNN, by contrast, is **supervised** — it uses labeled responses $y_i$, so its accuracy can be measured on held-out data and $K$ tuned by cross-validation. The two are grouped together only because both rely on a notion of **distance** between observations.",
      "tag": "Scaling & bias-variance"
    }
  ]
}