Exam SRM — Principal Components Analysis Flashcards

Principal components analysis for SOA Exam SRM: PCA as unsupervised dimension reduction, the first component as the normalized maximal-variance linear combination, loadings as eigenvectors and scores as projections, eigenvalues and the proportion-of-variance-explained, scree plots and the elbow, why variables are standardized (correlation-matrix PCA where the eigenvalues sum to $p$), biplot reading, and principal-components regression — with fully worked PVE, cumulative-PVE, and score computations.

44 cards6 topicsFree · fact-checked · LaTeX math

Tap card or press Space to flip

Answer

Import this deck

Download all 44 cards and import them into your flashcard app (JSON or CSV — works with Anki). Using the Willys app? No import needed — this deck is already built in (Settings → Library → Browse).

Download JSON Download CSV

Every deck is built into the Willys app

All of these decks — including the full practice problem banks — come built into Willys AI Flashcards & Quizzes for iPhone & iPad (Mac version coming soon), with FSRS + SM-2 spaced repetition, streaks, and exam-date cram mode. 14-day free trial, then $14.99. To load a deck in the app: Settings → Library → Browse, then pick your exam and deck.

Download on the App Store

More Exam SRM decks:

Clustering & KNN Clustering & KNN Practice Decision Trees & Ensembles Decision Trees & Ensembles Practice Generalized Linear Models Generalized Linear Models Practice

← All Exam SRM decks

Browse all 44 cards as a list

Principal components
What problem does **principal components analysis (PCA)** solve, and is it supervised or unsupervised?
PCA is an **unsupervised** technique: there is no response $Y$, only the $p$ features. Its goal is **dimension reduction** — summarizing a large set of correlated variables by a small number of new variables (the principal components) that capture as much of the data's variation as possible. The components are uncorrelated directions, ordered so the first carries the most variance. They are used for visualization, to remove multicollinearity, and as derived predictors (PCR).
Principal components
Write the **first principal component** $Z_1$ as a linear combination of the features and state the constraint on its coefficients.
$Z_1=\phi_{11}X_1+\phi_{21}X_2+\dots+\phi_{p1}X_p=\sum_{j=1}^{p}\phi_{j1}X_j$. The coefficients $\phi_{j1}$ are the **loadings**, collected in the loading vector $\phi_1=(\phi_{11},\dots,\phi_{p1})^{\top}$. They are **normalized** so that $\sum_{j=1}^{p}\phi_{j1}^{2}=1$. Without this constraint the variance could be inflated arbitrarily by scaling up the loadings.
Principal components
In words, what defines the **first principal component direction**?
Among all normalized linear combinations of the (centered) features, $Z_1$ is the one with the **largest sample variance** — it is the direction in feature space along which the data vary most. Equivalently, it is the line closest to the data in the sense that it minimizes the sum of squared perpendicular distances from the points to the line.
Principal components
How is the **second** principal component defined relative to the first?
$Z_2=\sum_{j=1}^{p}\phi_{j2}X_j$ is the normalized linear combination of maximal variance **subject to being uncorrelated with $Z_1$**. Uncorrelatedness of the scores is equivalent to the loading vector $\phi_2$ being **orthogonal** to $\phi_1$ ($\phi_1^{\top}\phi_2=0$). Each later component is the max-variance direction orthogonal to all earlier ones.
Principal components
Why must the features be **centered** (mean-subtracted) before computing principal components?
PCA finds directions of maximal **variance**, which is measured about the mean. If the data are not centered, the leading direction would be pulled toward the overall mean vector rather than reflecting the spread of the data. So each column is replaced by $X_j-\bar X_j$ before the components are formed; the components then pass through the centroid of the cloud.
Loadings & scores
Define the **loadings** and the **scores** in PCA and how they relate to the data matrix.
**Loadings** $\phi_{jm}$ are the weights defining each component; the loading vector $\phi_m$ is the **eigenvector** of the covariance (or correlation) matrix for the $m$-th largest eigenvalue. Loadings tell you which variables a component represents. **Scores** $z_{im}$ are the values of the component for each observation: $z_{im}=\sum_{j=1}^{p}\phi_{jm}\,x_{ij}$ (on centered data). Scores are the **projections** of the observations onto the loading direction.
Loadings & scores
How are the principal-component loadings obtained from the **covariance (or correlation) matrix**?
The loading vectors are the **eigenvectors** of the sample covariance matrix $\Sigma$ (or the correlation matrix when the data are standardized). The eigenvector for the largest eigenvalue is $\phi_1$, for the second-largest is $\phi_2$, and so on. The corresponding **eigenvalue** $\lambda_m$ equals the variance of the $m$-th score $Z_m$. The eigenvectors are orthonormal, matching the normalization and orthogonality of the components.
Loadings & scores
The first loading vector for two standardized variables is $\phi_1=(0.6,0.8)^{\top}$. Verify it is normalized and compute the **score** of an observation with standardized values $x_1=1.5$, $x_2=-0.5$.
Normalization check: $0.6^{2}+0.8^{2}=0.36+0.64=1$. Good — it is a unit vector. Score: $z_1=\phi_{11}x_1+\phi_{21}x_2=0.6(1.5)+0.8(-0.5)=0.90-0.40=0.50$. So this observation projects to $0.50$ along the first principal component.
Loadings & scores
An observation has centered values $(x_1,x_2,x_3)=(2,-1,3)$ and the first loading vector is $\phi_1=(0.5,0.5,0.707107)^{\top}$. Compute its **first-component score**.
$z_1=0.5(2)+0.5(-1)+0.707107(3)=1.0-0.5+2.121321=2.621321$. Quick normalization check on the loadings: $0.5^{2}+0.5^{2}+0.707107^{2}=0.25+0.25+0.5=1.0$, so $\phi_1$ is a valid unit loading vector. The first-component score is about $2.62$.
Loadings & scores
Loadings $\phi_1=(0.707107,0.707107)^{\top}$ and $\phi_2=(0.707107,-0.707107)^{\top}$. Confirm the components are **orthogonal** and find the second score for centered $(x_1,x_2)=(4,2)$.
Orthogonality: $\phi_1^{\top}\phi_2=0.707107(0.707107)+0.707107(-0.707107)=0.5-0.5=0$. The loading vectors are orthogonal, so $Z_1$ and $Z_2$ are uncorrelated. Second score: $z_2=0.707107(4)+(-0.707107)(2)=2.828428-1.414214=1.414214$. (For reference $z_1=0.707107(4)+0.707107(2)=4.242642$.)
Loadings & scores
How do you interpret the **sign and magnitude** of a loading $\phi_{jm}$?
The **magnitude** $|\phi_{jm}|$ shows how strongly variable $X_j$ contributes to component $m$ — large loadings identify the variables that the component summarizes. The **sign** shows the direction of association: variables with same-sign loadings move together along that component; opposite signs mean the component contrasts those variables. Note the overall sign of a whole loading vector is arbitrary — flipping all signs gives an equally valid component (scores just flip sign).
Variance explained
What is the relationship between the **eigenvalue** $\lambda_m$ and the variance of component $Z_m$?
The eigenvalue $\lambda_m$ of the covariance/correlation matrix **is** the sample variance of the $m$-th principal-component score: $\operatorname{Var}(Z_m)=\lambda_m$. Because components are ordered by decreasing eigenvalue, $\lambda_1\ge\lambda_2\ge\dots\ge\lambda_p\ge 0$, so $Z_1$ has the largest variance, $Z_2$ the next, and so on.
Variance explained
State the **proportion of variance explained (PVE)** by the $m$-th principal component in terms of eigenvalues.
$\text{PVE}_m=\dfrac{\lambda_m}{\sum_{k=1}^{p}\lambda_k}=\dfrac{\operatorname{Var}(Z_m)}{\text{total variance}}$. The denominator $\sum_k\lambda_k$ is the total variance of all the variables (the trace of the covariance/correlation matrix). Each PVE lies in $[0,1]$ and they sum to $1$ across all $p$ components.
Variance explained
Why does $\sum_{k=1}^{p}\lambda_k$ equal the **total variance**, and what does it equal for **standardized** data?
The eigenvalues sum to the **trace** of the covariance/correlation matrix, which is the sum of the variables' variances — the total variance in the data. For **standardized** variables PCA uses the correlation matrix, whose diagonal entries are all $1$. With $p$ variables the trace is $p$, so $\sum_{k=1}^{p}\lambda_k=p$. Then $\text{PVE}_m=\lambda_m/p$.
Variance explained
A PCA on $4$ variables gives eigenvalues $\lambda=(2.5,\,1.0,\,0.4,\,0.1)$. Compute the **PVE of each component**.
Total variance $=2.5+1.0+0.4+0.1=4.0$ (as expected for $4$ standardized variables). $\text{PVE}_1=2.5/4.0=0.625$ (62.5%). $\text{PVE}_2=1.0/4.0=0.250$ (25.0%). $\text{PVE}_3=0.4/4.0=0.100$ (10.0%). $\text{PVE}_4=0.1/4.0=0.025$ (2.5%). They sum to $1$, confirming the calculation.
Variance explained
For the same eigenvalues $\lambda=(2.5,1.0,0.4,0.1)$, find the **cumulative PVE** of the first two components.
Cumulative PVE of components $1$–$2$ $=\dfrac{\lambda_1+\lambda_2}{\sum_k\lambda_k}=\dfrac{2.5+1.0}{4.0}=\dfrac{3.5}{4.0}=0.875$. So the first two principal components together explain **87.5%** of the total variance — usually enough to justify reducing the four variables to two components.
Choosing components
Eigenvalues from a $5$-variable correlation-matrix PCA are $\lambda=(2.8,1.1,0.6,0.3,0.2)$. How many components are needed to explain at least **90%** of the variance?
Total $=2.8+1.1+0.6+0.3+0.2=5.0$ (matches $p=5$). Cumulative PVE: PC1: $2.8/5=0.560$. PC1–2: $3.9/5=0.780$. PC1–3: $4.5/5=0.900$. Three components reach exactly $90\%$, so you need the **first three** principal components.
Variance explained
A covariance-matrix PCA (variables **not** standardized) gives eigenvalues $\lambda=(40,8,2)$. Find the PVE of the first component.
Here the eigenvalues need not sum to $p$ — they sum to the total (unstandardized) variance. Total $=40+8+2=50$. $\text{PVE}_1=40/50=0.80$, i.e. the first component explains $80\%$ of the total variance. The first two together: $(40+8)/50=48/50=0.96$, or $96\%$.
Choosing components
What does a **scree plot** show, and how do you use the **"elbow"** to choose the number of components?
A scree plot graphs each component's variance explained (its PVE, or eigenvalue) against the component number, from largest to smallest. You look for an **elbow** — the point after which the curve flattens and additional components add little. Keep the components before the elbow and discard the rest, since the flat tail represents mostly noise. It is a visual, somewhat subjective rule; cumulative-PVE thresholds (e.g. $80$–$90\%$) are a common complement.
Choosing components
List the common rules for **deciding how many principal components to keep**.
1. **Cumulative-PVE threshold:** keep enough components to explain a target share (e.g. $80\%$ or $90\%$) of total variance. 2. **Scree-plot elbow:** keep components up to the kink where the curve levels off. 3. **Kaiser rule (correlation PCA):** keep components with $\lambda_m>1$ — each must explain more than a single standardized variable. 4. In **PCR**, choose the number of components by cross-validation against predictive error. The choice is partly judgment; there is no single optimal answer.
Choosing components
Under the **Kaiser (eigenvalue-greater-than-one) rule**, how many components do you keep from correlation-matrix eigenvalues $\lambda=(2.8,1.1,0.6,0.3,0.2)$?
The Kaiser rule (valid for **correlation-matrix** PCA, where each variable contributes variance $1$) keeps components with $\lambda_m>1$. Here $\lambda_1=2.8>1$ and $\lambda_2=1.1>1$, but $\lambda_3=0.6<1$. So keep the **first two** components. Rationale: a component with $\lambda<1$ explains less variance than one original standardized variable, so it is not worth retaining.
Scaling
Why are variables usually **standardized** before PCA, and what changes if you do not?
Variance is **scale-dependent**: a variable measured in large units (e.g. dollars) has a huge raw variance and would dominate the first component purely because of its units. Standardizing (subtract mean, divide by SD) puts every variable on variance $1$, so PCA runs on the **correlation matrix** and components reflect genuine co-variation, not measurement scale. Without standardization you do PCA on the **covariance matrix**, which is appropriate only when all variables share comparable units/scales.
Scaling
When is it acceptable (or preferable) to run PCA on the **covariance matrix** without standardizing?
When the variables are already in the **same units and on comparable scales**, so their relative variances are meaningful and you want larger-variance variables to count more (e.g. all measurements in the same currency, or pixel intensities). In that case standardizing would throw away real information about which variables vary most. Otherwise — mixed units or very different magnitudes — standardize and use the correlation matrix.
Scaling
Two variables have variances $100$ and $1$ and covariance $5$. Why would an **unstandardized** PCA be misleading here, and what fixes it?
The covariance matrix is $\begin{pmatrix}100 & 5\\ 5 & 1\end{pmatrix}$. The first variable's variance ($100$) swamps the second's ($1$), so the leading component is almost entirely the first variable — driven by its scale, not by any real structure. **Fix:** standardize. The correlation is $\rho=\frac{5}{\sqrt{100}\sqrt{1}}=0.5$, giving correlation matrix $\begin{pmatrix}1 & 0.5\\ 0.5 & 1\end{pmatrix}$, on which both variables contribute equally.
Variance explained
For the standardized two-variable correlation matrix $\begin{pmatrix}1 & 0.5\\ 0.5 & 1\end{pmatrix}$, find the **eigenvalues** and the PVE of the first component.
For a $2\times 2$ correlation matrix with off-diagonal $\rho$, the eigenvalues are $1+\rho$ and $1-\rho$. Here $\lambda_1=1+0.5=1.5$ and $\lambda_2=1-0.5=0.5$. Check: they sum to $2=p$. $\text{PVE}_1=\frac{1.5}{2}=0.75$, so the first component explains $75\%$ of the (standardized) variance.
Loadings & scores
For the same $2\times 2$ correlation matrix $\begin{pmatrix}1 & 0.5\\ 0.5 & 1\end{pmatrix}$, find the **first loading vector**.
By symmetry the eigenvector for $\lambda_1=1+\rho$ is proportional to $(1,1)$. Normalizing to unit length divides by $\sqrt{1^{2}+1^{2}}=\sqrt{2}$: $\phi_1=\left(\tfrac{1}{\sqrt2},\tfrac{1}{\sqrt2}\right)=(0.707107,\,0.707107)$. The second eigenvector (for $\lambda_2=1-\rho$) is $\phi_2=(0.707107,\,-0.707107)$ — orthogonal to $\phi_1$.
Loadings & scores
What is a **biplot** in PCA, and what do its two overlaid elements represent?
A biplot displays a PCA on a single set of axes (usually PC1 vs PC2): - **Points** are the observation **scores** projected onto the first two components. - **Arrows** (vectors) are the variable **loadings** — each original variable is drawn from the origin with coordinates given by its loadings on PC1 and PC2. It lets you read clusters of observations and the variables driving the components at the same time.
Loadings & scores
How do you read the **loading arrows** on a biplot?
An arrow's **direction** shows which component(s) a variable aligns with: an arrow pointing along the PC1 axis loads mainly on PC1. Arrows pointing in **similar directions** correspond to positively correlated variables; arrows roughly **opposite** correspond to negatively correlated variables; arrows at right angles are roughly uncorrelated. An arrow's **length** reflects how well that variable is represented by the two plotted components.
PCR & applications
What is **principal components regression (PCR)**, and how does it use PCA?
PCR is a regression method that first runs PCA on the predictors, then fits an **ordinary least squares** regression of the response $Y$ on the first $M$ principal-component scores $Z_1,\dots,Z_M$ instead of on the original $p$ predictors. Because the components capture most of the predictors' variation in few dimensions, PCR reduces dimensionality and **multicollinearity**, often lowering variance at the cost of a little bias.
PCR & applications
What key **assumption** underlies PCR, and when can it fail?
PCR assumes that the directions in which the **predictors vary most** (the leading principal components) are also the directions most associated with the **response** $Y$. This usually holds but is not guaranteed: a low-variance component (dropped by PCR) could still be the one that predicts $Y$. When that happens PCR performs poorly, and a supervised method like **partial least squares** — which uses $Y$ in forming components — may do better.
PCR & applications
Is PCR a method that performs **feature selection**? Explain.
No. Each principal component is a linear combination of **all** $p$ original predictors, so even using only $M<p$ components, every original variable still enters the model through the loadings. PCR therefore **shrinks/regularizes** the coefficient space (like ridge regression in spirit) rather than selecting a subset of features. Methods such as the lasso, not PCR, do genuine feature selection.
PCR & applications
How is the **number of components $M$** chosen in PCR, and what is the effect of moving $M$ from small to $p$?
$M$ is typically chosen by **cross-validation**, picking the $M$ that minimizes estimated test error. - Small $M$: large bias, small variance (a very reduced model). - $M=p$: PCR reproduces ordinary least squares on all predictors (no dimension reduction, no benefit). The sweet spot is an intermediate $M$ that captures the predictive structure while discarding noisy low-variance directions.
Scaling
Why must predictors be **standardized before PCR** (just as in PCA)?
PCR builds its components from a PCA of the predictors, so it inherits PCA's scale sensitivity: without standardizing, high-variance (large-unit) predictors dominate the components regardless of their relevance. Standardizing each predictor to mean $0$, variance $1$ ensures the components reflect correlation structure rather than units, so the regression on the scores is not distorted by an arbitrary choice of measurement scale.
PCR & applications
A PCR model regresses $Y$ on the first two component scores: $\hat Y=12+3Z_1-2Z_2$. The loadings are $\phi_1=(0.6,0.8)$ and $\phi_2=(0.8,-0.6)$. Predict $Y$ for standardized predictors $(x_1,x_2)=(1,-1)$.
First compute the scores from the loadings: $Z_1=0.6(1)+0.8(-1)=0.6-0.8=-0.2$. $Z_2=0.8(1)+(-0.6)(-1)=0.8+0.6=1.4$. Then plug into the fitted equation: $\hat Y=12+3(-0.2)-2(1.4)=12-0.6-2.8=8.6$. The predicted response is $8.6$.
Choosing components
Eigenvalues from a $6$-variable correlation PCA are $\lambda=(3.0,1.5,0.6,0.5,0.3,0.1)$. Build the **cumulative-PVE table** and apply an $80\%$ threshold.
Total $=3.0+1.5+0.6+0.5+0.3+0.1=6.0$ (matches $p=6$). PC1: $3.0/6=0.500$ → cum $0.500$. PC2: $1.5/6=0.250$ → cum $0.750$. PC3: $0.6/6=0.100$ → cum $0.850$. PC4: $0.5/6=0.083$ → cum $0.933$. The first component to push cumulative PVE past $80\%$ is **PC3** (cum $85.0\%$), so keep **three** components under an $80\%$ rule.
Variance explained
A PCA reports that PC1 explains $45\%$ and PC2 explains $30\%$ of total variance, with a total variance of $20$. Find the **eigenvalues** $\lambda_1$ and $\lambda_2$.
Since $\text{PVE}_m=\lambda_m/\sum_k\lambda_k$ and $\sum_k\lambda_k=20$: $\lambda_1=0.45\times 20=9.0$. $\lambda_2=0.30\times 20=6.0$. The first two components carry variance $9.0+6.0=15.0$ out of $20$, i.e. cumulative PVE $=15/20=0.75$ ($75\%$).
Variance explained
Compute the **PVE of the first component** directly from the scores: the first-component scores for $5$ observations are $z_{i1}=(-3,-1,0,1,3)$ and the total variance of all variables is $6.0$.
The variance explained by PC1 equals the (sample) variance of its scores. The scores have mean $0$. Sum of squares $=(-3)^{2}+(-1)^{2}+0^{2}+1^{2}+3^{2}=9+1+0+1+9=20$. Using the $1/n$ population convention, $\operatorname{Var}(Z_1)=20/5=4.0=\lambda_1$. $\text{PVE}_1=\dfrac{4.0}{6.0}\approx 0.667$, so PC1 explains about $66.7\%$ of the total variance.
Principal components
State two important **limitations** of PCA to keep in mind.
1. **Interpretability:** components are linear blends of all variables, so they often lack a clean real-world meaning, making results harder to explain than original-variable models. 2. **Unsupervised:** PCA ignores any response $Y$; the highest-variance directions need not be the most predictive ones. It is also sensitive to scaling and to outliers, and it only captures **linear** structure.
Choosing components
Eigenvalues $\lambda=(4.2,1.3,0.9,0.4,0.2)$ from a $5$-variable PCA. Compute the variance **lost** if you keep only the first **two** components.
Total variance $=4.2+1.3+0.9+0.4+0.2=7.0$. Variance retained by PC1–2 $=4.2+1.3=5.5$, so cumulative PVE $=5.5/7.0\approx 0.786$ ($78.6\%$). Variance **lost** (the dropped components) $=0.9+0.4+0.2=1.5$, a proportion $1.5/7.0\approx 0.214$, i.e. about $21.4\%$ of the total variance is discarded.
Principal components
Explain why the principal components are **orthogonal/uncorrelated** and why that property is useful.
The loading vectors are eigenvectors of a symmetric matrix (covariance or correlation), and eigenvectors of a symmetric matrix for distinct eigenvalues are **orthogonal**. Orthogonal loadings make the score variables **uncorrelated**: $\operatorname{Cov}(Z_m,Z_{m'})=0$ for $m\ne m'$. Usefulness: the total variance splits cleanly across components (so PVEs add up), and using scores as predictors removes multicollinearity in PCR.
Loadings & scores
A standardized $3$-variable PCA gives first loadings $\phi_1=(0.58,0.58,0.58)$ approximately. What does an (almost) **equal-weight** first component tell you about the variables?
Equal positive loadings (each near $1/\sqrt3\approx0.577$) mean PC1 is essentially an **average / overall-size** dimension — all three variables move together, so the dominant source of variation is their common level. A later component with mixed signs (e.g. $(0.71,-0.71,0)$) would then represent a **contrast** between variables. This pattern (size then shape) is common when variables are strongly positively correlated.
Choosing components
Covariance-matrix eigenvalues are $\lambda=(60,25,10,5)$ (variables in the same units, not standardized). Find the cumulative PVE through **three** components and state how many to keep for a $90\%$ target.
Total $=60+25+10+5=100$, so PVEs are conveniently the eigenvalues as percents. PC1: $60\%$ → cum $60\%$. PC1–2: $85\%$ → cum $85\%$. PC1–3: $95\%$ → cum $95\%$. Three components reach $95\%\ge 90\%$ (two give only $85\%$), so keep the **first three** components.
Loadings & scores
Given first-component loadings $\phi_1=(0.5,0.5,0.5,0.5)$ for four standardized variables, compute the PC1 scores for two observations $A=(1,1,1,1)$ and $B=(2,0,-1,1)$.
Normalization check: $4\times 0.5^{2}=4(0.25)=1$, so $\phi_1$ is a unit vector. Score of $A$: $0.5(1+1+1+1)=0.5(4)=2.0$. Score of $B$: $0.5(2+0-1+1)=0.5(2)=1.0$. Since all loadings are equal and positive, the PC1 score is just $0.5$ times the sum of the (standardized) variables — an overall-magnitude index.
Principal components
Distinguish **PCA** from **clustering** as unsupervised methods.
Both are unsupervised, but they answer different questions. **PCA** seeks a **low-dimensional representation** of the observations that captures most of the variance — it summarizes the variables/directions. **Clustering** seeks **subgroups** of observations that are similar to each other — it partitions the rows. PCA reduces dimensions (columns/directions); clustering groups observations (rows). They are often used together (e.g. cluster on the leading principal-component scores).