Exam MAS-II — Bayesian Analysis Flashcards
Bayesian inference for CAS Exam MAS-II: Bayes' theorem and the prior-likelihood-posterior chain, the conjugate families (beta-binomial, gamma-Poisson, normal-normal, gamma-exponential) and reading off posterior means, the prior and posterior predictive distributions, Bayes estimators under squared-error, absolute-error and 0-1 loss, credible intervals and how they differ from confidence intervals, Jeffreys/noninformative priors, and MCMC (Metropolis-Hastings, Gibbs) for intractable posteriors plus hierarchical models — with fully worked conjugate updates and predictive calculations.
Import this deck
Download all 44 cards and import them into your flashcard app (JSON or CSV — works with Anki). Using the Willys app? No import needed — this deck is already built in (Settings → Library → Browse).
Every deck is built into the Willys app
All of these decks — including the full practice problem banks — come built into Willys AI Flashcards & Quizzes for iPhone & iPad (Mac version coming soon), with FSRS + SM-2 spaced repetition, streaks, and exam-date cram mode. 14-day free trial, then $14.99. To load a deck in the app: Settings → Library → Browse, then pick your exam and deck.
More Exam MAS-II decks:
Browse all 44 cards as a list
- Bayes theorem & posteriorsState **Bayes' theorem** for a parameter $\theta$ given data $x$, naming each piece.$\pi(\theta\mid x)=\frac{L(x\mid\theta)\,\pi(\theta)}{\int L(x\mid\theta)\,\pi(\theta)\,d\theta}\propto L(x\mid\theta)\,\pi(\theta)$. $\pi(\theta)$ = **prior** (belief before data); $L(x\mid\theta)$ = **likelihood** (model for the data given $\theta$); the denominator is the **marginal / normalizing constant** $m(x)=\int L(x\mid\theta)\pi(\theta)\,d\theta$; and $\pi(\theta\mid x)$ is the **posterior**. The denominator does not depend on $\theta$, so up to proportionality **posterior $\propto$ likelihood $\times$ prior**.
- Bayes theorem & posteriorsWhy can you usually work with the posterior **only up to proportionality** in $\theta$?The marginal $m(x)=\int L(x\mid\theta)\pi(\theta)\,d\theta$ is a constant in $\theta$ — it only rescales the curve so it integrates to $1$. So you can drop any factor not involving $\theta$, recognize the **kernel** of a known density (e.g. a $\text{Beta}$ or $\text{Gamma}$ shape), and supply the normalizing constant from that family. This is exactly how conjugate updates are done by inspection without computing the integral.
- Bayes theorem & posteriorsA coin has unknown $\theta=P(\text{heads})$. Prior: $\theta=0.4$ with prob $0.5$ and $\theta=0.7$ with prob $0.5$ (a discrete prior). You flip once and get **heads**. Find the posterior.Likelihood of heads is just $\theta$. Compute joint $=\pi(\theta)L$: $\theta=0.4$: $0.5\times 0.4 = 0.20$. $\theta=0.7$: $0.5\times 0.7 = 0.35$. Marginal $m=0.20+0.35=0.55$. Posterior: $\pi(0.4\mid H)=\frac{0.20}{0.55}\approx 0.3636$, $\pi(0.7\mid H)=\frac{0.35}{0.55}\approx 0.6364$. The heads outcome shifts weight toward the larger $\theta$.
- Bayes theorem & posteriorsTwo urns: $A$ has $\frac{1}{3}$ defective, $B$ has $\frac{1}{6}$ defective; an urn is picked at random (prior $0.5$ each). A drawn item is **defective**. Posterior probability it came from urn $A$?$P(A\mid D)=\frac{P(D\mid A)P(A)}{P(D\mid A)P(A)+P(D\mid B)P(B)}$. Numerator $=\frac{1}{3}\cdot\frac{1}{2}=\frac{1}{6}$. Other term $=\frac{1}{6}\cdot\frac{1}{2}=\frac{1}{12}$. $P(A\mid D)=\frac{1/6}{1/6+1/12}=\frac{1/6}{1/4}=\frac{2}{3}\approx 0.667$. Observing a defect favors the higher-defect urn $A$.
- Conjugate priorsWhat does it mean for a prior to be **conjugate** to a likelihood, and why is it useful?A prior family is **conjugate** to a likelihood if the resulting posterior is in the **same family** as the prior — only the parameters change. The update is then closed-form: read off the new parameters and the posterior mean from the family's formulas, with no integration. The conjugate families on the syllabus are beta-binomial, gamma-Poisson, normal-normal (known variance), and gamma-exponential (gamma is conjugate for the rate).
- Conjugate priorsBeta-binomial: with prior $\theta\sim\text{Beta}(\alpha,\beta)$ and $x$ successes in $n$ Bernoulli trials, give the posterior and its mean.Posterior: $\theta\mid x \sim \text{Beta}(\alpha+x,\ \beta+n-x)$. Posterior mean $=\frac{\alpha+x}{\alpha+\beta+n}$. The data add $x$ to the first ('success') shape and $n-x$ to the second ('failure') shape. The prior acts like $\alpha$ pseudo-successes and $\beta$ pseudo-failures already observed.
- Conjugate priorsPrior $\theta\sim\text{Beta}(2,3)$ on a claim probability; in $n=10$ policies you observe $x=4$ claims. Find the posterior and its mean.Posterior $=\text{Beta}(\alpha+x,\ \beta+n-x)=\text{Beta}(2+4,\ 3+10-4)=\text{Beta}(6,9)$. Posterior mean $=\frac{6}{6+9}=\frac{6}{15}=0.40$. For comparison the prior mean was $\frac{2}{5}=0.40$ and the sample proportion $\frac{4}{10}=0.40$, so both agree and the posterior mean is $0.40$.
- Conjugate priorsGamma-Poisson: with prior $\lambda\sim\text{Gamma}(\alpha,\beta)$ (rate parameterization, mean $\alpha/\beta$) and $n$ Poisson counts $x_1,\dots,x_n$, give the posterior and its mean.Posterior: $\lambda\mid \mathbf{x}\sim\text{Gamma}\!\left(\alpha+\textstyle\sum_{i=1}^{n} x_i,\ \beta+n\right)$. Posterior mean $=\frac{\alpha+\sum x_i}{\beta+n}$. With the rate parameterization the data add the total count $\sum x_i$ to the shape and the exposure $n$ to the rate parameter $\beta$.
- Conjugate priorsPrior $\lambda\sim\text{Gamma}(\alpha=3,\beta=2)$ (rate form) for a Poisson claim frequency. You observe $n=4$ years with counts $1,0,2,3$. Find the posterior mean.$\sum x_i = 1+0+2+3 = 6$, $n=4$. Posterior $=\text{Gamma}(\alpha+\sum x_i,\ \beta+n)=\text{Gamma}(3+6,\ 2+4)=\text{Gamma}(9,6)$. Posterior mean $=\frac{9}{6}=1.5$. Prior mean was $\frac{3}{2}=1.5$ and sample mean $\frac{6}{4}=1.5$, so the Bayes estimate is $1.5$.
- Conjugate priorsNormal-normal (known variance $\sigma^2$): prior $\mu\sim N(\mu_0,\tau^2)$, data mean $\bar{x}$ from $n$ obs. Give the posterior.Posterior is normal: $\mu\mid\mathbf{x}\sim N(\mu_n,\ \sigma_n^2)$ with **precision** adding: $\frac{1}{\sigma_n^2}=\frac{1}{\tau^2}+\frac{n}{\sigma^2}$, and $\mu_n=\sigma_n^2\!\left(\frac{\mu_0}{\tau^2}+\frac{n\bar{x}}{\sigma^2}\right)$. The posterior mean is a precision-weighted average of the prior mean $\mu_0$ and the sample mean $\bar{x}$.
- Conjugate priorsKnown $\sigma^2=100$. Prior $\mu\sim N(\mu_0=50,\ \tau^2=25)$. You observe $n=4$ values with mean $\bar{x}=60$. Find the posterior mean and variance.Precisions: prior $\frac{1}{\tau^2}=\frac{1}{25}=0.04$; data $\frac{n}{\sigma^2}=\frac{4}{100}=0.04$. Posterior precision $=0.04+0.04=0.08$, so $\sigma_n^2=\frac{1}{0.08}=12.5$. $\mu_n=12.5\left(\frac{50}{25}+\frac{4(60)}{100}\right)=12.5(2.0+2.4)=12.5(4.4)=55$. Posterior: $N(55,\ 12.5)$ — halfway between $50$ and $60$ since the two precisions are equal.
- Conjugate priorsGamma-exponential: prior $\lambda\sim\text{Gamma}(\alpha,\beta)$ (rate form) on the rate of an exponential, data $x_1,\dots,x_n$. Give the posterior.For $X_i\sim\text{Exp}(\text{rate }\lambda)$, the likelihood is $\lambda^{n}e^{-\lambda\sum x_i}$. Posterior $\propto \lambda^{\alpha+n-1}e^{-(\beta+\sum x_i)\lambda}$, i.e. $\lambda\mid\mathbf{x}\sim\text{Gamma}\!\left(\alpha+n,\ \beta+\textstyle\sum x_i\right)$. Posterior mean of the rate $=\frac{\alpha+n}{\beta+\sum x_i}$. The data add $n$ to the shape and the total $\sum x_i$ to the rate parameter.
- Conjugate priorsPrior $\lambda\sim\text{Gamma}(\alpha=2,\beta=300)$ (rate form) for an exponential claim-severity rate. You observe $n=3$ claims of sizes $100,200,300$. Find the posterior mean of $\lambda$.$\sum x_i = 100+200+300 = 600$, $n=3$. Posterior $=\text{Gamma}(\alpha+n,\ \beta+\sum x_i)=\text{Gamma}(2+3,\ 300+600)=\text{Gamma}(5,900)$. Posterior mean of the rate $=\frac{5}{900}\approx 0.005556$. The plug-in reciprocal $1/E[\lambda\mid\text{data}]=\frac{900}{5}=180$ is **not** the posterior mean claim size, which is $E[1/\lambda\mid\text{data}]=\frac{\beta^*}{\alpha^*-1}=\frac{900}{4}=225$.
- Predictive distributionWhat is the **prior predictive (marginal) distribution**, and how does it differ from the **posterior predictive**?**Prior predictive** of a new observation $\tilde{x}$, before seeing data: $m(\tilde{x})=\int f(\tilde{x}\mid\theta)\,\pi(\theta)\,d\theta$ — average the model over the prior. **Posterior predictive**, after observing $x$: $f(\tilde{x}\mid x)=\int f(\tilde{x}\mid\theta)\,\pi(\theta\mid x)\,d\theta$ — average the model over the **posterior**. Both integrate out $\theta$; the posterior predictive uses updated beliefs and is what you use to predict the next claim.
- Predictive distributionWhy is the predictive distribution **wider** (more variable) than just plugging the posterior mean of $\theta$ into the model?The predictive distribution accounts for **two** sources of uncertainty: the natural variability of the observation given $\theta$ (process/model variance) **and** the remaining uncertainty about $\theta$ itself (parameter variance). By the law of total variance, $\text{Var}(\tilde{x}\mid x)=E[\text{Var}(\tilde{x}\mid\theta)\mid x]+\text{Var}(E[\tilde{x}\mid\theta]\mid x)$. Plugging in a single $\hat\theta$ drops the second term and understates uncertainty.
- Predictive distributionDiscrete prior: claim count $X\sim\text{Poisson}(\lambda)$ with $\lambda=1$ (prob $0.6$) or $\lambda=3$ (prob $0.4$). Find the **prior predictive** probability of exactly $0$ claims.$P(X=0\mid\lambda)=e^{-\lambda}$. Mix over the prior: $P(X=0)=0.6\,e^{-1}+0.4\,e^{-3}$. $e^{-1}\approx 0.367879$, $e^{-3}\approx 0.049787$. $P(X=0)=0.6(0.367879)+0.4(0.049787)\approx 0.220728 + 0.019915 = 0.2406$.
- Predictive distributionContinuing: with the prior $\lambda=1$ (prob $0.6$) or $\lambda=3$ (prob $0.4$), you observe **one year with $0$ claims**. Find the posterior on $\lambda$, then the **posterior predictive** probability of $0$ claims next year.Posterior $\propto$ prior $\times e^{-\lambda}$: weights $0.6(0.367879)=0.220728$ and $0.4(0.049787)=0.019915$; sum $=0.240643$. Posterior: $P(\lambda=1\mid 0)=\frac{0.220728}{0.240643}\approx 0.9173$, $P(\lambda=3\mid 0)\approx 0.0827$. Posterior predictive of $0$: $0.9173\,e^{-1}+0.0827\,e^{-3}\approx 0.9173(0.367879)+0.0827(0.049787)\approx 0.3375+0.0041=0.3416$.
- Predictive distributionGamma-Poisson posterior predictive: with posterior $\lambda\mid x\sim\text{Gamma}(\alpha^*,\beta^*)$ (rate form), what is the distribution of the next count $\tilde{x}$?Mixing a Poisson over a gamma rate gives a **negative binomial**. The next count $\tilde{x}\mid x$ has $E[\tilde{x}\mid x]=\frac{\alpha^*}{\beta^*}$ (the posterior mean of $\lambda$) and variance $\frac{\alpha^*}{\beta^*}\!\left(1+\frac{1}{\beta^*}\right)=\frac{\alpha^*}{\beta^*}\cdot\frac{\beta^*+1}{\beta^*}$. The variance exceeds the Poisson value $\frac{\alpha^*}{\beta^*}$ because parameter uncertainty in $\lambda$ adds overdispersion.
- Predictive distributionWith posterior $\lambda\mid x\sim\text{Gamma}(\alpha^*=9,\beta^*=6)$ (rate form), find the mean and variance of the **posterior predictive** count $\tilde{x}$.Mean $=\frac{\alpha^*}{\beta^*}=\frac{9}{6}=1.5$. Variance $=\frac{\alpha^*}{\beta^*}\cdot\frac{\beta^*+1}{\beta^*}=1.5\cdot\frac{7}{6}=1.75$. The predictive variance $1.75$ exceeds the mean $1.5$ (overdispersion) because of leftover uncertainty in $\lambda$, unlike a plain Poisson where variance $=$ mean.
- Predictive distributionBeta-binomial posterior predictive: posterior $\theta\mid x\sim\text{Beta}(\alpha^*,\beta^*)$. Probability that the **next single trial** is a success?$P(\tilde{x}=1\mid x)=\int_0^1 \theta\,\pi(\theta\mid x)\,d\theta = E[\theta\mid x]=\frac{\alpha^*}{\alpha^*+\beta^*}$. So the predictive success probability is simply the **posterior mean** of $\theta$. (For $m$ future trials the count follows a beta-binomial distribution.)
- Predictive distributionWith posterior $\theta\mid x\sim\text{Beta}(6,9)$, find the probability the next two independent trials are **both successes**.For two future trials, integrate $\theta^2$ against the posterior: $E[\theta^2\mid x]=\frac{\alpha^*(\alpha^*+1)}{(\alpha^*+\beta^*)(\alpha^*+\beta^*+1)}$. $=\frac{6\cdot 7}{15\cdot 16}=\frac{42}{240}=0.175$. This exceeds $(\text{posterior mean})^2=0.40^2=0.16$ because the two outcomes are positively correlated through the shared unknown $\theta$.
- Loss functions & estimatorsState the **Bayes estimator** of $\theta$ under each of squared-error, absolute-error, and $0$-$1$ loss.**Squared-error loss** $(\hat\theta-\theta)^2$ → posterior **mean** $E[\theta\mid x]$. **Absolute-error loss** $|\hat\theta-\theta|$ → posterior **median**. **$0$-$1$ loss** (penalize any miss equally) → posterior **mode** (the MAP estimate). Each is the value minimizing the expected posterior loss for its loss function.
- Loss functions & estimatorsProve the posterior **mean** minimizes expected squared-error loss.Minimize $g(\hat\theta)=E[(\theta-\hat\theta)^2\mid x]$ over $\hat\theta$. $g'(\hat\theta)=E[-2(\theta-\hat\theta)\mid x]=-2\big(E[\theta\mid x]-\hat\theta\big)=0$. Solving gives $\hat\theta=E[\theta\mid x]$, the posterior mean; $g''=2>0$ confirms a minimum. The minimized value is the posterior variance $\text{Var}(\theta\mid x)$.
- Loss functions & estimatorsPosterior $\theta\mid x\sim\text{Beta}(8,4)$. Give the Bayes estimate under (a) squared-error and (b) $0$-$1$ loss.(a) Squared-error → posterior **mean** $=\frac{\alpha^*}{\alpha^*+\beta^*}=\frac{8}{12}\approx 0.667$. (b) $0$-$1$ loss → posterior **mode** $=\frac{\alpha^*-1}{\alpha^*+\beta^*-2}=\frac{7}{10}=0.70$. The mode exceeds the mean because the $\text{Beta}(8,4)$ density is left-skewed.
- Loss functions & estimatorsDiscrete posterior on $\theta$: $P(\theta=1)=0.2$, $P(\theta=2)=0.5$, $P(\theta=4)=0.3$. Give the Bayes estimate under squared-error, absolute-error, and $0$-$1$ loss.**Squared-error** → mean $=1(0.2)+2(0.5)+4(0.3)=0.2+1.0+1.2=2.4$. **Absolute-error** → median: cumulative weight reaches $0.5$ at $\theta=2$ (the $0.2+0.5=0.7$ mass at or below $2$ first passes $0.5$), so median $=2$. **$0$-$1$ loss** → mode $=2$ (highest mass, $0.5$).
- Loss functions & estimatorsHow is the **MAP (maximum a posteriori)** estimate related to the MLE, and when do they coincide?The MAP maximizes the posterior $\pi(\theta\mid x)\propto L(x\mid\theta)\pi(\theta)$ — it is the posterior **mode**, the Bayes estimator under $0$-$1$ loss. The MLE maximizes only $L(x\mid\theta)$. They coincide when the prior is **flat (constant)** over the relevant range, since then the posterior is proportional to the likelihood. With an informative prior the MAP is pulled toward the prior's high-density region.
- Loss functions & estimatorsUnder a **weighted/asymmetric** squared-error loss $w(\theta)(\hat\theta-\theta)^2$, what is the Bayes estimator?Minimizing $E[w(\theta)(\theta-\hat\theta)^2\mid x]$ gives the **weighted posterior mean** $\hat\theta=\frac{E[w(\theta)\,\theta\mid x]}{E[w(\theta)\mid x]}$. When $w\equiv 1$ this reduces to the ordinary posterior mean. Weighting lets you penalize errors more in regions of $\theta$ that matter more.
- Loss functions & estimatorsWith posterior $\lambda\mid x\sim\text{Gamma}(\alpha^*=5,\beta^*=900)$ (rate form), give the Bayes estimate of $\lambda$ under squared-error loss and its posterior variance.Squared-error Bayes estimate $=$ posterior mean $=\frac{\alpha^*}{\beta^*}=\frac{5}{900}\approx 0.005556$. Posterior variance of a $\text{Gamma}(\alpha^*,\beta^*)$ rate $=\frac{\alpha^*}{(\beta^*)^2}=\frac{5}{900^2}=\frac{5}{810000}\approx 6.17\times 10^{-6}$. (Standard deviation $\approx 0.002484$.)
- Credible intervalsDefine a **$95\%$ credible interval** for $\theta$ and contrast it with a frequentist confidence interval.A $95\%$ **credible interval** $(a,b)$ satisfies $P(a<\theta<b\mid x)=0.95$ under the posterior — a direct probability statement about $\theta$ given the data. A frequentist **confidence interval** treats $\theta$ as fixed: $95\%$ refers to the long-run coverage of the random interval, not to a probability that this particular interval contains $\theta$. The Bayesian statement 'there is a $95\%$ probability $\theta$ lies in $(a,b)$' is valid; the same wording is incorrect for a confidence interval.
- Credible intervalsDistinguish an **equal-tailed** credible interval from a **highest posterior density (HPD)** interval.**Equal-tailed:** cut off $2.5\%$ posterior probability in each tail — bounds are the $0.025$ and $0.975$ posterior quantiles. Easy to compute from quantiles. **HPD:** the **shortest** interval containing $95\%$ posterior probability; every point inside has higher posterior density than any point outside. For a symmetric unimodal posterior the two coincide; for a skewed posterior the HPD is shorter and not equal-tailed.
- Credible intervalsPosterior $\mu\mid x\sim N(55,\ 12.5)$. Construct a $95\%$ equal-tailed credible interval for $\mu$.Posterior SD $=\sqrt{12.5}\approx 3.5355$. Use $z_{0.975}=1.96$. Interval $=55\pm 1.96(3.5355)=55\pm 6.93$. So the $95\%$ credible interval is approximately $(48.07,\ 61.93)$. Because the posterior is symmetric, this is also the HPD interval.
- Credible intervalsPosterior $\mu\mid x\sim N(120,\ 16)$. Find a **$90\%$** equal-tailed credible interval and state the probability $\mu>125$.Posterior SD $=\sqrt{16}=4$; $z_{0.95}=1.645$. $90\%$ interval $=120\pm 1.645(4)=120\pm 6.58=(113.42,\ 126.58)$. $P(\mu>125\mid x)=P\!\left(Z>\frac{125-120}{4}\right)=P(Z>1.25)\approx 1-0.8944=0.1056$.
- Credible intervalsFrom a discrete posterior $P(\theta=1)=0.10$, $P(\theta=2)=0.55$, $P(\theta=3)=0.25$, $P(\theta=4)=0.10$, give a credible set of probability **at least $0.90$**.Rank by posterior mass and accumulate until $\ge 0.90$: $\theta=2$ ($0.55$) → $\theta=3$ ($+0.25=0.80$) → $\theta=1$ ($+0.10=0.90$). The set $\{1,2,3\}$ carries $0.90$ probability, so it is a $90\%$ credible set (highest-density style). Equivalently, excluding $\theta=4$ (mass $0.10$) leaves $0.90$.
- Bayes theorem & posteriorsWhat is a **noninformative (vague/diffuse) prior**, and what risk does it carry?A noninformative prior aims to add little or no information, letting the data dominate — e.g. a flat prior $\pi(\theta)\propto 1$ or a very wide proper prior. Risk: a flat prior can be **improper** (does not integrate to a finite value); that is acceptable only if the resulting posterior is still proper. A prior that is flat in one parameterization is generally **not** flat after a nonlinear transformation, so 'noninformative' is not invariant — motivating Jeffreys priors.
- Bayes theorem & posteriorsDefine the **Jeffreys prior** and give its key property.Jeffreys prior is $\pi(\theta)\propto\sqrt{I(\theta)}$, where $I(\theta)=-E\!\left[\frac{\partial^2}{\partial\theta^2}\ln f(X\mid\theta)\right]$ is the Fisher information. Key property: it is **invariant under reparameterization** — applying Jeffreys' rule in any one-to-one transformed parameter gives the consistent transformed prior. For a binomial proportion it is $\pi(\theta)\propto\theta^{-1/2}(1-\theta)^{-1/2}$, i.e. $\text{Beta}(\tfrac12,\tfrac12)$.
- MCMC & hierarchicalWhy do we often need **MCMC**, and what does it produce?For most realistic models the posterior $\pi(\theta\mid x)\propto L\pi$ has an **intractable normalizing constant** (the integral $\int L\pi\,d\theta$ can't be done in closed form), and $\theta$ may be high-dimensional. MCMC builds a Markov chain whose stationary distribution **is** the posterior, generating dependent draws $\theta^{(1)},\theta^{(2)},\dots$. After discarding burn-in, sample averages estimate posterior means, quantiles give credible intervals, etc. — all without ever computing the normalizing constant.
- MCMC & hierarchicalState the **Metropolis-Hastings acceptance ratio** and the accept/reject rule.From current $\theta$, propose $\theta^*$ from $q(\theta^*\mid\theta)$. The acceptance probability is $\alpha=\min\!\left\{1,\ \dfrac{\pi(\theta^*\mid x)\,q(\theta\mid\theta^*)}{\pi(\theta\mid x)\,q(\theta^*\mid\theta)}\right\}$. Draw $u\sim U(0,1)$: **accept** $\theta^*$ if $u\le\alpha$, else keep $\theta$. The ratio uses the posterior only through $L\pi$, so the unknown normalizing constant cancels. For a **symmetric** proposal ($q(\theta^*\mid\theta)=q(\theta\mid\theta^*)$) it simplifies to $\alpha=\min\{1,\ \pi(\theta^*\mid x)/\pi(\theta\mid x)\}$ (Metropolis).
- MCMC & hierarchicalMetropolis step: target $\pi(\theta\mid x)\propto \theta^{8}(1-\theta)^{4}$ (a $\text{Beta}(9,5)$ kernel). Current $\theta=0.5$, symmetric proposal $\theta^*=0.7$. Compute the acceptance probability.Symmetric proposal → $\alpha=\min\{1,\ r\}$ with $r=\frac{\theta^{*8}(1-\theta^*)^4}{\theta^{8}(1-\theta)^4}$. Numerator: $0.7^{8}(0.3)^4 = 0.05764801\times 0.0081 \approx 4.6695\times 10^{-4}$. Denominator: $0.5^{8}(0.5)^4 = 0.5^{12}\approx 2.4414\times 10^{-4}$. $r\approx \frac{4.6695\times 10^{-4}}{2.4414\times 10^{-4}}\approx 1.913$. Since $r>1$, $\alpha=\min\{1,1.913\}=1$ — the move to $0.7$ is **always accepted**.
- MCMC & hierarchicalMetropolis step toward a **lower-density** point. Same target $\pi\propto\theta^{8}(1-\theta)^4$, current $\theta=0.5$, symmetric proposal $\theta^*=0.2$. Compute $\alpha$ and decide whether to accept if $u=0.10$.Numerator: $0.2^{8}(0.8)^4 = 2.56\times 10^{-6}\times 0.4096 \approx 1.0486\times 10^{-6}$. Denominator: $0.5^{12}\approx 2.4414\times 10^{-4}$. $r\approx \frac{1.0486\times 10^{-6}}{2.4414\times 10^{-4}}\approx 0.004295$, so $\alpha=\min\{1,0.004295\}\approx 0.0043$. With $u=0.10>0.0043$, **reject** — the chain stays at $\theta=0.5$. Downhill moves are accepted only rarely.
- MCMC & hierarchicalDescribe **Gibbs sampling** and when it is preferred over Metropolis-Hastings.Gibbs sampling updates one parameter (or block) at a time by drawing it from its **full conditional** distribution given the current values of all the others: cycle $\theta_1^{(t+1)}\sim\pi(\theta_1\mid\theta_2^{(t)},\dots,x)$, then $\theta_2^{(t+1)}\sim\pi(\theta_2\mid\theta_1^{(t+1)},\dots,x)$, and so on. Every draw is **always accepted** (it is a special case of M-H with acceptance $1$). It is preferred when the full conditionals are known standard distributions — common in conjugate hierarchical models.
- MCMC & hierarchicalDefine **burn-in**, **thinning**, and how convergence is assessed in MCMC.**Burn-in:** discard the first portion of the chain, which depends on the arbitrary starting value, before computing summaries. **Thinning:** keep every $k$-th draw to reduce autocorrelation/storage (optional; reduces effective info per kept draw). **Convergence diagnostics:** trace plots (should look like stationary noise), running-mean stability, autocorrelation plots, and the **Gelman-Rubin** statistic $\hat{R}$ comparing within- vs between-chain variance across multiple chains ($\hat{R}\approx 1$ suggests convergence). MCMC estimates are valid only after the chain has reached its stationary (posterior) distribution.
- MCMC & hierarchicalEstimate a posterior mean and a $95\%$ credible interval from a thinned, post-burn-in MCMC sample of $\theta$: $0.30,\ 0.42,\ 0.55,\ 0.61,\ 0.48,\ 0.37,\ 0.52,\ 0.44$ (treat as the retained draws).Posterior mean $\approx$ sample average $=\frac{0.30+0.42+0.55+0.61+0.48+0.37+0.52+0.44}{8}=\frac{3.69}{8}\approx 0.461$. For a credible interval, use the empirical quantiles of the draws: sorted $0.30,0.37,0.42,0.44,0.48,0.52,0.55,0.61$. The crude $95\%$ interval spans roughly the extreme retained draws $(0.30,\ 0.61)$ — a real run uses thousands of draws and the $2.5\%/97.5\%$ empirical quantiles for a tight interval.
- MCMC & hierarchicalWhat is a **hierarchical (multilevel) Bayesian model**, and why is it natural for actuarial credibility?A hierarchical model adds a layer: data depend on group-level parameters $\theta_i$, and those $\theta_i$ share a common **hyperprior** with hyperparameters $\phi$. Schematically $x_{ij}\mid\theta_i\sim f$, $\theta_i\mid\phi\sim g$, $\phi\sim\pi(\phi)$. This 'partial pooling' shrinks each group's estimate toward the overall mean by an amount that depends on how much data the group has — exactly the **credibility** idea. Empirical Bayes estimates $\phi$ from the data; full Bayes puts a prior on $\phi$ and integrates (typically via MCMC).
- MCMC & hierarchicalIn a hierarchical normal model, the group posterior mean is $\hat\theta_i = Z_i\,\bar{x}_i + (1-Z_i)\,\mu$ with $Z_i=\frac{n_i}{n_i+k}$, $k=\sigma^2/\tau^2$. Compute $\hat\theta_i$ for $\bar{x}_i=80$, $\mu=70$, $n_i=5$, $\sigma^2=100$, $\tau^2=25$.$k=\frac{\sigma^2}{\tau^2}=\frac{100}{25}=4$, so $Z_i=\frac{n_i}{n_i+k}=\frac{5}{5+4}=\frac{5}{9}\approx 0.5556$. $\hat\theta_i = Z_i\,\bar{x}_i+(1-Z_i)\,\mu = 0.5556(80)+0.4444(70)$ $\approx 44.44 + 31.11 = 75.56$. The group estimate is shrunk from its raw mean $80$ partway toward the global mean $70$ — more shrinkage when $n_i$ is small.