Chenzi Xu
MPhil DPhil (Oxon)
University of York
2022/12/04 (updated: 2022-12-12)
1 2 3 4 5
1 2 3 4 5
The sampling distribution of means is normal, provided that:
Illustration from Wikipedia
1 2 3 4 5
The Exponential distribution has a parameter \(\lambda\).
1 2 3 4 5
It represents the range of plausible value of the \(\mu\) parameter.
If you take samples repeatedly and compute the CI each time, 95% of those CIs will contain the true population mean \(\mu\).
1 2 3 4 5
The probability that the null hypothesis is true, or the probability that the alternative hypothesis is false.
The probability of obtaining the observed sample statistic, or some value more extreme than that, conditional on the assumption that the null hypothesis is true.
1 2 3 4 5
Null hypothesis significance testing (NHST) is only meaningful when power is high.
1 2 3 4 5
1 2 3 4 5
1 2 3 4 5
Data: the underlying random variable (Y) producing the data
Examples:
Probability distribution \(p(Y)\):
Examples:
Probability distribution \(p(Y)\):
1 2 3 4 5
Here, \(n\) represents the total number of trials, \(k\) the number of successes, and \(\theta\) the probability of success. The term \(\binom{n}{k}\), the number of ways in which one can choose \(k\) successes out of \(n\) trials, expands to \(\frac{n!}{k!(n-k)!}\).
1 2 3 4 5
Assumption: Each data point in the vector of data y is independent of the others.
The kernel density of the normal PDF: \(g(x|\mu,\sigma)=\exp \left(-\frac{(x-\mu)^2}{2\sigma^2} \right)\), and \(k\int{g(x)dx=1}\).
Probability in a continuous distribution is the area under the curve \(P(X<u)=\int_{-\infty}^uf(x)dx\), and will always be zero at any point value \(P(X=x)=0\).
1 2 3 4 5
1 2 3 4 5
Expectation \(E(x)\): the weighted mean of the possible outcomes, weighted by the respective probabilities of each outcome.
Variance is defined as: \(Var(X) = E[X^2]-E[Y]^2\)
Expectation: \(E\left[Y\right]=\sum_y y \cdot f(y)=n\theta\)
Variance: \(Var(X)=n\theta(1-\theta)\)
Expectation: \(E[X]=\int xf(x)dx=\mu\)
Variance: \(Var(X)=\sigma^2\)
1 2 3 4 5
The likelihood function \(\mathcal{L}(\theta \mid k,n)\) refers to the PMF \(p(k|n,\theta)\), treated as a function of \(\theta\).
Suppose that we record \(n = 10\) trials, and observe \(k = 7\) successes (heads in coin tosses).
\[\begin{equation} \mathcal{L}(\theta \mid k=7,n=10)= \binom{10}{7} \theta^{7} (1-\theta)^{10-7} \end{equation}\]1 2 3 4 5
The concept of“Integrating out a parameter”
Marginal likelihood: the likelihood computed by “marginalizing” out the parameter \(\theta\). It is a kind of weighted sum of the likelihood, weighted by the possible values of the parameter.
1 2 3 4 5
We have a joint PMF \(p_{X,Y} (x, y)\) for each possible pair of values of X and Y.
The marginal distributions: \[\begin{equation} p_{X}(x)=\sum_{y\in S_{Y}}p_{X,Y}(x,y) \end{equation}\]
\[\begin{equation} p_{Y}(y)=\sum_{x\in S_{X}}p_{X,Y}(x,y) \end{equation}\]The conditional distributions: \[\begin{equation} p_{X\mid Y}(x\mid y) = \frac{p_{X,Y}(x,y)}{p_Y(y)} \end{equation}\]
\[\begin{equation} p_{Y\mid X}(y\mid x) = \frac{p_{X,Y}(x,y)}{p_X(x)} \end{equation}\]1 2 3 4 5
PDF and CDF (\(\rho=0\)):
The joint distribution of X and Y: \[\begin{equation} \begin{pmatrix} X \\ Y \end{pmatrix}\sim \mathcal N_{2}\left(\begin{pmatrix} \mu_{X} \\ \mu_{Y} \end{pmatrix}, \begin{pmatrix} \sigma _{X}^2 & \rho_{XY}\sigma _{X}\sigma _{Y}\\ \rho_{XY}\sigma _{X}\sigma _{Y} & \sigma _{Y}^2\\ \end{pmatrix} \right) \end{equation}\]
In the variance-covariance matrix, \(\rho_{XY}\) is the correlation between X and Y; their covariance is \(Cov(X,Y) = \rho_{XY}\sigma _{X}\sigma _{Y}\).
The area under the curve sums to 1. \[\begin{equation} \iint_{S_{X,Y}} f_{X,Y}(x,y)\, dx dy = 1 \end{equation}\]
The marginal distributions:
\[\begin{equation} f_X(x) = \int_{S_Y} f_{X,Y}(x,y)\, dy \quad f_Y(y) = \int_{S_X} f_{X,Y}(x,y)\, dx \end{equation}\]1 2 3 4 5
The probability of A given B: \[\begin{equation} P(A|B)=\dfrac{P(A,B)}{P(B)}=\frac{P(B|A)P(A)}{P(B)} \hbox{ where } P(B)>0 \end{equation}\]
Since the probability A and B happening is the same as the probability of B and A happening \(P(B,A)=P(A,B)\), we have:
\[\begin{equation} P(B,A)=P(B|A)P(A)=P(A|B)P(B)=P(A,B). \end{equation}\]Rearranging terms:
\[\begin{equation} P(B|A)=\frac{P(A|B)P(B)}{P(A)} \end{equation}\]This is Bayes’ rule.
1 2 3 4 5
1 2 3 4 5
If \(y\) is a vector of data points, we have:
\[\begin{equation} f(\theta\mid y) = \frac{f(y\mid \theta) f(\theta)}{f(y)} \end{equation}\]Here \(f(\cdot)\) is a PDF (continuous case) or a PMF (discrete case).
We can rewrite it as follows:
\[\begin{equation} Posterior = \frac{Likelihood \cdot Prior}{Marginal Likelihood} \end{equation}\]Prior \(f(\theta)\): the initial probability distribution of the parameter(s), before seeing the data.
Posterior \(f(\theta \mid y)\): the probability distribution of the parameters conditional on the data.
Likelihood \(f(y \mid \theta)\): the PMF (discrete case) or the PDF (continuous case) expressed as a function of parameters \(\theta\).
Marginal Likelihood \(f(y)\): It standardizes the posterior distribution to ensure that the area under the curve of the distribution sums to 1.
1 2 3 4 5
If \(y\) is a vector of data points, we have:
\[\begin{equation} f(\theta\mid y) = \frac{f(y\mid \theta) f(\theta)}{f(y)} \end{equation}\]Here \(f(\cdot)\) is a PDF (continuous case) or a PMF (discrete case).
We can rewrite it as follows:
\[\begin{equation} Posterior = \frac{Likelihood \cdot Prior}{Marginal Likelihood} \end{equation}\]Marginal Likelihood \(f(y)\): It functions as the “normalisating constant”.
\[\begin{equation} f(y)=\int f(y,\theta) d\theta = \int f(y \mid \theta)f(\theta)d\theta \end{equation}\]Without it, we have: \[\begin{equation} f(\theta\mid y) \propto f(y\mid \theta) f(\theta) \end{equation}\]
\[\begin{equation} \hbox{Posterior} \propto \hbox{Likelihood}\times \hbox{Prior} \end{equation}\]1 2 3 4 5
\(P(Data \mid Theory)\)
\(P(Theory \mid Data)\)
1 2 3 4 5
1 2 3 4 5
1. Choosing a likelihood
The responses follow a binomial distribution:
\[\begin{equation} f(x\mid n,\theta) = {x \choose n}\theta^{x} (1-\theta)^{n-x} \end{equation}\]1 2 3 4 5
2. Choosing a prior distribution
We need a distribution that can represent our uncertainty about the probability \(\theta\) of success.
The beta distribution is commonly used as prior for proportions.
\[\begin{equation*} f(\theta)= \left\{ \begin{array}{l l} \frac{1}{B(a,b)} \theta^{a - 1} (1-\theta)^{b-1} & \textrm{if } 0< \theta < 1\\ 0 & \textrm{otherwise}\\ \end{array} \right. \end{equation*}\]Where \(B(a,b)=\int_0^1 \theta^{a-1}(1-\theta)^{b-1}\, d\theta\), a normalizing constant that ensures that the area under the curve sums to 1.
The expectation and variance of Beta PDF:
\[\begin{equation} E[\theta]=\frac{a}{a+b} \end{equation}\] \[\begin{equation} Var(\theta)=\frac{ab}{\left(a+b\right)^{2}\left(a+b+1\right)} \end{equation}\]1 2 3 4 5
3. Conjugate analysis based on Bayes’ rule
To get the posterior distribution for \(\theta\), we have (ignore the normalising constant \({x \choose n}\)):
\[\begin{equation} f(\theta\mid x) \propto f(x\mid n, \theta) f(\theta) \end{equation}\]We multiply the likelihood and the prior: \[\begin{equation} \theta^{x} (1-\theta)^{n-x}\theta^{a-1}(1-\theta)^{b-1}=\theta^{a+x-1}(1-\theta)^{b+n-x-1} \end{equation}\]
We computed the posterior up to proportionality.
Thus, given a \(Binomial(n, x \mid \theta)\) likelihood, and a \(Beta(a,b)\) prior on \(\theta\), the posterior will be \(Beta(a+x, b+n-x)\).
1 2 3 4 5
The conjugate case of the beta-binomial:
In realistic data-analysis settings, the likelihood function will be very complex, and many parameters will be involved.
1 2 3 4 5
1 2 3 4 5
MCMC (Markov Chain Monte Carlo) sampling
Bayes’ rule: \[\begin{equation} \begin{aligned} p(\Theta|Y) &= \cfrac{ p(Y|\Theta) \cdot p(\Theta) }{p(Y)}\\ p(\Theta|Y) &= \cfrac{ p(Y|\Theta) \cdot p(\Theta) }{\int_{\Theta} p(Y|\Theta) \cdot p(\Theta) d\Theta } \end{aligned} \end{equation}\]
1 2 3 4 5
It works when we know the closed form of the PDF we want to simulate from and can derive the inverse of that function.
1 2 3 4 5
In trace plot, we see a characteristic “fat hairy caterpillar” shape, and we say that the sampler has converged.
The first few samples (warm-ups) were dropped, because the initial samples may not yet be sampling from the posterior.
We usually run four parallel chains with different initial values chosen randomly. They all end up sampling from the same distribution and overlap visually (mixing).
brms
1 2 3 4 5
Data:
Suppose we have data from a single subject repeatedly pressing the space bar as fast as possible, without paying attention to any stimuli. The data are response times in milliseconds in each trial.
RQ: How long it takes to press a key for this subject?
brms
1 2 3 4 5
Assumptions: \[\begin{equation} rt_n \sim \mathit{Normal}(\mu, \sigma) \end{equation}\]
\[\begin{equation} rt_n = \mu + \varepsilon \hbox{, where } \varepsilon_n \stackrel{iid}{\sim} \mathit{Normal}(0,\sigma) \end{equation}\]brms
1 2 3 4 5
library(brms)
fit_press <- brm(rt ~ 1,
data = df_spacebar,
family = gaussian(),
prior = c(
prior(uniform(0, 60000), class = Intercept, lb = 0,
ub = 60000),
prior(uniform(0, 2000), class = sigma, lb = 0,
ub = 2000)
),
chains = 4,
iter = 2000,
warmup = 1000
)
SAMPLING FOR MODEL 'f7a7fc24ef4802fd1ddd47ff5ebae03c' NOW (CHAIN 1).
Chain 1:
Chain 1: Gradient evaluation took 0.00019 seconds
Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 1.9 seconds.
Chain 1: Adjust your expectations accordingly!
Chain 1:
Chain 1:
Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 1:
Chain 1: Elapsed Time: 0.047038 seconds (Warm-up)
Chain 1: 0.032751 seconds (Sampling)
Chain 1: 0.079789 seconds (Total)
Chain 1:
SAMPLING FOR MODEL 'f7a7fc24ef4802fd1ddd47ff5ebae03c' NOW (CHAIN 2).
Chain 2:
Chain 2: Gradient evaluation took 1.8e-05 seconds
Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
Chain 2: Adjust your expectations accordingly!
Chain 2:
Chain 2:
Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 2:
Chain 2: Elapsed Time: 0.043865 seconds (Warm-up)
Chain 2: 0.037392 seconds (Sampling)
Chain 2: 0.081257 seconds (Total)
Chain 2:
SAMPLING FOR MODEL 'f7a7fc24ef4802fd1ddd47ff5ebae03c' NOW (CHAIN 3).
Chain 3:
Chain 3: Gradient evaluation took 1.5e-05 seconds
Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.15 seconds.
Chain 3: Adjust your expectations accordingly!
Chain 3:
Chain 3:
Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 3:
Chain 3: Elapsed Time: 0.040491 seconds (Warm-up)
Chain 3: 0.042745 seconds (Sampling)
Chain 3: 0.083236 seconds (Total)
Chain 3:
SAMPLING FOR MODEL 'f7a7fc24ef4802fd1ddd47ff5ebae03c' NOW (CHAIN 4).
Chain 4:
Chain 4: Gradient evaluation took 1.6e-05 seconds
Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.16 seconds.
Chain 4: Adjust your expectations accordingly!
Chain 4:
Chain 4:
Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
Chain 4:
Chain 4: Elapsed Time: 0.041945 seconds (Warm-up)
Chain 4: 0.036886 seconds (Sampling)
Chain 4: 0.078831 seconds (Total)
Chain 4:
Sampling Specifications:
chains
:the number of independent runs for sampling (default: 4)
iter
: the number of iterations that the sampler makes to sample from the posterior distribution of each parameter (default: 2000)
warmup
: the number of iterations from the start of sampling that are eventually discarded (default: half of iter
)
brms
1 2 3 4 5
brms
1 2 3 4 5
# A draws_df: 3 iterations, 1 chains, and 4 variables
b_Intercept sigma lprior lp__
1 168 24 -19 -1683
2 169 24 -19 -1683
3 168 26 -19 -1683
# ... hidden reserved variables {'.chain', '.iteration', '.draw'}
[1] 168.6334
2.5% 97.5%
166.0657 171.2347
[1] 24.99585
2.5% 97.5%
23.29355 26.88645
brms
1 2 3 4 5
brms
1 2 3 4 5
This sensitivity analysis showed that the posterior is not overly affected by the choice of prior.
brms
1 2 3 4 5
This talk is dedicated to the SMLP 2022. All the materials are from https://vasishth.github.io/.