Introduction to Bayesian Data Analysis

Chenzi Xu
MPhil DPhil (Oxon)

University of York

2022/12/04 (updated: 2022-12-12)

Outline

  1. Why Bayesian data analysis?
  2. Fundamentals
    • Random variable and distribution
    • Conditional probability
    • Marginal likelihood
  3. Bayes’ rule
  4. Bayesian analysis
  5. Computational Bayesian models in R

Why Bayesian?

1 2 3 4 5

Frequentist Framework

1 2 3 4 5


(magic!)

The Central Limit Theorem

The sampling distribution of means is normal, provided that:

  1. the sample size is large enough
  2. \(\mu\) and \(\sigma\) are defined for the probability density or mass function that generated the data

Frequentist Framework

1 2 3 4 5


Hypothetical repeated sampling

The Exponential distribution has a parameter \(\lambda\).

Mean: \(\lambda\);

Variance: \(1/\lambda^2\)

n <- 1000 # number of samples
k <- 2000 # number of experiments

# generate data
y_matrix = matrix(rep(NA,n*k),
                  ncol=k)

for(i in 1:k){
  y_matrix[,i]<-rexp(n,rate=1/10)
}

Frequentist Framework

1 2 3 4 5


Confidence Interval (CI)

It represents the range of plausible value of the \(\mu\) parameter.

If you take samples repeatedly and compute the CI each time, 95% of those CIs will contain the true population mean \(\mu\).

Frequentist Framework

1 2 3 4 5


\(p\)-value

The probability that the null hypothesis is true, or the probability that the alternative hypothesis is false.

The probability of obtaining the observed sample statistic, or some value more extreme than that, conditional on the assumption that the null hypothesis is true.

Frequentist Framework

1 2 3 4 5


Errors and Power

  • Type I, II, M, S errors
  • Power (1-Type II error) is a function of:
    • the effect size
    • the standard deviation
    • the sample size.

Null hypothesis significance testing (NHST) is only meaningful when power is high.

Why Bayesian?

1 2 3 4 5


  • Friendly to limited data
  • Intuitive and solid model
  • Straightforward interpretation
  • Model flexibility
  • Not limited by optimisation constraints

Fundamentals

1 2 3 4 5

Random variable and distribution

1 2 3 4 5


Data: the underlying random variable (Y) producing the data

Discrete random variable

Examples:

  • Acceptability ratings on a Likert scale
  • Binary grammaticality judgements

Probability distribution \(p(Y)\):

  • Probability Mass Function (PMF)
  • PMF maps each possible outcome of Y to a value between [0, 1].

Continuous random variable

Examples:

  • Reading times or reaction times (in ms)
  • EEG signals (in microvolts)
  • f0, F1, F2…

Probability distribution \(p(Y)\):

  • Probability Density Function (PDF)
  • PDF maps a range of values of possible outcomes of Y to a value between [0, 1].

Discrete random variable: Binomial distribution

1 2 3 4 5


PMF for binomial distribution:

\[\begin{equation} \hbox{f}(k\mid n,\theta) = \binom{n}{k} \theta^{k} (1-\theta)^{n-k} \end{equation}\]

Here, \(n\) represents the total number of trials, \(k\) the number of successes, and \(\theta\) the probability of success. The term \(\binom{n}{k}\), the number of ways in which one can choose \(k\) successes out of \(n\) trials, expands to \(\frac{n!}{k!(n-k)!}\).

Continuous random variable: Normal distribution

1 2 3 4 5


PDF for normal distribution \(X \stackrel{iid}{\sim} Normal(\mu,\sigma)\):

\[\begin{equation} f(x\mid\mu,\sigma)= \frac{1}{\sqrt{2\pi \sigma^2}} \exp \left(-\frac{(x-\mu)^2}{2\sigma^2} \right) \end{equation}\]

Assumption: Each data point in the vector of data y is independent of the others.

The kernel density of the normal PDF: \(g(x|\mu,\sigma)=\exp \left(-\frac{(x-\mu)^2}{2\sigma^2} \right)\), and \(k\int{g(x)dx=1}\).

Probability in a continuous distribution is the area under the curve \(P(X<u)=\int_{-\infty}^uf(x)dx\), and will always be zero at any point value \(P(X=x)=0\).

Continuous random variable: Normal distribution

1 2 3 4 5


Expectation and Variance

1 2 3 4 5


Expectation \(E(x)\): the weighted mean of the possible outcomes, weighted by the respective probabilities of each outcome.

Variance is defined as: \(Var(X) = E[X^2]-E[Y]^2\)

Discrete: Binomial

Expectation: \(E\left[Y\right]=\sum_y y \cdot f(y)=n\theta\)

Variance: \(Var(X)=n\theta(1-\theta)\)

  • Estimated \(\hat{\theta}=\dfrac{k}{n}\) (Maximum likelihood estimate of the true parameter \(\theta\))

Continuous: Normal

Expectation: \(E[X]=\int xf(x)dx=\mu\)

Variance: \(Var(X)=\sigma^2\)

  • Estimated \(\hat{\mu}=\bar{\mu}=\dfrac{\sum y}{n}\)
  • Estimated \(\hat{\sigma}^2=\dfrac{\sum (y-\bar{y})^2}{n-1}\) (MLE estimate)

Likelihood Function

1 2 3 4 5


The likelihood function \(\mathcal{L}(\theta \mid k,n)\) refers to the PMF \(p(k|n,\theta)\), treated as a function of \(\theta\).

Suppose that we record \(n = 10\) trials, and observe \(k = 7\) successes (heads in coin tosses).

\[\begin{equation} \mathcal{L}(\theta \mid k=7,n=10)= \binom{10}{7} \theta^{7} (1-\theta)^{10-7} \end{equation}\]

Marginal likelihood

1 2 3 4 5


The concept of“Integrating out a parameter”

\[\begin{equation} p(\theta \mid k=7,n=10)=\int_0^1 \binom{10}{7} \theta^{7} (1-\theta)^{10-7} d\theta \end{equation}\]

Marginal likelihood: the likelihood computed by “marginalizing” out the parameter \(\theta\). It is a kind of weighted sum of the likelihood, weighted by the possible values of the parameter.

Bivariate distributions

1 2 3 4 5


Discrete case

We have a joint PMF \(p_{X,Y} (x, y)\) for each possible pair of values of X and Y.

The marginal distributions: \[\begin{equation} p_{X}(x)=\sum_{y\in S_{Y}}p_{X,Y}(x,y) \end{equation}\]

\[\begin{equation} p_{Y}(y)=\sum_{x\in S_{X}}p_{X,Y}(x,y) \end{equation}\]

The conditional distributions: \[\begin{equation} p_{X\mid Y}(x\mid y) = \frac{p_{X,Y}(x,y)}{p_Y(y)} \end{equation}\]

\[\begin{equation} p_{Y\mid X}(y\mid x) = \frac{p_{X,Y}(x,y)}{p_X(x)} \end{equation}\]

Bivariate distributions

1 2 3 4 5


Continuous case

PDF and CDF (\(\rho=0\)):