联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp2

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2023-01-06 01:05


Bayesian Statistical Methods

Queen Mary, Autumn 2022

1

Contents

1 Likelihood 4

1.1 Maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.2 Standard error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Bayesian inference 10

2.1 Bayes* theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 Bayes* theorem and Bayesian inference . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Conjugate prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Point estimates and credible intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Specifying prior distributions 21

3.1 Informative prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.2 Less informative prior distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Markov chain Monte Carlo methods 23

4.1 The Metropolis algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Predictive distributions 28

5.1 Simulating the posterior predictive distribution . . . . . . . . . . . . . . . . . . . . . 30

6 Hierarchical models 31

6.1 Inference for hierarchical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

6.2 Advantages of hierarchical models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

7 Tests and model comparisions 36

2

Introduction

So far at Queen Mary, the statistics modules have taught the classical or frequentist approach which is

based on the idea that probability represents a long run limiting frequency. In the Bayesian approach,

any uncertain quantity is described by a probability distribution, and so probability represents a

degree of belief in an event which is conditional on the knowledge of the person concerned.

This course will introduce you to Bayesian statistics. These notes are self-contained but you may

want to read other accounts of Bayesian statistics as well. A useful introductory textbook is:

Bayesian Statistics: An Introduction (4th ed.) by P M Lee. (This is available as an e-book

from the library)

Parts of the following are also useful:

Bayesian Inference for Statistical Analysis by G E P Box and G C Tiao.

Bayesian Data Analysis by A Gelman, J B Carlin, H S Stern and D B Rubin.

Probability and Statistics from a Bayesian viewpoint (Vol 2) by D V Lindley.

3

Chapter 1

Likelihood

First we review the concept of likelihood, which is essential for Bayesian theory, but can also be used

in frequentist methods. Let y be the data that we observe, which is usually a vector. We assume

that y was generated by some probability model which we can specify. Suppose that this probability

model depends on one or more parameters 牟, which we want to estimate.

Definition 1.1. If the components of y are continuous, then the likelihood is defined as the joint

probability density function of y; if y is discrete, then the likelihood is defined as the joint probability

mass function of y. In either case, we denote the likelihood as

p(y | 牟).

Example 1.1. Let y = y1, . . . , yn be a random sample from a normal distribution with unknown

parameters 米 and 考. Then 牟 is the vector (米, 考), and the likelihood is the joint probability density

function

p(y | 米, 考) =

n﹊

i=1

?(yi | 米, 考).

Note that

here is the symbol for the product:

n﹊

i=1

ai = a1 ℅ a2 ℅ ﹞ ﹞ ﹞ ℅ an.

is the normal probability density function with parameters 米 and 考

(yi | 米, 考) = e(yi?米)2/2考2﹟2羽 考2.

4

Example 1.2. Suppose we observe k successes in n independent trials, where each trial has proba-

bility of success q. Now 牟 = q, the observed data is k, and the likelihood is the binomial probability

mass function

p(k | q) =(nk)

qk(1 q)nk.

It is also possible to construct likelihoods which combine probabilities and probability density func-

tions, for example if the observed data contains both discrete and continuous components. Alter-

natively, probabilities may appear in the likelihood if continuous data is only observed to lie within

some interval.

Example 1.3. Assume that the time until failure for a certain type of light bulb is exponentially

distributed with parameter 竹, and we observe n bulbs, with failure times t = t1, . . . , tn.

The likelihood contribution for a single observation ti is the exponential probability density function

e竹ti .

So the overall likelihood is the joint probability density function

p(t | 竹) =

n﹊

i=1

e?竹ti .

Suppose instead that we observe the failure time for the first m light bulbs with m < n, but for the

remaining n ?m bulbs we only observe that they have not failed by time ti. Then for i ≒ m, the

likelihood contributions are as before.

For i > m, the likelihood is the probability of what we have observed. Denoting the random variable

for the failure time by Ti, we have observed that Ti > ti, so the likelihood contribution is

p(Ti > ti) = e竹ti .

This is because the cumulative distribution function is p(Ti ≒ ti) = 1? e?竹ti .

Hence the overall likelihood is now

p(t | 竹) =

m﹊

i=1

e竹ti

n﹊

i=m+1

e竹ti .

1.1 Maximum likelihood estimation

In example 1.1, the parameters 米 and 考 of the normal distribution which generated the data are

known as population parameters. An estimator is defined as a function of the observed data which

5

we use as an estimate of a population parameter. For example the sample mean and variance may

be used as estimators of the population mean and variance, respectively.

To use the likelihood to estimate parameters 牟, we find the vector 牟? which maximizes the likelihood

p(y | 牟), given the data x that we have observed. This is the method of maximum likelihood, and

the estimator 牟? is known as the maximum likelihood estimator, or MLE.

Usually the parameters 牟 are continuous, and those are the only examples we cover, but the idea of

likelihood also makes sense if the unknown quantity is discrete.

When finding the MLE it is usually more convenient to work with the log of the likelihood: as the

log is a monotonically increasing function, the same 牟 will maximize p(y | 牟) and log(p(y | 牟)). The

log-likelihood is denoted by

?(牟; y) = log(p(y | 牟)).

Since the likelihood is typically a product of terms for independent observations, the log-likelihood

is a sum of terms, so using the log greatly simplifies finding the derivatives in order to find the

maximum.

Returning to the binomial example 1.2, the log-likelihood is

We also do not cover confidence intervals, but we do cover the Bayesian version, which are called

credible intervals.

9

Chapter 2

Bayesian inference

2.1 Bayes* theorem

Bayes* theorem is a formula from probability theory that is central to Bayesian inference. It is

named after Rev. Thomas Bayes, a nonconformist minister who lived in England in the first half of

the eighteenth century. The theorem states that:

Theorem 2.1. Let ? be a sample space and A1, A2, . . . , Am be mutually exclusive and exhaustive

events in ? (i.e. Ai ﹎ Aj = ?, i ?= j ,﹍k i=1Ai = ?; the Ai form a partition of ?.) Let B be any event

with p(B) > 0. Then

p(Ai | B) = p(Ai)p(B | Ai)

p(B)

=

p(Ai)p(B | Ai)﹉m

j=1 p(Aj)p(B | Aj)

The proof follows from the definition of conditional probabilities and the law of total probabilities.

Example 2.1. Suppose a test for an infection has 90% sensitivity and 95% specificity, and the

proportion of the population with the infection is q = 1/2000. Sensitivity is the probability of

detecting a genuine infection, and specificity is the probability of being correct about a non-infection.

So p(+ve test | infected) = 0.9 and p(-ve test | not infected) = 0.95. What is the probability that

someone who tests positive is infected?

Let the events be as follows:

B: test positive

A1: is infected

A2: is not infected

We want to find p(A1 | B). The probabilities we have are p(A1) = 1/2000, p(B | A1) = 0.9 and

p(B | A2) = 1? (BC | A2) = 1? 0.95 = 0.05.

10

Applying Bayes* theorem,

p(A1 | B) = p(A1)p(B | A1)﹉2

j=1 p(Aj)p(B | Aj)

p(A1)p(B | A1) = 1/2000℅ 0.9 = 0.00045

p(A2)p(B | A2) = (1? 1/2000)℅ 0.05 = .05

Hence

p(A1 | B) = 0.00045

0.00045 + .05

= 0.0089

So there is a less than 1% chance that the person is infected if they test positive.

Bayes* Theorem is also applicable to probability densities.

Theorem 2.2. Let X , Y be two continuous r.v.*s (possibly multivariate) and let f(x, y) be the joint

probability density function (pdf), f(x | y) the conditional pdf etc. then

f(x | y) = f(y | x) f(x)

f(y)

=

f(y | x) f(x)÷

f(y | x∩) f(x∩) dx∩

Alternatively, Y may be discrete, in which case f(y | x) and f(y) are probability mass functions.

2.2 Bayes* theorem and Bayesian inference

In the Bayesian framework, all uncertainty is specified by probability distributions. This includes

uncertainty about the unknown parameters. So we need to start with a probability distribution for

the parameters p(牟).

We then update the probability distribution for 牟 using the observed data y. This updating is done

using Bayes* theorem

p(牟 | y) = p(牟) p(y | 牟)

p(y)

.

p(y | 牟) is the likelihood for parameters 牟 given the observed data y.

We don*t normally need to find the normalizing constant p(y), which is given by

p(y) =

÷

p(牟) p(y | 牟) d牟 or

p(牟) p(y | 牟)

11

So the procedure is as follows:

Start with a distribution p(牟) - this is known as the prior distribution.

Combine the prior distribution with the likelihood p(y | 牟) using Bayes* theorem.

The resulting probability distribution p(牟 | y) is known as the posterior distribution.

We base our inferences about 牟 on this posterior distribution.

The use of Bayes* theorem can be summarized as

Posterior distribution ≦ prior distribution ℅ likelihood

Example 2.2. Suppose a biased coin has probability of heads q, and we observe k heads in n

independent coin tosses. We saw the binomial likelihood for this problem:

p(k | q) =

(

n

k

)

qk(1? q)n?k

For Bayesian inference, we need to specify a prior distribution for q. As q is a continuous quantity

between 0 and 1, the family of beta distributions is a reasonable choice.

If X ‵ Beta(汐 , 汕), its probability density function is

f(x) =

x汐?1(1? x)汕?1

B(汐 , 汕)

, 0 ≒ x ≒ 1

where B is the beta function and 汐 , 汕 > 0 are parameters.

B(汐 , 汕) =

÷ 1

0

x汐?1(1? x)汕?1 dx, also B(汐 , 汕) = 忙(汐)忙(汕)

(汐 + 汕)

.

The uniform distribution on [0, 1] is a special case of the beta distribution with 汐 = 1, 汕 = 1.

Combining prior distribution and likelihood, we have the posterior distribution p(q | k) given by

p(q | k) ≦ p(q) p(k | q) = q

?1(1? q)汕?1

B(汐 , 汕)

(

n

k

)

qk(1? q)n?k ≦ qk+汐?1(1? q)n?k+汕?1

We can recognize this posterior distribution as being proportional to the pdf of a Beta(k+汐 , n?k+汕)

distribution. Hence we do not need to explicitly work out the normalizing constant for p(q | k), we

can immediately see that q | k has a Beta(k + 汐 , n? k + 汕) distribution.

12

For a Bayesian point estimate for q, we summarize the posterior distribution, for example by its

mean. A Beta(汐 , 汕) rv has expected value

+ 汕

. Hence our posterior mean for q is

E(q | k) = k + 汐

n+ 汐 + 汕

.

Recall that the maximum likelihood estimator is

q? =

k

n

.

So for large values of k and n, the Bayesian estimate and MLE will be similar, whereas they differ

more for smaller sample sizes.

In the special case that the prior distribution is uniform on [0, 1], the posterior distribution is Beta(k+

1, n? k + 1). and the posterior mean value for q is

E(q | k) = k + 1

n+ 2

.

2.3 Conjugate prior distributions

In the binomial example 2.2 with a beta prior distribution, we saw that the posterior distribution is

also a beta distribution. When we have the same family of distributions for the prior and posterior

for one or more parameters, this is known as a conjugate family of distributions. We say that the

family of beta distributions is conjugate to the binomial likelihood.

The binomial likelihood is

p(牟1, 牟2, 牟3 | y) d牟2 d牟3

In practical Bayesian inference, Markov Chain Monte Carlo methods, covered later in these notes,

are the most common means of approximating the posterior distribution. These methods produce an

approximate sample from the joint posterior density, and once this is done the marginal distribution

of each parameter is immediately available.

Example 2.6. Generating samples from the posterior distribution may also be helpful even if we

can calculate the exact posterior distribution. Suppose that the data are the outcome of a clinical

trial of two treatments for a serious illness, the number of deaths after each treatment. Let the data

be ki deaths out of ni patients, i = 1, 2 for the two treatments, and the two unknown parameters

are q1 and q2, the probability of death with each treatment. Assuming a binomial model for each

outcome, and independent Beta(汐i, 汕i) priors for each parameter, the posterior distribution is

qi | ki ‵ Beta(ki + 汐i, ni ? ki + 汕i), i = 1, 2.

We have independent prior distributions and likelihood, so the posterior distributions are also inde-

pendent.

p(q1, q2 | k1, k2) = p(q1 | k1) p(q2 | k2) ≦ p(k1 | q1)p(q1) p(k2 | q2)p(q2)

However, it is useful to think in terms of the joint posterior density for q1 and q2, as then we can

make probability statements involving both parameters. In this case, one quantity of interest is the

probability P (q2 < q1), i.e. does the second treatment have a lower death rate than the first. To

find this probability, we need to integrate the joint posterior density over the relevant region, which

is not possible to do exactly when it is a product of beta pdfs.

We can approximate the probability by generating a sample of (q1, q2) pairs from the joint density.

To generate the sample, we just need to generate each parameter from its beta posterior distribution,

which can be done in R using the rbeta command. Then once we have the sample, we just count

what proportion of pairs has q2 < q1 to estimate P (q2 < q1).

20

Chapter 3

Specifying prior distributions

The posterior distribution depends on both the observed data via the likelihood, and also on the

prior distribution. So far, we have taken the prior distribution as given, but now we look at how to

specify a prior.

3.1 Informative prior distributions

An informative prior distribution is one in which the probability mass is concentrated in some subset

of the possible range for the parameter(s), and is usually based on some specific information. There

may be other data that is relevant, and we might want to use this information without including all

previous data in our current model. In that case, we can use summaries of the data to find a prior

distribution.

Example 3.1. In example 2.3, the data was the lifetimes of light bulbs, t = (t1, . . . , tn), assumed to

be exponentially distributed with parameter 竹 (the failure rate, reciprocal of the mean lifetime). The

gamma distribution provides a conjugate prior distribution for 竹. Suppose that we had information

from several other similar types of light bulbs, which had observed failure rates r1, . . . , rk. Let the

sample mean and variance of these rates be m and v.

A Gamma(汐 , 汕) distribution has mean


相关文章

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp