STA314 Fall 2021 Homework 4

Homework 4

Deadline: Monday, Nov. 29, at 11:59pm.

Submission: You need to submit one file through Quercus with our answers to Questions 1, 2,

and 3 as well as code requested for Question 3. You can produce the PDF file however you like

(e.g. LATEX, Microsoft Word, scanner), as long as it is readable.

Neatness Point: One point will be given for neatness. You will receive this point as long as we

don’t have a hard time reading your solutions or understanding the structure of your code.

Late Submission: 10% of the total possible marks will be deducted for each day late, up to a

maximum of 3 days. After that, no submissions will be accepted.

Computing: To install Python and required libraries, see the instructions on the course web page.

Homeworks are to be done alone or in pairs. See the Course Information handout1

for detailed

policies.

1. [6pts] Categorial Distribution. In this problem you will consider a Bayesian approach to

modelling categorical outcomes. Let’s consider fitting the categorical distribution, which is

a discrete distribution over K outcomes, which we’ll number 1 through K. The probability

of each category is explicitly represented with parameter θk. For it to be a valid probability

distribution, we clearly need θk ≥ 0 and P

k

θk = 1. We’ll represent each observation x as

a 1-of-K encoding, i.e, a vector where one of the entries is 1 and the rest are 0. Under this

model, the probability of an observation can be written in the following form:

p(x|θ) = Y

K

k=1

θ

xk

k

.

Suppose you observe a dataset,

D = {x

(i)

}

N

i=1.

Denote the count for outcome k as Nk =

PN

i=1 x

(i)

k

and N =

PK

k=1 Nk. Recall that each data

point is in the 1-of-K encoding, i.e., x

(i)

k = 1 if the ith datapoint represents an outcome k

and x

(i)

k = 0 otherwise.

(a) [2pts] First, derive ?θk, which is the maximum likelihood estimator (MLE), for the class

probabilities θk. You may assume that Nk > 0 for this question. Derivations should be

rigorous.

Hint 1: We saw in lecture that MLE can be thought of as ‘ratio of counts’ for the data,

so what should ?θk be counting?

Hint 2: Similar to the binary case, write p(x

(i)

| θ) = ΠK

k=1θ

x

(i)

k

k

. The challenge in

maximizing the log-likelihood is that this problem is a constrained maximization under

the constraint that PK

k=1 θk = 1. To overcome this, we will use a generalization of the

idea from the coin flip example in lecture. We can turn the MLE maximization problem

1

https://www.cs.toronto.edu/~cmaddis/courses/sta314_f21/sta314_f21_syllabus.pdf

1

STA314 Fall 2021 Homework 4

into an easier constrained problem by setting θK = 1 ?

PK?1

k=1 θk. Note that this is a

maximization problem over the set {(θk)k6=K|θk > 0,

P

k6=K θk < 1}. Although this is

still constrained, the log likelihood is a concave function that goes to ?∞ as we approach

the boundary of the constraint set, so the function attains its maximum in the interior

of the region when ? log p(D|θ)/?θk = 0.

(b) [2pts] Now we will take a Bayesian approach. For the prior, we’ll use the Dirichlet

distribution, which is defined over the set of probability vectors (i.e. vectors that are

nonnegative and whose entries sum to 1). Its PDF is as follows:

p(θ) ∝ θ

a1?1

1

· · · θ

ak?1

K .

What is the probability distribution of the posterior distribution p(θ | D)? Don’t just

give the density of the posterior, say which family of distributions it belongs to.

(c) [1pt] Still assuming the Dirichlet prior distribution, determine the MAP estimate of the

parameter vector θ. For this question, you may assume each ak > 1.

(d) [1pts] Now, suppose that your friend said that they had a hidden N + 1st outcome,

x

(N+1), drawn from the same distribution as the previous N outcomes. Your friend does

not want to reveal the value of x

(N+1) to you. So, you want to use your Bayesian model

to predict what you think x

(N+1) is likely to be. The “proper” Bayesian predictor is the

so-called posterior predictive distribution:

p(x

(N+1)|D) = Z

p(x

(N+1)|θ)p(θ|D) dθ

What is the probability that the N +1 outcome was k, i.e., the probability that x

(N+1)

k =

1, under your posterior predictive distribution? Hint: A useful fact is that if θ ～

Dirichlet(a1, . . . , aK), then

E[θk] = P

ak

k

0 ak

0

.

Report your answers to the above questions.

2. [5pts] Gaussian Na¨?ve Bayes. In this question, you will derive the maximum likelihood

estimates for Gaussian Na¨?ve Bayes, which is just like the na¨?ve Bayes model from lecture,

except that the features are continuous, and the conditional distribution of each feature given

the class is (univariate) Gaussian rather than Bernoulli. Start with the following generative

model for a random discrete class label t ∈ {1, 2, ..., K} and a random real valued vector of

D features x ∈ R

D:

p(t = k) = αk (0.1)

p(x|t = k) = Y

D

d=1

2πσ2

d

!?1/2

exp(

?

X

D

d=1

1

2σ

2

d

(xd ? μkd)

2

)

(0.2)

where αk ≥ 0 is the prior on class k, σ

2

d > 0 are the variances for each feature, which are

shared between all classes, and μkd ∈ R is the mean of the feature d conditioned on class k.

We write α to represent the vector with elements αk and similarly σ is the vector of variances.

The matrix of class means is written μ where the kth row of μ is the mean for class k.

(a) [1pt] Use Bayes’ rule to derive an expression for p(t = k|x). Hint: Use the law of total

probability to derive an expression for p(x).

2

STA314 Fall 2021 Homework 4

(b) [1pt] Write down an expression for the likelihood function (LL)

`(θ) = log p(t

(1)

, x

(1), t(2)

, x

(2)

, · · · , t(N)

, x

(N)

) (0.3)

of a particular dataset D = {(t

(1)

, x

(1)),(t

(2)

, x

(2)), · · · ,(t

(N)

, x

(N)

)} with parameters

θ = {α, μ,σ} and this model. (Assume the data are i.i.d.)

(c) [3pts] Take partial derivatives of the likelihood with respect to each of the parameters

μkd and with respect to the shared variances σ

2

d

. Report these partial derivatives and

find the maximum likelihood estimates for μkd and σ

2

d

. You may assume that each class

appears at least once in the dataset, i.e. the number of times Nk that class k appears

in the dataset is Nk > 0.

Report your answers to the above questions.

3. [6pts] Gaussian Discriminant Analysis. For this question you will build classifiers to

label images of handwritten digits. Each image is 8 by 8 pixels and is represented as a vector

of dimension 64 by listing all the pixel values in raster scan order. The images are grayscale

and the pixel values are between 0 and 1. The labels y are 0, 1, 2, . . . , 9 corresponding to

which character was written in the image. There are 700 training cases and 400 test cases for

each digit; they can be found in the .txt files provided.

A skeleton (q3.py) is is provided for each question that you should use to structure your code.

Starter code to help you load the data is provided (data.py). Note: the get digits by label

function in data.py returns the subset of digits that belong to a given class.

Using maximum likelihood, fit a set of 10 class-conditional Gaussians with a separate, full

covariance matrix for each class. Remember that the conditional multivariate Gaussian probability

density is given by,

p(x |t = k) = (2π)

?D/2

|Σk|

?1/2

exp

?

1

2

(x ? μk

)

T Σ

?1

k

(x ? μk

)

(0.4)

where μk ∈ R

D, Σk ∈ R

D×D and positive-definite. You should take p(t = k) = 1

10

. You

will compute parameters μkj and Σk for k ∈ (0...9), j ∈ (1...64). You should implement the

covariance computation yourself (i.e. without the aid of ’np.cov’). Hint: To ensure numerical

stability you may have to add a small multiple of the identity to each covariance matrix. For

this assignment you should add 0.01I to each covariance matrix.

(a) [5pts] Complete the 5 functions that are not complete in the starter code ([1pt] each).

Include the function body for each completed function in your report.

(b) [1pt] Report the average conditional log-likelihood, i.e. 1

N

PN

i=1 log p(t

(i)

| x

(i)

), of the

trained model on both the train and test set. Report the accuracy, of the classifier that

selects most likely posterior class for each data point using the trained model, on the

train and test set.

3

版权所有：留学生编程辅导网 2020 All Rights Reserved 联系方式：QQ:99515681 电子信箱：99515681@qq.com

免责声明：本站部分内容从网络整理而来，只供参考！如有版权问题可联系本站删除。