 #### ÁȘÏ”·œÊœ

• QQŁș99515681
• ÓÊÏäŁș99515681@qq.com
• č€ŚśÊ±ŒäŁș8:00-20:00
• ÎąĐĆŁșcodinghelp #### Äú”±Ç°Î»ÖĂŁșÊŚÒł >> Python±àłÌPython±àłÌ

###### ÈŐÆÚŁș2021-12-02 09:44

STA314 Fall 2021 Homework 4

Homework 4

Deadline: Monday, Nov. 29, at 11:59pm.

Submission: You need to submit one file through Quercus with our answers to Questions 1, 2,

and 3 as well as code requested for Question 3. You can produce the PDF file however you like

(e.g. LATEX, Microsoft Word, scanner), as long as it is readable.

Neatness Point: One point will be given for neatness. You will receive this point as long as we

Late Submission: 10% of the total possible marks will be deducted for each day late, up to a

maximum of 3 days. After that, no submissions will be accepted.

Computing: To install Python and required libraries, see the instructions on the course web page.

Homeworks are to be done alone or in pairs. See the Course Information handout1

for detailed

policies.

1. [6pts] Categorial Distribution. In this problem you will consider a Bayesian approach to

modelling categorical outcomes. LetĄŻs consider fitting the categorical distribution, which is

a discrete distribution over K outcomes, which weĄŻll number 1 through K. The probability

of each category is explicitly represented with parameter ŠÈk. For it to be a valid probability

distribution, we clearly need ŠÈk ĄĘ 0 and P

k

ŠÈk = 1. WeĄŻll represent each observation x as

a 1-of-K encoding, i.e, a vector where one of the entries is 1 and the rest are 0. Under this

model, the probability of an observation can be written in the following form:

p(x|ŠÈ) = Y

K

k=1

ŠÈ

xk

k

.

Suppose you observe a dataset,

D = {x

(i)

}

N

i=1.

Denote the count for outcome k as Nk =

PN

i=1 x

(i)

k

and N =

PK

k=1 Nk. Recall that each data

point is in the 1-of-K encoding, i.e., x

(i)

k = 1 if the ith datapoint represents an outcome k

and x

(i)

k = 0 otherwise.

(a) [2pts] First, derive ?ŠÈk, which is the maximum likelihood estimator (MLE), for the class

probabilities ŠÈk. You may assume that Nk > 0 for this question. Derivations should be

rigorous.

Hint 1: We saw in lecture that MLE can be thought of as Ąźratio of countsĄŻ for the data,

so what should ?ŠÈk be counting?

Hint 2: Similar to the binary case, write p(x

(i)

| ŠÈ) = Š°K

k=1ŠÈ

x

(i)

k

k

. The challenge in

maximizing the log-likelihood is that this problem is a constrained maximization under

the constraint that PK

k=1 ŠÈk = 1. To overcome this, we will use a generalization of the

idea from the coin flip example in lecture. We can turn the MLE maximization problem

1

1

STA314 Fall 2021 Homework 4

into an easier constrained problem by setting ŠÈK = 1 ?

PK?1

k=1 ŠÈk. Note that this is a

maximization problem over the set {(ŠÈk)k6=K|ŠÈk > 0,

P

k6=K ŠÈk < 1}. Although this is

still constrained, the log likelihood is a concave function that goes to ?ĄȚ as we approach

the boundary of the constraint set, so the function attains its maximum in the interior

of the region when ? log p(D|ŠÈ)/?ŠÈk = 0.

(b) [2pts] Now we will take a Bayesian approach. For the prior, weĄŻll use the Dirichlet

distribution, which is defined over the set of probability vectors (i.e. vectors that are

nonnegative and whose entries sum to 1). Its PDF is as follows:

p(ŠÈ) ĄŰ ŠÈ

a1?1

1

Ą€ Ą€ Ą€ ŠÈ

ak?1

K .

What is the probability distribution of the posterior distribution p(ŠÈ | D)? DonĄŻt just

give the density of the posterior, say which family of distributions it belongs to.

(c) [1pt] Still assuming the Dirichlet prior distribution, determine the MAP estimate of the

parameter vector ŠÈ. For this question, you may assume each ak > 1.

(d) [1pts] Now, suppose that your friend said that they had a hidden N + 1st outcome,

x

(N+1), drawn from the same distribution as the previous N outcomes. Your friend does

not want to reveal the value of x

(N+1) to you. So, you want to use your Bayesian model

to predict what you think x

(N+1) is likely to be. The Ą°properĄ± Bayesian predictor is the

so-called posterior predictive distribution:

p(x

(N+1)|D) = Z

p(x

(N+1)|ŠÈ)p(ŠÈ|D) dŠÈ

What is the probability that the N +1 outcome was k, i.e., the probability that x

(N+1)

k =

1, under your posterior predictive distribution? Hint: A useful fact is that if ŠÈ Ą«

Dirichlet(a1, . . . , aK), then

E[ŠÈk] = P

ak

k

0 ak

0

.

2. [5pts] Gaussian NaĄ§?ve Bayes. In this question, you will derive the maximum likelihood

estimates for Gaussian NaĄ§?ve Bayes, which is just like the naĄ§?ve Bayes model from lecture,

except that the features are continuous, and the conditional distribution of each feature given

the class is (univariate) Gaussian rather than Bernoulli. Start with the following generative

model for a random discrete class label t ĄÊ {1, 2, ..., K} and a random real valued vector of

D features x ĄÊ R

D:

p(t = k) = ŠÁk (0.1)

p(x|t = k) = Y

D

d=1

2ŠĐŠÒ2

d

!?1/2

exp(

?

X

D

d=1

1

2ŠÒ

2

d

(xd ? ŠÌkd)

2

)

(0.2)

where ŠÁk ĄĘ 0 is the prior on class k, ŠÒ

2

d > 0 are the variances for each feature, which are

shared between all classes, and ŠÌkd ĄÊ R is the mean of the feature d conditioned on class k.

We write ŠÁ to represent the vector with elements ŠÁk and similarly ŠÒ is the vector of variances.

The matrix of class means is written ŠÌ where the kth row of ŠÌ is the mean for class k.

(a) [1pt] Use BayesĄŻ rule to derive an expression for p(t = k|x). Hint: Use the law of total

probability to derive an expression for p(x).

2

STA314 Fall 2021 Homework 4

(b) [1pt] Write down an expression for the likelihood function (LL)

`(ŠÈ) = log p(t

(1)

, x

(1), t(2)

, x

(2)

, Ą€ Ą€ Ą€ , t(N)

, x

(N)

) (0.3)

of a particular dataset D = {(t

(1)

, x

(1)),(t

(2)

, x

(2)), Ą€ Ą€ Ą€ ,(t

(N)

, x

(N)

)} with parameters

ŠÈ = {ŠÁ, ŠÌ,ŠÒ} and this model. (Assume the data are i.i.d.)

(c) [3pts] Take partial derivatives of the likelihood with respect to each of the parameters

ŠÌkd and with respect to the shared variances ŠÒ

2

d

. Report these partial derivatives and

find the maximum likelihood estimates for ŠÌkd and ŠÒ

2

d

. You may assume that each class

appears at least once in the dataset, i.e. the number of times Nk that class k appears

in the dataset is Nk > 0.

3. [6pts] Gaussian Discriminant Analysis. For this question you will build classifiers to

label images of handwritten digits. Each image is 8 by 8 pixels and is represented as a vector

of dimension 64 by listing all the pixel values in raster scan order. The images are grayscale

and the pixel values are between 0 and 1. The labels y are 0, 1, 2, . . . , 9 corresponding to

which character was written in the image. There are 700 training cases and 400 test cases for

each digit; they can be found in the .txt files provided.

A skeleton (q3.py) is is provided for each question that you should use to structure your code.

Starter code to help you load the data is provided (data.py). Note: the get digits by label

function in data.py returns the subset of digits that belong to a given class.

Using maximum likelihood, fit a set of 10 class-conditional Gaussians with a separate, full

covariance matrix for each class. Remember that the conditional multivariate Gaussian probability

density is given by,

p(x |t = k) = (2ŠĐ)

?D/2

|ŠČk|

?1/2

exp

?

1

2

(x ? ŠÌk

)

T ŠČ

?1

k

(x ? ŠÌk

)



(0.4)

where ŠÌk ĄÊ R

D, ŠČk ĄÊ R

DĄÁD and positive-definite. You should take p(t = k) = 1

10

. You

will compute parameters ŠÌkj and ŠČk for k ĄÊ (0...9), j ĄÊ (1...64). You should implement the

covariance computation yourself (i.e. without the aid of ĄŻnp.covĄŻ). Hint: To ensure numerical

stability you may have to add a small multiple of the identity to each covariance matrix. For

this assignment you should add 0.01I to each covariance matrix.

(a) [5pts] Complete the 5 functions that are not complete in the starter code ([1pt] each).

Include the function body for each completed function in your report.

(b) [1pt] Report the average conditional log-likelihood, i.e. 1

N

PN

i=1 log p(t

(i)

| x

(i)

), of the

trained model on both the train and test set. Report the accuracy, of the classifier that

selects most likely posterior class for each data point using the trained model, on the

train and test set.

3