联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp2

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2022-11-30 02:47

MTHM502 Introduction to Data Science and Statistical Modelling

Assignment

Please make sure that the submitted work is your own. This is NOT a group assignment,

therefore approaches, solutions shouldn’t be discussed with other students. Plagiarism and

collusion with other students are examples of academic misconduct and will be reported. More

information on academic honesty can be found here.

1. The colour of the human eye is determined by a pair of genes. If both of these genes code the colour

blue, then the given person will have blue eyes. If at least one of the genes codes the colour brown,

then the person will have brown eyes. That is, if we denote by ‘A’ the gene coding the colour brown,

and by ‘a’ the gene coding the colour blue, then we have the following

Gene Eye colour

AA Brown

Aa Brown

aA Brown

aa Blue

A child inherits one gene from each of their parents. That is one gene is chosen randomly (with

equal probability) from the gene-pair of their father, and one gene is chosen randomly (with equal

probability) from the gene-pair of their mother. Below are two examples, where the entries of the

tables show the possible gene-pairs of the children. Note that each of these gene-pairs has equal

probability.

Example 1:

Father’s genes

A a

Mother’s A AA Aa

genes a aA aa

Example 2:

Father’s genes

A A

Mother’s A AA AA

genes a aA aA

Assume that Aaron and both of his parents have brown eyes, but Aaron’s sister has blue eyes.

(a) [3 marks] What is the probability that Aaron has a blue eye gene?

(b) [6 marks] Assume that Aaron’s wife has blue eyes. What is the probability that their first child

will have blue eyes?

1

(c) [10 marks] Suppose that Aaron and his wife’s first child ended up having brown eyes (and not

blue). How does this information change the probability that Aaron has a blue eye gene? What

is the probability that their second child will have brown eyes too?

2. Assume that a new Conservative Party leadership election has been triggered in the UK at a time when

there are 361 conservative MPs in the parliament. Two of these MPs, M and B, join the leadership

contest, where the aim is to get the majority support of the remaining 359 conservative MPs.

We further assume that on the day the leadership contest is announced 184 of these MPs support

M, and the remaining 175 MPs support B in becoming the next party leader. The announcement

is followed by an election campaign, during which MPs can decide to change their allegiance. In

particular, we know that on any given day, there is a probability of 0.005 that an MP who has been

supporting M will become a B supporter by the end of the day, while the probability that an MP who

has been supporting B will become an M supporter by the end of the day is 0.004. Each MP makes

their decision independently of each other, and independently of the decision they made the day before.

(a) [4 marks] Introduce the following random variables:

X

(1)

i =

{

1, B supporter number i still supports B at the end of day 1,

0, B supporter number i changes to an M supporter at the end of day 1,

for i = 1, . . . , 175; and

X

(2)

i =

{

1, M supporter number i changes to a B supporter at the end of day 1,

0, M supporter number i still supports M at the end of day 1,

for i = 1, . . . , 184.

Using these random variables express the number of B supporters at the end of the first day, then

use your formula to find the expected number of B supporters at the end of the first day. Justify

every step of your argument.

(b) [3 marks] Define random variables X?(1)i , i = 1, . . . , 175 and X?

(2)

i , i = 1, . . . , 184 whose sum gives

you the number of M supporters at the end of the first day. What is the expected number of M

supporters at the end of the first day?

(c) [6 marks] R: The election campaign is set to last for 2 weeks. This means that each MP would

vote according to the allegiance they have at the end of day 14, that is, the candidate they would

vote for is the one they are supporting after the first 14 days of the campaign. Using simulation

find the probability that in this election B would hold the majority of the votes among the 359

MPs.

(d) [3 marks] R: Now suppose that the election had to be postponed, and with the new date, candi-

dates now have a 60 day long campaign period (as opposed to 14 days). Adjust your code from

part 2c to find the probability that B will win the delayed election. How does this probability

compare to the one computed in part 2c?

3. Observations Y1, Y2, . . . , Yn are assumed to be independent and identically distributed samples from a

data model following a Rayleigh distribution, with probability density function:

f(y; θ) = ye

y2/2θ

θ

for θ > 0 and 0 < y <∞.

The mean of this distribution is

μ =

πθ

2 ,

and the variance is

σ2 = θ(4? π)2 .

(Note that here π is not a parameter, it is the usual mathematical constant i.e. 3.14...)

2

(a) [2 marks] Find the method of moments estimator θ? of θ.

(b) [5 marks] Is your estimator θ? unbiased? If not, then suggest an adjustment to this estimator that

would make it unbiased and report your final unbiased estimator. Hint: If E(θ?) = cθ, then the

estimator 1c θ? is unbiased. Also, remember that we can express second moments using the formula

of the variance.

(c) [4 marks] An alternative estimator is θ? = 12n

∑n

i=1 Y

2

i . Is this estimator unbiased? If not, suggest

an adjustment that makes it unbiased. See hints given in part 3b.

(d) [5 marks] Using the fact that the random variable X = Y 2 is exponentially distributed with rate

1

2θ , assess whether the estimator θ? from part 3c is consistent.

(e) [6 marks] We have 150 samples from a Rayleigh distribution with sample mean 3.2. Using an

appropriate point estimator of θ, suggest a suitable estimate of the variance, and use this variance

estimate to construct an approximate 95% confidence interval for the mean of the distribution.

(You can use R to find the relevant quantiles).

4. Consider the data set Y1, Y2, . . . , Yn that is assumed to have arisen from the data model with probability

density function

f(y; θ) =

{

k(1? y)yθ+1, 0 < y < 1,

0, otherwise,

where θ > 0.

(a) [4 marks] Find the constant k that makes the above function a probability density function.

(b) [6 marks] Show that the maximum likelihood estimator, θ? of θ is given by the solution to the

equation:

θ?2

n∑

i=1

log(Yi) + θ?

[

5

n∑

i=1

log(Yi) + 2n

]

+ 6

n∑

i=1

log(Yi) + 5n = 0.

(c) [5 marks] R: Let y1, . . . , y30 below correspond to 30 samples of this distribution

0.573 0.770 0.652 0.827 0.821 0.789

0.898 0.718 0.382 0.668 0.647 0.477

0.661 0.380 0.870 0.794 0.783 0.732

0.629 0.777 0.600 0.724 0.553 0.693

0.687 0.935 0.494 0.411 0.530 0.478

To produce a maximum likelihood estimate for θ based on these data, use the polyroot function

of R.

Hint: Polyroot finds the roots of a polynomial. Its argument is the vector of polynomial coefficients

in increasing order. For example, to find the roots of the polynomial p(x) = x2 + 2x ? 3 we can

use

rt <- polyroot(c(-3,2,1))

Even though both roots that you will get are real, polyroot gives these roots in complex form

(don’t worry about what this means). You can use the Re() function to extract the real part of

complex numbers. That is if the outcome of the polyroot function is stored in the variable rt,

then we can use the following to get the desired roots.

rt_real <- Re(rt)

rt_real

## [1] 1 -3

Note that this code lists all the roots of a polynomial. You will have to check which one of these

is a local maximum.

3

(d) [3 marks] R: Produce a plot of the fitted probability density function using the estimate of θ

obtained from 4c.

5. The file ‘ozone.csv’, available on the course ELE page, contains information on ozone levels recorded

over 111 days from May to September 1973 in New York. The variables measured were:

ozone Ozone levels, in parts per billion (ppb),

radiation in langleys

temperature in farenheit

wind in miles per hour (mph)

Read these data into R and answer the following questions.

(a) [6 marks] Carry out exploratory data analysis, and produce a matrix scatterplot of the dataset.

Comment on your findings and what these plots suggest about the likely relationships between

the response variable (ozone) and the other variables.

(b) [9 marks] Fit a multiple regression of ozone as the response variable, against radiation,

temperature and wind as the explanatory variables (use all three, when fitting the model).

Comment on the summary of the model. What do these coefficients suggest about the relationship

between ozone and the other variables? Are these findings consistent with your earlier descriptive

plots? Also include suitable residual plots, commenting as appropriate.

(c) [10 marks] A colleague suggests you implement the following model,

log(ozonei) = β0+β1 log(radiationi)+β2 log(temperaturei)+β3 log(windi)+?i where ?i ~ N(0, σ2).

Fit this new model to the data to obtain estimates for the regression coefficients. Produce a plot of

the residuals against the fitted values, and a Q-Q plot of the residuals. Comment on the outputs

from the modelling (comparing it to the previously fitted model), paying particular attention to

the interpretation of the coefficients. Express the impact of the explanatory variables on the ozone

levels, with the latter expressed on the original (untransformed) scale.

Total for paper = 100 marks.

The submitted work should be your own work! The questions apart from Q2(c), Q2(d),

Q4(c), Q4(d) and Q5 are theoretical exercises, and should be solved using results we covered

in lectures. Make sure you justify each step of the theoretical reasoning by clearly stating the

theorem/property you are using (marks will be awarded for these). Also make sure that you

add comments to each section of your R code, explaining what you’re doing. All the relevant

R output (computed probabilities, plots, etc) should be included in your submission! A pdf

document with your R code, R output and the solutions to the theoretical exercises should be

submitted through EBART by Noon (12pm), 2nd December. Note that late submissions will

be penalised.


相关文章

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp