School of Mathematics
MATH5714: Linear Regression, Robustness and Smoothing
Practical: 2019
There have been a number of “worked examples” using R in the module. All of these are on MINERVA, and
some of these may contain useful commands for this practical. If you have specific questions on R, please feel
free to email me (or come to my office) for assistance. Alternatively, you may want to ask me any questions
during the practical session which will take place on Monday 2nd December. If you do not have any specific
questions, you do not need to attend this session.
You should write up your practical using WORD (or LATEX), with all graphical and R output correctly incorporated.
The total length should not exceed 12 pages (but it could be shorter).
NOTES:
A. You must hand in your solutions to my pigeon-hole (NOT Minerva) by 2pm (GMT) on Thursday, 12th
December.
B. In accordance with policies of the School of Maths: For every period of 24 hours or part thereof that your
assessment is overdue, you will lose 5% of the total marks available for the assessment.
C. If you have special mitigating circumstances that lead you to ask for an extension, you should make your
request in the School of Maths Taught Student Office.
D. Within reason you may talk to your friends about this piece of work, but you should not send R code (or
output) to each other, and your report must be only your own work.
This practical is deliberately open-ended, with little guidance on how to proceed.
Q1 The Databank of the worldbank1
collects data (“indicators”) every year on each country of the world in
order to examine trends, relationships, effect of policies, development, etc. Two of the variables (area,
and population are given for 2010 in the data.frame which can be read in by the R command (watch out
for the ∼ if you copy and paste):
dd=read.table("http://www1.maths.leeds.ac.uk/˜charles
/math3714/area-populaton.txt",header=TRUE)
(i) Using appropriate transformations of the data, find a linear model which can describe the relationship
between population (response) and area (explanatory).
(ii) Using appropriate diagnostics, confirm that your model is acceptable.
Guidance: In your answer, you only need to describe your final model, and ONE other model
which you have examined, but deemed less appropriate.
(iii) Using your model obtain a 95% confidence interval for the mean (expected) population, for a country
with an area of 250,000 Km2
.
Q2 In this question we are going to consider many more variables in the database. Because there are so many
missing entries, a set of variables and countries were selected such that there were no missing values. The
file is the same location as before, but now with file name: worlddata-indicators.txt. Note
that we now have only 149 countries.
We will take the response variable to be CO2 emmissions per capita (CO2), which is column 15 of the
data frame after reading in to R.
1You may want to check the meaning of the variables in the worldbank website:
https://databank.worldbank.org/home.aspx
(i) With due consideration to:
– transformations,
– interactions,
– model selection,
– model checking,
– variable selection,
– etc.
obtain a model which is able to predict CO2 using the other variables.
(ii) Justify your choice by comparing at least two “competing” models. The comparison should take
note of at least (a) model selection criteria, (b) diagnostics, and (c) interpretability.
(iii) Interpret the parameters in your preferred model.
Guidance: Remember, there is probably no ONE correct answer. The important thing is that you justify
your approach.
Q3 In this last question we are going to fit nonparametric regresion models to inflation data. The data frame
is inflation.txt (same place as previously) and consists of 3 columns. The first column is the
country code, column 2 is to be treated as the explanatory variable (Inflation, GDP deflator (annual %))
and column 3 the response (Inflation, consumer prices (annual %)).
You may find the code used in lectures to be useful for this question.
(i) Using the data (xi
, yi) in the data frame, create a scatter plot of the data and add nonparametric
regression lines which shows the fitted value mˆ (x) for x in the range (−5, 50). Plot one graph which
shows the Nadaraya-Watson estimate for smoothing parameters h = 1, 2, 5, 10, and a separate
graph which shows the local linear estimates for the same four values of h. Comment on these
graphs.
(ii) For each of the 8 estimates computed in part (i), find the predicted value mˆ (x) when x = −4.2.
Arrange these values in a suitable 2 × 4 table.
(iii) Using leave-one-out cross-validation find the “optimal” choice of h in the range (.7, 2.7) for the
NW estimate, and (1.7, 3.0) for the LL estimate. In the same plot draw the lines corresponding to
the cross-validation functions as a function of h.
(iv) Replot the data, and draw on the fitted nonparametric regression lines corresponding to the optimal
values of h. Comment on the fits.
Predict mˆ (x) for x = −4.2 using the corresponding optimal values of h for the NW and LL
estimates respectively. Which of these predictions do you think will be better, and why?
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。