Term 1, 2019/2020
ACCT648 Applied Statistics for Data Analysis
Assignment 3
Deadline of Submission: Upload your answer file in word-format on 6 November
2019 before 5pm in e-Learn, and submit the hard copy during class on that day
1. The owner of a moving company typically has his most experienced manager predict the
total number of labor hours (Hours) that will be required to complete an upcoming move.
This approach has proved useful in the past, but the owner has the business objective
of developing a more accurate method of predicting labor hours. In a preliminary effort
to provide a more accurate method, the owner has decided to use the number of cubic
feet moved (Feet), the number of pieces of large furniture (Large) and whether there is
an elevator in the apartment building (Elevator) as the independent variables and has
collected data for moves in which the origin and destination were within the borough
of Manhattan in New York City and the travel time was an insignificant portion of the
hours worked. The data are organized and stored in Moving2019.csv.
(a) Find the multiple regression equation L1 with all the three main independent variables.
(b) Find the multiple regression equation L2 with all the three main independent variables
with the interaction effect of Feet and Elevator.
(c) Find the multiple regression equation L3 with all the three main independent variables
with the interaction effect of Large and Elevator.
(d) Find the multiple regression equation L4 with all the three main independent variables
with the interaction effect of Feet and Large.
(e) When comparing all four regression models: L1, L2, L3, L4, explain why model L3
is the best model.
(f) Perform a residual analysis on the model L3 and determine whether the regression
assumptions are valid.
(g) Construct a 95% prediction interval estimate for the labor hours for moving 420
cubic feet with 2 large furniture in an apartment building that does not have an
elevator in model L3
(h) Construct a 95% confidence interval estimate for the average labor hours for moving
400 cubic feet with 3 large furniture in an apartment building that has an elevator
in model L3
(i) True or False: For a fixed value of cubic feet and at least one large furniture
situations, the total number of labor hours to move in the building with elevator
is on average less than the number of labor hours to move in the building without
elevator under model L3. Justify your answer.
1
2. Based on data set given in Question (1),
(a) Fit the multiple regression equation to predict the total number of labor hours with
all independent variables by using the Forward Selection and BIC criterion on the
training set. Plot the graph to show the number of variables versus BIC in each
selection step.
(b) Fit the multiple regression equation to predict the total number of labor hours
with all independent variables by using the Best Subset Selection with adjusted R2
criterion on the training set. Plot the graph to show the number of variables versus
adjusted R2
in each selection step.
(c) Use the 5-fold cross-validation approach to fit the models of L1, L2, L3 and L4 and
determine which model is the best under the criterion of their associated crossvalidation
errors. (Note: use set.seed(1208))
(d) Use the Leave-One-Out cross-validation approach to fit the models of L1, L2, L3 and
L4 and determine which model is the best under the criterion of their associated
cross-validation errors. (Note: use set.seed(5623))
3. Suppose we collect data for a group of 130 students in a statistical class with two
independent variables X1 = average studying hours per week, X2 = GPA, and one
dependent variable Y = Pass (or Fail).
We fit a logistic regression model: log(odds ratio) = β0+β1X1+β2X2 to predict whether
a student will pass the course. R-outputs produce estimated coefficients, βˆ
0 = −9.5447,
βˆ
1 = 0.5709, and βˆ
2 = 1.0682. The observations of the first five students are given as
follows:
Student Y X1 X2
1 Pass 9.4 3.03
2 Pass 14.5 3.52
3 Pass 12.2 3.14
4 Fail 8.4 2.76
5 Fail 11.3 3.20
(a) Based on the estimated logistic regression model, predict the probability that a
student who studies 11 hours per week on average and has a GPA of 3.40 will pass
the course.
(b) At least how many hours would the student in part (a) need to study to have more
than 70% predicted chance of passing the course?
(c) Find the deviance residues of the first five observed students.
(d) By using the estimated logistic regression model with the threshold value being
0.55 for classification of passing the course, determine whether the model makes
any error to predict each of the above five observed students. If there is an error,
determine what type of error as well.
2
4. The stock prices of Singapore Telecommunications Limited (SingTel) with code (Z74.SI)
and Singapore Airlines Limited (SIA) with code (C6L.SI) from 27 August 2018 to 29
July 2019 are stored in SingTelSIA2019.csv. Suppose a portfolio investment has 8,000
shares of SingTel at price of $3.34 per share and 5,000 shares of SIA at price of $9.42
per share on 29 July 2019. Therefore, the portfolio investment has value of $73,820
(8, 000 × 3.34 + 5, 000 × 9.42) on 29 July 2019.
(a) Based on the historical approach without any assumption of distribution, calculate
the one-day 99% VaR for this portfolio on 29 July 2019.
(b) Without any assumption of distribution, estimate the one-day 99% VaR for this
portfolio on 29 July 2019 based on the Bootstrap approach with 100,000 repetitions.
(Note: use set.seed(5483))
(c) Obtain a 95% Bootstrap percentile confidence interval for the one-day 99% VaR for
this portfolio on 29 July 2019.
5. The director of undergraduate studies at a college of business wants to predict whether
students in a BBA program can graduate with a honor degree using independent variables,
High school grade point average (GPA), SAT score, gender, and local citizen.
Data from a random sample of 90 students, organized and stored in BBA2019.csv,
show that 46 successfully completed the program with honor degrees (coded as Yes) and
44 without honor degrees (coded as No) under the variable column Graduate.
(a) Develop a logistic regression model, L1, to predict the probability of successfully
completed the BBA program with honor degrees, based on all independent variables.
(b) Develop the other logistic regression model, L2, to predict the probability of successfully
completed the BBA program with honor degrees, based on the SAT, Gender,
and Local independent variables.
(c) Develop the other logistic regression model, L3, to predict the probability of successfully
completed the BBA program, based on the SAT and Local independent
variables.
(d) Develop the other logistic regression model, L4, to predict the probability of successfully
completed the BBA program, based on the SAT independent variables.
(e) Explain why model L4 is the best model among the four models considered. At the
0.05 level of significance, is there evidence that a logistic regression model L4 is a
good fitting model?
(f) Predict the probability of successfully completed the BBA program with honor
degree given that a male local citizen with GPA 3.45 and SAT score 1330 under
model L4.
(g) Find the confusion matrix of model L4 with the threshold value 0.6 for classifying
students successfully completed the BBA program with honor degrees.
(h) Find the sensitivity, specificity and total error rate of the model L4 with the threshold
value 0.6.
-END-
3
版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。