联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp2

您当前位置:首页 >> CS作业CS作业

日期:2018-09-01 07:52


1. The assignment MUST be submitted electronically to Turnitin through QBUS6850

Canvas site. Please do NOT submit a zipped file.

2. The assignment is due at 17:00pm on Monday, 3 September 2018. The late penalty

for the assignment is 10% of the assigned mark per day, starting after 17:00pm on the

due date. The closing date Monday, 10 September 2018, 17:00pm is the last date on

which an assessment will be accepted for marking.

3. Your answers shall be provided as a word-processed report giving full explanation

and interpretation of any results you obtain. Output without explanation will receive

zero marks.

4. Be warned that plagiarism between individuals is always obvious to the markers of

the assignment and can be easily detected by Turnitin.

5. The data sets for this assignment can be downloaded from Canvas.

6. Presentation of the assignment is part of the assignment. Markers will reduce to 10%

of the mark for poor writing in clarity and presentation. It is recommended that you

should include your Python code as appendix to your report, however you may insert

small section of your code into the report for better interpretation when necessary.

Think about the best and most structured way to present your work, summarise the

procedures implemented, support your results/findings and prove the originality of

your work.

7. Numbers with decimals should be reported to the third decimal point.

8. The report should be NOT more than 10 pages including everything like text, figure,

tables, small sections of inserted codes etc but excluding the appendix containing

Python code.

Tasks

Question 1 (50 Marks)

You will work on the UCI ML housing dataset

A template Python program has been prepared for you. The program can

help you get the dataset from sklearn dataset repository. Please test and play with the

template program to fully understand the dataset.

For further information, please visit

https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.names.

(a) Suppose you are interested in using the house age AGE (proportion of owneroccupied

units built prior to 1940) as the first feature ????1 and the full-value

property-tax rate TAX as the second feature ????2 to predict the MEDV (median

value of owner-occupied homes in $1000’s) as the target t. Write code to extract

2018S2 QBUS6850 Page 2 of 4

these two features and the target from the dataset.

Use the dataset (two chosen features and one target) to plot the loss function

????(????) = 1

2?????(????(????????,????) ? ????????)2

????

????=1

with ????(????????,????) = ????1????1 + ????2????2

That is, we are using a linear regression model without the intercept term ????0.

Hint: This is a 3D plot and you will need to iterate over a range of ????1 and ????2

values.

(b) Use the linear regression model LinearRegression in the scikit-learn package

to do two linear regression models to predict the target, with and without the

intercept term. You may use 90% of the data as your training data, and the

remaining 10% as your testing data. Compare the performance of two models and

explain the importance of the intercept term.

Hint: The argument fit_intercept of the LinearRegression controls

whether an intercept term is included in the model by fit_intercept = True

or fit_intercept = False.

(c) Take 90% of data as training data. Construct the centred training dataset by

conducting the following steps in your Python code:

(i) Take the mean of all the training target values, then deduct this mean from

each training target value MEDV. Take the resulting target values as the new

training target values ????????????????;

(ii) In the training data, take the mean of all the first feature values AGE, then

deduct this mean from each of first feature values. Take the result as the new

first feature values ????????????????

???? ;

(iii)In the training data, do the same for the second feature TAX. The result is

????????????????

???? ;

Now build linear regressions with and without the intercept to fit to the new

training data. Report and compare the coefficients and the intercept. Compare the

performance of two models over the testing data. Note that, when you take your

testing data into the model to calculate performance scores, you shall take the

relevant training means from the testing features and targets.

(d) Consider the closed-form solution of the linear regression below, see slide 25 (the

number may change) of Lecture 2,

???? = (????????????)?1????????????

where X is the design (data) matrix whose first column is all 1s, and the first

component in ???? is the intercept. Suppose that the data are centred (refer to (c)).

Now prove that, in the case of centred data, the intercept ????0 in the solution above

is zero.

Hint: You may need that following fact that

2018S2 QBUS6850 Page 3 of 4

?

???? 0

0 ?????

?1

= ??????1 0

0 ?????1?

where both matrices A and B are invertible.

Question 2 (50 Marks)

Use Logistic Regression to predict diagnosis of breast cancer patients on the Breast Cancer

Wisconsin (Diagnostic) Dataset (wdbc.data). See Section About Datasets. This question

aims to test your ability in programming in matrix operation for Logistic Regression.

(a) Write Python code to load the data into your program. For the target feature

Diagnosis, change its literal M (malignant) to 0 and B (benign) to 1. Split the data

into training and validation sets (80%, 20% split). Then define and train a logistic

regression model by using scikit-learn’s LogisticRegression model.

(b) Using the logistic regression model function below and the estimated parameters

from your model, calculate the probability of sample ID 8510426 (20th sample)

having a benign diagnosis.

????(????????,????) = 1

1 + ?????????????

????????

(c) The objective of logistic regression is defined as, on slide 17 (the number may

change) of Lecture 3,

????(????) = ? 1

???? ??????????? log ??????????????, ?????? + (1 ? ????????) log ?1 ? ?????????????, ???????

????

????=1

?

where both the parameter ???? = (????0, ????1, … , ????????)???? and sample ???????? =

(????????0, ????????1, … , ????????????)???? are d+1 dimensional vectors, where the intercept feature

????????0 = 1. For Wisconsin Dataset d = 30. It is easy to prove that (you don’t need

to prove this)

????????(????)

???????? = 1

???? ????????(????(????,????) ? ????)

where ????(????,????) = ?????(????1,????), ????(????2,????), … , ????(????????,????)?

???? and ???? = (????1,????2, … ,????????)????.

Write your own python code to use this derivative formula to implement the

gradient descent algorithm for the logistic regression. You may write a python

function named such as myLogisticGD, which accepts an data matrix X, an

initial parameter beta_0, and a number of GD iterations T and other arguments

you see appropriate. Your function should return the learned parameter ????.

Hint: In python, you can use the following way to get the vector ???? = ????(????,????).

First define the sigmoid function by

2018S2 QBUS6850 Page 4 of 4

def sigmoid(x):

return (1 / (1 + np.exp(-x)))

then

F = sigmoid(np.dot(X, beta))

or similar.

(d) Based on task (c) and the training data used in (a), write python code to use

different initial values ???? = (0, 0, … , 0)????, ???? = (1, 1, … , 1)????, and a random initial

???? to start the gradient descent algorithm to minimise the objective of logistic

regression with respect to the parameter ????. You set the number of iteration

T=200. Use each resulting ???? to re-do task (b). Compare the results and explain the

major reasons why you may have different answers with different initial value for

????.

Hint: As mentioned on slide 29 of Lecture 2, it is a good practice to normalize

your data before you send them to your algorithm.

About Datasets

Breast Cancer Wisconsin (Diagnostic): wdbc.data

Attribute information

1: ID number

2: Diagnosis (M = malignant, B = benign)

3-32: Ten real-valued features are computed for two cell nuclei:

? radius (mean of distances from center to points on the perimeter)

? texture (standard deviation of gray-scale values)

? perimeter

? area

? smoothness (local variation in radius lengths)

? compactness (perimeter^2 / area - 1.0)

? concavity (severity of concave portions of the contour)

? concave points (number of concave portions of the contour)

? symmetry

? fractal dimension ("coastline approximation" - 1)

The mean, standard error, and "worst" or largest (mean of the three largest values) of these

features were computed for each image, resulting in 30 features. For instance, field 3 is Mean

Radius, field 13 is Radius SE, field 23 is Worst Radius.


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp