联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp2

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2022-03-23 11:02

DATA3888 (2022): Assignment 1

Instructions

1. Your assignment submission needs to be a HTML document that you have compiled using R Markdown.

Name your file as SIDXXX_Assignment.Rmd" where XXX is your Student ID.

2. Under author, put your Student ID at the top of the Rmd file (NOT your name) .

3. For your assignment, please use set.seed(3888) Do not upload the Rmd file (i.e. the code file).

4. You must use code folding so that we can inspect your code where required.

5. Your assignment should make sense and provide all the relevant information in the text when the code

is hidden. Don’t rely on the marker to understand your code.

6. Any output that you include needs to be explained in the text of the document. If your code chunk

generates unnecessary output you can suppress it by specifying results=‘hide’ in the chunk options.

7. Start each question in a separate section.

Question 1: Brain-box

A physics instructor Louis has created a data set stored under “Spiker box Louis.zip” that has a series

of sequences of varying lengths. The file name determines the eye movement. For example the file ‘LRL

L3.wav’ corresponds to left-right-left eye movements; the file LLRLRLRL_L.wav corresponds to left-left-rightleft-right-left-right-left‘

eye movements. There are a total of 31 files. Build a classification rule for detecting

{L, R} under streaming condition where the function will take a sequence of signal as an input.

? (i) Estimate the accuracy of your classifier. Is your value reasonable?

? (ii) Does the length of the sequence impact on the classification accuracy?

Hint: (a) Consider what metric you will use to define “performance”? You will need to explain your choice

and justify your answer.

dir("data/Spiker_box_Louis/Short")

## [1] "LLL_L1.wav" "LLL_L2.wav" "LLL_L3.wav" "LLR_L1.wav" "LLR_L2.wav"

## [6] "LLR_L3.wav" "LRL_L1.wav" "LRL_L2.wav" "LRL_L3.wav" "LRR_L1.wav"

## [11] "LRR_L2.wav" "LRR_L3.wav" "RLL_L1.wav" "RLL_L2.wav" "RLL_L3.wav"

## [16] "RLR_L1.wav" "RLR_L2.wav" "RLR_L3.wav" "RRL_L1.wav" "RRL_L2.wav"

## [21] "RRL_L3.wav" "RRR_L1.wav" "RRR_L2.wav" "RRR_L3.wav"

dir("data/Spiker_box_Louis/Medium")

## [1] "LLRLRLRL_L.wav" "LLRRLLLR_L.wav" "LLRRRLLL_L.wav" "LRRRLLRL_L.wav"

## [5] "RRRLRLLR_L.wav"

1

dir("data/Spiker_box_Louis/Long")

## [1] "LLLRLLLRLRRLRRRLRLLL_L.wav" "RRLRRLRLRLLLLLLRRLRL_L.wav"

Question 2: Prevalidated model

(from Week 3 lecture) The Kidney Transplant data from “GSE46474” contains the gene expression profiles

of 40 blood samples. Of those, 20 patients rejected their kidney and 20 had stable grafts and will be treated

as controls. Using this gene expression data. Lets build a classification model incorporating two types of

data using the prevalidation principle. Here, we first build a molecular signature (set of features) from

the gene expression platform to obtain a single variable known as prevalidated outcome. Next, we model this

prevalidated outcome in combination with the others other clinical variables to build a classifier of outcome

of interest.

? (a) Build a classifier using support vector machine (SVM) to predict the outcome of graft survival

and generate a prevalidated outcome from the gene expression data.

? (b) Use it together with the clinical variables in a logistic regression to build a risk model. Describe

your final model for classifying graft survival in different individuals and your estimate of its

accuracy.

? (c) What is the final prediction based on your final model for a 70-year-old male whose transcriptomics

profile is predicted to have a favourable survival outcome?

Question 3 - Blood vs Biopsy Biomarker

In the data GSE46474, we estimated the accuracy for our predictive model in graft rejection from peripheral

blood gene expression dataset. However, rejection is a very active process that occurs in the kidney itself.

Here we will look at a similar kidney microarray dataset. Therefore, instead of genes being isolated and

sequenced from blood, we examine another dataset GSE138043 where the samples have been sequenced from

a kidney biopsy. Select the top 50 most variable genes in each of the dataset GSE138043 and GSE46474

and use the selected genes to build a classifier using randomForest to predict the outcome of graft survival.

Visualize your results. We have broken this task into the following 4 sub section.

? (a) Select the top 50 most variable genes in each of the dataset GSE138043 and GSE46474. Combined

the two sets of genes and how many genes are in the union of these two list.

? (b) Build two classifier using randomForest to predict the outcome of graft survival using the genes

selected in part (a).

? (c) Preform repeated 5-fold cross validation for each of the data and calculate the accuracy. What is

the average accuracy for blood vs biopsy biomarker model?

? (d) Select an appropriate graphic to communicate the difference between these two classification

accuracy.

2

Question 4 - Visualisation on world map

Sully and colleagues have curated a public dataset containing characteristics linked to coral bleaching over the

last two decades. The data is in the file “Reef_Check_with_cortad_variables_with_annual_rate_of_SST_change.csv”,

and the author curated coral bleaching events at 3351 locations in 81 countries from 1998 to 2017. The

column “Average bleaching” records the percentage of coral reefs worldwide that were bleached during the

sampling periods, while the column “ClimSST” quantifies the sea-surface temperature (SST) at various

locations.

? (a) Use ggplot to visualize the percentage of bleaching in coral reefs on a world map and look at

which areas of the world have the most severe coral bleaching.

? (b) The team of scientist believe that “coral bleaching is less common in localities with a

high variance of sea-surface temperature Anomaly (SSTA) over time.” Use one or

two appropriate graphic together to demonstrate this point. Please explain your choice. Hint:

Determine which column of the data measure “sea-surface temperature Anomaly (SSTA)”.

3


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp