联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2020-05-07 11:10

Homework Set 6

April 28, 2020

NOTICE: The homework is due on May 8 (Friday) 11:59pm. Please provide

the R codes (Rmarkdown is highly recommended) and steps

that you use to get your solutions. You are allowed, and even encouraged,

to discuss the homeworks with your classmates. However, you must

write up the solutions on your own. Plagiarism and other anti-scholarly

behavior will be dealt with severely.

Problem 1. Here we explore the maximal margin classifier on a toy data

set.

(a) We are given n = 7 observations in p = 2 dimensions. For each observation,

there is an associated class label.

(b) Sketch the optimal separating hyperplane, and provide the equation for

this hyperplane.

1

(c) Describe the classification rule for the maximal margin classifier. It

should be something along the lines of “Classify to Red if β0 + β1X1 +

β2X2 > 0, and classify to Blue otherwise.” Provide the values for β0, β1,

and β2.

(d) On your sketch, indicate the margin for the maximal margin hyperplane.

(e) Indicate the support vectors for the maximal margin classifier.

(f) Argue that a slight movement of the seventh observation would not affect

the maximal margin hyperplane.

(g) Sketch a hyperplane that is not the optimal separating hyperplane, and

provide the equation for this hyperplane.

(h) Draw an additional observation on the plot so that the two classes are

no longer separable by a hyperplane.

Problem 2. K-means.

414 10. Unsupervised Learning

(b) Repeat (a), this time using single linkage clustering.

(c) Suppose that we cut the dendogram obtained in (a) such that

two clusters result. Which observations are in each cluster?

(d) Suppose that we cut the dendogram obtained in (b) such that

two clusters result. Which observations are in each cluster?

(e) It is mentioned in the chapter that at each fusion in the dendrogram,

the position of the two clusters being fused can be

swapped without changing the meaning of the dendrogram. Draw

a dendrogram that is equivalent to the dendrogram in (a), for

which two or more of the leaves are repositioned, but for which

the meaning of the dendrogram is the same.

3. In this problem, you will perform K-means clustering manually, with

K = 2, on a small example with n = 6 observations and p = 2

features. The observations are as follows.

Obs. X1 X2

1 1 4

2 1 3

3 0 4

4 5 1

5 6 2

6 4 0

(a) Plot the observations.

(b) Randomly assign a cluster label to each observation. You can

use the sample() command in R to do this. Report the cluster

labels for each observation.

(c) Compute the centroid for each cluster.

(d) Assign each observation to the centroid to which it is closest, in

terms of Euclidean distance. Report the cluster labels for each

observation.

(e) Repeat (c) and (d) until the answers obtained stop changing.

(f) In your plot from (a), color the observations according to the

cluster labels obtained.

4. Suppose that for a particular data set, we perform hierarchical clustering

using single linkage and using complete linkage. We obtain two

dendrograms.

(a) At a certain point on the single linkage dendrogram, the clusters

{1, 2, 3} and {4, 5} fuse. On the complete linkage dendrogram,

the clusters {1, 2, 3} and {4, 5} also fuse at a certain point.

Which fusion will occur higher on the tree, or will they fuse at

the same height, or is there not enough information to tell?

2

414 10. Unsupervised Learning

(b) Repeat (a), this time using single linkage clustering.

(c) Suppose that we cut the dendogram obtained in (a) such that

two clusters result. Which observations are in each cluster?

(d) Suppose that we cut the dendogram obtained in (b) such that

two clusters result. Which observations are in each cluster?

(e) It is mentioned in the chapter that at each fusion in the dendrogram,

the position of the two clusters being fused can be

swapped without changing the meaning of the dendrogram. Draw

a dendrogram that is equivalent to the dendrogram in (a), for

which two or more of the leaves are repositioned, but for which

the meaning of the dendrogram is the same.

3. In this problem, you will perform K-means clustering manually, with

K = 2, on a small example with n = 6 observations and p = 2

features. The observations are as follows.

Obs. X1 X2

1 1 4

2 1 3

3 0 4

4 5 1

5 6 2

6 4 0

(a) Plot the observations.

(b) Randomly assign a cluster label to each observation. You can

use the sample() command in R to do this. Report the cluster

labels for each observation.

(c) Compute the centroid for each cluster.

(d) Assign each observation to the centroid to which it is closest, in

terms of Euclidean distance. Report the cluster labels for each

observation.

(e) Repeat (c) and (d) until the answers obtained stop changing.

(f) In your plot from (a), color the observations according to the

cluster labels obtained.

4. Suppose that for a particular data set, we perform hierarchical clustering

using single linkage and using complete linkage. We obtain two

dendrograms.

(a) At a certain point on the single linkage dendrogram, the clusters

{1, 2, 3} and {4, 5} fuse. On the complete linkage dendrogram,

the clusters {1, 2, 3} and {4, 5} also fuse at a certain point.

Which fusion will occur higher on the tree, or will they fuse at

the same height, or is there not enough information to tell?

Problem 3. Hierarchical clustering.

10.7 Exercises 413

differ: for instance, cluster 4 in K-means clustering contains a portion of

the observations assigned to cluster 1 by hierarchical clustering, as well as

all of the observations assigned to cluster 2 by hierarchical clustering.

Rather than performing hierarchical clustering on the entire data matrix,

we can simply perform hierarchical clustering on the first few principal

component score vectors, as follows:

> hc.out=hclust(dist(pr.out$x [,1:5]) )

> plot(hc.out , labels =nci. labs , main="Hier. Clust. on First

Five Score Vectors ")

> table(cutree (hc.out ,4), nci.labs)

Not surprisingly, these results are different from the ones that we obtained

when we performed hierarchical clustering on the full data set. Sometimes

performing clustering on the first few principal component score vectors

can give better results than performing clustering on the full data. In this

situation, we might view the principal component step as one of denoising

the data. We could also perform K-means clustering on the first few

principal component score vectors rather than the full data set.

10.7 Exercises

Conceptual

1. This problem involves the K-means clustering algorithm.

(a) Prove (10.12).

(b) On the basis of this identity, argue that the K-means clustering

algorithm (Algorithm 10.1) decreases the objective (10.11) at

each iteration.

2. Suppose that we have four observations, for which we compute a

dissimilarity matrix, given by

For instance, the dissimilarity between the first and second observations

is 0.3, and the dissimilarity between the second and fourth

observations is 0.8.

(a) On the basis of this dissimilarity matrix, sketch the dendrogram

that results from hierarchically clustering these four observations

using complete linkage. Be sure to indicate on the plot the

height at which each fusion occurs, as well as the observations

corresponding to each leaf in the dendrogram.

414 10. Unsupervised Learning

(b) Repeat (a), this time using single linkage clustering.

(c) Suppose that we cut the dendogram obtained in (a) such that

two clusters result. Which observations are in each cluster?

(d) Suppose that we cut the dendogram obtained in (b) such that

two clusters result. Which observations are in each cluster?

(e) It is mentioned in the chapter that at each fusion in the dendrogram,

the position of the two clusters being fused can be

swapped without changing the meaning of the dendrogram. Draw

a dendrogram that is equivalent to the dendrogram in (a), for

which two or more of the leaves are repositioned, but for which

the meaning of the dendrogram is the same.

3. In this problem, you will perform K-means clustering manually, with

K = 2, on a small example with n = 6 observations and p = 2

features. The observations are as follows.

Obs. X1 X2

1 1 4

2 1 3

3 0 4

4 5 1

5 6 2

6 4 0

(a) Plot the observations.

(b) Randomly assign a cluster label to each observation. You can

use the sample() command in R to do this. Report the cluster

labels for each observation.

(c) Compute the centroid for each cluster.

(d) Assign each observation to the centroid to which it is closest, in

terms of Euclidean distance. Report the cluster labels for each

observation.

(e) Repeat (c) and (d) until the answers obtained stop changing.

(f) In your plot from (a), color the observations according to the

cluster labels obtained.

4. Suppose that for a particular data set, we perform hierarchical clustering

using single linkage and using complete linkage. We obtain two

dendrograms.

(a) At a certain point on the single linkage dendrogram, the clusters

{1, 2, 3} and {4, 5} fuse. On the complete linkage dendrogram,

the clusters {1, 2, 3} and {4, 5} also fuse at a certain point.

Which fusion will occur higher on the tree, or will they fuse at

the same height, or is there not enough information to tell?

3

414 10. Unsupervised Learning

(b) Repeat (a), this time using single linkage clustering.

(c) Suppose that we cut the dendogram obtained in (a) such that

two clusters result. Which observations are in each cluster?

(d) Suppose that we cut the dendogram obtained in (b) such that

two clusters result. Which observations are in each cluster?

(e) It is mentioned in the chapter that at each fusion in the dendrogram,

the position of the two clusters being fused can be

swapped without changing the meaning of the dendrogram. Draw

a dendrogram that is equivalent to the dendrogram in (a), for

which two or more of the leaves are repositioned, but for which

the meaning of the dendrogram is the same.

3. In this problem, you will perform K-means clustering manually, with

K = 2, on a small example with n = 6 observations and p = 2

features. The observations are as follows.

Obs. X1 X2

1 1 4

2 1 3

3 0 4

4 5 1

5 6 2

6 4 0

(a) Plot the observations.

(b) Randomly assign a cluster label to each observation. You can

use the sample() command in R to do this. Report the cluster

labels for each observation.

(c) Compute the centroid for each cluster.

(d) Assign each observation to the centroid to which it is closest, in

terms of Euclidean distance. Report the cluster labels for each

observation.

(e) Repeat (c) and (d) until the answers obtained stop changing.

(f) In your plot from (a), color the observations according to the

cluster labels obtained.

4. Suppose that for a particular data set, we perform hierarchical clustering

using single linkage and using complete linkage. We obtain two

dendrograms.

(a) At a certain point on the single linkage dendrogram, the clusters

{1, 2, 3} and {4, 5} fuse. On the complete linkage dendrogram,

the clusters {1, 2, 3} and {4, 5} also fuse at a certain point.

Which fusion will occur higher on the tree, or will they fuse at

the same height, or is there not enough information to tell?

Problem 4. Hierarchical clustering.

4


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp