联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2020-08-16 10:57

COMP9313:

Big Data Management

Sample Exam Questions

?Explain the difference between NameNode

and DataNode.

?Given a file of 500MB, let block size be

150MB, and replication factor=3. How much

space do we need to store this file in HDFS?

Why?

Question 1 HDFS

Question 2 Spark

? Given a large text file, your task is to find out the top-k most

frequent co-occurring term pairs. The co-occurrence of (w, u)

is defined as: u and w appear in the same line (this also

means that (w, u) and (u, w) are treated equally). Your Spark

program should generate a list of k key-value pairs ranked in

descending order according to the frequencies, where the

keys are the pair of terms and the values are the co-occurring

frequencies (Hint: you need to define a function which takes

an array of terms as input and generate all possible pairs).

textFile = sc.textFile(inputFile)

words = textFile.map(lambda x: x.lower().split())

// fill your code here, and store the result in a pair RDD avgLen

avgLen.collect()

Question 3 Finding Similar Items

Suppose we wish to find similar sets, and we

apply locality-sensitive hashing with k=5 and

l=2.

If two sets had Jaccard similarity 0.6, what is

the probability that they will be identified in

the locality-sensitive hashing as candidates

(i.e. they hash at least once to the same superhash)?

You may assume that there are no

coincidences, where two unequal values hash

to the same hash value.

Question 4 Mining Data Streams

Suppose we are maintaining a count of 1s using

the DGIM method. We represent a bucket by (i, t),

where i is the number of 1s in the bucket and t is

the bucket timestamp (time of the most recent 1).

Consider that the current time is 200, window size

is 60, and the current list of buckets is: (16, 148)

(8, 162) (8, 177) (4, 183) (2, 192) (1, 197) (1,

200). At the next ten clocks, 201 through 210, the

stream has 0101010101. What will the sequence

of buckets be at the end of these ten inputs?

Question 5 Recommender Systems

Consider three users u1, u2, and u3, and four movies m1, m2, m3, and m4. The users rated the

movies using a 4-point scale: -1: bad, 1: fair, 2:

good, and 3: great. A rating of 0 means that the

user did not rate the movie. The three users’

ratings for the four movies are: u1 = (3, 0, 0, - 1), u2 = (2, -1, 0, 3), u3 = (3, 0, 3, 1)

? Which user has more similar taste to u1 based on

cosine similarity, u2 or u3? Show detailed calculation

process.

? User u1 has not yet watched movies m2 and m3. Which movie(s) are you going to recommend to

user u1, based on the user-based collaborative filtering approach? Justify your answer.


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp