联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2021-12-22 09:21

Big Data Basics INFS 5095

Supp. Assessment: Working with HDFS and MapReduce

Submission

Part 1: Submit a word document containing all your answers and screenshots. The font size is

expected to be either 11 or 12 with lines single line spaced. The images should be clearly visible and

readable in normal reading setting.

Part 2: Question and Answer zoom on Dec 22nd, 2021 or Dec 23rd, 2021

Assessment Aims

This assessment aims to demonstrate your understanding of how big data are stored in HDFS and

processed using MapReduce. This knowledge will assist you as a graduate as you distributed file

systems that run on community software are often used to process large amounts of data. A popular

software for processing this distributed information is MapReduce. The results of this type of analysis

will also need to be effectively communicated to clients.

The assessment addresses the following course objectives:

[CO3]. Understand and apply standard processes and industry standard tools to acquire, store and

prepare big data sets.

[CO4]. Select an appropriate big data analytics tool and apply it to a problem involving big data.

[CO5]. Communicate appropriately with professional colleagues through visualisation and report.

Assessment Description

In this assessment, you will

? Store document files in Hadoop Distributed File System (HDFS).

? Write a MapReduce program in Python to solve a typical application in text analysis.

? Write a report for professional data analyst and submit a document file containing all answers

and screenshots via the submission point. – Max 500 words

Assessment Details

Text analytics is the process of deriving high-quality information from documents. Text analysis parses

the contents of a document and creates structured data out of free text contents of the document. A

typical application in text analytics is count the words in a set of documents and identify the trending

words i.e. highly talked about words or most referenced words.

You will write a MapReduce program that will read a document and compute the top K most frequent

words in the document, where K will be any integer value. The output of the program will be a text

file with one word and count per line, the word and count separated by a tab.

Your program should be able to handle any document file. The program should also perform

preprocessing like removing spaces and symbols like punctuation marks (i.e. ., ;, ,, !, ?), and it will not

consider the articles (i.e. a, an, the), and non-significant words like prepositions and conjunctions.


You use the modified version of tweets dataset as the input data for this practical. The dataset contains

5000 social media messages. The text document (tweets.txt) is given the Practical2_Resources folder

on the course website.

Marking Scheme

MapReduce Program – 5 Marks

Output File – 2 Marks

Report File – 3 Marks


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp