ST446程序代写、代写SQL设计编程-代写Database作业

联系方式

QQ：99515681
邮箱：99515681@qq.com
工作时间：8:00-23:00
微信：codinghelp

您当前位置：首页 >> Database作业Database作业

ST446程序代写、代写SQL设计编程

日期：2023-03-29 08:24

ST446 Assessment 2

Dataset We use an English-language Wikipedia dump dataset in this assessment, similarly to Assignment 1. You must use the

dump file available for download from here. This is a bzip2 compressed XML file.

@ Cluster configuration Each problem requires a different cluster configuration (see below). You can submit separate notebooks for each

question.

Remember to adjust to your project name, bucket and other parameters.

For P1, use the following configuration:

gclouddataprocclusterscreatest446-cluster--properties=^#^spark:spark.jars.packages=graphframes:graphfram

For P2, use the following configuration:

BUCKET="st446-w9-bucket"

scloudbetadataprocclusterscreatest446-cluster--projectst446-1t2023--bucket$iBUCKET}--regioneurope-

P1 Graph data processing In this exercise, the task is to perform graph data processing using PySpark Graphframes API. You should use graph

queries not dataframe/SQL queries.

In this exercise, the task is to perform graph data processing using PySpark Graphtrames API. You should use graph queries not dataframe/SQL queries.

P1.1 Creating a Vertex dataframe

You need to create a Vertex dataframe by first creating three Vertex dataframes and then creating the final Vertex dataframe as the union of the three Vertex dataframes (union by column name). The specification for the three Vertex dataframes is as follows.

l Collaborator Vertex dataframe vco

l Vertex ID is md5 hash of the concatentation of username and contributor id strings

l Attribute column name type, String type, column values = "contributor"

l Attribute column name contributorID, String type, column values are contributor id values

l Attribute column name name , String type, column values are username values

l Page Vertex dataframe vpa

l Vertex ID is md5 hash of the concatenation of page id and page title

l Attribute column name type, String type, column values = "page"

l Attribute column name pageID, String type, column values are page id values

l Attribute column name title , String type, column values are page titles

l Category Vertex dataframe vca

l Vertex ID is md5 hash of category name

l Attribute column name type, String type, column values = "category"

l Attribute column name category, String type, column values are category names

The final Vertex dataframe, v, must be the union of Vertex dataframes vco, vpa and vca (union by column name).

Show the schema and top 5 rows for each of the four Vertex dataframes.

Note: all the md5 hash values must be encoded in hexadecimal format.

P1.2 Creating an Edge dataframe

You need to create an Edge dataframe, by first creating two Edge dataframes and then creating one final Edge dataframe as union of the two Edge dataframes that you have created. The two Edge dataframes that you need to create first are such that one (contributor-page) contains information about edges connecting a contributor as the source vertex and a page as the destination vertex, and the other (page-category) contains information about edges connecting a page as the source vertex and a category as the destination vertex. The specification for the two Edge dataframes is as follows:

Contributor-page Edge dataframe ep

【返回顶部】【打印本稿】【关闭本页】

【上一篇】：代写COMP3311、代写SQL语言编程

【下一篇】：代写COMP3311、代写SQL语言编程

联系方式

最新辅导

热门辅导

您当前位置：首页 >> Database作业Database作业

ST446程序代写、代写SQL设计编程

日期：2023-03-29 08:24

相关文章