联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Database作业Database作业

日期:2023-03-29 08:24

ST446 Assessment 2


Dataset We use an English-language Wikipedia dump dataset in this assessment, similarly to Assignment 1. You must use the


dump file available for download from here. This is a bzip2 compressed XML file.


@ Cluster configuration Each problem requires a different cluster configuration (see below). You can submit separate notebooks for each


question.


Remember to adjust to your project name, bucket and other parameters.


For P1, use the following configuration:


gclouddataprocclusterscreatest446-cluster--properties=^#^spark:spark.jars.packages=graphframes:graphfram


For P2, use the following configuration:


BUCKET="st446-w9-bucket"


scloudbetadataprocclusterscreatest446-cluster--projectst446-1t2023--bucket$iBUCKET}--regioneurope-


P1 Graph data processing In this exercise, the task is to perform graph data processing using PySpark Graphframes API. You should use graph


queries not dataframe/SQL queries.


In this exercise, the task is to perform graph data processing using PySpark Graphtrames API. You should use graph queries not dataframe/SQL queries.


P1.1 Creating a Vertex dataframe


You need to create a Vertex dataframe by first creating three Vertex dataframes and then creating the final Vertex dataframe as the union of the three Vertex dataframes (union by column name). The specification for the three Vertex dataframes is as follows.


l Collaborator Vertex dataframe vco


l Vertex ID is md5 hash of the concatentation of username and contributor id strings


l Attribute column name type, String type, column values = "contributor"


l Attribute column name contributorID, String type, column values are contributor id values


l Attribute column name name , String type, column values are username values


l Page Vertex dataframe vpa


l Vertex ID is md5 hash of the concatenation of page id and page title


l Attribute column name type, String type, column values = "page"


l Attribute column name pageID, String type, column values are page id values


l Attribute column name title , String type, column values are page titles


l Category Vertex dataframe vca


l Vertex ID is md5 hash of category name


l Attribute column name type, String type, column values = "category"


l Attribute column name category, String type, column values are category names


The final Vertex dataframe, v, must be the union of Vertex dataframes vco, vpa and vca (union by column name).


Show the schema and top 5 rows for each of the four Vertex dataframes.


Note: all the md5 hash values must be encoded in hexadecimal format.


P1.2 Creating an Edge dataframe


You need to create an Edge dataframe, by first creating two Edge dataframes and then creating one final Edge dataframe as union of the two Edge dataframes that you have created. The two Edge dataframes that you need to create first are such that one (contributor-page) contains information about edges connecting a contributor as the source vertex and a page as the destination vertex, and the other (page-category) contains information about edges connecting a page as the source vertex and a category as the destination vertex. The specification for the two Edge dataframes is as follows:


Contributor-page Edge dataframe ep


相关文章

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp