联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Database作业Database作业

日期:2024-03-21 08:04

Submission Format

If you used databricks, please submit the published notebook link in a word or pdf document. Do not submit HTML, Jupyter notebook, or archive (DBC) formats.

If you used a local instance of spark, please submit a Jupyter notebook.

Setup Instructions

1.Read the article:

https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html (links to an external site)

2.Download Data: You an either download the JSON file from the Assignment 3 home page in Learn under 'EXPECTATIONS AND ASSIGNMENT FILES' or you can download the file from Github to your computer:

https://github.com/dmatrix/examples/blob/master/spark/databricks/notebooks/py/data/iot_devices.json (links to an external site) Click on the raw data source -> right click -> click save as. The file will download locally and now you can import it to Databricks. NOTE:

3.Import files: How to import your downloaded files (from step 3) to your Databricks cluster: https://www.projectpro.io/recipes/create-dataframe-from-json-file-read-data-from-dbfs-and-write-into-dbfs (links to an external site).

4.Import the notebook below. There is some data exploration already done in this notebook for your reference. For details on how to import the notebook above see: https://docs.databricks.com/user-guide/notebooks/index.html (links to an external site).

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/5915990090493625/411609171004360/6085673883631125/latest.html

5.Run it. NOTE: Don't forget to create a cluster and attach the imported notebook to it (left upper corner: button detached) before trying to run it.

Questions (12 marks)

1.Explain the main differences between RDDs, Dataframes and Datasets (4 marks)

2.Answer the following questions:

2.1 How many sensor pads are reported to be from Poland (2 marks)

2.2 How many different LCDs (distinct colors) are present in the dataset (2 marks)

2.3 Find 5 countries that have the largest number of MAC devices used (2 marks)

2.4 Propose and try an interesting statistical test or machine learning model you could use to gain insight from this dataset. Note, you don't have to use Machine Learning for this question. You can apply any analysis to the data even using SparkSQL, Python visualization libraries to analyze the data. Another example cloud be to apply correlation functions or other Spark functions to analyze the data. (2 marks)

NOTE: You may use MLLib in 2.4: https://spark.apache.org/docs/latest/ml-guide.html. Marks are awarded for the idea and implementation of the test/ML model.


相关文章

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp