联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp2

您当前位置:首页 >> Algorithm 算法作业Algorithm 算法作业

日期:2023-04-28 11:30

HW 6: Text Analysis of Harry Potter Books

Stat 133, Spring 2023

In this assignment, you will build a shiny app to visualize the results from a text analysis

performed on the seven Harry Potter1 books written by Joanne Kathleen Rowling:

1) Harry Potter and the Philosopher’s Stone

2) Harry Potter and the Chamber of Secrets

3) Harry Potter and the Prisoner of Azkaban

4) Harry Potter and the Goblet of Fire

5) Harry Potter and the Order of the Phoenix

6) Harry Potter and the Half-Blood Prince

7) Harry Potter and the Deathly Hallows

We are assuming that you have reviewed the learning materials of weeks 11, 12, and 13 (see

bCourses). Specifically, we are assuming that you have reviewed the text-mining tutorials

available in the readings/ folder:

https://bcourses.berkeley.edu/courses/1521994/files/folder/readings

Part I: Data

This section introduces the data for this assignment.

Harry Potter Data Files

The data for this assignment involves the text of Harry Potter books. This data is available

in two different presentations:

1) A single csv file (with the text of all seven Harry Potter books)

2) A set of seven rda2 files (one file per book)

1Harry Potter is a series of seven fantasy novels. The novels chronicle the lives of a young wizard, Harry

Potter, and his friends Hermione Granger and Ron Weasley.

2An rda file is a binary file that uses native’s R binary format (i.e. can be opened only in R).

1

All these files are located in the hw6/ folder (see bCourses folder Files/hws/hw6).

For sake of redundancy (in case you experience any issue when trying to access the files

through bCourses), you can also find the associated files in the following github repository:

https://github.com/gastonstat/harry-potter-data

1.1) Single csv file harry_potter_books.csv

The data of all the books is available in csv format. You may want to use this file to perform

sentiment analysis, or word-trend analysis.

# assuming that the csv file is in your working directory

hp_books = read_csv("harry_potter_books.csv", col_types = "ccc")

This data set is fairly simple—in terms of its structure—although the text content is far from

being tidy. The dataset has 95,085 rows and 3 columns:

1) text: text content

2) book: title of associated book

3) chapter: associated chapter number

1.2) Seven rda files

The data of each book is also available in its own R-Data rda file. To be more precise, the

text of each book comes in a character vector with as many elements as chapters in a book.

You may want to use these files to perform bigram analysis (or other type of n-gram analysis).

To import these files use the load() function:

# assuming that the rda files are in your working directory

load("philosophers_stone.rda")

load("chamber_of_secrets.rda")

load("prisoner_of_azkaban.rda")

load("goblet_of_fire.rda")

load("order_of_the_phoenix.rda")

load("half_blood_prince.rda")

load("deathly_hallows.rda")

Consider the first book “Harry Potter and the Philosopher’s Stone”. Assuming that you’ve

loaded the file "philosophers_stone.rda", the text of this book is available in the homonym

character vector philosophers_stone

length(philosophers_stone)

## [1] 17

2

As mentioned before, the number of elements in philosophers_stone corresponds to the

number of chapters in this book: 17 chapters.

Part II: Text Analysis

This section provides some of the suggested text analysis that you can perform for this

assignment.

Listed below are some text analysis ideas for you to get inspiration from. We are also

including recommended readings (some available in bCourses, some available in the book

“Text Mining with R”, by Silge & Robinson).

Out of the four listed types of text analysis (2A-2D) you will have to choose two

of them in order to create the shiny app.

2.A) Sentiment Analysis

Perhaps the broader and richer type of analysis that can be performed on the Harry Potter

text data involves sentiment analysis. Given the amount of text—spread across the seven

books and all their chapters—and the four sentiment lexicons (bing, afinn, nrc, loughran),

the options to do all sorts of sentiment analysis seem limitless.

For example, here are a few ideas (this is by no means an exhaustive list):

Given a certain book, compute a sentiment score for each chapter. Note that different

scores can be obtained by using different lexicons.

Given a certain book, compute sentiment scores and visualize them across a plot

trajectory. This is what Julia Silge refers to as the “track of narrative time in sections

of text”. To clarify, the notion of section of text does not necessarily have to match an

entire chapter.

Compute a sentiment score for each book. And then rank them from more positive

to more negative (or viceversa). Again, note that different scores can be obtained by

using different lexicons.

Which chapters or books have “relatively large” positive scores? And/or what words

contribute the most for the score?

Which chapters or books have “relatively large” negative scores? And/or what words

contribute the most for the score?

Sentiment Lexicons Files. For the sake of convenience, you can find rda files of all the

sentiment lexicons in the folder hws/hw6/sentiment-lexicons

https://bcourses.berkeley.edu/courses/1521994/files/folder/hws/hw6/sentiment-lexicons

3

If you experience any access issue through bCourses (hopefully not), you can also find the

associated files in the following github repository: https://github.com/gastonstat/harry-

potter-data

# assuming that the rda files are in your working directory

load("bing.rda")

load("afinn.rda")

load("nrc.rda")

load("loughran.rda")

Assuming that you’ve loaded the file "bing.rda", the associated lexicon is available in the

homonym tibble bing

bing

## # A tibble: 6,786 x 2

## word sentiment

##

## 1 2-faces negative

## 2 abnormal negative

## 3 abolish negative

## 4 abominable negative

## 5 abominably negative

## 6 abominate negative

## 7 abomination negative

## 8 abort negative

## 9 aborted negative

## 10 aborts negative

## # ... with 6,776 more rows

Suggested reading

text-mining-4-sentiment-analysis.html (see Files/readings in bCourses)

See also chapter 2 “Sentiment analysis with tidy data” (in “Text Mining with R”; link

below)

https://www.tidytextmining.com/sentiment.html

2.B) Word Trend Analysis

Another possibility consists of a word trend analysis.

One example of words could be the names of main characters: "harry", "ron", "hermione",

"dumbledore", "voldemort", "hagrid", etc.

With these names, one can compute their relative frequencies (i.e. proportion of occurrences)

across the books, and visualize their trend. See the tutorial listed below to get a better idea

4

of this kind of trend.

Alternatively, you can also look for one or more specific words, and see how they are used

across chapters of a book, or across the seven books. For instance, how does the word "love"

is used across the chapters of the first book “The Philosopher’s Stone”? Or other words such

as "spell", "potion", "wand", "quidditch", to mention a few.

Suggested reading

text-mining-5-pride-and-prejudice.html (see Files/readings in bCourses)

See figure 5.4 in “Text Mining with R” (link below) to get a rough idea about this type

of trends over time. Obviously there is no time in Harry Potter but you can use the

sequence of chapters as a proxy of time.

https://www.tidytextmining.com/dtm.html#tidying-dfm-objects

2.C) Word and Document Frequency (tf-idf)

You can also look at a term’s inverse document frequency (idf), which decreases the weight

for commonly used words and increases the weight for words that are not used very much in

a collection of documents. This can be combined with term frequency to calculate a term’s

tf-idf (the two quantities multiplied together), the frequency of a term adjusted for how

rarely it is used.

Suggested reading

See chapter 3 “Analyzing word and document frequency: tf-idf” (in “Text Mining with

R”; link below)

https://www.tidytextmining.com/tfidf.html

2.D) Bigram Analysis

Another type of analysis involves studying so-called bigrams (or n-grams in general) for

answering questions like:

what kind of words tend to be associated with other words?

Suggested reading

text-mining-2-pride-and-prejudice.html (see Files/readings in bCourses)

See chapter 4 “Relationships between words: n-grams and correlations” (in “Text

Mining with R”; link below)

https://www.tidytextmining.com/ngrams.html

Warning: visualizing graph networks of n-grams tends to be computationally expensive due

to the size of the text data;

5

Part III: Shiny App

This section describes generic specifications of the shiny app.

3) Shiny App

The main data product to be delivered for this assignment is a shiny app that allows the

user to explore the results of two types of text analysis.

For example, you can choose 1) a sentiment analysis, and a 2) word trend analysis. Keep in

mind that even if two (or more) students choose to work on the same type of analyses, there

is still enough room to approach them in slightly different ways, therefore producing different

shiny apps, with different scopes, and of course different data visualizations and outputs.

Important Note: It is possible to find—online—various types of text analysis on Harry

Potter data that different authors/analysts have performed in the past. You can definitely

get inspiration from them, but we expect that you do your own work, write your own code,

and conduct yourself with academic integrity.

3.1) Layout

Title of your App

analysis1

Graphing Area

You may want to arrange widgets across columns

Numeric/text output to help in the

interpretation of the displayed graphs

Use at least four different types of widgets

analysis2

Input widgets Input widgets Input widgets Input widgets

Now there are 2 tabs!!!

Figure 1: Diagram of the overall shiny app’s layout

6

You can find a template R script file app-template.R in the folder containing this pdf of

instructions (see bCourses folder Files/hws/hw6).

As you can tell from the above diagram, the layout of the app is different to the shiny app of

hw5. One big difference in the app for this assignment is in the fact that it uses two tabs:

1) Analysis1: this tab is for displaying the results for one type of text analysis (for

example: sentiment analysis)

2) Analysis2: this tab is for displaying the results for another type of text analysis (for

example: word trend analysis)

From the diagram above, note that there are four distinctive sections in the layout—see

template file app-template.R:

title: main title for your app (give it a meaningful name).

input widgets: the template already contains various input widgets arranged in four

columns. You can change this configuration if you want, as well as the types and

number of widgets. The only condition is to use at least four different types of widgets

(e.g. slider, numeric, text, and select).

plot: an output area to display graph(s).

stats: an output area (e.g. for a table, text, etc) to display numeric/text output.

4) Submission (and tentative poitns)

1) R file [7 pts]: You will have to submit the source app.R file (do NOT confuse with an

Rmd file) containing the code of your app.

2) Link of published app [1 pt]: You will also have to submit the link of your published

app in shinyapps.io (the free version). Share the link with us in the comments section

of the submission in bCourses.

3) Video [2 pts]: In addition to the app.R file and the link of your published app, you

will also have to record a video—maximum length of 4 mins—in which you show us

your published shiny app, how to use it, and a description of its outputs.

4) Important: You do NOT have to submit any Rmd or html files this time. Also,

we will not accept any content sent by email. We will only grade the app.R

file submitted to bCourses, the public link of the video, and the link of your app in

shinyapps.io.

5) Some of the things we will pay attention to

We will pay attention to the visual appearance of the graphics (e.g. type of graph, use of

colors, supporting elements such as grid lines, text, labels, legends, annotations, etc.). This

7

means that we will assess the effectiveness of your graph in terms of the displayed information,

taking into account good practices of data visualization.

We will also evaluate the effectiveness of the numeric and/or text output displayed in your

shiny app, in terms of providing understanding and insight for each of the analysis.

Likewise, we will also assess your video. Make sure that the image and sound quality of

your video are acceptable (avoid background noise, inaudible voice, highly pixelated images,

trembling camera movements, and things like that). You may need to rehearse what you will

say in your video a couple of times before its definitive recording.

Above all, put yourself in the place of a generic user who will use your app without you

being there to explain them how to use it, or to tell them how to make sense of the displayed

information. We will examine your published app without necessarily watching your video at

the same time. If something needs an explanation, make sure to include it in your app (not

just in your video).


相关文章

版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp