联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2023-10-30 09:31

University of Toronto, Department of Computer Science

CSC 485H/2501H: Computational linguistics, Fall 2023

Assignment 2

Due date: 23:59 on Friday, 3rd November, 2023.

Late assignments will not be accepted without a valid medical certificate or other documentation

of an emergency.

For CSC485 students, this assignment is worth 30% of your final grade.

For CSC2501 students, this assignment is worth 25% of your final grade.

? Read the whole assignment carefully.

? Type the written parts of your submission in no less than 12pt font.

? What you turn in must be your own work. You may not work with anyone else on any of

the problems in this assignment. If you need assistance, contact the instructor or TA for the

assignment.

? Any clarifications to the problems will be posted on the Piazza forum for the class. You will

be responsible for taking into account in your solutions any information that is posted there,

or discussed in class, so you should check the page regularly between now and the due date.

? The starter code directory for this assignment is accessible on Teaching Labs machines at

the path /u/csc485h/fall/pub/a2/. In this handout, code files we refer to are located in

that directory.

? When implementing code, make sure to read the docstrings as some of them provide important instructions, implementation details, or hints.

? Fill in your name, student number, and UTORid on the relevant lines at the top of each

file that you submit. (Do not add new lines; just replace the NAME, NUMBER, and UTORid

placeholders.)

0. Warming up with WordNet and NLTK (4 marks)

WordNet is a lexical database; like a dictionary, WordNet provides definitions and example

usages for different senses of word lemmas. But WordNet does the job of a thesaurus as well: it

provides synonyms for the senses, grouping synonymous senses together into a set called a synset.

But wait; there’s more! WordNet also provides information about semantic relationships beyond synonymy, such as antonymy, hyperonymy/hyponymy, and meronymy/holonymy. Throughout this assignment, you will be making use of WordNet via the NLTK package, so the first step is

to get acquainted with doing so. Consult sections 4.1 and 5 of chapter 2 as well as section 3.1 of

chapter 3 of the NLTK book for an introduction along with examples that you will likely find useful

for this assignment. You may also find section 3.6 is also useful for its discussion of lemmatization,

although you will not be doing any lemmatization for this assignment.

Make certain that you use Python 3 (python3). That is where NLTK lives on our machines.

You will need to import nltk. You may also need to nltk.download(’omw-1.4’).

(a) (1 mark) A root hyperonym is a synset with no hyperonyms. A synset s is said to have depth

d if there are d hyperonym links between s and a root hyperonym. Keep in mind that, because

synsets can have multiple hyperonyms, they can have multiple paths to root hyperonyms.

Implement the deepest function in q0.py that finds the synset in WordNet with the largest

maximum depth and report both the synset and its depth on each of its paths to a root hyperonym.1

(b) (2 marks) Implement the superdefn function in q0.py that takes a synset s and returns a

list consisting of all of the tokens in the definitions of s, its hyperonyms, and its hyponyms.

Use word_tokenize as shown in chapter 3 of the NLTK book.

(c) (1 mark) WordNet’s word_tokenize only tokenizes text; it doesn’t filter out any of the tokens. You will be calculating overlaps between sets of strings, so it will be important to

remove stop words and any tokens that consist entirely of punctuation symbols.

Implement the stop_tokenize function in q0.py that takes a string, tokenizes it using word_

tokenize, removes any tokens that occur in NLTK’s list of English stop words (which has

already been imported for you), and also removes any tokens that consist entirely of punctuation characters. For a list of punctuation symbols, use Python’s punctuation characters

from the string module (this has also already been imported for you). Keep in mind that

NLTK’s list contains only lower-case tokens, but the input string to stop_tokenize may

contain upper-case symbols. Maintain the original case in what you return.

1Hint: you may find the wn.all_synsets and synset.max_depth methods helpful.

2

1. The Lesk algorithm & word2vec (28 marks)

Recall the problem of word sense disambiguation (WSD): given a semantically ambiguous

word in context, determine the correct sense. A simple but surprisingly hard-to-beat baseline

method for WSD is Most Frequent Sense (MFS): just select the most frequent sense for each

ambiguous word, where sense frequencies are provided by some corpus.

(a) (1 mark) Implement the mfs function that returns the most frequent sense for a given word

in a sentence. Note that wordnet.synsets() orders its synsets by decreasing frequency.

As discussed in class, the Lesk algorithm is a venerable method for WSD. The Lesk algorithm

variant that we will be using for this assignment selects the sense with the largest largest number of

words in common with the ambiguous word’s sentence. This version is called the simplified Lesk

algorithm.

Algorithm 1: The simplified Lesk algorithm.

input : a word to disambiguate and the sentence in which it appears

best_sense ←? most_frequent_sense word

best_score ←? 0

context ←? the bag of words in sentence

for each sense of word do

signature ←? the bag of words in the definition and examples of sense

score ←? Overlap(signature, context)

if score > best_score then

best_sense ←? sense

best_score ←? score

end

end

return best_sense

Our version represents the signature and context each as bags (also known as multisets) of

words. Bags are like sets but allow for repeated elements by assigning each element with a nonnegative integer that indicates the number of instances of that element in the bag; this integer is

called multiplicity. Because of multiplicity, the bags A = {a,a,b}, B = {a,b,b}, C = {a,b}, and

D = {a,a,b,b} are all different. (This is not the case for sets, which do not allow multiplicity.) a

has multiplicity 2 in A and D but multiplicity 1 in B and C, while b has multiplicity 2 in B and D

but multiplicity 1 in A and C.

As with sets, each bag has associated with it a cardinality which is the sum of the multiplicities

of its elements; so A and B have cardinality 3 whileC has cardinality 2 and D has cardinality 4. The

intersection of two bags Y and Z is the bag where, the multiplicity of any element x is defined as the

minimum of the multiplicities of x in Y and in Z. For the bags defined above, C is the intersection

3

of A and B. The union is analogously defined using maximum instead of minimum: the bag D is

the union of A and B.

(b) (6 marks) In the lesk function, implement the simplified Lesk algorithm as specified in

Algorithm 1, including Overlap. Overlap(signature, context) returns the cardinality

of the intersection of the bags signature and context, i.e., the number of words tokens

that the signature and context have in common.

Use your stop_tokenize function to tokenize the examples and definitions.

Next, we’re going to extend the simplified Lesk algorithm so that the sense signatures are more

informative.

(c) (3 marks) In the lesk_ext function, implement a version of Algorithm 1 where, in addition to including the words in sense’s definition and examples, signature also includes

the words in the definition and examples of sense’s hyponyms, holonyms, and meronyms.

Beware that NLTK has separate methods to access member, part, and substance holonyms/

meronyms; use all of them.

Use stop_tokenize as you did for lesk.

(d) (2 mark) This extension should yield improvement in the algorithm’s accuracy. Why is this

extension helpful? Justify your answer.2

Beyond Overlap,there are other scores we could use. Recall cosine similarity from the lectures:

for vectors~v and ~w with angle θ between them, the cosine similarity CosSim is defined as:

CosSim(~v,~w) = cosθ =

~v ·~w

|~v||~w|

Cosine similarity can be applied to any two vectors in the same space. In the Lesk algorithm,

we compare contexts with sense signatures, both of which are bags of words. If, instead of bags,

we produced vectors from the relevant sources (i.e., the words in the sentence for the contexts and

the words in the relevant definitions and examples for the sense signatures), we could then use

cosine similarity to score the two.

Perhaps the simplest technique for constructing vectors from bags of words is to assign one

vector dimension for every word, setting the value for each dimension to the number of occurrences

of the associated word in the bag. So {a,a,b} might be represented with the vector

2 1

and

{a,b} with

1 1

. If we were comparing {new,buffalo,york} and {buffalo,buffalo,like},

we might use

1 0 1 1

and

2 1 0 0

, respectively.

(e) (4 marks) In the lesk_cos function, implement a variant of your lesk_ext function that

uses CosSim instead of Overlap. You will have to modify signature and context so that

they are vector-valued; construct the vectors from the relevant tokens for each in the manner

described above.

(Again, use stop_tokenize to get the tokens for the signature.)

2Hint: consider the likely sizes of the overlaps.

4

(f) (2 marks) In the lesk_cos_oneside function, implement a variant of your lesk_cos function that, when constructing the vectors for the signature and context, does not include words

that occur only in the signature. For example, if the signature has words {new,buffalo,york}

while the context has {buffalo,buffalo,like}, new and york would not be included; so

the signature might be represented with

1 0

and the context with

2 1

.

(Again, use stop_tokenize to get the tokens for the signature.)

(g) (3 marks) Compare how well lesk_cos_oneside performs compared to lesk_cos. Why do

you think this is the case? Justify your answer with examples.

(h) (1 mark) Suppose that, instead of using word counts as values for the vector elements, we

instead used binary values, so that {new,buffalo,york} and {buffalo,buffalo,like}

would be represented with

1 0 1 1

and

1 1 0 0

, respectively. This is a vector

representation of a set.

If we use CosSim for such vectors, how would this be related to the set intersection? (You do

not need to implement this.)

Finally, let’s try to incorporate modern word vectors in place of the bag of words–based method

above. Relatively simple models such as the skip-gram model of word2vec can be trained on large

amounts of unlabelled data; because of the large size of their training data, they are exposed to

many more tokens and contexts. Once trained, word vectors can be extracted from the model and

used to represent words for other tasks, usually bestowing substantial increases to performance.

They also seem to exhibit some interesting semantic properties; recall the example discussed in

class where if we take the vector for king, subtract the vector for man, add the vector for woman,

and then ask which existing word vector is closest to the result, the answer will be the vector for

queen. It stands to reason that incorporating these vectors might help improve the Lesk algorithm.

(i) (4 marks) In the lesk_w2v function, implement a variant of your lex_cos function where the

vectors for the signature and context are constructed by taking the mean of the word2vec

vectors for the words in the signature and sentence, respectively. Count each word once only;

i.e., treat the signature and context as sets rather than multisets.

(Again, use your stop_tokenize to get the tokens for the signature.)

You can run your implementations on the evaluation set that we provide by running python3

q1.py. This will skip any unimplemented functions and display scores for the implemented ones.

Report your scores in your written submission.

(j) (2 marks) Alter your code so that all tokens are lowercased before they are used for any of the

comparisons, vector lookups, etc. How does this alter the different methods’ performance?

Why?

(Do not submit this lowercased version.)

5

2. Word sense disambiguation with BERT (22 marks)

word2vec associates a vector with each word type. Newer models, such as ELMo, GPT, and BERT

instead produce vectors for each word token. Another way of saying this is that vector representations produced by the latter models are conditioned on context and will differ depending on the

context in which the word appears. Consider the two sentences Let’s play a game and Let’s see a

play. word2vec will produce the same vector for play in both cases even though they have different

senses. This is why you had to average all of the relevant vectors in the signature or sentence for

lesk_w2v rather than, for example, relying only on the vector for play. Clearly, there’s a parallel

to how ambiguous words can have their senses resolved only in context.

(a) (4 marks) Is context really necessary? Assuming all that is available are wordforms and

lemmata—no dependencies, semantic roles, parts of speech, etc.—can you give an example

of a sentence where word order–invariant methods such as those you implemented for Q1

will never be able to completely disambiguate? If so, what is the more general pattern, and

why is it impossible for the above methods to provide the correct sense for each ambiguous

word? Be clear about your reasoning.3

But BERT (as is the case with other contextual models) doesn’t produce the same vector for

every instance of a particular word sense; the same vector will be produced only in the exact

same context. So we cannot simply glance at the vector produced for an ambiguous word in order

to determine its sense. We might consider using BERT vectors in a similar manner as we did

the word2vec vectors for lesk_w2v above, taking the mean of the vectors to produce a vector

representation for a word sequence; this turns out to perform worse than any of the methods above.4

Instead, suppose that we had a vector associated with every sense. If the sense vectors are

related to corresponding word token vectors in some way, we could then compute similarities

between possible sense vectors for a particular word token and its corresponding contextual word

vector and then select the sense with the highest similarity. A simple method that works well is

to run the model (BERT, in our case) over each of the sentences in a sense-annotated training set.

We can gather the word tokens associated with all occurrences of a sense and take their average to

represent that sense.

(b) (10 marks) Implement gather_sense_vectors in q2.py to assign sense vectors as described above.

(c) (2 marks) In the docstring for gather_sense_vectors, we point out that sorting the corpus

by length before batching is much faster than leaving it as-is. Explain why this is the case.5

(d) (4 marks) Implement bert_1nn in q2.py to predict the sense for a word in a sentence given

sense vectors produced by gather_sense_vectors. Keep in mind the note in the docstring

about loop usage.

3Hint: re-read the preceding paragraph with this question in mind.

4Why?

5Hint: think about padding.

6

For this assignment, you will be using the BERT model, though the direct calls to it have been

implemented for you in given helper functions. Pay attention to the notes, etc. in q2.py, as they

will help with and fully specify proper usage.

Finally, beware that the version of word sense disambiguation in this assignment is restricted

compared to what would have to be done for the general case of disambiguating ambiguous words

in arbitrary sentences. In particular, the code you’ve written assumes that it is known which words

need to be disambiguated; the selection of ambiguous words in the sentence has already been done

for you.

(e) (2 marks) Think of at least one other issue that would come up when attempting to use the

code for this assignment to disambiguate arbitrary sentences. You may consider either the

Lesk variants from Q1 or the BERT-based method here (or both).

7

Notes

? Running the code for Q1 should not take too long, but if you would like to iterate faster,

caching tokens for signatures and contexts can be an effective speedup.

? The BERT code for Q2 can be a little slow and memory-intensive, so it’s recommended that

you run it on a GPU. (That said, it should run on CPU in less than an hour or so, so it’s definitely viable to do so if needed or preferred.) The code we’ve provided for this assignment

will automatically use the GPU if it’s available; you should never have to specify the device

of a Tensor or call .clone().detach() on one. Similar to A1, you can run your Q2 code

on a GPU-equipped machine via ssh to teach.cs.toronto.edu; use the gpu-run-q2.sh

script to do so. As a reminder, the GPU machines are shared resources, so there are limits on

how long your code is allowed to run.

? Do not alter the function signatures (i.e., the names of the functions or anything about the

parameters that they take), and do make sure that your implementations of the functions

return what the functions are specified to return (i.e., no extra return objects). Also, make

sure your submitted files are importable in Python (i.e., make sure that ‘import q1’ doesn’t

yield errors, and likewise for the other questions/files).

? Do not add any extra package or module imports. If you really would like to use a certain

package or module, ask on the Piazza forum with a brief justification.

? You may add any helper functions, global variables, etc. that you wish to the files for each

question (q0.py, q1.py, etc.), but do not allow your code to become overly convoluted and

make sure it is still readable so that we can trace and understand your logic. You will not

be submitting the other Python files, but avoid making changes to them, as your code’s

behaviour will be evaluated against those files as they are in the starter code.

8

What to submit

Submission is electronic and can be done from any Teaching Labs machine using the submit

command:

$ submit -c <course> -a A2 <filename-1> ... <filename-n>

where <course> is csc485h or csc2501h depending on which course you’re registered in, and

<filename-1> to <filename-n> are the n files you are submitting. The files you are to submit are

as follows:

? a2written.pdf: a PDF document containing your answers as applicable for each question.

This PDF must also include a typed copy of the Student Conduct declaration as on the last

page of this assignment handout. Type the declaration as it is and sign it by typing your

name.

? q0.py: the (entire) q0.py file with your implementations filled in.

? q1.py: the (entire) q1.py file with your implementations filled in. Again, do not include the

alterations that you implement for question 1h.

? q2.py: the (entire) q2.py file with your implementations filled in.

9

CSC 485H/2501H, Fall 2023: Assignment 2

Family name:

Given name:

Student #:

Date:

I declare that this assignment, both my paper and electronic submissions, is my own work, and

is in accordance with the University of Toronto Code of Behaviour on Academic Matters and the

Code of Student Conduct.

Signature:


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp