联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> C/C++编程C/C++编程

日期:2021-12-16 10:33

COMP34711 Natural Language Processing

Coursework 2, Nov 2021

You are provided with the product review corpus. Check the README file and observe the content,

format and structure of the corpus. You are asked to design and evaluate solutions for two NLP

tasks using this corpus. You are only free to use functions that are available in the NLTK

framework and machine learning libraries specified in the instruction, e.g., Weka, scikit-learn,

PyTorch, Keras, TensorFlow, to implement your design. Overall, this coursework is marked on the

basis of

? rigorous experimentation,

? knowledge displayed in report,

? independent problem-solving skill,

? self-learning ability,

? how informative your analysis is,

? language and ease of reading of the report,

? code quality based on correctness and readability (which includes comments).

You should solve all the tasks on your own. You are not permitted to collaborate with other

students on this coursework. In lab support sessions, you can ask TAs to explain knowledge

taught in the lecture or seek advice on how to use a natural language processing or machine

learning library. But you are not permitted to ask TAs to help with the solution design, or to check

the correctness of your solution.

Your submission should include both code and report. About your code, provide comments when

you see fit and your code will be marked based on both correctness and readability (which

includes comments). About your report, use Arial Font 11. Your main report should be no more

than 3 pages, including up to 2 pages for Task 1 while up to 1 page for Task 2. If needed, you can

include additionally up to 2 pages of screenshots (e.g., of your results) as an Appendix of your

report.

Task 1: Distributional Semantics (15 marks)

The following experiment is designed to evaluate the performance of a distributional semantic

approach.

? Step 1: Clean and pre-process all reviews in your text corpus as you see fit. Choose the top 50

most frequently occurred words (after removing the stop words) as the target words. You are

free to use functions that are available in the NLTK framework to help your text pre-processing.

? Step 2: For each of the 50 target words, uniformly sample half of its occurrences in the corpus

and substitute these with a made-up reverse words, e.g., half of the occurrences of "canon'' will

be transformed into "nonac". Refer to these 50 new words as pseudowords.

? Step 3: Construct a d-dimensional feature vector to characterise each of the 50 target words

and 50 pseudowords (N=50+50=100) using a distributional semantic approach (more detailed

requirements on this are provided later). Store your obtained feature vectors in a 100′d matrix

X.

? Step 4: Take the feature matrix X as the input, and apply a clustering algorithm to cluster the

100 words into 50 clusters. You are free to use eixsing clustering algorithm implementation as

2

you see fit. For instance, clustering modules (https://www.nltk.org/api/nltk.cluster.html) from

NLTK, machine-learning framework for clustering from Weka

(http://www.cs.waikato.ac.nz/ml/weka/) and scikit-learn (https://scikitlearn.org/stable/modules/clustering.html#clustering).

? Step 5: For each pair of the target word and its corresponding pseudoword, if these two are

grouped into the same cluster, it is defined as a correct pair. Among the 50 pairs, check the

percentage of the correct pairs, denoted by p.

? Step 6: Repeat this whole process multiple times, e.g., 5-10, and calculate the mean and

standard deviation of the obtained percentages p.

Applying what you have learned on lexical processing and distributional semantics, you should

come up with 2 different approaches for constructing the distributional semantic representations.

For instance, they can differ in ways of constructing the dictionary (e.g., stems vs. words) and of

extracting the context features, or differ in the approach principles. You should aim at achieving a

good clustering performance and understanding the reason behind.

Here are the requirements of the 2 approaches:

? They should differ significantly. For instance, the same context feature extraction approach

with different window sizes is considered as one approach.

? They should include one sparse approach and one dense approach.

? They should be evaluated and compared thoroughly, e.g., their performance, and effect of

their hyperparameter setting.

1. Submission Instruction

Your implementation should be well-structured, defining a function for each step and executing the

functions in a main file.

You should submit the implementation and evaluation of your 2 approaches as 2 separate Jupyter

notebook files, named as “Task1_Approach1”, “Task1_Approach2”. The TA will run each file

separately during marking.

You should prepare a report (up to 2 pages) containing two sections:

? Methods: Explanation of your text cleaning and pre-processing steps, as well as the 2

approaches for constructing the distributional semantic representations.

? Result Analysis: Analyse and discuss the obtained clustering results for each approach.

You should discuss hyperparameter relevant issues if your approach requires any

hyperparameter setting, e.g., setting context window size, determining feature

dimensionality d for a dense approach, etc.

2. Mark Allocation

Marks are allocated as below:

? 1 mark for text cleaning, pre-processing, target words selection, pseudo words

construction.

? 10 marks for implementation, description, and result analysis of the 2 approaches, where 5

marks for each approach.

3

? 2 marks for clustering performance award, which means to achieve a satisfactory clustering

performance exceeding a percentage threshold by at least one approach and for explaining

the reason behind your success. This percentage threshold will only be released to you

after the marking.

? 2 marks for design novelty that can be either an improvement of what has been taught or a

new reasonable approach not taught in the “Distributional Semantics” Chapter. You need to

highlight in the report what the novelty is, if to gain these marks.

Task 2: Neural Network for Classifying Product Reviews (10 marks)

The product review corpus contains reviews scored as positive and negative opinions. Pre-process

your text, prepare the review examples for training and evaluation. Implement, train and evaluate a

neural network that can classify an input review to either a positive or a negative class. You are

free to choose any neural network/deep learning technique taught in the Chapter “Deep Learning

for NLP”, e.g., multi-layer perceptron, LSTM, bi-directional LSTM, etc. You should evaluate your

classifier’s classification accuracy using 5-fold cross validation (CV). You are free to use PyTorch,

Keras or TensorFlow library.

1. Submission Instruction

Your implementation should be well-structured with comments. You should submit the

implementation and its evaluation as a single file, named as “Task2”.

Prepare a report (up to 1 page) containing 2 short sections:

? Method: Explanation of your classification model design and training.

? Experiment and Result Analysis: Describe your experiment and evaluation approach. You

should discuss hyperparameter relevant issues if your approach requires any

hyperparameter setting. Report and analyse classification accuracy.

2. Mark Allocation

Marks are allocated as below:

? 2 marks for text cleaning, pre-processing, and preparing the input data for the classifier.

? 7 marks for implementation, classification accuracy evaluation by 5-fold cross validation,

method description, and result analysis of the classifier.

? 1 mark for classification accuracy award, which is to achieve a satisfactory classification

accuracy exceeding an accuracy threshold. This threshold will only be released to you after

the marking.

-----------------------------

Submission Checklist

A .zip file named as “34711-Cwk-S-DeepLearning” containing

? Three code files: Task1_Approach1, Task1_Approach2, Task2.

? One .pdf file, combining reports for both Task 1 and Task 2.


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp