联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Python编程Python编程

日期:2020-11-25 10:48

11-751/18-781 Fall 2020

Homework 4

OUT: 27th October, 11 : 59 PM ET

DUE: 16th November, 12 : 00 PM ET

Collaboration Policy

Homeworks must be completed individually. You are allowed to discuss the homework assignment with other

students and collaborate by discussing the problems at a conceptual level. However, your answers to the

questions and any code you submit must be entirely your own. If you do collaborate with other students (i.e.

discussing how to attack one of the programming problems at a conceptual level), you must report these

collaborations in Problem 1 - Question 6. Your grade for homework 4 will be reduced if it is determined that

any part of your homework submission is not your individual work. No collaborations are permitted

on Problem 1 of Homework 4.

Collaboration without full disclosure will be handled in compliance with CMU Policy on Cheating and

Plagiarism: https://www.cmu.edu/policies/student-and-student-life/academic-integrity.html

Late Day Policy

You have a total of 3 late days that you can use over the semester for the 4 homework assignments. For

homework 4, no submissions will be accepted after November 18th at 12:00pm (ET), 2 days after the

homework deadline. If you do need a one-time extension due to special circumstances, please contact the

instructor (Ian Lane) via Piazza.

1

11-751/18-781 Fall 2020 : Homework 4 Problem 1

Compute Resources for Homeworks

You can complete the course homeworks either using your personal computers, other compute resources you

have access to, or you can choose to use one of the GHC machines that have been assigned for this course

(ghc50.ghc.andrew.cmu.edu - ghc69.ghc.andrew.cmu.edu). The GHC machines (ghc50 - ghc69) are Red

Hat Linux machines with 8-core i7-9700 CPUs, 16 GB of RAM and a GeForce GTX 2080 GPU. You can log

into a GHC machine as shown below:

$ ssh <andrewID>@ghc<machine-id>.ghc.andrew.cmu.edu

$ Password: <enter-your-andrew-password>

Note that one or more of the GHC machines could be offline at anytime. If you are unable to log into a

specific machine, try one of the other machines in the cluster. You can also use the "w" command to see how

many other students are using a particular machine. i.e.

$ ssh -t <andrewID>@ghc<machine-id>.ghc.andrew.cmu.edu w

12:50:08 up 6:03, 1 user, load average: 0.01, 0.04, 0.05

USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT

<andrewID> pts/2 <machineID> 12:50 0.00s 0.05s 0.00s w

For students who are new to a linux programming environment, here are some commonly used commands:

https://www.cmu.edu/computing/services/comm-collab/collaboration/afs/how-to/unix-commands.

pdf

2

11-751/18-781 Fall 2020 : Homework 4 Problem 1

Problems:

Problem 1

Short Answer Questions (60 pts)

Complete the 5 questions on the online assignment in Gradescope. Answers must be entered directly within

Gradescope itself. Please be as precise and concise in your answers as possible.

Remember to answer all 5 questions.

If you do collaborate with other students on Problem 2, (i.e. discussing how to attack one of the problems

at the conceptual-level) you must report these collaborations in Question 6.

Notes on Problem 1:

1. Questions 5 is the Initial Submission question for Problem 2, and requires the upload of an image file

containing your solution. Make sure this is in the png format

Gradescope Assignment Link: https://www.gradescope.com/courses/163942/assignments/762434

3

11-751/18-781 Fall 2020 : Homework 4 Problem 2

Sequence Models with Attention

Speech recognition can be regarded as a sequence transduction task with speech feature frames as inputs

and language tokens as outputs. Sequence models have an encoder-decoder architecture, where the encoder

obtains hidden representations from the input audio features, and the decoder produces language tokens in

an autoregressive fashion. Attention can be used to obtain alignments between the encoded outputs and

the decoder outputs at every decoding time step. In this assignment, you will implement an attention based

approach to speech recognition.

The primary reference paper we recommend for this assignment is Listen,Attend, and Spell (LAS). The code

for this assignment will be written using Pytorch, and we recommend the following resources to students

who are new to the toolkit: Pytorch Documentation, Pytorch Tutorials, and in particular, the tutorial for

Machine Translation using Sequence Models with Attention

Handout Structure and Coding Instructions

Data

The Data is present within the asr_data directory of the handout. This contains three directories for

train,dev, and test. Data that is input to the model and dataloader is stored in JSON format. The JSON

file contains utterance dictionaries, and each utterance has input, and output keys, which contain the path to

Kaldi ark files, and character level tokenized speech transcripts respectively. The JSON file for the unknown

test set only contains the input speech features and no transcription.

Code Template

The code template contains the following structure:

1. conf- The directory has configurations in YAML format for training and decoding

(a) base.yaml- The baseline configuration that can help you get the necessary scores for this assignment.

2. models

(a) encoder.py- Has the Neural Network Modules for the basic RNNLayer, pBLSTM, and Listener,

which is the LAS Encoder. You need to complete the forward method for the pBLSTM and

Listener Modules.

(b) decoder.py- Has the Neural Network Speller Module which is the LAS Decoder. It contains the

forward and greedy_decode methods that you need to fill. Optionally, to get better performance,

you could also implement the beam_decode method in this file.

(c) las_model.py- Has the Neural Network Module Wrapper for SpeechLAS, the sequence model with

attention.

(d) attention.py- Has Neural Network Modules that implement Location Based Attention

3. train.py- Interface for training the ASR Models. Has all the training options and default values listed.

4. decode.py - Interface for decoding the trained ASR Model. This will generate the decoded_test.txt

file that you need to submit to Gradescope.

5. utils.py- Contains Loss,Accuracy, WER utilities and padding utilities for the nural network code.

6. trainer.py- Contains the Trainer object that performs ASR Training

7. requirements.txt- Contains the pip installable Python environment for this assignment

4

Grading Scheme 11-751/18-781 Fall 2020 : Homework 4 Problem 2

Your Gradescope Code Submission: decoded_test.txt

Setup on GHC Machines

The storage in your folder on the GHC machines is limited, so we have placed the data in a course folder.

You can create a softlink from that path to your working directory, so that you can use the speech data.

To prevent the quota issue when installing packages, first comment out the torch in requirements.txt. The

GHC machines have torch preinstalled and we would use the preinstalled version. If you forgot to comment

out torch and get a disk space quota issue, please use rm -rf .cache/pip to clean your pip package and rerun

the instructions.

$ ssh <andrewID>@ghc<machine-id>.ghc.andrew.cmu.edu

$ ln -s /afs/andrew.cmu.edu/course/11/751/homework-4/asr_data .

$ python3 -m venv <env-name> --system-site-packages

$ source <env-name>/bin/activate

$ vim requirements.txt // comment out the torch line

$ pip install --ignore-installed -r requirements.txt

To download the template code and get started:

$ wget https://www.andrew.cmu.edu/user/ianlane/11751/homework-4_code.zip

$ unzip homework-4_code.zip

$ cd homework-4_code/

$ python3 train.py --tag <expt-tag-name>

...

$ python3 decode.py --model-dir exp/train_<expt-tag-name>

...

Grading Scheme

This assignment will use the Gradescope leaderboard with WER on the unknown test set as the metric. For

this assignment, we have two deadlines:

1. Preliminary Deadline November 9th (10 points)- You will submit your output transcript for the

unknown test set to Gradescope, and submit an attention plot for utterance ID "4kac030n" to Part 1

of the Gradescope assignment by this deadline.

(a) Attention Plots in Part 1: 5 points

(b) "Reasonable" WER score ( <= 60) for the unknown test set on the Gradescope Leaderboard: 5

points

2. Full Deadline November 16th (40 points)- You will submit your decoded hypothesis file.

(a) If your test WER <= 20 : 20 points

(b) 10 <= Test WER <= 20 : Maximum of 20 points with your test WER being interpolated between

10 and 20

Bonus Points: +10 points for the top 5 leaderboard entries with WER < 10

Building a Sequence Model in Pytorch

To build a neural network model in Pytorch, we have the following important steps:

5

Data and Experimental Setup 11-751/18-781 Fall 2020 : Homework 4 Problem 2

1. Data Preparation: Prepare the data by extracting speech features and tokenizing the text transcript

based on the unit you use to model speech. Then compile this data into easy readable format like

json files. Partition the data for training, development and evaluation. All the above steps have been

done for you. For training, and development, you will have access to the ground truth text transcript

labels and speech features. For evaluation, you will use speech features to obtain predictions of the

text transcript from your model.

2. Data Loading and Batching: Build a Pytorch Dataset object that has the __getitem__() and

__len__() methods, which return a data sample given an element key, and the total number of data

samples in a partition respectively. Then we create a DataLoader with a custom batching mechanism.

We create batches of examples based on the input size. This means that we can have batches of

different sizes such that the total number of input feature floats (batch-bins) has a maximum for each

batch.

3. Trainer: Then we need to implement a trainer that performs the training. Within each epoch of

training, we have training and validation steps. In the train step, we load data from the data loader,

run it through our model, compute the Cross Entropy Loss, and then perform backpropagation. Based

on the computed gradients, we update the model parameters by calling the step() method of optimiser.

In the validation step, we load the validation data, do forward propagation through the model, and

compute statistics- loss, accuracy, perplexity, and Word Error Rate (WER). After the training and

validation steps, we log statistics for the epoch, and save models.

4. Neural Network Building: This is the focus of the current assignment. Neural networks are written

in Pytorch using torch.nn package, which contains many standard neural network layers including

Linear, Convolutions, RNN, LSTM etc. As we are building an attention based sequence model, we

have three primary components: Encoder, Decoder, and Attention. Each of these components is

written as a torch.nn.Module which has an __init__ and forward method. The former defines the

important neural network layers within the module, and the forward method describes how forward

propagation will occur through the neural network given the input to find the outputs using the neural

network layers defined in the __init__ function.

5. Decoding and Search: After training the sequence model, you will use the trained model to produce

text transcriptions for an unknown test set. Hence, this would necessitate writing a decode function

within the decoder. Typically, beam search is used, and for this assignment, you will implement greedy

search, and optionally beam search.

Data and Experimental Setup

You will use the Wall Street Journal Corpus for training with an unknown test set. Characters have been

selected as the modelling units, and a dictionary file has been provided.

Batching, data-loading, and training code has been provided in the template. You can use the "batch-bins"

option to create batches as per your GPU.

Building the LAS Model

A LAS model with attention has three important components: the Listener(encoder), the Speller(decoder),

and the attention module. The Listener consists of a Pyramidal Bi-LSTM Network structure that takes

in the given utterances and compresses it to produce high-level representations for the Speller network.

The Speller takes in the high-level feature output from the Listener network and uses it to compute a

probability distribution over sequences of characters using the attention mechanism. Attention intuitively

can be understood as trying to learn a mapping from a word vector to some areas of the utterance map.

6

Building the LAS Model 11-751/18-781 Fall 2020 : Homework 4 Problem 2

The Listener produces a high-level representation of the given utterance and the Speller uses parts of the

representation (produced from the Listener) to predict the next word in the sequence.

Inputs, Batching and Padding

In sequence models that use text, videos, speech etc., each example input is two-dimensional, and has a

sequence length component and a frame feature dimension. So, a speech feature input of shape [50,83]

means we have 50 frames of audio inputs, and 83 dimensional features for each frame. In training a sequence

model, we can train and update model parameters using batches, or using individual examples. Batches are

the preferred means to train the model.

Batches can be created with a fixed batch size (all the batches have the same number of elements) or variable

batch sizes (all batches can have different numbers of elements). Variable batch sizes are more efficient for

sequence inputs, and are used in this assignment. It is efficient to create batches of examples that have

the same or similar sequence lengths, as this maximizes compute efficiency. In this assignment, we create

batches based on bins, i.e., the total number of floating point values within a batch.

Within a batch, it is possible that the multiple input elements have different sequence lengths. To be able to

create tensors of shape [batch_size,sequence_length,feature_dimension], we need to pad the inputs whose

length is less than the maximum sequence length of all elements of the batch along the sequence_length

dimension. We pad the audio features with zero (the parameter audio_pad in train.py) to create the audio

input xs, while remembering the actual lengths of the inputs in the parameter xlens.

The outputs for speech recognition are sequences of language tokens with one dimension representing

sequence_length. Different elements in a batch would have different output_lengths, so we also need

to pad the shorter output sequences to be able to create a tensor (ys_ref in las_model.py) of shape

[batch_size,max_sequence_length]. We pad with -1 ( the parameter text_pad in train.py) in the template

code. We need to remember the lengths of the actual output lengths in the variable ylen in las_model.py. We

need to mask the encoder outputs while computing the attention to not include the padded input elements,

and mask the decoded outputs while computing the loss so as to not include the padded output elements.

All of this has been done for you in the provided code template.

Pyramidal Encoder: Listener

The encoder code is to be written in models/encoder.py in the template.

The Listener contains an initial LSTM layer and Pyramidal BiLSTM layers. The Pyramidal BiLSTM

subsamples the feature input of size [Batch_size,Sequence_length,feature_dimensions] along the sequence

length axis by factor p = 2, while increasing the feature dimension by the same subsampling factor. In

order to write the code for the Listener, we need two basic components: the RNNLayer Module, and the

pBLSTM Module. The RNNLayer Module is the implementation of an RNN within Pytorch, and can

implement N layers of GRU/LSTM/RNN. While implementing an RNN Module in Pytorch over padded

variable length inputs, you can use the pack_padded_sequence and pad_packed_sequence utilities from

Pytorch to implement it efficiently. This has been done for you as an example.

The pBLSTM Module implements a single layer of LSTM with Pyramidal Subsampling. The idea here is

that if you get a sequence input of shape [batch,inp_sequence_length,hidden_dim], then you reshape the

input to reduce the sequence_length by factor p = 2 and multiply the hidden_dim by the same factor p.

Next, you pass this modified input through an RNNLayer module to get the output of the pBLSTM. You

will observe that if you are using Pyramidal Subsampling and a bidirectional encoder, that the resulting

output has four times the hidden dimension you provide for the pBLSTM module.

The Listener Module can be implemented using (a) an initial RNNLayer that maps the input feature di-

7

Building the LAS Model 11-751/18-781 Fall 2020 : Homework 4 Problem 2

mension to a hidden dimension, and (b) a sequence of pBLSTM layers that subsample the sequence_length

dimension as mentioned.

This is what the Listener should do:

L1, Llens1 = RNNLayer(xs, xlens)

for i = 1, 2....Ne; do

Li+1, Llensi+1 = pBLSTM(Li

, Llensi)

done

hs, hlens = LNe+1, LlensNe+1

where Ne is the number of encoder pBLSTM layers (elayers in train.py), xs is the padded input feature

tensor, xlens is the lengths of the inputs before padding, the encoder outputs are hs and hlens, obtained

from the last pBLSTM layer.

Number of Encoder Layers, Subsampling factor: Generally, in training attention models with speech

features extracted every 25ms with 10ms overlap, as in this case, subsampling factor of upto 8 can be used,

which means a maximum of 3 pBLSTM layers.

Projection Layers in the Encoder: You can choose to add projection layers between the pBLSTMs that

reduce the hidden dimension. A pBLSTM layer increases the hidden dimension four-fold, so we can use a

projection layer to reduce the hidden dimension four-fold before using it as input to the next pBLSTM.

Attention Module

The attention code is provided in models/attention.py in the template.

Attention has many formulations- Content based, Location based or hybrid methods. Content based attention

was used in the LAS paper, but in this assignment, we are providing location based attention,

proposed in Attention based Models for Speech Recognition. Refer to Section 2.1 in that paper for a detailed

mathematical treatment of location based attention. Fundamentally, location based attention computes the

attention based on a projection of the encoder output (key), a projection of the decoder state (query), and a

convolution followed by projection of the previous attention weights. The code for location based attention

has been provided in attention.py.

Alternative Attention methods: You could use different attention modules and parameters to possibly

improve your WER performance on the leaderboard.

Autoregressive Decoder: Speller

The decoder code is to be written in models/decoder.py in the template.

The Speller is an LSTM based auto-regressive decoder, i.e., we use the previously decoded outputs in the

current decoding step. At every decoding time step, we compute the Attention Context and weights based

on the current decoder state, the encoder output, and previous attention weight; then concatenate the

attention context with the character embedding of the previous output (ground-truth reference or model

output), and then use this concatenated vector as the input to the first decoder layer. Because we have to

compute a different attention context and embedding at each time step based on the previous time-step, we

cannot use torch.nn.LSTM/torch.nn.GRU as in the Listener to implement this. We have to instead use an

torch.nn.LSTMCell/torch.nn.GRUCell within Pytorch, and loop over all decoding time-steps during training

and decoding.

8

Building the LAS Model 11-751/18-781 Fall 2020 : Homework 4 Problem 2

The forward method of the Speller has a loop over the maximum target token sequence length in the batch.

This is what the Speller decoding loop should do:

for i = 1, 2,..... maxlen; do

Embedding = EmbeddingLayer(ysi?1)

The RNNForward method of the decoder goes through the d layers of the LSTM decoder, with the input to

the first layer being the concatenated attention context and token embedding.

The greedy_decode method performs batch greedy decoding using the encoder outputs, and decoding parameters

maxlenratio and minlenratio. In training, we know the maximum length of tokens that we can

output in each batch, but that is not the case for decoding. Therefore, we can either (a) use a fixed maximum

token length for decoding (say 200), or (b) assume that due to the correlation between length of audio and

length of transcript, the maximum length we need for each batch is a fraction of the length of the hidden

states hlens. We recommend option (b), and maxlenratio is the fraction that defines what fraction of the

maximum value in hlens should be the maximum decoding length for the batch. In greedy decoding, we

have a loop over decoding timesteps similar to the forward method, and the main difference is that while

computing embeddings we can only use the previously decoded token. We compute the logits similar to

in the forward method, and after that, compute the log-softmax logits and take the argmax over the token

vocabulary. A decoded output is considered complete once < eos > has been produced. This method returns

a list of token indices corresponding to the decoded tokens for all elements in a batch.

Adding SOS and EOS to reference: In the decoder, we use the start of sentence < sos > token to mark

the beginning of decoding, and end of sentence < eos > to mark the end of decoding. Both of these are

represented by the same token in the template code. While performing decoding for the first time step, you

need to use the embedding for < sos >, which you consider the previous output. When the model produces

the < eos > token, you consider decoding complete. Therefore, < eos > needs to be added to the reference

while computing the loss because you want the model to produce < eos > to indicate the end of decoding.

In the decoder forward method, you will use the reference ys and ylen to create a new reference with the

< eos > padding and return that as the ground-truth reference for loss computation along with the logits.

Pre-computing Ground Truth Embeddings: In the decoder loop, at each time-step, you compute the

embeddings of the previous ground truth token. It is recommended that you pre-compute these embeddings

outside the decoding loop, and access the i ? th element as needed within the loop. To pre-compute these

embeddings, you need the reference token list ys, but with the < sos > token appended to the start.

Scheduled Sampling: In the listener, the token embeddings that you feed in at training time are the

ground truth reference tokens produced at the previous time-step. However, at decoding time, you use the

predicted tokens from the previous time-step to compute the embeddings. This leads to the problem of label

bias, which can be addressed by scheduled sampling or teacher forcing. The idea is to have a parameter

ssprob, where at every decoding time-step with probability ssprob, you use the previous decoded outputs

9

Decoding and Search 11-751/18-781 Fall 2020 : Homework 4 Problem 2

at training time, and with probability 1 ? ssprob, you use the ground truth reference tokens to compute

embeddings. Using scheduled sampling speeds up model convergence, and it is beneficial to (a) use the same

value of ssprob through all epochs, or (b) gradually increase the dependence on previously decoded tokens.

Decoder Dropout: Dropout is a regularization strategy that mitigates the impact of overfitting, and we

recommend using dropout in the decoder.

Multiple Decoder Layers: It might be beneficial to add more layers to the decoder, but we recommend

that you get it working with a single decoder layer first.

Loss, Computing Statistics

In the utils file, we have a StatsCalculator function that computes the CrossEntropyLoss, the accuracy,

perplexity and Word Error Rate on the validation set. The CrossEntropy loss takes in the logits from the

decoder, and the modified reference with < eos >, and ignores the padded elements in the reference using

the ignore_id option.

We have support for Tensorboard logging of important values like loss, accuracy, and WER. We also provide

support for logging Tensorboard attention plots and gradient plots.

Decoding and Search

The file decode.py is the wrapper for decoding your trained model. It calls the decode_greedy method

in las_model.py, which in turn looks at the greedy_decode in decoder.py. The decode_greedy method in

las_model.py returns a list of length batchsize of lists, each of which contains the decoded tokens for that

element of the batch. Then, in decode.py, we convert these decoded tokens to their corresponding character

tokens, and replace "< space >" with " ". Then we write the decoded outputs in the format of <utt-id>

<text> to decoded_hyp.txt, which is what you should submit on Gradescope.

Debugging Strategies and Common Issues

Initial Value of CELoss

Print out the cross entropy loss you compute for the first batch without training. For N way classification,

the initial cross-entropy loss should be ln(N).

Gradient Plots

The gradient plot is an important debugging tool. The trainer file has code that outputs the gradient plots

via Tensorboard and saves it as a file. The gradient plot has min, max, and mean gradients. You want to

make sure that by after a few epochs of training, you have gradients across all decoder layers, through the

attention module, and in the encoder. If you don’t have gradients for particular layers, check your forward

implementation to make sure you have written the code correctly.

Attention Plots

In ASR, attention models should learn to produce monotonic attentions across training. Monitor your

attention plots to ensure that you have reasonably trained models.

Other Tips

1. It is challenging to debug in such large code bases. Please make sure that your code is highly modular,

and you have different functions to implement what you need. Make sure all variable names are

10

How to improve my WER and earn Bonus points? 11-751/18-781 Fall 2020 : Homework 4 Problem 2

Figure 1: Example of Reasonable Gradients during training

descriptive ( e.g., using attention_wt rather than a), and that you have comments in your code

describing what you did. We would be able to help debug your code only if the code is clean, modular,

and descriptive.

2. Use small parameter sizes while building your models as large models can’t be trained well with less

data.

3. Use regularization strategies like weight decay, dropout to address overfitting.

4. You can also use the blog post by Andrej Karpathy here to understand how to build and tune neural

networks.

5. Monitor the training curves on Tensorboard to identify over-fitting and under-fitting, and modify

optimizer parameters as you see fit.

How to improve my WER and earn Bonus points?

Here, we list additional ways that could help you improve performance and target the bonus points for top

scores:

1. Add CTC Loss to Attention, and train with joint CTC Attention

2. Implement Beam Search on your own to improve decoding scores

3. Use your code from HW3 to train a character language model as opposed to word language model.

At each decoding step, get the probability distribution over the token vocabulary from the language

model, and from LAS, and weight both contributions to make a final decision on decoding.

4. Pretrain Decoder as a Language Model to produce character tokens, and then train LAS.

5. Use Transformer Models rather than LSTM based LAS

6. Modify the pyramidal subsampling to convolutional sub-sampling with LSTM layers

Good luck and enjoy the challenge!

11


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp