联系方式

  • QQ:99515681
  • 邮箱:99515681@qq.com
  • 工作时间:8:00-23:00
  • 微信:codinghelp

您当前位置:首页 >> Matlab编程Matlab编程

日期:2019-04-09 11:35

Out: Mar 30 2019

Due: Apr 13 2019

EE 519: Speech Recognition and

Processing for Multimedia

Spring 2019

Homework 5

There are 2 problems in this homework, with several questions. Please make sure to show the details of working for

each question. Answers without any justification may not receive credit.

Return:

A PDF file with your solutions, including any code you wrote (in an Appendix at the end of your file) and any

plots asked.

Your Matlab source files, which we should be able to run.

All your files should be uploaded to the corresponding dropbox in the D2L platform. If you make multiple submissions,

only the last one will be evaluated.

Problem 1

Solve questions (a), (b) of this problem either by hand or implementing Viterbi algorithm in Matlab.

Consider a 5-state HMM with parameters λ used to model a sequence of voiced (V ) and unvoiced (U) sounds. Assume

equal transition probabilities and equal initial state probabilities:

aij = 0.2 i, j = 1, 2, 3, 4, 5 πi = 0.2 ?i = 1, 2, 3, 4, 5

The observation probabilities are

State 1 State 2 State 3 State 4 State 5

P(V ) 0.2 0.6 0.8 0.25 0.2

P(U) 0.8 0.4 0.2 0.75 0.8

We observe the sequence O = V V UUUV V V UU.

(a) What is the most probable state sequence Q?

(b) What is the joint probability of the observation sequence and the most probable state sequence P = P(O, Q?|λ)?

(c) What is the probability that the observation sequence was generated only by State 1?

(d) What is the average number of steps in a given HMM state? Note that at every “step” the model emits a single

observation.

Problem 2

For this Problem, you will implement an ASR system for recognition of isolated spoken digits following the HMM/GMM

paradigm. To that end, you will use the same dataset you used in HW4, together with the MFCCs you

created. For your convenience, the MFCCs are provided in the file mfccs hw4.mat, but you are welcome to use

your own features from HW4. In mfccs hw4.mat you will find four objects, each one containing the recordings of

one speaker. For example, mfcc jackson is a cell array where mfcc jackson{i}{j} is a 2D array with the MFCCs

for the j-th Jackson’s recording for the digit i1. You will need to download the Matlab toolboxes found on

https://www.cs.ubc.ca/~murphyk/Software/HMM/hmm_download.html and add them to your path.

Note: There are a few functions in those toolboxes which have the same names with built-in Matlab functions. This

may cause problems if you want to use the corresponding built-in function. For example, to use the built-in function

assert, you can just rename the corresponding function in KPMtools to assert2.m.

You will run three experiments:

(i) Speaker-dependent ASR (unique speaker): You will partition randomly Jackson’s recordings into training (70%)

and test (30%) sets keeping them balanced in terms of the different classes (digits). You will train the ASR system

using your training set and evaluate on the test set.

(ii) Speaker-dependent ASR (multiple speakers): You will partition randomly all the recordings into training (70%)

and test (30%) sets keeping them balanced in terms of the different classes (digits). You will train the ASR system

1

using your training set and evaluate on the test set.

(iii) Speaker-independent ASR: You will use all the Jackson’s recordings as your test set and all the other recordings as

your training set (leave-one-speaker-out scenario). You will train the ASR system using your training set and evaluate

on the test set.

PART 1: Initialization

(a) Because of the limited vocabulary (only 10 words), instead of phonemes or triphones, we can use the entire

words as the fundamental linguistic units that will be modeled through HMMs. So, 10 models need to be trained,

each one with Nd

states, with d = 0, 1, · · · , 9, corresponding to the digits 0, 1, · · · , 9. Because of the different acoustic

complexity of each word, Nd will not be the same ?d. For example, the word five sounds more complex than the

word two, so we will use more states to model the former. Using the CMU Pronouncing Dictionary1

, find the number

of phonemes that each word of your vocabulary is composed of. For example, the phonetic representation of the word

six is S IH K S, so the word is composed of 4 phonemes. Use 4 states for the words with one phoneme (if such a

word exists in your vocabulary). For each additional phoneme, add one state. For example, the word six should be

modeled by an HMM with 4+3=7 states, so N6 = 7. Report the values of Nd

, d = 0, 1, · · · , 9.

(b) You will use left-right, linear HMMs, like the one shown in Figure 1.

Figure 1: 5-state linear HMM.

Given this topology, what is a reasonable initialization of the transition matrix

a11 a12 a13 a14

a21 a22 a23 a24

a31 a32 a33 a34

a41 a42 a43 a44

for a 4-state HMM? What about a 6-state HMM? For the initial state probabilities, you will use

πi =(1 i = 1 (1st state)0 otherwise

(c) A possible observation at a given state of the HMM is a vector of 13 MFCCs. The probability of an observation

at a given state will be modeled by a Gaussian Mixture Model (GMM) with 3 13-dimensional Gaussians. In order

to initialize the models, you will need a set of data from which the sample means and covariances will be estimated.

Because of the topology used, it is not reasonable to assume that all the available features from a recording are generated

from any state of the HMM with equal probability. Instead, the first few MFCC vectors (first few frames) are

expected to be gererated from the first state of the HMM, the subsequent few vectors from the second state, etc. A

rather crude first estimation is to assume that each recording for the digit d is uniformly partitioned into Nd

(almost)

equal parts where each part has about Nf /Nd MFCC vectors (frames). Remember from the notation used in HW4

that Nf is the number of frames for a particular recording. Following this procedure for all the trainining recordings

and all the digits, you will have the data samples needed for the initialization of the distributions modeling all the

states of the 10 HMMs. Report the number of data samples you got through this process for the states 1, 2, · · · , 7 of

the HMM corresponding to the digit 6. Report those 7 numbers for the speaker-independent setting. What about the

digit 3? Using a diagonal covariance matrix for each Gaussian, at this point you have everything you need to initialize

your models with the function mixgauss init.

Note: If you cannot uniformly partition your data as described, you can initialize your models using all the available

training data for a particular digit to initialize all the states and continue to the next steps. That way, you may even

get a higher accuracy!

1http://www.speech.cs.cmu.edu/cgi-bin/cmudict?

2

PART 2: Training

The parameters of the models are obtained as their Maximum Likelihood estimations. Since a closed-form solution

does not exist, in practice this is done through the Expectation-Maximization algorithm. Use the function mhmm em

for that purpose, giving the right inputs, as explained in the comments of the function. You can leave the arguments

thres and max iter to their default values. Continue using diagonal covariance matrices. Running this function for

each HMM, you will have the log likelihoods, as well as the trained parameters of the HMMs and the GMMs.

(a) Plot, at the same graph, the log likelihood as a function of the iteration number (of the EM algorithm) for the

10 models for the speaker-independent setting.

(b) Here, you will verify that the GMMs have been suitably trained in order to model the true underlying distribution

of the training data in the 13-dimensional space of MFCCs. To do so, use Viterbi algorithm to find the most probable

state sequence for all the recordings of the training set. For that step, you will need the functions mixgauss prob and

viterbi path. After this, for the digit 6, collect all the feature vectors (MFCCs) which have been mapped to the

i-th state of the corresponding HMM. Plot the histogram of the 2nd MFCC, for the 3rd state of the corresponding

HMM. Plot, additionally, the probability distribution corresponding to that state, as modeled by the trained GMMs.

Of course, the GMMs you have used are composed of 13-dimensional Gaussians. How can you generate your plot

only for the 2nd MFCC (1-dimensional)? For your plot, you can implement your own functions or you can use the

Plot GM function found here: https://www.mathworks.com/matlabcentral/fileexchange/8793-plot_gm. Your

plots should correspond to the speaker-independent setting. Does this probability distribution represent the underlying

sample distribution as visualized by the histogram?

PART 3: Evaluation

(a) For each recording in the test set, find the probability that the sequence of observations (MFCCs) of the particular

recording was generated by each one of the 10 trained HMMs, using the function mhmm logprob. Assign to the recording

the digit corresponding to the HMM which gave the maximum probablility. For the three experiments (i), (ii), (iii)

report the accuracy of your system.

Accuracy = number of correctly classified recordings

total number of recordings

Additionally, for the experiment (iii), report the F1 score for each one of the 10 classes (digits).

F1(d) = 2 Precision(d) · Recall(d)

Precision(d) + Recall(d)

Precision(d) = number of correctly classified recordings in class d

number of recordings classified in class d

Recall(d) = number of correctly classified recordings in class d

total number of recordings in class d

(b) - extra credit For the experiment (iii), pick two digits d and ?d. For d, pick 3 recordings correctly classified and

plot the most probable trajectory of the observations (MFCCs) in the space of HMM states (for the HMM modeling

d). In other words, your x-axis should be time (or frame index) and the y-axis should be the index of the HMM state

(1, 2, · · ·). To do so, you will need the functions mixgauss prob and viterbi path like in Part 2, Question b. Do

the same for 3 recordings of d. Now, pick one of the 3 recordings of d and plot the most probable trajectory of the

observations (MFCCs) in the space of HMM states for the HMM modeling d. (3+3+1=7 plots are asked totally in

this question)

PART 4: Architecture and Parameter Tuning - Optional

If you want, try different parameters (e.g. number of states, number of Gaussians, full vs diagonal covariance matrices,

ergodic vs linear HMMs, etc) and report the accuracy you got with your optimal configuration.

3


版权所有:留学生编程辅导网 2020 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。 站长地图

python代写
微信客服:codinghelp