ASR - Automatic Speech Recognition

Scope of module Automatic Speech Recognition (ASR):
15 lectures and 15 laboratory classes plus work at home
classes on Hidden Markov Model Toolkit (HTK) developed at University of Cambridge
classes on developing real tools for ASR project on AGH (?).

ASR module students will be graded base on two grades for the two parts of the laboratory classes. Both of them will be evaluated considering: –

  • fulfilling the project tasks,
  • quality,
  • precision,
  • usability,
  • errors free.

A student is allowed to not attend up to two laboratory classes. In case they did not attend three classes a final grade will be lowered by 0.5. In case of 4 missed classes a student receives a negative grade.

The lectures plan is as follows:

  • (07.10) Introduction and general scheme of ASR.
  • (14.10) The fundamentals of HTK
  • (21.10) Introduction to discrete signal processing
  • (28.10) Speech and language resources available for automatic speech recognition.
  • (04.11) Bayes Rule, Maximum A-posteriori Probability (MAP).
  • (18.11) Speech parameterisation and segmentation (mel frequency cepstral coefficients (MFCC), perceptual linear predictive analysis (PLP)).
  • (25.11) Discrete wavelet transform.
  • (02.12) Speech modelling (hidden Markov model (HMM), artificial neural networks (ANN)).
  • (09.12) Decoding, Vitterbi algorithm
  • (16.12) Dictionaries in computer systems, Levenshtein metric
  • (06.01) Grammar modelling (parsers, n-grams, part of speech taggers).
  • (13.01) Semantic modelling (bag-of-words, wordnet, vector space model).
  • (20.01) Current research trends in ASR.
  • (28.01 ?) Academic and commercial ASR systems (HTK, Sphinx, Julius, PrimeSpeech, AGH).

The laboratories starts on 07.10. Designing and creating own speech recogniser using HTK (preferably native language because of possible problems with pronunciation, 20-50 words and 20-30 “sentences”). All the time, you may ask a teacher to show you an example of each file which is needed. Don’t waste time on staying stuck at any level. Keep a simple diary of everything you did, especially all running lines, and comments if you had to check something manually in files. You may look at a paper on experiments of HTK for Polish.

HTK as an example of state-of-art ASR system (1-7 labs):

1) (07.10)
Registering, downloading software + HTKBook and subscribing to the user list
Step 1 - the Task Grammar (design your own task – can be dictating credit card numbers or basic call centre information)
Step 2 - the Dictionary (English – Beep dictionary, Polish - ask a teacher for transcriptions), you can find the list of possible phonemes here
Step 3 - Recording the Data – Each utterance four times (3 for training and 1 for testing) as wav files, stay away from using mp3 for any speech processing tasks.

2) (14.10 and 21.10)
Step 4 - Creating the Transcription Files (for phonemes)
0 0 sil
50000 850000 r
900000 1650000 a
1700000 2300000 d
2350000 3300000 o
3350000 4650000 s
4700000 5350000 l_
5400000 6550000 a
6600000 8050000 f
8100000 8200000 sil
Double check of transcriptions

Step 5 - Coding the Data
Prepare file config1 and codetr.scp (config as in tutorial but with SOURCEFORMAT = WAV or NOHEAD. You may want to write a script to prepare codetr.scp automatically)
HCopy -T 1 -C config1 -S codetr.scp
Creating Monophone HMMs
Step 6 - Creating Flat Start Monophones (prepare file proto and create directory hmm0)
HCompV -C config1 -f 0.01 -m -S train.scp -M hmm0 proto
create hmmdefs by coping proto from hmm0. Each phoneme which exists in your recordings separately. Remember about changing the header with the name of a phone.
Create directory hmm1
Apart from a wav file there has to be a corresponding lab file with the same name. It contains segmentation and transcription information, which were prepared during the previous class. An example of a lab file (HTK uses units of 100ns):
1 2295000 sil
2320000 3151249 w
3151250 3994999 ah
3995000 5422500 n
5422501 8462500 sil
Then you should be able to do reestimation which is the start of proper ASR training:
HERest -C config2 -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm0/macros -H hmm0/hmmdefs -M hmm1 monophones
HERest -C config2 -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm1/macros -H hmm1/hmmdefs -M hmm2 monophones
HERest -C config2 -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm2/macros -H hmm2/hmmdefs -M hmm3 monophones

Step 7 - Fixing the Silence Models (There are two types: sil is silence and sp is short pause)
Create directory hmm4 from hmm3
Copy sil in hmm4/hmmdefs and paste as sp.
Create file sil.hed
Copy monophones0 to monophones1 and add in monophones1 sp.
Finally, another two passes of HERest are applied using the phone transcriptions with sp models between words.
HHEd -H hmm4/macros -H hmm4/hmmdefs -M hmm5 sil.hed monophones
HERest -C config2 -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm5/macros -H hmm5/hmmdefs -M hmm6 monophones
HERest -C config2 -I phones0.mlf -t 250.0 150.0 1000.0 -S train.scp -H hmm6/macros -H hmm6/hmmdefs -M hmm7 monophones
Step 8 - Realigning the Training Data
Add silence sil to the dictionary.
You may have to add all monophones to the dictionary as well.
HVite -l '*' -o SWT -b sil -C config2 -a -H hmm7/macros -H hmm7/hmmdefs -i aligned.mlf -m -t 250.0 -y lab -I words.mlf -S train.scp dict monophones

Step 9 - Making Triphones from Monophones
Create file mktri.led
HLEd -n triphones1 -l '*' -i wintri.mlf mktri.led aligned.mlf
Create file mktri.hed using a script from /samples/HTKTutorial/ maketrihed monophones triphones1
Create a dictionary hmm10
HHEd -H hmm9/macros -H hmm9/hmmdefs -M hmm10 mktri.hed monophones
HERest -B -C config2 -I wintri.mlf -t 250.0 150.0 1000.0 -s stats -S train.scp -H hmm10/macros -H hmm10/hmmdefs -M hmm11 triphones1
HERest -B -C config2 -I wintri.mlf -t 250.0 150.0 1000.0 -s stats -S train.scp -H hmm11/macros -H hmm11/hmmdefs -M hmm12 triphones1

Step 10 - Making Tied-State Triphones
Check phoneme groups for a language you use. Polish phoneme groups will be provided by a teacher. Apply them to create tree.hed
HHEd -B -H hmm12/macros -H hmm12/hmmdefs -M hmm13 tree.hed triphones1 > log
fulllist should have every element from triphones1 (plus many other, including sil). Check it.
HERest -B -C config2 -I wintri.mlf -t 250.0 150.0 1000.0 -s stats -S train.scp -H hmm13/macros -H hmm13/hmmdefs -M hmm14 tiedlist
HERest -B -C config2 -I wintri.mlf -t 250.0 150.0 1000.0 -s stats -S train.scp -H hmm14/macros -H hmm14/hmmdefs -M hmm15 tiedlist

Tests and evaluation
Step 11 - Recognising the Test Data
Prepare files recout.mlf (empty) and test.scp (containing paths to test .wav files, using a script might be useful)
HVite -C config3 -H hmm15/macros -H hmm15/hmmdefs -S test.scp -l '*' -i recout.mlf -w wdnet -p 0.0 -s 5.0 dict tiedlist
Prepare files .lab containing ortographic transcription of words and a list of the .lab files in testref.mlf (you may want to use scripts for this task). You may have to change asteriks in recout.mlf to paths to .lab files.
Check statistics of recognition on test data.
HResults -I testref.mlf tiedlist recout.mlf
Try recognising live.
Prepare a report about the whole process of training own ASR system based on HTK. Add audio files, transcriptions, HMM files and any other results of your work during the classes so far. Focus on your results and your opinions about how ASR systems can be used in commercial software rather then on copying anything from the HTKbook or Internet.

dr inż. Bartosz Ziółko

asr/start.txt · ostatnio zmienione: 2013/03/01 13:48 przez Bartosz Ziółko
Powered by Monster; Copyright © 2010, Bartosz Ziółko; Driven by Driven by DokuWiki