University of York
Workshop at Newcastle University
June 11, 2023
Postdoctoral Research Associate
University of York
Person-specific Automatic Speaker Recognition: Understanding the behaviour of individuals for applications of ASR
DPhil Candidate, MPhil (Distinction)
University of Oxford
Investigating the tonal system of Plastic Mandarin: A cross-varietal comparison
1 2 3 4 5 6
1 2 3 4 5 6
Automatic Speech Recognition (ASR) or Speech-to-text (STT)
Related ASR tasks:
1 2 3 4 5 6
Multiple sources of Variation
1 2 3 4 5 6
1 2 3 4 5 6
Input
Output
Aim
Statistical Approach (Traditional)
Recorded speech as a sequence of acoustic feature vectors, X
Word sequence as W
To find the most likely W, given X
Statistical models are trained using a corpus of labelled training utterances (Xn, Wn)
1 2 3 4 5 6
Desirable properties:
Typical acoustic features:
1 2 3 4 5 6
Phonemes:
Phones:
Typical labels: Words, phones, etc.
1 2 3 4 5 6
Aligning the sequences Xn and Wn for each training utterance
Searching over all possible output sequences W to find the most likely one
1 2 3 4 5 6
1 2 3 4 5 6
A statistical model for time series data with a set of discrete states {1,…,J}
1 2 3 4 5 6
Searching the most probable word sequence W*
Applying Bayes’ Theorem
Rewriting W*
\(W^* = argmax_WP(W|X)\)
\(\begin{align*}P(W|X)=\frac{P(X|W)P(W)}{P(X)}\\ \propto P(X|W)P(W) \end{align*}\)
\(W^* = argmax_W\underbrace{P(X|W)}_\text{Acoustic model }\underbrace{P(W)}_\text{Language model}\)
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
1 2 3 4 5 6
A forced alignment system built with the Kaldi ASR toolkit.
Installation:
1 2 3 4 5 6
You have:
MFA:
General procedure:
.wav
).txt
/.lab
/.TextGrid
)1 2 3 4 5 6
You have:
MFA:
File Structure:
+-- textgrid_corpus_directory
| --- recording1.wav
| --- recording1.TextGrid
| --- recording2.wav
| --- recording2.TextGrid
| --- ...
+-- prosodylab_corpus_directory
| +-- speaker1
| --- recording1.wav
| --- recording1.lab
| --- recording2.wav
| --- recording2.lab
| +-- speaker2
| --- recording3.wav
| --- recording3.lab
| --- ...
1 2 3 4 5 6
You have:
MFA:
Command syntax:
Example:
1 2 3 4 5 6
1 2 3 4 5 6
You have:
MFA:
WORDA PHONEA PHONEB
WORDA PHONEC
WORDB PHONEB PHONEC
the 0.16 0.08 2.17 1.13 d i
the 0.99 0.04 2.14 1.15 d ə
the 0.01 0.14 2.48 1.18 ð i
the 0.02 0.12 1.87 1.23 ð ə
the 0.11 0.15 2.99 1.15 ə
1 2 3 4 5 6
You have:
MFA:
G2P: Grapheme-to-phoneme
1 2 3 4 5 6
You have:
MFA:
1 2 3 4 5 6
1 2 3 4 5 6
You have:
MFA:
Example:
# Make sure dataset is in proper format
mfa validate ~/mfa_data/my_corpus ~/mfa_data/my_dictionary.txt
# Export just the trained acoustic model
mfa train ~/mfa_data/my_corpus ~/mfa_data/my_dictionary.txt ~/mfa_data/new_acoustic_model.zip
# Export just the training alignments
mfa train ~/mfa_data/my_corpus ~/mfa_data/my_dictionary.txt ~/mfa_data/my_corpus_aligned
# Export both trained model and alignments
mfa train ~/mfa_data/my_corpus ~/mfa_data/my_dictionary.txt ~/mfa_data/new_acoustic_model.zip --output_directory ~/mfa_data/my_corpus_aligned
# save model
mfa model save acoustic ~/mfa_data/new_acoustic_model.zip
More advanced: Official Kaldi website Eleanor Chodroff’s Kaldi Tutorial
Daniel Jurafsky and James H. Martin (2008). Speech and Language Processing, Pearson Education (2nd edition).
Daniel Jurafsky and James H. Martin (2023). Speech and Language Processing, Pearson Education (3rd edition).
The University of Edinburgh ASR Lectures 2022-2023 by Professor Peter Bell
Slides by Gwénolé Lecorvé from the Research in Computer Science (SIF) master
Blog by Maël Fabien