Train your first ASR model:
Introduction to ASR in linguistic research


Chenzi Xu
MPhil DPhil (Oxon)

University of York
Workshop at Newcastle University

June 11, 2023

About me

Postdoctoral Research Associate

University of York

Person-specific Automatic Speaker Recognition: Understanding the behaviour of individuals for applications of ASR


DPhil Candidate, MPhil (Distinction)

University of Oxford

Investigating the tonal system of Plastic Mandarin: A cross-varietal comparison

Outline


  1. What is ASR?
  2. Statistical Speech Recognition
  3. ASR Application in Linguistic Research
  4. Hands-on 1: Automatic Forced Alignment
  5. Hands-on 2: Adapt Existing Models
  6. Hands-on 3: Train Acoustic Models

What is Automatic Speech Recognition (ASR)?

1 2 3 4 5 6

Automatic Speech Recognition

1 2 3 4 5 6


Automatic Speech Recognition (ASR) or Speech-to-text (STT)

  • Transcription: Transform recorded speech audio into a sequence of corresponding words
  • Deal with acoustic ambiguity: “Recognise speech?” or “Wreck a nice beach?
  • “What did they say?”


Related ASR tasks:

  • Speaker recognition: “Who spoke?”
  • Speaker diarization: “Who spoke when?”
  • Speech understanding: “What does it mean?”
  • Paralinguistic aspects: “How did they say it?”

Challenges of Speech Recognition

1 2 3 4 5 6


From a linguistic perspective:

Multiple sources of Variation

  • Linguistic aspects
  • Paralinguistic aspects
  • Accents and dialects
  • Speaker and style
  • Environment
  • Multilingual scenarios

From a machine learning perspective:


  • Classification: high dimensional output space
  • Sequence-to-sequence: long input sequence
  • Noisy and limited data (compared to text-based NLP)
  • Hierachical and compositional nature

Statistical Speech Recognition

1 2 3 4 5 6

Input, Output, and Aim

1 2 3 4 5 6


Input


Output


Aim


Statistical Approach (Traditional)

Recorded speech as a sequence of acoustic feature vectors, X


Word sequence as W


To find the most likely W, given X


Statistical models are trained using a corpus of labelled training utterances (Xn, Wn)

Representing Speech: Feature Extraction (X)

1 2 3 4 5 6


Desirable properties:

  • Robust to F0 changes (& F0 harmonics)
  • Robust across speakers
  • Robust against noise/channel effects
  • Low dimension as possible
  • No redundancy among features

Typical acoustic features:

  • Mel-Frequency Cepstral coefficients (MFCC)
  • Perceptual Linear Prediction (PLP)

Labelling Speech (W)

1 2 3 4 5 6


Phonemes:

  • Abstract units based on contrastive role in word meanings (e.g. “cat” vs “bat”)
  • 40–50 phonemes in English

Phones:

  • Speech sounds defined by acoustics
  • Many allophones of the same phoneme (e.g. /p/ in “pit” and “spit”)
  • Limitless in number

Typical labels: Words, phones, etc.

  • Labels may be time-aligned
  • No conclusive evidence that phones are the basic units in speech recognition

Two Key Challenges

1 2 3 4 5 6



1. In training the model:

Aligning the sequences Xn and Wn for each training utterance


2. In performing recognition:

Searching over all possible output sequences W to find the most likely one

A naive algorithm for collapsing an alignment between input and letters.

The Hidden Markov Model

1 2 3 4 5 6


  • Mapping a sequence of continuous observations to a sequence of discrete outputs
  • Monophone models: 3-state linear HMM: beginning, middle, end
  • Triphone models: 3/5-state linear HMM for context-dependent phones
  • Algorithms: Training (forward-backward), decoding and alignment (Viterbi)

A generative model for the observation sequence.

The Hidden Markov Model

1 2 3 4 5 6


A statistical model for time series data with a set of discrete states {1,…,J}

  • At each time step \(t\):
    • the model is in a fixed state \(q_t\): \(P(q_t=j|q_{t-1}=k)\)
    • the model generates an observation, \(x_t\): \(P(x|q=j)\)
  • Markov assumption: Observation independence \(P(q_t=j|q_1...q_{t-1})=P(q_t=j|q_{t-1})\)
  • We don’t observe which state the model is in at each time step – hence “hidden”.

Probabilistic automaton

Fundamental Equation of Statistical Speech Recognition

1 2 3 4 5 6



Searching the most probable word sequence W*


Applying Bayes’ Theorem



Rewriting W*

\(W^* = argmax_WP(W|X)\)


\(\begin{align*}P(W|X)=\frac{P(X|W)P(W)}{P(X)}\\ \propto P(X|W)P(W) \end{align*}\)


\(W^* = argmax_W\underbrace{P(X|W)}_\text{Acoustic model }\underbrace{P(W)}_\text{Language model}\)

Speech Recognition Components

1 2 3 4 5 6


  • Acoustic model
  • Language model
  • Lexicon

Acoustic Models

1 2 3 4 5 6


HMM-GMM

HMM-DNN

Both graphs by Maël Fabien

End-to-End models

1 2 3 4 5 6


Schematic architecture for an encoder-decoder speech recognizer.

ASR Application in Linguistic Research

1 2 3 4 5 6

Phonetics and Phonology

1 2 3 4 5 6



  • Transcription of fieldwork speech data
  • Automatic forced alignment
  • Allophone distributions

Hands-on 1: Automatic forced alignment

1 2 3 4 5 6

Automatic Forced Alignment

1 2 3 4 5 6


Preparation

1 2 3 4 5 6


Data

Download here


Credit: Eleanor Chodroff


Source:

Northwestern ALLSSTAR Corpus

Mozilla Common Voice Corpus

Montreal Forced Aligner

A forced alignment system built with the Kaldi ASR toolkit.

  • Pre-trained models are available
  • Training and aligning speech data
    • The basic acoustic model recipe: GMM-HMM framework
    • Speaker adaptation

Installation:

conda create -n aligner -c conda-forge montreal-forced-aligner
conda activate aligner

Research Scenario 1

1 2 3 4 5 6


You have:

  • Speech recordings from production experiment
  • Transcripts (stimuli)

MFA:

General procedure:

  1. Prepare the audio files (.wav)
  2. Prepare the transcript files (.txt/.lab/.TextGrid)
  3. Set up input and output folders
  4. Obtain a pronunciation dictionary
  5. Run the aligner with pre-trained acoustic models

Research Scenario 1

1 2 3 4 5 6


You have:

  • Speech recordings from production experiment
  • Transcripts (stimuli)

MFA:

File Structure:

+-- textgrid_corpus_directory
|   --- recording1.wav
|   --- recording1.TextGrid
|   --- recording2.wav
|   --- recording2.TextGrid
|   --- ...

+-- prosodylab_corpus_directory
|   +-- speaker1
|       --- recording1.wav
|       --- recording1.lab
|       --- recording2.wav
|       --- recording2.lab
|   +-- speaker2
|       --- recording3.wav
|       --- recording3.lab
|   --- ...

Research Scenario 1

1 2 3 4 5 6


You have:

  • Speech recordings from production experiment
  • Transcripts (stimuli)

MFA:

Command syntax:

mfa align corpus_directory dictionary acoustic_model output_directory


Example:

mfa model download acoustic english_us_arpa
mfa model download dictionary english_us_arpa

mfa align /Users/cx936/Desktop/MFATutorial2021/ex1_english english_us_arpa english_us_arpa /Users/cx936/Desktop/MFATutorial2021/output

Hands-on 2: Adapt Existing Models

1 2 3 4 5 6

Research Scenario 2: Adapt Dictionary

1 2 3 4 5 6


You have:

  • Specialised speech corpora
  • Transcripts

MFA:

  1. Find OOV items in a corpus
mfa validate /Users/cx936/Desktop/MFATutorial2021/ex4_english_modify1 english_us_arpa english_us_arpa output/
  1. Create pronunciation dictionary manually
  • Non-probabilistic format
WORDA  PHONEA PHONEB
WORDA  PHONEC
WORDB  PHONEB PHONEC
  • Dictionary with silence probabilities
the    0.16    0.08    2.17    1.13    d i
the    0.99    0.04    2.14    1.15    d ə
the    0.01    0.14    2.48    1.18    ð i
the    0.02    0.12    1.87    1.23    ð ə
the    0.11    0.15    2.99    1.15    ə

Research Scenario 2: Adapt Dictionary

1 2 3 4 5 6


You have:

  • Specialised speech corpora
  • Transcripts

MFA:

  1. Create pronunciation dictionary using a G2P model

G2P: Grapheme-to-phoneme

mfa model download g2p english_us_arpa

mfa g2p ~/Desktop/MFATutorial2021/ex1_english english_us_arpa ~/Desktop/MFATutorial2021/ex1_english/oovs.txt --dictionary_path english_us_arpa

mfa model add_words english_us_arpa ~/Desktop/MFATutorial2021/ex1_english/oovs.txt

Research Scenario 2: Adapt Acoustic Models

1 2 3 4 5 6


You have:

  • Speech recordings from production experiment
  • Transcripts (stimuli)

MFA:

MFA can adapt pretrained acoustic models to a new dataset.


Example:

mfa adapt --clean /Users/cx936/Desktop/MFATutorial2021/ex1_english english_us_arpa english_us_arpa output/model/ /Users/cx936/Desktop/MFATutorial2021/output

Hands-on 3: Train Acoustic Models

1 2 3 4 5 6

Research Scenario 3

1 2 3 4 5 6


You have:

  • Speech recordings from production experiment
  • Transcripts (stimuli)
  • Pronunciation dictionary

MFA:

  • No pre-trained models

Example:

# Make sure dataset is in proper format
mfa validate ~/mfa_data/my_corpus ~/mfa_data/my_dictionary.txt

# Export just the trained acoustic model
mfa train ~/mfa_data/my_corpus ~/mfa_data/my_dictionary.txt ~/mfa_data/new_acoustic_model.zip

# Export just the training alignments
mfa train ~/mfa_data/my_corpus ~/mfa_data/my_dictionary.txt ~/mfa_data/my_corpus_aligned

# Export both trained model and alignments
mfa train ~/mfa_data/my_corpus ~/mfa_data/my_dictionary.txt ~/mfa_data/new_acoustic_model.zip --output_directory ~/mfa_data/my_corpus_aligned

# save model
mfa model save acoustic ~/mfa_data/new_acoustic_model.zip

More advanced: Official Kaldi website Eleanor Chodroff’s Kaldi Tutorial

Resources and References

Daniel Jurafsky and James H. Martin (2008). Speech and Language Processing, Pearson Education (2nd edition).

Daniel Jurafsky and James H. Martin (2023). Speech and Language Processing, Pearson Education (3rd edition).

The University of Edinburgh ASR Lectures 2022-2023 by Professor Peter Bell

Slides by Gwénolé Lecorvé from the Research in Computer Science (SIF) master

Blog by Maël Fabien