Train your first ASR model:
Introduction to ASR in linguistic research


Chenzi Xu
MPhil DPhil (Oxon)

University of York
Workshop at Newcastle University

June 11, 2023

About me

Postdoctoral Research Associate

University of York

Person-specific Automatic Speaker Recognition: Understanding the behaviour of individuals for applications of ASR


DPhil Candidate, MPhil (Distinction)

University of Oxford

Investigating the tonal system of Plastic Mandarin: A cross-varietal comparison

Outline


  1. What is ASR?
  2. Statistical Speech Recognition
  3. ASR Application in Linguistic Research
  4. Hands-on 1: Automatic Forced Alignment
  5. Hands-on 2: Adapt Existing Models
  6. Hands-on 3: Train Acoustic Models

What is Automatic Speech Recognition (ASR)?

1 2 3 4 5 6

Automatic Speech Recognition

1 2 3 4 5 6


Automatic Speech Recognition (ASR) or Speech-to-text (STT)

  • Transcription: Transform recorded speech audio into a sequence of corresponding words
  • Deal with acoustic ambiguity: “Recognise speech?” or “Wreck a nice beach?
  • “What did they say?”


Related ASR tasks:

  • Speaker recognition: “Who spoke?”
  • Speaker diarization: “Who spoke when?”
  • Speech understanding: “What does it mean?”
  • Paralinguistic aspects: “How did they say it?”

Challenges of Speech Recognition

1 2 3 4 5 6


From a linguistic perspective:

Multiple sources of Variation

  • Linguistic aspects
  • Paralinguistic aspects
  • Accents and dialects
  • Speaker and style
  • Environment
  • Multilingual scenarios

From a machine learning perspective:


  • Classification: high dimensional output space
  • Sequence-to-sequence: long input sequence
  • Noisy and limited data (compared to text-based NLP)
  • Hierachical and compositional nature

Statistical Speech Recognition

1 2 3 4 5 6

Input, Output, and Aim

1 2 3 4 5 6


Input


Output


Aim


Statistical Approach (Traditional)

Recorded speech as a sequence of acoustic feature vectors, X


Word sequence as W


To find the most likely W, given X


Statistical models are trained using a corpus of labelled training utterances (Xn, Wn)

Representing Speech: Feature Extraction (X)

1 2 3 4 5 6


Desirable properties:

  • Robust to F0 changes (& F0 harmonics)
  • Robust across speakers
  • Robust against noise/channel effects
  • Low dimension as possible
  • No redundancy among features

Typical acoustic features:

  • Mel-Frequency Cepstral coefficients (MFCC)
  • Perceptual Linear Prediction (PLP)

Labelling Speech (W)

1 2 3 4 5 6


Phonemes:

  • Abstract units based on contrastive role in word meanings (e.g. “cat” vs “bat”)
  • 40–50 phonemes in English

Phones:

  • Speech sounds defined by acoustics
  • Many allophones of the same phoneme (e.g. /p/ in “pit” and “spit”)
  • Limitless in number