Forced Alignment Using P2FA and MFA for Mandarin data


Acquiring a large amount of speech data can be ‘cheap’ and relatively easy today. The traditional way of manually transcribing and segmenting audios is, however, very time-consuming and ‘expensive’. Algorithms of automatic speech recognition (ASR) can be extremely helpful in automatic transcription through speech-to-text, as well as allow for automatic alignment and synchronisation of speech signals to phonetic units.

A forced alignment system usually takes an audio file and its corresponding transcript as input and returns a text file, which is time-aligned at the phone and word levels. I employed two forced alignment systems: the Penn Forced Aligner (P2FA) and the Montreal Forced Aligner (MFA). The former is built with the HTK speech recognition toolkit, while the latter with a similar system Kaldi ASR toolkit. Many other aligners are based on one of these two toolkits. I’ll briefly walk through how to use them from data preparation and installation to post-aligning processing, pooling relevant online resources (instead of reinventing the wheels) and adding in some of my own snippets of code.

Unix Shell Python


General procedure

  1. Prepare the .wav files
  2. Prepare the transcript files (.txt/.lab/.TextGrid)
  3. Obtain a pronunciation dictionary with canonical phonetic transcription for words/characters
  4. Run the aligner with pre-trained acoustic models

Post-alignment options

  1. Convert .Textgrid files into readable table format with temporal information

Here I basically describe how I managed to acquire automatic time-aligned .Textgrids using open-source softwares and tools on my MacBook (Mojave 10.14.6) in details. I first introduce how to prepare input data including .wav files and transcript files. Then, I demonstrate how to work with the Penn Forced Aligner and the Montreal Forced Aligner respectively. Mandarin Chinese data will be used as an example. I’m trying my best to be clear and hope this is helpful for those who want to achieve similar goals, especially for non-programmers and linguistic students.

Click on the chapters in the Table of Contents to START.


If you find this tutorial useful for your research, you can cite this website:

Chenzi Xu 2020, Forced Alignment Using P2FA and MFA for Mandarin data, accessed DD MONTH YEAR, https://chenzixu.rbind.io/resources/1forcedalignment/.


Feel free to share this tutorial!

DISCLAIMER

Feel free to contact me if you have a question or issue, but I’m probably unable to offer personal assistance to your problems (I’m in the middle of my dissertation). In short, this website is not responsible for any troubles. Good luck!