Forced Alignment Using P2FA and Montreal

Acquiring a large amount of speech data can be ‘cheap’ and relatively easy. The traditional way of manually transcribing and segmenting audios is, however, very time-consuming and ‘expensive’. Algorithms of automatic speech recognition (ASR) can be extremely helpful in automatic transcription through speech-to-text, as well as allow for automatic alignment and synchronisation of speech signals to phonetic units.

A forced alignment system usually takes an audio file and its corresponding transcript as input and returns a text file, which is time-aligned at the phone and word levels. I employed two forced alignment systems: the Penn Forced Aligner (P2FA) and the Montreal Forced Aligner. The former is built with the HTK speech recognition toolkit, while the latter with a similar system Kaldi ASR toolkit. Many other aligners are based on one of these two toolkits. I’ll briefly walk through how to use them from data preparation and installation to post-aligning processing, pooling relevant online resources (instead of reinventing the wheels) and adding in some of my own snippets of code.

The general procedure for running the aligners include: 1. Prepare the .wav files 2. Prepare the transcript files (.txt/.lab/.TextGrid) 3. Obtain a pronunciation dictionary with canonical phonetic transcription for words/characters 4. Run the aligner with pre-trained acoustic models

Here I will describe how I managed to acquire automatic time-aligned .Textgrids using open-source softwares and tools on my Mac (Mojave 10.14.6) in details. I will first introduce how to prepare input data including .wav files and transcript files and then how to work with the Penn Forced Aligner and Montreal Forced Aligner respectively. I’m trying my best to be clear and hope this is helpful for those who want to achieve similar goals, especially for non-programmers and linguistic students.


Feel free to leave a comment if you have a question or issue, but I’m probably unable to offer personal assistance to your problems (I’m in the middle of my dissertation). In short, this website is not responsible for any troubles. Good luck!

Different aligners may have different requirements for audio input. Most work well with 16 KHz, 16-bit precision, and mono channel, so it would be safe to format all the target .wav files like that.

The Penn Forced Aligner doesn’t work with 24-bit.wav files (I stumbled on this for quite a while trying to debug).

You can reformat your .wav files in Praat using its interactive graphical interface. To acquire a single channel, select the sound > Convert > Convert to mono or Extract one channel; to change the sampling rate, select the sound > Convert > Resample; then save it as a new .wav. The default is 16-bit in Praat.

Alternatively, you can also use SoX (Sound eXchange) commands in the Terminal. SoX is a collection of handy sound processing utilities. It is also required by P2FA. You can download it here. To reformat the input.wav, we can use the following command in the Terminal having installed SoX.

$ sox input.wav -r 16k -b 16 -c 1 output.wav

The $ isn’t part of the command. It indicates that this is a Shell script in the Terminal. The flags here: -r , -b, -c define the sampling rate, precision, and number of channel of the output.wav respectively.

I’m using Mandarin Chinese here as an example. My goal here is to align each Chinese character (each syllable) with its corresponding sound interval. Thus I need to prepare transcript files in which there is a space between any two characters. Besides, there are many audio files in a corpus to be aligned. We would want to name all the transcript files accordingly (matching with the .wav filename) so that we can write a loop script to run an aligner for all of the .wav files at once. Since .txt files work for both aligners, I’ll demonstrate how to prepare a .txt file (relatively efficiently) for each .wav file. One way to do this is:

  1. Put all the .wav files in a directory and open Terminal and navigate to that directory.
  2. In Terminal:

    $ ls *.wav >> list.txt 

    we obtain a .txt file with each .wav filename being a row in one column. Then we add in the corresponding transcript of the audio on the same row follows the .wav filename. So an example of the updated list.txt file is:

1_1_101.wav 他们堆的雪人很稳当
1_2_101.wav 他们堆的城堡很稳当
1_3_101.wav 他们堆的台阶很稳当

There are many ways to achieve this depending on how you obtain your text transcripts. For my project, I have pre-designed speech stimuli, so I can just copy them over and slightly modify them manually when speakers made any variation. How to obtain orthography transcripts efficiently for spontaneous speech? I think of open-source speech-to-text APIs (e.g. Google’s Speech-to-Text or Baidu’s DeepSpeech). But there are often some restrictions. I may build a pipeline from speech-to-text to forced alignment in the near future and post it here.

  1. Then we can make use of bash shell commands to insert a space between every character:

    $ sed -e 's/./& /' list.txt

    So the list.txt is as follows:

    1_1_101.wav 他 们 堆 的 雪 人 很 稳 当
    1_2_101.wav 他 们 堆 的 城 堡 很 稳 当
    1_3_101.wav 他 们 堆 的 台 阶 很 稳 当
  2. We want the characters in each line to be an independent .txt file whose filename is the first field but with .txt extension. File-naming is very important because you need to think about the next step.

To use P2FA for a group of files in a directory, I used a shell script calling the .py in a loop. So it would be easy to use the .wav filename (e.g. 1_1_101.wav) as a common stem and a variable $i so that the corresponding .txt filename is $i.txt (e.g. 1_1_101.wav.txt).

$ cat list.txt | while read line || [ -n "$line" ]; do echo $line | awk '{$1=""}1'| awk '{$1=$1}1' > $(cut -d " " -f1 <<< $line).txt; done < list.txt

The first awk here removes the first field and the second awk removes the leading space (by redefine the beginning as the first string).

To use the Montreal aligner we want the paired .txt and .wav to have exactly the same filename (e.g. 1_1_101.txt and 1_1_101.wav). An easier way is probably change the extensions in the first column in list.txt first to make them the filenames of our output .txt files before splitting it into individual files.

$ sed 's/\.wav/\.txt/' list.txt 

(you can also use Find and Replace in your text editor. Do whichever is easier!)

$ cat list.txt | while read line || [ -n "$line" ]; do echo $line | awk '{$1=""}1'| awk '{$1=$1}1' > $(cut -d " " -f1 <<< $line); done < list.txt

By doing so, our corresponding .txt files are ready.

1.3.1 Installation The Penn Forced Aligner (P2FA) can be downloaded from here. They have a version for American English and one for Mandarin Chinese, though there’s only scant documentation. The installation of P2FA can be a bit of hassle. Fortunately we have detailed instructions: For Mac users: Check Will Styler’s post. The installation is the same for English and Chinese versions. For Windows users: Check Cong Zhang’s post. I rewrite a bit of the python code for the Mandarin version and make it a python 3 script since python 2 is getting outdated. Check my Github repository[P2FA_Mandarin_py3] for it.

1.3.2 Pronunciation Dictionary Before running the aligner, we need to make sure that the pronunciation dictionary /P2FA_Mandarin/run/model/dict contains all the characters appeared in our transcripts. Again, Bash Shell commands can help us with that (idea from here, adapted for Chinese orthography). First we obtain a wordlist from our transcripts. Continuing with the above example /list.txt, we make a copy of it, and in Terminal we navigate to this directory.

$ tr ' ' '\n' < list.txt|sort|uniq -c|sed 's/^ *//'|sort -r -n > wordlist.txt

Tip: No space in front of sort! Otherwise you might get the error message: “Command not found”, since Bash is sensitive to spaces when you’re piping.

This command generates a wordlist.txt file in which each unique Chinese character is lining up as a single column. The command also gives you the corresponding frequency count of each character in the first column. Then we want to compare it against the dictionary. We can also make a copy of the dict file /dict copy (so that we don’t ruin it by accident). If the character in wordlist.txt is also in the dictionary, then the corresponding dictionary line is extracted.

$ cut -d ' ' -f 2 wordlist.txt | sed 's/^/^/'| sed 's/$/ /' >tmp.txt 

(-d ‘’: this flag specifies the delimiter is a space. Put the column of characters into regular expression format for locating the beginning of a line(^))

$ egrep --file=tmp.txt dict\ copy > words_phones.txt

There are some duplicated rows in the dict file. So we could do the following: $ words_phones.txt|uniq -c|sed ’s/^ *//’ >words_phones2.txt The idea is to sort the Chinese characters the same way in /wordlist.txt and /words_phones2.txt so that we can use the join command to see the record(s) that do not match.

$ sort -k 2 wordlist.txt >tmp1.txt

The problem is that even if we sort the column of characters in the grepped dict file /words_phones2.txt, the sorting result is influenced by the third field of letters. So we decided to extract only the column of the Chinese characters of /words_phones2.txt and sort it.

$ awk '{print $2}' words_phones2.txt|sort> tmp2.txt

Then we find out whether there are any characters in tmp1.txt but missing in tmp2.txt:

$ join -v 1 -1 2 -2 1 tmp1.txt tmp2.txt >missingwords.txt

(-v 1: this flag displays the non-matching records of the file 1. The following -1 2 -2 1: file 1 second column or field; file 2 first column) This /missingwords.txt lists the missing Chinese characters and you can manually add them to the original /dict file in the /model.

1.3.3 Running P2FA Running P2FA is easy when you have all the input files prepared as required. You just need one single line in the Terminal calling the and filling in relevant arguments: .wav file path, .txt file path, (output) .Textgrid file path. This script returns the short form .Textgrid file. If you want to run the aligner for all of the audio files in a directory, you can make use of a loop structure:

$ for i in *.wav; do python $i $i.txt $i.TextGrid; done

In the Github repository /P2FA_Mandarin there’s also, which returns the output in .mlf with table-like form, as shown in the following example:

0 8500000 sp 3079.143311 sp
8500000 8800000 n -1.651408 你
8800000 9700000 i 73.151802

I also made a Python script to convert files in .mlf to .Textgrid (short form).

Dr Chenzi Xu
Dr Chenzi Xu
Leverhulme Early Career Fellow

My research interests include speech prosody, speech perception, and speech technology.