Use Montreal Forced Aligner
4.1 Installation
The Montreal Forced Aligner (MFA) can be downloaded from here. They have a detailed documentation, so I will only mention it briefly.
After you download the .zip archive and unzip it, you can open a terminal window and navigate to the MFA directory. You can then create an input folder where you will put all your .wav
files and their corresponding .txt
files and an output folder for the time-aligned .Textgrid
files to be created.
/Desktop
or other important directories as the output directory! It is recommended that you create a new directory for it because MFA DELETEs EVERYTHING in the output directory. (What a lesson for me! My desktop was cleared in a second Oooops >.<)The Montreal Forced Aligner has pretrained acoustic models and pretrained grapheme-to-phoneme (G2P) models for a number of languages. You can check them out here.
Download the pre-trained acoustic model and the G2P model if available and put them in /MFA/pretrained_models/
.
4.2 Pronunciation Dictionary
The pronunciation dictionary here refers to a text file in which each line consisting of an orthographic transcription followed by the phonetic transcription. The phones in the dictionary must match the ones in the acoustic models and the orthography should match that in the transcripts.
If there is a pre-trained G2P model, we can generate a customised pronunciation dictionary from our transcripts. A Mandarin dictionary example:
bin/mfa_generate_dictionary pretrained_models/mandarin_character_g2p.zip input/ mandarin_dict.txt
4.3 Running Montreal Forced Aligner
When you have prepared the following, you’re ready to go!
- All
.wav
files are in 16KHz, 16-bit, mono channel - Each
.wav
file has a.txt
transcript file with a matching filename, and they are put in the/input/
folder - You have generated a pronunciation dictionary from all the transcripts
- You have created an empty
/output/
folder - You downloaded or trained an acoustic model (in
.zip
) and put it in the/pretrained_models/
folder.
Continuing from the previous example:
bin/mfa_align input/ mandarin_dict.txt pretrained_models/mandarin.zip output/
You can also use mandarin
above without the .zip
extension when you have downloaded the pre-trained model.
MFA can also train your own acoustic models and align using only the data set.
bin/mfa_train_and_align input/ mandarin_dict.txt output/
Use the flag -o PATH
to save the acoustic model for future use.
I assume a large dataset would be helpful. I tried this option but my own dataset is not big enough so the alignment tended to derail.