Use Montreal Forced Aligner

The following tutorial is based on the legacy version of MFA (v1.1.0). For the most recent MFA (v3.0.0+), please check out my updated MFA guide New Update: A Gentle Guide to Montreal Forced Aligner

4.1 Installation

The Montreal Forced Aligner (MFA) can be downloaded from here. They have a detailed documentation, so I will only mention it briefly.

The stable version 1.0.1 does NOT work for my Mac(10.14.6), but Version 1.1.0 Beta 2 works well.

After you download the .zip archive and unzip it, you can open a terminal window and navigate to the MFA directory. You can then create an input folder where you will put all your .wav files and their corresponding .txt files and an output folder for the time-aligned .Textgrid files to be created.

Never use your /Desktop or other important directories as the output directory! It is recommended that you create a new directory for it because MFA DELETEs EVERYTHING in the output directory. (What a lesson for me! My desktop was cleared in a second Oooops >.<)

The Montreal Forced Aligner has pretrained acoustic models and pretrained grapheme-to-phoneme (G2P) models for a number of languages. You can check them out here.

Download the pre-trained acoustic model and the G2P model if available and put them in /MFA/pretrained_models/.

NO need to unzip the model files.

4.2 Pronunciation Dictionary

The pronunciation dictionary here refers to a text file in which each line consisting of an orthographic transcription followed by the phonetic transcription. The phones in the dictionary must match the ones in the acoustic models and the orthography should match that in the transcripts.

If there is a pre-trained G2P model, we can generate a customised pronunciation dictionary from our transcripts. A Mandarin dictionary example:

bin/mfa_generate_dictionary pretrained_models/mandarin_character_g2p.zip input/ mandarin_dict.txt

4.3 Running Montreal Forced Aligner

When you have prepared the following, you’re ready to go!

All .wav files are in 16KHz, 16-bit, mono channel
Each .wav file has a .txt transcript file with a matching filename, and they are put in the /input/ folder
You have generated a pronunciation dictionary from all the transcripts
You have created an empty /output/ folder
You downloaded or trained an acoustic model (in .zip) and put it in the /pretrained_models/ folder.

Continuing from the previous example:

bin/mfa_align input/ mandarin_dict.txt pretrained_models/mandarin.zip output/

You can also use mandarin above without the .zip extension when you have downloaded the pre-trained model.

MFA can also train your own acoustic models and align using only the data set.

bin/mfa_train_and_align input/ mandarin_dict.txt output/

Use the flag -o PATH to save the acoustic model for future use.

I assume a large dataset would be helpful. I tried this option but my own dataset is not big enough so the alignment tended to derail.

Last updated on May 8, 2024