New Update: A Gentle Guide to Montreal Forced Aligner
It has been a few years since my previous tutorial on the Montreal Forced Aligner (MFA). MFA is continuously evolving and becoming increasingly powerful. In this new tutorial, I would like to introduce a more sophisticated general workflow for producing time-aligned phone boundaries for languages that have a pretrained acoustic model and pretrained dictionary, especially if you are working with large datasets.
*The following workflow has been tested on M-chip Macs and Linux (montreal-forced-aligner v3.0.0+, May 2024).
- (Advanced) MFA Forced Alignment Workflow
6.1. Organisation of working directory and sanity check
An example structure of the working directory is as follows.
project/
├── corpus
│ ├── recording1.wav
│ ├── recording1.TextGrid
│ ├── recording2.wav
│ ├── recording2.TextGrid
│ └── ...
├── txts/
│ ├── recording1.txt
│ ├── recording2.txt
│ └── ...
├── output/
└── text2tg.py
All .wav
audio files of a corpus are in the /corpus/
directory, and all transcript files (.txt
) are in the /txts/
directory. The generated input TextGrid files can be added to the /corpus/
directory.
For very large corpora, I usually write a script to perform a sanity check of the original corpora files, so that we know whether there are any filename inconsistencies, and whether there is a transcript file for each audio file.
6.2. Setting up the Montreal Forced Aligner
Install MFA and activate conda environment
conda config --add channels conda-forge #enable the conda-forge channel by default
conda install montreal-forced-aligner
conda activate aligner # create a new environment for forced alignment
For Conda installation, check here. Feel free to check out the official MFA installation guide.
To update your MFA:
conda update -c conda-forge montreal-forced-aligner kalpy kaldi=*=cpu* --update-deps
Download pretrained models
mfa model download acoustic mandarin_mfa
mfa model download dictionary mandarin_china_mfa
mfa model download g2p mandarin_china_mfa # if needed
There are a few different pretrained dictionaries, which can be used for Standard Mandarin, Beijing Mandarin, and Taiwan Mandarin, please check them out.
The Montreal Forced Aligner has many pretrained models for a number of languages. You can check them out here.
6.3. Generating input TextGrids from transcripts
In many scenarios, the transcript file for each audio recording is in a plain text .txt
format with a matching filename as the audio. It is recommended to convert transcripts in plain text file to input .TextGrid
files (all texts in one tier), with the tier names being the speaker ID. MFA automatically includes speaker adaptation. This can be achieved using a customised Python script, which depends on the specific format of the transcript file.
For example, in my ASR tutorial that works on datasets from Common Voice, I created a cv15_totgs.py
script that generates input TextGrids for forced alignment.
📋 Tips for Input Transcript Text
- Unknown sounds, laughter, and coughing can be represented as {LG}
- Non-speech vocalizations that are similar to silence such as breathing and exhalation can be represented as {SL}
sil
andspn
are two special phones for non-speech annotations recognised by pre-trained MFA dictionaries.{LG} spn {SL} sil
- I tend to remove all the punctuations in Chinese transcripts.
- Adding a space between Chinese characters can enable syllable-level alignment.
6.4. Data validation and solutions for OOVS
Validate the prepared data and pretrained models before the actual alignment. This command mfa validate
parses the corpus and dictionary, generates a nice report summarising information about your corpus, and logs potential issues including the out-of-vocabulary items (OOVs), i.e. any words in transcriptions that are not in the dictionary.
cd (your project path)
mfa validate [OPTIONS] CORPUS_DIRECTORY DICTIONARY_PATH
mfa validate corpus/ mandarin_china_mfa mandarin_mfa
Obtaining the Out-Of-Vocabulary items(OOVS) list
In the latest version of MFA, you can find a list of the OOVs and which utterances they appear in your MFA folder. Most likely on a Mac, you can find them in your /Documents/MFA/(your project)/oovs_found_xxx.txt
. It will be empty if there is no OOV.
Backtracking easy-to-fix typos
Sometimes missing words in the dictionary result from spelling mistakes. Using the information in /Documents/MFA/(your project)/oovs_utterances.txt
from the validation output, we can backtrace the words with typos and fix them in the transcript texts.
Generating pronunciation dictionary for remaining OOVs
For remaining OOVs that are saved as a oovs.txt
, we can generate a dictionary entry for each OOV using the pre-trained Grapheme-to-Phoneme (G2P) model.
cd Users/chenzi/Documents/MFA/project # your project directory inside MFA directory
mfa g2p [OPTIONS] INPUT_PATH G2P_MODEL_PATH OUTPUT_PATH
mfa g2p oovs.txt mandarin_china_mfa oovs.dict
# add probabilities to a dictionary (**optional**)
mfa train_dictionary [OPTIONS] CORPUS_DIRECTORY DICTIONARY_PATH ACOUSTIC_MODEL_PATH OUTPUT_DIRECTORY
mfa train_dictionary --clean ~/project/corpus/ oovs.dict mandarin_mfa ~/project/
#combine pretrained dictionary and dictionary for OOVs
cat oovs.dict ~/Documents/MFA/pretrained_models/dictionary/mandarin_china_mfa.dict > ~/project/mandarin_new.dict
6.5. MFA alignment.
When there is no other issue after the validation, we can now start forced alignment. You can try to adjust the parameters and compare the output TextGrids. You don’t have to do all of the following four runs – any of them will generate a set of output TextGrids in the /output/
directory.
Initial run
This is the straight-out-of-the-box baseline output that you can compare to. This might be sufficient for small corpora with no OOVs.
mfa align [OPTIONS] CORPUS_DIRECTORY DICTIONARY_PATH ACOUSTIC_MODEL_PATH OUTPUT_DIRECTORY
mfa align corpus/ mandarin_china_mfa mandarin_mfa output/tgs0/
Second run with updated dictionary
mfa align corpus/ mandarin_new.dict mandarin_mfa output/tgs1/
Third run with a larger beam parameter
When you have very long transcript for each audio file, I would suggest that you use a larger beam size in decoding (e.g. retry beam = 100), which originally defaults to 10. This of course will increase the alignment time.
mfa align --clean corpus/ mandarin_new.dict mandarin_mfa output/tgs2/ --beam 100
Fourth run with adapted acoustic model to new data
You can also fine-tune the pretrained acoustic model to see if it gives better alignment results.
mfa adapt [OPTIONS] CORPUS_DIRECTORY DICTIONARY_PATH ACOUSTIC_MODEL_PATH OUTPUT_MODEL_PATH
mfa adapt --clean corpus/ mandarin_new.dict mandarin_mfa output/model/ output/tgs3/ --beam 100