ASR from Scratch III: Training models of Bora, a Low-resource Language (MFA)


This chapter we focus on Bora, an endangered indigenous language of South America, primarily spoken in the western Amazon rainforest. I will demonstrate how to train a mini Bora acoustic model for forced alignment. Even with just 1.5 hours of Bora speech from Dr Jose Elias-Ulloa’s fieldwork, we can produce surprisingly decent time-aligned TextGrids.

This tutorial is designed for researchers working on low-resource languages, where open-source speech materials are scarce and pre-trained models are not available.

For the installation of MFA, please refer to §4.1 in the previous chapter or the official installation guide.

Table of Contents


5.1 The fieldwork dataset

Phonetic fieldwork on a lesser-studied language often begins with recording word lists and short stories. The Bora data we used here consist of 913 sound files (.wav) with about 1.55 hours total duration from one Bora speaker. The majority of these files feature a single word repeated three times with pauses in between (see below), while 152 files contain longer utterances drawn from short stories.

An example recording of a bora word repeated three times

In this case, we also have the list of words that the informant read out aloud. For instance, in a text file wordlist.txt, we have:

...
gábuuve
gáyaga
gayága
garága
gapáco
gááraco
gáárapo
gáápaco
gaañúro
...

For longer utterances, each audio file has a transcription roughly marked in a corresponding TextGrid file as follows:

Bora Transcription Illustration for 02_sp_Panduro_ROR001_20250212_016.wav

These utterances are ready for training an acoustic model using MFA. All sound files are organised in a repository bora_corpus/ under the project repository bora/. Temporarily, we can place all the word list recordings in a separate working directory, words_in_isolation/, to make transcript generation easier.

├── bora
... ├── bora_dict.txt
    ├── wordlist.txt
    ├── create_input_tgs_from_wordlists.py
    ├── create_input_tgs_with_sils.praat
    ├── words_in_isolation # temporatry
    |   ├── 01_sp_Panduro_BOR001_20250203_001.wav
    |   ├── 01_sp_Panduro_BOR001_20250203_002.wav
    |   └── ...
    ├── ...
    └── bora_corpus
        ├── 01_sp_Panduro_BOR001_20250203_001.wav
        ├── 01_sp_Panduro_BOR001_20250203_002.wav
        ├── 01_sp_Panduro_BOR001_20250203_003.wav
        ├── ...
        └── ... #913 wav items in total

5.2 Data Preprocessing

5.2.1 Transcripts preparation: initial TextGrids in two approaches

The next step is to prepare an input transcription for each wordlist sound file. I recommend using the .TextGrid format when using MFA (.txt or .lab should work too). Here I demonstrate two options for the input TextGrids.

Illustration of the Two options of Input TextGrids

Option ❶ for the word list transcription:

The following Python script create_input_tgs_from_wordlists.py prepares the initial transcript files by creating a corresponding .TextGrid file for each audio recording, extracting the corresponding word from the word list, and writing it to a word tier (see above figure, Tier 1). Note that the audio filenames are neatly ordered and correspond to the order of the word list.

To use this Python script, we can do the following in your unix shell:

cd ~/Wip/bora #your project repository
python create_input_tgs_from_wordlists.py

You will need to pre-install packages such as soundfile and praatio in your environment.

# create_input_tgs_from_wordlists.py
# Created by Chenzi Xu on 31/07/2025

import os
import soundfile as sf
from praatio import textgrid

# Path to the audio folder and dictionary
audio_folder = "words_in_isolation"
dictionary_file = "wordlist.txt"

# Load words from wordlist
with open(dictionary_file, "r", encoding="utf-8") as f:
    words = [line.strip().split()[0] for line in f if line.strip()]

# Sort audio files (make sure it's consistent with dictionary order)
audio_files = sorted([f for f in os.listdir(audio_folder) if f.endswith(".wav")])

# Check count matches
if len(audio_files) != len(words):
    raise ValueError(f"Mismatch: {len(audio_files)} wav files vs {len(words)} words")

# Loop over each file and create TextGrid
for i, (wav_file, word) in enumerate(zip(audio_files, words)):
    wav_path = os.path.join(audio_folder, wav_file)

    with sf.SoundFile(wav_path) as f:
        duration = len(f) / f.samplerate

    label = f"{word} {word} {word}"

    tg = textgrid.Textgrid()
    tier = textgrid.IntervalTier("word", [(0.0, duration, label)], 0.0, duration)
    tg.addTier(tier)

    base_name = os.path.splitext(wav_file)[0]
    tg_name = f"{base_name}.TextGrid"
    tg.save(
        os.path.join(audio_folder, tg_name),
        format="long_textgrid",
        includeBlankSpaces=True,
    )

print("Done! TextGrids created for all audio files.")

Option ❷ for the word list transcription: Bootstrapped input TextGrids

The problem with the first approach is that the onset boundaries of words tend to be messy in the output when the dataset is extremely small. One way to address this is to create additional bootstrapped input TextGrids to provide more information (e.g. initial time boundaries) about the speech intervals. We can potentially use the word tier of first-pass output as input and rerun the training and/or alignment.

Or we can use the following Praat script to generate initial word boundaries through silence/speech detection (see above figure, Tier 2). This method works well when the recording of the word lists is highly consistent without much environmental noises (good clean recordings).

In this Praat script, the key parameters we need to consider is in this line:

To TextGrid (silences): 75, 0, -35, 0.15, 0.1, "", word$
  • Parameters for the intensity analysis:
    • Pitch floor (Hz)
    • Time step (s): 0.0 (= auto)
  • Silent intervals detection:
    • Silence threshold (dB)
    • Minimum silent interval (s)
    • Minimum sounding interval (s)
    • Silent interval label
    • Sounding interval label

This command takes seven arguments, among which the pitch floor, silence threshold, and minimum silent interval (in seconds) typically need to be customized based on the speech data — for example, the speaker’s pitch range and speech rate, the recording conditions, and the level of background noise. A practical way to determine optimal values is to randomly open a few sound files in Praat and run the command from the graphical interface (Annotate >To TextGrid (silences)) with a range of parameters to get a feel for the best parameters.

# create_input_tgs_with_sils.praat
# Written by Chenzi XU (30 July 2025)

form Batch annotate wordlist
    sentence WordListFile /Users/chenzi/Wip/bora/wordlist.txt
    sentence AudioFolder /Users/chenzi/Wip/bora/words_in_isolation
endform

Read Strings from raw text file... 'WordListFile$'
Rename... wordList
numberOfWords = Get number of strings

Create Strings as file list... wavList 'AudioFolder$'/*.wav
Sort
numberOfWavs = Get number of strings

if numberOfWords <> numberOfWavs
    exit ("Mismatch: ", numberOfWords, " words vs ", numberOfWavs, " audio files.")
endif


for i from 1 to numberOfWavs
	selectObject: "Strings wavList"
    wavFile$ = Get string: i
	selectObject: "Strings wordList"
	word$ = Get string: i
    fullWavPath$ = "'AudioFolder$'/'wavFile$'"

    Read from file: fullWavPath$
    soundName$ = selected$("Sound")

    To TextGrid (silences): 75, 0, -35, 0.15, 0.1, "", word$

	selectObject: "TextGrid " + soundName$
	Set tier name: 1, "word"

    tgFile$ = replace$(fullWavPath$, ".wav", ".TextGrid",1)
    Save as text file: tgFile$

    appendInfoLine: "Annotated: ", wavFile$, " with label: ", word$
endfor

select all
Remove
appendInfoLine: "Done! ", numberOfWavs, " TextGrids created."

5.2.2 The dictionary by linguists

We prepared a Bora pronunciation dictionary, bora_dict.txt, for the list of words collected by Jose. Each word in this dictionary is transcribed in IPA, with individual IPA symbols separated by spaces and a tab character separating the word from its transcription.

<oov> oov
{LG}  spn
{SL}  sil
aabéváa	aː p é b âː
aacu	aː kʰ u
aamédítyuváa	aː m é t í tʲʰ u b âː
aaméne	aː m é n e
aaméváa	aː m é b âː
aamɨ́nema	aː m ɨ́ n e m a
aanévané	aː n é b a n é
aanéváa	aː n é b âː
aaúváa	aː ú b âː
acháháchá	a t͡sʲʰ á ʔ á t͡sʲʰ á
adówatu	a t ó k͡p a tʰ u
adówááñé	a t ó k͡p áː ɲ é
ahdújucóváa	a ʔ t ú h u kʰ ó b âː
ajchóta	a h t͡sʲʰ ó tʰ a
allúrí	a t͡sʲ ú r í
...

Then we can clean up the project repository. The bora_corpus/ repository contains both the sound files and their corresponding input transcriptions. A tgs/ repository is added to house the output TextGrids.

.
├── bora
... ├── bora_dict.txt
    ├── wordlist.txt
    ├── create_input_tgs_from_wordlists.py
    ├── create_input_tgs_with_sils.praat
    ├── ...
    ├── tgs # for output TextGrids
    └── bora_corpus
        ├── 01_sp_Panduro_BOR001_20250203_001.wav
        ├── 01_sp_Panduro_BOR001_20250203_001.TextGrid
        ├── 01_sp_Panduro_BOR001_20250203_002.wav
        ├── 01_sp_Panduro_BOR001_20250203_002.TextGrid
        └── ... 

5.3 Training acoustic models using MFA

Before we start, use the mfa validate command to look through the training corpus, bora_corpus/ in our case, and to make sure that the dataset is in the proper format for MFA.

mfa validate --clean --single_speaker bora_corpus bora_dict.txt --output_directory ~/Wip/bora

The output of this command reports on aspects of the training corpus including the number of speakers, the number of utterances, the total duration, the missing transcriptions or audio files if any, the Out of Vocabulary (oov) items if any, etc. You can see some of the INFO lines printed in the Unix Shell as follows:

 INFO     Corpus
 INFO     913 sound files
 INFO     913 text files
 INFO     1 speakers
 INFO     1773 utterances
 INFO     5584.087 seconds total duration
 INFO     Sound file read errors
 INFO     There were no issues reading sound files.
 INFO     Feature generation
 INFO     There were no utterances missing features.
 INFO     Files without transcriptions
 INFO     There were no sound files missing transcriptions.
 INFO     Transcriptions without sound files
 INFO     There were no transcription files missing sound files.
 INFO     Dictionary
 INFO     Out of vocabulary words
 INFO     There were no missing words from the dictionary. If you plan on using the a model
          trained on this dataset to align other datasets in the future, it is recommended that
          there be at least some missing words.        
 ...

The above output indicates that we passed our data validation. If there are any missing files or the number of speakers is incorrect, you will need to fix the problems and run mfa validate again. Oscillate between these two steps until you have validated your data.

The MFA command for training a new acoustic model is mfa train, which takes three arguments:

mfa train [OPTIONS] <corpus_directory> <dictionary_path> <output_model_path>

For more details, see the official guide.

I have added an optional argument --output_directory to put the output TextGrids for our training data.

mfa train --clean --single_speaker bora_corpus bora_dict.txt bora_model.zip --output_directory tgs --subset_word_count 1 --minimum_utterance_length 1

If you see the following few lines at the end of the Shell output, congratulations 🎉 on completing training an acoustic model.

INFO     Finished exporting TextGrids to xxxxxx!                                  
INFO     Done! Everything took xxxxx seconds

An example of the TextGrid output is as follows:

Time-aligned phones for sp_Panduro_BOR001_20250217_088.wav

With such a small training corpus (~1.55 hours), the resulting alignment is surprisingly good. Although in the example above, the alignment for the bilabial nasal /m/ and vowel /u/ is off.

5.4 Next steps

The Bora DoReCo dataset may help us improve the mini acoustic model with more speech data from more speakers. (coming soon)

Previous