Assemble time-aligned transcription files

The structure of a speech corpus

An ideal way to organise our speech corpus directory is demonstrated below. The plain transcript (.txt) and time-aligned transcription file (.TextGrid) share the same filename as the corresponding audio file (.wav). Designing a consistent, anonymous, interpretable, and information-dense filename system is always a good practice here.

speech_corpus
├── metadata.txt
├── audios
│   ├── b01_1_101q.wav
│   ├── b01_2_101q.wav
│   └── b01_3_101q.wav
├── textgrids
│   ├── b01_1_101q.TextGrid
│   ├── b01_1_102a.TextGrid
│   └── b01_1_103a.TextGrid
└── transcripts
    ├── b01_1_101q.txt
    ├── b01_2_101q.txt
    └── b01_3_101q.txt

The first step that enables us to search intended speech sequences from the corpus is to create a large text file assembling all time-aligned transcripts so that we have access to three key information for all audio files: 1) temporal information; 2) the symbol for the speech unit (it can be a segment, a syllable, or a word given the granularity of the segmentation); 3) the filename/path.

Assemble time-aligned transcripts

Usually a TextGrid file, as shown below, is not in its best format to work with. In the previous tutorial, I provided Python scripts that convert a .TextGrid file into a plain text file with tabular format data. Again, they are available at my Github repository. The README.md will take you from there.

If you would like to convert multiple .TextGrid files all at once, you can consider a for loop in your command line and then concatenate individual .txt files into a large text file.

File type = "ooTextFile"
Object class = "TextGrid"

xmin = 0.0125 
xmax = 1.8725 
tiers? <exists> 
size = 2 
item []: 
    item [1]:
        class = "IntervalTier" 
        name = "phone" 
        xmin = 0.0125 
        xmax = 1.8725 
        intervals: size = 26 
        intervals [1]:
            xmin = 0.0125 
            xmax = 0.0925 
            text = "t"

Alternatively, I created another Python script tg2csv.py that loops round the \textgrid\ directory and create one large text file in the output. It is also available at my Github repository. You can download the script.

The key to creating this assembling text file in the tabular format is forward thinking. What information will you need in the next steps? In order to cut speech segments out from a audio file, we will need the filename (path) of the audio file, and the times of the targeted speech segments. The correspondence between the filenames of an audio file and a transcript does us a favor in accessing the path of the audio file when we have the transcript file. Therefore, a column of filename should be in the tabular data, and we might also want to remove the file extension, which would make it easier to work with different file extensions later.

Demo task

Suppose that I am interested in some Mandarin syllables and my .TextGrid files are time-aligned at both the segmental (tier name: phoneme) and syllabic level (tier name: word). I hope to convert all the word tiers into tabular format.

tg2csv.py should be placed in the project directory, i.e. /speech_corpus in above example. It takes three arguments: 1) the name of the directory where we put the .TextGrid files, in this case, textgrids; 2) the name of the tier that we are interested in, word; 3) the name of the desired output file. Let’s call it words.csv.

In the Terminal or your Unix Shell, we can do:

python tg2csv.py textgrids word words.csv

The final tabular output of tg2csv.py is demonstrated below. You may delete the first row 0,1,2,3,4 in the output file, which is the unspecified column names.

妈,0.0125,0.3325,0.3200,b01_1_101q
妈,0.3325,0.5125,0.1800,b01_1_101q
们,0.5125,0.7925,0.2800,b01_1_101q
正,0.7925,1.1125,0.3200,b01_1_101q
看,1.1125,1.6725,0.5600,b01_1_101q
...

Now that we have some organised information about all syllables in our own Mandarin corpus.

Last updated on Dec 15, 2021