Speech Corpus Querying

A speech corpus is usually a large collection of audio recordings of a spoken language, most often accompanied by text transcription files, and sometimes metadata documents about sources or background information of these files.

There are many available, open source speech corpora such as TIMIT (English), LibriSpeech (English ~1000 hours). Open speech corpora of other languages are also available, and many of them are fairly large, used for machine learning training. For text corpora, there are many free or commercial corpus analysis toolkits or software that provide a nice interface and powerful concordancer which allows to query corpora in an efficient manner, to generate frequency/n-grams for search tokens, and to automatically perform semantic or grammatical tagging. Such tools include AntConc and Sketch Engine for example. For spoken corpora, however, there aren’t many good options. AIKUMA may be a good choice for recording an audio source and adding translation and metadata; ELAN is an excellent annotation tool for audio and video recordings.

For linguists, we often conduct fieldwork and collect our own, first-hand language data by recording speakers and transcribing their speech. Depending on your linguistic research questions, we might also want to control our speech stimuli and speaker socio-cultural background instead of using large open source database without knowing much information about the speakers and recording contexts. Corpus building skill can be essential, and making queries of a speech corpus can be a crucial step in phonetic research. This tutorial will introduce one way to compile a speech corpus and make queries of speech intervals, using the command-line interface.

When we build our own corpus, we will need audio files, and prepare corresponding time-aligned transcripts. The transcript file usually share the same filename with the audio file, except for a different file extention. It would be nice to have another metadata file to log some information about the speakers, recording equipments and environments, and audio file formats.

General elements for speech corpora

  1. Speech audio files (.wav )
  2. Time-aligned transcript files (.txt/.lab/.TextGrid)
  3. Metadata files*

We prefer uncompressed audio formats such as WAV in research; sometimes you might encounter lossless compressed audio formats such as FLAC. Here I won’t be covering how to record an audio file or how to get a transcription file (assuming that you already have them).

Forced-alignment tutorial

If you don’t know how to use a forced aligner, please check out my another tutorial about how to get time-aligned transcription files automatically.

General procedure for making a query

  1. Assemble all time-aligned transcripts
  2. Prepare a query script that search the text file for targeted sequences
  3. Prepare a trim script for cutting the portions out of relevant audios

In this tutorial, I’ll briefly walk through how to search and extract syllables or phrases that are the focus or target of research from a speech corpus. All demonstration was tested on My Mac Book (Big Sur 11.5.1). Mandarin Chinese data will be used as an example. I’m trying my best to be clear and hope this is helpful for those who want to achieve similar goals, especially for non-programmers and linguistic students.

This online tutorial is presented by courtesy of my supervisor, Prof. John Coleman, who taught me how to query a speech corpus.

Unix Shell Python SoX

Click on the chapters in the Table of Contents to START.


Feel free to leave a comment if you have a question or issue by emailing me, but I’m probably unable to offer personal assistance to your problems (I’m in the middle of my dissertation). In short, this website is not responsible for any troubles. Good luck!

Thank you for reading. Feel free to share this tutorial!