Utilising ASR in Linguistic Research

Automatic Speech Recognition (ASR), or Speech-to-text (TTS), maps a sequence of audio inputs to a sequence of text outputs. Not only is it the core in applications such as voice assistants, video captioning, and minutes-taking, but it can also facilitate linguistic fieldwork and speech data preprocessing.

There are many open-source resources that can empower us to integrate ASR into our linguistic research workflows. This tutorial aims to help you understand the basic concepts in ASR and guide you step-by-step to utilise ASR in your own linguistic research.

The tutorial has the following chapters:

Chapter 1 employs state-of-the-art pre-trained ASR models to generate transcripts for audio recordings. Chapter 2 is coming soon. Chapter 3 demonstrates training acoustic models and alignment from scratch using the Kaldi ASR toolkit. Chapter 4 continues from Chapter 3 and demonstrates training acoustic models and alignment with much simpler MFA implementation.

The subsequent chapters will release soon and cover how to fine tune and train ASR models from scratch using PyTorch. Please stay tuned!

Forced-alignment tutorial
Automatic Forced Alignment is also based on ASR models. If you don’t know how to use a forced aligner, please check out my another online tutorial about how to get time-aligned transcription files automatically.

The main audience expected for this online tutorial is linguistic researchers and students. All scripts were tested on my MacBook Air (M1 2020). Please click on the chapters in the Table of Contents (left side) to START.

Unix Shell Python

DISCLAIMER
Feel free to leave a comment if you have a question or issue by emailing me, but I’m probably unable to offer personal assistance to your problems. In short, this website is not responsible for any troubles. Good luck!

Thank you for reading. Feel free to share this tutorial!

Last updated on Nov 1, 2024