Working with large-scale speech corpus for phonetic research: Pipeline and tools

Abstract

Large-scale speech datasets are increasingly central to research. This talk provides a practical overview of end-to-end workflows for working with large speech corpora, from acquisition and storage to preprocessing and querying. We will compare popular management ecosystems, including the file-based Kaldi style and the Hugging Face Datasets API, highlighting their respective advantages for annotation, scalability, and interoperability. Participants will learn how to adapt pre-trained large speech models such as Whisper, structure and validate metadata, efficiently query and filter large datasets, and leverage modern tools for distributed processing.

Date
Aug 14, 2025 3:30 PM
Event
DiLLA Speech Workshop, Universiteit Leiden
Location
Herta Mohr Building 1.80, Leiden University
Leiden,

Prerequisite:

This talk is open to researchers and students who are interested in working with large-scale speech corpus. Basic knowledge of Unix Shell and Python is desirable, but not required.

Dr Chenzi Xu
Dr Chenzi Xu
Leverhulme Early Career Fellow

My research interests include speech prosody, speech perception, and speech technology.

Previous

Related