Large-scale speech datasets are increasingly central to research. This talk provides a practical overview of end-to-end workflows for working with large speech corpora, from acquisition and storage to preprocessing and querying. We will compare popular management ecosystems, including the file-based Kaldi style and the Hugging Face Datasets API, highlighting their respective advantages for annotation, scalability, and interoperability. Participants will learn how to adapt pre-trained large speech models such as Whisper, structure and validate metadata, efficiently query and filter large datasets, and leverage modern tools for distributed processing.
This talk is open to researchers and students who are interested in working with large-scale speech corpus. Basic knowledge of Unix Shell and Python is desirable, but not required.