Working with large-scale speech corpus for phonetic research: Pipeline and tools

Name: Working with large-scale speech corpus for phonetic research: Pipeline and tools
Start: 2025-08-14T15:30:00Z
Location: Herta Mohr Building 1.80, Leiden University

Slides

Abstract

Large-scale speech datasets are increasingly central to research. This talk provides a practical overview of end-to-end workflows for working with large speech corpora, from acquisition and storage to preprocessing and querying. We will compare popular management ecosystems, including the file-based Kaldi style and the Hugging Face Datasets API, highlighting their respective advantages for annotation, scalability, and interoperability. Participants will learn how to adapt pre-trained large speech models such as Whisper, structure and validate metadata, efficiently query and filter large datasets, and leverage modern tools for distributed processing.

Date

Aug 14, 2025 3:30 PM

Event

DiLLA Speech Workshop, Universiteit Leiden

Location

Herta Mohr Building 1.80, Leiden University

Leiden,

Prerequisite:

This talk is open to researchers and students who are interested in working with large-scale speech corpus. Basic knowledge of Unix Shell and Python is desirable, but not required.

ASR Corpus Speech technology research skills

Working with large-scale speech corpus for phonetic research: Pipeline and tools

Abstract

Prerequisite:

Dr Chenzi Xu

Leverhulme Early Career Fellow

Related