Working with large-scale speech corpus for phonetic research: Pipeline and tools


Chenzi Xu
MPhil DPhil (Oxon)

University of Oxford
Workshop at Universiteit Leiden

August 11, 2025

About me

Leverhulme Trust Early Career Fellow

University of Oxford

The rise and fall of a tone


Postdoctoral Research Associate

University of York

Person-specific Automatic Speaker Recognition: Understanding the behaviour of individuals for applications of ASR


DPhil, MPhil (Distinction),

University of Oxford

Why work with large corpora, for phonetic research?


  • Capturing variation and change
    • Longitudinal and synchronic variation
    • Dialectology, sociophonetics, forensic phonetics
    • Growing availability of “in-the-wild” corpora
  • Statistical Power
  • Generalisability
  • Reproducibility
  • Advancing the phonetic toolset

Roadmap


Corpus Data Access

1 2 3 4

Know Your Device

1 2 3 4


5 important facts about your device or server

  • OS (operating system)
  • CPU (central processing unit): cores, threads
  • GPU (graphics processing unit): VRAM, CUDA
  • RAM (random-access memory)
  • Storage: Free Disk space
system_profiler SPHardwareDataType | head -n 10
Hardware:

    Hardware Overview:

      Model Name: MacBook Pro
      Model Identifier: Mac15,11
      Model Number: MRW33B/A
      Chip: Apple M3 Max
      Total Number of Cores: 14 (10 performance and 4 efficiency)
      Memory: 36 GB
df -h /
Filesystem        Size    Used   Avail Capacity iused ifree %iused  Mounted on
/dev/disk3s1s1   926Gi    10Gi   384Gi     3%    426k  4.0G    0%   /

Corpus Structure

1 2 3 4


Audio Formats

  • WAV
  • FLAC
  • MP3

Transcription Formats

  • Plain Text
  • TextGrid
  • ELAN
  • JSON

Corpus Management Ecosystems: Kaldi-style

  • File-based metadata system
    • Plain-text files: easy to read, edit, and version-control
    • Flexible: support rich and custom annotations
  • Broad tool support
    • Out-of-the-box Kaldi and ESPnet scripts for validation, filtering, splitting, and merging datasets

Corpus Structure

1 2 3 4


Corpus Management Ecosystems: Kaldi-style

  • File-based metadata system
    • wav.scp: maps recording/utterance IDs to audio paths
    • text: transcripts for each uterance
    • utt2spk and spk2utt: links utterances to speakers
    • *segments: (for long recordings) start and end times for utterances

Conventions:

  • Each file is space‑separated, strictly sorted by the first field, no duplicates.
  • IDs are opaque but consistent.

Corpus Structure

1 2 3 4


KeSpeech1

kespeech
├── audio
│   ├── phase1
│   └── phase2
└── metadata
    ├── city2Chinese
    ├── city2subdialect
    ├── phase1.text
    ├── phase1.utt2style
    ├── phase1.utt2subdialect
    ├── phase1.wav.scp
    ├── phase2.text
    ├── phase2.utt2env
    ├── phase2.wav.scp
    ├── spk2age
    ├── spk2city
    ├── spk2gender
    ├── spk2utt
    ├── subdialect2Chinese
    └── utt2spk


Corpus Management Ecosystems: Kaldi-style

  • Broad tool support
# Validation of datasets
utils/validate_data_dir.sh --no-feats kespeech/metadata

# Generate utterance durations
utils/data/get_utt2dur.sh kespeech/metadata

# Keep shortest/longest N for debugging
utils/subset_data_dir.sh --shortest kespeech/metadata 100 kspeech/ks_s100

# Random 10k-utterance subset (useful for quick experiments)
utils/subset_data_dir.sh kespeech/metadata 10000 kespeech/ks_10k

# Make dev/test by speakers (avoid speaker leakage)
utils/subset_data_dir_tr_cv_spk.sh --cv-spk-percent 15 kespeech/metadata kespeech/train kespeech/dev

ESPnet: End-to-end speech processing toolkit

ESPnet bundles the same Kaldi recipes and scripts.

Corpus Structure

1 2 3 4


Audio Formats

  • WAV
  • FLAC
  • MP3

Transcription Formats

  • Plain Text
  • TextGrid
  • ELAN
  • JSON

Corpus Management Ecosystems: Huggingface

  • Parquet/Arrow format metadata
    • A single, columnar file structure for audio and annotations
    • Cloud hosting and distribution via Hugging Face Hub
  • Wide ecosystem integration: Python API
    • Native streaming and loading from Hugging Face Hub
    • Built-in tools for filtering, batching, and audio decoding
    • Seamless compatibility with PyTorch, TensorFlow, and other ML frameworks

Corpus Structure

1 2 3 4


Common Voice1

cv-corpus-22/yue
├── clip_durations.tsv
├── invalidated.tsv
├── other.tsv
├── reported.tsv
├── unvalidated_sentences.tsv
├── validated_sentences.tsv
└── validated.tsv
└── clips
    ├── common_voice_yue_31172849.mp3
    ├── common_voice_yue_31172850.mp3
    └── ...


Corpus Management Ecosystems: Huggingface

  • Parquet/Arrow format metadata

Corpus Structure

1 2 3 4


Common Voice1

cv-corpus-22/yue
├── clip_durations.tsv
├── invalidated.tsv
├── other.tsv
├── reported.tsv
├── unvalidated_sentences.tsv
├── validated_sentences.tsv
└── validated.tsv
└── clips
    ├── common_voice_yue_31172849.mp3
    ├── common_voice_yue_31172850.mp3
    └── ...


Corpus Management Ecosystems: Huggingface

  • Wide ecosystem integration: Python API
from datasets import load_dataset, Audio

# Load dataset on the fly
cv_22_yue = load_dataset("fsicoli/common_voice_22_0", "yue", split="train", streaming=True)

# Decode audio
cv_22_yue = cv_22_yue.cast_column("audio", Audio(sampling_rate=16_000))

# Remove very short utterances
cv_22_yue = cv_22_yue.filter(lambda x: len(x["sentence"]) > 3)

# Make dev/test by speakers (avoid speaker leakage)
split_yue = cv_22_yue.train_test_split(test_size=0.15, seed=42, stratify_by_column="client_id")
train_yue = split_yue["train"]
test_yue  = split_yue["test"]

Data Preprocessing

1 2 3 4

Transcription

1 2 3 4


Open-source ASR toolkits


Pretrained model APIs


Cloud ASR APIs


Tutorial

A hands-on introductory tutorial on applying Whisper and wav2Vec 2.0 is avilable at here.

Transcription: Whisper1

1 2 3 4


Advantages:

  • High accuracy: Handles long-form audio natively
  • Multilingual: Support 50+ languages (trained on 98 languages)
  • Multi-tasking: Transcription, translation, timestamping
  • Offline: Privacy-friendly
  • Flexible: Finetuneable

Kaldi Gigaspeech XL • facebook/wav2vec2-large-robust-ft-libri-960h • Whisper medium.en2

Transcription: Whisper1

1 2 3 4


Weaknesses:

  • May hallucinate
  • Produce “(too) clean” output (disfluency and hesitation features removed)
  • Large models can be GPU-heavy and slow
  • Accuracy depends on domain and language

Whisper architecture2

Transcription: Advanced Whisper Techniques

1 2 3 4


Model configuration

  1. Provide language selection

  2. Initial prompt

  3. Dynamic temperature fallback

Segmentation strategies

4. Whisper-timestamped

5. VAD-guided chunking

model_size = "large-v2"
language = "en"
task = "transcribe"
initial_prompt = "umm uhh oh ah hm er erm urgh mm"

transcribe_args = {
    "task": task,
    "language": language,
    "patience": None,
    "length_penalty": None,
    "suppress_tokens": "-1",
    "initial_prompt": initial_prompt,
    "fp16": False,
    "condition_on_previous_text": False,
    "vad": True,
    "best_of": 5,
    "beam_size": 5,
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
}

result = whisper.transcribe(model, audio, seed=seed, **transcribe_args)

Transcription: Advanced Whisper Techniques

1 2 3 4


Model configuration

1. Provide language selection

2. Initial prompt

3. Dynamic temperature fallback

Segmentation strategies

  1. Whisper-timestamped1

  2. VAD-guided chunking

pip3 install whisper-timestamped

# for Voice Activity Detection
pip3 install onnxruntime torchaudio


transcribe_args = {
    "task": task,
    "language": language,
    "patience": None,
    "length_penalty": None,
    "suppress_tokens": "-1",
    "initial_prompt": initial_prompt,
    "fp16": False,
    "condition_on_previous_text": False,
    "vad": True,
    "best_of": 5,
    "beam_size": 5,
    "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
}

result = whisper.transcribe(model, audio, seed=seed, **transcribe_args)

Transcription: Adapting Whisper

1 2 3 4


Vanilla Fine Tuning

  • \(W_{finetune} = W_{pre-trained} + \Delta W\)
  • All parameters update

  • Finetuning Whisper-large-v2 with insufficient 20h data reduced the average WER by 54.94% for 7 low-resource language1.

  • Whisper-large-v2 model requires ~24GB of GPU VRAM for full fine-tuning and requires ~7 GB of storage for each fine-tuned checkpoint.

Transcription: Adapting Whisper

1 2 3 4