Hardware:
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: Mac15,11
Model Number: MRW33B/A
Chip: Apple M3 Max
Total Number of Cores: 14 (10 performance and 4 efficiency)
Memory: 36 GB
University of Oxford
Workshop at Universiteit Leiden
August 11, 2025
Leverhulme Trust Early Career Fellow
University of Oxford
The rise and fall of a tone
Postdoctoral Research Associate
University of York
Person-specific Automatic Speaker Recognition: Understanding the behaviour of individuals for applications of ASR
DPhil, MPhil (Distinction),
University of Oxford
1 2 3 4
1 2 3 4
Hardware:
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: Mac15,11
Model Number: MRW33B/A
Chip: Apple M3 Max
Total Number of Cores: 14 (10 performance and 4 efficiency)
Memory: 36 GB
1 2 3 4
1 2 3 4
wav.scp: maps recording/utterance IDs to audio pathstext: transcripts for each uteranceutt2spk and spk2utt: links utterances to speakerssegments: (for long recordings) start and end times for utterancesConventions:
1 2 3 4
kespeech
├── audio
│ ├── phase1
│ └── phase2
└── metadata
├── city2Chinese
├── city2subdialect
├── phase1.text
├── phase1.utt2style
├── phase1.utt2subdialect
├── phase1.wav.scp
├── phase2.text
├── phase2.utt2env
├── phase2.wav.scp
├── spk2age
├── spk2city
├── spk2gender
├── spk2utt
├── subdialect2Chinese
└── utt2spk
# Validation of datasets
utils/validate_data_dir.sh --no-feats kespeech/metadata
# Generate utterance durations
utils/data/get_utt2dur.sh kespeech/metadata
# Keep shortest/longest N for debugging
utils/subset_data_dir.sh --shortest kespeech/metadata 100 kspeech/ks_s100
# Random 10k-utterance subset (useful for quick experiments)
utils/subset_data_dir.sh kespeech/metadata 10000 kespeech/ks_10k
# Make dev/test by speakers (avoid speaker leakage)
utils/subset_data_dir_tr_cv_spk.sh --cv-spk-percent 15 kespeech/metadata kespeech/train kespeech/devESPnet: End-to-end speech processing toolkit
ESPnet bundles the same Kaldi recipes and scripts.
1 2 3 4
1 2 3 4
cv-corpus-22/yue
├── clip_durations.tsv
├── invalidated.tsv
├── other.tsv
├── reported.tsv
├── unvalidated_sentences.tsv
├── validated_sentences.tsv
└── validated.tsv
└── clips
├── common_voice_yue_31172849.mp3
├── common_voice_yue_31172850.mp3
└── ...
1 2 3 4
cv-corpus-22/yue
├── clip_durations.tsv
├── invalidated.tsv
├── other.tsv
├── reported.tsv
├── unvalidated_sentences.tsv
├── validated_sentences.tsv
└── validated.tsv
└── clips
├── common_voice_yue_31172849.mp3
├── common_voice_yue_31172850.mp3
└── ...
from datasets import load_dataset, Audio
# Load dataset on the fly
cv_22_yue = load_dataset("fsicoli/common_voice_22_0", "yue", split="train", streaming=True)
# Decode audio
cv_22_yue = cv_22_yue.cast_column("audio", Audio(sampling_rate=16_000))
# Remove very short utterances
cv_22_yue = cv_22_yue.filter(lambda x: len(x["sentence"]) > 3)
# Make dev/test by speakers (avoid speaker leakage)
split_yue = cv_22_yue.train_test_split(test_size=0.15, seed=42, stratify_by_column="client_id")
train_yue = split_yue["train"]
test_yue = split_yue["test"]1 2 3 4
1 2 3 4
Open-source ASR toolkits
Pretrained model APIs
Cloud ASR APIs
Tutorial
A hands-on introductory tutorial on applying Whisper and wav2Vec 2.0 is avilable at here.
1 2 3 4
1 2 3 4
1 2 3 4
Provide language selection
Initial prompt
Dynamic temperature fallback
4. Whisper-timestamped
5. VAD-guided chunking
model_size = "large-v2"
language = "en"
task = "transcribe"
initial_prompt = "umm uhh oh ah hm er erm urgh mm"
transcribe_args = {
"task": task,
"language": language,
"patience": None,
"length_penalty": None,
"suppress_tokens": "-1",
"initial_prompt": initial_prompt,
"fp16": False,
"condition_on_previous_text": False,
"vad": True,
"best_of": 5,
"beam_size": 5,
"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
}
result = whisper.transcribe(model, audio, seed=seed, **transcribe_args)1 2 3 4
1. Provide language selection
2. Initial prompt
3. Dynamic temperature fallback
Whisper-timestamped1
VAD-guided chunking
transcribe_args = {
"task": task,
"language": language,
"patience": None,
"length_penalty": None,
"suppress_tokens": "-1",
"initial_prompt": initial_prompt,
"fp16": False,
"condition_on_previous_text": False,
"vad": True,
"best_of": 5,
"beam_size": 5,
"temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0),
}
result = whisper.transcribe(model, audio, seed=seed, **transcribe_args)1 2 3 4
Whisper-large-v2 with insufficient 20h data reduced the average WER by 54.94% for 7 low-resource language1.Whisper-large-v2 model requires ~24GB of GPU VRAM for full fine-tuning and requires ~7 GB of storage for each fine-tuned checkpoint.1 2 3 4