Hardware:
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: Mac15,11
Model Number: MRW33B/A
Chip: Apple M3 Max
Total Number of Cores: 14 (10 performance and 4 efficiency)
Memory: 36 GB
University of Oxford
Workshop at Universiteit Leiden
August 11, 2025
Leverhulme Trust Early Career Fellow
University of Oxford
The rise and fall of a tone
Postdoctoral Research Associate
University of York
Person-specific Automatic Speaker Recognition: Understanding the behaviour of individuals for applications of ASR
DPhil, MPhil (Distinction),
University of Oxford
1 2 3 4
1 2 3 4
Hardware:
Hardware Overview:
Model Name: MacBook Pro
Model Identifier: Mac15,11
Model Number: MRW33B/A
Chip: Apple M3 Max
Total Number of Cores: 14 (10 performance and 4 efficiency)
Memory: 36 GB
1 2 3 4
1 2 3 4
wav.scp
: maps recording/utterance IDs to audio pathstext
: transcripts for each uteranceutt2spk
and spk2utt
: links utterances to speakerssegments
: (for long recordings) start and end times for utterancesConventions:
1 2 3 4
kespeech
├── audio
│ ├── phase1
│ └── phase2
└── metadata
├── city2Chinese
├── city2subdialect
├── phase1.text
├── phase1.utt2style
├── phase1.utt2subdialect
├── phase1.wav.scp
├── phase2.text
├── phase2.utt2env
├── phase2.wav.scp
├── spk2age
├── spk2city
├── spk2gender
├── spk2utt
├── subdialect2Chinese
└── utt2spk
# Validation of datasets
utils/validate_data_dir.sh --no-feats kespeech/metadata
# Generate utterance durations
utils/data/get_utt2dur.sh kespeech/metadata
# Keep shortest/longest N for debugging
utils/subset_data_dir.sh --shortest kespeech/metadata 100 kspeech/ks_s100
# Random 10k-utterance subset (useful for quick experiments)
utils/subset_data_dir.sh kespeech/metadata 10000 kespeech/ks_10k
# Make dev/test by speakers (avoid speaker leakage)
utils/subset_data_dir_tr_cv_spk.sh --cv-spk-percent 15 kespeech/metadata kespeech/train kespeech/dev
ESPnet: End-to-end speech processing toolkit
ESPnet bundles the same Kaldi recipes and scripts.
1 2 3 4
1 2 3 4
cv-corpus-22/yue
├── clip_durations.tsv
├── invalidated.tsv
├── other.tsv
├── reported.tsv
├── unvalidated_sentences.tsv
├── validated_sentences.tsv
└── validated.tsv
└── clips
├── common_voice_yue_31172849.mp3
├── common_voice_yue_31172850.mp3
└── ...
1 2 3 4
cv-corpus-22/yue
├── clip_durations.tsv
├── invalidated.tsv
├── other.tsv
├── reported.tsv
├── unvalidated_sentences.tsv
├── validated_sentences.tsv
└── validated.tsv
└── clips
├── common_voice_yue_31172849.mp3
├── common_voice_yue_31172850.mp3
└── ...
from datasets import load_dataset, Audio
# Load dataset on the fly
cv_22_yue = load_dataset("fsicoli/common_voice_22_0", "yue", split="train", streaming=True)
# Decode audio
cv_22_yue = cv_22_yue.cast_column("audio", Audio(sampling_rate=16_000))
# Remove very short utterances
cv_22_yue = cv_22_yue.filter(lambda x: len(x["sentence"]) > 3)
# Make dev/test by speakers (avoid speaker leakage)
split_yue = cv_22_yue.train_test_split(test_size=0.15, seed=42, stratify_by_column="client_id")
train_yue = split_yue["train"]
test_yue = split_yue["test"]
1 2 3 4
1 2 3 4
Open-source ASR toolkits
Pretrained model APIs
Cloud ASR APIs
Tutorial
A hands-on introductory tutorial on applying Whisper and wav2Vec 2.0 is avilable at here.
1 2 3 4
1 2 3 4
1 2 3 4
Provide language selection
Initial prompt
Dynamic temperature fallback
4. Whisper-timestamped
5. VAD-guided chunking
model_size = "large-v2" language = "en" task = "transcribe" initial_prompt = "umm uhh oh ah hm er erm urgh mm" transcribe_args = { "task": task, "language": language, "patience": None, "length_penalty": None, "suppress_tokens": "-1", "initial_prompt": initial_prompt, "fp16": False, "condition_on_previous_text": False, "vad": True, "best_of": 5, "beam_size": 5, "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), } result = whisper.transcribe(model, audio, seed=seed, **transcribe_args)
model_size = "large-v2" language = "en" task = "transcribe" initial_prompt = "umm uhh oh ah hm er erm urgh mm" transcribe_args = { "task": task, "language": language, "patience": None, "length_penalty": None, "suppress_tokens": "-1", "initial_prompt": initial_prompt, "fp16": False, "condition_on_previous_text": False, "vad": True, "best_of": 5, "beam_size": 5, "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), } result = whisper.transcribe(model, audio, seed=seed, **transcribe_args)
model_size = "large-v2" language = "en" task = "transcribe" initial_prompt = "umm uhh oh ah hm er erm urgh mm" transcribe_args = { "task": task, "language": language, "patience": None, "length_penalty": None, "suppress_tokens": "-1", "initial_prompt": initial_prompt, "fp16": False, "condition_on_previous_text": False, "vad": True, "best_of": 5, "beam_size": 5, "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), } result = whisper.transcribe(model, audio, seed=seed, **transcribe_args)
model_size = "large-v2" language = "en" task = "transcribe" initial_prompt = "umm uhh oh ah hm er erm urgh mm" transcribe_args = { "task": task, "language": language, "patience": None, "length_penalty": None, "suppress_tokens": "-1", "initial_prompt": initial_prompt, "fp16": False, "condition_on_previous_text": False, "vad": True, "best_of": 5, "beam_size": 5, "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), } result = whisper.transcribe(model, audio, seed=seed, **transcribe_args)
model_size = "large-v2" language = "en" task = "transcribe" initial_prompt = "umm uhh oh ah hm er erm urgh mm" transcribe_args = { "task": task, "language": language, "patience": None, "length_penalty": None, "suppress_tokens": "-1", "initial_prompt": initial_prompt, "fp16": False, "condition_on_previous_text": False, "vad": True, "best_of": 5, "beam_size": 5, "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), } result = whisper.transcribe(model, audio, seed=seed, **transcribe_args)
1 2 3 4
1. Provide language selection
2. Initial prompt
3. Dynamic temperature fallback
Whisper-timestamped1
VAD-guided chunking
transcribe_args = { "task": task, "language": language, "patience": None, "length_penalty": None, "suppress_tokens": "-1", "initial_prompt": initial_prompt, "fp16": False, "condition_on_previous_text": False, "vad": True, "best_of": 5, "beam_size": 5, "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), } result = whisper.transcribe(model, audio, seed=seed, **transcribe_args)
transcribe_args = { "task": task, "language": language, "patience": None, "length_penalty": None, "suppress_tokens": "-1", "initial_prompt": initial_prompt, "fp16": False, "condition_on_previous_text": False, "vad": True, "best_of": 5, "beam_size": 5, "temperature": (0.0, 0.2, 0.4, 0.6, 0.8, 1.0), } result = whisper.transcribe(model, audio, seed=seed, **transcribe_args)
1 2 3 4
Whisper-large-v2
with insufficient 20h data reduced the average WER by 54.94% for 7 low-resource language1.Whisper-large-v2
model requires ~24GB of GPU VRAM for full fine-tuning and requires ~7 GB of storage for each fine-tuned checkpoint.1 2 3 4
1
1 2 3 4
1 2 3 4
[v0 v1 ... v{m-1}] + [<sot> ... previous text tokens ...]
│ │
trainable frozen decoder
1 2 3 4
Speaker Diarization: “Who is speaking and when?”
class SpeakerAligner:
def align(self, transcription, timestamps, diarization):
speaker_transcriptions = []
# Find the end time of the last segment in diarization
last_diarization_end = self.get_last_segment(diarization).end
for chunk in timestamps:
chunk_start = chunk["timestamp"][0]
chunk_end = chunk["timestamp"][1]
segment_text = chunk["text"]
# Handle the case where chunk_end is None
if chunk_end is None:
# Use the end of the last diarization segment as the default end time
chunk_end = (
last_diarization_end
if last_diarization_end is not None
else chunk_start
)
# Find the best matching speaker segment
best_match = self.find_best_match(diarization, chunk_start, chunk_end)
if best_match:
speaker = best_match[2] # Extract the speaker label
speaker_transcriptions.append(
(speaker, chunk_start, chunk_end, segment_text)
)
# Merge consecutive segments of the same speaker
speaker_transcriptions = self.merge_consecutive_segments(speaker_transcriptions)
return speaker_transcriptions
def find_best_match(self, diarization, start_time, end_time):
best_match = None
max_intersection = 0
for turn, _, speaker in diarization.itertracks(yield_label=True):
turn_start = turn.start
turn_end = turn.end
# Calculate intersection manually
intersection_start = max(start_time, turn_start)
intersection_end = min(end_time, turn_end)
if intersection_start < intersection_end:
intersection_length = intersection_end - intersection_start
if intersection_length > max_intersection:
max_intersection = intersection_length
best_match = (turn_start, turn_end, speaker)
return best_match
def merge_consecutive_segments(self, segments):
merged_segments = []
previous_segment = None
for segment in segments:
if previous_segment is None:
previous_segment = segment
else:
if segment[0] == previous_segment[0]:
# Merge segments of the same speaker that are consecutive
previous_segment = (
previous_segment[0],
previous_segment[1],
segment[2],
previous_segment[3] + segment[3],
)
else:
merged_segments.append(previous_segment)
previous_segment = segment
if previous_segment:
merged_segments.append(previous_segment)
return merged_segments
def get_last_segment(self, annotation):
last_segment = None
for segment in annotation.itersegments():
last_segment = segment
return last_segment
1 2 3 4
Fast Whisper-Large-v2 Fine-Tuning with LoRA
Whisper Precision: A Comprehensive Guide to Fine-Tuning and Hyperparameter Tuning
Fine-tuning Whisper on Low-Resource Languages for Real-World Applications
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
sed
grep
awk
utils/subset_data_dir.sh
.filter()
.select()
.map()
1 2 3 4
.TextGrid
File type = "ooTextFile"
Object class = "TextGrid"
xmin = 0.0125
xmax = 1.8725
tiers? <exists>
size = 2
item []:
item [1]:
class = "IntervalTier"
name = "phone"
xmin = 0.0125
xmax = 1.8725
intervals: size = 26
intervals [1]:
xmin = 0.0125
xmax = 0.0925
text = "t"
words.txt
妈,0.0125,0.3325,0.3200,b01_1_101q
妈,0.3325,0.5125,0.1800,b01_1_101q
们,0.5125,0.7925,0.2800,b01_1_101q
正,0.7925,1.1125,0.3200,b01_1_101q
看,1.1125,1.6725,0.5600,b01_1_101q
...
1 2 3 4
Create a find_de.awk
snippet to find disyllabic phrases in Mandarin that end with “的”.
1 2 3 4
Python interface to Praat TextGrid
tgt
praatio
textgrid
parselmouth
Large Language Models
1 2 3 4
index | pos | word | pinyin | if_neutral | meaning | group |
---|---|---|---|---|---|---|
1 | N | 地方 | dìfang | 是 | 某一区域、空间的一部分、部位 | A |
1 | N | 地方 | dìfāng | 否 | 中央下属的各级行政区划的统称,本地、当地 | A |
2 | N | 地下 | dìxia | 是 | 指地面上 | A |
2 | N | 地下 | dìxià | 否 | 指地面下或秘密的 | A |
3 | N | 东西 | dōngxi | 是 | 泛指各种事物,特指人或动物 | A |
3 | N | 东西 | dōngxī | 否 | 指东和西两个方向 | A |
… |
1 2 3 4
Set high-level instructions and context
The actual query or input
The model’s generated output
system_msg = ( "You are a linguist specializing in Chinese semantics. " "Given a sentence and a target word with multiple meanings, " "your job is to identify the most appropriate meaning from the candidate list." ) user_msg = ( f"Sentence: {text}\n" f"Target word: {word}\n" f"Candidate meanings:\n" + "\n".join(f"{i+1}. {c}" for i, c in enumerate(candidate_strings)) + "\n\nWhich meaning fits best? Respond only with the number (e.g., 1, 2) or 'None'. " ) response = client.chat.completions.create( model="gpt-4.1-nano", messages=[ {"role": "system", "content": system_msg}, {"role": "user", "content": user_msg}, ], temperature=0.2, )
system_msg = ( "You are a linguist specializing in Chinese semantics. " "Given a sentence and a target word with multiple meanings, " "your job is to identify the most appropriate meaning from the candidate list." ) user_msg = ( f"Sentence: {text}\n" f"Target word: {word}\n" f"Candidate meanings:\n" + "\n".join(f"{i+1}. {c}" for i, c in enumerate(candidate_strings)) + "\n\nWhich meaning fits best? Respond only with the number (e.g., 1, 2) or 'None'. " ) response = client.chat.completions.create( model="gpt-4.1-nano", messages=[ {"role": "system", "content": system_msg}, {"role": "user", "content": user_msg}, ], temperature=0.2, )
system_msg = ( "You are a linguist specializing in Chinese semantics. " "Given a sentence and a target word with multiple meanings, " "your job is to identify the most appropriate meaning from the candidate list." ) user_msg = ( f"Sentence: {text}\n" f"Target word: {word}\n" f"Candidate meanings:\n" + "\n".join(f"{i+1}. {c}" for i, c in enumerate(candidate_strings)) + "\n\nWhich meaning fits best? Respond only with the number (e.g., 1, 2) or 'None'. " ) response = client.chat.completions.create( model="gpt-4.1-nano", messages=[ {"role": "system", "content": system_msg}, {"role": "user", "content": user_msg}, ], temperature=0.2, )
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4
1 2 3 4