Corpus Phonetics | Dr Chenzi Xu

Voice quality in connected speech: annotating creaky, breathy, and whispery phonation

Mon, 22 Jun 2026 00:00:00 +0000

Voice quality (VQ) plays an important role in phonetic description and forensic speaker comparison, yet automatic approaches to non-modal phonation often transfer poorly to natural connected-speech corpora. A recent systematic review reports that acoustic studies of breathy and whispery voice in vocally healthy speakers rely heavily on sustained vowels and that whispery voice remains comparatively under-specified methodologically [1]. We present a work-in-progress annotation protocol designed to produce frame-ready supervision for creaky, breathy, and whispery phonation in connected forensic-style speech. The data comprise 62 speakers of Standard Southern British English from two style-controlled Cambridge corpora: 31 male speakers from DyViS and 31 female speakers from DIVERSE [2,3]. DyViS was designed to examine variability in forensic speaker comparison across speaking styles and telephone transmission conditions, while DIVERSE extends this design to female speakers in a mock forensic recording scenario [2,3]. We target approximately 120–150 annotated segments per speaker. The annotation protocol is perceptually guided by Vocal Profile Analysis (VPA), adapting its structured listening approach to frame-level corpus annotation [4]. Three trained phoneticians annotate a shared pilot subset, followed by calibration and adjudication sessions that refine criteria for voicing boundaries, overlap, and recurrent perceptual confusions before the main annotation phase. This workflow reflects evidence that reliable perceptual voice-quality annotation requires explicit training and calibration procedures [4].

Methodologically, the protocol separates voicing from voice quality. Annotators first label voiced versus unvoiced intervals, then annotate creaky, breathy, and whispery on separate tiers within voiced speech. This design addresses a known challenge in irregular-phonation detection: aperiodicity from unvoiced consonants and background noise can produce false positives unless voiced regions are explicitly identified [5]. Within voiced intervals, overlap across VQ categories is permitted when perceptually warranted rather than forcing a single categorical label, reflecting evidence that phonation states may co-occur or transition gradually [1,4]. Interval annotations are subsequently converted to fixed-hop frame targets for modeling. Two training representations are derived: (1) high-confidence consensus labels for categorical model training and evaluation, and (2) agreement-based soft targets that preserve overlapping or mixed phonation. This interval-to-frame pipeline produces a reusable corpus-phonetic resource and provides training data compatible with frame-based phonation detection methods [6].

References

Patman, C., Foulkes, P., & McDougall, K. 2025. Acoustic methods for analysing breathy and whispery voices: a systematic review. Phonetica.
Nolan, F., McDougall, K., de Jong, G., & Hudson, T. 2009. The DyViS database: style-controlled recordings for forensic phonetics. University of Cambridge Phonetics Laboratory.
University of Cambridge Phonetics Laboratory. DIVERSE: Database of Individual Variation in English by Recording style and SEx.
San Segundo, E., Foulkes, P., French, P., Harrison, P., Hughes, V., & Kavanagh, C. 2019. The use of the Vocal Profile Analysis for speaker characterization: methodological proposals. Journal of the International Phonetic Association.
Ishi, C. T., Sakakibara, K., Ishiguro, H., & Hagita, N. 2008. A method for automatic detection of vocal fry. IEEE Transactions on Audio, Speech, and Language Processing.
Murton, O., Hillenbrand, J., & Houde, R. 2019. Identifying a creak probability threshold for an irregular pitch period detection algorithm. Journal of the Acoustical Society of America.

Performance of Montreal Forced Aligner on Cantonese Spontaneous Speech

Fri, 01 Aug 2025 00:00:00 +0000

The study presents a comprehensive evaluation of the Montreal Forced Aligner (MFA) in aligning phone boundaries of Hong Kong Cantonese (HKC) spontaneous speech. We developed two tailored Cantonese MFA models, designed to address distinct Cantonese phonetic features, such as checked syllables. These models were applied to align the same set of recordings from spontaneous interviews, and their performance was compared against human annotations. Our results reveal that the updated Cantonese MFA models achieved decent alignment accuracy on spontaneous speech, with a satisfactory level of agreement with manually adjusted boundaries in vowels. However, Cantonese-specific features and connected speech process remain major challenges for the current models. This observation allows us to propose specific amendments to the models to improve alignment performance, as well as recommendations on manual boundary adjustments.

The processing of neutral tone in self-supervised learning speech models

Sun, 01 Jun 2025 00:00:00 +0000

The present study explores how self-supervised learning (SSL) speech models represent Mandarin neutral tone, in comparison to the four lexical tones. In Standard Mandarin, neutral tone displays greater variability in their pitch realisation than the canonical four citation tones and is largely influenced by its neighbouring tonal contexts. In the phonological literature, neutral tone has been analysed as the fifth lexical tone, a tone sandhi phenomenon, or a product of tonal neutralisation resulting from the interaction of stress [1]. While some scholars have proposed a phonological specification for neutral tone (e.g. [2, 3]), others have argued for its underspecified representation. Acoustic evidence suggests that neutral tone may often manifest post-lexical boundary tonal features, distinguishing it from the four lexical tones [4].

The main analysis was performed on two publicly available pre-trained models: wav2vec2-large-xlsr-53 [5] and wav2vec2-large-xlsr-53-chinese [6]. The base multilingual XLSR model was trained on unlabelled speech data from multiple datasets including Multilingual LibriSpeech, Common Voice, and Babel, encompassing 53 languages. The latter model was the same base model finetuned for Mandarin Chinese. Both models consist of a feature extraction network of 7 convolutional neural network (CNN) layers and a context network of 24 transformer layers. During pre-training, the CNN block output is quantised into codewords, and these discrete representations over 25 ms windows allow us to probe potential abstract representations of neutral tone at the segmental level. The study used 6393 disyllabic Mandarin words (3337 words with two full tones, 3056 words with a neutral tone) mined from recordings in the Beijing Mandarin subset of the KeSpeech corpus [7], filtered to include only speakers from Beijing to minimise the influence of regional accent on the use of neutral tone. The identification of neutral-toned words was based on a list of words with obligatory neutral tone, as defined by the Standard Mandarin (Putonghua) Proficiency Test, as well as words with a grammatical particle such as de, the modifier marker. It is worth noting that automatic tonal transcriptions of Chinese characters were often inaccurate for neutral tone and were therefore not used.

The study applies the W2V models to all words in the dataset and extracted the feature encoder outputs (CNN) and the outputs of every Transformer layer. Then a layer-specific Multi-Layer Perceptron (MLP) classifiers were trained to predict the tone categories, to understand the tonal information captured in the layers of the speech models. Classifier probes were evaluated using Accuracies, F1 scores, and Matthews correlation coefficients. In addition, codevectors for all frames of each word were generated to test where there are sets of codevectors that differentiate the lexical tone and neutral tone versions of a given vowel.

The findings suggest that (1) the CNN block represents neutral tone in a segment-specific manner, similar to English stress [8], with different codevectors for full tone and neutral tone versions of a given vowel. (2) The transformer layer based classifiers outperform the CNN based classifiers, driven by the enriched context in the transformer block. (3) The classifier performs best in the middle layers of the networks. (4) The finetuned model improves the classifier performance in the middle and later layers, especially the last three layers. (5) The classification of all tones improved in the first 8 layers (0-7) in both models.

References

[1] L. Liu, “20 shiji hanyu qingsheng yanjiu zongshu 20 ([An overview of research on the Chinese neutral tone in the 20th century),” Yu wen yan jiu , no. 3, pp. 43–47, 2002.

[2] M. Yip, “The tonal phonology of Chinese,” Thesis, Massachusetts In- stitute of Technology, 1980.

[3] H. Lin, “Mandarin Neutral Tone as a Phonologically Low Tone,” Jour- nal of Chinese Language and Computing, vol. 16, no. 2, pp. 121–134, Jan. 2006.

[4] C. Xu and C. Zhang, “A cross-linguistic review of citation tone pro- duction studies: Methodology and recommendations,” The Journal of the Acoustical Society of America, vol. 156, no. 4, pp. 2538–2565, Oct. 2024.

[5] A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Un- supervised Cross-Lingual Representation Learning for Speech Recog- nition,” in Interspeech 2021. ISCA, Aug. 2021, pp. 2426–2430.

[6] J. Grosman, “Fine-tuned XLSR-53 large model for speech recog- nition in Chinese,” https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-chinese-zh-cn, 2021.

[7] Z. Tang, D. Wang, Y. Xu, J. Sun, X. Lei, S. Zhao, C. Wen, X. Tan, C. Xie, S. Zhou, R. Yan, C. Lv, Y. Han, W. Zou, and X. Li, “KeSpeech: An Open Source Speech Dataset of Mandarin and Its Eight Subdi- alects,” in 35th Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021, p. 12.

[8] M. Bentum, L. T. Bosch, and T. Lentz, “The Processing of Stress in End-to-End Automatic Speech Recognition Models,” in Interspeech 2024. ISCA, Sep. 2024, pp. 2350–2354.

Tone Patterns in Binumarien Noun Stems (Kainantu, Trans-New Guinea)

Sun, 01 Jun 2025 00:00:00 +0000

This study examines the tonal patterns of disyllabic noun stems in Binumarien, employing fieldwork data and phonetic-acoustic analysis. It also establishes a pipeline for developing and applying language technologies to a low-resource language, benefiting future research. Binumarien (endonym: Afaqinna Ufa) is a Kainantu language (Trans-New Guinea) spoken in the Eastern Highlands Province of Papua New Guinea [1], with around 1200 speakers. Most children grow up with Binumarien as their first language, along with Tok Pisin, the lingua franca, and regional languages.

In this study, speech data were collected between February and April 2023 in Binumarien village, from two male L1 speakers who were born and raised there. The speakers were recorded using a Tascam DR-10 audio recorder with a clip-on Rohde miniature microphone (Mono), and a Marantz PMD 661 solid state audio recorder (Stereo). The recorders ran simultaneously at 48 kHz, 24 bps. The Tascam recordings in WAV format were used here. The participants were asked to pronounce each noun stem four times: twice in isolation and twice with a suffix. The nouns were extracted from texts and a word list collected during a field trip in 2018-19 for a master’s thesis [3] and from a literacy booklet [4]. In addition to recording the pronunciation of each word, speakers were asked to whistle the tone of each word, for an impressionistic identification of tone. When the interviewer was in doubt, participants were asked to group words based on their tone patterns. All the annotated intervals in the recordings, totalling 1.9 hours, were used to train a preliminary acoustic model with the Montreal Forced Aligner [5], which facilitates word- and segment-level time alignment between audio and text. Upon inspection, about 15% of the word-level alignments of nouns were manually corrected. Then sound intervals of disyllabic noun stems in citation form were extracted, within which f0 measurements every 10 milliseconds in voiced regions were obtained using the Parselmouth library [6, 7] (floor: 75Hz, ceiling: 300Hz). The f0 measurements in Hz were normalised to semitones relative to the speaker mean, and the corresponding time values in seconds were linearly scaled to the range [0,1], to enable the comparison of the contour shapes across speakers and word items.

Impressionistically, we find four tones on syllables on the surface: high (H), low (L), falling (F), and rising (R) (see poster for illustrations). F occurs only on long vowels and diphthongs, while R appears on these as well as on stem-final short vowels and occasionally between a L and a H. Whether F and R should phonologically be seen as separate tonal units or as combinations of underlying L and H is to be investigated.

Acoustically, the f0 contours of 50 disyllabic noun stems (coded in different colours) are shown in the poster, grouped in eight preliminary clusters. The eight clusters were based on the visual inspection of the surface f0 contours. Variations within each cluster can be partially attributed to the differences in syllable structure among these words. Some clusters may appear similar but serve to differentiate word meanings. For example, the f0 contours of the minimal pair of aandau are illustrated in the poster, where the two f0 patterns, LH and LF, are affiliated with different word meanings (‘white hair’ and ‘animal’). This empirical study contributes to our knowledge of the Binumarien tonal system. We are collecting more data on a variety of speech materials from additional speakers to generalise our findings and gain deeper insights into the tonal system, distinct from many Southeast Asian varieties.

References

[1] Pawley, A., & H. Hammarström. 2017. The Trans New Guinea family. In Bill Palmer (Ed.), The languages and linguistics of the New Guinea area: A comprehensive guide. Berlin: De Gruyter, 21–196.

[2] Oatridge, D. & J. Oatridge. 1965. Phonemes of Binumarien. In Frantz, Frantz, Oatridge, Oatridge, Loving, Swick, Pence, Staalsen, Boxwell & Boxwell (Eds.), Papers in New Guinea Linguistics. Canberra: Australian National University, 13-22.

[3] van Dasselaar, R. 2019. Topics in the Grammar of Binumarien: Tone and Switch-reference in a Kainantu Language of Papua New Guinea. Master’s thesis, Leiden University.

[4] Aadoo. 1973. Oosana Oosana Aandau Ufa - Animals and Birds. Summer Institute of Linguistics.

[5] McAuliffe, M., Socolof, M., Mihuc, S., Wagner, M., & Sonderegger, M. 2017. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. Interspeech.

[6] Jadoul, Y., Thompson, B., & de Boer, B. 2018. Introducing Parselmouth: A Python interface to Praat. Journal of Phonetics, 71, 1-15. https://doi.org/10.1016/j.wocn.2018.07.001

[7] Boersma, P., & Weenink, D. 2021. Praat: doing phonetics by computer [Computer program]. Version 6.1.38, retrieved 2 January 2021 from http://www.praat.org/

Exploring individual speaker performance within a forensic automatic speaker recognition system

Mon, 01 Jul 2024 00:00:00 +0000

A key issue for automatic speaker recognition (ASR), particularly for forensics, is our lack of understanding about why certain voices prove more or less of a challenge for systems. In this paper, we focus on variability in individual speaker performance within an x-vector ASR system and examine this variability as a function of the phonetic content within speech samples. The inclusion of vowels generally improved performance, but not for all speakers. Indeed, some speakers produced broadly the same Cllr irrespective of the phonetic content in the speech samples. Poor ASR performance was not well correlated with long-term laryngeal features (f0 and laryngeal voice quality) and these features may provide additional speaker discriminatory information for some speakers. We discuss the implications of these findings in terms of developing a speaker quality metric for flagging potentially problematic speakers prior to ASR comparison.

Contributions of acoustic measures to the classification of laryngeal voice quality in continuous English speech

Thu, 24 Aug 2023 00:00:00 +0000

Laryngeal voice qualities (e.g. breathy and creaky voice), variable within and across speakers, often pose a challenge in data collection. Their acoustic correlates are still inadequately understood. This study revisits the acoustics of laryngeal voice qualities in high-quality recordings of continuous British English speech produced by experienced phoneticians. Through principal component analysis and multinomial logistic regression with l1 regularisation, this study identifies contributions of a variety of acoustic measures to the classification of laryngeal voice qualities and provides a multidimensional acoustic profile for breathy, creaky, and modal voice. Classification rates as high as 90% were achieved using the first 5 principal components. The most salient acoustic correlates for creaky voice are, compared to other categories, higher mean H2*, lower mean f0 and HNR below 500 Hz, and for breathy voice, higher mean H1* and spectral tilt measures such as H1*–A1* and H1*–H2*.

Impact of the changes in long-term acoustic features upon different-speaker ASR scores

Mon, 24 Jul 2023 00:00:00 +0000

Automatic speaker recognition (ASR) systems usually take a pair of speech recordings as input, extract their speaker embeddings using deep learning (e.g. x-vectors; Snyder et al. 2018), and output through a classifier a speaker similarity score, which is in turn calibrated to a likelihood ratio. Despite the increasing accuracy of the ASR prediction, relatively little is known about the relationship between voice properties and ASR outputs. It has thus been a challenge to explain the output to an end-user in forensic context. This study aims to improve the interpretability of the scores by an ASR system by assessing how acoustic mismatches related to speech production impact different-speaker scores on a given evaluation corpus. Hautamäki and Kinnunen (2020) identified the most prominent factor in explaining low same-speaker scores as the difference in long-term f0 mean. This study focuses on the different-speaker scores in forensically realistic data and explores how differences in a range of acoustic features contribute to the discrimination of speakers. In particular, which acoustic similarities between speakers contribute to more difficult discrimination?

In this experiment, we model the impact of acoustic distance on the ASR score in discriminating speakers with similar demographic profiles. The study utilised a subset of the Home Office Contest corpus* containing 155 mobile phone recordings, all from different male speakers of London English. Each recording is a single channel of a mobile phone conversation, about 15 minutes long, with 8kHz sampling rate. Different-speaker (DS) comparisons were conducted using the pre-trained VOCALISE 2021 ASR system (version 3.0.0.1746; Kelly et al. 2019) with x-vectors and PLDA to generate scores. The scores were calibrated using a dataset of mobile phone recordings (8kHz, 16 bit, and single channel) from 20 speakers with a similar demographic profile – male London speakers – from the GBR-ENG corpus. We randomly selected two recordings per speaker for calibration. Bayesian calibration with Jeffreys non-informative priors was used due to the relatively small calibration set (Brümmer & Swart, 2014). The Cllr based on the DS likelihood-ratio values was 0.0152, 0.15% of the pairs (18/11925) had a positive calibrated score (i.e. lend contrary-to-fact support to a same-speaker decision). A range of acoustic features including f0, formants, formant bandwidths, jitter, shimmer, spectral tilts and so on were extracted automatically using Praat (Boersma & Weenink, 2022) and the OpenSMILE toolkit (Eyben et al. 2013). In our regression models, the dependent variable is the calibrated scores and the predictor is the acoustic distance between speakers in each comparison, represented by the absolute differences of the statistics of the selected long-term acoustic features or ensemble differences of feature groups. In general, the larger the acoustic distance the lower the calibrated score. Specific pairs that were difficult to discriminate in the ASR system are further examined and discussed. The findings will help us to flag or predict difficult voices for the ASR system to discriminate, and facilitate further exploration on how the discrimination may be improved with score calibration based on a dataset with acoustically similar speakers.

*Both GBR-ENG corpus and Home office Contest corpus belong to a telephonic speech database collected for the UK Government for evaluating speech technologies. Further details on application.

References

Brümmer, N., & Swart, A. (2014). Bayesian calibration for forensic evidence reporting. arXiv preprint arXiv:1403.5997.

Eyben, Florian, Felix Weninger, Florian Gross, and Björn Schuller. 2013. “Recent Developments in openSMILE, the Munich Open-Source Multimedia Feature Extractor.” In Proceedings of the 21st ACM International Conference on Multimedia. New York, NY, USA: ACM. https://doi.org/10.1145/2502081.2502224.

Hautamäki, Rosa González, and Tomi Kinnunen. 2020. “Why Did the X-Vector System Miss a Target Speaker? Impact of Acoustic Mismatch upon Target Score on VoxCeleb Data.” In Interspeech 2020. ISCA: ISCA. https://doi.org/10.21437/interspeech.2020-2715.

Kelly, F., Forth, O., Kent, S., Gerlach, L. and Alexander, A. (2019) Deep neural network based forensic automatic speaker recognition in VOCALISE using x-vectors. Proceedings of the Audio Engineering Conference: 2019 AES International Conference on Audio Forensics.

Boersma, P. & Weenink, D. (2022) Praat: doing phonetics by computer [Computer program]. Version 6.2.06, retrieved 23 January 2022 from https://www.praat.org.

Snyder, D., Garcia-Romero, D., Sell, G., Povey, D. and Khudanpur, S. (2018) X-vectors: robust DNN embeddings for speaker recognition. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, 5329–5333.

Cross-dialectal perspectives on Mandarin neutral tone

Mon, 24 Oct 2022 00:00:00 +0000

With an aim to investigate the nature of Mandarin neutral tone through the lens of language variation and change, this study examines the pitch patterns of speech sequences containing neutral tone syllables, i.e. those that do not have any of the four canonical lexical tones and are often overlooked in prior studies of tones, in two Mandarin varieties: Standard Mandarin and Plastic Mandarin spoken in Changsha, China. Using Generalised Additive Mixed Models, the study shows (a) that f0 contours of a sequence of neutral tone syllables following various lexical tones converge in the end at a low pitch in both Mandarin varieties, and (b) that the low pitch target of neutral tone syllables tends to be the same across the two Mandarin varieties. The cross-dialectal comparison favours the phonological account that neutral tone is underlyingly underspecified and attracts the boundary tone. It suggests that the constant pitch target across two Mandarin varieties with distinct lexical tone contours may be attributed to the stable transfer of prosodic structure in the Standard-Plastic variation.

Revisiting Neutral Tone in Mandarin Broadcast News Speech

Mon, 27 Apr 2020 00:00:00 +0000