Publications

Click item to see the abstract

M. Barański, J. Jasiński, J. Bartolewska, S. Kacprzak, M. Witkowski, and K. Kowalczyk, “Investigation of whisper ASR hallucinations induced by non-speech audio,” 2025, Available: https://arxiv.org/abs/2501.11378

Hallucinations of deep neural models are amongst key challenges in automatic speech recognition (ASR). In this paper, we investigate hallucinations of the Whisper ASR model induced by non-speech audio segments present during inference. By inducting hallucinations with various types of sounds, we show that there exists a set of hallucinations that appear frequently. We then study hallucinations caused by the augmentation of speech with such sounds. Finally, we describe the creation of a bag of hallucinations (BoH) that allows to remove the effect of hallucinations through the post-processing of text transcriptions. The results of our experiments show that such post-processing is capable of reducing word error rate (WER) and acts as a good safeguard against problematic hallucinations.

S. Kacprzak and K. Kowalczyk, “Heightceleb - an enrichment of voxceleb dataset with speaker height information,” in 2024 IEEE spoken language technology workshop (SLT), 2024, pp. 857–862. doi: https://doi.org/10.1109/SLT61566.2024.10832224.

Prediction of speaker’s height is of interest for voice forensics, surveillance, and automatic speaker profiling. Until now, TIMIT has been the most popular dataset for training and evaluation of the height estimation methods. In this paper, we introduce HeightCeleb, an extension to VoxCeleb, which is the dataset commonly used in speaker recognition tasks. This enrichment consists in adding information about the height of all 1251 speakers from VoxCeleb that has been extracted with an automated method from publicly available sources. Such annotated data will enable the research community to utilize freely available speaker embedding extractors, pre-trained on VoxCeleb, to build more efficient speaker height estimators. In this work, we describe the creation of the HeightCeleb dataset and show that using it enables to achieve state-of-the-art results on the TIMIT test set by using simple statistical regression methods and embeddings obtained with a popular speaker model (without any additional fine-tuning).

M. Igras-Cybulska et al., “Towards multimodal VR trainer of voice emission and public speaking -work-in-progress,” in 2023 IEEE conference on virtual reality and 3D user interfaces abstracts and workshops (VRW), 2023, pp. 355–359. doi: 10.1109/VRW58643.2023.00079.

GlossoVR is a virtual reality (VR) application that combines training in public speaking in front of a virtual audience and in voice emission in relaxation exercises. It is accompanied by digital signal processing (DSP) and artificial intelligence (AI) modules which provide automatic feedback on the vocal performance as well as the behavior and psychophysiology of the user. In particular, we address parameters of speech emotions, prosody and timbre, and the user’s hand gestures and eye movement. The prototype is in the proof of concept phase, and we are developing it in accordance with the user-centered design paradigm. In this article reports the work in progress, focusing on the approaches, datasets and algorithms applied in the current state of the glossoVR project.

J. Bartolewska, S. Kacprzak, and K. Kowalczyk, “Causal signal-based DCCRN with overlapped-frame prediction for online speech enhancement,” in Proc. INTERSPEECH 2023, 2023, pp. 4039–4043.

The aim of speech enhancement is to improve speech signal quality and intelligibility from a noisy microphone signal. In many applications, it is crucial to enable processing with small computational complexity and minimal requirements regarding access to future signal samples (look-ahead). This paper presents signal-based causal DCCRN that improves online single-channel speech enhancement by reducing the required look-ahead and the number of network parameters. The proposed modifications include complex filtering of the signal, application of overlapped-frame prediction, causal convolutions and deconvolutions, and modification of the loss function. Results of performed experiments indicate that the proposed model with overlapped signal prediction and additional adjustments, achieves similar or better performance than the original DCCRN in terms of various speech enhancement metrics, while it reduces the latency and network parameter number by around 30

E. Stefanowska and S. Kacprzak, “Generating melodic dictations using markov chains and LSTM neural networks,” in Audio engineering society convention 154, 2023.

Melodic dictations are aural training exercises that require students to transcribe the melody they hear into musical notation. In this paper, we propose three algorithms that generate single-voice melodies that could be serve as melodic dictations. The first algorithm utilizes a higher-order Markov Chain model to generate melodic patterns based on a given data set of training set dictations. The second algorithm employs a neural network with Long Short-Term Memory (LSTM) layers and the Bahdanau attention mechanism. The third algorithm generates melodies by choosing each note randomly. We analyzed the generated dictations using the dissimilarity index based on the cross-correlation, to demonstrate that the algorithms generate novel and diverse melodic dictations. To evaluate the musical quality of the melodies, we conducted a survey in which professional music theory teachers graded the dictations from the training set and those generated by the algorithms. The results indicate that some of the generated dictations are comparable in quality to those in the training set and could find potential applications in musical education.

S. Kacprzak, M. Rybicka, and K. Kowalczyk, “Spoken language recognition with cluster-based modeling,” in ICASSP 2022 - 2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), May 2022, pp. 6867–6871. doi: 10.1109/ICASSP43922.2022.9747515.

In this study, we analyze the incorporation of cluster-based modeling into the language recognition systems, in which a single utterance is represented as an embedding, deploying widely used i-vectors and x-vectors. We compare the results obtained with a Cosine Distance Scoring, Gaussian Mixture Model, Logistic Regression, and the Mixture of von Misses-Fisher distributions with the classifiers based on the proposed approach which incorporates cluster-based sub-models. Experimental evaluation is performed on the i-vector embeddings from the NIST 2015 language recognition i-vector machine learning challenge and the x-vector embeddings from the Oriental Language Recognition 2020 Challenge (AP20-OLR). The experimental results clearly show that the proposed approach combined with discriminatively trained Logistic Regression classifier achieves notable improvements over the baseline systems, i.e., those without language sub-models, and that our approach is competitive to other systems reported in the literature.

J. Bartolewska, S. Kacprzak, and K. Kowalczyk, “Refining DNN-based mask estimation using CGMM-based EM algorithm for multi-channel noise reduction,” Proc. Interspeech 2022, pp. 2923–2927, 2022.

In this paper, we present a method that allows to further improve speech enhancement obtained with recently introduced Deep Neural Network (DNN) models. We propose a multi-channel refinement method of time-frequency masks obtained with single-channel DNNs, which consists of an iterative Complex Gaussian Mixture Model (CGMM) based algorithm, followed by optimum spatial filtration. We validate our approach on time-frequency masks estimated with three recent deep learning models, namely DCUnet, DCCRN, and FullSubNet. We show that our method with the proposed mask refinement procedure allows to improve the accuracy of estimated masks, in terms of the Area Under the ROC Curve (AUC) measure, and as a consequence the overall speech quality of the enhanced speech signal, as measured by PESQ improvement, and that the improvement is consistent across all three DNN models.

S. Kacprzak and K. Kowalczyk, “Adversarial domain adaptation with paired examples for acoustic scene classification on different recording devices,” in 2021 29th european signal processing conference (EUSIPCO), Aug. 2021, pp. 1030–1034. doi: 10.23919/EUSIPCO54536.2021.9616321.

In classification tasks, the classification accuracy diminishes when the data is gathered in different domains. To address this problem, in this paper, we investigate several adversarial models for domain adaptation (DA) and their effect on the acoustic scene classification task. The studied models include several types of generative adversarial networks (GAN), with different loss functions, and the so-called cycle GAN which consists of two interconnected GAN models. The experiments are performed on the DCASE20 challenge task 1A dataset, in which we can leverage the paired examples of data recorded using different devices, i.e., the source and target domain recordings. The results of performed experiments indicate that the best performing domain adaptation can be obtained using the cycle GAN, which achieves as much as 66

M. Igras-Cybulska et al., “glossoVR - voice emission and public speech training system,” in 2020 IEEE conference on virtual reality and 3D user interfaces abstracts and workshops (VRW), 2020, pp. 832–833. doi: 10.1109/VRW50115.2020.00267.

A new VR application for voice and speech training has emerged from a problem observable in everyday life: an anxiety of public speaking. In the design process, we incorporated both domain knowledge of experts as well as research with end-users in order to explore the needs and the context of the problem. Functionalities of the prototype are the effect of user-centered process in order to suit best their needs and the way they interact with VR environment.

M. Rybicka, S. Kacprzak, M. Witkowski, and K. Kowalczyk, “Description of the DSP AGH systems for the SdSV challenge,” 2020.

In the following we describe systems used to generate the DSP AGH submission to the Short-duration Speaker Verification Challenge 2020, in which we address the problem of speaker verification from utterances of short duration with cross-language domain mismatch between enroll and test conditions. We perform domain adaptation directly in speaker embedding space using consistent generative adversarial network (CycleGAN), and present a suitable network architecture and loss to operate on vector embeddings.

M. Ziółko and S. Kacprzak, “Language ranking based on frequency varieties of phones,” Multimedia Tools and Applications, pp. 1–14, 2019.

Phones for 239 non-annotated languages were selected by automatic segmentation based on changes of energy in the time-frequency representation of speech signals. Phone boundaries were set at location of relatively major changes in energy distribution between seven frequency bands. A vector of average energies calculated for eleven frequency bands was chosen as the representation of a single phone. We focus our research on an unsupervised comparison of phone distribution in 239 languages. Using the hierarchical clustering method, the relationship between the number of clusters and Ward’s distance was determined. A mathematical model is proposed to describe this dependency. Its four parameters are determined for each language individually to model the relationship between the number of clusters and the frequency diversity of phones contained in clusters. We used these relationships to compare languages and to create their ranking based on the size of phone varieties in the frequency domain.

S. Kacprzak, “Spoken language recognition in i-vector space using cluster based modeling,” PhD thesis, AGH University of Science; Technology, 2019.

This thesis investigates the use of clustering algorithms in the spoken language recognition task. The problem of clustering speech utterances into groups that correspond to the languages is analysed based on recordings transformed into the i-vector space. Different clustering algorithms and their configurations are tested on the NIST i-vector LRE data set. The obtained clusterings are assessed with ex- ternal and internal clustering quality measures. Experiments show that the mean shift algorithm with cosine kernel is capable of achieving relatively pure clus- ters. Based on observations from clustering experiments, a modification to the standard language recognition system is proposed. This modification consists of creating an additional cluster-based models for each language with k-means al- gorithm. Experiments show that additional models with simple linear classifiers allow to achieve results competitive to those obtained with complex non-linear classifiers. Proposed system modifications enable parallelism and can be applied in existing i-vector based language recognition systems.

K. Kowalczyk, S. Kacprzak, and M. Ziółko, “On the extraction of early reflection signals for automatic speech recognition,” in 2017 IEEE 2nd international conference on signal and image processing (ICSIP), Aug. 2017, pp. 351–355. doi: 10.1109/SIPROCESS.2017.8124563.

Room reverberation caused by multipath sound wave propagation in acoustic enclosures constitutes an unwanted distortion for automatic speech recognition systems. Multichannel speech enhancement methods often aim to enhance the signal impinging at the microphone array from the source direction while reducing late reverberation. In this paper, we investigate the applicability of spatial filters which constructively combine the direct-path signal with distinct early room reflection signals to increase the direct-to-reverberation ratio and to reduce the word error rate (WER) of automatic speech recognition systems. We present suitable filters and compare them with existing approaches. Results for the simulated acoustic environments indicate that an improvement in WER can indeed be achieved by the spatial filters which account for strong early reflections.

S. Kacprzak, B. Chwiećko, and B. Ziółko, “Speech/music discrimination for analysis of radio stations,” in 2017 international conference on systems, signals and image processing (IWSSIP), May 2017, pp. 1–4. doi: 10.1109/IWSSIP.2017.7965606.

A computationally efficient feature, called Minimum Energy Density (MED) was applied to discriminate audio signals between speech and music in the radio stations programs. The presented binary classifier is based on testing two features: energy distribution and differences between energy in channels. We analyzed 240 hours of signals, from 10 Polish radio stations. Our analysis enables us to provide information about content of particular radio stations.

S. Kacprzak, “Spoken language clustering in the i-vectors space,” in 2017 international conference on systems, signals and image processing (IWSSIP), May 2017, pp. 1–5. doi: 10.1109/IWSSIP.2017.7965607.

This paper presents the results of language clustering in the i-vectors space, a method to determine in an unsupervised manner how many languages are in a data set and which recordings contain the same language. The most dense i-vectors clusters are found using the DBSCAN algorithm in a low dimensional space obtained by the t-SNE method. Quality of clustering for spherical k-means and the proposed method are tested with the data from NIST 2015 i-Vector Challenge. Usefulness of obtained clustering is tested in the challenge evaluation system. The results demonstrate that the proposed method allows to find 109 dense clusters with low impurity for 50 target languages.

M. Witkowski, S. Kacprzak, P. Zelasko, K. Kowalczyk, and J. Galka, “Audio replay attack detection using high-frequency features,” in Interspeech, 2017, pp. 27–31. doi: 10.21437/Interspeech.2017-776.

This paper presents our contribution to the ASVspoof 2017 Challenge. It addresses a replay spoofing attack against a speaker recognition system by detecting that the analysed signal has passed through multiple analogue-to-digital (AD) conversions. Specifically, we show that most of the cues that enable to detect the replay attacks can be found in the high-frequency band of the replayed recordings. The described anti-spoofing countermeasures are based on (1) modelling the subband spectrum and (2) using the proposed features derived from the linear prediction (LP) analysis. The results of the investigated methods show a significant improvement in comparison to the baseline system of the ASVspoof 2017 Challenge. A relative equal error rate (EER) reduction by 70

S. Kacprzak, M. Mąsior, and M. Ziółko, “Automatic extraction and clustering of phones,” in 2016 signal processing: Algorithms, architectures, arrangements, and applications (SPA), 2016, pp. 310–314. doi: 10.1109/SPA.2016.7763633.

The automatic segmentation and parametrization based on the frequency analysis was used to compare with manually annotated phones. The phones boundaries were fixed in places of relatively large changes in the energy distribution between the frequency bands. Frequency parametrization and clustering enabled the division of phones into groups (clusters) according to their acoustic similarities. The results of performed experiments showed that analysis of the frequency properties only, results in correct segmentation but accuracy of recognition was about 20

J. Grzybowska and S. Kacprzak, “Speaker age classification and regression using i-vectors,” in Interspeech 2016, 2016, pp. 1402–1406. doi: 10.21437/Interspeech.2016-1118.

In this paper, we examine the use of i-vectors both for age regression as well as for age classification. Although i-vectors have been previously used for age regression task, we extend this approach by applying fusion of i-vectors and acoustic features regression to estimate the speaker age. By our fusion we obtain a relative improvement of 12.6We also use i-vectors for age classification, which to our knowledge is the first attempt to do so. Our best results reach unweighted accuracy 62.9

S. Kacprzak and M. Ziółko, “Multilanguage modeling of phones,” in Krajowa konferencja zastosowań matematyki w biologii, 2016.

The wavelet analysis was used for speech segmentation and parameterization. The obtained frequency parameters of phones were grouped using the Gaussian Mixture Model and the hierarchical clustering. The relationship between the number of clusters and a maximum distance between their centers was approximated by the sum of two exponential functions. The research was conducted for 245 world’s languages. In this way, for each language four parameters were received. A comparison of these parameters allows to search for acoustic similarities between phones of different languages.

S. Kacprzak, M. Mąsior, and M. Ziółko, “Automatyczna ekstrakcja i klasteryzacja głosek w sygnale mowy dla wielojęzykowej analizy porównawczej,” Prace Filologiczne, no. LXVI, pp. 073–083, 2015.

M. Igras, S. Kacprzak, M. Mąsior, and M. Ziółko, “The acoustic diversity in the phoneme inventories of the world’s languages,” Theoria et Historia Scientiarum, vol. 11, pp. 117–128, 2014.

M. Mąsior and S. Kacprzak, “Data analysis and management engine for signal processing,” Studia Informatica, vol. 35, no. 2, 2014.

M. Mąsior, M. Igras, M. Ziółko, and S. Kacprzak, “Database of speech recordings for comparative analysis of multi-language phonems,” Studia Informatica, vol. 34, no. 2B, pp. 79–87, 2013.

S. Kacprzak and M. Ziółko, “Speech/music discrimination via energy density analysis,” in Statistical language and speech processing: First international conference, SLSP 2013, tarragona, spain, july 29-31, 2013. Proceedings 1, 2013, pp. 135–142.

S. Kacprzak, M. Ziółko, M. Mąsior, M. Igras, and K. Ruszkiewicz, “Statistical analysis of phonemic diversity in languages across the world,” in Proceedings of the XIX national conference applications of mathematics in biology and medicine, 2013, pp. 16–20.

The results of investigation of the differences among the phonemes of 574 languages all over the world are presented. We attempt to verify the hypothesis of African origin for all languages and gradual languages diversification on other parts of the globe. The obtained results justify the languages classification by applying the methods used in evolutionary genetics.

S. Kacprzak, “Inteligentne metody rozpoznawania dźwięku,” PhD thesis, Politechnika Łódzka, 2010.

Niniejsza praca poświęcona jest zagadnieniom sztucznej inteligencji dotyczącym rozpoznawania dźwięku na przykładzie systemu rozpoznawania izolowanych słów. Głównym przedmiotem pracy jest stworzenie systemu rozpoznawania izolowanych słów.

« Back to Main