TY - JOUR AU - Boeddeker, Christoph AU - Subramanian, Aswin Shanmugam AU - Wichern, Gordon AU - Haeb-Umbach, Reinhold AU - Le Roux, Jonathan ID - 52958 JF - IEEE/ACM Transactions on Audio, Speech, and Language Processing KW - Electrical and Electronic Engineering KW - Acoustics and Ultrasonics KW - Computer Science (miscellaneous) KW - Computational Mathematics SN - 2329-9290 TI - TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings VL - 32 ER - TY - CONF AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 48269 T2 - European Signal Processing Conference (EUSIPCO) TI - On the Integration of Sampling Rate Synchronization and Acoustic Beamforming ER - TY - CONF AU - Cord-Landwehr, Tobias AU - Boeddeker, Christoph AU - Zorilă, Cătălin AU - Doddipatla, Rama AU - Haeb-Umbach, Reinhold ID - 47128 T2 - ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - Frame-Wise and Overlap-Robust Speaker Embeddings for Meeting Diarization ER - TY - CONF AU - Schmalenstroeer, Joerg AU - Gburrek, Tobias AU - Haeb-Umbach, Reinhold ID - 48270 T2 - ITG Conference on Speech Communication TI - LibriWASN: A Data Set for Meeting Separation, Diarization, and Recognition with Asynchronous Recording Devices ER - TY - CONF AU - Cord-Landwehr, Tobias AU - Boeddeker, Christoph AU - Zorilă, Cătălin AU - Doddipatla, Rama AU - Haeb-Umbach, Reinhold ID - 47129 T2 - INTERSPEECH 2023 TI - A Teacher-Student Approach for Extracting Informative Speaker Embeddings From Speech Mixtures ER - TY - CONF AB - Unsupervised speech disentanglement aims at separating fast varying from slowly varying components of a speech signal. In this contribution, we take a closer look at the embedding vector representing the slowly varying signal components, commonly named the speaker embedding vector. We ask, which properties of a speaker's voice are captured and investigate to which extent do individual embedding vector components sign responsible for them, using the concept of Shapley values. Our findings show that certain speaker-specific acoustic-phonetic properties can be fairly well predicted from the speaker embedding, while the investigated more abstract voice quality features cannot. AU - Rautenberg, Frederik AU - Kuhlmann, Michael AU - Wiechmann, Jana AU - Seebauer, Fritz AU - Wagner, Petra AU - Haeb-Umbach, Reinhold ID - 48355 T2 - ITG Conference on Speech Communication TI - On Feature Importance and Interpretability of Speaker Representations ER - TY - CONF AU - Wiechmann, Jana AU - Rautenberg, Frederik AU - Wagner, Petra AU - Haeb-Umbach, Reinhold ID - 48410 T2 - 20th International Congress of the Phonetic Sciences (ICPhS) TI - Explaining voice characteristics to novice voice practitioners-How successful is it? ER - TY - CONF AU - Aralikatti, Rohith AU - Boeddeker, Christoph AU - Wichern, Gordon AU - Subramanian, Aswin AU - Le Roux, Jonathan ID - 48391 T2 - ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - Reverberation as Supervision For Speech Separation ER - TY - CONF AU - Berger, Simon AU - Vieting, Peter AU - Boeddeker, Christoph AU - Schlüter, Ralf AU - Haeb-Umbach, Reinhold ID - 48390 T2 - INTERSPEECH 2023 TI - Mixture Encoder for Joint Speech Separation and Recognition ER - TY - CONF AU - Seebauer, Fritz AU - Kuhlmann, Michael AU - Haeb-Umbach, Reinhold AU - Wagner, Petra ID - 46069 T2 - 12th Speech Synthesis Workshop (SSW) 2023 TI - Re-examining the quality dimensions of synthetic speech ER - TY - JOUR AB - Continuous Speech Separation (CSS) has been proposed to address speech overlaps during the analysis of realistic meeting-like conversations by eliminating any overlaps before further processing. CSS separates a recording of arbitrarily many speakers into a small number of overlap-free output channels, where each output channel may contain speech of multiple speakers. This is often done by applying a conventional separation model trained with Utterance-level Permutation Invariant Training (uPIT), which exclusively maps a speaker to an output channel, in sliding window approach called stitching. Recently, we introduced an alternative training scheme called Graph-PIT that teaches the separation network to directly produce output streams in the required format without stitching. It can handle an arbitrary number of speakers as long as never more of them overlap at the same time than the separator has output channels. In this contribution, we further investigate the Graph-PIT training scheme. We show in extended experiments that models trained with Graph-PIT also work in challenging reverberant conditions. Models trained in this way are able to perform segment-less CSS, i.e., without stitching, and achieve comparable and often better separation quality than the conventional CSS with uPIT and stitching. We simplify the training schedule for Graph-PIT with the recently proposed Source Aggregated Signal-to-Distortion Ratio (SA-SDR) loss. It eliminates unfavorable properties of the previously used A-SDR loss and thus enables training with Graph-PIT from scratch. Graph-PIT training relaxes the constraints w.r.t. the allowed numbers of speakers and speaking patterns which allows using a larger variety of training data. Furthermore, we introduce novel signal-level evaluation metrics for meeting scenarios, namely the source-aggregated scale- and convolution-invariant Signal-to-Distortion Ratio (SA-SI-SDR and SA-CI-SDR), which are generalizations of the commonly used SDR-based metrics for the CSS case. AU - von Neumann, Thilo AU - Kinoshita, Keisuke AU - Boeddeker, Christoph AU - Delcroix, Marc AU - Haeb-Umbach, Reinhold ID - 35602 JF - IEEE/ACM Transactions on Audio, Speech, and Language Processing KW - Continuous Speech Separation KW - Source Separation KW - Graph-PIT KW - Dynamic Programming KW - Permutation Invariant Training SN - 2329-9290 TI - Segment-Less Continuous Speech Separation of Meetings: Training and Evaluation Criteria VL - 31 ER - TY - CONF AB - We propose a general framework to compute the word error rate (WER) of ASR systems that process recordings containing multiple speakers at their input and that produce multiple output word sequences (MIMO). Such ASR systems are typically required, e.g., for meeting transcription. We provide an efficient implementation based on a dynamic programming search in a multi-dimensional Levenshtein distance tensor under the constraint that a reference utterance must be matched consistently with one hypothesis output. This also results in an efficient implementation of the ORC WER which previously suffered from exponential complexity. We give an overview of commonly used WER definitions for multi-speaker scenarios and show that they are specializations of the above MIMO WER tuned to particular application scenarios. We conclude with a discussion of the pros and cons of the various WER definitions and a recommendation when to use which. AU - von Neumann, Thilo AU - Boeddeker, Christoph AU - Kinoshita, Keisuke AU - Delcroix, Marc AU - Haeb-Umbach, Reinhold ID - 48281 KW - Word Error Rate KW - Meeting Recognition KW - Levenshtein Distance T2 - ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems ER - TY - CONF AB - MeetEval is an open-source toolkit to evaluate all kinds of meeting transcription systems. It provides a unified interface for the computation of commonly used Word Error Rates (WERs), specifically cpWER, ORC WER and MIMO WER along other WER definitions. We extend the cpWER computation by a temporal constraint to ensure that only words are identified as correct when the temporal alignment is plausible. This leads to a better quality of the matching of the hypothesis string to the reference string that more closely resembles the actual transcription quality, and a system is penalized if it provides poor time annotations. Since word-level timing information is often not available, we present a way to approximate exact word-level timings from segment-level timings (e.g., a sentence) and show that the approximation leads to a similar WER as a matching with exact word-level annotations. At the same time, the time constraint leads to a speedup of the matching algorithm, which outweighs the additional overhead caused by processing the time stamps. AU - von Neumann, Thilo AU - Boeddeker, Christoph AU - Delcroix, Marc AU - Haeb-Umbach, Reinhold ID - 48275 KW - Speech Recognition KW - Word Error Rate KW - Meeting Transcription T2 - Proc. CHiME 2023 Workshop on Speech Processing in Everyday Environments TI - MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems ER - TY - CONF AB - We propose a diarization system, that estimates “who spoke when” based on spatial information, to be used as a front-end of a meeting transcription system running on the signals gathered from an acoustic sensor network (ASN). Although the spatial distribution of the microphones is advantageous, exploiting the spatial diversity for diarization and signal enhancement is challenging, because the microphones’ positions are typically unknown, and the recorded signals are initially unsynchronized in general. Here, we approach these issues by first blindly synchronizing the signals and then estimating time differences of arrival (TDOAs). The TDOA information is exploited to estimate the speakers’ activity, even in the presence of multiple speakers being simultaneously active. This speaker activity information serves as a guide for a spatial mixture model, on which basis the individual speaker’s signals are extracted via beamforming. Finally, the extracted signals are forwarded to a speech recognizer. Additionally, a novel initialization scheme for spatial mixture models based on the TDOA estimates is proposed. Experiments conducted on real recordings from the LibriWASN data set have shown that our proposed system is advantageous compared to a system using a spatial mixture model, which does not make use of external diarization information. AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 49109 KW - Diarization KW - time difference of arrival KW - ad-hoc acoustic sensor network KW - meeting transcription T2 - Proc. Asilomar Conference on Signals, Systems, and Computers TI - Spatial Diarization for Meeting Transcription with Ad-Hoc Acoustic Sensor Networks ER - TY - CONF AB - Due to the high variation in the application requirements of sound event detection (SED) systems, it is not sufficient to evaluate systems only in a single operating mode. Therefore, the community recently adopted the polyphonic sound detection score (PSDS) as an evaluation metric, which is the normalized area under the PSD receiver operating characteristic (PSD-ROC). It summarizes the system performance over a range of operating modes resulting from varying the decision threshold that is used to translate the system output scores into a binary detection output. Hence, it provides a more complete picture of the overall system behavior and is less biased by specific threshold tuning. However, besides the decision threshold there is also the post-processing that can be changed to enter another operating mode. In this paper we propose the post-processing independent PSDS (piPSDS) as a generalization of the PSDS. Here, the post-processing independent PSD-ROC includes operating points from varying post-processings with varying decision thresholds. Thus, it summarizes even more operating modes of an SED system and allows for system comparison without the need of implementing a post-processing and without a bias due to different post-processings. While piPSDS can in principle combine different types of post-processing, we here, as a first step, present median filter independent PSDS (miPSDS) results for this year’s DCASE Challenge Task4a systems. Source code is publicly available in our sed_scores_eval package (https://github.com/fgnt/sed_scores_eval). AU - Ebbers, Janek AU - Haeb-Umbach, Reinhold AU - Serizel, Romain ID - 49111 T2 - Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023) TI - Post-Processing Independent Evaluation of Sound Event Detection Systems ER - TY - CONF AU - Rautenberg, Frederik AU - Kuhlmann, Michael AU - Ebbers, Janek AU - Wiechmann, Jana AU - Seebauer, Fritz AU - Wagner, Petra AU - Haeb-Umbach, Reinhold ID - 44849 T2 - Fortschritte der Akustik - DAGA 2023 TI - Speech Disentanglement for Analysis and Modification of Acoustic and Perceptual Speaker Characteristics ER - TY - JOUR AB - Far-field multi-speaker automatic speech recognition (ASR) has drawn increasing attention in recent years. Most existing methods feature a signal processing frontend and an ASR backend. In realistic scenarios, these modules are usually trained separately or progressively, which suffers from either inter-module mismatch or a complicated training process. In this paper, we propose an end-to-end multi-channel model that jointly optimizes the speech enhancement (including speech dereverberation, denoising, and separation) frontend and the ASR backend as a single system. To the best of our knowledge, this is the first work that proposes to optimize dereverberation, beamforming, and multi-speaker ASR in a fully end-to-end manner. The frontend module consists of a weighted prediction error (WPE) based submodule for dereverberation and a neural beamformer for denoising and speech separation. For the backend, we adopt a widely used end-to-end (E2E) ASR architecture. It is worth noting that the entire model is differentiable and can be optimized in a fully end-to-end manner using only the ASR criterion, without the need of parallel signal-level labels. We evaluate the proposed model on several multi-speaker benchmark datasets, and experimental results show that the fully E2E ASR model can achieve competitive performance on both noisy and reverberant conditions, with over 30% relative word error rate (WER) reduction over the single-channel baseline systems. AU - Zhang, Wangyou AU - Chang, Xuankai AU - Boeddeker, Christoph AU - Nakatani, Tomohiro AU - Watanabe, Shinji AU - Qian, Yanmin ID - 33669 JF - IEEE/ACM Transactions on Audio, Speech, and Language Processing SN - Print ISSN: 2329-9290 Electronic ISSN: 2329-9304 TI - End-to-End Dereverberation, Beamforming, and Speech Recognition in A Cocktail Party ER - TY - CONF AU - Boeddeker, Christoph AU - Cord-Landwehr, Tobias AU - von Neumann, Thilo AU - Haeb-Umbach, Reinhold ID - 33954 T2 - Interspeech 2022 TI - An Initialization Scheme for Meeting Separation with Spatial Mixture Models ER - TY - CONF AB - The intelligibility of demodulated audio signals from analog high frequency transmissions, e.g., using single-sideband (SSB) modulation, can be severely degraded by channel distortions and/or a mismatch between modulation and demodulation carrier frequency. In this work a neural network (NN)-based approach for carrier frequency offset (CFO) estimation from demodulated SSB signals is proposed, whereby a task specific architecture is presented. Additionally, a simulation framework for SSB signals is introduced and utilized for training the NNs. The CFO estimator is combined with a speech enhancement network to investigate its influence on the enhancement performance. The NN-based system is compared to a recently proposed pitch tracking based approach on publicly available data from real high frequency transmissions. Experiments show that the NN exhibits good CFO estimation properties and results in significant improvements in speech intelligibility, especially when combined with a noise reduction network. AU - Heitkämper, Jens AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 33471 T2 - Proceedings of the 30th European Signal Processing Conference (EUSIPCO) TI - Neural Network Based Carrier Frequency Offset Estimation From Speech Transmitted Over High Frequency Channels ER - TY - CONF AU - Afifi, Haitham AU - Karl, Holger AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg ID - 33806 T2 - 2022 International Wireless Communications and Mobile Computing (IWCMC) TI - Data-driven Time Synchronization in Wireless Multimedia Networks ER - TY - CONF AB - Recent speaker diarization studies showed that integration of end-to-end neural diarization (EEND) and clustering-based diarization is a promising approach for achieving state-of-the-art performance on various tasks. Such an approach first divides an observed signal into fixed-length segments, then performs {\it segment-level} local diarization based on an EEND module, and merges the segment-level results via clustering to form a final global diarization result. The segmentation is done to limit the number of speakers in each segment since the current EEND cannot handle a large number of speakers. In this paper, we argue that such an approach involving the segmentation has several issues; for example, it inevitably faces a dilemma that larger segment sizes increase both the context available for enhancing the performance and the number of speakers for the local EEND module to handle. To resolve such a problem, this paper proposes a novel framework that performs diarization without segmentation. However, it can still handle challenging data containing many speakers and a significant amount of overlapping speech. The proposed method can take an entire meeting for inference and perform {\it utterance-by-utterance} diarization that clusters utterance activities in terms of speakers. To this end, we leverage a neural network training scheme called Graph-PIT proposed recently for neural source separation. Experiments with simulated active-meeting-like data and CALLHOME data show the superiority of the proposed approach over the conventional methods. AU - Kinoshita, Keisuke AU - von Neumann, Thilo AU - Delcroix, Marc AU - Boeddeker, Christoph AU - Haeb-Umbach, Reinhold ID - 33958 T2 - Proc. Interspeech 2022 TI - Utterance-by-utterance overlap-aware neural diarization with Graph-PIT ER - TY - CONF AU - von Neumann, Thilo AU - Kinoshita, Keisuke AU - Boeddeker, Christoph AU - Delcroix, Marc AU - Haeb-Umbach, Reinhold ID - 33819 T2 - ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - SA-SDR: A Novel Loss Function for Separation of Meeting Style Data ER - TY - CONF AB - The scope of speech enhancement has changed from a monolithic view of single, independent tasks, to a joint processing of complex conversational speech recordings. Training and evaluation of these single tasks requires synthetic data with access to intermediate signals that is as close as possible to the evaluation scenario. As such data often is not available, many works instead use specialized databases for the training of each system component, e.g WSJ0-mix for source separation. We present a Multi-purpose Multi-Speaker Mixture Signal Generator (MMS-MSG) for generating a variety of speech mixture signals based on any speech corpus, ranging from classical anechoic mixtures (e.g., WSJ0-mix) over reverberant mixtures (e.g., SMS-WSJ) to meeting-style data. Its highly modular and flexible structure allows for the simulation of diverse environments and dynamic mixing, while simultaneously enabling an easy extension and modification to generate new scenarios and mixture types. These meetings can be used for prototyping, evaluation, or training purposes. We provide example evaluation data and baseline results for meetings based on the WSJ corpus. Further, we demonstrate the usefulness for realistic scenarios by using MMS-MSG to provide training data for the LibriCSS database. AU - Cord-Landwehr, Tobias AU - von Neumann, Thilo AU - Boeddeker, Christoph AU - Haeb-Umbach, Reinhold ID - 33847 T2 - 2022 International Workshop on Acoustic Signal Enhancement (IWAENC) TI - MMS-MSG: A Multi-purpose Multi-Speaker Mixture Signal Generator ER - TY - CONF AB - Impressive progress in neural network-based single-channel speech source separation has been made in recent years. But those improvements have been mostly reported on anechoic data, a situation that is hardly met in practice. Taking the SepFormer as a starting point, which achieves state-of-the-art performance on anechoic mixtures, we gradually modify it to optimize its performance on reverberant mixtures. Although this leads to a word error rate improvement by 7 percentage points compared to the standard SepFormer implementation, the system ends up with only marginally better performance than a PIT-BLSTM separation system, that is optimized with rather straightforward means. This is surprising and at the same time sobering, challenging the practical usefulness of many improvements reported in recent years for monaural source separation on nonreverberant data. AU - Cord-Landwehr, Tobias AU - Boeddeker, Christoph AU - von Neumann, Thilo AU - Zorila, Catalin AU - Doddipatla, Rama AU - Haeb-Umbach, Reinhold ID - 33848 T2 - 2022 International Workshop on Acoustic Signal Enhancement (IWAENC) TI - Monaural source separation: From anechoic to reverberant environments ER - TY - CONF AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 33807 T2 - ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - On Synchronization of Wireless Acoustic Sensor Networks in the Presence of Time-Varying Sampling Rate Offsets and Speaker Changes ER - TY - JOUR AB - We present an approach to automatically generate semantic labels for real recordings of automotive range-Doppler (RD) radar spectra. Such labels are required when training a neural network for object recognition from radar data. The automatic labeling approach rests on the simultaneous recording of camera and lidar data in addition to the radar spectrum. By warping radar spectra into the camera image, state-of-the-art object recognition algorithms can be applied to label relevant objects, such as cars, in the camera image. The warping operation is designed to be fully differentiable, which allows backpropagating the gradient computed on the camera image through the warping operation to the neural network operating on the radar data. As the warping operation relies on accurate scene flow estimation, we further propose a novel scene flow estimation algorithm which exploits information from camera, lidar and radar sensors. The proposed scene flow estimation approach is compared against a state-of-the-art scene flow algorithm, and it outperforms it by approximately 30% w.r.t. mean average error. The feasibility of the overall framework for automatic label generation for RD spectra is verified by evaluating the performance of neural networks trained with the proposed framework for Direction-of-Arrival estimation. AU - Grimm, Christopher AU - Fei, Tai AU - Warsitz, Ernst AU - Farhoud, Ridha AU - Breddermann, Tobias AU - Haeb-Umbach, Reinhold ID - 33451 IS - 9 JF - IEEE Transactions on Vehicular Technology TI - Warping of Radar Data Into Camera Image for Cross-Modal Supervision in Automotive Applications VL - 71 ER - TY - GEN AB - In this report we present our system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2022 Challenge Task 4: Sound Event Detection in Domestic Environments 1 . As in previous editions of the Challenge, we use forward-backward convolutional recurrent neural networks (FBCRNNs) [1, 2] for weakly labeled and semi-supervised sound event detection (SED) and eventually generate strong pseudo labels for weakly labeled and unlabeled data. Then, (tag-conditioned) bidirectional CRNNs (Bi-CRNNs) [1, 2] are trained in a strongly supervised manner as our final SED models. In each of the training stages we use multiple iterations of self-training. Compared to previous editions, we improved our system performance by 1) some tweaks regarding data augmentation, pseudo labeling and inference 2) using weakly labeled AudioSet data [3] for pretraining larger networks and 3) augmenting the DESED data [4] with strongly labeled AudioSet data [5] for finetuning of the networks. Source code is publicly available at https://github.com/fgnt/pb_sed. AU - Ebbers, Janek AU - Haeb-Umbach, Reinhold ID - 49113 TI - Pre-Training And Self-Training For Sound Event Detection In Domestic Environments ER - TY - CONF AU - Wiechmann, Jana AU - Glarner, Thomas AU - Rautenberg, Frederik AU - Wagner, Petra AU - Haeb-Umbach, Reinhold ID - 33696 T2 - 18. Phonetik und Phonologie im deutschsprachigen Raum (P&P) TI - Technically enabled explaining of voice characteristics ER - TY - CONF AU - Kuhlmann, Michael AU - Seebauer, Fritz AU - Ebbers, Janek AU - Wagner, Petra AU - Haeb-Umbach, Reinhold ID - 33857 T2 - Interspeech 2022 TI - Investigation into Target Speaking Rate Adaptation for Voice Conversion ER - TY - CONF AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Heitkaemper, Jens AU - Haeb-Umbach, Reinhold ID - 33808 T2 - 2022 International Workshop on Acoustic Signal Enhancement (IWAENC) TI - Informed vs. Blind Beamforming in Ad-Hoc Acoustic Sensor Networks for Meeting Transcription ER - TY - GEN AU - Gburrek, Tobias AU - Boeddeker, Christoph AU - von Neumann, Thilo AU - Cord-Landwehr, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 33816 TI - A Meeting Transcription System for an Ad-Hoc Acoustic Sensor Network ER - TY - CONF AB - Performing an adequate evaluation of sound event detection (SED) systems is far from trivial and is still subject to ongoing research. The recently proposed polyphonic sound detection (PSD)-receiver operating characteristic (ROC) and PSD score (PSDS) make an important step into the direction of an evaluation of SED systems which is independent from a certain decision threshold. This allows to obtain a more complete picture of the overall system behavior which is less biased by threshold tuning. Yet, the PSD-ROC is currently only approximated using a finite set of thresholds. The choice of the thresholds used in approximation, however, can have a severe impact on the resulting PSDS. In this paper we propose a method which allows for computing system performance on an evaluation set for all possible thresholds jointly, enabling accurate computation not only of the PSD-ROC and PSDS but also of other collar-based and intersection-based performance curves. It further allows to select the threshold which best fulfills the requirements of a given application. Source code is publicly available in our SED evaluation package sed_scores_eval. AU - Ebbers, Janek AU - Haeb-Umbach, Reinhold AU - Serizel, Romain ID - 34072 T2 - Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - Threshold Independent Evaluation of Sound Event Detection Scores ER - TY - JOUR AB - The machine recognition of speech spoken at a distance from the microphones, known as far-field automatic speech recognition (ASR), has received a significant increase of attention in science and industry, which caused or was caused by an equally significant improvement in recognition accuracy. Meanwhile it has entered the consumer market with digital home assistants with a spoken language interface being its most prominent application. Speech recorded at a distance is affected by various acoustic distortions and, consequently, quite different processing pipelines have emerged compared to ASR for close-talk speech. A signal enhancement front-end for dereverberation, source separation and acoustic beamforming is employed to clean up the speech, and the back-end ASR engine is robustified by multi-condition training and adaptation. We will also describe the so-called end-to-end approach to ASR, which is a new promising architecture that has recently been extended to the far-field scenario. This tutorial article gives an account of the algorithms used to enable accurate speech recognition from a distance, and it will be seen that, although deep learning has a significant share in the technological breakthroughs, a clever combination with traditional signal processing can lead to surprisingly effective solutions. AU - Haeb-Umbach, Reinhold AU - Heymann, Jahn AU - Drude, Lukas AU - Watanabe, Shinji AU - Delcroix, Marc AU - Nakatani, Tomohiro ID - 21065 IS - 2 JF - Proceedings of the IEEE TI - Far-Field Automatic Speech Recognition VL - 109 ER - TY - CONF AU - Zhang, Wangyou AU - Boeddeker, Christoph AU - Watanabe, Shinji AU - Nakatani, Tomohiro AU - Delcroix, Marc AU - Kinoshita, Keisuke AU - Ochiai, Tsubasa AU - Kamo, Naoyuki AU - Haeb-Umbach, Reinhold AU - Qian, Yanmin ID - 28256 T2 - ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend ER - TY - CONF AU - Li, Chenda AU - Shi, Jing AU - Zhang, Wangyou AU - Subramanian, Aswin Shanmugam AU - Chang, Xuankai AU - Kamo, Naoyuki AU - Hira, Moto AU - Hayashi, Tomoki AU - Boeddeker, Christoph AU - Chen, Zhuo AU - Watanabe, Shinji ID - 28262 T2 - 2021 IEEE Spoken Language Technology Workshop (SLT) TI - ESPnet-SE: End-To-End Speech Enhancement and Separation Toolkit Designed for ASR Integration ER - TY - CONF AU - Li, Chenda AU - Luo, Yi AU - Han, Cong AU - Li, Jinyu AU - Yoshioka, Takuya AU - Zhou, Tianyan AU - Delcroix, Marc AU - Kinoshita, Keisuke AU - Boeddeker, Christoph AU - Qian, Yanmin AU - Watanabe, Shinji AU - Chen, Zhuo ID - 28261 T2 - 2021 IEEE Spoken Language Technology Workshop (SLT) TI - Dual-Path RNN for Long Recording Speech Separation ER - TY - CONF AU - Heitkaemper, Jens AU - Schmalenstroeer, Joerg AU - Ion, Valentin AU - Haeb-Umbach, Reinhold ID - 24000 T2 - Speech Communication; 14th ITG-Symposium TI - A Database for Research on Detection and Enhancement of Speech Transmitted over HF links ER - TY - CONF AB - Unsupervised blind source separation methods do not require a training phase and thus cannot suffer from a train-test mismatch, which is a common concern in neural network based source separation. The unsupervised techniques can be categorized in two classes, those building upon the sparsity of speech in the Short-Time Fourier transform domain and those exploiting non-Gaussianity or non-stationarity of the source signals. In this contribution, spatial mixture models which fall in the first category and independent vector analysis (IVA) as a representative of the second category are compared w.r.t. their separation performance and the performance of a downstream speech recognizer on a reverberant dataset of reasonable size. Furthermore, we introduce a serial concatenation of the two, where the result of the mixture model serves as initialization of IVA, which achieves significantly better WER performance than each algorithm individually and even approaches the performance of a much more complex neural network based technique. AU - Boeddeker, Christoph AU - Rautenberg, Frederik AU - Haeb-Umbach, Reinhold ID - 44843 T2 - ITG Conference on Speech Communication TI - A Comparison and Combination of Unsupervised Blind Source Separation Techniques ER - TY - CONF AU - Boeddeker, Christoph AU - Zhang, Wangyou AU - Nakatani, Tomohiro AU - Kinoshita, Keisuke AU - Ochiai, Tsubasa AU - Delcroix, Marc AU - Kamo, Naoyuki AU - Qian, Yanmin AU - Haeb-Umbach, Reinhold ID - 28259 T2 - ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - Convolutive Transfer Function Invariant SDR Training Criteria for Multi-Channel Reverberant Speech Separation ER - TY - CONF AU - Schmalenstroeer, Joerg AU - Heitkaemper, Jens AU - Ullmann, Joerg AU - Haeb-Umbach, Reinhold ID - 23998 T2 - 29th European Signal Processing Conference (EUSIPCO) TI - Open Range Pitch Tracking for Carrier Frequency Difference Estimation from HF Transmitted Speech ER - TY - JOUR AB - Due to the ad hoc nature of wireless acoustic sensor networks, the position of the sensor nodes is typically unknown. This contribution proposes a technique to estimate the position and orientation of the sensor nodes from the recorded speech signals. The method assumes that a node comprises a microphone array with synchronously sampled microphones rather than a single microphone, but does not require the sampling clocks of the nodes to be synchronized. From the observed audio signals, the distances between the acoustic sources and arrays, as well as the directions of arrival, are estimated. They serve as input to a non-linear least squares problem, from which both the sensor nodes’ positions and orientations, as well as the source positions, are alternatingly estimated in an iterative process. Given one set of unknowns, i.e., either the source positions or the sensor nodes’ geometry, the other set of unknowns can be computed in closed-form. The proposed approach is computationally efficient and the first one, which employs both distance and directional information for geometry calibration in a common cost function. Since both distance and direction of arrival measurements suffer from outliers, e.g., caused by strong reflections of the sound waves on the surfaces of the room, we introduce measures to deemphasize or remove unreliable measurements. Additionally, we discuss modifications of our previously proposed deep neural network-based acoustic distance estimator, to account not only for omnidirectional sources but also for directional sources. Simulation results show good positioning accuracy and compare very favorably with alternative approaches from the literature. AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 22528 JF - EURASIP Journal on Audio, Speech, and Music Processing SN - 1687-4722 TI - Geometry calibration in wireless acoustic sensor networks utilizing DoA and distance information ER - TY - CONF AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 23994 T2 - ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - Iterative Geometry Calibration from Distance Estimates for Wireless Acoustic Sensor Networks ER - TY - CONF AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 23999 T2 - Speech Communication; 14th ITG-Symposium TI - On Source-Microphone Distance Estimation Using Convolutional Recurrent Neural Networks ER - TY - CONF AU - Chinaev, Aleksej AU - Enzner, Gerald AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg ID - 23997 T2 - 29th European Signal Processing Conference (EUSIPCO) TI - Online Estimation of Sampling Rate Offsets in Wireless Acoustic Sensor Networks with Packet Loss ER - TY - CONF AB - In this work we address disentanglement of style and content in speech signals. We propose a fully convolutional variational autoencoder employing two encoders: a content encoder and a style encoder. To foster disentanglement, we propose adversarial contrastive predictive coding. This new disentanglement method does neither need parallel data nor any supervision. We show that the proposed technique is capable of separating speaker and content traits into the two different representations and show competitive speaker-content disentanglement performance compared to other unsupervised approaches. We further demonstrate an increased robustness of the content representation against a train-test mismatch compared to spectral features, when used for phone recognition. AU - Ebbers, Janek AU - Kuhlmann, Michael AU - Cord-Landwehr, Tobias AU - Haeb-Umbach, Reinhold ID - 29304 T2 - Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - Contrastive Predictive Coding Supported Factorized Variational Autoencoder for Unsupervised Learning of Disentangled Speech Representations ER - TY - CONF AB - Automatic transcription of meetings requires handling of overlapped speech, which calls for continuous speech separation (CSS) systems. The uPIT criterion was proposed for utterance-level separation with neural networks and introduces the constraint that the total number of speakers must not exceed the number of output channels. When processing meeting-like data in a segment-wise manner, i.e., by separating overlapping segments independently and stitching adjacent segments to continuous output streams, this constraint has to be fulfilled for any segment. In this contribution, we show that this constraint can be significantly relaxed. We propose a novel graph-based PIT criterion, which casts the assignment of utterances to output channels in a graph coloring problem. It only requires that the number of concurrently active speakers must not exceed the number of output channels. As a consequence, the system can process an arbitrary number of speakers and arbitrarily long segments and thus can handle more diverse scenarios. Further, the stitching algorithm for obtaining a consistent output order in neighboring segments is of less importance and can even be eliminated completely, not the least reducing the computational effort. Experiments on meeting-style WSJ data show improvements in recognition performance over using the uPIT criterion. AU - von Neumann, Thilo AU - Kinoshita, Keisuke AU - Boeddeker, Christoph AU - Delcroix, Marc AU - Haeb-Umbach, Reinhold ID - 26770 KW - Continuous speech separation KW - automatic speech recognition KW - overlapped speech KW - permutation invariant training T2 - Interspeech 2021 TI - Graph-PIT: Generalized Permutation Invariant Training for Continuous Separation of Arbitrary Numbers of Speakers ER - TY - CONF AU - von Neumann, Thilo AU - Boeddeker, Christoph AU - Kinoshita, Keisuke AU - Delcroix, Marc AU - Haeb-Umbach, Reinhold ID - 29173 T2 - Speech Communication; 14th ITG Conference TI - Speeding Up Permutation Invariant Training for Source Separation ER - TY - CONF AB - In this paper we present our system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 Challenge Task 4: Sound Event Detection and Separation in Domestic Environments, where it scored the fourth rank. Our presented solution is an advancement of our system used in the previous edition of the task.We use a forward-backward convolutional recurrent neural network (FBCRNN) for tagging and pseudo labeling followed by tag-conditioned sound event detection (SED) models which are trained using strong pseudo labels provided by the FBCRNN. Our advancement over our earlier model is threefold. First, we introduce a strong label loss in the objective of the FBCRNN to take advantage of the strongly labeled synthetic data during training. Second, we perform multiple iterations of self-training for both the FBCRNN and tag-conditioned SED models. Third, while we used only tag-conditioned CNNs as our SED model in the previous edition we here explore sophisticated tag-conditioned SED model architectures, namely, bidirectional CRNNs and bidirectional convolutional transformer neural networks (CTNNs), and combine them. With metric and class specific tuning of median filter lengths for post-processing, our final SED model, consisting of 6 submodels (2 of each architecture), achieves on the public evaluation set poly-phonic sound event detection scores (PSDS) of 0.455 for scenario 1 and 0.684 for scenario as well as a collar-based F1-score of 0.596 outperforming the baselines and our model from the previous edition by far. Source code is publicly available at https://github.com/fgnt/pb_sed. AU - Ebbers, Janek AU - Haeb-Umbach, Reinhold ID - 29308 SN - 978-84-09-36072-7 T2 - Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021) TI - Self-Trained Audio Tagging and Sound Event Detection in Domestic Environments ER - TY - CONF AB - Recently, there has been a rising interest in sound recognition via Acoustic Sensor Networks to support applications such as ambient assisted living or environmental habitat monitoring. With state-of-the-art sound recognition being dominated by deep-learning-based approaches, there is a high demand for labeled training data. Despite the availability of large-scale data sets such as Google's AudioSet, acquiring training data matching a certain application environment is still often a problem. In this paper we are concerned with human activity monitoring in a domestic environment using an ASN consisting of multiple nodes each providing multichannel signals. We propose a self-training based domain adaptation approach, which only requires unlabeled data from the target environment. Here, a sound recognition system trained on AudioSet, the teacher, generates pseudo labels for data from the target environment on which a student network is trained. The student can furthermore glean information about the spatial arrangement of sensors and sound sources to further improve classification performance. It is shown that the student significantly improves recognition performance over the pre-trained teacher without relying on labeled data from the environment the system is deployed in. AU - Ebbers, Janek AU - Keyser, Moritz Curt AU - Haeb-Umbach, Reinhold ID - 29306 T2 - Proceedings of the 29th European Signal Processing Conference (EUSIPCO) TI - Adapting Sound Recognition to A New Environment Via Self-Training ER - TY - JOUR AB - One objective of current research in explainable intelligent systems is to implement social aspects in order to increase the relevance of explanations. In this paper, we argue that a novel conceptual framework is needed to overcome shortcomings of existing AI systems with little attention to processes of interaction and learning. Drawing from research in interaction and development, we first outline the novel conceptual framework that pushes the design of AI systems toward true interactivity with an emphasis on the role of the partner and social relevance. We propose that AI systems will be able to provide a meaningful and relevant explanation only if the process of explaining is extended to active contribution of both partners that brings about dynamics that is modulated by different levels of analysis. Accordingly, our conceptual framework comprises monitoring and scaffolding as key concepts and claims that the process of explaining is not only modulated by the interaction between explainee and explainer but is embedded into a larger social context in which conventionalized and routinized behaviors are established. We discuss our conceptual framework in relation to the established objectives of transparency and autonomy that are raised for the design of explainable AI systems currently. AU - Rohlfing, Katharina J. AU - Cimiano, Philipp AU - Scharlau, Ingrid AU - Matzner, Tobias AU - Buhl, Heike M. AU - Buschmeier, Hendrik AU - Esposito, Elena AU - Grimminger, Angela AU - Hammer, Barbara AU - Haeb-Umbach, Reinhold AU - Horwath, Ilona AU - Hüllermeier, Eyke AU - Kern, Friederike AU - Kopp, Stefan AU - Thommes, Kirsten AU - Ngonga Ngomo, Axel-Cyrille AU - Schulte, Carsten AU - Wachsmuth, Henning AU - Wagner, Petra AU - Wrede, Britta ID - 24456 IS - 3 JF - IEEE Transactions on Cognitive and Developmental Systems KW - Explainability KW - process ofexplaining andunderstanding KW - explainable artificial systems SN - 2379-8920 TI - Explanation as a Social Practice: Toward a Conceptual Framework for the Social Design of AI Systems VL - 13 ER -