TY - JOUR AU - Boeddeker, Christoph AU - Subramanian, Aswin Shanmugam AU - Wichern, Gordon AU - Haeb-Umbach, Reinhold AU - Le Roux, Jonathan ID - 52958 JF - IEEE/ACM Transactions on Audio, Speech, and Language Processing KW - Electrical and Electronic Engineering KW - Acoustics and Ultrasonics KW - Computer Science (miscellaneous) KW - Computational Mathematics SN - 2329-9290 TI - TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings VL - 32 ER - TY - CONF AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 48269 T2 - European Signal Processing Conference (EUSIPCO) TI - On the Integration of Sampling Rate Synchronization and Acoustic Beamforming ER - TY - CONF AU - Cord-Landwehr, Tobias AU - Boeddeker, Christoph AU - Zorilă, Cătălin AU - Doddipatla, Rama AU - Haeb-Umbach, Reinhold ID - 47128 T2 - ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - Frame-Wise and Overlap-Robust Speaker Embeddings for Meeting Diarization ER - TY - CONF AU - Schmalenstroeer, Joerg AU - Gburrek, Tobias AU - Haeb-Umbach, Reinhold ID - 48270 T2 - ITG Conference on Speech Communication TI - LibriWASN: A Data Set for Meeting Separation, Diarization, and Recognition with Asynchronous Recording Devices ER - TY - CONF AU - Cord-Landwehr, Tobias AU - Boeddeker, Christoph AU - Zorilă, Cătălin AU - Doddipatla, Rama AU - Haeb-Umbach, Reinhold ID - 47129 T2 - INTERSPEECH 2023 TI - A Teacher-Student Approach for Extracting Informative Speaker Embeddings From Speech Mixtures ER - TY - CONF AB - Unsupervised speech disentanglement aims at separating fast varying from slowly varying components of a speech signal. In this contribution, we take a closer look at the embedding vector representing the slowly varying signal components, commonly named the speaker embedding vector. We ask, which properties of a speaker's voice are captured and investigate to which extent do individual embedding vector components sign responsible for them, using the concept of Shapley values. Our findings show that certain speaker-specific acoustic-phonetic properties can be fairly well predicted from the speaker embedding, while the investigated more abstract voice quality features cannot. AU - Rautenberg, Frederik AU - Kuhlmann, Michael AU - Wiechmann, Jana AU - Seebauer, Fritz AU - Wagner, Petra AU - Haeb-Umbach, Reinhold ID - 48355 T2 - ITG Conference on Speech Communication TI - On Feature Importance and Interpretability of Speaker Representations ER - TY - CONF AU - Wiechmann, Jana AU - Rautenberg, Frederik AU - Wagner, Petra AU - Haeb-Umbach, Reinhold ID - 48410 T2 - 20th International Congress of the Phonetic Sciences (ICPhS) TI - Explaining voice characteristics to novice voice practitioners-How successful is it? ER - TY - CONF AU - Berger, Simon AU - Vieting, Peter AU - Boeddeker, Christoph AU - Schlüter, Ralf AU - Haeb-Umbach, Reinhold ID - 48390 T2 - INTERSPEECH 2023 TI - Mixture Encoder for Joint Speech Separation and Recognition ER - TY - CONF AU - Seebauer, Fritz AU - Kuhlmann, Michael AU - Haeb-Umbach, Reinhold AU - Wagner, Petra ID - 46069 T2 - 12th Speech Synthesis Workshop (SSW) 2023 TI - Re-examining the quality dimensions of synthetic speech ER - TY - JOUR AB - Continuous Speech Separation (CSS) has been proposed to address speech overlaps during the analysis of realistic meeting-like conversations by eliminating any overlaps before further processing. CSS separates a recording of arbitrarily many speakers into a small number of overlap-free output channels, where each output channel may contain speech of multiple speakers. This is often done by applying a conventional separation model trained with Utterance-level Permutation Invariant Training (uPIT), which exclusively maps a speaker to an output channel, in sliding window approach called stitching. Recently, we introduced an alternative training scheme called Graph-PIT that teaches the separation network to directly produce output streams in the required format without stitching. It can handle an arbitrary number of speakers as long as never more of them overlap at the same time than the separator has output channels. In this contribution, we further investigate the Graph-PIT training scheme. We show in extended experiments that models trained with Graph-PIT also work in challenging reverberant conditions. Models trained in this way are able to perform segment-less CSS, i.e., without stitching, and achieve comparable and often better separation quality than the conventional CSS with uPIT and stitching. We simplify the training schedule for Graph-PIT with the recently proposed Source Aggregated Signal-to-Distortion Ratio (SA-SDR) loss. It eliminates unfavorable properties of the previously used A-SDR loss and thus enables training with Graph-PIT from scratch. Graph-PIT training relaxes the constraints w.r.t. the allowed numbers of speakers and speaking patterns which allows using a larger variety of training data. Furthermore, we introduce novel signal-level evaluation metrics for meeting scenarios, namely the source-aggregated scale- and convolution-invariant Signal-to-Distortion Ratio (SA-SI-SDR and SA-CI-SDR), which are generalizations of the commonly used SDR-based metrics for the CSS case. AU - von Neumann, Thilo AU - Kinoshita, Keisuke AU - Boeddeker, Christoph AU - Delcroix, Marc AU - Haeb-Umbach, Reinhold ID - 35602 JF - IEEE/ACM Transactions on Audio, Speech, and Language Processing KW - Continuous Speech Separation KW - Source Separation KW - Graph-PIT KW - Dynamic Programming KW - Permutation Invariant Training SN - 2329-9290 TI - Segment-Less Continuous Speech Separation of Meetings: Training and Evaluation Criteria VL - 31 ER - TY - CONF AB - We propose a general framework to compute the word error rate (WER) of ASR systems that process recordings containing multiple speakers at their input and that produce multiple output word sequences (MIMO). Such ASR systems are typically required, e.g., for meeting transcription. We provide an efficient implementation based on a dynamic programming search in a multi-dimensional Levenshtein distance tensor under the constraint that a reference utterance must be matched consistently with one hypothesis output. This also results in an efficient implementation of the ORC WER which previously suffered from exponential complexity. We give an overview of commonly used WER definitions for multi-speaker scenarios and show that they are specializations of the above MIMO WER tuned to particular application scenarios. We conclude with a discussion of the pros and cons of the various WER definitions and a recommendation when to use which. AU - von Neumann, Thilo AU - Boeddeker, Christoph AU - Kinoshita, Keisuke AU - Delcroix, Marc AU - Haeb-Umbach, Reinhold ID - 48281 KW - Word Error Rate KW - Meeting Recognition KW - Levenshtein Distance T2 - ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech Recognition Systems ER - TY - CONF AB - MeetEval is an open-source toolkit to evaluate all kinds of meeting transcription systems. It provides a unified interface for the computation of commonly used Word Error Rates (WERs), specifically cpWER, ORC WER and MIMO WER along other WER definitions. We extend the cpWER computation by a temporal constraint to ensure that only words are identified as correct when the temporal alignment is plausible. This leads to a better quality of the matching of the hypothesis string to the reference string that more closely resembles the actual transcription quality, and a system is penalized if it provides poor time annotations. Since word-level timing information is often not available, we present a way to approximate exact word-level timings from segment-level timings (e.g., a sentence) and show that the approximation leads to a similar WER as a matching with exact word-level annotations. At the same time, the time constraint leads to a speedup of the matching algorithm, which outweighs the additional overhead caused by processing the time stamps. AU - von Neumann, Thilo AU - Boeddeker, Christoph AU - Delcroix, Marc AU - Haeb-Umbach, Reinhold ID - 48275 KW - Speech Recognition KW - Word Error Rate KW - Meeting Transcription T2 - Proc. CHiME 2023 Workshop on Speech Processing in Everyday Environments TI - MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems ER - TY - CONF AB - We propose a diarization system, that estimates “who spoke when” based on spatial information, to be used as a front-end of a meeting transcription system running on the signals gathered from an acoustic sensor network (ASN). Although the spatial distribution of the microphones is advantageous, exploiting the spatial diversity for diarization and signal enhancement is challenging, because the microphones’ positions are typically unknown, and the recorded signals are initially unsynchronized in general. Here, we approach these issues by first blindly synchronizing the signals and then estimating time differences of arrival (TDOAs). The TDOA information is exploited to estimate the speakers’ activity, even in the presence of multiple speakers being simultaneously active. This speaker activity information serves as a guide for a spatial mixture model, on which basis the individual speaker’s signals are extracted via beamforming. Finally, the extracted signals are forwarded to a speech recognizer. Additionally, a novel initialization scheme for spatial mixture models based on the TDOA estimates is proposed. Experiments conducted on real recordings from the LibriWASN data set have shown that our proposed system is advantageous compared to a system using a spatial mixture model, which does not make use of external diarization information. AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 49109 KW - Diarization KW - time difference of arrival KW - ad-hoc acoustic sensor network KW - meeting transcription T2 - Proc. Asilomar Conference on Signals, Systems, and Computers TI - Spatial Diarization for Meeting Transcription with Ad-Hoc Acoustic Sensor Networks ER - TY - CONF AU - Rautenberg, Frederik AU - Kuhlmann, Michael AU - Ebbers, Janek AU - Wiechmann, Jana AU - Seebauer, Fritz AU - Wagner, Petra AU - Haeb-Umbach, Reinhold ID - 44849 T2 - Fortschritte der Akustik - DAGA 2023 TI - Speech Disentanglement for Analysis and Modification of Acoustic and Perceptual Speaker Characteristics ER - TY - CONF AU - Boeddeker, Christoph AU - Cord-Landwehr, Tobias AU - von Neumann, Thilo AU - Haeb-Umbach, Reinhold ID - 33954 T2 - Interspeech 2022 TI - An Initialization Scheme for Meeting Separation with Spatial Mixture Models ER - TY - CONF AB - The intelligibility of demodulated audio signals from analog high frequency transmissions, e.g., using single-sideband (SSB) modulation, can be severely degraded by channel distortions and/or a mismatch between modulation and demodulation carrier frequency. In this work a neural network (NN)-based approach for carrier frequency offset (CFO) estimation from demodulated SSB signals is proposed, whereby a task specific architecture is presented. Additionally, a simulation framework for SSB signals is introduced and utilized for training the NNs. The CFO estimator is combined with a speech enhancement network to investigate its influence on the enhancement performance. The NN-based system is compared to a recently proposed pitch tracking based approach on publicly available data from real high frequency transmissions. Experiments show that the NN exhibits good CFO estimation properties and results in significant improvements in speech intelligibility, especially when combined with a noise reduction network. AU - Heitkämper, Jens AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 33471 T2 - Proceedings of the 30th European Signal Processing Conference (EUSIPCO) TI - Neural Network Based Carrier Frequency Offset Estimation From Speech Transmitted Over High Frequency Channels ER - TY - CONF AB - Recent speaker diarization studies showed that integration of end-to-end neural diarization (EEND) and clustering-based diarization is a promising approach for achieving state-of-the-art performance on various tasks. Such an approach first divides an observed signal into fixed-length segments, then performs {\it segment-level} local diarization based on an EEND module, and merges the segment-level results via clustering to form a final global diarization result. The segmentation is done to limit the number of speakers in each segment since the current EEND cannot handle a large number of speakers. In this paper, we argue that such an approach involving the segmentation has several issues; for example, it inevitably faces a dilemma that larger segment sizes increase both the context available for enhancing the performance and the number of speakers for the local EEND module to handle. To resolve such a problem, this paper proposes a novel framework that performs diarization without segmentation. However, it can still handle challenging data containing many speakers and a significant amount of overlapping speech. The proposed method can take an entire meeting for inference and perform {\it utterance-by-utterance} diarization that clusters utterance activities in terms of speakers. To this end, we leverage a neural network training scheme called Graph-PIT proposed recently for neural source separation. Experiments with simulated active-meeting-like data and CALLHOME data show the superiority of the proposed approach over the conventional methods. AU - Kinoshita, Keisuke AU - von Neumann, Thilo AU - Delcroix, Marc AU - Boeddeker, Christoph AU - Haeb-Umbach, Reinhold ID - 33958 T2 - Proc. Interspeech 2022 TI - Utterance-by-utterance overlap-aware neural diarization with Graph-PIT ER - TY - CONF AU - von Neumann, Thilo AU - Kinoshita, Keisuke AU - Boeddeker, Christoph AU - Delcroix, Marc AU - Haeb-Umbach, Reinhold ID - 33819 T2 - ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - SA-SDR: A Novel Loss Function for Separation of Meeting Style Data ER - TY - CONF AB - The scope of speech enhancement has changed from a monolithic view of single, independent tasks, to a joint processing of complex conversational speech recordings. Training and evaluation of these single tasks requires synthetic data with access to intermediate signals that is as close as possible to the evaluation scenario. As such data often is not available, many works instead use specialized databases for the training of each system component, e.g WSJ0-mix for source separation. We present a Multi-purpose Multi-Speaker Mixture Signal Generator (MMS-MSG) for generating a variety of speech mixture signals based on any speech corpus, ranging from classical anechoic mixtures (e.g., WSJ0-mix) over reverberant mixtures (e.g., SMS-WSJ) to meeting-style data. Its highly modular and flexible structure allows for the simulation of diverse environments and dynamic mixing, while simultaneously enabling an easy extension and modification to generate new scenarios and mixture types. These meetings can be used for prototyping, evaluation, or training purposes. We provide example evaluation data and baseline results for meetings based on the WSJ corpus. Further, we demonstrate the usefulness for realistic scenarios by using MMS-MSG to provide training data for the LibriCSS database. AU - Cord-Landwehr, Tobias AU - von Neumann, Thilo AU - Boeddeker, Christoph AU - Haeb-Umbach, Reinhold ID - 33847 T2 - 2022 International Workshop on Acoustic Signal Enhancement (IWAENC) TI - MMS-MSG: A Multi-purpose Multi-Speaker Mixture Signal Generator ER - TY - CONF AB - Impressive progress in neural network-based single-channel speech source separation has been made in recent years. But those improvements have been mostly reported on anechoic data, a situation that is hardly met in practice. Taking the SepFormer as a starting point, which achieves state-of-the-art performance on anechoic mixtures, we gradually modify it to optimize its performance on reverberant mixtures. Although this leads to a word error rate improvement by 7 percentage points compared to the standard SepFormer implementation, the system ends up with only marginally better performance than a PIT-BLSTM separation system, that is optimized with rather straightforward means. This is surprising and at the same time sobering, challenging the practical usefulness of many improvements reported in recent years for monaural source separation on nonreverberant data. AU - Cord-Landwehr, Tobias AU - Boeddeker, Christoph AU - von Neumann, Thilo AU - Zorila, Catalin AU - Doddipatla, Rama AU - Haeb-Umbach, Reinhold ID - 33848 T2 - 2022 International Workshop on Acoustic Signal Enhancement (IWAENC) TI - Monaural source separation: From anechoic to reverberant environments ER - TY - CONF AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 33807 T2 - ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - On Synchronization of Wireless Acoustic Sensor Networks in the Presence of Time-Varying Sampling Rate Offsets and Speaker Changes ER - TY - JOUR AB - We present an approach to automatically generate semantic labels for real recordings of automotive range-Doppler (RD) radar spectra. Such labels are required when training a neural network for object recognition from radar data. The automatic labeling approach rests on the simultaneous recording of camera and lidar data in addition to the radar spectrum. By warping radar spectra into the camera image, state-of-the-art object recognition algorithms can be applied to label relevant objects, such as cars, in the camera image. The warping operation is designed to be fully differentiable, which allows backpropagating the gradient computed on the camera image through the warping operation to the neural network operating on the radar data. As the warping operation relies on accurate scene flow estimation, we further propose a novel scene flow estimation algorithm which exploits information from camera, lidar and radar sensors. The proposed scene flow estimation approach is compared against a state-of-the-art scene flow algorithm, and it outperforms it by approximately 30% w.r.t. mean average error. The feasibility of the overall framework for automatic label generation for RD spectra is verified by evaluating the performance of neural networks trained with the proposed framework for Direction-of-Arrival estimation. AU - Grimm, Christopher AU - Fei, Tai AU - Warsitz, Ernst AU - Farhoud, Ridha AU - Breddermann, Tobias AU - Haeb-Umbach, Reinhold ID - 33451 IS - 9 JF - IEEE Transactions on Vehicular Technology TI - Warping of Radar Data Into Camera Image for Cross-Modal Supervision in Automotive Applications VL - 71 ER - TY - CONF AU - Wiechmann, Jana AU - Glarner, Thomas AU - Rautenberg, Frederik AU - Wagner, Petra AU - Haeb-Umbach, Reinhold ID - 33696 T2 - 18. Phonetik und Phonologie im deutschsprachigen Raum (P&P) TI - Technically enabled explaining of voice characteristics ER - TY - CONF AU - Kuhlmann, Michael AU - Seebauer, Fritz AU - Ebbers, Janek AU - Wagner, Petra AU - Haeb-Umbach, Reinhold ID - 33857 T2 - Interspeech 2022 TI - Investigation into Target Speaking Rate Adaptation for Voice Conversion ER - TY - CONF AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Heitkaemper, Jens AU - Haeb-Umbach, Reinhold ID - 33808 T2 - 2022 International Workshop on Acoustic Signal Enhancement (IWAENC) TI - Informed vs. Blind Beamforming in Ad-Hoc Acoustic Sensor Networks for Meeting Transcription ER - TY - GEN AU - Gburrek, Tobias AU - Boeddeker, Christoph AU - von Neumann, Thilo AU - Cord-Landwehr, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 33816 TI - A Meeting Transcription System for an Ad-Hoc Acoustic Sensor Network ER - TY - CONF AB - Performing an adequate evaluation of sound event detection (SED) systems is far from trivial and is still subject to ongoing research. The recently proposed polyphonic sound detection (PSD)-receiver operating characteristic (ROC) and PSD score (PSDS) make an important step into the direction of an evaluation of SED systems which is independent from a certain decision threshold. This allows to obtain a more complete picture of the overall system behavior which is less biased by threshold tuning. Yet, the PSD-ROC is currently only approximated using a finite set of thresholds. The choice of the thresholds used in approximation, however, can have a severe impact on the resulting PSDS. In this paper we propose a method which allows for computing system performance on an evaluation set for all possible thresholds jointly, enabling accurate computation not only of the PSD-ROC and PSDS but also of other collar-based and intersection-based performance curves. It further allows to select the threshold which best fulfills the requirements of a given application. Source code is publicly available in our SED evaluation package sed_scores_eval. AU - Ebbers, Janek AU - Haeb-Umbach, Reinhold AU - Serizel, Romain ID - 34072 T2 - Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - Threshold Independent Evaluation of Sound Event Detection Scores ER - TY - JOUR AB - The machine recognition of speech spoken at a distance from the microphones, known as far-field automatic speech recognition (ASR), has received a significant increase of attention in science and industry, which caused or was caused by an equally significant improvement in recognition accuracy. Meanwhile it has entered the consumer market with digital home assistants with a spoken language interface being its most prominent application. Speech recorded at a distance is affected by various acoustic distortions and, consequently, quite different processing pipelines have emerged compared to ASR for close-talk speech. A signal enhancement front-end for dereverberation, source separation and acoustic beamforming is employed to clean up the speech, and the back-end ASR engine is robustified by multi-condition training and adaptation. We will also describe the so-called end-to-end approach to ASR, which is a new promising architecture that has recently been extended to the far-field scenario. This tutorial article gives an account of the algorithms used to enable accurate speech recognition from a distance, and it will be seen that, although deep learning has a significant share in the technological breakthroughs, a clever combination with traditional signal processing can lead to surprisingly effective solutions. AU - Haeb-Umbach, Reinhold AU - Heymann, Jahn AU - Drude, Lukas AU - Watanabe, Shinji AU - Delcroix, Marc AU - Nakatani, Tomohiro ID - 21065 IS - 2 JF - Proceedings of the IEEE TI - Far-Field Automatic Speech Recognition VL - 109 ER - TY - CONF AU - Zhang, Wangyou AU - Boeddeker, Christoph AU - Watanabe, Shinji AU - Nakatani, Tomohiro AU - Delcroix, Marc AU - Kinoshita, Keisuke AU - Ochiai, Tsubasa AU - Kamo, Naoyuki AU - Haeb-Umbach, Reinhold AU - Qian, Yanmin ID - 28256 T2 - ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - End-to-End Dereverberation, Beamforming, and Speech Recognition with Improved Numerical Stability and Advanced Frontend ER - TY - CONF AU - Heitkaemper, Jens AU - Schmalenstroeer, Joerg AU - Ion, Valentin AU - Haeb-Umbach, Reinhold ID - 24000 T2 - Speech Communication; 14th ITG-Symposium TI - A Database for Research on Detection and Enhancement of Speech Transmitted over HF links ER - TY - CONF AB - Unsupervised blind source separation methods do not require a training phase and thus cannot suffer from a train-test mismatch, which is a common concern in neural network based source separation. The unsupervised techniques can be categorized in two classes, those building upon the sparsity of speech in the Short-Time Fourier transform domain and those exploiting non-Gaussianity or non-stationarity of the source signals. In this contribution, spatial mixture models which fall in the first category and independent vector analysis (IVA) as a representative of the second category are compared w.r.t. their separation performance and the performance of a downstream speech recognizer on a reverberant dataset of reasonable size. Furthermore, we introduce a serial concatenation of the two, where the result of the mixture model serves as initialization of IVA, which achieves significantly better WER performance than each algorithm individually and even approaches the performance of a much more complex neural network based technique. AU - Boeddeker, Christoph AU - Rautenberg, Frederik AU - Haeb-Umbach, Reinhold ID - 44843 T2 - ITG Conference on Speech Communication TI - A Comparison and Combination of Unsupervised Blind Source Separation Techniques ER - TY - CONF AU - Boeddeker, Christoph AU - Zhang, Wangyou AU - Nakatani, Tomohiro AU - Kinoshita, Keisuke AU - Ochiai, Tsubasa AU - Delcroix, Marc AU - Kamo, Naoyuki AU - Qian, Yanmin AU - Haeb-Umbach, Reinhold ID - 28259 T2 - ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - Convolutive Transfer Function Invariant SDR Training Criteria for Multi-Channel Reverberant Speech Separation ER - TY - CONF AU - Schmalenstroeer, Joerg AU - Heitkaemper, Jens AU - Ullmann, Joerg AU - Haeb-Umbach, Reinhold ID - 23998 T2 - 29th European Signal Processing Conference (EUSIPCO) TI - Open Range Pitch Tracking for Carrier Frequency Difference Estimation from HF Transmitted Speech ER - TY - JOUR AB - Due to the ad hoc nature of wireless acoustic sensor networks, the position of the sensor nodes is typically unknown. This contribution proposes a technique to estimate the position and orientation of the sensor nodes from the recorded speech signals. The method assumes that a node comprises a microphone array with synchronously sampled microphones rather than a single microphone, but does not require the sampling clocks of the nodes to be synchronized. From the observed audio signals, the distances between the acoustic sources and arrays, as well as the directions of arrival, are estimated. They serve as input to a non-linear least squares problem, from which both the sensor nodes’ positions and orientations, as well as the source positions, are alternatingly estimated in an iterative process. Given one set of unknowns, i.e., either the source positions or the sensor nodes’ geometry, the other set of unknowns can be computed in closed-form. The proposed approach is computationally efficient and the first one, which employs both distance and directional information for geometry calibration in a common cost function. Since both distance and direction of arrival measurements suffer from outliers, e.g., caused by strong reflections of the sound waves on the surfaces of the room, we introduce measures to deemphasize or remove unreliable measurements. Additionally, we discuss modifications of our previously proposed deep neural network-based acoustic distance estimator, to account not only for omnidirectional sources but also for directional sources. Simulation results show good positioning accuracy and compare very favorably with alternative approaches from the literature. AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 22528 JF - EURASIP Journal on Audio, Speech, and Music Processing SN - 1687-4722 TI - Geometry calibration in wireless acoustic sensor networks utilizing DoA and distance information ER - TY - CONF AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 23994 T2 - ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - Iterative Geometry Calibration from Distance Estimates for Wireless Acoustic Sensor Networks ER - TY - CONF AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 23999 T2 - Speech Communication; 14th ITG-Symposium TI - On Source-Microphone Distance Estimation Using Convolutional Recurrent Neural Networks ER - TY - CONF AB - In this work we address disentanglement of style and content in speech signals. We propose a fully convolutional variational autoencoder employing two encoders: a content encoder and a style encoder. To foster disentanglement, we propose adversarial contrastive predictive coding. This new disentanglement method does neither need parallel data nor any supervision. We show that the proposed technique is capable of separating speaker and content traits into the two different representations and show competitive speaker-content disentanglement performance compared to other unsupervised approaches. We further demonstrate an increased robustness of the content representation against a train-test mismatch compared to spectral features, when used for phone recognition. AU - Ebbers, Janek AU - Kuhlmann, Michael AU - Cord-Landwehr, Tobias AU - Haeb-Umbach, Reinhold ID - 29304 T2 - Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - Contrastive Predictive Coding Supported Factorized Variational Autoencoder for Unsupervised Learning of Disentangled Speech Representations ER - TY - CONF AB - Automatic transcription of meetings requires handling of overlapped speech, which calls for continuous speech separation (CSS) systems. The uPIT criterion was proposed for utterance-level separation with neural networks and introduces the constraint that the total number of speakers must not exceed the number of output channels. When processing meeting-like data in a segment-wise manner, i.e., by separating overlapping segments independently and stitching adjacent segments to continuous output streams, this constraint has to be fulfilled for any segment. In this contribution, we show that this constraint can be significantly relaxed. We propose a novel graph-based PIT criterion, which casts the assignment of utterances to output channels in a graph coloring problem. It only requires that the number of concurrently active speakers must not exceed the number of output channels. As a consequence, the system can process an arbitrary number of speakers and arbitrarily long segments and thus can handle more diverse scenarios. Further, the stitching algorithm for obtaining a consistent output order in neighboring segments is of less importance and can even be eliminated completely, not the least reducing the computational effort. Experiments on meeting-style WSJ data show improvements in recognition performance over using the uPIT criterion. AU - von Neumann, Thilo AU - Kinoshita, Keisuke AU - Boeddeker, Christoph AU - Delcroix, Marc AU - Haeb-Umbach, Reinhold ID - 26770 KW - Continuous speech separation KW - automatic speech recognition KW - overlapped speech KW - permutation invariant training T2 - Interspeech 2021 TI - Graph-PIT: Generalized Permutation Invariant Training for Continuous Separation of Arbitrary Numbers of Speakers ER - TY - CONF AU - von Neumann, Thilo AU - Boeddeker, Christoph AU - Kinoshita, Keisuke AU - Delcroix, Marc AU - Haeb-Umbach, Reinhold ID - 29173 T2 - Speech Communication; 14th ITG Conference TI - Speeding Up Permutation Invariant Training for Source Separation ER - TY - CONF AB - In this paper we present our system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 Challenge Task 4: Sound Event Detection and Separation in Domestic Environments, where it scored the fourth rank. Our presented solution is an advancement of our system used in the previous edition of the task.We use a forward-backward convolutional recurrent neural network (FBCRNN) for tagging and pseudo labeling followed by tag-conditioned sound event detection (SED) models which are trained using strong pseudo labels provided by the FBCRNN. Our advancement over our earlier model is threefold. First, we introduce a strong label loss in the objective of the FBCRNN to take advantage of the strongly labeled synthetic data during training. Second, we perform multiple iterations of self-training for both the FBCRNN and tag-conditioned SED models. Third, while we used only tag-conditioned CNNs as our SED model in the previous edition we here explore sophisticated tag-conditioned SED model architectures, namely, bidirectional CRNNs and bidirectional convolutional transformer neural networks (CTNNs), and combine them. With metric and class specific tuning of median filter lengths for post-processing, our final SED model, consisting of 6 submodels (2 of each architecture), achieves on the public evaluation set poly-phonic sound event detection scores (PSDS) of 0.455 for scenario 1 and 0.684 for scenario as well as a collar-based F1-score of 0.596 outperforming the baselines and our model from the previous edition by far. Source code is publicly available at https://github.com/fgnt/pb_sed. AU - Ebbers, Janek AU - Haeb-Umbach, Reinhold ID - 29308 SN - 978-84-09-36072-7 T2 - Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021) TI - Self-Trained Audio Tagging and Sound Event Detection in Domestic Environments ER - TY - CONF AB - Recently, there has been a rising interest in sound recognition via Acoustic Sensor Networks to support applications such as ambient assisted living or environmental habitat monitoring. With state-of-the-art sound recognition being dominated by deep-learning-based approaches, there is a high demand for labeled training data. Despite the availability of large-scale data sets such as Google's AudioSet, acquiring training data matching a certain application environment is still often a problem. In this paper we are concerned with human activity monitoring in a domestic environment using an ASN consisting of multiple nodes each providing multichannel signals. We propose a self-training based domain adaptation approach, which only requires unlabeled data from the target environment. Here, a sound recognition system trained on AudioSet, the teacher, generates pseudo labels for data from the target environment on which a student network is trained. The student can furthermore glean information about the spatial arrangement of sensors and sound sources to further improve classification performance. It is shown that the student significantly improves recognition performance over the pre-trained teacher without relying on labeled data from the environment the system is deployed in. AU - Ebbers, Janek AU - Keyser, Moritz Curt AU - Haeb-Umbach, Reinhold ID - 29306 T2 - Proceedings of the 29th European Signal Processing Conference (EUSIPCO) TI - Adapting Sound Recognition to A New Environment Via Self-Training ER - TY - JOUR AB - One objective of current research in explainable intelligent systems is to implement social aspects in order to increase the relevance of explanations. In this paper, we argue that a novel conceptual framework is needed to overcome shortcomings of existing AI systems with little attention to processes of interaction and learning. Drawing from research in interaction and development, we first outline the novel conceptual framework that pushes the design of AI systems toward true interactivity with an emphasis on the role of the partner and social relevance. We propose that AI systems will be able to provide a meaningful and relevant explanation only if the process of explaining is extended to active contribution of both partners that brings about dynamics that is modulated by different levels of analysis. Accordingly, our conceptual framework comprises monitoring and scaffolding as key concepts and claims that the process of explaining is not only modulated by the interaction between explainee and explainer but is embedded into a larger social context in which conventionalized and routinized behaviors are established. We discuss our conceptual framework in relation to the established objectives of transparency and autonomy that are raised for the design of explainable AI systems currently. AU - Rohlfing, Katharina J. AU - Cimiano, Philipp AU - Scharlau, Ingrid AU - Matzner, Tobias AU - Buhl, Heike M. AU - Buschmeier, Hendrik AU - Esposito, Elena AU - Grimminger, Angela AU - Hammer, Barbara AU - Haeb-Umbach, Reinhold AU - Horwath, Ilona AU - Hüllermeier, Eyke AU - Kern, Friederike AU - Kopp, Stefan AU - Thommes, Kirsten AU - Ngonga Ngomo, Axel-Cyrille AU - Schulte, Carsten AU - Wachsmuth, Henning AU - Wagner, Petra AU - Wrede, Britta ID - 24456 IS - 3 JF - IEEE Transactions on Cognitive and Developmental Systems KW - Explainability KW - process ofexplaining andunderstanding KW - explainable artificial systems SN - 2379-8920 TI - Explanation as a Social Practice: Toward a Conceptual Framework for the Social Design of AI Systems VL - 13 ER - TY - CONF AU - Haeb-Umbach, Reinhold ED - Böck, Ronald ED - Siegert, Ingo ED - Wendemuth, Andreas ID - 17763 KW - Poster SN - 978-3-959081-93-1 T2 - Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2020 TI - Sprachtechnologien für Digitale Assistenten ER - TY - CONF AU - Boeddeker, Christoph AU - Cord-Landwehr, Tobias AU - Heitkaemper, Jens AU - Zorila, Catalin AU - Hayakawa, Daichi AU - Li, Mohan AU - Liu, Min AU - Doddipatla, Rama AU - Haeb-Umbach, Reinhold ID - 20700 T2 - Proc. CHiME 2020 Workshop on Speech Processing in Everyday Environments TI - Towards a speaker diarization system for the CHiME 2020 dinner party transcription ER - TY - JOUR AU - Nakatani, Tomohiro AU - Boeddeker, Christoph AU - Kinoshita, Keisuke AU - Ikeshita, Rintaro AU - Delcroix, Marc AU - Haeb-Umbach, Reinhold ID - 17598 JF - IEEE/ACM Transactions on Audio, Speech, and Language Processing TI - Jointly optimal denoising, dereverberation, and source separation ER - TY - CONF AB - In recent years time domain speech separation has excelled over frequency domain separation in single channel scenarios and noise-free environments. In this paper we dissect the gains of the time-domain audio separation network (TasNet) approach by gradually replacing components of an utterance-level permutation invariant training (u-PIT) based separation system in the frequency domain until the TasNet system is reached, thus blending components of frequency domain approaches with those of time domain approaches. Some of the intermediate variants achieve comparable signal-to-distortion ratio (SDR) gains to TasNet, but retain the advantage of frequency domain processing: compatibility with classic signal processing tools such as frequency-domain beamforming and the human interpretability of the masks. Furthermore, we show that the scale invariant signal-to-distortion ratio (si-SDR) criterion used as loss function in TasNet is related to a logarithmic mean square error criterion and that it is this criterion which contributes most reliable to the performance advantage of TasNet. Finally, we critically assess which gains in a noise-free single channel environment generalize to more realistic reverberant conditions. AU - Heitkaemper, Jens AU - Jakobeit, Darius AU - Boeddeker, Christoph AU - Drude, Lukas AU - Haeb-Umbach, Reinhold ID - 20504 KW - voice activity detection KW - speech activity detection KW - neural network KW - statistical speech processing T2 - ICASSP 2020 Virtual Barcelona Spain TI - Demystifying TasNet: A Dissecting Approach ER - TY - CONF AB - Speech activity detection (SAD), which often rests on the fact that the noise is "more'' stationary than speech, is particularly challenging in non-stationary environments, because the time variance of the acoustic scene makes it difficult to discriminate speech from noise. We propose two approaches to SAD, where one is based on statistical signal processing, while the other utilizes neural networks. The former employs sophisticated signal processing to track the noise and speech energies and is meant to support the case for a resource efficient, unsupervised signal processing approach. The latter introduces a recurrent network layer that operates on short segments of the input speech to do temporal smoothing in the presence of non-stationary noise. The systems are tested on the Fearless Steps challenge database, which consists of the transmission data from the Apollo-11 space mission. The statistical SAD achieves comparable detection performance to earlier proposed neural network based SADs, while the neural network based approach leads to a decision cost function of 1.07% on the evaluation set of the 2020 Fearless Steps Challenge, which sets a new state of the art. AU - Heitkaemper, Jens AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 20505 KW - voice activity detection KW - speech activity detection KW - neural network KW - statistical speech processing T2 - INTERSPEECH 2020 Virtual Shanghai China TI - Statistical and Neural Network Based Speech Activity Detection in Non-Stationary Acoustic Environments ER - TY - CONF AB - The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multispeaker speech recognition. However, up until now, state-of-theart neural network–based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of 11.0% on WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic E2E frequency domain systems proposed so far. AU - von Neumann, Thilo AU - Kinoshita, Keisuke AU - Drude, Lukas AU - Boeddeker, Christoph AU - Delcroix, Marc AU - Nakatani, Tomohiro AU - Haeb-Umbach, Reinhold ID - 20762 T2 - ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - End-to-End Training of Time Domain Audio Separation and Recognition ER - TY - CONF AB - Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown. To cope with this, we extend an iterative speech extraction system with mechanisms to count the number of sources and combine it with a single-talker speech recognizer to form the first end-to-end multi-talker automatic speech recognition system for an unknown number of active speakers. Our experiments show very promising performance in counting accuracy, source separation and speech recognition on simulated clean mixtures from WSJ0-2mix and WSJ0-3mix. Among others, we set a new state-of-the-art word error rate on the WSJ0-2mix database. Furthermore, our system generalizes well to a larger number of speakers than it ever saw during training, as shown in experiments with the WSJ0-4mix database. AU - von Neumann, Thilo AU - Boeddeker, Christoph AU - Drude, Lukas AU - Kinoshita, Keisuke AU - Delcroix, Marc AU - Nakatani, Tomohiro AU - Haeb-Umbach, Reinhold ID - 20764 T2 - Proc. Interspeech 2020 TI - Multi-Talker ASR for an Unknown Number of Sources: Joint Training of Source Counting, Separation and ASR ER - TY - CONF AB - We present an approach to deep neural network based (DNN-based) distance estimation in reverberant rooms for supporting geometry calibration tasks in wireless acoustic sensor networks. Signal diffuseness information from acoustic signals is aggregated via the coherent-to-diffuse power ratio to obtain a distance-related feature, which is mapped to a source-to-microphone distance estimate by means of a DNN. This information is then combined with direction-of-arrival estimates from compact microphone arrays to infer the geometry of the sensor network. Unlike many other approaches to geometry calibration, the proposed scheme does only require that the sampling clocks of the sensor nodes are roughly synchronized. In simulations we show that the proposed DNN-based distance estimator generalizes to unseen acoustic environments and that precise estimates of the sensor node positions are obtained. AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Brendel, Andreas AU - Kellermann, Walter AU - Haeb-Umbach, Reinhold ID - 18651 T2 - European Signal Processing Conference (EUSIPCO) TI - Deep Neural Network based Distance Estimation for Geometry Calibration in Acoustic Sensor Network ER - TY - CONF AB - Recently, the source separation performance was greatly improved by time-domain audio source separation based on dual-path recurrent neural network (DPRNN). DPRNN is a simple but effective model for a long sequential data. While DPRNN is quite efficient in modeling a sequential data of the length of an utterance, i.e., about 5 to 10 second data, it is harder to apply it to longer sequences such as whole conversations consisting of multiple utterances. It is simply because, in such a case, the number of time steps consumed by its internal module called inter-chunk RNN becomes extremely large. To mitigate this problem, this paper proposes a multi-path RNN (MPRNN), a generalized version of DPRNN, that models the input data in a hierarchical manner. In the MPRNN framework, the input data is represented at several (>_ 3) time-resolutions, each of which is modeled by a specific RNN sub-module. For example, the RNN sub-module that deals with the finest resolution may model temporal relationship only within a phoneme, while the RNN sub-module handling the most coarse resolution may capture only the relationship between utterances such as speaker information. We perform experiments using simulated dialogue-like mixtures and show that MPRNN has greater model capacity, and it outperforms the current state-of-the-art DPRNN framework especially in online processing scenarios. AU - Kinoshita, Keisuke AU - von Neumann, Thilo AU - Delcroix, Marc AU - Nakatani, Tomohiro AU - Haeb-Umbach, Reinhold ID - 20766 T2 - Proc. Interspeech 2020 TI - Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation ER - TY - CONF AB - In this paper we present our system for the detection and classification of acoustic scenes and events (DCASE) 2020 Challenge Task 4: Sound event detection and separation in domestic environments. We introduce two new models: the forward-backward convolutional recurrent neural network (FBCRNN) and the tag-conditioned convolutional neural network (CNN). The FBCRNN employs two recurrent neural network (RNN) classifiers sharing the same CNN for preprocessing. With one RNN processing a recording in forward direction and the other in backward direction, the two networks are trained to jointly predict audio tags, i.e., weak labels, at each time step within a recording, given that at each time step they have jointly processed the whole recording. The proposed training encourages the classifiers to tag events as soon as possible. Therefore, after training, the networks can be applied to shorter audio segments of, e.g., 200ms, allowing sound event detection (SED). Further, we propose a tag-conditioned CNN to complement SED. It is trained to predict strong labels while using (predicted) tags, i.e., weak labels, as additional input. For training pseudo strong labels from a FBCRNN ensemble are used. The presented system scored the fourth and third place in the systems and teams rankings, respectively. Subsequent improvements allow our system to even outperform the challenge baseline and winner systems in average by, respectively, 18.0% and 2.2% event-based F1-score on the validation set. Source code is publicly available at https://github.com/fgnt/pb_sed. AU - Ebbers, Janek AU - Haeb-Umbach, Reinhold ID - 20753 T2 - Proceedings of the Detection and Classification of Acoustic Scenes and Events 2020 Workshop (DCASE2020) TI - Forward-Backward Convolutional Recurrent Neural Networks and Tag-Conditioned Convolutional Neural Networks for Weakly Labeled Semi-Supervised Sound Event Detection ER - TY - JOUR AB - Abstract Wenn akustische Signalverarbeitung mit automatisiertem Lernen verknüpft wird: Nachrichtentechniker arbeiten mit mehreren Mikrofonen und tiefen neuronalen Netzen an besserer Spracherkennung unter widrigsten Bedingungen. Von solchen Sensornetzwerken könnten langfristig auch digitale Sprachassistenten profitieren. AU - Haeb-Umbach, Reinhold ID - 17762 IS - 1 JF - forschung TI - Lektionen für Alexa \& Co?! VL - 44 ER - TY - JOUR AB - We present a multi-channel database of overlapping speech for training, evaluation, and detailed analysis of source separation and extraction algorithms: SMS-WSJ -- Spatialized Multi-Speaker Wall Street Journal. It consists of artificially mixed speech taken from the WSJ database, but unlike earlier databases we consider all WSJ0+1 utterances and take care of strictly separating the speaker sets present in the training, validation and test sets. When spatializing the data we ensure a high degree of randomness w.r.t. room size, array center and rotation, as well as speaker position. Furthermore, this paper offers a critical assessment of recently proposed measures of source separation performance. Alongside the code to generate the database we provide a source separation baseline and a Kaldi recipe with competitive word error rates to provide common ground for evaluation. AU - Drude, Lukas AU - Heitkaemper, Jens AU - Boeddeker, Christoph AU - Haeb-Umbach, Reinhold ID - 19446 JF - ArXiv e-prints TI - SMS-WSJ: Database, performance measures, and baseline recipe for multi-channel source separation and recognition ER - TY - CONF AB - We present an unsupervised training approach for a neural network-based mask estimator in an acoustic beamforming application. The network is trained to maximize a likelihood criterion derived from a spatial mixture model of the observations. It is trained from scratch without requiring any parallel data consisting of degraded input and clean training targets. Thus, training can be carried out on real recordings of noisy speech rather than simulated ones. In contrast to previous work on unsupervised training of neural mask estimators, our approach avoids the need for a possibly pre-trained teacher model entirely. We demonstrate the effectiveness of our approach by speech recognition experiments on two different datasets: one mainly deteriorated by noise (CHiME 4) and one by reverberation (REVERB). The results show that the performance of the proposed system is on par with a supervised system using oracle target masks for training and with a system trained using a model-based teacher. AU - Drude, Lukas AU - Heymann, Jahn AU - Haeb-Umbach, Reinhold ID - 11965 T2 - INTERSPEECH 2019, Graz, Austria TI - Unsupervised training of neural mask-based beamforming ER - TY - CONF AB - We propose a training scheme to train neural network-based source separation algorithms from scratch when parallel clean data is unavailable. In particular, we demonstrate that an unsupervised spatial clustering algorithm is sufficient to guide the training of a deep clustering system. We argue that previous work on deep clustering requires strong supervision and elaborate on why this is a limitation. We demonstrate that (a) the single-channel deep clustering system trained according to the proposed scheme alone is able to achieve a similar performance as the multi-channel teacher in terms of word error rates and (b) initializing the spatial clustering approach with the deep clustering result yields a relative word error rate reduction of 26% over the unsupervised teacher. AU - Drude, Lukas AU - Hasenklever, Daniel AU - Haeb-Umbach, Reinhold ID - 12874 T2 - ICASSP 2019, Brighton, UK TI - Unsupervised Training of a Deep Clustering Model for Multichannel Blind Source Separation ER - TY - CONF AB - Signal dereverberation using the Weighted Prediction Error (WPE) method has been proven to be an effective means to raise the accuracy of far-field speech recognition. First proposed as an iterative algorithm, follow-up works have reformulated it as a recursive least squares algorithm and therefore enabled its use in online applications. For this algorithm, the estimation of the power spectral density (PSD) of the anechoic signal plays an important role and strongly influences its performance. Recently, we showed that using a neural network PSD estimator leads to improved performance for online automatic speech recognition. This, however, comes at a price. To train the network, we require parallel data, i.e., utterances simultaneously available in clean and reverberated form. Here we propose to overcome this limitation by training the network jointly with the acoustic model of the speech recognizer. To be specific, the gradients computed from the cross-entropy loss between the target senone sequence and the acoustic model network output is backpropagated through the complex-valued dereverberation filter estimation to the neural network for PSD estimation. Evaluation on two databases demonstrates improved performance for on-line processing scenarios while imposing fewer requirements on the available training data and thus widening the range of applications. AU - Heymann, Jahn AU - Drude, Lukas AU - Haeb-Umbach, Reinhold AU - Kinoshita, Keisuke AU - Nakatani, Tomohiro ID - 12875 T2 - ICASSP 2019, Brighton, UK TI - Joint Optimization of Neural Network-based WPE Dereverberation and Acoustic Model for Robust Online ASR ER - TY - CONF AB - In this paper, we present libDirectional, a MATLAB library for directional statistics and directional estimation. It supports a variety of commonly used distributions on the unit circle, such as the von Mises, wrapped normal, and wrapped Cauchy distributions. Furthermore, various distributions on higher-dimensional manifolds such as the unit hypersphere and the hypertorus are available. Based on these distributions, several recursive filtering algorithms in libDirectional allow estimation on these manifolds. The functionality is implemented in a clear, well-documented, and object-oriented structure that is both easy to use and easy to extend. AU - Kurz, Gerhard AU - Gilitschenski, Igor AU - Pfaff, Florian AU - Drude, Lukas AU - Hanebeck, Uwe D. AU - Haeb-Umbach, Reinhold AU - Siegwart, Roland Y. ID - 12876 T2 - Journal of Statistical Software 89(4) TI - Directional Statistics and Filtering Using libDirectional ER - TY - JOUR AB - We formulate a generic framework for blind source separation (BSS), which allows integrating data-driven spectro-temporal methods, such as deep clustering and deep attractor networks, with physically motivated probabilistic spatial methods, such as complex angular central Gaussian mixture models. The integrated model exploits the complementary strengths of the two approaches to BSS: the strong modeling power of neural networks, which, however, is based on supervised learning, and the ease of unsupervised learning of the spatial mixture models whose few parameters can be estimated on as little as a single segment of a real mixture of speech. Experiments are carried out on both artificially mixed speech and true recordings of speech mixtures. The experiments verify that the integrated models consistently outperform the individual components. We further extend the models to cope with noisy, reverberant speech and introduce a cross-domain teacher–student training where the mixture model serves as the teacher to provide training targets for the student neural network. AU - Drude, Lukas AU - Haeb-Umbach, Reinhold ID - 12890 JF - IEEE Journal of Selected Topics in Signal Processing TI - Integration of Neural Networks and Probabilistic Spatial Models for Acoustic Blind Source Separation ER - TY - CONF AB - Despite the strong modeling power of neural network acoustic models, speech enhancement has been shown to deliver additional word error rate improvements if multi-channel data is available. However, there has been a longstanding debate whether enhancement should also be carried out on the ASR training data. In an extensive experimental evaluation on the acoustically very challenging CHiME-5 dinner party data we show that: (i) cleaning up the training data can lead to substantial error rate reductions, and (ii) enhancement in training is advisable as long as enhancement in test is at least as strong as in training. This approach stands in contrast and delivers larger gains than the common strategy reported in the literature to augment the training database with additional artificially degraded speech. Together with an acoustic model topology consisting of initial CNN layers followed by factorized TDNN layers we achieve with 41.6% and 43.2% WER on the DEV and EVAL test sets, respectively, a new single-system state-of-the-art result on the CHiME-5 data. This is a 8% relative improvement compared to the best word error rate published so far for a speech recognizer without system combination. AU - Zorila, Catalin AU - Boeddeker, Christoph AU - Doddipatla, Rama AU - Haeb-Umbach, Reinhold ID - 15816 T2 - ASRU 2019, Sentosa, Singapore TI - An Investigation Into the Effectiveness of Enhancement in ASR Training and Test for Chime-5 Dinner Party Transcription ER - TY - CONF AB - Multi-talker speech and moving speakers still pose a significant challenge to automatic speech recognition systems. Assuming an enrollment utterance of the target speakeris available, the so-called SpeakerBeam concept has been recently proposed to extract the target speaker from a speech mixture. If multi-channel input is available, spatial properties of the speaker can be exploited to support the source extraction. In this contribution we investigate different approaches to exploit such spatial information. In particular, we are interested in the question, how useful this information is if the target speaker changes his/her position. To this end, we present a SpeakerBeam-based source extraction network that is adapted to work on moving speakers by recursively updating the beamformer coefficients. Experimental results are presented on two data sets, one with articially created room impulse responses, and one with real room impulse responses and noise recorded in a conference room. Interestingly, spatial features turn out to be advantageous even if the speaker position changes. AU - Heitkaemper, Jens AU - Feher, Thomas AU - Freitag, Michael AU - Haeb-Umbach, Reinhold ID - 14822 T2 - International Conference on Statistical Language and Speech Processing 2019, Ljubljana, Slovenia TI - A Study on Online Source Extraction in the Presence of Changing Speaker Positions ER - TY - CONF AB - This paper deals with multi-channel speech recognition in scenarios with multiple speakers. Recently, the spectral characteristics of a target speaker, extracted from an adaptation utterance, have been used to guide a neural network mask estimator to focus on that speaker. In this work we present two variants of speakeraware neural networks, which exploit both spectral and spatial information to allow better discrimination between target and interfering speakers. Thus, we introduce either a spatial preprocessing prior to the mask estimation or a spatial plus spectral speaker characterization block whose output is directly fed into the neural mask estimator. The target speaker’s spectral and spatial signature is extracted from an adaptation utterance recorded at the beginning of a session. We further adapt the architecture for low-latency processing by means of block-online beamforming that recursively updates the signal statistics. Experimental results show that the additional spatial information clearly improves source extraction, in particular in the same-gender case, and that our proposal achieves state-of-the-art performance in terms of distortion reduction and recognition accuracy. AU - Martin-Donas, Juan M. AU - Heitkaemper, Jens AU - Haeb-Umbach, Reinhold AU - Gomez, Angel M. AU - Peinado, Antonio M. ID - 14824 T2 - INTERSPEECH 2019, Graz, Austria TI - Multi-Channel Block-Online Source Extraction based on Utterance Adaptation ER - TY - CONF AB - In this paper, we present Hitachi and Paderborn University’s joint effort for automatic speech recognition (ASR) in a dinner party scenario. The main challenges of ASR systems for dinner party recordings obtained by multiple microphone arrays are (1) heavy speech overlaps, (2) severe noise and reverberation, (3) very natural onversational content, and possibly (4) insufficient training data. As an example of a dinner party scenario, we have chosen the data presented during the CHiME-5 speech recognition challenge, where the baseline ASR had a 73.3% word error rate (WER), and even the best performing system at the CHiME-5 challenge had a 46.1% WER. We extensively investigated a combination of the guided source separation-based speech enhancement technique and an already proposed strong ASR backend and found that a tight combination of these techniques provided substantial accuracy improvements. Our final system achieved WERs of 39.94% and 41.64% for the development and evaluation data, respectively, both of which are the best published results for the dataset. We also investigated with additional training data on the official small data in the CHiME-5 corpus to assess the intrinsic difficulty of this ASR task. AU - Kanda, Naoyuki AU - Boeddeker, Christoph AU - Heitkaemper, Jens AU - Fujita, Yusuke AU - Horiguchi, Shota AU - Haeb-Umbach, Reinhold ID - 14826 T2 - INTERSPEECH 2019, Graz, Austria TI - Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR ER - TY - CONF AB - Automatic meeting analysis comprises the tasks of speaker counting, speaker diarization, and the separation of overlapped speech, followed by automatic speech recognition. This all has to be carried out on arbitrarily long sessions and, ideally, in an online or block-online manner. While significant progress has been made on individual tasks, this paper presents for the first time an all-neural approach to simultaneous speaker counting, diarization and source separation. The NN-based estimator operates in a block-online fashion and tracks speakers even if they remain silent for a number of time blocks, thus learning a stable output order for the separated sources. The neural network is recurrent over time as well as over the number of sources. The simulation experiments show that state of the art separation performance is achieved, while at the same time delivering good diarization and source counting results. It even generalizes well to an unseen large number of blocks. AU - von Neumann, Thilo AU - Kinoshita, Keisuke AU - Delcroix, Marc AU - Araki, Shoko AU - Nakatani, Tomohiro AU - Haeb-Umbach, Reinhold ID - 13271 T2 - ICASSP 2019, Brighton, UK TI - All-neural Online Source Separation, Counting, and Diarization for Meeting Analysis ER - TY - JOUR AB - Once a popular theme of futuristic science fiction or far-fetched technology forecasts, digital home assistants with a spoken language interface have become a ubiquitous commodity today. This success has been made possible by major advancements in signal processing and machine learning for so-called far-field speech recognition, where the commands are spoken at a distance from the sound capturing device. The challenges encountered are quite unique and different from many other use cases of automatic speech recognition. The purpose of this tutorial article is to describe, in a way amenable to the non-specialist, the key speech processing algorithms that enable reliable fully hands-free speech interaction with digital home assistants. These technologies include multi-channel acoustic echo cancellation, microphone array processing and dereverberation techniques for signal enhancement, reliable wake-up word and end-of-interaction detection, high-quality speech synthesis, as well as sophisticated statistical models for speech and language, learned from large amounts of heterogeneous training data. In all these fields, deep learning has occupied a critical role. AU - Haeb-Umbach, Reinhold AU - Watanabe, Shinji AU - Nakatani, Tomohiro AU - Bacchiani, Michiel AU - Hoffmeister, Bjoern AU - Seltzer, Michael L. AU - Zen, Heiga AU - Souden, Mehrez ID - 15814 IS - 6 JF - IEEE Signal Processing Magazine SN - 1558-0792 TI - Speech Processing for Digital Home Assistance: Combining Signal Processing With Deep-Learning Techniques VL - 36 ER - TY - JOUR AB - Wenn akustische Signalverarbeitung mit automatisiertem Lernen verknüpft wird: Nachrichtentechniker arbeiten mit mehreren Mikrofonen und tiefen neuronalen Netzen an besserer Spracherkennung unter widrigsten Bedingungen. Von solchen Sensornetzwerken könnten langfristig auch digitale Sprachassistenten profitieren. AU - Haeb-Umbach, Reinhold ID - 19450 JF - DFG forschung 1/2019 TI - Lektionen für Alexa & Co?! ER - TY - CONF AB - This paper presents an approach to voice conversion, whichdoes neither require parallel data nor speaker or phone labels fortraining. It can convert between speakers which are not in thetraining set by employing the previously proposed concept of afactorized hierarchical variational autoencoder. Here, linguisticand speaker induced variations are separated upon the notionthat content induced variations change at a much shorter timescale, i.e., at the segment level, than speaker induced variations,which vary at the longer utterance level. In this contribution wepropose to employ convolutional instead of recurrent networklayers in the encoder and decoder blocks, which is shown toachieve better phone recognition accuracy on the latent segmentvariables at frame-level due to their better temporal resolution.For voice conversion the mean of the utterance variables is re-placed with the respective estimated mean of the target speaker.The resulting log-mel spectra of the decoder output are used aslocal conditions of a WaveNet which is utilized for synthesis ofthe speech waveforms. Experiments show both good disentan-glement properties of the latent space variables, and good voiceconversion performance. AU - Gburrek, Tobias AU - Glarner, Thomas AU - Ebbers, Janek AU - Haeb-Umbach, Reinhold AU - Wagner, Petra ID - 15237 T2 - Proc. 10th ISCA Speech Synthesis Workshop TI - Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion ER - TY - CONF AB - In this paper we present our audio tagging system for the DCASE 2019 Challenge Task 2. We propose a model consisting of a convolutional front end using log-mel-energies as input features, a recurrent neural network sequence encoder and a fully connected classifier network outputting an activity probability for each of the 80 considered event classes. Due to the recurrent neural network, which encodes a whole sequence into a single vector, our model is able to process sequences of varying lengths. The model is trained with only little manually labeled training data and a larger amount of automatically labeled web data, which hence suffers from label noise. To efficiently train the model with the provided data we use various data augmentation to prevent overfitting and improve generalization. Our best submitted system achieves a label-weighted label-ranking average precision (lwlrap) of 75.5% on the private test set which is an absolute improvement of 21.7% over the baseline. This system scored the second place in the teams ranking of the DCASE 2019 Challenge Task 2 and the fifth place in the Kaggle competition “Freesound Audio Tagging 2019” with more than 400 participants. After the challenge ended we further improved performance to 76.5% lwlrap setting a new state-of-the-art on this dataset. AU - Ebbers, Janek AU - Haeb-Umbach, Reinhold ID - 15794 T2 - DCASE2019 Workshop, New York, USA TI - Convolutional Recurrent Neural Network and Data Augmentation for Audio Tagging with Noisy Labels and Minimal Supervision ER - TY - CONF AB - In this paper we consider human daily activity recognition using an acoustic sensor network (ASN) which consists of nodes distributed in a home environment. Assuming that the ASN is permanently recording, the vast majority of recordings is silence. Therefore, we propose to employ a computationally efficient two-stage sound recognition system, consisting of an initial sound activity detection (SAD) and a subsequent sound event classification (SEC), which is only activated once sound activity has been detected. We show how a low-latency activity detector with high temporal resolution can be trained from weak labels with low temporal resolution. We further demonstrate the advantage of using spatial features for the subsequent event classification task. AU - Ebbers, Janek AU - Drude, Lukas AU - Haeb-Umbach, Reinhold AU - Brendel, Andreas AU - Kellermann, Walter ID - 15796 T2 - CAMSAP 2019, Guadeloupe, West Indies TI - Weakly Supervised Sound Activity Detection and Event Classification in Acoustic Sensor Networks ER - TY - CONF AB - In this paper we highlight the privacy risks entailed in deep neural network feature extraction for domestic activity monitoring. We employ the baseline system proposed in the Task 5 of the DCASE 2018 challenge and simulate a feature interception attack by an eavesdropper who wants to perform speaker identification. We then propose to reduce the aforementioned privacy risks by introducing a variational information feature extraction scheme that allows for good activity monitoring performance while at the same time minimizing the information of the feature representation, thus restricting speaker identification attempts. We analyze the resulting model’s composite loss function and the budget scaling factor used to control the balance between the performance of the trusted and attacker tasks. It is empirically demonstrated that the proposed method reduces speaker identification privacy risks without significantly deprecating the performance of domestic activity monitoring tasks. AU - Nelus, Alexandru AU - Ebbers, Janek AU - Haeb-Umbach, Reinhold AU - Martin, Rainer ID - 15792 T2 - INTERSPEECH 2019, Graz, Austria TI - Privacy-preserving Variational Information Feature Extraction for Domestic Activity Monitoring Versus Speaker Identification ER - TY - CONF AB - Acoustic event detection, i.e., the task of assigning a human interpretable label to a segment of audio, has only recently attracted increased interest in the research community. Driven by the DCASE challenges and the availability of large-scale audio datasets, the state-of-the-art has progressed rapidly with deep-learning-based classi- fiers dominating the field. Because several potential use cases favor a realization on distributed sensor nodes, e.g. ambient assisted living applications, habitat monitoring or surveillance, we are concerned with two issues here. Firstly the classification performance of such systems and secondly the computing resources required to achieve a certain performance considering node level feature extraction. In this contribution we look at the balance between the two criteria by employing traditional techniques and different deep learning architectures, including convolutional and recurrent models in the context of real life everyday audio recordings in realistic, however challenging, multisource conditions. AU - Ebbers, Janek AU - Nelus, Alexandru AU - Martin, Rainer AU - Haeb-Umbach, Reinhold ID - 11760 T2 - DAGA 2018, München TI - Evaluation of Modulation-MFCC Features and DNN Classification for Acoustic Event Detection ER - TY - CONF AB - Signal dereverberation using the weighted prediction error (WPE) method has been proven to be an effective means to raise the accuracy of far-field speech recognition. But in its original formulation, WPE requires multiple iterations over a sufficiently long utterance, rendering it unsuitable for online low-latency applications. Recently, two methods have been proposed to overcome this limitation. One utilizes a neural network to estimate the power spectral density (PSD) of the target signal and works in a block-online fashion. The other method relies on a rather simple PSD estimation which smoothes the observed PSD and utilizes a recursive formulation which enables it to work on a frame-by-frame basis. In this paper, we integrate a deep neural network (DNN) based estimator into the recursive frame-online formulation. We evaluate the performance of the recursive system with different PSD estimators in comparison to the block-online and offline variant on two distinct corpora. The REVERB challenge data, where the signal is mainly deteriorated by reverberation, and a database which combines WSJ and VoiceHome to also consider (directed) noise sources. The results show that although smoothing works surprisingly well, the more sophisticated DNN based estimator shows promising improvements and shortens the performance gap between online and offline processing. AU - Heymann, Jahn AU - Drude, Lukas AU - Haeb-Umbach, Reinhold AU - Kinoshita, Keisuke AU - Nakatani, Tomohiro ID - 11835 T2 - IWAENC 2018, Tokio, Japan TI - Frame-Online DNN-WPE Dereverberation ER - TY - CONF AB - We present a block-online multi-channel front end for automatic speech recognition in noisy and reverberated environments. It is an online version of our earlier proposed neural network supported acoustic beamformer, whose coefficients are calculated from noise and speech spatial covariance matrices which are estimated utilizing a neural mask estimator. However, the sparsity of speech in the STFT domain causes problems for the initial beamformer coefficients estimation in some frequency bins due to lack of speech observations. We propose two methods to mitigate this issue. The first is to lower the frequency resolution of the STFT, which comes with the additional advantage of a reduced time window, thus lowering the latency introduced by block processing. The second approach is to smooth beamforming coefficients along the frequency axis, thus exploiting their high interfrequency correlation. With both approaches the gap between offline and block-online beamformer performance, as measured by the word error rate achieved by a downstream speech recognizer, is significantly reduced. Experiments are carried out on two copora, representing noisy (CHiME-4) and noisy reverberant (voiceHome) environments. AU - Heitkaemper, Jens AU - Heymann, Jahn AU - Haeb-Umbach, Reinhold ID - 11837 T2 - ITG 2018, Oldenburg, Germany TI - Smoothing along Frequency in Online Neural Network Supported Acoustic Beamforming ER - TY - CONF AB - The weighted prediction error (WPE) algorithm has proven to be a very successful dereverberation method for the REVERB challenge. Likewise, neural network based mask estimation for beamforming demonstrated very good noise suppression in the CHiME 3 and CHiME 4 challenges. Recently, it has been shown that this estimator can also be trained to perform dereverberation and denoising jointly. However, up to now a comparison of a neural beamformer and WPE is still missing, so is an investigation into a combination of the two. Therefore, we here provide an extensive evaluation of both and consequently propose variants to integrate deep neural network based beamforming with WPE. For these integrated variants we identify a consistent word error rate (WER) reduction on two distinct databases. In particular, our study shows that deep learning based beamforming benefits from a model-based dereverberation technique (i.e. WPE) and vice versa. Our key findings are: (a) Neural beamforming yields the lower WERs in comparison to WPE the more channels and noise are present. (b) Integration of WPE and a neural beamformer consistently outperforms all stand-alone systems. AU - Drude, Lukas AU - Boeddeker, Christoph AU - Heymann, Jahn AU - Kinoshita, Keisuke AU - Delcroix, Marc AU - Nakatani, Tomohiro AU - Haeb-Umbach, Reinhold ID - 11872 T2 - INTERSPEECH 2018, Hyderabad, India TI - Integration neural network based beamforming and weighted prediction error dereverberation ER - TY - CONF AB - NARA-WPE is a Python software package providing implementations of the weighted prediction error (WPE) dereverberation algorithm. WPE has been shown to be a highly effective tool for speech dereverberation, thus improving the perceptual quality of the signal and improving the recognition performance of downstream automatic speech recognition (ASR). It is suitable both for single-channel and multi-channel applications. The package consist of (1) a Numpy implementation which can easily be integrated into a custom Python toolchain, and (2) a TensorFlow implementation which allows integration into larger computational graphs and enables backpropagation through WPE to train more advanced front-ends. This package comprises of an iterative offline (batch) version, a block-online version, and a frame-online version which can be used in moderately low latency applications, e.g. digital speech assistants. AU - Drude, Lukas AU - Heymann, Jahn AU - Boeddeker, Christoph AU - Haeb-Umbach, Reinhold ID - 11873 T2 - ITG 2018, Oldenburg, Germany TI - NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing ER - TY - JOUR AB - We present an experimental comparison of seven state-of-the-art machine learning algorithms for the task of semantic analysis of spoken input, with a special emphasis on applications for dysarthric speech. Dysarthria is a motor speech disorder, which is characterized by poor articulation of phonemes. In order to cater for these noncanonical phoneme realizations, we employed an unsupervised learning approach to estimate the acoustic models for speech recognition, which does not require a literal transcription of the training data. Even for the subsequent task of semantic analysis, only weak supervision is employed, whereby the training utterance is accompanied by a semantic label only, rather than a literal transcription. Results on two databases, one of them containing dysarthric speech, are presented showing that Markov logic networks and conditional random fields substantially outperform other machine learning approaches. Markov logic networks have proved to be especially robust to recognition errors, which are caused by imprecise articulation in dysarthric speech. AU - Despotovic, Vladimir AU - Walter, Oliver AU - Haeb-Umbach, Reinhold ID - 11916 JF - Speech Communication 99 (2018) 242-251 (Elsevier B.V.) TI - Machine learning techniques for semantic analysis of dysarthric speech: An experimental study ER - TY - CONF AB - Deep clustering (DC) and deep attractor networks (DANs) are a data-driven way to monaural blind source separation. Both approaches provide astonishing single channel performance but have not yet been generalized to block-online processing. When separating speech in a continuous stream with a block-online algorithm, it needs to be determined in each block which of the output streams belongs to whom. In this contribution we solve this block permutation problem by introducing an additional speaker identification embedding to the DAN model structure. We motivate this model decision by analyzing the embedding topology of DC and DANs and show, that DC and DANs themselves are not sufficient for speaker identification. This model structure (a) improves the signal to distortion ratio (SDR) over a DAN baseline and (b) provides up to 61% and up to 34% relative reduction in permutation error rate and re-identification error rate compared to an i-vector baseline, respectively. AU - Drude, Lukas AU - von Neumann, Thilo AU - Haeb-Umbach, Reinhold ID - 12898 T2 - ICASSP 2018, Calgary, Canada TI - Deep Attractor Networks for Speaker Re-Identifikation and Blind Source Separation ER - TY - CONF AB - Deep attractor networks (DANs) are a recently introduced method to blindly separate sources from spectral features of a monaural recording using bidirectional long short-term memory networks (BLSTMs). Due to the nature of BLSTMs, this is inherently not online-ready and resorting to operating on blocks yields a block permutation problem in that the index of each speaker may change between blocks. We here propose the joint modeling of spatial and spectral features to solve the block permutation problem and generalize DANs to multi-channel meeting recordings: The DAN acts as a spectral feature extractor for a subsequent model-based clustering approach. We first analyze different joint models in batch-processing scenarios and finally propose a block-online blind source separation algorithm. The efficacy of the proposed models is demonstrated on reverberant mixtures corrupted by real recordings of multi-channel background noise. We demonstrate that both the proposed batch-processing and the proposed block-online system outperform (a) a spatial-only model with a state-of-the-art frequency permutation solver and (b) a spectral-only model with an oracle block permutation solver in terms of signal to distortion ratio (SDR) gains. AU - Drude, Lukas AU - Higuchi,, Takuya AU - Kinoshita, Keisuke AU - Nakatani, Tomohiro AU - Haeb-Umbach, Reinhold ID - 12900 T2 - ICASSP 2018, Calgary, Canada TI - Dual Frequency- and Block-Permutation Alignment for Deep Learning Based Block-Online Blind Source Separation ER - TY - CONF AB - This work examines acoustic beamformers employing neural networks (NNs) for mask prediction as front-end for automatic speech recognition (ASR) systems for practical scenarios like voice-enabled home devices. To test the versatility of the mask predicting network, the system is evaluated with different recording hardware, different microphone array designs, and different acoustic models of the downstream ASR system. Significant gains in recognition accuracy are obtained in all configurations despite the fact that the NN had been trained on mismatched data. Unlike previous work, the NN is trained on a feature level objective, which gives some performance advantage over a mask related criterion. Furthermore, different approaches for realizing online, or adaptive, NN-based beamforming are explored, where the online algorithms still show significant gains compared to the baseline performance. AU - Boeddeker, Christoph AU - Erdogan, Hakan AU - Yoshioka, Takuya AU - Haeb-Umbach, Reinhold ID - 12901 T2 - ICASSP 2018, Calgary, Canada TI - Exploring Practical Aspects of Neural Mask-Based Beamforming for Far-Field Speech Recognition ER - TY - CONF AB - This contribution presents a speech enhancement system for the CHiME-5 Dinner Party Scenario. The front-end employs multi-channel linear time-variant filtering and achieves its gains without the use of a neural network. We present an adaptation of blind source separation techniques to the CHiME-5 database which we call Guided Source Separation (GSS). Using the baseline acoustic and language model, the combination of Weighted Prediction Error based dereverberation, guided source separation, and beamforming reduces the WER by 10:54% (relative) for the single array track and by 21:12% (relative) on the multiple array track. AU - Boeddeker, Christoph AU - Heitkaemper, Jens AU - Schmalenstroeer, Joerg AU - Drude, Lukas AU - Heymann, Jahn AU - Haeb-Umbach, Reinhold ID - 12899 T2 - Proc. CHiME 2018 Workshop on Speech Processing in Everyday Environments, Hyderabad, India TI - Front-End Processing for the CHiME-5 Dinner Party Scenario ER - TY - CONF AB - Signal processing in WASNs is based on a software framework for hosting the algorithms as well as on a set of wireless connected devices representing the hardware. Each of the nodes contributes memory, processing power, communication bandwidth and some sensor information for the tasks to be solved on the network. In this paper we present our MARVELO framework for distributed signal processing. It is intended for transforming existing centralized implementations into distributed versions. To this end, the software only needs a block-oriented implementation, which MARVELO picks-up and distributes on the network. Additionally, our sensor node hardware and the audio interfaces responsible for multi-channel recordings are presented. AU - Afifi, Haitham AU - Schmalenstroeer, Joerg AU - Ullmann, Joerg AU - Haeb-Umbach, Reinhold AU - Karl, Holger ID - 6859 T2 - Speech Communication; 13th ITG-Symposium TI - MARVELO - A Framework for Signal Processing in Wireless Acoustic Sensor Networks ER - TY - CONF AB - In this paper, we present a neural network based classification algorithm for the discrimination of moving from stationary targets in the sight of an automotive radar sensor. Compared to existing algorithms, the proposed algorithm can take into account multiple local radar targets instead of performing classification inference on each target individually resulting in superior discrimination accuracy, especially suitable for non rigid objects, like pedestrians, which in general have a wide velocity spread when multiple targets are detected. AU - Grimm, Christopher AU - Breddermann, Tobias AU - Farhoud, Ridha AU - Fei, Tai AU - Warsitz, Ernst AU - Haeb-Umbach, Reinhold ID - 11747 T2 - International Conference on Microwaves for Intelligent Mobility (ICMIM) 2018 TI - Discrimination of Stationary from Moving Targets with Recurrent Neural Networks in Automotive Radar ER - TY - CONF AB - The invention of the Variational Autoencoder enables the application of Neural Networks to a wide range of tasks in unsupervised learning, including the field of Acoustic Unit Discovery (AUD). The recently proposed Hidden Markov Model Variational Autoencoder (HMMVAE) allows a joint training of a neural network based feature extractor and a structured prior for the latent space given by a Hidden Markov Model. It has been shown that the HMMVAE significantly outperforms pure GMM-HMM based systems on the AUD task. However, the HMMVAE cannot autonomously infer the number of acoustic units and thus relies on the GMM-HMM system for initialization. This paper introduces the Bayesian Hidden Markov Model Variational Autoencoder (BHMMVAE) which solves these issues by embedding the HMMVAE in a Bayesian framework with a Dirichlet Process Prior for the distribution of the acoustic units, and diagonal or full-covariance Gaussians as emission distributions. Experiments on TIMIT and Xitsonga show that the BHMMVAE is able to autonomously infer a reasonable number of acoustic units, can be initialized without supervision by a GMM-HMM system, achieves computationally efficient stochastic variational inference by using natural gradient descent, and, additionally, improves the AUD performance over the HMMVAE. AU - Glarner, Thomas AU - Hanebrink, Patrick AU - Ebbers, Janek AU - Haeb-Umbach, Reinhold ID - 11907 T2 - INTERSPEECH 2018, Hyderabad, India TI - Full Bayesian Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery ER - TY - CONF AB - Distributed sensor data acquisition usually encompasses data sampling by the individual devices, where each of them has its own oscillator driving the local sampling process, resulting in slightly different sampling rates at the individual sensor nodes. Nevertheless, for certain downstream signal processing tasks it is important to compensate even for small sampling rate offsets. Aligning the sampling rates of oscillators which differ only by a few parts-per-million, is, however, challenging and quite different from traditional multirate signal processing tasks. In this paper we propose to transfer a precise but computationally demanding time domain approach, inspired by the Nyquist-Shannon sampling theorem, to an efficient frequency domain implementation. To this end a buffer control is employed which compensates for sampling offsets which are multiples of the sampling period, while a digital filter, realized by the wellknown Overlap-Save method, handles the fractional part of the sampling phase offset. With experiments on artificially misaligned data we investigate the parametrization, the efficiency, and the induced distortions of the proposed resampling method. It is shown that a favorable compromise between residual distortion and computational complexity is achieved, compared to other sampling rate offset compensation techniques. AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 11838 T2 - 26th European Signal Processing Conference (EUSIPCO 2018) TI - Efficient Sampling Rate Offset Compensation - An Overlap-Save Based Approach ER - TY - CONF AB - This paper describes the systems for the single-array track and the multiple-array track of the 5th CHiME Challenge. The final system is a combination of multiple systems, using Confusion Network Combination (CNC). The different systems presented here are utilizing different front-ends and training sets for a Bidirectional Long Short-Term Memory (BLSTM) Acoustic Model (AM). The front-end was replaced by enhancements provided by Paderborn University [1]. The back-end has been implemented using RASR [2] and RETURNN [3]. Additionally, a system combination including the hypothesis word graphs from the system of the submission [1] has been performed, which results in the final best system. AU - Kitza, Markus AU - Michel, Wilfried AU - Boeddeker, Christoph AU - Heitkaemper, Jens AU - Menne, Tobias AU - Schlüter, Ralf AU - Ney, Hermann AU - Schmalenstroeer, Joerg AU - Drude, Lukas AU - Heymann, Jahn AU - Haeb-Umbach, Reinhold ID - 11876 T2 - Proc. CHiME 2018 Workshop on Speech Processing in Everyday Environments, Hyderabad, India TI - The RWTH/UPB System Combination for the CHiME 2018 Workshop ER - TY - CONF AB - Due to their distributed nature wireless acoustic sensor networks offer great potential for improved signal acquisition, processing and classification for applications such as monitoring and surveillance, home automation, or hands-free telecommunication. To reduce the communication demand with a central server and to raise the privacy level it is desirable to perform processing at node level. The limited processing and memory capabilities on a sensor node, however, stand in contrast to the compute and memory intensive deep learning algorithms used in modern speech and audio processing. In this work, we perform benchmarking of commonly used convolutional and recurrent neural network architectures on a Raspberry Pi based acoustic sensor node. We show that it is possible to run medium-sized neural network topologies used for speech enhancement and speech recognition in real time. For acoustic event recognition, where predictions in a lower temporal resolution are sufficient, it is even possible to run current state-of-the-art deep convolutional models with a real-time-factor of 0:11. AU - Ebbers, Janek AU - Heitkaemper, Jens AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 11836 T2 - ITG 2018, Oldenburg, Germany TI - Benchmarking Neural Network Architectures for Acoustic Sensor Networks ER - TY - CONF AB - It has been experimentally verified that sampling rate offsets (SROs) between the input channels of an acoustic beamformer have a detrimental effect on the achievable SNR gains. In this paper we derive an analytic model to study the impact of SRO on the estimation of the spatial noise covariance matrix used in MVDR beamforming. It is shown that a perfect compensation of the SRO is impossible if the noise covariance matrix is estimated by time averaging, even if the SRO is perfectly known. The SRO should therefore be compensated for prior to beamformer coefficient estimation. We present a novel scheme where SRO compensation and beamforming closely interact, saving some computational effort compared to separate SRO adjustment followed by acoustic beamforming. AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 11839 T2 - ITG 2018, Oldenburg, Germany TI - Insights into the Interplay of Sampling Rate Offsets and MVDR Beamforming ER - TY - CONF AB - In this work, we address the limited availability of large annotated databases for real-life audio event detection by utilizing the concept of transfer learning. This technique aims to transfer knowledge from a source domain to a target domain, even if source and target have different feature distributions and label sets. We hypothesize that all acoustic events share the same inventory of basic acoustic building blocks and differ only in the temporal order of these acoustic units. We then construct a deep neural network with convolutional layers for extracting the acoustic units and a recurrent layer for capturing the temporal order. Under the above hypothesis, transfer learning from a source to a target domain with a different acoustic event inventory is realized by transferring the convolutional layers from the source to the target domain. The recurrent layer is, however, learnt directly from the target domain. Experiments on the transfer from a synthetic source database to the reallife target database of DCASE 2016 demonstrate that transfer learning leads to improved detection performance on average. However, the successful transfer to detect events which are very different from what was seen in the source domain, could not be verified. AU - Arora, Prerna AU - Haeb-Umbach, Reinhold ID - 11717 T2 - IEEE 19th International Workshop on Multimedia Signal Processing (MMSP) TI - A Study on Transfer Learning for Acoustic Event Detection in a Real Life Scenario ER - TY - GEN AB - This report describes the computation of gradients by algorithmic differentiation for statistically optimum beamforming operations. Especially the derivation of complex-valued functions is a key component of this approach. Therefore the real-valued algorithmic differentiation is extended via the complex-valued chain rule. In addition to the basic mathematic operations the derivative of the eigenvalue problem with complex-valued eigenvectors is one of the key results of this report. The potential of this approach is shown with experimental results on the CHiME-3 challenge database. There, the beamforming task is used as a front-end for an ASR system. With the developed derivatives a joint optimization of a speech enhancement and speech recognition system w.r.t. the recognition optimization criterion is possible. AU - Boeddeker, Christoph AU - Hanebrink, Patrick AU - Drude, Lukas AU - Heymann, Jahn AU - Haeb-Umbach, Reinhold ID - 11735 TI - On the Computation of Complex-valued Gradients with Application to Statistically Optimum Beamforming ER - TY - CONF AB - In this paper we show how a neural network for spectral mask estimation for an acoustic beamformer can be optimized by algorithmic differentiation. Using the beamformer output SNR as the objective function to maximize, the gradient is propagated through the beamformer all the way to the neural network which provides the clean speech and noise masks from which the beamformer coefficients are estimated by eigenvalue decomposition. A key theoretical result is the derivative of an eigenvalue problem involving complex-valued eigenvectors. Experimental results on the CHiME-3 challenge database demonstrate the effectiveness of the approach. The tools developed in this paper are a key component for an end-to-end optimization of speech enhancement and speech recognition. AU - Boeddeker, Christoph AU - Hanebrink, Patrick AU - Drude, Lukas AU - Heymann, Jahn AU - Haeb-Umbach, Reinhold ID - 11736 T2 - Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP) TI - Optimizing Neural-Network Supported Acoustic Beamforming by Algorithmic Differentiation ER - TY - CONF AB - The benefits of both a logarithmic spectral amplitude (LSA) estimation and a modeling in a generalized spectral domain (where short-time amplitudes are raised to a generalized power exponent, not restricted to magnitude or power spectrum) are combined in this contribution to achieve a better tradeoff between speech quality and noise suppression in single-channel speech enhancement. A novel gain function is derived to enhance the logarithmic generalized spectral amplitudes of noisy speech. Experiments on the CHiME-3 dataset show that it outperforms the famous minimum mean squared error (MMSE) LSA gain function of Ephraim and Malah in terms of noise suppression by 1.4 dB, while the good speech quality of the MMSE-LSA estimator is maintained. AU - Chinaev, Alleksej AU - Haeb-Umbach, Reinhold ID - 11737 T2 - Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP) TI - A Generalized Log-Spectral Amplitude Estimator for Single-Channel Speech Enhancement ER - TY - CONF AB - Recent advances in discriminatively trained mask estimation networks to extract a single source utilizing beamforming techniques demonstrate, that the integration of statistical models and deep neural networks (DNNs) are a promising approach for robust automatic speech recognition (ASR) applications. In this contribution we demonstrate how discriminatively trained embeddings on spectral features can be tightly integrated into statistical model-based source separation to separate and transcribe overlapping speech. Good generalization to unseen spatial configurations is achieved by estimating a statistical model at test time, while still leveraging discriminative training of deep clustering embeddings on a separate training set. We formulate an expectation maximization (EM) algorithm which jointly estimates a model for deep clustering embeddings and complex-valued spatial observations in the short time Fourier transform (STFT) domain at test time. Extensive simulations confirm, that the integrated model outperforms (a) a deep clustering model with a subsequent beamforming step and (b) an EM-based model with a beamforming step alone in terms of signal to distortion ratio (SDR) and perceptually motivated metric (PESQ) gains. ASR results on a reverberated dataset further show, that the aforementioned gains translate to reduced word error rates (WERs) even in reverberant environments. AU - Drude, Lukas AU - Haeb-Umbach, Reinhold ID - 11754 T2 - INTERSPEECH 2017, Stockholm, Schweden TI - Tight integration of spatial and spectral features for BSS with Deep Clustering embeddings ER - TY - CONF AB - In this contribution we show how to exploit text data to support word discovery from audio input in an underresourced target language. Given audio, of which a certain amount is transcribed at the word level, and additional unrelated text data, the approach is able to learn a probabilistic mapping from acoustic units to characters and utilize it to segment the audio data into words without the need of a pronunciation dictionary. This is achieved by three components: an unsupervised acoustic unit discovery system, a supervisedly trained acoustic unit-to-grapheme converter, and a word discovery system, which is initialized with a language model trained on the text data. Experiments for multiple setups show that the initialization of the language model with text data improves the word segementation performance by a large margin. AU - Glarner, Thomas AU - Boenninghoff, Benedikt AU - Walter, Oliver AU - Haeb-Umbach, Reinhold ID - 11770 T2 - INTERSPEECH 2017, Stockholm, Schweden TI - Leveraging Text Data for Word Segmentation for Underresourced Languages ER - TY - CONF AB - This paper presents an end-to-end training approach for a beamformer-supported multi-channel ASR system. A neural network which estimates masks for a statistically optimum beamformer is jointly trained with a network for acoustic modeling. To update its parameters, we propagate the gradients from the acoustic model all the way through feature extraction and the complex valued beamforming operation. Besides avoiding a mismatch between the front-end and the back-end, this approach also eliminates the need for stereo data, i.e., the parallel availability of clean and noisy versions of the signals. Instead, it can be trained with real noisy multichannel data only. Also, relying on the signal statistics for beamforming, the approach makes no assumptions on the configuration of the microphone array. We further observe a performance gain through joint training in terms of word error rate in an evaluation of the system on the CHiME 4 dataset. AU - Heymann, Jahn AU - Drude, Lukas AU - Boeddeker, Christoph AU - Hanebrink, Patrick AU - Haeb-Umbach, Reinhold ID - 11809 T2 - Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP) TI - BEAMNET: End-to-End Training of a Beamformer-Supported Multi-Channel ASR System ER - TY - JOUR AB - Acoustic beamforming can greatly improve the performance of Automatic Speech Recognition (ASR) and speech enhancement systems when multiple channels are available. We recently proposed a way to support the model-based Generalized Eigenvalue beamforming operation with a powerful neural network for spectral mask estimation. The enhancement system has a number of desirable properties. In particular, neither assumptions need to be made about the nature of the acoustic transfer function (e.g., being anechonic), nor does the array configuration need to be known. While the system has been originally developed to enhance speech in noisy environments, we show in this article that it is also effective in suppressing reverberation, thus leading to a generic trainable multi-channel speech enhancement system for robust speech processing. To support this claim, we consider two distinct datasets: The CHiME 3 challenge, which features challenging real-world noise distortions, and the Reverb challenge, which focuses on distortions caused by reverberation. We evaluate the system both with respect to a speech enhancement and a recognition task. For the first task we propose a new way to cope with the distortions introduced by the Generalized Eigenvalue beamformer by renormalizing the target energy for each frequency bin, and measure its effectiveness in terms of the PESQ score. For the latter we feed the enhanced signal to a strong DNN back-end and achieve state-of-the-art ASR results on both datasets. We further experiment with different network architectures for spectral mask estimation: One small feed-forward network with only one hidden layer, one Convolutional Neural Network and one bi-directional Long Short-Term Memory network, showing that even a small network is capable of delivering significant performance improvements. AU - Heymann, Jahn AU - Drude, Lukas AU - Haeb-Umbach, Reinhold ID - 11811 JF - Computer Speech and Language TI - A Generic Neural Acoustic Beamforming Architecture for Robust Multi-Channel Speech Processing ER - TY - CONF AB - In this paper, we apply a high-resolution approach, i.e. the matrix pencil method (MPM), to the FMCW automotive radar system to separate the neighboring targets, which share similar parameters, i.e. range, relative speed and azimuth angle, and cause overlapping in the radar spectrum. In order to adapt the 1D model of MPM to the 2D range-velocity spectrum and simultaneously limit the computational cost, some preprocessing steps are proposed to construct a novel separation algorithm. Finally, this algorithm is evaluated in both simulation and real data, and the results indicate a promising performance. AU - Fei, Tai AU - Grimm, Christopher AU - Farhoud, Ridha AU - Breddermann, Tobias AU - Warsitz, Ernst AU - Haeb-Umbach, Reinhold ID - 11763 T2 - IEEE International conference on microwave, communications, anthenas and electronic systems TI - A Novel Target Separation Algorithm Applied to The Two-Dimensional Spectrum for FMCW Automotive Radar Systems ER - TY - CONF AB - In this paper, we present a hypothesis test for the classification of moving targets in the sight of an automotive radar sensor. For this purpose, a statistical model of the relative velocity between a stationary target and the radar sensor has been developed. With respect to the statistical properties a confidence interval is calculated and targets with relative velocity lying outside this interval are classified as moving targets. Compared to existing algorithms our approach is able to give robust classification independent of the number of observed moving targets and is characterized by an instantaneous classification, a simple parameterization of the model and an automatic calculation of the discriminating threshold. AU - Grimm, Christopher AU - Breddermann, Tobias AU - Farhoud, Ridha AU - Fei, Tai AU - Warsitz, Ernst AU - Haeb-Umbach, Reinhold ID - 11772 T2 - IEEE International conference on microwave, communications, anthenas and electronic systems (COMCAS) TI - Hypothesis Test for the Detection of Moving Targets in Automotive Radar ER - TY - CONF AB - Variational Autoencoders (VAEs) have been shown to provide efficient neural-network-based approximate Bayesian inference for observation models for which exact inference is intractable. Its extension, the so-called Structured VAE (SVAE) allows inference in the presence of both discrete and continuous latent variables. Inspired by this extension, we developed a VAE with Hidden Markov Models (HMMs) as latent models. We applied the resulting HMM-VAE to the task of acoustic unit discovery in a zero resource scenario. Starting from an initial model based on variational inference in an HMM with Gaussian Mixture Model (GMM) emission probabilities, the accuracy of the acoustic unit discovery could be significantly improved by the HMM-VAE. In doing so we were able to demonstrate for an unsupervised learning task what is well-known in the supervised learning case: Neural networks provide superior modeling power compared to GMMs. AU - Ebbers, Janek AU - Heymann, Jahn AU - Drude, Lukas AU - Glarner, Thomas AU - Haeb-Umbach, Reinhold AU - Raj, Bhiksha ID - 11759 T2 - INTERSPEECH 2017, Stockholm, Schweden TI - Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery ER - TY - CONF AB - Multi-channel speech enhancement algorithms rely on a synchronous sampling of the microphone signals. This, however, cannot always be guaranteed, especially if the sensors are distributed in an environment. To avoid performance degradation the sampling rate offset needs to be estimated and compensated for. In this contribution we extend the recently proposed coherence drift based method in two important directions. First, the increasing phase shift in the short-time Fourier transform domain is estimated from the coherence drift in a Matched Filterlike fashion, where intermediate estimates are weighted by their instantaneous SNR. Second, an observed bias is removed by iterating between offset estimation and compensation by resampling a couple of times. The effectiveness of the proposed method is demonstrated by speech recognition results on the output of a beamformer with and without sampling rate offset compensation between the input channels. We compare MVDR and maximum-SNR beamformers in reverberant environments and further show that both benefit from a novel phase normalization, which we also propose in this contribution. AU - Schmalenstroeer, Joerg AU - Heymann, Jahn AU - Drude, Lukas AU - Boeddeker, Christoph AU - Haeb-Umbach, Reinhold ID - 11895 T2 - IEEE 19th International Workshop on Multimedia Signal Processing (MMSP) TI - Multi-Stage Coherence Drift Based Sampling Rate Synchronization for Acoustic Beamforming ER - TY - CONF AB - In this paper we present an algorithm for the detection of moving targets in sight of an automotive radar sensor which can handle distorted ego-velocity information. In situations where biased or none velocity information is provided from the ego-vehicle, the algorithm is able to estimate the ego-velocity based on previously detected stationary targets with high accuracy, subsequently used for the target classification. Compared to existing ego-velocity algorithms our approach provides fast and efficient inference without sacrificing the practical classification accuracy. Other than that the algorithm is characterized by simple parameterization and little but appropriate model assumptions for high accurate production automotive radar sensors. AU - Grimm, Christopher AU - Farhoud, Ridha AU - Fei, Tai AU - Warsitz, Ernst AU - Haeb-Umbach, Reinhold ID - 11773 T2 - IEEE Microwaves, Radar and Remote Sensing Symposium (MRRS) TI - Detection of Moving Targets in Automotive Radar with Distorted Ego-Velocity Information ER -