TY - JOUR AB - Due to the ad hoc nature of wireless acoustic sensor networks, the position of the sensor nodes is typically unknown. This contribution proposes a technique to estimate the position and orientation of the sensor nodes from the recorded speech signals. The method assumes that a node comprises a microphone array with synchronously sampled microphones rather than a single microphone, but does not require the sampling clocks of the nodes to be synchronized. From the observed audio signals, the distances between the acoustic sources and arrays, as well as the directions of arrival, are estimated. They serve as input to a non-linear least squares problem, from which both the sensor nodes’ positions and orientations, as well as the source positions, are alternatingly estimated in an iterative process. Given one set of unknowns, i.e., either the source positions or the sensor nodes’ geometry, the other set of unknowns can be computed in closed-form. The proposed approach is computationally efficient and the first one, which employs both distance and directional information for geometry calibration in a common cost function. Since both distance and direction of arrival measurements suffer from outliers, e.g., caused by strong reflections of the sound waves on the surfaces of the room, we introduce measures to deemphasize or remove unreliable measurements. Additionally, we discuss modifications of our previously proposed deep neural network-based acoustic distance estimator, to account not only for omnidirectional sources but also for directional sources. Simulation results show good positioning accuracy and compare very favorably with alternative approaches from the literature. AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 22528 JF - EURASIP Journal on Audio, Speech, and Music Processing SN - 1687-4722 TI - Geometry calibration in wireless acoustic sensor networks utilizing DoA and distance information ER - TY - CONF AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 23994 T2 - ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - Iterative Geometry Calibration from Distance Estimates for Wireless Acoustic Sensor Networks ER - TY - CONF AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 23999 T2 - Speech Communication; 14th ITG-Symposium TI - On Source-Microphone Distance Estimation Using Convolutional Recurrent Neural Networks ER - TY - CONF AU - Chinaev, Aleksej AU - Enzner, Gerald AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg ID - 23997 T2 - 29th European Signal Processing Conference (EUSIPCO) TI - Online Estimation of Sampling Rate Offsets in Wireless Acoustic Sensor Networks with Packet Loss ER - TY - CONF AB - In this work we address disentanglement of style and content in speech signals. We propose a fully convolutional variational autoencoder employing two encoders: a content encoder and a style encoder. To foster disentanglement, we propose adversarial contrastive predictive coding. This new disentanglement method does neither need parallel data nor any supervision. We show that the proposed technique is capable of separating speaker and content traits into the two different representations and show competitive speaker-content disentanglement performance compared to other unsupervised approaches. We further demonstrate an increased robustness of the content representation against a train-test mismatch compared to spectral features, when used for phone recognition. AU - Ebbers, Janek AU - Kuhlmann, Michael AU - Cord-Landwehr, Tobias AU - Haeb-Umbach, Reinhold ID - 29304 T2 - Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - Contrastive Predictive Coding Supported Factorized Variational Autoencoder for Unsupervised Learning of Disentangled Speech Representations ER - TY - CONF AB - Automatic transcription of meetings requires handling of overlapped speech, which calls for continuous speech separation (CSS) systems. The uPIT criterion was proposed for utterance-level separation with neural networks and introduces the constraint that the total number of speakers must not exceed the number of output channels. When processing meeting-like data in a segment-wise manner, i.e., by separating overlapping segments independently and stitching adjacent segments to continuous output streams, this constraint has to be fulfilled for any segment. In this contribution, we show that this constraint can be significantly relaxed. We propose a novel graph-based PIT criterion, which casts the assignment of utterances to output channels in a graph coloring problem. It only requires that the number of concurrently active speakers must not exceed the number of output channels. As a consequence, the system can process an arbitrary number of speakers and arbitrarily long segments and thus can handle more diverse scenarios. Further, the stitching algorithm for obtaining a consistent output order in neighboring segments is of less importance and can even be eliminated completely, not the least reducing the computational effort. Experiments on meeting-style WSJ data show improvements in recognition performance over using the uPIT criterion. AU - von Neumann, Thilo AU - Kinoshita, Keisuke AU - Boeddeker, Christoph AU - Delcroix, Marc AU - Haeb-Umbach, Reinhold ID - 26770 KW - Continuous speech separation KW - automatic speech recognition KW - overlapped speech KW - permutation invariant training T2 - Interspeech 2021 TI - Graph-PIT: Generalized Permutation Invariant Training for Continuous Separation of Arbitrary Numbers of Speakers ER - TY - CONF AU - von Neumann, Thilo AU - Boeddeker, Christoph AU - Kinoshita, Keisuke AU - Delcroix, Marc AU - Haeb-Umbach, Reinhold ID - 29173 T2 - Speech Communication; 14th ITG Conference TI - Speeding Up Permutation Invariant Training for Source Separation ER - TY - CONF AB - In this paper we present our system for the Detection and Classification of Acoustic Scenes and Events (DCASE) 2021 Challenge Task 4: Sound Event Detection and Separation in Domestic Environments, where it scored the fourth rank. Our presented solution is an advancement of our system used in the previous edition of the task.We use a forward-backward convolutional recurrent neural network (FBCRNN) for tagging and pseudo labeling followed by tag-conditioned sound event detection (SED) models which are trained using strong pseudo labels provided by the FBCRNN. Our advancement over our earlier model is threefold. First, we introduce a strong label loss in the objective of the FBCRNN to take advantage of the strongly labeled synthetic data during training. Second, we perform multiple iterations of self-training for both the FBCRNN and tag-conditioned SED models. Third, while we used only tag-conditioned CNNs as our SED model in the previous edition we here explore sophisticated tag-conditioned SED model architectures, namely, bidirectional CRNNs and bidirectional convolutional transformer neural networks (CTNNs), and combine them. With metric and class specific tuning of median filter lengths for post-processing, our final SED model, consisting of 6 submodels (2 of each architecture), achieves on the public evaluation set poly-phonic sound event detection scores (PSDS) of 0.455 for scenario 1 and 0.684 for scenario as well as a collar-based F1-score of 0.596 outperforming the baselines and our model from the previous edition by far. Source code is publicly available at https://github.com/fgnt/pb_sed. AU - Ebbers, Janek AU - Haeb-Umbach, Reinhold ID - 29308 SN - 978-84-09-36072-7 T2 - Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021) TI - Self-Trained Audio Tagging and Sound Event Detection in Domestic Environments ER - TY - CONF AB - Recently, there has been a rising interest in sound recognition via Acoustic Sensor Networks to support applications such as ambient assisted living or environmental habitat monitoring. With state-of-the-art sound recognition being dominated by deep-learning-based approaches, there is a high demand for labeled training data. Despite the availability of large-scale data sets such as Google's AudioSet, acquiring training data matching a certain application environment is still often a problem. In this paper we are concerned with human activity monitoring in a domestic environment using an ASN consisting of multiple nodes each providing multichannel signals. We propose a self-training based domain adaptation approach, which only requires unlabeled data from the target environment. Here, a sound recognition system trained on AudioSet, the teacher, generates pseudo labels for data from the target environment on which a student network is trained. The student can furthermore glean information about the spatial arrangement of sensors and sound sources to further improve classification performance. It is shown that the student significantly improves recognition performance over the pre-trained teacher without relying on labeled data from the environment the system is deployed in. AU - Ebbers, Janek AU - Keyser, Moritz Curt AU - Haeb-Umbach, Reinhold ID - 29306 T2 - Proceedings of the 29th European Signal Processing Conference (EUSIPCO) TI - Adapting Sound Recognition to A New Environment Via Self-Training ER - TY - JOUR AB - One objective of current research in explainable intelligent systems is to implement social aspects in order to increase the relevance of explanations. In this paper, we argue that a novel conceptual framework is needed to overcome shortcomings of existing AI systems with little attention to processes of interaction and learning. Drawing from research in interaction and development, we first outline the novel conceptual framework that pushes the design of AI systems toward true interactivity with an emphasis on the role of the partner and social relevance. We propose that AI systems will be able to provide a meaningful and relevant explanation only if the process of explaining is extended to active contribution of both partners that brings about dynamics that is modulated by different levels of analysis. Accordingly, our conceptual framework comprises monitoring and scaffolding as key concepts and claims that the process of explaining is not only modulated by the interaction between explainee and explainer but is embedded into a larger social context in which conventionalized and routinized behaviors are established. We discuss our conceptual framework in relation to the established objectives of transparency and autonomy that are raised for the design of explainable AI systems currently. AU - Rohlfing, Katharina J. AU - Cimiano, Philipp AU - Scharlau, Ingrid AU - Matzner, Tobias AU - Buhl, Heike M. AU - Buschmeier, Hendrik AU - Esposito, Elena AU - Grimminger, Angela AU - Hammer, Barbara AU - Haeb-Umbach, Reinhold AU - Horwath, Ilona AU - Hüllermeier, Eyke AU - Kern, Friederike AU - Kopp, Stefan AU - Thommes, Kirsten AU - Ngonga Ngomo, Axel-Cyrille AU - Schulte, Carsten AU - Wachsmuth, Henning AU - Wagner, Petra AU - Wrede, Britta ID - 24456 IS - 3 JF - IEEE Transactions on Cognitive and Developmental Systems KW - Explainability KW - process ofexplaining andunderstanding KW - explainable artificial systems SN - 2379-8920 TI - Explanation as a Social Practice: Toward a Conceptual Framework for the Social Design of AI Systems VL - 13 ER - TY - CONF AU - Haeb-Umbach, Reinhold ED - Böck, Ronald ED - Siegert, Ingo ED - Wendemuth, Andreas ID - 17763 KW - Poster SN - 978-3-959081-93-1 T2 - Studientexte zur Sprachkommunikation: Elektronische Sprachsignalverarbeitung 2020 TI - Sprachtechnologien für Digitale Assistenten ER - TY - CONF AU - Boeddeker, Christoph AU - Nakatani, Tomohiro AU - Kinoshita, Keisuke AU - Haeb-Umbach, Reinhold ID - 20695 SN - 9781509066315 T2 - ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - Jointly Optimal Dereverberation and Beamforming ER - TY - CONF AU - Boeddeker, Christoph AU - Cord-Landwehr, Tobias AU - Heitkaemper, Jens AU - Zorila, Catalin AU - Hayakawa, Daichi AU - Li, Mohan AU - Liu, Min AU - Doddipatla, Rama AU - Haeb-Umbach, Reinhold ID - 20700 T2 - Proc. CHiME 2020 Workshop on Speech Processing in Everyday Environments TI - Towards a speaker diarization system for the CHiME 2020 dinner party transcription ER - TY - JOUR AU - Nakatani, Tomohiro AU - Boeddeker, Christoph AU - Kinoshita, Keisuke AU - Ikeshita, Rintaro AU - Delcroix, Marc AU - Haeb-Umbach, Reinhold ID - 17598 JF - IEEE/ACM Transactions on Audio, Speech, and Language Processing TI - Jointly optimal denoising, dereverberation, and source separation ER - TY - CONF AB - In recent years time domain speech separation has excelled over frequency domain separation in single channel scenarios and noise-free environments. In this paper we dissect the gains of the time-domain audio separation network (TasNet) approach by gradually replacing components of an utterance-level permutation invariant training (u-PIT) based separation system in the frequency domain until the TasNet system is reached, thus blending components of frequency domain approaches with those of time domain approaches. Some of the intermediate variants achieve comparable signal-to-distortion ratio (SDR) gains to TasNet, but retain the advantage of frequency domain processing: compatibility with classic signal processing tools such as frequency-domain beamforming and the human interpretability of the masks. Furthermore, we show that the scale invariant signal-to-distortion ratio (si-SDR) criterion used as loss function in TasNet is related to a logarithmic mean square error criterion and that it is this criterion which contributes most reliable to the performance advantage of TasNet. Finally, we critically assess which gains in a noise-free single channel environment generalize to more realistic reverberant conditions. AU - Heitkaemper, Jens AU - Jakobeit, Darius AU - Boeddeker, Christoph AU - Drude, Lukas AU - Haeb-Umbach, Reinhold ID - 20504 KW - voice activity detection KW - speech activity detection KW - neural network KW - statistical speech processing T2 - ICASSP 2020 Virtual Barcelona Spain TI - Demystifying TasNet: A Dissecting Approach ER - TY - GEN AB - Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules. AU - Watanabe, Shinji AU - Mandel, Michael AU - Barker, Jon AU - Vincent, Emmanuel AU - Arora, Ashish AU - Chang, Xuankai AU - Khudanpur, Sanjeev AU - Manohar, Vimal AU - Povey, Daniel AU - Raj, Desh AU - Snyder, David AU - Subramanian, Aswin Shanmugam AU - Trmal, Jan AU - Yair, Bar Ben AU - Boeddeker, Christoph AU - Ni, Zhaoheng AU - Fujita, Yusuke AU - Horiguchi, Shota AU - Kanda, Naoyuki AU - Yoshioka, Takuya AU - Ryant, Neville ID - 28263 T2 - arXiv:2004.09249 TI - CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings ER - TY - CONF AB - Speech activity detection (SAD), which often rests on the fact that the noise is "more'' stationary than speech, is particularly challenging in non-stationary environments, because the time variance of the acoustic scene makes it difficult to discriminate speech from noise. We propose two approaches to SAD, where one is based on statistical signal processing, while the other utilizes neural networks. The former employs sophisticated signal processing to track the noise and speech energies and is meant to support the case for a resource efficient, unsupervised signal processing approach. The latter introduces a recurrent network layer that operates on short segments of the input speech to do temporal smoothing in the presence of non-stationary noise. The systems are tested on the Fearless Steps challenge database, which consists of the transmission data from the Apollo-11 space mission. The statistical SAD achieves comparable detection performance to earlier proposed neural network based SADs, while the neural network based approach leads to a decision cost function of 1.07% on the evaluation set of the 2020 Fearless Steps Challenge, which sets a new state of the art. AU - Heitkaemper, Jens AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 20505 KW - voice activity detection KW - speech activity detection KW - neural network KW - statistical speech processing T2 - INTERSPEECH 2020 Virtual Shanghai China TI - Statistical and Neural Network Based Speech Activity Detection in Non-Stationary Acoustic Environments ER - TY - CONF AB - The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multispeaker speech recognition. However, up until now, state-of-theart neural network–based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of 11.0% on WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic E2E frequency domain systems proposed so far. AU - von Neumann, Thilo AU - Kinoshita, Keisuke AU - Drude, Lukas AU - Boeddeker, Christoph AU - Delcroix, Marc AU - Nakatani, Tomohiro AU - Haeb-Umbach, Reinhold ID - 20762 T2 - ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) TI - End-to-End Training of Time Domain Audio Separation and Recognition ER - TY - CONF AB - Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown. To cope with this, we extend an iterative speech extraction system with mechanisms to count the number of sources and combine it with a single-talker speech recognizer to form the first end-to-end multi-talker automatic speech recognition system for an unknown number of active speakers. Our experiments show very promising performance in counting accuracy, source separation and speech recognition on simulated clean mixtures from WSJ0-2mix and WSJ0-3mix. Among others, we set a new state-of-the-art word error rate on the WSJ0-2mix database. Furthermore, our system generalizes well to a larger number of speakers than it ever saw during training, as shown in experiments with the WSJ0-4mix database. AU - von Neumann, Thilo AU - Boeddeker, Christoph AU - Drude, Lukas AU - Kinoshita, Keisuke AU - Delcroix, Marc AU - Nakatani, Tomohiro AU - Haeb-Umbach, Reinhold ID - 20764 T2 - Proc. Interspeech 2020 TI - Multi-Talker ASR for an Unknown Number of Sources: Joint Training of Source Counting, Separation and ASR ER - TY - CONF AB - We present an approach to deep neural network based (DNN-based) distance estimation in reverberant rooms for supporting geometry calibration tasks in wireless acoustic sensor networks. Signal diffuseness information from acoustic signals is aggregated via the coherent-to-diffuse power ratio to obtain a distance-related feature, which is mapped to a source-to-microphone distance estimate by means of a DNN. This information is then combined with direction-of-arrival estimates from compact microphone arrays to infer the geometry of the sensor network. Unlike many other approaches to geometry calibration, the proposed scheme does only require that the sampling clocks of the sensor nodes are roughly synchronized. In simulations we show that the proposed DNN-based distance estimator generalizes to unseen acoustic environments and that precise estimates of the sensor node positions are obtained. AU - Gburrek, Tobias AU - Schmalenstroeer, Joerg AU - Brendel, Andreas AU - Kellermann, Walter AU - Haeb-Umbach, Reinhold ID - 18651 T2 - European Signal Processing Conference (EUSIPCO) TI - Deep Neural Network based Distance Estimation for Geometry Calibration in Acoustic Sensor Network ER -