TY - CONF AB - We present an unsupervised training approach for a neural network-based mask estimator in an acoustic beamforming application. The network is trained to maximize a likelihood criterion derived from a spatial mixture model of the observations. It is trained from scratch without requiring any parallel data consisting of degraded input and clean training targets. Thus, training can be carried out on real recordings of noisy speech rather than simulated ones. In contrast to previous work on unsupervised training of neural mask estimators, our approach avoids the need for a possibly pre-trained teacher model entirely. We demonstrate the effectiveness of our approach by speech recognition experiments on two different datasets: one mainly deteriorated by noise (CHiME 4) and one by reverberation (REVERB). The results show that the performance of the proposed system is on par with a supervised system using oracle target masks for training and with a system trained using a model-based teacher. AU - Drude, Lukas AU - Heymann, Jahn AU - Haeb-Umbach, Reinhold ID - 11965 T2 - INTERSPEECH 2019, Graz, Austria TI - Unsupervised training of neural mask-based beamforming ER - TY - CONF AB - We propose a training scheme to train neural network-based source separation algorithms from scratch when parallel clean data is unavailable. In particular, we demonstrate that an unsupervised spatial clustering algorithm is sufficient to guide the training of a deep clustering system. We argue that previous work on deep clustering requires strong supervision and elaborate on why this is a limitation. We demonstrate that (a) the single-channel deep clustering system trained according to the proposed scheme alone is able to achieve a similar performance as the multi-channel teacher in terms of word error rates and (b) initializing the spatial clustering approach with the deep clustering result yields a relative word error rate reduction of 26% over the unsupervised teacher. AU - Drude, Lukas AU - Hasenklever, Daniel AU - Haeb-Umbach, Reinhold ID - 12874 T2 - ICASSP 2019, Brighton, UK TI - Unsupervised Training of a Deep Clustering Model for Multichannel Blind Source Separation ER - TY - CONF AB - Signal dereverberation using the Weighted Prediction Error (WPE) method has been proven to be an effective means to raise the accuracy of far-field speech recognition. First proposed as an iterative algorithm, follow-up works have reformulated it as a recursive least squares algorithm and therefore enabled its use in online applications. For this algorithm, the estimation of the power spectral density (PSD) of the anechoic signal plays an important role and strongly influences its performance. Recently, we showed that using a neural network PSD estimator leads to improved performance for online automatic speech recognition. This, however, comes at a price. To train the network, we require parallel data, i.e., utterances simultaneously available in clean and reverberated form. Here we propose to overcome this limitation by training the network jointly with the acoustic model of the speech recognizer. To be specific, the gradients computed from the cross-entropy loss between the target senone sequence and the acoustic model network output is backpropagated through the complex-valued dereverberation filter estimation to the neural network for PSD estimation. Evaluation on two databases demonstrates improved performance for on-line processing scenarios while imposing fewer requirements on the available training data and thus widening the range of applications. AU - Heymann, Jahn AU - Drude, Lukas AU - Haeb-Umbach, Reinhold AU - Kinoshita, Keisuke AU - Nakatani, Tomohiro ID - 12875 T2 - ICASSP 2019, Brighton, UK TI - Joint Optimization of Neural Network-based WPE Dereverberation and Acoustic Model for Robust Online ASR ER - TY - CONF AB - In this paper, we present libDirectional, a MATLAB library for directional statistics and directional estimation. It supports a variety of commonly used distributions on the unit circle, such as the von Mises, wrapped normal, and wrapped Cauchy distributions. Furthermore, various distributions on higher-dimensional manifolds such as the unit hypersphere and the hypertorus are available. Based on these distributions, several recursive filtering algorithms in libDirectional allow estimation on these manifolds. The functionality is implemented in a clear, well-documented, and object-oriented structure that is both easy to use and easy to extend. AU - Kurz, Gerhard AU - Gilitschenski, Igor AU - Pfaff, Florian AU - Drude, Lukas AU - Hanebeck, Uwe D. AU - Haeb-Umbach, Reinhold AU - Siegwart, Roland Y. ID - 12876 T2 - Journal of Statistical Software 89(4) TI - Directional Statistics and Filtering Using libDirectional ER - TY - JOUR AB - We formulate a generic framework for blind source separation (BSS), which allows integrating data-driven spectro-temporal methods, such as deep clustering and deep attractor networks, with physically motivated probabilistic spatial methods, such as complex angular central Gaussian mixture models. The integrated model exploits the complementary strengths of the two approaches to BSS: the strong modeling power of neural networks, which, however, is based on supervised learning, and the ease of unsupervised learning of the spatial mixture models whose few parameters can be estimated on as little as a single segment of a real mixture of speech. Experiments are carried out on both artificially mixed speech and true recordings of speech mixtures. The experiments verify that the integrated models consistently outperform the individual components. We further extend the models to cope with noisy, reverberant speech and introduce a cross-domain teacher–student training where the mixture model serves as the teacher to provide training targets for the student neural network. AU - Drude, Lukas AU - Haeb-Umbach, Reinhold ID - 12890 JF - IEEE Journal of Selected Topics in Signal Processing TI - Integration of Neural Networks and Probabilistic Spatial Models for Acoustic Blind Source Separation ER - TY - CONF AB - In this paper we consider human daily activity recognition using an acoustic sensor network (ASN) which consists of nodes distributed in a home environment. Assuming that the ASN is permanently recording, the vast majority of recordings is silence. Therefore, we propose to employ a computationally efficient two-stage sound recognition system, consisting of an initial sound activity detection (SAD) and a subsequent sound event classification (SEC), which is only activated once sound activity has been detected. We show how a low-latency activity detector with high temporal resolution can be trained from weak labels with low temporal resolution. We further demonstrate the advantage of using spatial features for the subsequent event classification task. AU - Ebbers, Janek AU - Drude, Lukas AU - Haeb-Umbach, Reinhold AU - Brendel, Andreas AU - Kellermann, Walter ID - 15796 T2 - CAMSAP 2019, Guadeloupe, West Indies TI - Weakly Supervised Sound Activity Detection and Event Classification in Acoustic Sensor Networks ER - TY - CONF AB - Signal dereverberation using the weighted prediction error (WPE) method has been proven to be an effective means to raise the accuracy of far-field speech recognition. But in its original formulation, WPE requires multiple iterations over a sufficiently long utterance, rendering it unsuitable for online low-latency applications. Recently, two methods have been proposed to overcome this limitation. One utilizes a neural network to estimate the power spectral density (PSD) of the target signal and works in a block-online fashion. The other method relies on a rather simple PSD estimation which smoothes the observed PSD and utilizes a recursive formulation which enables it to work on a frame-by-frame basis. In this paper, we integrate a deep neural network (DNN) based estimator into the recursive frame-online formulation. We evaluate the performance of the recursive system with different PSD estimators in comparison to the block-online and offline variant on two distinct corpora. The REVERB challenge data, where the signal is mainly deteriorated by reverberation, and a database which combines WSJ and VoiceHome to also consider (directed) noise sources. The results show that although smoothing works surprisingly well, the more sophisticated DNN based estimator shows promising improvements and shortens the performance gap between online and offline processing. AU - Heymann, Jahn AU - Drude, Lukas AU - Haeb-Umbach, Reinhold AU - Kinoshita, Keisuke AU - Nakatani, Tomohiro ID - 11835 T2 - IWAENC 2018, Tokio, Japan TI - Frame-Online DNN-WPE Dereverberation ER - TY - CONF AB - The weighted prediction error (WPE) algorithm has proven to be a very successful dereverberation method for the REVERB challenge. Likewise, neural network based mask estimation for beamforming demonstrated very good noise suppression in the CHiME 3 and CHiME 4 challenges. Recently, it has been shown that this estimator can also be trained to perform dereverberation and denoising jointly. However, up to now a comparison of a neural beamformer and WPE is still missing, so is an investigation into a combination of the two. Therefore, we here provide an extensive evaluation of both and consequently propose variants to integrate deep neural network based beamforming with WPE. For these integrated variants we identify a consistent word error rate (WER) reduction on two distinct databases. In particular, our study shows that deep learning based beamforming benefits from a model-based dereverberation technique (i.e. WPE) and vice versa. Our key findings are: (a) Neural beamforming yields the lower WERs in comparison to WPE the more channels and noise are present. (b) Integration of WPE and a neural beamformer consistently outperforms all stand-alone systems. AU - Drude, Lukas AU - Boeddeker, Christoph AU - Heymann, Jahn AU - Kinoshita, Keisuke AU - Delcroix, Marc AU - Nakatani, Tomohiro AU - Haeb-Umbach, Reinhold ID - 11872 T2 - INTERSPEECH 2018, Hyderabad, India TI - Integration neural network based beamforming and weighted prediction error dereverberation ER - TY - CONF AB - NARA-WPE is a Python software package providing implementations of the weighted prediction error (WPE) dereverberation algorithm. WPE has been shown to be a highly effective tool for speech dereverberation, thus improving the perceptual quality of the signal and improving the recognition performance of downstream automatic speech recognition (ASR). It is suitable both for single-channel and multi-channel applications. The package consist of (1) a Numpy implementation which can easily be integrated into a custom Python toolchain, and (2) a TensorFlow implementation which allows integration into larger computational graphs and enables backpropagation through WPE to train more advanced front-ends. This package comprises of an iterative offline (batch) version, a block-online version, and a frame-online version which can be used in moderately low latency applications, e.g. digital speech assistants. AU - Drude, Lukas AU - Heymann, Jahn AU - Boeddeker, Christoph AU - Haeb-Umbach, Reinhold ID - 11873 T2 - ITG 2018, Oldenburg, Germany TI - NARA-WPE: A Python package for weighted prediction error dereverberation in Numpy and Tensorflow for online and offline processing ER - TY - CONF AB - Deep clustering (DC) and deep attractor networks (DANs) are a data-driven way to monaural blind source separation. Both approaches provide astonishing single channel performance but have not yet been generalized to block-online processing. When separating speech in a continuous stream with a block-online algorithm, it needs to be determined in each block which of the output streams belongs to whom. In this contribution we solve this block permutation problem by introducing an additional speaker identification embedding to the DAN model structure. We motivate this model decision by analyzing the embedding topology of DC and DANs and show, that DC and DANs themselves are not sufficient for speaker identification. This model structure (a) improves the signal to distortion ratio (SDR) over a DAN baseline and (b) provides up to 61% and up to 34% relative reduction in permutation error rate and re-identification error rate compared to an i-vector baseline, respectively. AU - Drude, Lukas AU - von Neumann, Thilo AU - Haeb-Umbach, Reinhold ID - 12898 T2 - ICASSP 2018, Calgary, Canada TI - Deep Attractor Networks for Speaker Re-Identifikation and Blind Source Separation ER -