TY - CONF AB - In this work, we address the limited availability of large annotated databases for real-life audio event detection by utilizing the concept of transfer learning. This technique aims to transfer knowledge from a source domain to a target domain, even if source and target have different feature distributions and label sets. We hypothesize that all acoustic events share the same inventory of basic acoustic building blocks and differ only in the temporal order of these acoustic units. We then construct a deep neural network with convolutional layers for extracting the acoustic units and a recurrent layer for capturing the temporal order. Under the above hypothesis, transfer learning from a source to a target domain with a different acoustic event inventory is realized by transferring the convolutional layers from the source to the target domain. The recurrent layer is, however, learnt directly from the target domain. Experiments on the transfer from a synthetic source database to the reallife target database of DCASE 2016 demonstrate that transfer learning leads to improved detection performance on average. However, the successful transfer to detect events which are very different from what was seen in the source domain, could not be verified. AU - Arora, Prerna AU - Haeb-Umbach, Reinhold ID - 11717 T2 - IEEE 19th International Workshop on Multimedia Signal Processing (MMSP) TI - A Study on Transfer Learning for Acoustic Event Detection in a Real Life Scenario ER - TY - GEN AB - This report describes the computation of gradients by algorithmic differentiation for statistically optimum beamforming operations. Especially the derivation of complex-valued functions is a key component of this approach. Therefore the real-valued algorithmic differentiation is extended via the complex-valued chain rule. In addition to the basic mathematic operations the derivative of the eigenvalue problem with complex-valued eigenvectors is one of the key results of this report. The potential of this approach is shown with experimental results on the CHiME-3 challenge database. There, the beamforming task is used as a front-end for an ASR system. With the developed derivatives a joint optimization of a speech enhancement and speech recognition system w.r.t. the recognition optimization criterion is possible. AU - Boeddeker, Christoph AU - Hanebrink, Patrick AU - Drude, Lukas AU - Heymann, Jahn AU - Haeb-Umbach, Reinhold ID - 11735 TI - On the Computation of Complex-valued Gradients with Application to Statistically Optimum Beamforming ER - TY - CONF AB - In this paper we show how a neural network for spectral mask estimation for an acoustic beamformer can be optimized by algorithmic differentiation. Using the beamformer output SNR as the objective function to maximize, the gradient is propagated through the beamformer all the way to the neural network which provides the clean speech and noise masks from which the beamformer coefficients are estimated by eigenvalue decomposition. A key theoretical result is the derivative of an eigenvalue problem involving complex-valued eigenvectors. Experimental results on the CHiME-3 challenge database demonstrate the effectiveness of the approach. The tools developed in this paper are a key component for an end-to-end optimization of speech enhancement and speech recognition. AU - Boeddeker, Christoph AU - Hanebrink, Patrick AU - Drude, Lukas AU - Heymann, Jahn AU - Haeb-Umbach, Reinhold ID - 11736 T2 - Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP) TI - Optimizing Neural-Network Supported Acoustic Beamforming by Algorithmic Differentiation ER - TY - CONF AB - The benefits of both a logarithmic spectral amplitude (LSA) estimation and a modeling in a generalized spectral domain (where short-time amplitudes are raised to a generalized power exponent, not restricted to magnitude or power spectrum) are combined in this contribution to achieve a better tradeoff between speech quality and noise suppression in single-channel speech enhancement. A novel gain function is derived to enhance the logarithmic generalized spectral amplitudes of noisy speech. Experiments on the CHiME-3 dataset show that it outperforms the famous minimum mean squared error (MMSE) LSA gain function of Ephraim and Malah in terms of noise suppression by 1.4 dB, while the good speech quality of the MMSE-LSA estimator is maintained. AU - Chinaev, Alleksej AU - Haeb-Umbach, Reinhold ID - 11737 T2 - Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP) TI - A Generalized Log-Spectral Amplitude Estimator for Single-Channel Speech Enhancement ER - TY - CONF AB - Recent advances in discriminatively trained mask estimation networks to extract a single source utilizing beamforming techniques demonstrate, that the integration of statistical models and deep neural networks (DNNs) are a promising approach for robust automatic speech recognition (ASR) applications. In this contribution we demonstrate how discriminatively trained embeddings on spectral features can be tightly integrated into statistical model-based source separation to separate and transcribe overlapping speech. Good generalization to unseen spatial configurations is achieved by estimating a statistical model at test time, while still leveraging discriminative training of deep clustering embeddings on a separate training set. We formulate an expectation maximization (EM) algorithm which jointly estimates a model for deep clustering embeddings and complex-valued spatial observations in the short time Fourier transform (STFT) domain at test time. Extensive simulations confirm, that the integrated model outperforms (a) a deep clustering model with a subsequent beamforming step and (b) an EM-based model with a beamforming step alone in terms of signal to distortion ratio (SDR) and perceptually motivated metric (PESQ) gains. ASR results on a reverberated dataset further show, that the aforementioned gains translate to reduced word error rates (WERs) even in reverberant environments. AU - Drude, Lukas AU - Haeb-Umbach, Reinhold ID - 11754 T2 - INTERSPEECH 2017, Stockholm, Schweden TI - Tight integration of spatial and spectral features for BSS with Deep Clustering embeddings ER - TY - CONF AB - In this contribution we show how to exploit text data to support word discovery from audio input in an underresourced target language. Given audio, of which a certain amount is transcribed at the word level, and additional unrelated text data, the approach is able to learn a probabilistic mapping from acoustic units to characters and utilize it to segment the audio data into words without the need of a pronunciation dictionary. This is achieved by three components: an unsupervised acoustic unit discovery system, a supervisedly trained acoustic unit-to-grapheme converter, and a word discovery system, which is initialized with a language model trained on the text data. Experiments for multiple setups show that the initialization of the language model with text data improves the word segementation performance by a large margin. AU - Glarner, Thomas AU - Boenninghoff, Benedikt AU - Walter, Oliver AU - Haeb-Umbach, Reinhold ID - 11770 T2 - INTERSPEECH 2017, Stockholm, Schweden TI - Leveraging Text Data for Word Segmentation for Underresourced Languages ER - TY - CONF AB - This paper presents an end-to-end training approach for a beamformer-supported multi-channel ASR system. A neural network which estimates masks for a statistically optimum beamformer is jointly trained with a network for acoustic modeling. To update its parameters, we propagate the gradients from the acoustic model all the way through feature extraction and the complex valued beamforming operation. Besides avoiding a mismatch between the front-end and the back-end, this approach also eliminates the need for stereo data, i.e., the parallel availability of clean and noisy versions of the signals. Instead, it can be trained with real noisy multichannel data only. Also, relying on the signal statistics for beamforming, the approach makes no assumptions on the configuration of the microphone array. We further observe a performance gain through joint training in terms of word error rate in an evaluation of the system on the CHiME 4 dataset. AU - Heymann, Jahn AU - Drude, Lukas AU - Boeddeker, Christoph AU - Hanebrink, Patrick AU - Haeb-Umbach, Reinhold ID - 11809 T2 - Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP) TI - BEAMNET: End-to-End Training of a Beamformer-Supported Multi-Channel ASR System ER - TY - JOUR AB - Acoustic beamforming can greatly improve the performance of Automatic Speech Recognition (ASR) and speech enhancement systems when multiple channels are available. We recently proposed a way to support the model-based Generalized Eigenvalue beamforming operation with a powerful neural network for spectral mask estimation. The enhancement system has a number of desirable properties. In particular, neither assumptions need to be made about the nature of the acoustic transfer function (e.g., being anechonic), nor does the array configuration need to be known. While the system has been originally developed to enhance speech in noisy environments, we show in this article that it is also effective in suppressing reverberation, thus leading to a generic trainable multi-channel speech enhancement system for robust speech processing. To support this claim, we consider two distinct datasets: The CHiME 3 challenge, which features challenging real-world noise distortions, and the Reverb challenge, which focuses on distortions caused by reverberation. We evaluate the system both with respect to a speech enhancement and a recognition task. For the first task we propose a new way to cope with the distortions introduced by the Generalized Eigenvalue beamformer by renormalizing the target energy for each frequency bin, and measure its effectiveness in terms of the PESQ score. For the latter we feed the enhanced signal to a strong DNN back-end and achieve state-of-the-art ASR results on both datasets. We further experiment with different network architectures for spectral mask estimation: One small feed-forward network with only one hidden layer, one Convolutional Neural Network and one bi-directional Long Short-Term Memory network, showing that even a small network is capable of delivering significant performance improvements. AU - Heymann, Jahn AU - Drude, Lukas AU - Haeb-Umbach, Reinhold ID - 11811 JF - Computer Speech and Language TI - A Generic Neural Acoustic Beamforming Architecture for Robust Multi-Channel Speech Processing ER - TY - GEN AB - The invention relates to a building or enclosure termination opening and/or closing apparatus having communication signed or encrypted by means of a key, and to a method for operating such. To allow simple, convenient and secure use by exclusively authorised users, the apparatus comprises: a first and a second user terminal, with secure forwarding of a time-limited key from the first to the second user terminal being possible. According to an alternative, individual keys are generated by a user identification and a secret device key. AU - Jacob, Florian AU - Schmalenstroeer, Joerg ID - 12081 TI - Building or Enclosure Termination Closing and/or Opening Apparatus, and Method for Operating a Building or Enclosure Termination ER - TY - CONF AB - In this paper, we apply a high-resolution approach, i.e. the matrix pencil method (MPM), to the FMCW automotive radar system to separate the neighboring targets, which share similar parameters, i.e. range, relative speed and azimuth angle, and cause overlapping in the radar spectrum. In order to adapt the 1D model of MPM to the 2D range-velocity spectrum and simultaneously limit the computational cost, some preprocessing steps are proposed to construct a novel separation algorithm. Finally, this algorithm is evaluated in both simulation and real data, and the results indicate a promising performance. AU - Fei, Tai AU - Grimm, Christopher AU - Farhoud, Ridha AU - Breddermann, Tobias AU - Warsitz, Ernst AU - Haeb-Umbach, Reinhold ID - 11763 T2 - IEEE International conference on microwave, communications, anthenas and electronic systems TI - A Novel Target Separation Algorithm Applied to The Two-Dimensional Spectrum for FMCW Automotive Radar Systems ER - TY - CONF AB - In this paper, we present a hypothesis test for the classification of moving targets in the sight of an automotive radar sensor. For this purpose, a statistical model of the relative velocity between a stationary target and the radar sensor has been developed. With respect to the statistical properties a confidence interval is calculated and targets with relative velocity lying outside this interval are classified as moving targets. Compared to existing algorithms our approach is able to give robust classification independent of the number of observed moving targets and is characterized by an instantaneous classification, a simple parameterization of the model and an automatic calculation of the discriminating threshold. AU - Grimm, Christopher AU - Breddermann, Tobias AU - Farhoud, Ridha AU - Fei, Tai AU - Warsitz, Ernst AU - Haeb-Umbach, Reinhold ID - 11772 T2 - IEEE International conference on microwave, communications, anthenas and electronic systems (COMCAS) TI - Hypothesis Test for the Detection of Moving Targets in Automotive Radar ER - TY - CONF AB - Variational Autoencoders (VAEs) have been shown to provide efficient neural-network-based approximate Bayesian inference for observation models for which exact inference is intractable. Its extension, the so-called Structured VAE (SVAE) allows inference in the presence of both discrete and continuous latent variables. Inspired by this extension, we developed a VAE with Hidden Markov Models (HMMs) as latent models. We applied the resulting HMM-VAE to the task of acoustic unit discovery in a zero resource scenario. Starting from an initial model based on variational inference in an HMM with Gaussian Mixture Model (GMM) emission probabilities, the accuracy of the acoustic unit discovery could be significantly improved by the HMM-VAE. In doing so we were able to demonstrate for an unsupervised learning task what is well-known in the supervised learning case: Neural networks provide superior modeling power compared to GMMs. AU - Ebbers, Janek AU - Heymann, Jahn AU - Drude, Lukas AU - Glarner, Thomas AU - Haeb-Umbach, Reinhold AU - Raj, Bhiksha ID - 11759 T2 - INTERSPEECH 2017, Stockholm, Schweden TI - Hidden Markov Model Variational Autoencoder for Acoustic Unit Discovery ER - TY - CONF AB - Multi-channel speech enhancement algorithms rely on a synchronous sampling of the microphone signals. This, however, cannot always be guaranteed, especially if the sensors are distributed in an environment. To avoid performance degradation the sampling rate offset needs to be estimated and compensated for. In this contribution we extend the recently proposed coherence drift based method in two important directions. First, the increasing phase shift in the short-time Fourier transform domain is estimated from the coherence drift in a Matched Filterlike fashion, where intermediate estimates are weighted by their instantaneous SNR. Second, an observed bias is removed by iterating between offset estimation and compensation by resampling a couple of times. The effectiveness of the proposed method is demonstrated by speech recognition results on the output of a beamformer with and without sampling rate offset compensation between the input channels. We compare MVDR and maximum-SNR beamformers in reverberant environments and further show that both benefit from a novel phase normalization, which we also propose in this contribution. AU - Schmalenstroeer, Joerg AU - Heymann, Jahn AU - Drude, Lukas AU - Boeddeker, Christoph AU - Haeb-Umbach, Reinhold ID - 11895 T2 - IEEE 19th International Workshop on Multimedia Signal Processing (MMSP) TI - Multi-Stage Coherence Drift Based Sampling Rate Synchronization for Acoustic Beamforming ER - TY - CONF AB - In this paper we present an algorithm for the detection of moving targets in sight of an automotive radar sensor which can handle distorted ego-velocity information. In situations where biased or none velocity information is provided from the ego-vehicle, the algorithm is able to estimate the ego-velocity based on previously detected stationary targets with high accuracy, subsequently used for the target classification. Compared to existing ego-velocity algorithms our approach provides fast and efficient inference without sacrificing the practical classification accuracy. Other than that the algorithm is characterized by simple parameterization and little but appropriate model assumptions for high accurate production automotive radar sensors. AU - Grimm, Christopher AU - Farhoud, Ridha AU - Fei, Tai AU - Warsitz, Ernst AU - Haeb-Umbach, Reinhold ID - 11773 T2 - IEEE Microwaves, Radar and Remote Sensing Symposium (MRRS) TI - Detection of Moving Targets in Automotive Radar with Distorted Ego-Velocity Information ER - TY - CONF AB - In this contribution we investigate a priori signal-to-noise ratio (SNR) estimation, a crucial component of a single-channel speech enhancement system based on spectral subtraction. The majority of the state-of-the art a priori SNR estimators work in the power spectral domain, which is, however, not confirmed to be the optimal domain for the estimation. Motivated by the generalized spectral subtraction rule, we show how the estimation of the a priori SNR can be formulated in the so called generalized SNR domain. This formulation allows to generalize the widely used decision directed (DD) approach. An experimental investigation with different noise types reveals the superiority of the generalized DD approach over the conventional DD approach in terms of both the mean opinion score - listening quality objective measure and the output global SNR in the medium to high input SNR regime, while we show that the power spectrum is the optimal domain for low SNR. We further develop a parameterization which adjusts the domain of estimation automatically according to the estimated input global SNR. Index Terms: single-channel speech enhancement, a priori SNR estimation, generalized spectral subtraction AU - Chinaev, Aleksej AU - Haeb-Umbach, Reinhold ID - 11738 T2 - INTERSPEECH 2016, San Francisco, USA TI - A Priori SNR Estimation Using a Generalized Decision Directed Approach ER - TY - CONF AB - This contribution introduces a novel causal a priori signal-to-noise ratio (SNR) estimator for single-channel speech enhancement. To exploit the advantages of the generalized spectral subtraction, a normalized ?-order magnitude (NAOM) domain is introduced where an a priori SNR estimation is carried out. In this domain, the NAOM coefficients of noise and clean speech signals are modeled by a Weibull distribution and aWeibullmixturemodel (WMM), respectively. While the parameters of the noise model are calculated from the noise power spectral density estimates, the speechWMM parameters are estimated from the noisy signal by applying a causal Expectation-Maximization algorithm. Further a maximum a posteriori estimate of the a priori SNR is developed. The experiments in different noisy environments show the superiority of the proposed estimator compared to the well-known decision-directed approach in terms of estimation error, estimator variance and speech quality of the enhanced signals when used for speech enhancement. AU - Chinaev, Aleksej AU - Heitkaemper, Jens AU - Haeb-Umbach, Reinhold ID - 11743 T2 - 12. ITG Fachtagung Sprachkommunikation (ITG 2016) TI - A Priori SNR Estimation Using Weibull Mixture Model ER - TY - CONF AB - A noise power spectral density (PSD) estimation is an indispensable component of speech spectral enhancement systems. In this paper we present a noise PSD tracking algorithm, which employs a noise presence probability estimate delivered by a deep neural network (DNN). The algorithm provides a causal noise PSD estimate and can thus be used in speech enhancement systems for communication purposes. An extensive performance comparison has been carried out with ten causal state-of-the-art noise tracking algorithms taken from the literature and categorized acc. to applied techniques. The experiments showed that the proposed DNN-based noise PSD tracker outperforms all competing methods with respect to all tested performance measures, which include the noise tracking performance and the performance of a speech enhancement system employing the noise tracking component. AU - Chinaev, Aleksej AU - Heymann, Jahn AU - Drude, Lukas AU - Haeb-Umbach, Reinhold ID - 11744 T2 - 12. ITG Fachtagung Sprachkommunikation (ITG 2016) TI - Noise-Presence-Probability-Based Noise PSD Estimation by Using DNNs ER - TY - CONF AU - Drude, Lukas AU - Boeddeker, Christoph AU - Haeb-Umbach, Reinhold ID - 11751 T2 - Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP) TI - Blind Speech Separation based on Complex Spherical k-Mode Clustering ER - TY - CONF AB - Although complex-valued neural networks (CVNNs) â?? networks which can operate with complex arithmetic â?? have been around for a while, they have not been given reconsideration since the breakthrough of deep network architectures. This paper presents a critical assessment whether the novel tool set of deep neural networks (DNNs) should be extended to complex-valued arithmetic. Indeed, with DNNs making inroads in speech enhancement tasks, the use of complex-valued input data, specifically the short-time Fourier transform coefficients, is an obvious consideration. In particular when it comes to performing tasks that heavily rely on phase information, such as acoustic beamforming, complex-valued algorithms are omnipresent. In this contribution we recapitulate backpropagation in CVNNs, develop complex-valued network elements, such as the split-rectified non-linearity, and compare real- and complex-valued networks on a beamforming task. We find that CVNNs hardly provide a performance gain and conclude that the effort of developing the complex-valued counterparts of the building blocks of modern deep or recurrent neural networks can hardly be justified. AU - Drude, Lukas AU - Raj, Bhiksha AU - Haeb-Umbach, Reinhold ID - 11756 T2 - INTERSPEECH 2016, San Francisco, USA TI - On the appropriateness of complex-valued neural networks for speech enhancement ER - TY - CONF AB - This paper is concerned with speech presence probability estimation employing an explicit model of the temporal and spectral correlations of speech. An undirected graphical model is introduced, based on a Factor Graph formulation. It is shown that this undirected model cures some of the theoretical issues of an earlier directed graphical model. Furthermore, we formulate a message passing inference scheme based on an approximate graph factorization, identify this inference scheme as a particular message passing schedule based on the turbo principle and suggest further alternative schedules. The experiments show an improved performance over speech presence probability estimation based on an IID assumption, and a slightly better performance of the turbo schedule over the alternatives. AU - Glarner, Thomas AU - Mahdi Momenzadeh, Mohammad AU - Drude, Lukas AU - Haeb-Umbach, Reinhold ID - 11771 T2 - 12. ITG Fachtagung Sprachkommunikation (ITG 2016) TI - Factor Graph Decoding for Speech Presence Probability Estimation ER - TY - CONF AU - Heymann, Jahn AU - Drude, Lukas AU - Haeb-Umbach, Reinhold ID - 11812 T2 - Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP) TI - Neural Network Based Spectral Mask Estimation for Acoustic Beamforming ER - TY - CONF AB - This contribution investigates Direction of Arrival (DoA) estimation using linearly arranged microphone arrays. We are going to develop a model for the DoA estimation error in a reverberant scenario and show the existence of a bias, that is a consequence of the linear arrangement and limited field of view (FoV) bias: First, the limited FoV leading to a clipping of the measurements, and, second, the angular distribution of the signal energy of the reflections being non-uniform. Since both issues are a consequence of the linear arrangement of the sensors, the bias arises largely independent of the kind of DoA estimator. The experimental evaluation demonstrates the existence of the bias for a selected number of DoA estimation methods and proves that the prediction from the developed theoretical model matches the simulation results. AU - Jacob, Florian AU - Haeb-Umbach, Reinhold ID - 11829 T2 - 12. ITG Fachtagung Sprachkommunikation (ITG 2016) TI - On the Bias of Direction of Arrival Estimation Using Linear Microphone Arrays ER - TY - CONF AB - We present a system for the 4th CHiME challenge which significantly increases the performance for all three tracks with respect to the provided baseline system. The front-end uses a bi-directional Long Short-Term Memory (BLSTM)-based neural network to estimate signal statistics. These then steer a Generalized Eigenvalue beamformer. The back-end consists of a 22 layer deep Wide Residual Network and two extra BLSTM layers. Working on a whole utterance instead of frames allows us to refine Batch-Normalization. We also train our own BLSTM-based language model. Adding a discriminative speaker adaptation leads to further gains. The final system achieves a word error rate on the six channel real test data of 3.48%. For the two channel track we achieve 5.96% and for the one channel track 9.34%. This is the best reported performance on the challenge achieved by a single system, i.e., a configuration, which does not combine multiple systems. At the same time, our system is independent of the microphone configuration. We can thus use the same components for all three tracks. AU - Heymann, Jahn AU - Drude, Lukas AU - Haeb-Umbach, Reinhold ID - 11834 T2 - Computer Speech and Language TI - Wide Residual BLSTM Network with Discriminative Speaker Adaptation for Robust Speech Recognition ER - TY - JOUR AU - Kinoshita, Keisuke AU - Delcroix, Marc AU - Gannot, Sharon AU - Habets, Emanuel A. P. AU - Haeb-Umbach, Reinhold AU - Kellermann, Walter AU - Leutnant, Volker AU - Maas, Roland AU - Nakatani, Tomohiro AU - Raj, Bhiksha AU - Sehr, Armin AU - Yoshioka, Takuya ID - 11840 JF - EURASIP Journal on Advances in Signal Processing TI - A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research ER - TY - JOUR AB - Today, we are often surrounded by devices with one or more microphones, such as smartphones, laptops, and wireless microphones. If they are part of an acoustic sensor network, their distribution in the environment can be beneficially exploited for various speech processing tasks. However, applications like speaker localization, speaker tracking, and speech enhancement by beamforming avail themselves of the geometrical configuration of the sensors. Therefore, acoustic microphone geometry calibration has recently become a very active field of research. This article provides an application-oriented, comprehensive survey of existing methods for microphone position self-calibration, which will be categorized by the measurements they use and the scenarios they can calibrate. Selected methods will be evaluated comparatively with real-world recordings. AU - Plinge, Axel AU - Jacob, Florian AU - Haeb-Umbach, Reinhold AU - Fink, Gernot A. ID - 11886 IS - 4 JF - IEEE Signal Processing Magazine KW - Acoustic sensors KW - Microphones KW - Portable computers KW - Smart phones KW - Wireless communication KW - Wireless sensor networks SN - 1053-5888 TI - Acoustic Microphone Geometry Calibration: An overview and experimental evaluation of state-of-the-art algorithms VL - 33 ER - TY - CONF AB - This paper describes automatic speech recognition (ASR) systems developed jointly by RWTH, UPB and FORTH for the 1ch, 2ch and 6ch track of the 4th CHiME Challenge. In the 2ch and 6ch tracks the final system output is obtained by a Confusion Network Combination (CNC) of multiple systems. The Acoustic Model (AM) is a deep neural network based on Bidirectional Long Short-Term Memory (BLSTM) units. The systems differ by front ends and training sets used for the acoustic training. The model for the 1ch track is trained without any preprocessing. For each front end we trained and evaluated individual acoustic models. We compare the ASR performance of different beamforming approaches: a conventional superdirective beamformer [1] and an MVDR beamformer as in [2], where the steering vector is estimated based on [3]. Furthermore we evaluated a BLSTM supported Generalized Eigenvalue beamformer using NN-GEV [4]. The back end is implemented using RWTH?s open-source toolkits RASR [5], RETURNN [6] and rwthlm [7]. We rescore lattices with a Long Short-Term Memory (LSTM) based language model. The overall best results are obtained by a system combination that includes the lattices from the system of UPB?s submission [8]. Our final submission scored second in each of the three tracks of the 4th CHiME Challenge. AU - Menne, Tobias AU - Heymann, Jahn AU - Alexandridis, Anastasios AU - Irie, Kazuki AU - Zeyer, Albert AU - Kitza, Markus AU - Golik, Pavel AU - Kulikov, Ilia AU - Drude, Lukas AU - Schlüter, Ralf AU - Ney, Hermann AU - Haeb-Umbach, Reinhold AU - Mouchtaris, Athanasios ID - 11908 T2 - Computer Speech and Language TI - The RWTH/UPB/FORTH System Combination for the 4th CHiME Challenge Evaluation ER - TY - CONF AB - In this paper we demonstrate an algorithm to learn words from speech using non-parametric Bayesian hierarchical models in an unsupervised setting. We exploit the assumption of a hierarchical structure of speech, namely the formation of spoken words as a sequence of phonemes. We employ the Nested Hierarchical Pitman-Yor Language Model, which allows an a priori unknown and possibly unlimited number of words. We assume the n-gram probabilities of words, the m-gram probabilities of phoneme sequences in words and the phoneme sequences of the words themselves as latent variables to be learned. We evaluate the algorithm on a cross language task using an existing speech recognizer trained on English speech to decode speech in the Xitsonga language supplied for the 2015 ZeroSpeech challenge. We apply the learning algorithm on the resulting phoneme graphs and achieve the highest token precision and F score compared to present systems. AU - Walter, Oliver AU - Haeb-Umbach, Reinhold ID - 11920 T2 - 38th German Conference on Pattern Recognition (GCPR 2016) TI - Unsupervised Word Discovery from Speech using Bayesian Hierarchical Models ER - TY - CONF AB - In this paper we study the influence of directional radio patterns of Bluetooth low energy (BLE) beacons on smartphone localization accuracy and beacon network planning. A two-dimensional model of the power emission characteristic is derived from measurements of the radiation pattern of BLE beacons carried out in an RF chamber. The Cramer-Rao lower bound (CRLB) for position estimation is then derived for this directional power emission model. With this lower bound on the RMS positioning error the coverage of different beacon network configurations can be evaluated. For near-optimal network planing an evolutionary optimization algorithm for finding the best beacon placement is presented. AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 11890 T2 - 24th European Signal Processing Conference (EUSIPCO 2016) TI - Investigations into Bluetooth Low Energy Localization Precision Limits ER - TY - CONF AB - Noise tracking is an important component of speech enhancement algorithms. Of the many noise trackers proposed, Minimum Statistics (MS) is a particularly popular one due to its simple parameterization and at the same time excellent performance. In this paper we propose to further reduce the number of MS parameters by giving an alternative derivation of an optimal smoothing constant. At the same time the noise tracking performance is improved as is demonstrated by experiments employing speech degraded by various noise types and at different SNR values. AU - Chinaev, Aleksej AU - Haeb-Umbach, Reinhold ID - 11739 KW - speech enhancement KW - noise tracking KW - optimal smoothing T2 - Interspeech 2015 TI - On Optimal Smoothing in Minimum Statistics Based Noise Tracking ER - TY - CONF AB - We present a semantic analysis technique for spoken input using Markov Logic Networks (MLNs). MLNs combine graphical models with first-order logic. They areparticularly suitable for providing inference in the presence of inconsistent and incomplete data, which are typical of an automatic speech recognizer's (ASR) output in the presence of degraded speech. The target application is a speech interface to a home automation system to be operated by people with speech impairments, where the ASR output is particularly noisy. In order to cater for dysarthric speech with non-canonical phoneme realizations, acoustic representations of the input speech are learned in an unsupervised fashion. While training data transcripts are not required for the acoustic model training, the MLN training requires supervision, however, at a rather loose and abstract level. Results on two databases, one of them for dysarthric speech, show that MLN-based semantic analysis clearly outperforms baseline approaches employing non-negative matrix factorization, multinomial naive Bayes models, or support vector machines. AU - Despotovic, Vladimir AU - Walter, Oliver AU - Haeb-Umbach, Reinhold ID - 11748 T2 - INTERSPEECH 2015 TI - Semantic Analysis of Spoken Input using Markov Logic Networks ER - TY - CONF AB - This contribution presents a Direction of Arrival (DoA) estimation algorithm based on the complex Watson distribution to incorporate both phase and level differences of captured micro- phone array signals. The derived algorithm is reviewed in the context of the Generalized State Coherence Transform (GSCT) on the one hand and a kernel density estimation method on the other hand. A thorough simulative evaluation yields insight into parameter selection and provides details on the performance for both directional and omni-directional microphones. A comparison to the well known Steered Response Power with Phase Transform (SRP-PHAT) algorithm and a state of the art DoA estimator which explicitly accounts for aliasing, shows in particular the advantages of presented algorithm if inter-sensor level differences are indicative of the DoA, as with directional microphones. AU - Drude, Lukas AU - Jacob, Florian AU - Haeb-Umbach, Reinhold ID - 11755 T2 - 23th European Signal Processing Conference (EUSIPCO 2015) TI - DOA-Estimation based on a Complex Watson Kernel Method ER - TY - CONF AU - Heymann, Jahn AU - Drude, Lukas AU - Chinaev, Aleksej AU - Haeb-Umbach, Reinhold ID - 11810 T2 - Automatic Speech Recognition and Understanding Workshop (ASRU 2015) TI - BLSTM supported GEV Beamformer Front-End for the 3RD CHiME Challenge ER - TY - CONF AB - The parametric Bayesian Feature Enhancement (BFE) and a datadriven Denoising Autoencoder (DA) both bring performance gains in severe single-channel speech recognition conditions. The first can be adjusted to different conditions by an appropriate parameter setting, while the latter needs to be trained on conditions similar to the ones expected at decoding time, making it vulnerable to a mismatch between training and test conditions. We use a DNN backend and study reverberant ASR under three types of mismatch conditions: different room reverberation times, different speaker to microphone distances and the difference between artificially reverberated data and the recordings in a reverberant environment. We show that for these mismatch conditions BFE can provide the targets for a DA. This unsupervised adaptation provides a performance gain over the direct use of BFE and even enables to compensate for the mismatch of real and simulated reverberant data. AU - Heymann, Jahn AU - Haeb-Umbach, Reinhold AU - Golik, P. AU - Schlueter, R. ID - 11813 KW - codecs KW - signal denoising KW - speech recognition KW - Bayesian feature enhancement KW - denoising autoencoder KW - reverberant ASR KW - single-channel speech recognition KW - speaker to microphone distances KW - unsupervised adaptation KW - Adaptation models KW - Noise reduction KW - Reverberation KW - Speech KW - Speech recognition KW - Training KW - deep neuronal networks KW - denoising autoencoder KW - feature enhancement KW - robust speech recognition T2 - Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on TI - Unsupervised adaptation of a denoising autoencoder by Bayesian Feature Enhancement for reverberant asr under mismatch conditions ER - TY - JOUR AB - Joint audio-visual speaker tracking requires that the locations of microphones and cameras are known and that they are given in a common coordinate system. Sensor self-localization algorithms, however, are usually separately developed for either the acoustic or the visual modality and return their positions in a modality specific coordinate system, often with an unknown rotation, scaling and translation between the two. In this paper we propose two techniques to determine the positions of acoustic sensors in a common coordinate system, based on audio-visual correlates, i.e., events that are localized by both, microphones and cameras separately. The first approach maps the output of an acoustic self-calibration algorithm by estimating rotation, scale and translation to the visual coordinate system, while the second solves a joint system of equations with acoustic and visual directions of arrival as input. The evaluation of the two strategies reveals that joint calibration outperforms the mapping approach and achieves an overall calibration error of 0.20m even in reverberant environments. AU - Jacob, Florian AU - Haeb-Umbach, Reinhold ID - 11830 JF - ArXiv e-prints TI - Absolute Geometry Calibration of Distributed Microphone Arrays in an Audio-Visual Sensor Network ER - TY - BOOK AU - Li, Jinyu AU - Deng, Li AU - Haeb-Umbach, Reinhold AU - Gong, Y. ID - 11868 TI - Robust Automatic Speech Recognition ER - TY - CONF AB - Only a few studies exist on automatic emotion analysis of speech from children with Autism Spectrum Conditions (ASC). Out of these, some preliminary studies have recently focused on comparing the relevance of selected prosodic features against large sets of acoustic, spectral, and cepstral features; however, no study so far provided a comparison of performances across different languages. The present contribution aims to fill this white spot in the literature and provide insight by extensive evaluations carried out on three databases of prompted phrases collected in English, Swedish, and Hebrew, inducing nine emotion categories embedded in short-stories. The datasets contain speech of children with ASC and typically developing children under the same conditions. We evaluate automatic diagnosis and recognition of emotions in atypical childrens voice over the nine categories including binary valence/arousal discrimination. AU - Marchi, Erik AU - Schuller, Bjoern AU - Baron-Cohen, Simon AU - Golan, Ofer AU - Boelte, Sven AU - Arora, Prerna AU - Haeb-Umbach, Reinhold ID - 11875 T2 - INTERSPEECH 2015 TI - Typicality and Emotion in the Voice of Children with Autism Spectrum Condition: Evidence Across Three Languages ER - TY - CONF AB - In this paper we present a source counting algorithm to determine the number of speakers in a speech mixture. In our proposed method, we model the histogram of estimated directions of arrival with a nonparametric Bayesian infinite Gaussian mixture model. As an alternative to classical model selection criteria and to avoid specifying the maximum number of mixture components in advance, a Dirichlet process prior is employed over the mixture components. This allows to automatically determine the optimal number of mixture components that most probably model the observations. We demonstrate by experiments that this model outperforms a parametric approach using a finite Gaussian mixture model with a Dirichlet distribution prior over the mixture weights. AU - Walter, Oliver AU - Drude, Lukas AU - Haeb-Umbach, Reinhold ID - 11919 T2 - 40th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2015) TI - Source Counting in Speech Mixtures by Nonparametric Bayesian Estimation of an infinite Gaussian Mixture Model ER - TY - JOUR AB - Besides the core learning algorithm itself, one major question in machine learning is how to best encode given training data such that the learning technology can efficiently learn based thereon and generalize to novel data. While classical approaches often rely on a hand coded data representation, the topic of autonomous representation or feature learning plays a major role in modern learning architectures. The goal of this contribution is to give an overview about different principles of autonomous feature learning, and to exemplify two principles based on two recent examples: autonomous metric learning for sequences, and autonomous learning of a deep representation for spoken language, respectively. AU - Walter, Oliver AU - Haeb-Umbach, Reinhold AU - Mokbel, Bassam AU - Paassen, Benjamin AU - Hammer, Barbara ID - 11922 JF - KI - Kuenstliche Intelligenz KW - Representation learning KW - Metric learning KW - Deep representation KW - Spoken language TI - Autonomous Learning of Representations ER - TY - GEN AB - In this paper we show that recently developed algorithms for unsupervised word segmentation can be a valuable tool for the documentation of endangered languages. We applied an unsupervised word segmentation algorithm based on a nested Pitman-Yor language model to two austronesian languages, Wooi and Waima'a. The algorithm was then modified and parameterized to cater the needs of linguists for high precision of lexical discovery: We obtained a lexicon precision of of 69.2\% and 67.5\% for Wooi and Waima'a, respectively, if single-letter words and words found less than three times were discarded. A comparison with an English word segmentation task showed comparable performance, verifying that the assumptions underlying the Pitman-Yor language model, the universality of Zipf's law and the power of n-gram structures, do also hold for languages as exotic as Wooi and Waima'a. AU - Walter, Oliver AU - Haeb-Umbach, Reinhold AU - Strunk, Jan AU - P. Himmelmann, Nikolaus ID - 11923 TI - Lexicon Discovery for Language Preservation using Unsupervised Word Segmentation with Pitman-Yor Language Models (FGNT-2015-01) ER - TY - CONF AU - Hoang, Manh Kha AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 11874 T2 - 40th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2015) TI - Aligning training models with smartphone properties in WiFi fingerprinting based indoor localization ER - TY - CONF AB - "A method for nonstationary noise robust automatic speech recognition (ASR) is to first estimate the changing noise statistics and second clean up the features prior to recognition accordingly. Here, the first is accomplished by noise tracking in the spectral domain, while the second relies on Bayesian enhancement in the feature domain. In this way we take advantage of our recently proposed maximum a-posteriori based (MAP-B) noise power spectral density estimation algorithm, which is able to estimate the noise statistics even in time-frequency bins dominated by speech. We show that MAP-B noise tracking leads to an improved noise model estimate in the feature domain compared to estimating noise in speech absence periods only, if the bias resulting from the nonlinear transformation from the spectral to the feature domain is accounted for. Consequently, ASR results are improved, as is shown by experiments conducted on the Aurora IV database." AU - Chinaev, Aleksej AU - Puels, Marc AU - Haeb-Umbach, Reinhold ID - 11746 T2 - 11. ITG Fachtagung Sprachkommunikation (ITG 2014) TI - Spectral Noise Tracking for Improved Nonstationary Noise Robust ASR ER - TY - CONF AB - "In this contribution we derive a variational EM (VEM) algorithm for model selection in complex Watson mixture models, which have been recently proposed as a model of the distribution of normalized microphone array signals in the short-time Fourier transform domain. The VEM algorithm is applied to count the number of active sources in a speech mixture by iteratively estimating the mode vectors of the Watson distributions and suppressing the signals from the corresponding directions. A key theoretical contribution is the derivation of the MMSE estimate of a quadratic form involving the mode vector of the Watson distribution. The experimental results demonstrate the effectiveness of the source counting approach at moderately low SNR. It is further shown that the VEM algorithm is more robust w.r.t. used threshold values." AU - Drude, Lukas AU - Chinaev, Aleksej AU - Tran Vu, Dang Hai AU - Haeb-Umbach, Reinhold ID - 11752 T2 - 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014) TI - Source Counting in Speech Mixtures Using a Variational EM Approach for Complexwatson Mixture Models ER - TY - CONF AB - This contribution describes a step-wise source counting algorithm to determine the number of speakers in an offline scenario. Each speaker is identified by a variational expectation maximization (VEM) algorithm for complex Watson mixture models and therefore directly yields beamforming vectors for a subsequent speech separation process. An observation selection criterion is proposed which improves the robustness of the source counting in noise. The algorithm is compared to an alternative VEM approach with Gaussian mixture models based on directions of arrival and shown to deliver improved source counting accuracy. The article concludes by extending the offline algorithm towards a low-latency online estimation of the number of active sources from the streaming input data. AU - Drude, Lukas AU - Chinaev, Aleksej AU - Tran Vu, Dang Hai AU - Haeb-Umbach, Reinhold ID - 11753 KW - Accuracy KW - Acoustics KW - Estimation KW - Mathematical model KW - Soruce separation KW - Speech KW - Vectors KW - Bayes methods KW - Blind source separation KW - Directional statistics KW - Number of speakers KW - Speaker diarization T2 - 14th International Workshop on Acoustic Signal Enhancement (IWAENC 2014) TI - Towards Online Source Counting in Speech Mixtures Applying a Variational EM for Complex Watson Mixture Models ER - TY - CONF AB - "In this paper we present an algorithm for the unsupervised segmentation of a lattice produced by a phoneme recognizer into words. Using a lattice rather than a single phoneme string accounts for the uncertainty of the recognizer about the true label sequence. An example application is the discovery of lexical units from the output of an error-prone phoneme recognizer in a zero-resource setting, where neither the lexicon nor the language model (LM) is known. We propose a computationally efficient iterative approach, which alternates between the following two steps: First, the most probable string is extracted from the lattice using a phoneme LM learned on the segmentation result of the previous iteration. Second, word segmentation is performed on the extracted string using a word and phoneme LM which is learned alongside the new segmentation. We present results on lattices produced by a phoneme recognizer on the WSJCAM0 dataset. We show that our approach delivers superior segmentation performance than an earlier approach found in the literature, in particular for higher-order language models. " AU - Heymann, Jahn AU - Walter, Oliver AU - Haeb-Umbach, Reinhold AU - Raj, Bhiksha ID - 11814 T2 - 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014) TI - Iterative Bayesian Word Segmentation for Unspuervised Vocabulary Discovery from Phoneme Lattices ER - TY - CONF AB - "Several self-localization algorithms have been proposed, that determine the positions of either acoustic or visual sensors autonomously. Usually these positions are given in a modality specific coordinate system, with an unknown rotation, translation and scale between the different systems. For a joint audiovisual tracking, where the different modalities support each other, the two modalities need to be mapped into a common coordinate system. In this paper we propose to estimate this mapping based on audiovisual correlates, i.e., a speaker that can be localized by both, a microphone and a camera network separately. The voice is tracked by a microphone network, which had to be calibrated by a self-localization algorithm at first, and the head is tracked by a calibrated camera network. Unlike existing Singular Value Decomposition based approaches to estimate the coordinate system mapping, we propose to perform an estimation in the shape domain, which turns out to be computationally more efficient. Simulations of the self-localization of an acoustic sensor network and a following coordinate mapping for a joint speaker localization showed a significant improvement of the localization performance, since the modalities were able to support each other." AU - Jacob, Florian AU - Haeb-Umbach, Reinhold ID - 11831 T2 - 11. ITG Fachtagung Sprachkommunikation (ITG 2014) TI - Coordinate Mapping Between an Acoustic and Visual Sensor Network in the Shape Domain for a Joint Self-Calibrating Speaker Tracking ER - TY - JOUR AB - In this contribution we present a theoretical and experimental investigation into the effects of reverberation and noise on features in the logarithmic mel power spectral domain, an intermediate stage in the computation of the mel frequency cepstral coefficients, prevalent in automatic speech recognition (ASR). Gaining insight into the complex interaction between clean speech, noise, and noisy reverberant speech features is essential for any ASR system to be robust against noise and reverberation present in distant microphone input signals. The findings are gathered in a probabilistic formulation of an observation model which may be used in model-based feature compensation schemes. The proposed observation model extends previous models in three major directions: First, the contribution of additive background noise to the observation error is explicitly taken into account. Second, an energy compensation constant is introduced which ensures an unbiased estimate of the reverberant speech features, and, third, a recursive variant of the observation model is developed resulting in reduced computational complexity when used in model-based feature compensation. The experimental section is used to evaluate the accuracy of the model and to describe how its parameters can be determined from test data. AU - Leutnant, Volker AU - Krueger, Alexander AU - Haeb-Umbach, Reinhold ID - 11861 IS - 1 JF - IEEE/ACM Transactions on Audio, Speech, and Language Processing KW - computational complexity KW - reverberation KW - speech recognition KW - automatic speech recognition KW - background noise KW - clean speech KW - computational complexity KW - energy compensation KW - logarithmic mel power spectral domain KW - mel frequency cepstral coefficients KW - microphone input signals KW - model-based feature compensation schemes KW - noisy reverberant speech automatic recognition KW - noisy reverberant speech features KW - reverberation KW - Atmospheric modeling KW - Computational modeling KW - Noise KW - Noise measurement KW - Reverberation KW - Speech KW - Vectors KW - Model-based feature compensation KW - observation model for reverberant and noisy speech KW - recursive observation model KW - robust automatic speech recognition SN - 2329-9290 TI - A New Observation Model in the Logarithmic Mel Power Spectral Domain for the Automatic Recognition of Noisy Reverberant Speech VL - 22 ER - TY - JOUR AB - New waves of consumer-centric applications, such as voice search and voice interaction with mobile devices and home entertainment systems, increasingly require automatic speech recognition (ASR) to be robust to the full range of real-world noise and other acoustic distorting conditions. Despite its practical importance, however, the inherent links between and distinctions among the myriad of methods for noise-robust ASR have yet to be carefully studied in order to advance the field further. To this end, it is critical to establish a solid, consistent, and common mathematical foundation for noise-robust ASR, which is lacking at present. This article is intended to fill this gap and to provide a thorough overview of modern noise-robust techniques for ASR developed over the past 30 years. We emphasize methods that are proven to be successful and that are likely to sustain or expand their future applicability. We distill key insights from our comprehensive overview in this field and take a fresh look at a few old problems, which nevertheless are still highly relevant today. Specifically, we have analyzed and categorized a wide range of noise-robust techniques using five different criteria: 1) feature-domain vs. model-domain processing, 2) the use of prior knowledge about the acoustic environment distortion, 3) the use of explicit environment-distortion models, 4) deterministic vs. uncertainty processing, and 5) the use of acoustic models trained jointly with the same feature enhancement or model adaptation process used in the testing stage. With this taxonomy-oriented review, we equip the reader with the insight to choose among techniques and with the awareness of the performance-complexity tradeoffs. The pros and cons of using different noise-robust ASR techniques in practical application scenarios are provided as a guide to interested practitioners. The current challenges and future research directions in this field is also carefully analyzed. AU - Li, Jinyu AU - Deng, Li AU - Gong, Yifan AU - Haeb-Umbach, Reinhold ID - 11867 IS - 4 JF - IEEE Transactions on Audio, Speech and Language Processing KW - Speech recognition KW - compensation KW - distortion modeling KW - joint model training KW - noise KW - robustness KW - uncertainty processing TI - An Overview of Noise-Robust Automatic Speech Recognition VL - 22 ER - TY - CONF AB - In this paper, we investigate unsupervised acoustic model training approaches for dysarthric-speech recognition. These models are first, frame-based Gaussian posteriorgrams, obtained from Vector Quantization (VQ), second, so-called Acoustic Unit Descriptors (AUDs), which are hidden Markov models of phone-like units, that are trained in an unsupervised fashion, and, third, posteriorgrams computed on the AUDs. Experiments were carried out on a database collected from a home automation task and containing nine speakers, of which seven are considered to utter dysarthric speech. All unsupervised modeling approaches delivered significantly better recognition rates than a speaker-independent phoneme recognition baseline, showing the suitability of unsupervised acoustic model training for dysarthric speech. While the AUD models led to the most compact representation of an utterance for the subsequent semantic inference stage, posteriorgram-based representations resulted in higher recognition rates, with the Gaussian posteriorgram achieving the highest slot filling F-score of 97.02%. Index Terms: unsupervised learning, acoustic unit descriptors, dysarthric speech, non-negative matrix factorization AU - Walter, Oliver AU - Despotovic, Vladimir AU - Haeb-Umbach, Reinhold AU - Gemmeke, Jrt AU - Ons, Bart AU - Van hamme, Hugo ID - 11918 T2 - INTERSPEECH 2014 TI - An Evaluation of Unsupervised Acoustic Model Training for a Dysarthric Speech Interface ER - TY - JOUR AB - Abstract In this paper we present an approach for synchronizing a wireless acoustic sensor network using a two-stage procedure. First the clock frequency and phase differences between pairs of nodes are estimated employing a two-way message exchange protocol. The estimates are further improved in a Kalman filter with a dedicated observation error model. In the second stage network-wide synchronization is achieved by means of a gossiping algorithm which estimates the average clock frequency and phase of the sensor nodes. These averages are viewed as frequency and phase of a virtual master clock, to which the clocks of the sensor nodes have to be adjusted. The amount of adjustment is computed in a specific control loop. While these steps are done in software, the actual sampling rate correction is carried out in hardware by using an adjustable frequency synthesizer. Experimental results obtained from hardware devices and software simulations of large scale networks are presented. AU - Schmalenstroeer, Joerg AU - Jebramcik, Patrick AU - Haeb-Umbach, Reinhold ID - 11898 JF - Signal Processing KW - Gossip algorithm SN - 0165-1684 TI - A combined hardware-software approach for acoustic sensor network synchronization ER - TY - CONF AB - "In this paper we present an approach for synchronizing the sampling clocks of distributed microphones over a wireless network. The proposed system uses a two stage procedure. It first employs a two-way message exchange algorithm to estimate the clock phase and frequency difference between two nodes and then uses a gossiping algorithmto estimate a virtual master clock, to which all sensor nodes synchronize. Simulation results are presented for networks of different topology and size, showing the effectiveness of our approach." AU - Schmalenstroeer, Joerg AU - Jebramcik, Patrick AU - Haeb-Umbach, Reinhold ID - 11897 T2 - 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014) TI - A Gossiping Approach to Sampling Clock Synchronization in Wireless Acoustic Sensor Networks ER -