TY - CONF AB - The parametric Bayesian Feature Enhancement (BFE) and a datadriven Denoising Autoencoder (DA) both bring performance gains in severe single-channel speech recognition conditions. The first can be adjusted to different conditions by an appropriate parameter setting, while the latter needs to be trained on conditions similar to the ones expected at decoding time, making it vulnerable to a mismatch between training and test conditions. We use a DNN backend and study reverberant ASR under three types of mismatch conditions: different room reverberation times, different speaker to microphone distances and the difference between artificially reverberated data and the recordings in a reverberant environment. We show that for these mismatch conditions BFE can provide the targets for a DA. This unsupervised adaptation provides a performance gain over the direct use of BFE and even enables to compensate for the mismatch of real and simulated reverberant data. AU - Heymann, Jahn AU - Haeb-Umbach, Reinhold AU - Golik, P. AU - Schlueter, R. ID - 11813 KW - codecs KW - signal denoising KW - speech recognition KW - Bayesian feature enhancement KW - denoising autoencoder KW - reverberant ASR KW - single-channel speech recognition KW - speaker to microphone distances KW - unsupervised adaptation KW - Adaptation models KW - Noise reduction KW - Reverberation KW - Speech KW - Speech recognition KW - Training KW - deep neuronal networks KW - denoising autoencoder KW - feature enhancement KW - robust speech recognition T2 - Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on TI - Unsupervised adaptation of a denoising autoencoder by Bayesian Feature Enhancement for reverberant asr under mismatch conditions ER - TY - CONF AB - Recently, substantial progress has been made in the field of reverberant speech signal processing, including both single- and multichannel de-reverberation techniques, and automatic speech recognition (ASR) techniques robust to reverberation. To evaluate state-of-the-art algorithms and obtain new insights regarding potential future research directions, we propose a common evaluation framework including datasets, tasks, and evaluation metrics for both speech enhancement and ASR techniques. The proposed framework will be used as a common basis for the REVERB (REverberant Voice Enhancement and Recognition Benchmark) challenge. This paper describes the rationale behind the challenge, and provides a detailed description of the evaluation framework and benchmark results. AU - Kinoshita, Keisuke AU - Delcroix, Marc AU - Yoshioka, Takuya AU - Nakatani, Tomohiro AU - Habets, Emanuel AU - Haeb-Umbach, Reinhold AU - Leutnant, Volker AU - Sehr, Armin AU - Kellermann, Walter AU - Maas, Roland AU - Gannot, Sharon AU - Raj, Bhiksha ID - 11841 KW - Reverberant speech KW - dereverberation KW - ASR KW - evaluation KW - challenge T2 - IEEE Workshop on Applications of Signal Processing to Audio and Acoustics TI - The reverb challenge: a common evaluation framework for dereverberation and recognition of reverberant speech ER - TY - JOUR AB - In this paper, we present a new technique for automatic speech recognition (ASR) in reverberant environments. Our approach is aimed at the enhancement of the logarithmic Mel power spectrum, which is computed at an intermediate stage to obtain the widely used Mel frequency cepstral coefficients (MFCCs). Given the reverberant logarithmic Mel power spectral coefficients (LMPSCs), a minimum mean square error estimate of the clean LMPSCs is computed by carrying out Bayesian inference. We employ switching linear dynamical models as an a priori model for the dynamics of the clean LMPSCs. Further, we derive a stochastic observation model which relates the clean to the reverberant LMPSCs through a simplified model of the room impulse response (RIR). This model requires only two parameters, namely RIR energy and reverberation time, which can be estimated from the captured microphone signal. The performance of the proposed enhancement technique is studied on the AURORA5 database and compared to that of constrained maximum-likelihood linear regression (CMLLR). It is shown by experimental results that our approach significantly outperforms CMLLR and that up to 80\% of the errors caused by the reverberation are recovered. In addition to the fact that the approach is compatible with the standard MFCC feature vectors, it leaves the ASR back-end unchanged. It is of moderate computational complexity and suitable for real time applications. AU - Krueger, Alexander AU - Haeb-Umbach, Reinhold ID - 11846 IS - 7 JF - IEEE Transactions on Audio, Speech, and Language Processing KW - ASR KW - AURORA5 database KW - automatic speech recognition KW - Bayesian inference KW - belief networks KW - CMLLR KW - computational complexity KW - constrained maximum likelihood linear regression KW - least mean squares methods KW - LMPSC computation KW - logarithmic Mel power spectrum KW - maximum likelihood estimation KW - Mel frequency cepstral coefficients KW - MFCC feature vectors KW - microphone signal KW - minimum mean square error estimation KW - model-based feature enhancement KW - regression analysis KW - reverberant speech recognition KW - reverberation KW - RIR energy KW - room impulse response KW - speech recognition KW - stochastic observation model KW - stochastic processes TI - Model-Based Feature Enhancement for Reverberant Speech Recognition VL - 18 ER - TY - JOUR AB - In this paper, we derive an uncertainty decoding rule for automatic speech recognition (ASR), which accounts for both corrupted observations and inter-frame correlation. The conditional independence assumption, prevalent in hidden Markov model-based ASR, is relaxed to obtain a clean speech posterior that is conditioned on the complete observed feature vector sequence. This is a more informative posterior than one conditioned only on the current observation. The novel decoding is used to obtain a transmission-error robust remote ASR system, where the speech capturing unit is connected to the decoder via an error-prone communication network. We show how the clean speech posterior can be computed for communication links being characterized by either bit errors or packet loss. Recognition results are presented for both distributed and network speech recognition, where in the latter case common voice-over-IP codecs are employed. AU - Ion, Valentin AU - Haeb-Umbach, Reinhold ID - 11820 IS - 5 JF - IEEE Transactions on Audio, Speech, and Language Processing KW - automatic speech recognition KW - bit errors KW - codecs KW - communication links KW - corrupted observations KW - decoding KW - distributed speech recognition KW - error-prone communication network KW - feature vector sequence KW - hidden Markov model-based ASR KW - hidden Markov models KW - inter-frame correlation KW - Internet telephony KW - network speech recognition KW - packet loss KW - speech posterior KW - speech recognition KW - transmission error robust speech recognition KW - uncertainty decoding KW - voice-over-IP codecs TI - A Novel Uncertainty Decoding Rule With Applications to Transmission Error Robust Speech Recognition VL - 16 ER -