TY - CONF AB - In this paper we present a comparison of the recently proposed Soft-Feature Distributed Speech Recognition (SFDSR) with the two evaluated candidate codecs for Speech Enabled Services over wireless networks: Adaptive Multirate Codec (AMR) and the ETSI Extended Advanced Front-End for Distributed Speech Recognition (XAFE). It is shown that SFDSR achieves the best recognition performance on a simulated GSM transmission, followed by XAFE and AMR.We also present some new results concerning SFDSR which demonstrate the versatility of the approach. Further, a simple method is introduced which considerably reduces the computational effort. AU - Ion, Valentin AU - Haeb-Umbach, Reinhold ID - 11828 KW - adaptive codes KW - adaptive multirate codec KW - AMR KW - distributed speech recognition KW - ETSI KW - extended advanced front-end KW - recognition performance KW - SFDSR KW - simulated GSM transmission KW - soft-feature distributed speech recognition KW - speech codecs KW - speech coding KW - speech recognition KW - variable rate codes KW - XAFE T2 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2005) TI - A Comparison of Soft-Feature Distributed Speech Recognition with Candidate Codecs for Speech Enabled Mobile Services VL - 1 ER - TY - CONF AB - For human-machine interfaces in distant-talking environments multichannel signal processing is often employed to obtain an enhanced signal for subsequent processing. In this paper we propose a novel adaptation algorithm for a filter-and-sum beamformer to adjust the coefficients of FIR filters to changing acoustic room impulses, e.g. due to speaker movement. A deterministic and a stochastic gradient ascent algorithm are derived from a constrained optimization problem, which iteratively estimates the eigenvector corresponding to the largest eigenvalue of the cross power spectral density of the microphone signals. The method does not require an explicit estimation of the speaker location. The experimental results show fast adaptation and excellent robustness of the proposed algorithm. AU - Warsitz, Ernst AU - Haeb-Umbach, Reinhold ID - 11930 KW - acoustic filter-and-sum beamforming KW - acoustic room impulses KW - acoustic signal processing KW - adaptive principal component analysis KW - adaptive signal processing KW - architectural acoustics KW - constrained optimization problem KW - cross power spectral density KW - deterministic algorithm KW - deterministic algorithms KW - distant-talking environments KW - eigenvalues and eigenfunctions KW - eigenvector KW - enhanced signal KW - filter-and-sum beamformer KW - FIR filter coefficients KW - FIR filter coefficients KW - FIR filters KW - gradient methods KW - human-machine interfaces KW - iterative estimation KW - iterative methods KW - largest eigenvalue KW - microphone signals KW - multichannel signal processing KW - optimisation KW - principal component analysis KW - spectral analysis KW - stochastic gradient ascent algorithm KW - stochastic processes T2 - IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2005) TI - Acoustic filter-and-sum beamforming by adaptive principal component analysis VL - 4 ER - TY - CONF AU - Haeb-Umbach, Reinhold AU - Schmalenstroeer, Joerg ID - 11802 T2 - Interspeech 2005 TI - Speech Processing in the Networked Home Environment - A View on the Amigo Project ER - TY - CONF AU - Haeb-Umbach, Reinhold AU - Schmalenstroeer, Joerg ID - 11801 T2 - Interspeech 2005 TI - A Comparison of Particle Filtering Variants for Speech Feature Enhancement ER - TY - JOUR AB - Satellite positioning systems, such as GPS or the future European system Galileo, employ direct-sequence spread-spectrum signals. The positioning accuracy is strongly affected by the quality of the pseudo range measurements. These measurements necessitate code and carrier synchronization of the received signal with the internally generated reference signals. In this type of systems one major error source is the multipath phenomenon, which results in a sum of delayed and weighted copies of the original signal to be present at the receiver input. This can result in a systematic error of the code tracking loop resulting in range errors in the order of several tens of meters. In this paper we propose an extension of the standard code tracking loop capable of estimating the parameters of the line-of-sight (LOS) signal and separating the LOS from the reflected signal portions. It is based on an analysis of the cross correlation of the received signal with a locally generated code sequence in the vicinity of the tracking point of a Delay-Locked Loop (DLL). For this reason, we call this method Cross Correlation Function (CCF) Analysis. The proposed method achieves considerably more accurate estimates than a DLL. Its performance is comparable to the Multipath Estimating Delay-Locked Loop (MEDLL) which is considered to be the best method for reducing multipath induced errors, so far. However, the computational complexity of the CCF Analysis is by a factor of three smaller compared to the MEDLL. Extensive simulations have been conducted for the proposed method and the MEDLL in order to assess the robustness of the two approaches under various signal constellations. AU - Bischoff, R. AU - Haeb-Umbach, Reinhold AU - Nammi, Sai Ramesh ID - 11732 IS - 1 JF - AEUe, Int. Journal on Electronics and Communications TI - Multipath-Resistant Time of Arrival Estimation for Satellite Positioning VL - 58 ER - TY - CONF AB - A major drawback of distributed versus terminal-based speech recognition is the fact that transmission errors can lead to degraded recognition performance. In this paper we employ soft features to mitigate the effect of bit errors on wireless transmission links: At the receiver a posteriori probabilities of the transmitted feature vectors are computed by combining bit reliability information provided by the channel decoder and a priori knowledge about residual redundancy in the feature vectors. While the first-order moment of the a posteriori probability function is the MMSE estimate, the second-order moment is a measure of the uncertainty in the reconstructed features. We conducted realistic simulations of GSM transmission and achieved significant improvements in word accuracy compared to the error mitigation strategy described in the ETSI standard. AU - Haeb-Umbach, Reinhold AU - Ion, Valentin ID - 11790 T2 - International Conference on Spoken Language Processing (ICSLP 2004) TI - Soft Features for Improved Distributed Speech Recognition over Wireless Networks ER - TY - CONF AB - The paper is concerned with binaural signal processing for a bimodal human-robot interface with hearing and vision. The two microphone signals are processed to obtain an enhanced single-channel input signal for the subsequent speech recognizer and to localize the acoustic source, an important information for establishing a natural human-robot communication. We utilize a robust adaptive algorithm for filter-and-sum beamforming (FSB) and extract speaker direction information from the resulting FIR filter coefficients. Further, particle filtering is applied which conducts a nonlinear Bayesian tracking of speaker movement. Good location accuracy can be achieved even in highly reverberant environments. The results obtained outperform the conventional generalized cross correlation (GCC) method. AU - Warsitz, Ernst AU - Haeb-Umbach, Reinhold ID - 11931 KW - bimodal human-robot interface KW - binaural signal processing KW - enhanced single-channel input signal KW - filter-and-sum beamforming KW - filtering theory KW - FIR filter coefficient KW - generalized cross correlation method KW - microphones KW - microphone signal KW - nonlinear Bayesian tracking KW - particle filtering KW - robust adaptive algorithm KW - robust speaker direction estimation KW - signal processing KW - speech enhancement KW - speech recognition KW - speech recognizer KW - user interfaces T2 - IEEE Workshop on Multimedia Signal Processing (MMSP 2004) TI - Robust speaker direction estimation with particle filtering ER - TY - CONF AB - While the main objective of adaptive Filter-and-Sum beamforming is to obtain an enhanced speech signal for subsequent processing like speech recognition, we show how speaker localization information can be derived from the filter coefficients. To increase localization accuracy, speaker tracking is performed by non-linear Bayesian state estimation, which is realized by sequential Monte Carlo methods. Improved acquisition and tracking performance was achieved even in highly reverberant environments, in comparison with both a Kalman Filter and a recently proposed Particle Filter operating on the output of a nonadaptive Delay-and-Sum beamformer. AU - Warsitz, Ernst AU - Haeb-Umbach, Reinhold AU - Peschke, Sven ID - 11932 T2 - International Conference on Spoken Language Processing (ICSLP 2004) TI - Adaptive Beamforming Combined with Particle Filtering for Acoustic Source Localization ER - TY - JOUR AU - Haeb-Umbach, Reinhold ID - 11777 JF - Forschungsforum Paderborn TI - Auf ein Wort - Moeglichkeiten und Grenzen der automatischen Spracherkennung ER - TY - JOUR AB - Automatic speech recognition of real-live broadcast news (BN) data (Hub-4) has become a challenging research topic in recent years. This paper summarizes our key efforts to build a large vocabulary continuous speech recognition system for the heterogenous BN task without inducing undesired complexity and computational resources. These key efforts included: - automatic segmentation of the audio signal into speech utterances; - efficient one-pass trigram decoding using look-ahead techniques; - optimal log-linear interpolation of a variety of acoustic and language models using discriminative model combination (DMC); - handling short-range and weak longer-range correlations in natural speech and language by the use of phrases and of distance-language models; - improving the acoustic modeling by a robust feature extraction, channel normalization, adaptation techniques as well as automatic script selection and verification. The starting point of the system development was the Philips 64k-NAB word-internal triphone trigram system. On the speaker-independent but microphone-dependent NAB-task (transcription of read newspaper texts) we obtained a word error rate of about 10\%. Now, at the conclusion of the system development, we have arrived at Philips at an DMC-interpolated phrase-based crossword-pentaphone 4-gram system. This system transcribes BN data with an overall word error rate of about 17\%. AU - Beyerlein, P. AU - Aubert, X. AU - Haeb-Umbach, Reinhold AU - Harris, M. AU - Klakow, D. AU - Wendemuth, A. AU - Molau, S. AU - Ney, N. AU - Pitz, Michael AU - Sixtus, A. ID - 11727 IS - 37 JF - Speech Communication TI - Large Vocabulary Continuous Speech Recognition of Broadcast News - The Philips/RWTH Approach ER - TY - CONF AB - Currently the future satellite navigation system Galileo and the third generation mobile communications system UMTS are on their way to the market in Europe. cdma2000 is under development in the USA and, furthermore, a new civil GPS signal in L2 band and the new frequency band L5 are added. In a hybrid receiver for satellite navigation and mobile radio communications, the possibility of an additional usage of the mobile radio signals for navigation purposes could also be a remedy to one problem of satellite navigation systems, which is the reduced location accuracy inside of buildings and urban canyons. A hybrid receiver with two fully separated receiver branches would lead to an increased bill of material and to increased power consumption in the receiver. This paper, therefore, introduces a hybrid receiver capable of evaluating Galileo/GPS as well as UMTS/cdma2000 signals with reduced computational efforts. Furthermore, the proposed structure performs a constructive superposition of the incoming paths to improve location accuracy. AU - Bischoff, Renke AU - Haeb-Umbach, Reinhold AU - Heinrichs, Guenther ID - 11731 T2 - ION-GPS 2002 TI - A Joint Time Multiplex Receiver for UMTS and Galileo ER - TY - CONF AB - Current navigation systems like GPS (Global Positioning System) and its Russian counterpart GLONASS (Global Navigation Satellite System) only evaluate the direct signal path. The receivers treat the reflected paths also reaching the receiver antenna as disturbance which has to be suppressed. Multipath affects the tracking accuracy by resulting in a degeneration of the S-curve of the DLL (delay locked loop). Nowadays the future European systems GALILEO and GPSIIF/III with two new signals are on the way to the market and it is time to think about new receiver structures. Therefore we investigated if it is possible to use multipath for navigation constructively. AU - Bischoff, Renke AU - Haeb-Umbach, Reinhold AU - Schulz, Wolfgang AU - Heinrichs, Guenther ID - 11733 KW - combined GALILEO/UMTS receiver KW - delay locked loop KW - delay lock loops KW - DLL KW - Global Positioning System KW - GLONASS KW - GPS KW - GPSIIF/III KW - mobile satellite communication KW - multipath channels KW - multipath receiver structure KW - radio receivers KW - RAKE receiver KW - S-curve T2 - IEEE 55th Vehicular Technology Conference (VTC 2002 Spring) TI - Employment of a multipath receiver structure in a combined GALILEO/UMTS receiver VL - 4 ER - TY - CONF AB - Current location methods for cellular communication systems TOA and E-OTD exploit time delays and time differences of various base station signals measured in a mobile phone to determine its location. These methods assume line-of-sight (LOS) connections to all utilized base stations. Since mobile radio channels are mainly characterized by scattered propagation paths and non-line-of-sight (NLOS) propagation, bias errors occur when measured time delays and time differences are utilized in position calculation algorithms. In this paper, the distribution of the time error due to NLOS propagation is estimated based on the channel model proposed in [1]. In combination with actual channel measurements in [2] the NLOS time error and its probability distribution function is estimated. With this information being determined for each received signal, position calculation algorithms can utilize the reliability information to enhance positioning accuracy. [1] L. J. Greenstein, V. Erceg, Y. S. Yeh, M. V. Clark, {grqq}A new path-gain/delay-spread propagation model for digital cellular channels'', IEEE Transactions on Vehicular Technology, Vol. 46, No. 2, May 1997 [2] H. Asplund, {grqq}Wideband Channel Measurements in Central Stockholm'', T1P1.5/98-242r1 AU - Hesse, Thomas AU - Bischoff, Renke AU - Schulz, Wolfgang AU - Haeb-Umbach, Reinhold ID - 11808 T2 - International Symposium on Location Based Services for Cellular Users (LOCELLUS 2002) TI - Estimation of Bias Location Error due to Absence of the LOS-Signal in a UMTS-System ER - TY - CONF AU - Bischoff, R. AU - Haeb-Umbach, Reinhold AU - Schulz, W. AU - Heinrichs, G. ID - 11734 T2 - 1st ESA Workshop on Satellite Navigation User Equipment Technology (Navitec 2001) TI - Implementation of a Rake Receiver Architecture into a Galileo Receiver ER - TY - JOUR AB - In this paper, it is shown that a correlation criterion is the appropriate criterion for bottom-up clustering to obtain broad phonetic class regression trees for maximum likelihood linear regression (MLLR)-based speaker adaptation. The correlation structure among speech units is estimated on the speaker-independent training data. In adaptation experiments the tree outperformed a regression tree obtained from clustering according to closeness in acoustic space and achieved results comparable with those of a manually designed broad phonetic class tree AU - Haeb-Umbach, Reinhold ID - 11778 IS - 3 JF - IEEE Transactions on Speech and Audio Processing KW - acoustic space KW - adaptation experiments KW - automatic generation KW - bottom-up clustering KW - broad phonetic class regression trees KW - correlation criterion KW - correlation methods KW - maximum likelihood estimation KW - maximum likelihood linear regression based speaker adaptation KW - MLLR adaptation KW - pattern clustering KW - phonetic regression class trees KW - speaker-independent training data KW - speech recognition KW - speech units KW - statistical analysis KW - trees (mathematics) TI - Automatic generation of phonetic regression class trees for MLLR adaptation VL - 9 ER - TY - JOUR AB - We derive a class of computationally inexpensive linear dimension reduction criteria by introducing a weighted variant of the well-known K-class Fisher criterion associated with linear discriminant analysis (LDA). It can be seen that LDA weights contributions of individual class pairs according to the Euclidean distance of the respective class means. We generalize upon LDA by introducing a different weighting function AU - Loog, M. AU - Duin, R.P.W. AU - Haeb-Umbach, Reinhold ID - 11870 IS - 7 JF - IEEE Transactions on Pattern Analysis and Machine Intelligence KW - approximate pairwise accuracy KW - Bayes error KW - Bayes methods KW - error statistics KW - Euclidean distance KW - Fisher criterion KW - linear dimension reduction KW - linear discriminant analysis KW - pattern classification KW - statistical analysis KW - statistical pattern classification KW - weighting function TI - Multiclass linear dimension reduction by weighted pairwise Fisher criteria VL - 23 ER - TY - CONF AB - The traditional way to find a linear solution to the feature extraction problem is based on the maximization of the class-between scatter over the class-within scatter (Fisher mapping). For the multi-class problem this is, however, sub-optimal due to class conjunctions, even for the simple situation of normal distributed classes with identical covariance matrices. We propose a novel, equally fast method, based on nonlinear PCA. Although still sub-optimal, it may avoid the class conjunction. The proposed method is experimentally compared with Fisher mapping and with a neural network based approach to nonlinear PCA. It appears to outperform both methods, the first one even in a dramatic way. AU - Duin, Robert P.W. AU - Loog, Marco AU - Haeb-Umbach, Reinhold ID - 11758 T2 - International Conference on Pattern Recognition (ICPR 2000) TI - Multi-class Linear Feature Extraction by Nonlinear PCA ER - TY - CONF AU - Haeb-Umbach, Reinhold ID - 11779 T2 - International Conference on Spoken Language Processing (ICSLP 2000) TI - Data-driven Phonetic Regression Class Tree Estimation for MLLR Adaptation ER - TY - CONF AB - Amongst several data driven approaches for designing filters for the time sequence of spectral parameters, the linear discriminant analysis (LDA) based method has been proposed for automatic speech recognition. Here we apply LDA-based filter design to cepstral features, which better match the inherent assumption of this method that feature vector components are uncorrelated. Extensive recognition experiments have been conducted both on the standard TIMIT phone recognition task and on a proprietary 130-words command word task under various adverse environmental conditions, including reverberant data with real-life room impulse responses and data processed by acoustic echo cancellation algorithms. Significant error rate reductions have been achieved when applying the novel long-range feature filters compared to standard approaches employing cepstral mean normalization and delta and delta-delta features, in particular when facing acoustic echo cancellation scenarios and room reverberation. For example, the phone accuracy on reverberated TIMIT data could be increased from 50.7\% to 56.0\% AU - Lieb, M. AU - Haeb-Umbach, Reinhold ID - 11869 KW - acoustic echo cancellation algorithms KW - adverse environmental conditions KW - automatic speech recognition KW - cepstral analysis KW - cepstral features KW - cepstral mean normalization KW - command word task KW - delta-delta features KW - delta features KW - echo suppression KW - error rate reductions KW - feature vector components KW - FIR filters KW - LDA derived cepstral trajectory filters KW - linear discriminant analysis KW - long-range feature filters KW - phone accuracy KW - real-life room impulse responses KW - reverberant data KW - spectral parameters KW - speech recognition KW - standard TIMIT phone recognition task T2 - IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2000) TI - LDA derived cepstral trajectory filters in adverse environmental conditions VL - 2 ER - TY - CONF AU - Loog, M. AU - Haeb-Umbach, Reinhold ID - 11871 T2 - International Conference on Spoken Language Processing (ICSLP 2000) TI - Multi-class Linear Dimension Reduction by Generalized Fisher Criteria ER - TY - CONF AB - This paper contains a description of the Philips/RWTH 1998 HUB4 system which has been build in a joint e ort of Philips Research Laboratories Aachen and Aachen University of Technology. We will focus our discussion on recent improvements compared to the original 1997 HUB4 system and evaluate them on the HUB4'97 evaluation data. The paper will deal with 1. a rough system overview including feature extraction, acoustic training, audio stream segmentation, and decoding 2. log-linear interpolation of distance-language models, 3. and the integration of various acoustic and language models via Discriminative Model Combination (DMC). The performance of the described system is 23% (relative) better than the performance of the 1997 Philips HUB4 system. A word error rate of 17.9% was achieved on the 1997 HUB4 evaluation set, compared to 23.5% using the original 1997 system. AU - Beyerlein, Peter AU - Aubert, Xavier L. AU - Haeb-Umbach, Reinhold AU - Harris, Matthew J. AU - Klakow, Dietrich AU - Wendemuth, Andreas AU - Molau, Sirko AU - Pitz, Michael AU - Sixtus, Achim ID - 11728 T2 - Eurospeech TI - The Philips/RWTH system for transcription of broadcast news ER - TY - CONF AB - This paper contains a description of the Philips/RWTH 1998 HUB4 system which has been build in a joint e ort of Philips Research Laboratories Aachen and Aachen University of Technology. We will focus our discussion on recent improvements compared to the original 1997 HUB4 system and evaluate them on the HUB4'97 evaluation data. The paper will deal with 1. a rough system overview including feature extraction, acoustic training, audio stream segmentation, and decoding 2. log-linear interpolation of distance-language models, 3. and the integration of various acoustic and language models via Discriminative Model Combination (DMC). The performance of the described system is 23% (relative) better than the performance of the 1997 Philips HUB4 system. A word error rate of 17.9% was achieved on the 1997 HUB4 evaluation set, compared to 23.5% using the original 1997 system. AU - Beyerlein, Peter AU - Aubert, Xavier L. AU - Haeb-Umbach, Reinhold AU - Harris, Matthew J. AU - Klakow, Dietrich AU - Wendemuth, Andreas AU - Molau, Sirko AU - Pitz, Michael AU - Sixtus, Achim ID - 11729 T2 - Broadcast News Transcription and Understanding Workshop, Washington TI - The Philips/RWTH System for Transcription of Broadcast News ER - TY - CONF AB - We apply Fisher variate analysis to measure the effectiveness of speaker normalization techniques. A trace criterion, which measures the ratio of the variations due to different phonemes compared to variations due to different speakers, serves as a first assessment of a feature set without the need for recognition experiments. By using this measure and by recognition experiments we demonstrate that cepstral mean normalization also has a speaker normalization effect, in addition to the well-known channel normalization effect. Similarly vocal tract normalization (VTN) is shown to remove inter-speaker variability. For VTN we show that normalization on a per sentence basis performs better than normalization on a per speaker basis. Recognition results are given on Wall Street Journal and Hub-4 databases AU - Haeb-Umbach, Reinhold ID - 11780 T2 - ICASSP99 Phoenix, AZ TI - Investigations on inter-speaker variability in the feature space ER - TY - CONF AB - We examined variants of MFCC and PLP cepstral parameterisations in the context of large vocabulary continuous speech recognition under different acous-tical environmental conditions: Compared to MFCC, mel-frequency PLP uses a cubic root intensity-to-loudness law, and an LPC analysis is applied to the mel-warped spectrum. In LPC-smoothed MFCC, the only difference to MFCC is the additional LPC smoothing of the warped spectrum. While neither technique was able to significantly outperform the MFCC parameterisation in our setup which includes an LDA feature transformation, feature set combination via DMC at the acoustic likelihood level and via ROVER at the recognized word level delivered small but consistent improvements. AU - Haeb-Umbach, Reinhold AU - Loog, Marco ID - 11791 T2 - Eurospeech TI - An Investigation of Cepstral Parameterisations for Large Vocabulary Speech Recognition ER - TY - CONF AB - In transcription of broadcast news, dividing the signal into homogeneous segments, and clustering together similar segments is important. Decoding a complete broadcast news program in one chunk is technically di cult. Also, through creation of homogeneous clusters of segments, improvement from adaptation can be increased. Two systems of segmentation and clustering are compared. The best system used the BIC algorithm to produce long, homogeneous segments, and a nearest neighbour bottom-up agglomerative clustering algorithm to produce homogeneous clusters. Adaptation brought a word error rate (WER) improvement from 23:4% to 21:0% using the automatic segmentation and clustering, compared to an improvement from 21:8% to 20:0% using a handmade \correct" segmentation and clustering. AU - Harris, Matthew J. AU - Aubert, Xavier L. AU - Haeb-Umbach, Reinhold AU - Beyerlein, Peter ID - 11805 T2 - Eurospeech TI - A study of broadcast news audio stream segmentation and segment clustering ER - TY - CONF AB - In this paper the Philips Broadcast News transcription system is described. The Broadcast News task aims at the recognition of "found" speech in radio and television broadcasts without any additional side information (e.g. speaking style, background conditions). The system was derived from the Philips continuous mixture density crossword HMM system, using MFCC features and Laplacian densities. A segmentation was performed to obtain sentence-like partitions of the broadcasts. Using data-driven clustering, the obtained segments were grouped into clusters with similar acoustic conditions for adaptation purposes. Gender independent word-internal and crossword triphone models were trained on 70 hours of the HUB4 training data. No focus condition specific training was applied. Channel and speaker normalization was done by mean and variance normalization as well as VTN and MLLR. The transcription was produced by an adaptive multiple pass decoder starting with phrase-bigram decoding using word-internal triphones and finishing with a phrase-trigram decoding using MLLR-adapted crossword models. AU - Beyerlein, Peter AU - Aubert, Xavier L. AU - Haeb-Umbach, Reinhold AU - Klakow, Dietrich AU - Ullrich, Meinhard AU - Wendemuth, Andreas AU - Wilcox, Patricia ID - 11730 T2 - DARPA Broadcast News Transcription and Understanding Workshop, Landsdowne TI - Automatic Transcription of English Broadcast News ER - TY - CONF AB - In this paper we describe some characteristics of the acoustic modeling used in the Philips continuous-speech recognition system for the DARPA Hub-4 1997 evaluation, which are related to robustness issues. We aimed at a conceptually simple system: We trained two model sets on 70 hours of the Hub-4 training data, one for within-word and one for cross-word decoding. These model sets were used for both genders and all environmental conditions. In order to be able to do so, channel normalization (mean, variance normalization) and speaker normalization (vocal tract length normalization, realized by an appropriate shift of the center frequencies of the mel filter bank) have been applied, as well as adaptation techniques. MLLR-based unsupervised batch adaptation on clusters of segments was conducted both after a first within-word decoding and a cross-word decoding pass. The training strategy and the effects of the various normalization and adaptation techniques will be discussed in the paper. AU - Haeb-Umbach, Reinhold AU - Aubert, Xavier L. AU - Beyerlein, Peter AU - Klakow, Dietrich AU - Ullrich, Meinhard AU - Wendemuth, Andreas AU - Wilcox, Patricia ID - 11784 T2 - DARPA Broadcast News Transcription and Understanding Workshop, Landsdowne TI - Acoustic Modeling in the Philips Hub-4 Continuous-Speech Recognition System ER - TY - CONF AB - In this paper we present some experiments that have been performed while developing language models for the PHILIPS Broadcast News system. Three main issues will be discussed: construction of phrases, adaptation of remote corpora to this task, and the combination of the different models. Also, perplexities on the 1997 evaluation data are reported. AU - Klakow, Dietrich AU - Aubert, Xavier L. AU - Haeb-Umbach, Reinhold AU - Beyerlein, Peter AU - Ullrich, Meinhard AU - Wendemuth, Andreas AU - Wilcox, Patricia ID - 11842 T2 - DARPA Broadcast News Transcription and Understanding Workshop, Landsdowne TI - Language-Model Investigations related to Broadcast News ER - TY - CONF AB - Although speaker normalization is attempted in very different manners, vocal tract normalization (VTN) and speaker adaptive training (SAT) share many common properties. We show that both lead to more compact representations of the phonetically relevant variations of the training data and that both achieve improved error rate performance only if a complementary normalization or adaptation operation is conducted on the test data. Algorithms for fast test speaker enrollment are presented for both normalization methods: in the framework of SAT, a pre-transformation step is proposed, which alone, i.e. without subsequent unsupervised MLLR adaption, reduces the error rate by almost 10% on the WSJ 5k test sets. For VTN, the use of a Gaussian mixture model makes obsolete a first recognition pass to obtain a preliminary transcription of the test utterance at hardly and loss in performance. AU - Welling, L. AU - Haeb-Umbach, Reinhold AU - Aubert, X. AU - Haberland, N. ID - 11936 T2 - ICASSP 1998, Seattle TI - A Study on Speaker Normalization Using Vocal Tract Normalization and Speaker Adaptive Training ER - TY - CONF AB - Addresses the problem of online, writer-independent, unconstrained handwriting recognition. Based on hidden Markov models (HMM), which are successfully employed in speech recognition tasks, we focus on representations which address scalability, recognition performance and compactness. 'Delayed' features are introduced which integrate more global, handwriting specific knowledge into the HMM representation. These features lead to larger error-rate reduction than 'delta' features which are known from speech recognition and even require fewer additional components. Scalability is addressed with a size-independent representation. Compactness is achieved with linear discriminant analysis. The representations are discussed and the results for a mixed-style word recognition task with vocabularies of 200 (up to 99% correct words) and 20000 words (up to 88.8% correct words) are given. AU - Dolfing, J.G.A. AU - Haeb-Umbach, Reinhold ID - 11750 T2 - ICASSP, Munich TI - Signal Representations for Hidden Markov Model Based On-Line Handwriting Recognition ER - TY - JOUR AB - This paper reports the design of a command-based speech interface for an answering machine or a voice mail system. Automatic speech recognition was integrated in order to facilitate the remote control and the retrieval of voice messages from any telephone in a speech-only dialogue. The design goal was that consumers would perceive the speech interface as a benefit compared with the common touch-tone interface. In this paper we will first describe the speech technology underlying the system. Then it will be shown how, based on this technology, the user interface was designed in a top-down approach. We started with the development of a concept and tested it by means of a Wizard-of-Oz simulation. After refining the concept in parallel design, it was implemented in a high-fidelity prototype. By means of qualitative user testing the design was improved in three iteration steps. The achievement of the design goal was finally verified with user tests in two countries. AU - Gamm, Stephan AU - Haeb-Umbach, Reinhold AU - Langmann, Detlev ID - 11766 JF - Speech Communication TI - The development of a command-based speech interface for a telephone answering machine ER - TY - CONF AB - The increased popularity of mobile telephony introduces both challenges and opportunitites for automatic speech recognition. ASR offers ways to simplify the use of mobile phones, notably in hands- and eyes-busy situations. However, the acoustic environment can be severely degraded and the wireless network may add additional distortions to the speech signal. This paper gives an overview of the sources of degradation and attempts to robust speech recognition for mobile communications. Emphasis is placed on approaches which are suitable for implementation in mobile terminals. Two example applications are described which illustrate the robustness issues and design considerations typical of low-cost noisy speech recognition: voice-dialling in a GSM phone and hands-free digit recognition in the car. AU - Haeb-Umbach, Reinhold ID - 11781 T2 - Eurospeech TI - Robust Speech Recognition for Wireless Networks and Mobile Telephony ER - TY - CONF AB - The SpeechDat project aims to produce speech databases for all official languages of the European Union and some major dialectal variants and minority languages resulting in 28 speech databases. They will be recorded over fixed and mobile telephone networks. This will provide a realistic basis for training and assessment of both isolated and continuous-speech utterances, employing whole-word or subword approaches, and thus can be used for developing voice driven teleservices including speaker verification. The specification of the databases has been developed jointly, and is essentially the same for each language to facilitate dissemination and use. There will be a controlled variation among the speakers concerning sex, age, dialect, environment of call, etc. The validation of all databases will be carried out centrally. The SpeechDat databases will be transferred to ELRA for distribution. The next databases to be recorded will cover East European languages. AU - Hoege, H. AU - Tropf, H. S. AU - Winsky, R. AU - van den Heuvel, H. AU - Haeb-Umbach, Reinhold AU - Choukri, K. ID - 11819 T2 - ICASSP, Munich TI - European Speech Databases for Telephone Applications ER - TY - CONF AB - This paper describes speaker-independent speech recognition experiments concerning acoustic front end processing on a speech database that was recorded in 3 different cars. We investigate different feature analysis approaches (mel-filter bank, mel-cepstrum, perceptually linear predictive coding) and present results with noise compensation techniques based on spectral subtraction. Although the methods employed lead to considerable error rate reduction the error analysis shows that low signal-to-noise ratios are still a problem AU - Langmann, Detlev AU - Fischer, Alexander AU - Wuppermann, Friedhelm AU - Haeb-Umbach, Reinhold AU - Eisele, Thomas ID - 11852 T2 - Eurospeech TI - Acoustic Front Ends for Speaker-Independent Digit Recognition in Car Environments ER - TY - CONF AU - Langmann, Detlev AU - Wuppermann, Friedhelm AU - Haeb-Umbach, Reinhold AU - Fischer, A. AU - Eisele, Thomas ID - 11855 T2 - Aachener Kolloquium on Signal Theory TI - Investigation of Acoustic Front Ends for Speaker-Independent Speech Recognition in the Car ER - TY - CONF AB - Although widely used, there are still open questions concerning which properties of linear discriminant analysis (LDA) account for its success in many speech recognition systems. In order to gain more insight into the nature of the transformation we compare LDA with mel-cepstral feature vectors with respect to the following criteria: decorrelation and ordering property; invariance under linear transforms; automatic learning of dynamical features; and data dependence of the transformation. AU - Eisele, Thomas AU - Haeb-Umbach, Reinhold AU - Langmann, Detlev ID - 11761 T2 - ICSLP , Philadelphia TI - A Comparative Study of Linear Feature Transformation Techniques for Automatic Speech Recognition ER - TY - CONF AB - This paper tells the story of the design of a command-based speech interface for a voice mail system. Speech recognition was integrated in the voice mail system in order to allow the remote interrogation of messages in a speech-only dialogue. Our design goal was that consumers would perceive voice control as a clear benefit versus touch-tone control. It is shown how the speech interface was designed in a top-down approach. We started with a concept development and tested it by means of a Wizard-of-Oz simulation. After refining the concept in parallel design, the design was implemented in a high-fidelity prototype. By means of qualitative user testing it was improved in three iteration steps. We verified the achievement of our design goal with tests in two countries AU - Gamm, Stephan AU - Haeb-Umbach, Reinhold AU - Langmann, Detlev ID - 11767 T2 - IEEE Workshop on Interactive Voice Technology for Telecommunications Applications TI - Findings with the Design of a Command-Based Speech Interface for a Voice Mail System ER - TY - CONF AB - The paper describes the design, collection and postprocessing of the French SpeechDat corpus FRESCO. Being a database of approximately 35000 utterances recorded from 1000 callers over the terrestrial telephone network in France, it comprises immediately usable and relevant speech for the initial training and assessment of speaker independent phoneme model or word model based speech recognizers, as they are employed in automated telephone services. FRESCO is one of the 1000 speaker telephone speech databases produced as "case studies" within the European project SpeechDat(M). AU - Langmann, Detlev AU - Haeb-Umbach, Reinhold ID - 11853 T2 - ICSLP, Philadelphia TI - FRESCO: The French Telephone Speech Data Collection - Part of the European SpeechDat(M) Project ER - TY - CONF AU - Langmann, Detlev AU - Haeb-Umbach, Reinhold AU - Eisele, Thomas ID - 11854 T2 - ITG Fachtagung Sprachkommunikation, Frankfurt TI - Robust Rejection Modeling for a Small-Vocabulary Application ER - TY - CONF AB - Clustering techniques have been integrated at different levels into the training procedure of a continuous-density hidden Markov model (HMM) speech recognizer. These clustering techniques can be used in two ways. First acoustically similar states are tied together. It will help to reduce the number of parameters but also allow to train otherwise rarely seen states together with more robust ones (state-tying). Secondly densities are clustered across states, this reduces the number of densities while at the same time keeping the best performances of our recognizer (density-clustering). We have applied these techniques both to word-based small-vocabulary and phoneme-based large-vocabulary recognition tasks. On the WSJ task, we could achieve a reduction of the word error rate by 7%. On the TI/NIST-connected digit task, the number of parameters was reduced by a factor 2-3 while keeping the same string error rate. AU - Dugast, Christian AU - Beyerlein, Peter AU - Haeb-Umbach, Reinhold ID - 11757 T2 - ICASSP, Detroit TI - Application of Clustering Techniques to Mixture Density Modelling for Continuous-Speech Recognition ER - TY - JOUR AB - Today speech recognition of a small vocabulary can be realized so cost-effectively that the technology can penetrate into consumer electronics. But, as first applications that failed on the market show, it is by no means obvious how to incorporate voice control in a user interface. This paper addresses the issue of how to design a voice control so that the user perceives it as a benefit. User interface guidelines that are adapted or specific to voice control are presented. Then the process of designing a voice control in the user-centred approach is described. By means of two examples, the car stereo and telephone answering machine, it is shown how this is turned into practice. AU - Gamm, Stephan AU - Haeb-Umbach, Reinhold ID - 11764 JF - Philips Journal of Research TI - User interface design of voice controlled consumer electronics ER - TY - CONF AU - Gamm, Stephan AU - Haeb-Umbach, Reinhold ID - 11765 T2 - Eurospeech, Madrid TI - Human Factors of a Voice-Controlled Car Stereo ER - TY - CONF AU - Gamm, Stephan AU - Haeb-Umbach, Reinhold AU - Langmann, Det ID - 11768 T2 - International Symposium on Human Factors in Telecommunications, Melbourne TI - The Usability Engineering of a Voice-Controlled Answering Machine ER - TY - JOUR AB - Recognition accuracy has been the primary objective of most speech recognition research, and impressive results have been obtained, e.g. less than 0.3% word error rate on a speaker-independent digit recognition task. When it comes to real-world applications, robustness and real-time response might be more important issues. For the first requirement we review some of the work on robustness and discuss one specific technique, spectral normalization, in more detail. The requirement of real-time response has to be considered in the light of the limited hardware resources in voice control applications, which are due to the tight cost constraints. In this paper we discuss in detail one specific means to reduce the processing and memory demands: a clustering technique applied at various levels within the acoustic modelling. AU - Haeb-Umbach, Reinhold AU - Beyerlein, Peter AU - Geller, Dieter ID - 11786 JF - Philips Journal of Research TI - Speech recognition algorithms for voice control interfaces ER - TY - CONF AB - We address the problem of automatically finding an acoustic representation (i.e. a transcription) of unknown words as a sequence of subword units, given a few sample utterances of the unknown words, and an inventory of speaker-independent subword units. The problem arises if a user wants to add his own vocabulary to a speaker-independent recognition system simply by speaking the words a few times. Two methods are investigated which are both based on a maximum-likelihood formulation of the problem. The experimental results show that both automatic transcription methods provide a good estimate of the acoustic models of unknown words. The recognition error rates obtained with such models in a speaker-independent recognition task are clearly better than those resulting from separate whole-word models. They are comparable with the performance of transcriptions drawn from a dictionary. AU - Haeb-Umbach, Reinhold AU - Beyerlein, P. AU - Thelen, E. ID - 11787 T2 - ICASSP, Detroit TI - Automatic Transcription of Unknown Words in a Speech Recognition System ER - TY - JOUR AB - This paper gives an overview of the Philips Research system for continuous-speech recognition. The recognition architecture is based on an integrated statistical approach. The system has been successfully applied to various tasks in American English and German, ranging from small vocabulary tasks to very large vocabulary tasks and from recognition only to speech understanding. Here, we concentrate on phoneme-based continuous-speech recognition for large vocabulary recognition as used for dictation, which covers a significant part of our research work on speech recognition. We describe this task and report on experimental results. In order to allow a comparison with the performance of other systems, a section with an evaluation on the standard North American Business news (NAB2) task (dictation of American English newspaper text) is supplied. AU - Steinbiss, Volker AU - Ney, Hermann J. AU - Aubert, Xavier L. AU - Besling, Stefan AU - Dugast, Christian AU - Essen, Ute AU - Geller, Dieter AU - Haeb-Umbach, Reinhold AU - Kneser, Reinhard AU - Meier, Hans Günter AU - Oerder, Martin AU - Tran, Bach Hiep ID - 11905 JF - Philips Journal of Research TI - The Philips Research system for continuous-speech dictation ER - TY - JOUR AB - This paper gives an overview of the Philips research system for phoneme-based, large-vocabulary, continuousspeech recognition. The system has been successfully applied to various tasks in the German and (American) English languages, ranging from small vocabulary tasks to very large vocabulary tasks. Here, we concentrate on continuousspeech recognition for dictation in real applications, the dictation of legal reports and radiology reports in German. We describe this task and report on experimental results. We also describe a commercial PC-based dictation system which includes a PC implementation of our scientific recognition prototype. In order to allow for a comparison with the performance of other systems, a section with an evaluation on the standard Wall Street Journal task (dictation of American English newspaper text) is supplied. The recognition architecture is based on an integrated statistical approach. We describe the characteristic features of the system as opposed to other systems: 1. the Viterbi criterion is consistently applied both in training and testing; 2. continuous mixture densities are used without tying or smoothing; 3. time-synchronous beam search in connection with a phoneme look-ahead is applied to a tree-organized lexicon. AU - Steinbiss, Volker AU - Ney, Hermann J. AU - Essen, Ute AU - Tran, Bach Hiep AU - Aubert, Xavier L. AU - Dugast, Christian AU - Kneser, Reinhard AU - Meier, Hans Günter AU - Oerder, Martin AU - Haeb-Umbach, Reinhold AU - Geller, Dieter AU - Hoellerbauer, W. AU - Bartosik, H. ID - 11948 JF - Speech Communication TI - Continuous speech dictation - From theory to practice ER - TY - JOUR AB - The authors describe the improvements in a time-synchronous beam search strategy for a 10000-word continuous-speech recognition task. Basically they introduced two measures, namely a tree organization of the pronunciation lexicon and a novel look-ahead technique at the phoneme level. The experimental tests performed showed that the number of state hypotheses could be reduced from 50000 to 3000, i.e., by a factor of about 17. At the same time, the word error rate did not increase. AU - Haeb-Umbach, Reinhold AU - Ney, Hermann ID - 11796 JF - IEEE Transactions on Speech and Audio Processing TI - Improvements in beam search for 10000-word continuous-speech recognition ER - TY - CONF AU - Ney, Hermann AU - Steinbeiss, Volker AU - Aubert, Xavier L. AU - Haeb-Umbach, Reinhold ID - 11878 T2 - Artifical Intelligence, Progress and Prospects of Speech Research and Technology, Munich TI - Progress in Large-Vocabulary, Continuous Speech Recognition ER - TY - JOUR AB - This paper gives an overview of a research system for phoneme based, large vocabulary continuous speech recognition. The system to be described has been applied to the SPICOS task, the DARPA RM task and a 12000 word dictation task. Experimental results for these three tasks will be presented. Like many other systems, the recognition architecture is based on an integrated statistical approach. In this paper, we describe the characteristic features of the system as opposed to other systems: (1) The Viterbi criterion is consistently applied both in training and testing. (2) Continuous mixture densities are used without any tying or smoothing; this approach can be viewed as a sort of ‘statistical template matching’. (3) Time-synchronous beam search is used consistently throughout all tasks; extensions using a tree organization of the vocabulary and phoneme lookahead are presented so that a 12000 word task can be handled. AU - Ney, Hermann AU - Steinbeiss, Volker AU - Haeb-Umbach, Reinhold AU - Tran, Bach Hiep ID - 11879 JF - International Journal on Pattern Recognition and Artificial Intelligence TI - An Overview of the Philips Research System for Large Vocabulary Continuous Speech Recognition ER -