TY - CONF AB - In recent years time domain speech separation has excelled over frequency domain separation in single channel scenarios and noise-free environments. In this paper we dissect the gains of the time-domain audio separation network (TasNet) approach by gradually replacing components of an utterance-level permutation invariant training (u-PIT) based separation system in the frequency domain until the TasNet system is reached, thus blending components of frequency domain approaches with those of time domain approaches. Some of the intermediate variants achieve comparable signal-to-distortion ratio (SDR) gains to TasNet, but retain the advantage of frequency domain processing: compatibility with classic signal processing tools such as frequency-domain beamforming and the human interpretability of the masks. Furthermore, we show that the scale invariant signal-to-distortion ratio (si-SDR) criterion used as loss function in TasNet is related to a logarithmic mean square error criterion and that it is this criterion which contributes most reliable to the performance advantage of TasNet. Finally, we critically assess which gains in a noise-free single channel environment generalize to more realistic reverberant conditions. AU - Heitkaemper, Jens AU - Jakobeit, Darius AU - Boeddeker, Christoph AU - Drude, Lukas AU - Haeb-Umbach, Reinhold ID - 20504 KW - voice activity detection KW - speech activity detection KW - neural network KW - statistical speech processing T2 - ICASSP 2020 Virtual Barcelona Spain TI - Demystifying TasNet: A Dissecting Approach ER - TY - CONF AB - Speech activity detection (SAD), which often rests on the fact that the noise is "more'' stationary than speech, is particularly challenging in non-stationary environments, because the time variance of the acoustic scene makes it difficult to discriminate speech from noise. We propose two approaches to SAD, where one is based on statistical signal processing, while the other utilizes neural networks. The former employs sophisticated signal processing to track the noise and speech energies and is meant to support the case for a resource efficient, unsupervised signal processing approach. The latter introduces a recurrent network layer that operates on short segments of the input speech to do temporal smoothing in the presence of non-stationary noise. The systems are tested on the Fearless Steps challenge database, which consists of the transmission data from the Apollo-11 space mission. The statistical SAD achieves comparable detection performance to earlier proposed neural network based SADs, while the neural network based approach leads to a decision cost function of 1.07% on the evaluation set of the 2020 Fearless Steps Challenge, which sets a new state of the art. AU - Heitkaemper, Jens AU - Schmalenstroeer, Joerg AU - Haeb-Umbach, Reinhold ID - 20505 KW - voice activity detection KW - speech activity detection KW - neural network KW - statistical speech processing T2 - INTERSPEECH 2020 Virtual Shanghai China TI - Statistical and Neural Network Based Speech Activity Detection in Non-Stationary Acoustic Environments ER - TY - CONF AB - In this paper we present a speech presence probability (SPP) estimation algorithmwhich exploits both temporal and spectral correlations of speech. To this end, the SPP estimation is formulated as the posterior probability estimation of the states of a two-dimensional (2D) Hidden Markov Model (HMM). We derive an iterative algorithm to decode the 2D-HMM which is based on the turbo principle. The experimental results show that indeed the SPP estimates improve from iteration to iteration, and further clearly outperform another state-of-the-art SPP estimation algorithm. AU - Vu, Dang Hai Tran AU - Haeb-Umbach, Reinhold ID - 11917 KW - correlation methods KW - estimation theory KW - hidden Markov models KW - iterative methods KW - probability KW - spectral analysis KW - speech processing KW - 2D HMM KW - SPP estimates KW - iterative algorithm KW - posterior probability estimation KW - spectral correlation KW - speech presence probability estimation KW - state-of-the-art SPP estimation algorithm KW - temporal correlation KW - turbo principle KW - two-dimensional hidden Markov model KW - Correlation KW - Decoding KW - Estimation KW - Iterative decoding KW - Noise KW - Speech KW - Vectors SN - 1520-6149 T2 - 38th International Conference on Acoustics, Speech and Signal Processing (ICASSP 2013) TI - Using the turbo principle for exploiting temporal and spectral correlations in speech presence probability estimation ER - TY - CONF AB - This paper investigates the influence of feedback provided by an autonomous robot (BIRON) on users’ discursive behavior. A user study is described during which users show objects to the robot. The results of the experiment indicate, that the robot’s verbal feedback utterances cause the humans to adapt their own way of speaking. The changes in users’ verbal behavior are due to their beliefs about the robots knowledge and abilities. In this paper they are identified and grouped. Moreover, the data implies variations in user behavior regarding gestures. Unlike speech, the robot was not able to give feedback with gestures. Due to the lack of feedback, users did not seem to have a consistent mental representation of the robot’s abilities to recognize gestures. As a result, changes between different gestures are interpreted to be unconscious variations accompanying speech. AU - Lohse, Manja AU - Rohlfing, Katharina AU - Wrede, Britta AU - Sagerer, Gerhard ID - 17278 KW - discursive behavior KW - autonomous robot KW - BIRON KW - man-machine systems KW - robot abilities KW - robot knowledge KW - user gestures KW - robot verbal feedback utterance KW - speech processing KW - user verbal behavior KW - service robots KW - human-robot interaction KW - human computer interaction KW - gesture recognition SN - 1050-4729 TI - “Try something else!” — When users change their discursive behavior in human-robot interaction ER - TY - GEN AU - Plessl, Christian AU - Maurer, Simon ID - 2433 KW - co-design KW - speech processing TI - Hardware/Software Codesign in Speech Compression Applications ER -