TY  - CONF
AB  - We propose a spatio-spectral, combined model-based and data-driven
diarization pipeline consisting of TDOA-based segmentation followed by
embedding-based clustering. The proposed system requires neither access to
multi-channel training data nor prior knowledge about the number or placement
of microphones. It works for both a compact microphone array and distributed
microphones, with minor adjustments. Due to its superior handling of
overlapping speech during segmentation, the proposed pipeline significantly
outperforms the single-channel pyannote approach, both in a scenario with a
compact microphone array and in a setup with distributed microphones.
Additionally, we show that, unlike fully spatial diarization pipelines, the
proposed system can correctly track speakers when they change positions.
AU  - Cord-Landwehr, Tobias
AU  - Gburrek, Tobias
AU  - Deegen, Marc
AU  - Haeb-Umbach, Reinhold
ID  - 61079
T2  - Proceedings of INTERSPEECH
TI  - Spatio-spectral diarization of meetings by combining TDOA-based  segmentation and speaker embedding-based clustering
ER  - 
TY  - CONF
AU  - Gburrek, Tobias
AU  - Meise, Adrian
AU  - Schmalenstroeer, Joerg
AU  - Haeb-Umbach, Reinhold
ID  - 57031
T2  - 2024 18th International Workshop on Acoustic Signal Enhancement (IWAENC)
TI  - Diminishing Domain Mismatch for DNN-Based Acoustic Distance Estimation via Stochastic Room Reverberation Models
ER  - 
TY  - CONF
AU  - Gburrek, Tobias
AU  - Schmalenstroeer, Joerg
AU  - Haeb-Umbach, Reinhold
ID  - 48269
T2  - European Signal Processing Conference (EUSIPCO)
TI  - On the Integration of Sampling Rate Synchronization and Acoustic Beamforming
ER  - 
TY  - CONF
AU  - Schmalenstroeer, Joerg
AU  - Gburrek, Tobias
AU  - Haeb-Umbach, Reinhold
ID  - 48270
T2  - ITG Conference on Speech Communication
TI  - LibriWASN: A Data Set for Meeting Separation, Diarization, and Recognition with Asynchronous Recording Devices
ER  - 
TY  - CONF
AB  - We propose a diarization system, that estimates “who spoke when” based on spatial information, to be used as a front-end of a meeting transcription system running on the signals gathered from an acoustic sensor network (ASN). Although the
spatial distribution of the microphones is advantageous, exploiting the spatial diversity for diarization and signal enhancement is challenging, because the microphones’ positions are typically unknown, and the recorded signals are initially unsynchronized in general. Here, we approach these issues by first blindly synchronizing the signals and then estimating time differences of arrival (TDOAs). The TDOA information is exploited to estimate the speakers’ activity, even in the presence of multiple speakers being simultaneously active. This speaker activity information serves as a guide for a spatial mixture model, on which basis the individual speaker’s signals are extracted via beamforming. Finally, the extracted signals are forwarded to a speech recognizer. Additionally, a novel initialization scheme for spatial mixture models based on the TDOA estimates is proposed. Experiments conducted on real recordings from the LibriWASN data set have shown that our proposed system is advantageous compared to a system using a spatial mixture model, which does not make use
of external diarization information.
AU  - Gburrek, Tobias
AU  - Schmalenstroeer, Joerg
AU  - Haeb-Umbach, Reinhold
ID  - 49109
KW  - Diarization
KW  - time difference of arrival
KW  - ad-hoc acoustic sensor network
KW  - meeting transcription
T2  - Proc. Asilomar Conference on Signals, Systems, and Computers
TI  - Spatial Diarization for Meeting Transcription with Ad-Hoc Acoustic Sensor Networks
ER  - 
TY  - CONF
AU  - Afifi, Haitham
AU  - Karl, Holger
AU  - Gburrek, Tobias
AU  - Schmalenstroeer, Joerg
ID  - 33806
T2  - 2022 International Wireless Communications and Mobile Computing (IWCMC)
TI  - Data-driven Time Synchronization in Wireless Multimedia Networks
ER  - 
TY  - CONF
AU  - Gburrek, Tobias
AU  - Schmalenstroeer, Joerg
AU  - Haeb-Umbach, Reinhold
ID  - 33807
T2  - ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
TI  - On Synchronization of Wireless Acoustic Sensor Networks in the Presence of Time-Varying Sampling Rate Offsets and Speaker Changes
ER  - 
TY  - CONF
AU  - Gburrek, Tobias
AU  - Schmalenstroeer, Joerg
AU  - Heitkaemper, Jens
AU  - Haeb-Umbach, Reinhold
ID  - 33808
T2  - 2022 International Workshop on Acoustic Signal Enhancement (IWAENC)
TI  - Informed vs. Blind Beamforming in Ad-Hoc Acoustic Sensor Networks for Meeting Transcription
ER  - 
TY  - GEN
AU  - Gburrek, Tobias
AU  - Boeddeker, Christoph
AU  - von Neumann, Thilo
AU  - Cord-Landwehr, Tobias
AU  - Schmalenstroeer, Joerg
AU  - Haeb-Umbach, Reinhold
ID  - 33816
TI  - A Meeting Transcription System for an Ad-Hoc Acoustic Sensor Network
ER  - 
TY  - JOUR
AB  - Due to the ad hoc nature of wireless acoustic sensor networks, the position of the sensor nodes is typically unknown. This contribution proposes a technique to estimate the position and orientation of the sensor nodes from the recorded speech signals. The method assumes that a node comprises a microphone array with synchronously sampled microphones rather than a single microphone, but does not require the sampling clocks of the nodes to be synchronized. From the observed audio signals, the distances between the acoustic sources and arrays, as well as the directions of arrival, are estimated. They serve as input to a non-linear least squares problem, from which both the sensor nodes’ positions and orientations, as well as the source positions, are alternatingly estimated in an iterative process. Given one set of unknowns, i.e., either the source positions or the sensor nodes’ geometry, the other set of unknowns can be computed in closed-form. The proposed approach is computationally efficient and the first one, which employs both distance and directional information for geometry calibration in a common cost function. Since both distance and direction of arrival measurements suffer from outliers, e.g., caused by strong reflections of the sound waves on the surfaces of the room, we introduce measures to deemphasize or remove unreliable measurements. Additionally, we discuss modifications of our previously proposed deep neural network-based acoustic distance estimator, to account not only for omnidirectional sources but also for directional sources. Simulation results show good positioning accuracy and compare very favorably with alternative approaches from the literature.
AU  - Gburrek, Tobias
AU  - Schmalenstroeer, Joerg
AU  - Haeb-Umbach, Reinhold
ID  - 22528
JF  - EURASIP Journal on Audio, Speech, and Music Processing
SN  - 1687-4722
TI  - Geometry calibration in wireless acoustic sensor networks utilizing DoA and distance information
ER  - 
TY  - CONF
AU  - Gburrek, Tobias
AU  - Schmalenstroeer, Joerg
AU  - Haeb-Umbach, Reinhold
ID  - 23994
T2  - ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
TI  - Iterative Geometry Calibration from Distance Estimates for Wireless Acoustic Sensor Networks
ER  - 
TY  - CONF
AU  - Gburrek, Tobias
AU  - Schmalenstroeer, Joerg
AU  - Haeb-Umbach, Reinhold
ID  - 23999
T2  - Speech Communication; 14th ITG-Symposium
TI  - On Source-Microphone Distance Estimation Using Convolutional Recurrent Neural Networks
ER  - 
TY  - CONF
AU  - Chinaev, Aleksej
AU  - Enzner, Gerald
AU  - Gburrek, Tobias
AU  - Schmalenstroeer, Joerg
ID  - 23997
T2  - 29th European Signal Processing Conference (EUSIPCO)
TI  - Online Estimation of Sampling Rate Offsets in Wireless Acoustic Sensor Networks with Packet Loss
ER  - 
TY  - CONF
AB  - We present an approach to deep neural network based (DNN-based) distance estimation in reverberant rooms for supporting geometry calibration tasks in wireless acoustic sensor networks. Signal diffuseness information from acoustic signals is aggregated via the coherent-to-diffuse power ratio to obtain a distance-related feature, which is mapped to a source-to-microphone distance estimate by means of a DNN. This information is then combined with direction-of-arrival estimates from compact microphone arrays to infer the geometry of the sensor network. Unlike many other approaches to geometry calibration, the proposed scheme does only require that the sampling clocks of the sensor nodes are roughly synchronized. In simulations we show that the proposed DNN-based distance estimator generalizes to unseen acoustic environments and that precise estimates of the sensor node positions are obtained. 
AU  - Gburrek, Tobias
AU  - Schmalenstroeer, Joerg
AU  - Brendel, Andreas
AU  - Kellermann, Walter
AU  - Haeb-Umbach, Reinhold
ID  - 18651
T2  - European Signal Processing Conference (EUSIPCO)
TI  - Deep Neural Network based Distance Estimation for Geometry Calibration in Acoustic Sensor Network
ER  - 
TY  - CONF
AB  - This  paper  presents  an  approach  to  voice  conversion,  whichdoes neither require parallel data nor speaker or phone labels fortraining.  It can convert between speakers which are not in thetraining set by employing the previously proposed concept of afactorized hierarchical variational autoencoder. Here, linguisticand speaker induced variations are separated upon the notionthat content induced variations change at a much shorter timescale, i.e., at the segment level, than speaker induced variations,which vary at the longer utterance level. In this contribution wepropose to employ convolutional instead of recurrent networklayers  in  the  encoder  and  decoder  blocks,  which  is  shown  toachieve better phone recognition accuracy on the latent segmentvariables at frame-level due to their better temporal resolution.For voice conversion the mean of the utterance variables is re-placed with the respective estimated mean of the target speaker.The resulting log-mel spectra of the decoder output are used aslocal conditions of a WaveNet which is utilized for synthesis ofthe speech waveforms.  Experiments show both good disentan-glement properties of the latent space variables, and good voiceconversion performance.
AU  - Gburrek, Tobias
AU  - Glarner, Thomas
AU  - Ebbers, Janek
AU  - Haeb-Umbach, Reinhold
AU  - Wagner, Petra
ID  - 15237
T2  - Proc. 10th ISCA Speech Synthesis Workshop
TI  - Unsupervised Learning of a Disentangled Speech Representation for Voice Conversion
ER  -