End-to-End Training of Time Domain Audio Separation and Recognition
T. von Neumann, K. Kinoshita, L. Drude, C. Boeddeker, M. Delcroix, T. Nakatani, R. Haeb-Umbach, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7004–7008.
Download
ICASSP_2020_vonNeumann_Paper.pdf
192.53 KB
Conference Paper
| English
Author
von Neumann, ThiloLibreCat ;
Kinoshita, Keisuke;
Drude, Lukas;
Boeddeker, ChristophLibreCat;
Delcroix, Marc;
Nakatani, Tomohiro;
Haeb-Umbach, ReinholdLibreCat
Abstract
The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multispeaker speech recognition. However, up until now, state-of-theart neural network–based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of 11.0% on WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic E2E frequency domain systems proposed so far.
Publishing Year
Proceedings Title
ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Page
7004-7008
LibreCat-ID
Cite this
von Neumann T, Kinoshita K, Drude L, et al. End-to-End Training of Time Domain Audio Separation and Recognition. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). ; 2020:7004-7008. doi:10.1109/ICASSP40776.2020.9053461
von Neumann, T., Kinoshita, K., Drude, L., Boeddeker, C., Delcroix, M., Nakatani, T., & Haeb-Umbach, R. (2020). End-to-End Training of Time Domain Audio Separation and Recognition. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7004–7008. https://doi.org/10.1109/ICASSP40776.2020.9053461
@inproceedings{von Neumann_Kinoshita_Drude_Boeddeker_Delcroix_Nakatani_Haeb-Umbach_2020, title={End-to-End Training of Time Domain Audio Separation and Recognition}, DOI={10.1109/ICASSP40776.2020.9053461}, booktitle={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, author={von Neumann, Thilo and Kinoshita, Keisuke and Drude, Lukas and Boeddeker, Christoph and Delcroix, Marc and Nakatani, Tomohiro and Haeb-Umbach, Reinhold}, year={2020}, pages={7004–7008} }
Neumann, Thilo von, Keisuke Kinoshita, Lukas Drude, Christoph Boeddeker, Marc Delcroix, Tomohiro Nakatani, and Reinhold Haeb-Umbach. “End-to-End Training of Time Domain Audio Separation and Recognition.” In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7004–8, 2020. https://doi.org/10.1109/ICASSP40776.2020.9053461.
T. von Neumann et al., “End-to-End Training of Time Domain Audio Separation and Recognition,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7004–7008, doi: 10.1109/ICASSP40776.2020.9053461.
von Neumann, Thilo, et al. “End-to-End Training of Time Domain Audio Separation and Recognition.” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7004–08, doi:10.1109/ICASSP40776.2020.9053461.
All files available under the following license(s):
Creative Commons Public Domain Dedication (CC0 1.0):
Main File(s)
File Name
ICASSP_2020_vonNeumann_Paper.pdf
192.53 KB
Access Level
Open Access
Last Uploaded
2020-12-16T14:09:48Z