End-to-End Training of Time Domain Audio Separation and Recognition

conference paper End-to-End Training of Time Domain Audio Separation and Recognition yes Thilo von Neumann author 49870https://orcid.org/0000-0002-7717-8670 Keisuke Kinoshita author Lukas Drude author Christoph Boeddeker author 40767 Marc Delcroix author Tomohiro Nakatani author Reinhold Haeb-Umbach author 242 54 department Computing Resources Provided by the Paderborn Center for Parallel Computing project The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multispeaker speech recognition. However, up until now, state-of-theart neural network–based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of 11.0% on WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic E2E frequency domain systems proposed so far. https://ris.uni-paderborn.de/download/20762/20763/ICASSP_2020_vonNeumann_Paper.pdf application/pdfno 2020 eng ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP40776.2020.9053461 7004-7008 von Neumann T, Kinoshita K, Drude L, et al. End-to-End Training of Time Domain Audio Separation and Recognition. In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). ; 2020:7004-7008. doi:<a href="https://doi.org/10.1109/ICASSP40776.2020.9053461">10.1109/ICASSP40776.2020.9053461</a> von Neumann, T., Kinoshita, K., Drude, L., Boeddeker, C., Delcroix, M., Nakatani, T., & Haeb-Umbach, R. (2020). End-to-End Training of Time Domain Audio Separation and Recognition. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7004–7008. <a href="https://doi.org/10.1109/ICASSP40776.2020.9053461">https://doi.org/10.1109/ICASSP40776.2020.9053461</a> von Neumann, Thilo, et al. “End-to-End Training of Time Domain Audio Separation and Recognition.” ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7004–08, doi:<a href="https://doi.org/10.1109/ICASSP40776.2020.9053461">10.1109/ICASSP40776.2020.9053461</a>. Neumann, Thilo von, Keisuke Kinoshita, Lukas Drude, Christoph Boeddeker, Marc Delcroix, Tomohiro Nakatani, and Reinhold Haeb-Umbach. “End-to-End Training of Time Domain Audio Separation and Recognition.” In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7004–8, 2020. <a href="https://doi.org/10.1109/ICASSP40776.2020.9053461">https://doi.org/10.1109/ICASSP40776.2020.9053461</a>. @inproceedings{von Neumann_Kinoshita_Drude_Boeddeker_Delcroix_Nakatani_Haeb-Umbach_2020, title={End-to-End Training of Time Domain Audio Separation and Recognition}, DOI={<a href="https://doi.org/10.1109/ICASSP40776.2020.9053461">10.1109/ICASSP40776.2020.9053461</a>}, booktitle={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, author={von Neumann, Thilo and Kinoshita, Keisuke and Drude, Lukas and Boeddeker, Christoph and Delcroix, Marc and Nakatani, Tomohiro and Haeb-Umbach, Reinhold}, year={2020}, pages={7004–7008} } T. von Neumann, K. Kinoshita, L. Drude, C. Boeddeker, M. Delcroix, T. Nakatani, R. Haeb-Umbach, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7004–7008. T. von Neumann et al., “End-to-End Training of Time Domain Audio Separation and Recognition,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7004–7008, doi: <a href="https://doi.org/10.1109/ICASSP40776.2020.9053461">10.1109/ICASSP40776.2020.9053461</a>. 207622020-12-16T14:07:54Z2023-11-15T12:17:45Z