{"status":"public","year":"2020","author":[{"last_name":"von Neumann","id":"49870","full_name":"von Neumann, Thilo","first_name":"Thilo","orcid":"https://orcid.org/0000-0002-7717-8670"},{"first_name":"Keisuke","full_name":"Kinoshita, Keisuke","last_name":"Kinoshita"},{"full_name":"Drude, Lukas","last_name":"Drude","first_name":"Lukas"},{"last_name":"Boeddeker","id":"40767","full_name":"Boeddeker, Christoph","first_name":"Christoph"},{"first_name":"Marc","last_name":"Delcroix","full_name":"Delcroix, Marc"},{"full_name":"Nakatani, Tomohiro","last_name":"Nakatani","first_name":"Tomohiro"},{"last_name":"Haeb-Umbach","id":"242","full_name":"Haeb-Umbach, Reinhold","first_name":"Reinhold"}],"title":"End-to-End Training of Time Domain Audio Separation and Recognition","_id":"20762","department":[{"_id":"54"}],"user_id":"49870","citation":{"ieee":"T. von Neumann <i>et al.</i>, “End-to-End Training of Time Domain Audio Separation and Recognition,” in <i>ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</i>, 2020, pp. 7004–7008, doi: <a href=\"https://doi.org/10.1109/ICASSP40776.2020.9053461\">10.1109/ICASSP40776.2020.9053461</a>.","ama":"von Neumann T, Kinoshita K, Drude L, et al. End-to-End Training of Time Domain Audio Separation and Recognition. In: <i>ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</i>. ; 2020:7004-7008. doi:<a href=\"https://doi.org/10.1109/ICASSP40776.2020.9053461\">10.1109/ICASSP40776.2020.9053461</a>","mla":"von Neumann, Thilo, et al. “End-to-End Training of Time Domain Audio Separation and Recognition.” <i>ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</i>, 2020, pp. 7004–08, doi:<a href=\"https://doi.org/10.1109/ICASSP40776.2020.9053461\">10.1109/ICASSP40776.2020.9053461</a>.","apa":"von Neumann, T., Kinoshita, K., Drude, L., Boeddeker, C., Delcroix, M., Nakatani, T., &#38; Haeb-Umbach, R. (2020). End-to-End Training of Time Domain Audio Separation and Recognition. <i>ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</i>, 7004–7008. <a href=\"https://doi.org/10.1109/ICASSP40776.2020.9053461\">https://doi.org/10.1109/ICASSP40776.2020.9053461</a>","bibtex":"@inproceedings{von Neumann_Kinoshita_Drude_Boeddeker_Delcroix_Nakatani_Haeb-Umbach_2020, title={End-to-End Training of Time Domain Audio Separation and Recognition}, DOI={<a href=\"https://doi.org/10.1109/ICASSP40776.2020.9053461\">10.1109/ICASSP40776.2020.9053461</a>}, booktitle={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, author={von Neumann, Thilo and Kinoshita, Keisuke and Drude, Lukas and Boeddeker, Christoph and Delcroix, Marc and Nakatani, Tomohiro and Haeb-Umbach, Reinhold}, year={2020}, pages={7004–7008} }","short":"T. von Neumann, K. Kinoshita, L. Drude, C. Boeddeker, M. Delcroix, T. Nakatani, R. Haeb-Umbach, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7004–7008.","chicago":"Neumann, Thilo von, Keisuke Kinoshita, Lukas Drude, Christoph Boeddeker, Marc Delcroix, Tomohiro Nakatani, and Reinhold Haeb-Umbach. “End-to-End Training of Time Domain Audio Separation and Recognition.” In <i>ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</i>, 7004–8, 2020. <a href=\"https://doi.org/10.1109/ICASSP40776.2020.9053461\">https://doi.org/10.1109/ICASSP40776.2020.9053461</a>."},"language":[{"iso":"eng"}],"date_updated":"2023-11-15T12:17:45Z","type":"conference","publication":"ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","has_accepted_license":"1","project":[{"name":"Computing Resources Provided by the Paderborn Center for Parallel Computing","_id":"52"}],"quality_controlled":"1","oa":"1","abstract":[{"lang":"eng","text":"The rising interest in single-channel multi-speaker speech separation sparked development of End-to-End (E2E) approaches to multispeaker speech recognition. However, up until now, state-of-theart neural network–based time domain source separation has not yet been combined with E2E speech recognition. We here demonstrate how to combine a separation module based on a Convolutional Time domain Audio Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train such a model jointly by distributing it over multiple GPUs or by approximating truncated back-propagation for the convolutional front-end. To put this work into perspective and illustrate the complexity of the design space, we provide a compact overview of single-channel multi-speaker recognition systems. Our experiments show a word error rate of 11.0% on WSJ0-2mix and indicate that our joint time domain model can yield substantial improvements over cascade DNN-HMM and monolithic E2E frequency domain systems proposed so far."}],"file_date_updated":"2020-12-16T14:09:48Z","date_created":"2020-12-16T14:07:54Z","ddc":["000"],"doi":"10.1109/ICASSP40776.2020.9053461","page":"7004-7008","file":[{"file_id":"20763","file_name":"ICASSP_2020_vonNeumann_Paper.pdf","date_created":"2020-12-16T14:09:48Z","date_updated":"2020-12-16T14:09:48Z","creator":"huesera","file_size":192529,"content_type":"application/pdf","access_level":"open_access","relation":"main_file"}]}