{"publication":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","department":[{"_id":"54"}],"citation":{"short":"W. Zhang, X. Chang, C. Boeddeker, T. Nakatani, S. Watanabe, Y. Qian, IEEE/ACM Transactions on Audio, Speech, and Language Processing (2022).","apa":"Zhang, W., Chang, X., Boeddeker, C., Nakatani, T., Watanabe, S., & Qian, Y. (2022). End-to-End Dereverberation, Beamforming, and Speech Recognition in A Cocktail Party. IEEE/ACM Transactions on Audio, Speech, and Language Processing. https://doi.org/10.1109/TASLP.2022.3209942","mla":"Zhang, Wangyou, et al. “End-to-End Dereverberation, Beamforming, and Speech Recognition in A Cocktail Party.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, doi:10.1109/TASLP.2022.3209942.","ama":"Zhang W, Chang X, Boeddeker C, Nakatani T, Watanabe S, Qian Y. End-to-End Dereverberation, Beamforming, and Speech Recognition in A Cocktail Party. IEEE/ACM Transactions on Audio, Speech, and Language Processing. Published online 2022. doi:10.1109/TASLP.2022.3209942","bibtex":"@article{Zhang_Chang_Boeddeker_Nakatani_Watanabe_Qian_2022, title={End-to-End Dereverberation, Beamforming, and Speech Recognition in A Cocktail Party}, DOI={10.1109/TASLP.2022.3209942}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, author={Zhang, Wangyou and Chang, Xuankai and Boeddeker, Christoph and Nakatani, Tomohiro and Watanabe, Shinji and Qian, Yanmin}, year={2022} }","ieee":"W. Zhang, X. Chang, C. Boeddeker, T. Nakatani, S. Watanabe, and Y. Qian, “End-to-End Dereverberation, Beamforming, and Speech Recognition in A Cocktail Party,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, doi: 10.1109/TASLP.2022.3209942.","chicago":"Zhang, Wangyou, Xuankai Chang, Christoph Boeddeker, Tomohiro Nakatani, Shinji Watanabe, and Yanmin Qian. “End-to-End Dereverberation, Beamforming, and Speech Recognition in A Cocktail Party.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022. https://doi.org/10.1109/TASLP.2022.3209942."},"author":[{"last_name":"Zhang","full_name":"Zhang, Wangyou","first_name":"Wangyou"},{"full_name":"Chang, Xuankai","last_name":"Chang","first_name":"Xuankai"},{"last_name":"Boeddeker","id":"40767","full_name":"Boeddeker, Christoph","first_name":"Christoph"},{"last_name":"Nakatani","full_name":"Nakatani, Tomohiro","first_name":"Tomohiro"},{"first_name":"Shinji","last_name":"Watanabe","full_name":"Watanabe, Shinji"},{"last_name":"Qian","full_name":"Qian, Yanmin","first_name":"Yanmin"}],"language":[{"iso":"eng"}],"date_created":"2022-10-11T07:27:51Z","user_id":"40767","date_updated":"2022-12-05T12:35:31Z","has_accepted_license":"1","file":[{"relation":"main_file","file_name":"End-to-End_Dereverberation_Beamforming_and_Speech_Recognition_in_A_Cocktail_Party.pdf","date_updated":"2022-10-11T07:23:13Z","content_type":"application/pdf","creator":"huesera","date_created":"2022-10-11T07:23:13Z","file_id":"33674","access_level":"open_access","file_size":6167931}],"publication_identifier":{"issn":["Print ISSN: 2329-9290 Electronic ISSN: 2329-9304"]},"abstract":[{"lang":"eng","text":"Far-field multi-speaker automatic speech recognition (ASR) has drawn increasing attention in recent years. Most existing methods feature a signal processing frontend and an ASR backend. In realistic scenarios, these modules are usually trained separately or progressively, which suffers from either inter-module mismatch or a complicated training process. In this paper, we propose an end-to-end multi-channel model that jointly optimizes the speech enhancement (including speech dereverberation, denoising, and separation) frontend and the ASR backend as a single system. To the best of our knowledge, this is the first work that proposes to optimize dereverberation, beamforming, and multi-speaker ASR in a fully end-to-end manner. The frontend module consists of a weighted prediction error (WPE) based submodule for dereverberation and a neural beamformer for denoising and speech separation. For the backend, we adopt a widely used end-to-end (E2E) ASR architecture. It is worth noting that the entire model is differentiable and can be optimized in a fully end-to-end manner using only the ASR criterion, without the need of parallel signal-level labels. We evaluate the proposed model on several multi-speaker benchmark datasets, and experimental results show that the fully E2E ASR model can achieve competitive performance on both noisy and reverberant conditions, with over 30% relative word error rate (WER) reduction over the single-channel baseline systems."}],"publication_status":"published","file_date_updated":"2022-10-11T07:23:13Z","oa":"1","status":"public","title":"End-to-End Dereverberation, Beamforming, and Speech Recognition in A Cocktail Party","ddc":["000"],"_id":"33669","doi":"10.1109/TASLP.2022.3209942","year":"2022","related_material":{"link":[{"url":"https://ieeexplore.ieee.org/abstract/document/9904314","relation":"confirmation"}]},"type":"journal_article"}