{"year":"2022","title":"End-to-End Dereverberation, Beamforming, and Speech Recognition in A Cocktail Party","file":[{"content_type":"application/pdf","file_name":"End-to-End_Dereverberation_Beamforming_and_Speech_Recognition_in_A_Cocktail_Party.pdf","file_size":6167931,"date_updated":"2022-10-11T07:23:13Z","date_created":"2022-10-11T07:23:13Z","access_level":"open_access","relation":"main_file","creator":"huesera","file_id":"33674"}],"language":[{"iso":"eng"}],"file_date_updated":"2022-10-11T07:23:13Z","ddc":["000"],"date_updated":"2022-12-05T12:35:31Z","publication":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","user_id":"40767","publication_status":"published","type":"journal_article","status":"public","_id":"33669","publication_identifier":{"issn":["Print ISSN: 2329-9290 Electronic ISSN: 2329-9304"]},"oa":"1","citation":{"ieee":"W. Zhang, X. Chang, C. Boeddeker, T. Nakatani, S. Watanabe, and Y. Qian, “End-to-End Dereverberation, Beamforming, and Speech Recognition in A Cocktail Party,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, doi: 10.1109/TASLP.2022.3209942.","bibtex":"@article{Zhang_Chang_Boeddeker_Nakatani_Watanabe_Qian_2022, title={End-to-End Dereverberation, Beamforming, and Speech Recognition in A Cocktail Party}, DOI={10.1109/TASLP.2022.3209942}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, author={Zhang, Wangyou and Chang, Xuankai and Boeddeker, Christoph and Nakatani, Tomohiro and Watanabe, Shinji and Qian, Yanmin}, year={2022} }","chicago":"Zhang, Wangyou, Xuankai Chang, Christoph Boeddeker, Tomohiro Nakatani, Shinji Watanabe, and Yanmin Qian. “End-to-End Dereverberation, Beamforming, and Speech Recognition in A Cocktail Party.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022. https://doi.org/10.1109/TASLP.2022.3209942.","ama":"Zhang W, Chang X, Boeddeker C, Nakatani T, Watanabe S, Qian Y. End-to-End Dereverberation, Beamforming, and Speech Recognition in A Cocktail Party. IEEE/ACM Transactions on Audio, Speech, and Language Processing. Published online 2022. doi:10.1109/TASLP.2022.3209942","short":"W. Zhang, X. Chang, C. Boeddeker, T. Nakatani, S. Watanabe, Y. Qian, IEEE/ACM Transactions on Audio, Speech, and Language Processing (2022).","apa":"Zhang, W., Chang, X., Boeddeker, C., Nakatani, T., Watanabe, S., & Qian, Y. (2022). End-to-End Dereverberation, Beamforming, and Speech Recognition in A Cocktail Party. IEEE/ACM Transactions on Audio, Speech, and Language Processing. https://doi.org/10.1109/TASLP.2022.3209942","mla":"Zhang, Wangyou, et al. “End-to-End Dereverberation, Beamforming, and Speech Recognition in A Cocktail Party.” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022, doi:10.1109/TASLP.2022.3209942."},"has_accepted_license":"1","department":[{"_id":"54"}],"date_created":"2022-10-11T07:27:51Z","author":[{"first_name":"Wangyou","full_name":"Zhang, Wangyou","last_name":"Zhang"},{"first_name":"Xuankai","full_name":"Chang, Xuankai","last_name":"Chang"},{"id":"40767","last_name":"Boeddeker","first_name":"Christoph","full_name":"Boeddeker, Christoph"},{"last_name":"Nakatani","first_name":"Tomohiro","full_name":"Nakatani, Tomohiro"},{"last_name":"Watanabe","first_name":"Shinji","full_name":"Watanabe, Shinji"},{"first_name":"Yanmin","full_name":"Qian, Yanmin","last_name":"Qian"}],"doi":"10.1109/TASLP.2022.3209942","abstract":[{"lang":"eng","text":"Far-field multi-speaker automatic speech recognition (ASR) has drawn increasing attention in recent years. Most existing methods feature a signal processing frontend and an ASR backend. In realistic scenarios, these modules are usually trained separately or progressively, which suffers from either inter-module mismatch or a complicated training process. In this paper, we propose an end-to-end multi-channel model that jointly optimizes the speech enhancement (including speech dereverberation, denoising, and separation) frontend and the ASR backend as a single system. To the best of our knowledge, this is the first work that proposes to optimize dereverberation, beamforming, and multi-speaker ASR in a fully end-to-end manner. The frontend module consists of a weighted prediction error (WPE) based submodule for dereverberation and a neural beamformer for denoising and speech separation. For the backend, we adopt a widely used end-to-end (E2E) ASR architecture. It is worth noting that the entire model is differentiable and can be optimized in a fully end-to-end manner using only the ASR criterion, without the need of parallel signal-level labels. We evaluate the proposed model on several multi-speaker benchmark datasets, and experimental results show that the fully E2E ASR model can achieve competitive performance on both noisy and reverberant conditions, with over 30% relative word error rate (WER) reduction over the single-channel baseline systems."}],"related_material":{"link":[{"relation":"confirmation","url":"https://ieeexplore.ieee.org/abstract/document/9904314"}]}}