{"type":"conference","status":"public","publication_status":"published","date_created":"2022-10-28T12:07:57Z","user_id":"40767","publication":"Proc. Interspeech 2022","department":[{"_id":"54"}],"doi":"10.21437/Interspeech.2022-11408","main_file_link":[{"url":"https://www.isca-archive.org/interspeech_2022/kinoshita22_interspeech.pdf"}],"abstract":[{"lang":"eng","text":"Recent speaker diarization studies showed that integration of end-to-end neural diarization (EEND) and clustering-based diarization is a promising approach for achieving state-of-the-art performance on various tasks. Such an approach first divides an observed signal into fixed-length segments, then performs {\\it segment-level} local diarization based on an EEND module, and merges the segment-level results via clustering to form a final global diarization result. The segmentation is done to limit the number of speakers in each segment since the current EEND cannot handle a large number of speakers. In this paper, we argue that such an approach involving the segmentation has several issues; for example, it inevitably faces a dilemma that larger segment sizes increase both the context available for enhancing the performance and the number of speakers for the local EEND module to handle. To resolve such a problem, this paper proposes a novel framework that performs diarization without segmentation. However, it can still handle challenging data containing many speakers and a significant amount of overlapping speech. The proposed method can take an entire meeting for inference and perform {\\it utterance-by-utterance} diarization that clusters utterance activities in terms of speakers. To this end, we leverage a neural network training scheme called Graph-PIT proposed recently for neural source separation. Experiments with simulated active-meeting-like data and CALLHOME data show the superiority of the proposed approach over the conventional methods."}],"conference":{"name":"Interspeech 2022"},"publisher":"ISCA","author":[{"full_name":"Kinoshita, Keisuke","last_name":"Kinoshita","first_name":"Keisuke"},{"first_name":"Thilo","last_name":"von Neumann","full_name":"von Neumann, Thilo","id":"49870","orcid":"https://orcid.org/0000-0002-7717-8670"},{"first_name":"Marc","last_name":"Delcroix","full_name":"Delcroix, Marc"},{"id":"40767","full_name":"Boeddeker, Christoph","first_name":"Christoph","last_name":"Boeddeker"},{"first_name":"Reinhold","last_name":"Haeb-Umbach","full_name":"Haeb-Umbach, Reinhold","id":"242"}],"quality_controlled":"1","page":"1486-1490","_id":"33958","citation":{"mla":"Kinoshita, Keisuke, et al. “Utterance-by-Utterance Overlap-Aware Neural Diarization with Graph-PIT.” <i>Proc. Interspeech 2022</i>, ISCA, 2022, pp. 1486–90, doi:<a href=\"https://doi.org/10.21437/Interspeech.2022-11408\">10.21437/Interspeech.2022-11408</a>.","ieee":"K. Kinoshita, T. von Neumann, M. Delcroix, C. Boeddeker, and R. Haeb-Umbach, “Utterance-by-utterance overlap-aware neural diarization with Graph-PIT,” in <i>Proc. Interspeech 2022</i>, 2022, pp. 1486–1490, doi: <a href=\"https://doi.org/10.21437/Interspeech.2022-11408\">10.21437/Interspeech.2022-11408</a>.","chicago":"Kinoshita, Keisuke, Thilo von Neumann, Marc Delcroix, Christoph Boeddeker, and Reinhold Haeb-Umbach. “Utterance-by-Utterance Overlap-Aware Neural Diarization with Graph-PIT.” In <i>Proc. Interspeech 2022</i>, 1486–90. ISCA, 2022. <a href=\"https://doi.org/10.21437/Interspeech.2022-11408\">https://doi.org/10.21437/Interspeech.2022-11408</a>.","short":"K. Kinoshita, T. von Neumann, M. Delcroix, C. Boeddeker, R. Haeb-Umbach, in: Proc. Interspeech 2022, ISCA, 2022, pp. 1486–1490.","ama":"Kinoshita K, von Neumann T, Delcroix M, Boeddeker C, Haeb-Umbach R. Utterance-by-utterance overlap-aware neural diarization with Graph-PIT. In: <i>Proc. Interspeech 2022</i>. ISCA; 2022:1486-1490. doi:<a href=\"https://doi.org/10.21437/Interspeech.2022-11408\">10.21437/Interspeech.2022-11408</a>","apa":"Kinoshita, K., von Neumann, T., Delcroix, M., Boeddeker, C., &#38; Haeb-Umbach, R. (2022). Utterance-by-utterance overlap-aware neural diarization with Graph-PIT. <i>Proc. Interspeech 2022</i>, 1486–1490. <a href=\"https://doi.org/10.21437/Interspeech.2022-11408\">https://doi.org/10.21437/Interspeech.2022-11408</a>","bibtex":"@inproceedings{Kinoshita_von Neumann_Delcroix_Boeddeker_Haeb-Umbach_2022, title={Utterance-by-utterance overlap-aware neural diarization with Graph-PIT}, DOI={<a href=\"https://doi.org/10.21437/Interspeech.2022-11408\">10.21437/Interspeech.2022-11408</a>}, booktitle={Proc. Interspeech 2022}, publisher={ISCA}, author={Kinoshita, Keisuke and von Neumann, Thilo and Delcroix, Marc and Boeddeker, Christoph and Haeb-Umbach, Reinhold}, year={2022}, pages={1486–1490} }"},"title":"Utterance-by-utterance overlap-aware neural diarization with Graph-PIT","language":[{"iso":"eng"}],"date_updated":"2025-02-12T09:09:05Z","year":"2022"}