{"user_id":"460","citation":{"mla":"Schmalenstroeer, Joerg, et al. “Fusing Audio and Video Information for Online Speaker Diarization.” Interspeech 2009, 2009.","ama":"Schmalenstroeer J, Kelling M, Leutnant V, Haeb-Umbach R. Fusing Audio and Video Information for Online Speaker Diarization. In: Interspeech 2009. ; 2009.","ieee":"J. Schmalenstroeer, M. Kelling, V. Leutnant, and R. Haeb-Umbach, “Fusing Audio and Video Information for Online Speaker Diarization,” 2009.","apa":"Schmalenstroeer, J., Kelling, M., Leutnant, V., & Haeb-Umbach, R. (2009). Fusing Audio and Video Information for Online Speaker Diarization. Interspeech 2009.","bibtex":"@inproceedings{Schmalenstroeer_Kelling_Leutnant_Haeb-Umbach_2009, title={Fusing Audio and Video Information for Online Speaker Diarization}, booktitle={Interspeech 2009}, author={Schmalenstroeer, Joerg and Kelling, Martin and Leutnant, Volker and Haeb-Umbach, Reinhold}, year={2009} }","chicago":"Schmalenstroeer, Joerg, Martin Kelling, Volker Leutnant, and Reinhold Haeb-Umbach. “Fusing Audio and Video Information for Online Speaker Diarization.” In Interspeech 2009, 2009.","short":"J. Schmalenstroeer, M. Kelling, V. Leutnant, R. Haeb-Umbach, in: Interspeech 2009, 2009."},"date_created":"2019-07-12T05:30:24Z","title":"Fusing Audio and Video Information for Online Speaker Diarization","_id":"11899","department":[{"_id":"54"}],"quality_controlled":"1","oa":"1","author":[{"first_name":"Joerg","id":"460","last_name":"Schmalenstroeer","full_name":"Schmalenstroeer, Joerg"},{"full_name":"Kelling, Martin","last_name":"Kelling","first_name":"Martin"},{"first_name":"Volker","full_name":"Leutnant, Volker","last_name":"Leutnant"},{"full_name":"Haeb-Umbach, Reinhold","id":"242","last_name":"Haeb-Umbach","first_name":"Reinhold"}],"abstract":[{"lang":"eng","text":"In this paper we present a system for identifying and localizingspeakers using distant microphone arrays and a steerablepan-tilt-zoom camera. Audio and video streams are processedin real-time to obtain the diarization information {grqq}who speakswhen and where'' with low latency to be used in advanced videoconferencing systems or user-adaptive interfaces. A key featureof the proposed system is to first glean information about thespeaker{\\rq}s location and identity from the audio and visual datastreams separately and then to fuse these data in a probabilisticframework employing the Viterbi algorithm. Here, visual evidenceof a person is utilized through a priori state probabilities,while location and speaker change information are employedvia time-variant transition probablities. Experiments show thatvideo information yields a substantial improvement comparedto pure audio-based diarization."}],"status":"public","year":"2009","date_updated":"2023-10-26T08:10:10Z","type":"conference","publication":"Interspeech 2009","language":[{"iso":"eng"}],"main_file_link":[{"url":"https://groups.uni-paderborn.de/nt/pubs/2009/ScKeLeHa09.pdf","open_access":"1"}]}