TY  - JOUR
AB  - For an environment to be perceived as being smart, contextual information has to be gathered to adapt the system's behavior and its interface towards the user. Being a rich source of context information speech can be acquired unobtrusively by microphone arrays and then processed to extract information about the user and his environment. In this paper, a system for joint temporal segmentation, speaker localization, and identification is presented, which is supported by face identification from video data obtained from a steerable camera. Special attention is paid to latency aspects and online processing capabilities, as they are important for the application under investigation, namely ambient communication. It describes the vision of terminal-less, session-less and multi-modal telecommunication with remote partners, where the user can move freely within his home while the communication follows him. The speaker diarization serves as a context source, which has been integrated in a service-oriented middleware architecture and provided to the application to select the most appropriate I/O device and to steer the camera towards the speaker during ambient communication.
AU  - Schmalenstroeer, Joerg
AU  - Haeb-Umbach, Reinhold
ID  - 11892
IS  - 5
JF  - IEEE Journal of Selected Topics in Signal Processing
KW  - audio streaming
KW  - audio visual data streaming
KW  - context information speech
KW  - face identification
KW  - face recognition
KW  - image segmentation
KW  - middleware
KW  - multimodal telecommunication
KW  - online diarization
KW  - service oriented middleware architecture
KW  - sessionless telecommunication
KW  - software architecture
KW  - speaker identification
KW  - speaker localization
KW  - speaker recognition
KW  - steerable camera
KW  - telecommunication computing
KW  - temporal segmentation
KW  - terminal-less telecommunication
KW  - video streaming
TI  - Online Diarization of Streaming Audio-Visual Data for Smart Environments
VL  - 4
ER  -