Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation

K. Kinoshita, T. von Neumann, M. Delcroix, T. Nakatani, R. Haeb-Umbach, in: Proc. Interspeech 2020, 2020, pp. 2652–2656.

Download
OA INTERSPEECH_2020_vonNeumann1_Paper.pdf 1.73 MB
Conference Paper | English
Author
Kinoshita, Keisuke; von Neumann, ThiloLibreCat ; Delcroix, Marc; Nakatani, Tomohiro; Haeb-Umbach, ReinholdLibreCat
Abstract
Recently, the source separation performance was greatly improved by time-domain audio source separation based on dual-path recurrent neural network (DPRNN). DPRNN is a simple but effective model for a long sequential data. While DPRNN is quite efficient in modeling a sequential data of the length of an utterance, i.e., about 5 to 10 second data, it is harder to apply it to longer sequences such as whole conversations consisting of multiple utterances. It is simply because, in such a case, the number of time steps consumed by its internal module called inter-chunk RNN becomes extremely large. To mitigate this problem, this paper proposes a multi-path RNN (MPRNN), a generalized version of DPRNN, that models the input data in a hierarchical manner. In the MPRNN framework, the input data is represented at several (>_ 3) time-resolutions, each of which is modeled by a specific RNN sub-module. For example, the RNN sub-module that deals with the finest resolution may model temporal relationship only within a phoneme, while the RNN sub-module handling the most coarse resolution may capture only the relationship between utterances such as speaker information. We perform experiments using simulated dialogue-like mixtures and show that MPRNN has greater model capacity, and it outperforms the current state-of-the-art DPRNN framework especially in online processing scenarios.
Publishing Year
Proceedings Title
Proc. Interspeech 2020
Page
2652-2656
LibreCat-ID

Cite this

Kinoshita K, von Neumann T, Delcroix M, Nakatani T, Haeb-Umbach R. Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation. In: Proc. Interspeech 2020. ; 2020:2652-2656. doi:10.21437/Interspeech.2020-2388
Kinoshita, K., von Neumann, T., Delcroix, M., Nakatani, T., & Haeb-Umbach, R. (2020). Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation. Proc. Interspeech 2020, 2652–2656. https://doi.org/10.21437/Interspeech.2020-2388
@inproceedings{Kinoshita_von Neumann_Delcroix_Nakatani_Haeb-Umbach_2020, title={Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation}, DOI={10.21437/Interspeech.2020-2388}, booktitle={Proc. Interspeech 2020}, author={Kinoshita, Keisuke and von Neumann, Thilo and Delcroix, Marc and Nakatani, Tomohiro and Haeb-Umbach, Reinhold}, year={2020}, pages={2652–2656} }
Kinoshita, Keisuke, Thilo von Neumann, Marc Delcroix, Tomohiro Nakatani, and Reinhold Haeb-Umbach. “Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and Its Application to Speaker Stream Separation.” In Proc. Interspeech 2020, 2652–56, 2020. https://doi.org/10.21437/Interspeech.2020-2388.
K. Kinoshita, T. von Neumann, M. Delcroix, T. Nakatani, and R. Haeb-Umbach, “Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation,” in Proc. Interspeech 2020, 2020, pp. 2652–2656, doi: 10.21437/Interspeech.2020-2388.
Kinoshita, Keisuke, et al. “Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and Its Application to Speaker Stream Separation.” Proc. Interspeech 2020, 2020, pp. 2652–56, doi:10.21437/Interspeech.2020-2388.
All files available under the following license(s):
Main File(s)
Access Level
OA Open Access
Last Uploaded
2020-12-16T14:16:32Z


Export

Marked Publications

Open Data LibreCat

Search this title in

Google Scholar