Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation

Kinoshita, Keisuke; von Neumann, Thilo; Delcroix, Marc; Nakatani, Tomohiro; Haeb-Umbach, Reinhold

Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation

K. Kinoshita, T. von Neumann, M. Delcroix, T. Nakatani, R. Haeb-Umbach, in: Proc. Interspeech 2020, 2020, pp. 2652–2656.

Download

INTERSPEECH_2020_vonNeumann1_Paper.pdf 1.73 MB

DOI

10.21437/Interspeech.2020-2388

Conference Paper | English

Author

Kinoshita, Keisuke; von Neumann, Thilo^LibreCat ; Delcroix, Marc; Nakatani, Tomohiro; Haeb-Umbach, Reinhold^LibreCat

Department

Nachrichtentechnik (NT) / Heinz Nixdorf Institut

Abstract

Recently, the source separation performance was greatly improved by time-domain audio source separation based on dual-path recurrent neural network (DPRNN). DPRNN is a simple but effective model for a long sequential data. While DPRNN is quite efficient in modeling a sequential data of the length of an utterance, i.e., about 5 to 10 second data, it is harder to apply it to longer sequences such as whole conversations consisting of multiple utterances. It is simply because, in such a case, the number of time steps consumed by its internal module called inter-chunk RNN becomes extremely large. To mitigate this problem, this paper proposes a multi-path RNN (MPRNN), a generalized version of DPRNN, that models the input data in a hierarchical manner. In the MPRNN framework, the input data is represented at several (>_ 3) time-resolutions, each of which is modeled by a specific RNN sub-module. For example, the RNN sub-module that deals with the finest resolution may model temporal relationship only within a phoneme, while the RNN sub-module handling the most coarse resolution may capture only the relationship between utterances such as speaker information. We perform experiments using simulated dialogue-like mixtures and show that MPRNN has greater model capacity, and it outperforms the current state-of-the-art DPRNN framework especially in online processing scenarios.

Publishing Year

2020

Proceedings Title

Proc. Interspeech 2020

Page

2652-2656

LibreCat-ID

20766

Cite this

Kinoshita K, von Neumann T, Delcroix M, Nakatani T, Haeb-Umbach R. Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation. In: Proc. Interspeech 2020. ; 2020:2652-2656. doi:10.21437/Interspeech.2020-2388

Kinoshita, K., von Neumann, T., Delcroix, M., Nakatani, T., & Haeb-Umbach, R. (2020). Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation. Proc. Interspeech 2020, 2652–2656. https://doi.org/10.21437/Interspeech.2020-2388

@inproceedings{Kinoshita_von Neumann_Delcroix_Nakatani_Haeb-Umbach_2020, title={Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation}, DOI={10.21437/Interspeech.2020-2388}, booktitle={Proc. Interspeech 2020}, author={Kinoshita, Keisuke and von Neumann, Thilo and Delcroix, Marc and Nakatani, Tomohiro and Haeb-Umbach, Reinhold}, year={2020}, pages={2652–2656} }

Kinoshita, Keisuke, Thilo von Neumann, Marc Delcroix, Tomohiro Nakatani, and Reinhold Haeb-Umbach. “Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and Its Application to Speaker Stream Separation.” In Proc. Interspeech 2020, 2652–56, 2020. https://doi.org/10.21437/Interspeech.2020-2388.

K. Kinoshita, T. von Neumann, M. Delcroix, T. Nakatani, and R. Haeb-Umbach, “Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker Stream Separation,” in Proc. Interspeech 2020, 2020, pp. 2652–2656, doi: 10.21437/Interspeech.2020-2388.

Kinoshita, Keisuke, et al. “Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and Its Application to Speaker Stream Separation.” Proc. Interspeech 2020, 2020, pp. 2652–56, doi:10.21437/Interspeech.2020-2388.

All files available under the following license(s):

Creative Commons Public Domain Dedication (CC0 1.0):

https://creativecommons.org/publicdomain/zero/1.0/
https://creativecommons.org/publicdomain/zero/1.0/legalcode

Main File(s)

File Name

INTERSPEECH_2020_vonNeumann1_Paper.pdf 1.73 MB

Access Level

Open Access

Last Uploaded

2020-12-16T14:16:32Z

Export

Marked Publications

Open Data LibreCat

Search this title in

Google Scholar