---
_id: '56004'
author:
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Christoph
  full_name: Boeddeker, Christoph
  id: '40767'
  last_name: Boeddeker
- first_name: Tobias
  full_name: Cord-Landwehr, Tobias
  id: '44393'
  last_name: Cord-Landwehr
- first_name: Marc
  full_name: Delcroix, Marc
  last_name: Delcroix
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'von Neumann T, Boeddeker C, Cord-Landwehr T, Delcroix M, Haeb-Umbach R. Meeting
    Recognition with Continuous Speech Separation and Transcription-Supported Diarization.
    In: <i>2024 IEEE International Conference on Acoustics, Speech, and Signal Processing
    Workshops (ICASSPW)</i>. IEEE; 2024. doi:<a href="https://doi.org/10.1109/icasspw62465.2024.10625894">10.1109/icasspw62465.2024.10625894</a>'
  apa: von Neumann, T., Boeddeker, C., Cord-Landwehr, T., Delcroix, M., &#38; Haeb-Umbach,
    R. (2024). Meeting Recognition with Continuous Speech Separation and Transcription-Supported
    Diarization. <i>2024 IEEE International Conference on Acoustics, Speech, and Signal
    Processing Workshops (ICASSPW)</i>. <a href="https://doi.org/10.1109/icasspw62465.2024.10625894">https://doi.org/10.1109/icasspw62465.2024.10625894</a>
  bibtex: '@inproceedings{von Neumann_Boeddeker_Cord-Landwehr_Delcroix_Haeb-Umbach_2024,
    title={Meeting Recognition with Continuous Speech Separation and Transcription-Supported
    Diarization}, DOI={<a href="https://doi.org/10.1109/icasspw62465.2024.10625894">10.1109/icasspw62465.2024.10625894</a>},
    booktitle={2024 IEEE International Conference on Acoustics, Speech, and Signal
    Processing Workshops (ICASSPW)}, publisher={IEEE}, author={von Neumann, Thilo
    and Boeddeker, Christoph and Cord-Landwehr, Tobias and Delcroix, Marc and Haeb-Umbach,
    Reinhold}, year={2024} }'
  chicago: Neumann, Thilo von, Christoph Boeddeker, Tobias Cord-Landwehr, Marc Delcroix,
    and Reinhold Haeb-Umbach. “Meeting Recognition with Continuous Speech Separation
    and Transcription-Supported Diarization.” In <i>2024 IEEE International Conference
    on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)</i>. IEEE, 2024.
    <a href="https://doi.org/10.1109/icasspw62465.2024.10625894">https://doi.org/10.1109/icasspw62465.2024.10625894</a>.
  ieee: 'T. von Neumann, C. Boeddeker, T. Cord-Landwehr, M. Delcroix, and R. Haeb-Umbach,
    “Meeting Recognition with Continuous Speech Separation and Transcription-Supported
    Diarization,” 2024, doi: <a href="https://doi.org/10.1109/icasspw62465.2024.10625894">10.1109/icasspw62465.2024.10625894</a>.'
  mla: von Neumann, Thilo, et al. “Meeting Recognition with Continuous Speech Separation
    and Transcription-Supported Diarization.” <i>2024 IEEE International Conference
    on Acoustics, Speech, and Signal Processing Workshops (ICASSPW)</i>, IEEE, 2024,
    doi:<a href="https://doi.org/10.1109/icasspw62465.2024.10625894">10.1109/icasspw62465.2024.10625894</a>.
  short: 'T. von Neumann, C. Boeddeker, T. Cord-Landwehr, M. Delcroix, R. Haeb-Umbach,
    in: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing
    Workshops (ICASSPW), IEEE, 2024.'
date_created: 2024-09-04T07:26:02Z
date_updated: 2025-02-12T09:20:07Z
ddc:
- '000'
department:
- _id: '54'
doi: 10.1109/icasspw62465.2024.10625894
file:
- access_level: open_access
  content_type: application/pdf
  creator: tvn
  date_created: 2024-09-04T07:34:30Z
  date_updated: 2024-09-04T07:34:30Z
  file_id: '56005'
  file_name: main.pdf
  file_size: 150432
  relation: main_file
file_date_updated: 2024-09-04T07:34:30Z
has_accepted_license: '1'
language:
- iso: eng
oa: '1'
project:
- _id: '52'
  name: 'PC2: Computing Resources Provided by the Paderborn Center for Parallel Computing'
- _id: '508'
  grant_number: '448568305'
  name: Automatische Transkription von Gesprächssituationen
publication: 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing
  Workshops (ICASSPW)
publication_status: published
publisher: IEEE
status: public
title: Meeting Recognition with Continuous Speech Separation and Transcription-Supported
  Diarization
type: conference
user_id: '40767'
year: '2024'
...
---
_id: '57659'
author:
- first_name: Peter
  full_name: Vieting, Peter
  last_name: Vieting
- first_name: Simon
  full_name: Berger, Simon
  last_name: Berger
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Christoph
  full_name: Boeddeker, Christoph
  id: '40767'
  last_name: Boeddeker
- first_name: Ralf
  full_name: Schlüter, Ralf
  last_name: Schlüter
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'Vieting P, Berger S, von Neumann T, Boeddeker C, Schlüter R, Haeb-Umbach R.
    Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for
    Meeting Transcription. In: <i>2024 IEEE Spoken Language Technology Workshop (SLT)</i>.
    ; 2024.'
  apa: Vieting, P., Berger, S., von Neumann, T., Boeddeker, C., Schlüter, R., &#38;
    Haeb-Umbach, R. (2024). Combining TF-GridNet and Mixture Encoder for Continuous
    Speech Separation for Meeting Transcription. <i>2024 IEEE Spoken Language Technology
    Workshop (SLT)</i>.
  bibtex: '@inproceedings{Vieting_Berger_von Neumann_Boeddeker_Schlüter_Haeb-Umbach_2024,
    title={Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation
    for Meeting Transcription}, booktitle={2024 IEEE Spoken Language Technology Workshop
    (SLT)}, author={Vieting, Peter and Berger, Simon and von Neumann, Thilo and Boeddeker,
    Christoph and Schlüter, Ralf and Haeb-Umbach, Reinhold}, year={2024} }'
  chicago: Vieting, Peter, Simon Berger, Thilo von Neumann, Christoph Boeddeker, Ralf
    Schlüter, and Reinhold Haeb-Umbach. “Combining TF-GridNet and Mixture Encoder
    for Continuous Speech Separation for Meeting Transcription.” In <i>2024 IEEE Spoken
    Language Technology Workshop (SLT)</i>, 2024.
  ieee: P. Vieting, S. Berger, T. von Neumann, C. Boeddeker, R. Schlüter, and R. Haeb-Umbach,
    “Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for
    Meeting Transcription,” 2024.
  mla: Vieting, Peter, et al. “Combining TF-GridNet and Mixture Encoder for Continuous
    Speech Separation for Meeting Transcription.” <i>2024 IEEE Spoken Language Technology
    Workshop (SLT)</i>, 2024.
  short: 'P. Vieting, S. Berger, T. von Neumann, C. Boeddeker, R. Schlüter, R. Haeb-Umbach,
    in: 2024 IEEE Spoken Language Technology Workshop (SLT), 2024.'
date_created: 2024-12-09T11:46:18Z
date_updated: 2025-02-12T09:20:59Z
department:
- _id: '54'
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://www-i6.informatik.rwth-aachen.de/publications/download/1259/VietingPeterBergerSimonNeumannThilovonBoeddekerChristophSchl%FCterRalfHaeb-UmbachReinhold--CombiningTF-GridNetMixtureEncoderforContinuousSpeechSeparationforMeetingTranscription--2024.pdf
oa: '1'
project:
- _id: '52'
  name: 'PC2: Computing Resources Provided by the Paderborn Center for Parallel Computing'
- _id: '508'
  grant_number: '448568305'
  name: Automatische Transkription von Gesprächssituationen
publication: 2024 IEEE Spoken Language Technology Workshop (SLT)
status: public
title: Combining TF-GridNet and Mixture Encoder for Continuous Speech Separation for
  Meeting Transcription
type: conference
user_id: '40767'
year: '2024'
...
---
_id: '35602'
abstract:
- lang: eng
  text: "Continuous Speech Separation (CSS) has been proposed to address speech overlaps
    during the analysis of realistic meeting-like conversations by eliminating any
    overlaps before further processing.\r\nCSS separates a recording of arbitrarily
    many speakers into a small number of overlap-free output channels, where each
    output channel may contain speech of multiple speakers.\r\nThis is often done
    by applying a conventional separation model trained with Utterance-level Permutation
    Invariant Training (uPIT), which exclusively maps a speaker to an output channel,
    in sliding window approach called stitching.\r\nRecently, we introduced an alternative
    training scheme called Graph-PIT that teaches the separation network to directly
    produce output streams in the required format without stitching.\r\nIt can handle
    an arbitrary number of speakers as long as never more of them overlap at the same
    time than the separator has output channels.\r\nIn this contribution, we further
    investigate the Graph-PIT training scheme.\r\nWe show in extended experiments
    that models trained with Graph-PIT also work in challenging reverberant conditions.\r\nModels
    trained in this way are able to perform segment-less CSS, i.e., without stitching,
    and achieve comparable and often better separation quality than the conventional
    CSS with uPIT and stitching.\r\nWe simplify the training schedule for Graph-PIT
    with the recently proposed Source Aggregated Signal-to-Distortion Ratio (SA-SDR)
    loss.\r\nIt eliminates unfavorable properties of the previously used A-SDR loss
    and thus enables training with Graph-PIT from scratch.\r\nGraph-PIT training relaxes
    the constraints w.r.t. the allowed numbers of speakers and speaking patterns which
    allows using a larger variety of training data.\r\nFurthermore, we introduce novel
    signal-level evaluation metrics for meeting scenarios, namely the source-aggregated
    scale- and convolution-invariant Signal-to-Distortion Ratio (SA-SI-SDR and SA-CI-SDR),
    which are generalizations of the commonly used SDR-based metrics for the CSS case."
article_type: original
author:
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Keisuke
  full_name: Kinoshita, Keisuke
  last_name: Kinoshita
- first_name: Christoph
  full_name: Boeddeker, Christoph
  id: '40767'
  last_name: Boeddeker
- first_name: Marc
  full_name: Delcroix, Marc
  last_name: Delcroix
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'von Neumann T, Kinoshita K, Boeddeker C, Delcroix M, Haeb-Umbach R. Segment-Less
    Continuous Speech Separation of Meetings: Training and Evaluation Criteria. <i>IEEE/ACM
    Transactions on Audio, Speech, and Language Processing</i>. 2023;31:576-589. doi:<a
    href="https://doi.org/10.1109/taslp.2022.3228629">10.1109/taslp.2022.3228629</a>'
  apa: 'von Neumann, T., Kinoshita, K., Boeddeker, C., Delcroix, M., &#38; Haeb-Umbach,
    R. (2023). Segment-Less Continuous Speech Separation of Meetings: Training and
    Evaluation Criteria. <i>IEEE/ACM Transactions on Audio, Speech, and Language Processing</i>,
    <i>31</i>, 576–589. <a href="https://doi.org/10.1109/taslp.2022.3228629">https://doi.org/10.1109/taslp.2022.3228629</a>'
  bibtex: '@article{von Neumann_Kinoshita_Boeddeker_Delcroix_Haeb-Umbach_2023, title={Segment-Less
    Continuous Speech Separation of Meetings: Training and Evaluation Criteria}, volume={31},
    DOI={<a href="https://doi.org/10.1109/taslp.2022.3228629">10.1109/taslp.2022.3228629</a>},
    journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, publisher={Institute
    of Electrical and Electronics Engineers (IEEE)}, author={von Neumann, Thilo and
    Kinoshita, Keisuke and Boeddeker, Christoph and Delcroix, Marc and Haeb-Umbach,
    Reinhold}, year={2023}, pages={576–589} }'
  chicago: 'Neumann, Thilo von, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix,
    and Reinhold Haeb-Umbach. “Segment-Less Continuous Speech Separation of Meetings:
    Training and Evaluation Criteria.” <i>IEEE/ACM Transactions on Audio, Speech,
    and Language Processing</i> 31 (2023): 576–89. <a href="https://doi.org/10.1109/taslp.2022.3228629">https://doi.org/10.1109/taslp.2022.3228629</a>.'
  ieee: 'T. von Neumann, K. Kinoshita, C. Boeddeker, M. Delcroix, and R. Haeb-Umbach,
    “Segment-Less Continuous Speech Separation of Meetings: Training and Evaluation
    Criteria,” <i>IEEE/ACM Transactions on Audio, Speech, and Language Processing</i>,
    vol. 31, pp. 576–589, 2023, doi: <a href="https://doi.org/10.1109/taslp.2022.3228629">10.1109/taslp.2022.3228629</a>.'
  mla: 'von Neumann, Thilo, et al. “Segment-Less Continuous Speech Separation of Meetings:
    Training and Evaluation Criteria.” <i>IEEE/ACM Transactions on Audio, Speech,
    and Language Processing</i>, vol. 31, Institute of Electrical and Electronics
    Engineers (IEEE), 2023, pp. 576–89, doi:<a href="https://doi.org/10.1109/taslp.2022.3228629">10.1109/taslp.2022.3228629</a>.'
  short: T. von Neumann, K. Kinoshita, C. Boeddeker, M. Delcroix, R. Haeb-Umbach,
    IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023) 576–589.
date_created: 2023-01-09T17:24:17Z
date_updated: 2023-11-15T12:16:11Z
ddc:
- '000'
department:
- _id: '54'
doi: 10.1109/taslp.2022.3228629
file:
- access_level: open_access
  content_type: application/pdf
  creator: haebumb
  date_created: 2023-01-09T17:46:05Z
  date_updated: 2023-01-11T08:50:19Z
  file_id: '35607'
  file_name: main.pdf
  file_size: 7185077
  relation: main_file
file_date_updated: 2023-01-11T08:50:19Z
has_accepted_license: '1'
intvolume: '        31'
keyword:
- Continuous Speech Separation
- Source Separation
- Graph-PIT
- Dynamic Programming
- Permutation Invariant Training
language:
- iso: eng
oa: '1'
page: 576-589
project:
- _id: '52'
  name: 'PC2: Computing Resources Provided by the Paderborn Center for Parallel Computing'
publication: IEEE/ACM Transactions on Audio, Speech, and Language Processing
publication_identifier:
  issn:
  - 2329-9290
  - 2329-9304
publication_status: published
publisher: Institute of Electrical and Electronics Engineers (IEEE)
quality_controlled: '1'
status: public
title: 'Segment-Less Continuous Speech Separation of Meetings: Training and Evaluation
  Criteria'
type: journal_article
user_id: '49870'
volume: 31
year: '2023'
...
---
_id: '48281'
abstract:
- lang: eng
  text: "\tWe propose a general framework to compute the word error rate (WER) of
    ASR systems that process recordings containing multiple speakers at their input
    and that produce multiple output word sequences (MIMO).\r\n\tSuch ASR systems
    are typically required, e.g., for meeting transcription.\r\n\tWe provide an efficient
    implementation based on a dynamic programming search in a multi-dimensional Levenshtein
    distance tensor under the constraint that a reference utterance must be matched
    consistently with one hypothesis output. \r\n\tThis also results in an efficient
    implementation of the ORC WER which previously suffered from exponential complexity.\r\n\tWe
    give an overview of commonly used WER definitions for multi-speaker scenarios
    and show that they are specializations of the above MIMO WER tuned to particular
    application scenarios. \r\n\tWe conclude with a  discussion of the pros and cons
    of the various WER definitions and a recommendation when to use which."
author:
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Christoph
  full_name: Boeddeker, Christoph
  id: '40767'
  last_name: Boeddeker
- first_name: Keisuke
  full_name: Kinoshita, Keisuke
  last_name: Kinoshita
- first_name: Marc
  full_name: Delcroix, Marc
  last_name: Delcroix
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'von Neumann T, Boeddeker C, Kinoshita K, Delcroix M, Haeb-Umbach R. On Word
    Error Rate Definitions and Their Efficient Computation for Multi-Speaker Speech
    Recognition Systems. In: <i>ICASSP 2023 - 2023 IEEE International Conference on
    Acoustics, Speech and Signal Processing (ICASSP)</i>. IEEE; 2023. doi:<a href="https://doi.org/10.1109/icassp49357.2023.10094784">10.1109/icassp49357.2023.10094784</a>'
  apa: von Neumann, T., Boeddeker, C., Kinoshita, K., Delcroix, M., &#38; Haeb-Umbach,
    R. (2023). On Word Error Rate Definitions and Their Efficient Computation for
    Multi-Speaker Speech Recognition Systems. <i>ICASSP 2023 - 2023 IEEE International
    Conference on Acoustics, Speech and Signal Processing (ICASSP)</i>. <a href="https://doi.org/10.1109/icassp49357.2023.10094784">https://doi.org/10.1109/icassp49357.2023.10094784</a>
  bibtex: '@inproceedings{von Neumann_Boeddeker_Kinoshita_Delcroix_Haeb-Umbach_2023,
    title={On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker
    Speech Recognition Systems}, DOI={<a href="https://doi.org/10.1109/icassp49357.2023.10094784">10.1109/icassp49357.2023.10094784</a>},
    booktitle={ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech
    and Signal Processing (ICASSP)}, publisher={IEEE}, author={von Neumann, Thilo
    and Boeddeker, Christoph and Kinoshita, Keisuke and Delcroix, Marc and Haeb-Umbach,
    Reinhold}, year={2023} }'
  chicago: Neumann, Thilo von, Christoph Boeddeker, Keisuke Kinoshita, Marc Delcroix,
    and Reinhold Haeb-Umbach. “On Word Error Rate Definitions and Their Efficient
    Computation for Multi-Speaker Speech Recognition Systems.” In <i>ICASSP 2023 -
    2023 IEEE International Conference on Acoustics, Speech and Signal Processing
    (ICASSP)</i>. IEEE, 2023. <a href="https://doi.org/10.1109/icassp49357.2023.10094784">https://doi.org/10.1109/icassp49357.2023.10094784</a>.
  ieee: 'T. von Neumann, C. Boeddeker, K. Kinoshita, M. Delcroix, and R. Haeb-Umbach,
    “On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker
    Speech Recognition Systems,” 2023, doi: <a href="https://doi.org/10.1109/icassp49357.2023.10094784">10.1109/icassp49357.2023.10094784</a>.'
  mla: von Neumann, Thilo, et al. “On Word Error Rate Definitions and Their Efficient
    Computation for Multi-Speaker Speech Recognition Systems.” <i>ICASSP 2023 - 2023
    IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</i>,
    IEEE, 2023, doi:<a href="https://doi.org/10.1109/icassp49357.2023.10094784">10.1109/icassp49357.2023.10094784</a>.
  short: 'T. von Neumann, C. Boeddeker, K. Kinoshita, M. Delcroix, R. Haeb-Umbach,
    in: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and
    Signal Processing (ICASSP), IEEE, 2023.'
date_created: 2023-10-19T07:38:31Z
date_updated: 2025-02-12T09:16:34Z
ddc:
- '000'
department:
- _id: '54'
doi: 10.1109/icassp49357.2023.10094784
file:
- access_level: open_access
  content_type: application/pdf
  creator: tvn
  date_created: 2023-10-19T07:39:57Z
  date_updated: 2023-10-19T07:41:56Z
  file_id: '48282'
  file_name: ICASSP_2023_Meeting_Evaluation.pdf
  file_size: 204994
  relation: main_file
file_date_updated: 2023-10-19T07:41:56Z
has_accepted_license: '1'
keyword:
- Word Error Rate
- Meeting Recognition
- Levenshtein Distance
language:
- iso: eng
main_file_link:
- url: https://ieeexplore.ieee.org/document/10094784
oa: '1'
project:
- _id: '52'
  name: 'PC2: Computing Resources Provided by the Paderborn Center for Parallel Computing'
- _id: '508'
  grant_number: '448568305'
  name: Automatische Transkription von Gesprächssituationen
publication: ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech
  and Signal Processing (ICASSP)
publication_status: published
publisher: IEEE
quality_controlled: '1'
related_material:
  link:
  - relation: software
    url: https://github.com/fgnt/meeteval
status: public
title: On Word Error Rate Definitions and Their Efficient Computation for Multi-Speaker
  Speech Recognition Systems
type: conference
user_id: '40767'
year: '2023'
...
---
_id: '48275'
abstract:
- lang: eng
  text: "MeetEval is an open-source toolkit to evaluate  all kinds of meeting transcription
    systems.\r\nIt provides a unified interface for the computation of commonly used
    Word Error Rates (WERs), specifically cpWER, ORC WER and MIMO WER along other
    WER definitions.\r\nWe extend the cpWER computation by a temporal constraint to
    ensure that only words are identified as correct when the temporal alignment is
    plausible.\r\nThis leads to a better quality of the matching of the hypothesis
    string to the reference string that more closely resembles the actual transcription
    quality, and a system is penalized if it provides poor time annotations.\r\nSince
    word-level timing information is often not available, we present a way to approximate
    exact word-level timings from segment-level timings (e.g., a sentence) and show
    that the approximation leads to a similar WER as a matching with exact word-level
    annotations.\r\nAt the same time, the time constraint leads to a speedup of the
    matching algorithm, which outweighs the additional overhead caused by processing
    the time stamps."
author:
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Christoph
  full_name: Boeddeker, Christoph
  id: '40767'
  last_name: Boeddeker
- first_name: Marc
  full_name: Delcroix, Marc
  last_name: Delcroix
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'von Neumann T, Boeddeker C, Delcroix M, Haeb-Umbach R. MeetEval: A Toolkit
    for Computation of Word Error Rates for Meeting Transcription Systems. In: <i>Proc.
    CHiME 2023 Workshop on Speech Processing in Everyday Environments</i>. ; 2023.'
  apa: 'von Neumann, T., Boeddeker, C., Delcroix, M., &#38; Haeb-Umbach, R. (2023).
    MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription
    Systems. <i>Proc. CHiME 2023 Workshop on Speech Processing in Everyday Environments</i>.
    CHiME 2023 Workshop on Speech Processing in Everyday Environments, Dublin.'
  bibtex: '@inproceedings{von Neumann_Boeddeker_Delcroix_Haeb-Umbach_2023, title={MeetEval:
    A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems},
    booktitle={Proc. CHiME 2023 Workshop on Speech Processing in Everyday Environments},
    author={von Neumann, Thilo and Boeddeker, Christoph and Delcroix, Marc and Haeb-Umbach,
    Reinhold}, year={2023} }'
  chicago: 'Neumann, Thilo von, Christoph Boeddeker, Marc Delcroix, and Reinhold Haeb-Umbach.
    “MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription
    Systems.” In <i>Proc. CHiME 2023 Workshop on Speech Processing in Everyday Environments</i>,
    2023.'
  ieee: 'T. von Neumann, C. Boeddeker, M. Delcroix, and R. Haeb-Umbach, “MeetEval:
    A Toolkit for Computation of Word Error Rates for Meeting Transcription Systems,”
    presented at the CHiME 2023 Workshop on Speech Processing in Everyday Environments,
    Dublin, 2023.'
  mla: 'von Neumann, Thilo, et al. “MeetEval: A Toolkit for Computation of Word Error
    Rates for Meeting Transcription Systems.” <i>Proc. CHiME 2023 Workshop on Speech
    Processing in Everyday Environments</i>, 2023.'
  short: 'T. von Neumann, C. Boeddeker, M. Delcroix, R. Haeb-Umbach, in: Proc. CHiME
    2023 Workshop on Speech Processing in Everyday Environments, 2023.'
conference:
  location: Dublin
  name: CHiME 2023 Workshop on Speech Processing in Everyday Environments
date_created: 2023-10-19T07:24:51Z
date_updated: 2025-02-12T09:12:05Z
ddc:
- '000'
department:
- _id: '54'
file:
- access_level: open_access
  content_type: application/pdf
  creator: tvn
  date_created: 2023-10-19T07:19:59Z
  date_updated: 2023-10-19T07:19:59Z
  file_id: '48276'
  file_name: Chime_7__MeetEval.pdf
  file_size: 263744
  relation: main_file
file_date_updated: 2023-10-19T07:19:59Z
has_accepted_license: '1'
keyword:
- Speech Recognition
- Word Error Rate
- Meeting Transcription
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://arxiv.org/abs/2307.11394
oa: '1'
project:
- _id: '52'
  name: 'PC2: Computing Resources Provided by the Paderborn Center for Parallel Computing'
- _id: '508'
  grant_number: '448568305'
  name: Automatische Transkription von Gesprächssituationen
publication: Proc. CHiME 2023 Workshop on Speech Processing in Everyday Environments
quality_controlled: '1'
related_material:
  link:
  - relation: software
    url: https://github.com/fgnt/meeteval
status: public
title: 'MeetEval: A Toolkit for Computation of Word Error Rates for Meeting Transcription
  Systems'
type: conference
user_id: '40767'
year: '2023'
...
---
_id: '54439'
author:
- first_name: Christoph
  full_name: Boeddeker, Christoph
  id: '40767'
  last_name: Boeddeker
- first_name: Tobias
  full_name: Cord-Landwehr, Tobias
  id: '44393'
  last_name: Cord-Landwehr
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'Boeddeker C, Cord-Landwehr T, von Neumann T, Haeb-Umbach R. Multi-stage diarization
    refinement for the CHiME-7 DASR scenario. In: <i>7th International Workshop on
    Speech Processing in Everyday Environments (CHiME 2023)</i>. ISCA; 2023. doi:<a
    href="https://doi.org/10.21437/chime.2023-10">10.21437/chime.2023-10</a>'
  apa: Boeddeker, C., Cord-Landwehr, T., von Neumann, T., &#38; Haeb-Umbach, R. (2023).
    Multi-stage diarization refinement for the CHiME-7 DASR scenario. <i>7th International
    Workshop on Speech Processing in Everyday Environments (CHiME 2023)</i>. <a href="https://doi.org/10.21437/chime.2023-10">https://doi.org/10.21437/chime.2023-10</a>
  bibtex: '@inproceedings{Boeddeker_Cord-Landwehr_von Neumann_Haeb-Umbach_2023, title={Multi-stage
    diarization refinement for the CHiME-7 DASR scenario}, DOI={<a href="https://doi.org/10.21437/chime.2023-10">10.21437/chime.2023-10</a>},
    booktitle={7th International Workshop on Speech Processing in Everyday Environments
    (CHiME 2023)}, publisher={ISCA}, author={Boeddeker, Christoph and Cord-Landwehr,
    Tobias and von Neumann, Thilo and Haeb-Umbach, Reinhold}, year={2023} }'
  chicago: Boeddeker, Christoph, Tobias Cord-Landwehr, Thilo von Neumann, and Reinhold
    Haeb-Umbach. “Multi-Stage Diarization Refinement for the CHiME-7 DASR Scenario.”
    In <i>7th International Workshop on Speech Processing in Everyday Environments
    (CHiME 2023)</i>. ISCA, 2023. <a href="https://doi.org/10.21437/chime.2023-10">https://doi.org/10.21437/chime.2023-10</a>.
  ieee: 'C. Boeddeker, T. Cord-Landwehr, T. von Neumann, and R. Haeb-Umbach, “Multi-stage
    diarization refinement for the CHiME-7 DASR scenario,” 2023, doi: <a href="https://doi.org/10.21437/chime.2023-10">10.21437/chime.2023-10</a>.'
  mla: Boeddeker, Christoph, et al. “Multi-Stage Diarization Refinement for the CHiME-7
    DASR Scenario.” <i>7th International Workshop on Speech Processing in Everyday
    Environments (CHiME 2023)</i>, ISCA, 2023, doi:<a href="https://doi.org/10.21437/chime.2023-10">10.21437/chime.2023-10</a>.
  short: 'C. Boeddeker, T. Cord-Landwehr, T. von Neumann, R. Haeb-Umbach, in: 7th
    International Workshop on Speech Processing in Everyday Environments (CHiME 2023),
    ISCA, 2023.'
date_created: 2024-05-23T15:16:15Z
date_updated: 2025-02-12T09:16:13Z
department:
- _id: '54'
doi: 10.21437/chime.2023-10
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://www.isca-archive.org/chime_2023/boeddeker23_chime.pdf
oa: '1'
project:
- _id: '52'
  name: 'PC2: Computing Resources Provided by the Paderborn Center for Parallel Computing'
- _id: '508'
  grant_number: '448568305'
  name: Automatische Transkription von Gesprächssituationen
publication: 7th International Workshop on Speech Processing in Everyday Environments
  (CHiME 2023)
publication_status: published
publisher: ISCA
status: public
title: Multi-stage diarization refinement for the CHiME-7 DASR scenario
type: conference
user_id: '40767'
year: '2023'
...
---
_id: '33847'
abstract:
- lang: eng
  text: "The scope of speech enhancement has changed from a monolithic view of single,\r\nindependent
    tasks, to a joint processing of complex conversational speech\r\nrecordings. Training
    and evaluation of these single tasks requires synthetic\r\ndata with access to
    intermediate signals that is as close as possible to the\r\nevaluation scenario.
    As such data often is not available, many works instead\r\nuse specialized databases
    for the training of each system component, e.g\r\nWSJ0-mix for source separation.
    We present a Multi-purpose Multi-Speaker\r\nMixture Signal Generator (MMS-MSG)
    for generating a variety of speech mixture\r\nsignals based on any speech corpus,
    ranging from classical anechoic mixtures\r\n(e.g., WSJ0-mix) over reverberant
    mixtures (e.g., SMS-WSJ) to meeting-style\r\ndata. Its highly modular and flexible
    structure allows for the simulation of\r\ndiverse environments and dynamic mixing,
    while simultaneously enabling an easy\r\nextension and modification to generate
    new scenarios and mixture types. These\r\nmeetings can be used for prototyping,
    evaluation, or training purposes. We\r\nprovide example evaluation data and baseline
    results for meetings based on the\r\nWSJ corpus. Further, we demonstrate the usefulness
    for realistic scenarios by\r\nusing MMS-MSG to provide training data for the LibriCSS
    database."
author:
- first_name: Tobias
  full_name: Cord-Landwehr, Tobias
  id: '44393'
  last_name: Cord-Landwehr
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Christoph
  full_name: Boeddeker, Christoph
  id: '40767'
  last_name: Boeddeker
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'Cord-Landwehr T, von Neumann T, Boeddeker C, Haeb-Umbach R. MMS-MSG: A Multi-purpose
    Multi-Speaker Mixture Signal Generator. In: <i>2022 International Workshop on
    Acoustic Signal Enhancement (IWAENC)</i>. ; 2022.'
  apa: 'Cord-Landwehr, T., von Neumann, T., Boeddeker, C., &#38; Haeb-Umbach, R. (2022).
    MMS-MSG: A Multi-purpose Multi-Speaker Mixture Signal Generator. <i>2022 International
    Workshop on Acoustic Signal Enhancement (IWAENC)</i>. 2022 International Workshop
    on Acoustic Signal Enhancement (IWAENC), Bamberg.'
  bibtex: '@inproceedings{Cord-Landwehr_von Neumann_Boeddeker_Haeb-Umbach_2022, title={MMS-MSG:
    A Multi-purpose Multi-Speaker Mixture Signal Generator}, booktitle={2022 International
    Workshop on Acoustic Signal Enhancement (IWAENC)}, author={Cord-Landwehr, Tobias
    and von Neumann, Thilo and Boeddeker, Christoph and Haeb-Umbach, Reinhold}, year={2022}
    }'
  chicago: 'Cord-Landwehr, Tobias, Thilo von Neumann, Christoph Boeddeker, and Reinhold
    Haeb-Umbach. “MMS-MSG: A Multi-Purpose Multi-Speaker Mixture Signal Generator.”
    In <i>2022 International Workshop on Acoustic Signal Enhancement (IWAENC)</i>,
    2022.'
  ieee: 'T. Cord-Landwehr, T. von Neumann, C. Boeddeker, and R. Haeb-Umbach, “MMS-MSG:
    A Multi-purpose Multi-Speaker Mixture Signal Generator,” presented at the 2022
    International Workshop on Acoustic Signal Enhancement (IWAENC), Bamberg, 2022.'
  mla: 'Cord-Landwehr, Tobias, et al. “MMS-MSG: A Multi-Purpose Multi-Speaker Mixture
    Signal Generator.” <i>2022 International Workshop on Acoustic Signal Enhancement
    (IWAENC)</i>, 2022.'
  short: 'T. Cord-Landwehr, T. von Neumann, C. Boeddeker, R. Haeb-Umbach, in: 2022
    International Workshop on Acoustic Signal Enhancement (IWAENC), 2022.'
conference:
  location: Bamberg
  name: 2022 International Workshop on Acoustic Signal Enhancement (IWAENC)
date_created: 2022-10-20T14:02:14Z
date_updated: 2023-11-15T14:55:14Z
ddc:
- '000'
department:
- _id: '54'
external_id:
  arxiv:
  - '2209.11494'
file:
- access_level: open_access
  content_type: application/pdf
  creator: cord
  date_created: 2023-11-15T14:54:56Z
  date_updated: 2023-11-15T14:54:56Z
  file_id: '48931'
  file_name: mms_msg_camera_ready.pdf
  file_size: 177975
  relation: main_file
file_date_updated: 2023-11-15T14:54:56Z
has_accepted_license: '1'
language:
- iso: eng
oa: '1'
project:
- _id: '52'
  name: 'PC2: Computing Resources Provided by the Paderborn Center for Parallel Computing'
publication: 2022 International Workshop on Acoustic Signal Enhancement (IWAENC)
quality_controlled: '1'
status: public
title: 'MMS-MSG: A Multi-purpose Multi-Speaker Mixture Signal Generator'
type: conference
user_id: '44393'
year: '2022'
...
---
_id: '33848'
abstract:
- lang: eng
  text: "Impressive progress in neural network-based single-channel speech source\r\nseparation
    has been made in recent years. But those improvements have been\r\nmostly reported
    on anechoic data, a situation that is hardly met in practice.\r\nTaking the SepFormer
    as a starting point, which achieves state-of-the-art\r\nperformance on anechoic
    mixtures, we gradually modify it to optimize its\r\nperformance on reverberant
    mixtures. Although this leads to a word error rate\r\nimprovement by 7 percentage
    points compared to the standard SepFormer\r\nimplementation, the system ends up
    with only marginally better performance than\r\na PIT-BLSTM separation system,
    that is optimized with rather straightforward\r\nmeans. This is surprising and
    at the same time sobering, challenging the\r\npractical usefulness of many improvements
    reported in recent years for monaural\r\nsource separation on nonreverberant data."
author:
- first_name: Tobias
  full_name: Cord-Landwehr, Tobias
  id: '44393'
  last_name: Cord-Landwehr
- first_name: Christoph
  full_name: Boeddeker, Christoph
  id: '40767'
  last_name: Boeddeker
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Catalin
  full_name: Zorila, Catalin
  last_name: Zorila
- first_name: Rama
  full_name: Doddipatla, Rama
  last_name: Doddipatla
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'Cord-Landwehr T, Boeddeker C, von Neumann T, Zorila C, Doddipatla R, Haeb-Umbach
    R. Monaural source separation: From anechoic to reverberant environments. In:
    <i>2022 International Workshop on Acoustic Signal Enhancement (IWAENC)</i>. IEEE;
    2022.'
  apa: 'Cord-Landwehr, T., Boeddeker, C., von Neumann, T., Zorila, C., Doddipatla,
    R., &#38; Haeb-Umbach, R. (2022). Monaural source separation: From anechoic to
    reverberant environments. <i>2022 International Workshop on Acoustic Signal Enhancement
    (IWAENC)</i>. 2022 International Workshop on Acoustic Signal Enhancement (IWAENC).'
  bibtex: '@inproceedings{Cord-Landwehr_Boeddeker_von Neumann_Zorila_Doddipatla_Haeb-Umbach_2022,
    place={Bamberg}, title={Monaural source separation: From anechoic to reverberant
    environments}, booktitle={2022 International Workshop on Acoustic Signal Enhancement
    (IWAENC)}, publisher={IEEE}, author={Cord-Landwehr, Tobias and Boeddeker, Christoph
    and von Neumann, Thilo and Zorila, Catalin and Doddipatla, Rama and Haeb-Umbach,
    Reinhold}, year={2022} }'
  chicago: 'Cord-Landwehr, Tobias, Christoph Boeddeker, Thilo von Neumann, Catalin
    Zorila, Rama Doddipatla, and Reinhold Haeb-Umbach. “Monaural Source Separation:
    From Anechoic to Reverberant Environments.” In <i>2022 International Workshop
    on Acoustic Signal Enhancement (IWAENC)</i>. Bamberg: IEEE, 2022.'
  ieee: 'T. Cord-Landwehr, C. Boeddeker, T. von Neumann, C. Zorila, R. Doddipatla,
    and R. Haeb-Umbach, “Monaural source separation: From anechoic to reverberant
    environments,” presented at the 2022 International Workshop on Acoustic Signal
    Enhancement (IWAENC), 2022.'
  mla: 'Cord-Landwehr, Tobias, et al. “Monaural Source Separation: From Anechoic to
    Reverberant Environments.” <i>2022 International Workshop on Acoustic Signal Enhancement
    (IWAENC)</i>, IEEE, 2022.'
  short: 'T. Cord-Landwehr, C. Boeddeker, T. von Neumann, C. Zorila, R. Doddipatla,
    R. Haeb-Umbach, in: 2022 International Workshop on Acoustic Signal Enhancement
    (IWAENC), IEEE, Bamberg, 2022.'
conference:
  name: 2022 International Workshop on Acoustic Signal Enhancement (IWAENC)
date_created: 2022-10-20T14:07:28Z
date_updated: 2025-02-12T09:05:25Z
ddc:
- '000'
department:
- _id: '54'
external_id:
  arxiv:
  - '2111.07578'
file:
- access_level: open_access
  content_type: application/pdf
  creator: cord
  date_created: 2023-11-15T14:52:16Z
  date_updated: 2023-11-15T14:52:16Z
  file_id: '48930'
  file_name: monaural_source_separation.pdf
  file_size: 212890
  relation: main_file
file_date_updated: 2023-11-15T14:52:16Z
has_accepted_license: '1'
language:
- iso: eng
oa: '1'
place: Bamberg
project:
- _id: '52'
  name: 'PC2: Computing Resources Provided by the Paderborn Center for Parallel Computing'
- _id: '508'
  grant_number: '448568305'
  name: Automatische Transkription von Gesprächssituationen
publication: 2022 International Workshop on Acoustic Signal Enhancement (IWAENC)
publisher: IEEE
status: public
title: 'Monaural source separation: From anechoic to reverberant environments'
type: conference
user_id: '40767'
year: '2022'
...
---
_id: '33819'
author:
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Keisuke
  full_name: Kinoshita, Keisuke
  last_name: Kinoshita
- first_name: Christoph
  full_name: Boeddeker, Christoph
  id: '40767'
  last_name: Boeddeker
- first_name: Marc
  full_name: Delcroix, Marc
  last_name: Delcroix
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'von Neumann T, Kinoshita K, Boeddeker C, Delcroix M, Haeb-Umbach R. SA-SDR:
    A Novel Loss Function for Separation of Meeting Style Data. In: <i>ICASSP 2022
    - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing
    (ICASSP)</i>. IEEE; 2022. doi:<a href="https://doi.org/10.1109/icassp43922.2022.9746757">10.1109/icassp43922.2022.9746757</a>'
  apa: 'von Neumann, T., Kinoshita, K., Boeddeker, C., Delcroix, M., &#38; Haeb-Umbach,
    R. (2022). SA-SDR: A Novel Loss Function for Separation of Meeting Style Data.
    <i>ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal
    Processing (ICASSP)</i>. <a href="https://doi.org/10.1109/icassp43922.2022.9746757">https://doi.org/10.1109/icassp43922.2022.9746757</a>'
  bibtex: '@inproceedings{von Neumann_Kinoshita_Boeddeker_Delcroix_Haeb-Umbach_2022,
    title={SA-SDR: A Novel Loss Function for Separation of Meeting Style Data}, DOI={<a
    href="https://doi.org/10.1109/icassp43922.2022.9746757">10.1109/icassp43922.2022.9746757</a>},
    booktitle={ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech
    and Signal Processing (ICASSP)}, publisher={IEEE}, author={von Neumann, Thilo
    and Kinoshita, Keisuke and Boeddeker, Christoph and Delcroix, Marc and Haeb-Umbach,
    Reinhold}, year={2022} }'
  chicago: 'Neumann, Thilo von, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix,
    and Reinhold Haeb-Umbach. “SA-SDR: A Novel Loss Function for Separation of Meeting
    Style Data.” In <i>ICASSP 2022 - 2022 IEEE International Conference on Acoustics,
    Speech and Signal Processing (ICASSP)</i>. IEEE, 2022. <a href="https://doi.org/10.1109/icassp43922.2022.9746757">https://doi.org/10.1109/icassp43922.2022.9746757</a>.'
  ieee: 'T. von Neumann, K. Kinoshita, C. Boeddeker, M. Delcroix, and R. Haeb-Umbach,
    “SA-SDR: A Novel Loss Function for Separation of Meeting Style Data,” 2022, doi:
    <a href="https://doi.org/10.1109/icassp43922.2022.9746757">10.1109/icassp43922.2022.9746757</a>.'
  mla: 'von Neumann, Thilo, et al. “SA-SDR: A Novel Loss Function for Separation of
    Meeting Style Data.” <i>ICASSP 2022 - 2022 IEEE International Conference on Acoustics,
    Speech and Signal Processing (ICASSP)</i>, IEEE, 2022, doi:<a href="https://doi.org/10.1109/icassp43922.2022.9746757">10.1109/icassp43922.2022.9746757</a>.'
  short: 'T. von Neumann, K. Kinoshita, C. Boeddeker, M. Delcroix, R. Haeb-Umbach,
    in: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and
    Signal Processing (ICASSP), IEEE, 2022.'
date_created: 2022-10-20T05:29:12Z
date_updated: 2025-02-12T09:08:14Z
ddc:
- '000'
department:
- _id: '54'
doi: 10.1109/icassp43922.2022.9746757
file:
- access_level: open_access
  content_type: application/pdf
  creator: tvn
  date_created: 2022-10-20T05:33:10Z
  date_updated: 2022-10-20T05:33:10Z
  file_id: '33820'
  file_name: main.pdf
  file_size: 228069
  relation: main_file
- access_level: open_access
  content_type: application/pdf
  creator: tvn
  date_created: 2022-10-20T05:35:32Z
  date_updated: 2022-10-20T05:35:32Z
  file_id: '33821'
  file_name: poster.pdf
  file_size: 229166
  relation: poster
file_date_updated: 2022-10-20T05:35:32Z
has_accepted_license: '1'
language:
- iso: eng
oa: '1'
project:
- _id: '52'
  name: 'PC2: Computing Resources Provided by the Paderborn Center for Parallel Computing'
- _id: '508'
  grant_number: '448568305'
  name: Automatische Transkription von Gesprächssituationen
publication: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech
  and Signal Processing (ICASSP)
publication_status: published
publisher: IEEE
quality_controlled: '1'
related_material:
  link:
  - relation: supplementary_material
    url: https://github.com/fgnt/graph_pit
status: public
title: 'SA-SDR: A Novel Loss Function for Separation of Meeting Style Data'
type: conference
user_id: '40767'
year: '2022'
...
---
_id: '33816'
author:
- first_name: Tobias
  full_name: Gburrek, Tobias
  id: '44006'
  last_name: Gburrek
- first_name: Christoph
  full_name: Boeddeker, Christoph
  id: '40767'
  last_name: Boeddeker
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Tobias
  full_name: Cord-Landwehr, Tobias
  id: '44393'
  last_name: Cord-Landwehr
- first_name: Joerg
  full_name: Schmalenstroeer, Joerg
  id: '460'
  last_name: Schmalenstroeer
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: Gburrek T, Boeddeker C, von Neumann T, Cord-Landwehr T, Schmalenstroeer J,
    Haeb-Umbach R. <i>A Meeting Transcription System for an Ad-Hoc Acoustic Sensor
    Network</i>. arXiv; 2022. doi:<a href="https://doi.org/10.48550/ARXIV.2205.00944">10.48550/ARXIV.2205.00944</a>
  apa: Gburrek, T., Boeddeker, C., von Neumann, T., Cord-Landwehr, T., Schmalenstroeer,
    J., &#38; Haeb-Umbach, R. (2022). <i>A Meeting Transcription System for an Ad-Hoc
    Acoustic Sensor Network</i>. arXiv. <a href="https://doi.org/10.48550/ARXIV.2205.00944">https://doi.org/10.48550/ARXIV.2205.00944</a>
  bibtex: '@book{Gburrek_Boeddeker_von Neumann_Cord-Landwehr_Schmalenstroeer_Haeb-Umbach_2022,
    title={A Meeting Transcription System for an Ad-Hoc Acoustic Sensor Network},
    DOI={<a href="https://doi.org/10.48550/ARXIV.2205.00944">10.48550/ARXIV.2205.00944</a>},
    publisher={arXiv}, author={Gburrek, Tobias and Boeddeker, Christoph and von Neumann,
    Thilo and Cord-Landwehr, Tobias and Schmalenstroeer, Joerg and Haeb-Umbach, Reinhold},
    year={2022} }'
  chicago: Gburrek, Tobias, Christoph Boeddeker, Thilo von Neumann, Tobias Cord-Landwehr,
    Joerg Schmalenstroeer, and Reinhold Haeb-Umbach. <i>A Meeting Transcription System
    for an Ad-Hoc Acoustic Sensor Network</i>. arXiv, 2022. <a href="https://doi.org/10.48550/ARXIV.2205.00944">https://doi.org/10.48550/ARXIV.2205.00944</a>.
  ieee: T. Gburrek, C. Boeddeker, T. von Neumann, T. Cord-Landwehr, J. Schmalenstroeer,
    and R. Haeb-Umbach, <i>A Meeting Transcription System for an Ad-Hoc Acoustic Sensor
    Network</i>. arXiv, 2022.
  mla: Gburrek, Tobias, et al. <i>A Meeting Transcription System for an Ad-Hoc Acoustic
    Sensor Network</i>. arXiv, 2022, doi:<a href="https://doi.org/10.48550/ARXIV.2205.00944">10.48550/ARXIV.2205.00944</a>.
  short: T. Gburrek, C. Boeddeker, T. von Neumann, T. Cord-Landwehr, J. Schmalenstroeer,
    R. Haeb-Umbach, A Meeting Transcription System for an Ad-Hoc Acoustic Sensor Network,
    arXiv, 2022.
date_created: 2022-10-18T11:10:58Z
date_updated: 2025-02-12T09:03:42Z
ddc:
- '004'
department:
- _id: '54'
doi: 10.48550/ARXIV.2205.00944
file:
- access_level: open_access
  content_type: application/pdf
  creator: tgburrek
  date_created: 2023-11-17T06:42:04Z
  date_updated: 2023-11-17T06:42:04Z
  file_id: '48992'
  file_name: meeting_transcription_22.pdf
  file_size: 199006
  relation: main_file
file_date_updated: 2023-11-17T06:42:04Z
has_accepted_license: '1'
language:
- iso: eng
oa: '1'
project:
- _id: '52'
  name: 'PC2: Computing Resources Provided by the Paderborn Center for Parallel Computing'
- _id: '508'
  grant_number: '448568305'
  name: Automatische Transkription von Gesprächssituationen
publisher: arXiv
status: public
title: A Meeting Transcription System for an Ad-Hoc Acoustic Sensor Network
type: misc
user_id: '40767'
year: '2022'
...
---
_id: '33954'
author:
- first_name: Christoph
  full_name: Boeddeker, Christoph
  id: '40767'
  last_name: Boeddeker
- first_name: Tobias
  full_name: Cord-Landwehr, Tobias
  id: '44393'
  last_name: Cord-Landwehr
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'Boeddeker C, Cord-Landwehr T, von Neumann T, Haeb-Umbach R. An Initialization
    Scheme for Meeting Separation with Spatial Mixture Models. In: <i>Interspeech
    2022</i>. ISCA; 2022. doi:<a href="https://doi.org/10.21437/interspeech.2022-10929">10.21437/interspeech.2022-10929</a>'
  apa: Boeddeker, C., Cord-Landwehr, T., von Neumann, T., &#38; Haeb-Umbach, R. (2022).
    An Initialization Scheme for Meeting Separation with Spatial Mixture Models. <i>Interspeech
    2022</i>. <a href="https://doi.org/10.21437/interspeech.2022-10929">https://doi.org/10.21437/interspeech.2022-10929</a>
  bibtex: '@inproceedings{Boeddeker_Cord-Landwehr_von Neumann_Haeb-Umbach_2022, title={An
    Initialization Scheme for Meeting Separation with Spatial Mixture Models}, DOI={<a
    href="https://doi.org/10.21437/interspeech.2022-10929">10.21437/interspeech.2022-10929</a>},
    booktitle={Interspeech 2022}, publisher={ISCA}, author={Boeddeker, Christoph and
    Cord-Landwehr, Tobias and von Neumann, Thilo and Haeb-Umbach, Reinhold}, year={2022}
    }'
  chicago: Boeddeker, Christoph, Tobias Cord-Landwehr, Thilo von Neumann, and Reinhold
    Haeb-Umbach. “An Initialization Scheme for Meeting Separation with Spatial Mixture
    Models.” In <i>Interspeech 2022</i>. ISCA, 2022. <a href="https://doi.org/10.21437/interspeech.2022-10929">https://doi.org/10.21437/interspeech.2022-10929</a>.
  ieee: 'C. Boeddeker, T. Cord-Landwehr, T. von Neumann, and R. Haeb-Umbach, “An Initialization
    Scheme for Meeting Separation with Spatial Mixture Models,” 2022, doi: <a href="https://doi.org/10.21437/interspeech.2022-10929">10.21437/interspeech.2022-10929</a>.'
  mla: Boeddeker, Christoph, et al. “An Initialization Scheme for Meeting Separation
    with Spatial Mixture Models.” <i>Interspeech 2022</i>, ISCA, 2022, doi:<a href="https://doi.org/10.21437/interspeech.2022-10929">10.21437/interspeech.2022-10929</a>.
  short: 'C. Boeddeker, T. Cord-Landwehr, T. von Neumann, R. Haeb-Umbach, in: Interspeech
    2022, ISCA, 2022.'
date_created: 2022-10-28T10:53:56Z
date_updated: 2025-02-12T09:06:56Z
department:
- _id: '54'
doi: 10.21437/interspeech.2022-10929
language:
- iso: eng
main_file_link:
- open_access: '1'
  url: https://www.isca-archive.org/interspeech_2022/boeddeker22_interspeech.pdf
oa: '1'
project:
- _id: '52'
  name: 'PC2: Computing Resources Provided by the Paderborn Center for Parallel Computing'
- _id: '508'
  grant_number: '448568305'
  name: Automatische Transkription von Gesprächssituationen
publication: Interspeech 2022
publication_status: published
publisher: ISCA
status: public
title: An Initialization Scheme for Meeting Separation with Spatial Mixture Models
type: conference
user_id: '40767'
year: '2022'
...
---
_id: '33958'
abstract:
- lang: eng
  text: Recent speaker diarization studies showed that integration of end-to-end neural
    diarization (EEND) and clustering-based diarization is a promising approach for
    achieving state-of-the-art performance on various tasks. Such an approach first
    divides an observed signal into fixed-length segments, then performs {\it segment-level}
    local diarization based on an EEND module, and merges the segment-level results
    via clustering to form a final global diarization result. The segmentation is
    done to limit the number of speakers in each segment since the current EEND cannot
    handle a large number of speakers. In this paper, we argue that such an approach
    involving the segmentation has several issues; for example, it inevitably faces
    a dilemma that larger segment sizes increase both the context available for enhancing
    the performance and the number of speakers for the local EEND module to handle.
    To resolve such a problem, this paper proposes a novel framework that performs
    diarization without segmentation. However, it can still handle challenging data
    containing many speakers and a significant amount of overlapping speech. The proposed
    method can take an entire meeting for inference and perform {\it utterance-by-utterance}
    diarization that clusters utterance activities in terms of speakers. To this end,
    we leverage a neural network training scheme called Graph-PIT proposed recently
    for neural source separation. Experiments with simulated active-meeting-like data
    and CALLHOME data show the superiority of the proposed approach over the conventional
    methods.
author:
- first_name: Keisuke
  full_name: Kinoshita, Keisuke
  last_name: Kinoshita
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Marc
  full_name: Delcroix, Marc
  last_name: Delcroix
- first_name: Christoph
  full_name: Boeddeker, Christoph
  id: '40767'
  last_name: Boeddeker
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'Kinoshita K, von Neumann T, Delcroix M, Boeddeker C, Haeb-Umbach R. Utterance-by-utterance
    overlap-aware neural diarization with Graph-PIT. In: <i>Proc. Interspeech 2022</i>.
    ISCA; 2022:1486-1490. doi:<a href="https://doi.org/10.21437/Interspeech.2022-11408">10.21437/Interspeech.2022-11408</a>'
  apa: Kinoshita, K., von Neumann, T., Delcroix, M., Boeddeker, C., &#38; Haeb-Umbach,
    R. (2022). Utterance-by-utterance overlap-aware neural diarization with Graph-PIT.
    <i>Proc. Interspeech 2022</i>, 1486–1490. <a href="https://doi.org/10.21437/Interspeech.2022-11408">https://doi.org/10.21437/Interspeech.2022-11408</a>
  bibtex: '@inproceedings{Kinoshita_von Neumann_Delcroix_Boeddeker_Haeb-Umbach_2022,
    title={Utterance-by-utterance overlap-aware neural diarization with Graph-PIT},
    DOI={<a href="https://doi.org/10.21437/Interspeech.2022-11408">10.21437/Interspeech.2022-11408</a>},
    booktitle={Proc. Interspeech 2022}, publisher={ISCA}, author={Kinoshita, Keisuke
    and von Neumann, Thilo and Delcroix, Marc and Boeddeker, Christoph and Haeb-Umbach,
    Reinhold}, year={2022}, pages={1486–1490} }'
  chicago: Kinoshita, Keisuke, Thilo von Neumann, Marc Delcroix, Christoph Boeddeker,
    and Reinhold Haeb-Umbach. “Utterance-by-Utterance Overlap-Aware Neural Diarization
    with Graph-PIT.” In <i>Proc. Interspeech 2022</i>, 1486–90. ISCA, 2022. <a href="https://doi.org/10.21437/Interspeech.2022-11408">https://doi.org/10.21437/Interspeech.2022-11408</a>.
  ieee: 'K. Kinoshita, T. von Neumann, M. Delcroix, C. Boeddeker, and R. Haeb-Umbach,
    “Utterance-by-utterance overlap-aware neural diarization with Graph-PIT,” in <i>Proc.
    Interspeech 2022</i>, 2022, pp. 1486–1490, doi: <a href="https://doi.org/10.21437/Interspeech.2022-11408">10.21437/Interspeech.2022-11408</a>.'
  mla: Kinoshita, Keisuke, et al. “Utterance-by-Utterance Overlap-Aware Neural Diarization
    with Graph-PIT.” <i>Proc. Interspeech 2022</i>, ISCA, 2022, pp. 1486–90, doi:<a
    href="https://doi.org/10.21437/Interspeech.2022-11408">10.21437/Interspeech.2022-11408</a>.
  short: 'K. Kinoshita, T. von Neumann, M. Delcroix, C. Boeddeker, R. Haeb-Umbach,
    in: Proc. Interspeech 2022, ISCA, 2022, pp. 1486–1490.'
conference:
  name: Interspeech 2022
date_created: 2022-10-28T12:07:57Z
date_updated: 2025-02-12T09:09:05Z
department:
- _id: '54'
doi: 10.21437/Interspeech.2022-11408
language:
- iso: eng
main_file_link:
- url: https://www.isca-archive.org/interspeech_2022/kinoshita22_interspeech.pdf
page: 1486-1490
publication: Proc. Interspeech 2022
publication_status: published
publisher: ISCA
quality_controlled: '1'
status: public
title: Utterance-by-utterance overlap-aware neural diarization with Graph-PIT
type: conference
user_id: '40767'
year: '2022'
...
---
_id: '26770'
abstract:
- lang: eng
  text: "Automatic transcription of meetings requires handling of overlapped speech,
    which calls for continuous speech separation (CSS) systems. The uPIT criterion
    was proposed for utterance-level separation with neural networks and introduces
    the constraint that the total number of speakers must not exceed the number of
    output channels. When processing meeting-like data in a segment-wise manner, i.e.,
    by separating overlapping segments independently and stitching adjacent segments
    to continuous output streams, this constraint has to be fulfilled for any segment.
    In this contribution, we show that this constraint can be significantly relaxed.
    We propose a novel graph-based PIT criterion, which casts the assignment of utterances
    to output channels in a graph coloring problem. It only requires that the number
    of concurrently active speakers must not exceed the number of output channels.
    As a consequence, the system can process an arbitrary number of speakers and arbitrarily
    long segments and thus can handle more diverse scenarios.\r\nFurther, the stitching
    algorithm for obtaining a consistent output order in neighboring segments is of
    less importance and can even be eliminated completely, not the least reducing
    the computational effort. Experiments on meeting-style WSJ data show improvements
    in recognition performance over using the uPIT criterion. "
author:
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Keisuke
  full_name: Kinoshita, Keisuke
  last_name: Kinoshita
- first_name: Christoph
  full_name: Boeddeker, Christoph
  id: '40767'
  last_name: Boeddeker
- first_name: Marc
  full_name: Delcroix, Marc
  last_name: Delcroix
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'von Neumann T, Kinoshita K, Boeddeker C, Delcroix M, Haeb-Umbach R. Graph-PIT:
    Generalized Permutation Invariant Training for Continuous Separation of Arbitrary
    Numbers of Speakers. In: <i>Interspeech 2021</i>. ; 2021. doi:<a href="https://doi.org/10.21437/interspeech.2021-1177">10.21437/interspeech.2021-1177</a>'
  apa: 'von Neumann, T., Kinoshita, K., Boeddeker, C., Delcroix, M., &#38; Haeb-Umbach,
    R. (2021). Graph-PIT: Generalized Permutation Invariant Training for Continuous
    Separation of Arbitrary Numbers of Speakers. <i>Interspeech 2021</i>. Interspeech.
    <a href="https://doi.org/10.21437/interspeech.2021-1177">https://doi.org/10.21437/interspeech.2021-1177</a>'
  bibtex: '@inproceedings{von Neumann_Kinoshita_Boeddeker_Delcroix_Haeb-Umbach_2021,
    title={Graph-PIT: Generalized Permutation Invariant Training for Continuous Separation
    of Arbitrary Numbers of Speakers}, DOI={<a href="https://doi.org/10.21437/interspeech.2021-1177">10.21437/interspeech.2021-1177</a>},
    booktitle={Interspeech 2021}, author={von Neumann, Thilo and Kinoshita, Keisuke
    and Boeddeker, Christoph and Delcroix, Marc and Haeb-Umbach, Reinhold}, year={2021}
    }'
  chicago: 'Neumann, Thilo von, Keisuke Kinoshita, Christoph Boeddeker, Marc Delcroix,
    and Reinhold Haeb-Umbach. “Graph-PIT: Generalized Permutation Invariant Training
    for Continuous Separation of Arbitrary Numbers of Speakers.” In <i>Interspeech
    2021</i>, 2021. <a href="https://doi.org/10.21437/interspeech.2021-1177">https://doi.org/10.21437/interspeech.2021-1177</a>.'
  ieee: 'T. von Neumann, K. Kinoshita, C. Boeddeker, M. Delcroix, and R. Haeb-Umbach,
    “Graph-PIT: Generalized Permutation Invariant Training for Continuous Separation
    of Arbitrary Numbers of Speakers,” presented at the Interspeech, 2021, doi: <a
    href="https://doi.org/10.21437/interspeech.2021-1177">10.21437/interspeech.2021-1177</a>.'
  mla: 'von Neumann, Thilo, et al. “Graph-PIT: Generalized Permutation Invariant Training
    for Continuous Separation of Arbitrary Numbers of Speakers.” <i>Interspeech 2021</i>,
    2021, doi:<a href="https://doi.org/10.21437/interspeech.2021-1177">10.21437/interspeech.2021-1177</a>.'
  short: 'T. von Neumann, K. Kinoshita, C. Boeddeker, M. Delcroix, R. Haeb-Umbach,
    in: Interspeech 2021, 2021.'
conference:
  name: Interspeech
date_created: 2021-10-25T08:50:01Z
date_updated: 2023-11-15T12:14:40Z
ddc:
- '000'
department:
- _id: '54'
doi: 10.21437/interspeech.2021-1177
file:
- access_level: open_access
  content_type: video/mp4
  creator: tvn
  date_created: 2021-12-06T10:39:13Z
  date_updated: 2021-12-06T10:48:30Z
  file_id: '28327'
  file_name: Interspeech 2021 voiceover-002-compressed.mp4
  file_size: 9550220
  relation: supplementary_material
  title: Video for INTERSPEECH 2021
- access_level: open_access
  content_type: application/vnd.openxmlformats-officedocument.presentationml.presentation
  creator: tvn
  date_created: 2021-12-06T10:47:01Z
  date_updated: 2021-12-06T10:47:01Z
  file_id: '28328'
  file_name: Graph-PIT-poster-presentation.pptx
  file_size: 1337297
  relation: slides
  title: Slides from INTERSPEECH 2021
- access_level: open_access
  content_type: application/pdf
  creator: tvn
  date_created: 2021-12-06T10:48:21Z
  date_updated: 2021-12-06T10:48:21Z
  file_id: '28329'
  file_name: INTERSPEECH2021_Graph_PIT.pdf
  file_size: 226589
  relation: main_file
file_date_updated: 2021-12-06T10:48:30Z
has_accepted_license: '1'
keyword:
- Continuous speech separation
- automatic speech recognition
- overlapped speech
- permutation invariant training
language:
- iso: eng
oa: '1'
project:
- _id: '52'
  name: 'PC2: Computing Resources Provided by the Paderborn Center for Parallel Computing'
publication: Interspeech 2021
publication_status: published
quality_controlled: '1'
related_material:
  link:
  - relation: software
    url: https://github.com/fgnt/graph_pit
status: public
title: 'Graph-PIT: Generalized Permutation Invariant Training for Continuous Separation
  of Arbitrary Numbers of Speakers'
type: conference
user_id: '49870'
year: '2021'
...
---
_id: '29173'
author:
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Christoph
  full_name: Boeddeker, Christoph
  id: '40767'
  last_name: Boeddeker
- first_name: Keisuke
  full_name: Kinoshita, Keisuke
  last_name: Kinoshita
- first_name: Marc
  full_name: Delcroix, Marc
  last_name: Delcroix
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'von Neumann T, Boeddeker C, Kinoshita K, Delcroix M, Haeb-Umbach R. Speeding
    Up Permutation Invariant Training for Source Separation. In: <i>Speech Communication;
    14th ITG Conference</i>. ; 2021.'
  apa: von Neumann, T., Boeddeker, C., Kinoshita, K., Delcroix, M., &#38; Haeb-Umbach,
    R. (2021). Speeding Up Permutation Invariant Training for Source Separation. <i>Speech
    Communication; 14th ITG Conference</i>. Speech Communication; 14th ITG Conference,
    Kiel.
  bibtex: '@inproceedings{von Neumann_Boeddeker_Kinoshita_Delcroix_Haeb-Umbach_2021,
    title={Speeding Up Permutation Invariant Training for Source Separation}, booktitle={Speech
    Communication; 14th ITG Conference}, author={von Neumann, Thilo and Boeddeker,
    Christoph and Kinoshita, Keisuke and Delcroix, Marc and Haeb-Umbach, Reinhold},
    year={2021} }'
  chicago: Neumann, Thilo von, Christoph Boeddeker, Keisuke Kinoshita, Marc Delcroix,
    and Reinhold Haeb-Umbach. “Speeding Up Permutation Invariant Training for Source
    Separation.” In <i>Speech Communication; 14th ITG Conference</i>, 2021.
  ieee: T. von Neumann, C. Boeddeker, K. Kinoshita, M. Delcroix, and R. Haeb-Umbach,
    “Speeding Up Permutation Invariant Training for Source Separation,” presented
    at the Speech Communication; 14th ITG Conference, Kiel, 2021.
  mla: von Neumann, Thilo, et al. “Speeding Up Permutation Invariant Training for
    Source Separation.” <i>Speech Communication; 14th ITG Conference</i>, 2021.
  short: 'T. von Neumann, C. Boeddeker, K. Kinoshita, M. Delcroix, R. Haeb-Umbach,
    in: Speech Communication; 14th ITG Conference, 2021.'
conference:
  end_date: 2021-10-01
  location: Kiel
  name: Speech Communication; 14th ITG Conference
  start_date: 2021-09-29
date_created: 2022-01-07T10:40:56Z
date_updated: 2023-11-15T12:16:31Z
ddc:
- '000'
department:
- _id: '54'
file:
- access_level: open_access
  content_type: application/pdf
  creator: tvn
  date_created: 2022-01-06T13:23:27Z
  date_updated: 2022-01-06T13:23:27Z
  file_id: '29180'
  file_name: poster.pdf
  file_size: 191938
  relation: poster
- access_level: open_access
  content_type: application/pdf
  creator: tvn
  date_created: 2022-01-07T10:42:54Z
  date_updated: 2022-01-07T10:42:54Z
  file_id: '29181'
  file_name: ITG2021_Speeding_up_Permutation_Invariant_Training.pdf
  file_size: 236670
  relation: main_file
file_date_updated: 2022-01-07T10:42:54Z
has_accepted_license: '1'
language:
- iso: eng
oa: '1'
project:
- _id: '52'
  name: 'PC2: Computing Resources Provided by the Paderborn Center for Parallel Computing'
publication: Speech Communication; 14th ITG Conference
quality_controlled: '1'
status: public
title: Speeding Up Permutation Invariant Training for Source Separation
type: conference
user_id: '49870'
year: '2021'
...
---
_id: '20762'
abstract:
- lang: eng
  text: The rising interest in single-channel multi-speaker speech separation sparked
    development of End-to-End (E2E) approaches to multispeaker speech recognition.
    However, up until now, state-of-theart neural network–based time domain source
    separation has not yet been combined with E2E speech recognition. We here demonstrate
    how to combine a separation module based on a Convolutional Time domain Audio
    Separation Network (Conv-TasNet) with an E2E speech recognizer and how to train
    such a model jointly by distributing it over multiple GPUs or by approximating
    truncated back-propagation for the convolutional front-end. To put this work into
    perspective and illustrate the complexity of the design space, we provide a compact
    overview of single-channel multi-speaker recognition systems. Our experiments
    show a word error rate of 11.0% on WSJ0-2mix and indicate that our joint time
    domain model can yield substantial improvements over cascade DNN-HMM and monolithic
    E2E frequency domain systems proposed so far.
author:
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Keisuke
  full_name: Kinoshita, Keisuke
  last_name: Kinoshita
- first_name: Lukas
  full_name: Drude, Lukas
  last_name: Drude
- first_name: Christoph
  full_name: Boeddeker, Christoph
  id: '40767'
  last_name: Boeddeker
- first_name: Marc
  full_name: Delcroix, Marc
  last_name: Delcroix
- first_name: Tomohiro
  full_name: Nakatani, Tomohiro
  last_name: Nakatani
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'von Neumann T, Kinoshita K, Drude L, et al. End-to-End Training of Time Domain
    Audio Separation and Recognition. In: <i>ICASSP 2020 - 2020 IEEE International
    Conference on Acoustics, Speech and Signal Processing (ICASSP)</i>. ; 2020:7004-7008.
    doi:<a href="https://doi.org/10.1109/ICASSP40776.2020.9053461">10.1109/ICASSP40776.2020.9053461</a>'
  apa: von Neumann, T., Kinoshita, K., Drude, L., Boeddeker, C., Delcroix, M., Nakatani,
    T., &#38; Haeb-Umbach, R. (2020). End-to-End Training of Time Domain Audio Separation
    and Recognition. <i>ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
    Speech and Signal Processing (ICASSP)</i>, 7004–7008. <a href="https://doi.org/10.1109/ICASSP40776.2020.9053461">https://doi.org/10.1109/ICASSP40776.2020.9053461</a>
  bibtex: '@inproceedings{von Neumann_Kinoshita_Drude_Boeddeker_Delcroix_Nakatani_Haeb-Umbach_2020,
    title={End-to-End Training of Time Domain Audio Separation and Recognition}, DOI={<a
    href="https://doi.org/10.1109/ICASSP40776.2020.9053461">10.1109/ICASSP40776.2020.9053461</a>},
    booktitle={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech
    and Signal Processing (ICASSP)}, author={von Neumann, Thilo and Kinoshita, Keisuke
    and Drude, Lukas and Boeddeker, Christoph and Delcroix, Marc and Nakatani, Tomohiro
    and Haeb-Umbach, Reinhold}, year={2020}, pages={7004–7008} }'
  chicago: Neumann, Thilo von, Keisuke Kinoshita, Lukas Drude, Christoph Boeddeker,
    Marc Delcroix, Tomohiro Nakatani, and Reinhold Haeb-Umbach. “End-to-End Training
    of Time Domain Audio Separation and Recognition.” In <i>ICASSP 2020 - 2020 IEEE
    International Conference on Acoustics, Speech and Signal Processing (ICASSP)</i>,
    7004–8, 2020. <a href="https://doi.org/10.1109/ICASSP40776.2020.9053461">https://doi.org/10.1109/ICASSP40776.2020.9053461</a>.
  ieee: 'T. von Neumann <i>et al.</i>, “End-to-End Training of Time Domain Audio Separation
    and Recognition,” in <i>ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
    Speech and Signal Processing (ICASSP)</i>, 2020, pp. 7004–7008, doi: <a href="https://doi.org/10.1109/ICASSP40776.2020.9053461">10.1109/ICASSP40776.2020.9053461</a>.'
  mla: von Neumann, Thilo, et al. “End-to-End Training of Time Domain Audio Separation
    and Recognition.” <i>ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
    Speech and Signal Processing (ICASSP)</i>, 2020, pp. 7004–08, doi:<a href="https://doi.org/10.1109/ICASSP40776.2020.9053461">10.1109/ICASSP40776.2020.9053461</a>.
  short: 'T. von Neumann, K. Kinoshita, L. Drude, C. Boeddeker, M. Delcroix, T. Nakatani,
    R. Haeb-Umbach, in: ICASSP 2020 - 2020 IEEE International Conference on Acoustics,
    Speech and Signal Processing (ICASSP), 2020, pp. 7004–7008.'
date_created: 2020-12-16T14:07:54Z
date_updated: 2023-11-15T12:17:45Z
ddc:
- '000'
department:
- _id: '54'
doi: 10.1109/ICASSP40776.2020.9053461
file:
- access_level: open_access
  content_type: application/pdf
  creator: huesera
  date_created: 2020-12-16T14:09:48Z
  date_updated: 2020-12-16T14:09:48Z
  file_id: '20763'
  file_name: ICASSP_2020_vonNeumann_Paper.pdf
  file_size: 192529
  relation: main_file
file_date_updated: 2020-12-16T14:09:48Z
has_accepted_license: '1'
language:
- iso: eng
oa: '1'
page: 7004-7008
project:
- _id: '52'
  name: Computing Resources Provided by the Paderborn Center for Parallel Computing
publication: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech
  and Signal Processing (ICASSP)
quality_controlled: '1'
status: public
title: End-to-End Training of Time Domain Audio Separation and Recognition
type: conference
user_id: '49870'
year: '2020'
...
---
_id: '20764'
abstract:
- lang: eng
  text: 'Most approaches to multi-talker overlapped speech separation and recognition
    assume that the number of simultaneously active speakers is given, but in realistic
    situations, it is typically unknown. To cope with this, we extend an iterative
    speech extraction system with mechanisms to count the number of sources and combine
    it with a single-talker speech recognizer to form the first end-to-end multi-talker
    automatic speech recognition system for an unknown number of active speakers.
    Our experiments show very promising performance in counting accuracy, source separation
    and speech recognition on simulated clean mixtures from WSJ0-2mix and WSJ0-3mix.
    Among others, we set a new state-of-the-art word error rate on the WSJ0-2mix database.
    Furthermore, our system generalizes well to a larger number of speakers than it
    ever saw during training, as shown in experiments with the WSJ0-4mix database. '
author:
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Christoph
  full_name: Boeddeker, Christoph
  id: '40767'
  last_name: Boeddeker
- first_name: Lukas
  full_name: Drude, Lukas
  last_name: Drude
- first_name: Keisuke
  full_name: Kinoshita, Keisuke
  last_name: Kinoshita
- first_name: Marc
  full_name: Delcroix, Marc
  last_name: Delcroix
- first_name: Tomohiro
  full_name: Nakatani, Tomohiro
  last_name: Nakatani
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'von Neumann T, Boeddeker C, Drude L, et al. Multi-Talker ASR for an Unknown
    Number of Sources: Joint Training of Source Counting, Separation and ASR. In:
    <i>Proc. Interspeech 2020</i>. ; 2020:3097-3101. doi:<a href="https://doi.org/10.21437/Interspeech.2020-2519">10.21437/Interspeech.2020-2519</a>'
  apa: 'von Neumann, T., Boeddeker, C., Drude, L., Kinoshita, K., Delcroix, M., Nakatani,
    T., &#38; Haeb-Umbach, R. (2020). Multi-Talker ASR for an Unknown Number of Sources:
    Joint Training of Source Counting, Separation and ASR. <i>Proc. Interspeech 2020</i>,
    3097–3101. <a href="https://doi.org/10.21437/Interspeech.2020-2519">https://doi.org/10.21437/Interspeech.2020-2519</a>'
  bibtex: '@inproceedings{von Neumann_Boeddeker_Drude_Kinoshita_Delcroix_Nakatani_Haeb-Umbach_2020,
    title={Multi-Talker ASR for an Unknown Number of Sources: Joint Training of Source
    Counting, Separation and ASR}, DOI={<a href="https://doi.org/10.21437/Interspeech.2020-2519">10.21437/Interspeech.2020-2519</a>},
    booktitle={Proc. Interspeech 2020}, author={von Neumann, Thilo and Boeddeker,
    Christoph and Drude, Lukas and Kinoshita, Keisuke and Delcroix, Marc and Nakatani,
    Tomohiro and Haeb-Umbach, Reinhold}, year={2020}, pages={3097–3101} }'
  chicago: 'Neumann, Thilo von, Christoph Boeddeker, Lukas Drude, Keisuke Kinoshita,
    Marc Delcroix, Tomohiro Nakatani, and Reinhold Haeb-Umbach. “Multi-Talker ASR
    for an Unknown Number of Sources: Joint Training of Source Counting, Separation
    and ASR.” In <i>Proc. Interspeech 2020</i>, 3097–3101, 2020. <a href="https://doi.org/10.21437/Interspeech.2020-2519">https://doi.org/10.21437/Interspeech.2020-2519</a>.'
  ieee: 'T. von Neumann <i>et al.</i>, “Multi-Talker ASR for an Unknown Number of
    Sources: Joint Training of Source Counting, Separation and ASR,” in <i>Proc. Interspeech
    2020</i>, 2020, pp. 3097–3101, doi: <a href="https://doi.org/10.21437/Interspeech.2020-2519">10.21437/Interspeech.2020-2519</a>.'
  mla: 'von Neumann, Thilo, et al. “Multi-Talker ASR for an Unknown Number of Sources:
    Joint Training of Source Counting, Separation and ASR.” <i>Proc. Interspeech 2020</i>,
    2020, pp. 3097–101, doi:<a href="https://doi.org/10.21437/Interspeech.2020-2519">10.21437/Interspeech.2020-2519</a>.'
  short: 'T. von Neumann, C. Boeddeker, L. Drude, K. Kinoshita, M. Delcroix, T. Nakatani,
    R. Haeb-Umbach, in: Proc. Interspeech 2020, 2020, pp. 3097–3101.'
date_created: 2020-12-16T14:12:45Z
date_updated: 2023-11-15T12:17:57Z
ddc:
- '000'
department:
- _id: '54'
doi: 10.21437/Interspeech.2020-2519
file:
- access_level: open_access
  content_type: application/pdf
  creator: huesera
  date_created: 2020-12-16T14:14:14Z
  date_updated: 2020-12-16T14:14:14Z
  file_id: '20765'
  file_name: INTERSPEECH_2020_vonNeumann_Paper.pdf
  file_size: 267893
  relation: main_file
file_date_updated: 2020-12-16T14:14:14Z
has_accepted_license: '1'
language:
- iso: eng
oa: '1'
page: 3097-3101
project:
- _id: '52'
  name: Computing Resources Provided by the Paderborn Center for Parallel Computing
publication: Proc. Interspeech 2020
quality_controlled: '1'
status: public
title: 'Multi-Talker ASR for an Unknown Number of Sources: Joint Training of Source
  Counting, Separation and ASR'
type: conference
user_id: '49870'
year: '2020'
...
---
_id: '20766'
abstract:
- lang: eng
  text: Recently, the source separation performance was greatly improved by time-domain
    audio source separation based on dual-path recurrent neural network (DPRNN). DPRNN
    is a simple but effective model for a long sequential data. While DPRNN is quite
    efficient in modeling a sequential data of the length of an utterance, i.e., about
    5 to 10 second data, it is harder to apply it to longer sequences such as whole
    conversations consisting of multiple utterances. It is simply because, in such
    a case, the number of time steps consumed by its internal module called inter-chunk
    RNN becomes extremely large. To mitigate this problem, this paper proposes a multi-path
    RNN (MPRNN), a generalized version of DPRNN, that models the input data in a hierarchical
    manner. In the MPRNN framework, the input data is represented at several (>_ 3)
    time-resolutions, each of which is modeled by a specific RNN sub-module. For example,
    the RNN sub-module that deals with the finest resolution may model temporal relationship
    only within a phoneme, while the RNN sub-module handling the most coarse resolution
    may capture only the relationship between utterances such as speaker information.
    We perform experiments using simulated dialogue-like mixtures and show that MPRNN
    has greater model capacity, and it outperforms the current state-of-the-art DPRNN
    framework especially in online processing scenarios.
author:
- first_name: Keisuke
  full_name: Kinoshita, Keisuke
  last_name: Kinoshita
- first_name: Thilo
  full_name: von Neumann, Thilo
  id: '49870'
  last_name: von Neumann
  orcid: https://orcid.org/0000-0002-7717-8670
- first_name: Marc
  full_name: Delcroix, Marc
  last_name: Delcroix
- first_name: Tomohiro
  full_name: Nakatani, Tomohiro
  last_name: Nakatani
- first_name: Reinhold
  full_name: Haeb-Umbach, Reinhold
  id: '242'
  last_name: Haeb-Umbach
citation:
  ama: 'Kinoshita K, von Neumann T, Delcroix M, Nakatani T, Haeb-Umbach R. Multi-Path
    RNN for Hierarchical Modeling of Long Sequential Data and its Application to Speaker
    Stream Separation. In: <i>Proc. Interspeech 2020</i>. ; 2020:2652-2656. doi:<a
    href="https://doi.org/10.21437/Interspeech.2020-2388">10.21437/Interspeech.2020-2388</a>'
  apa: Kinoshita, K., von Neumann, T., Delcroix, M., Nakatani, T., &#38; Haeb-Umbach,
    R. (2020). Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and
    its Application to Speaker Stream Separation. <i>Proc. Interspeech 2020</i>, 2652–2656.
    <a href="https://doi.org/10.21437/Interspeech.2020-2388">https://doi.org/10.21437/Interspeech.2020-2388</a>
  bibtex: '@inproceedings{Kinoshita_von Neumann_Delcroix_Nakatani_Haeb-Umbach_2020,
    title={Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its
    Application to Speaker Stream Separation}, DOI={<a href="https://doi.org/10.21437/Interspeech.2020-2388">10.21437/Interspeech.2020-2388</a>},
    booktitle={Proc. Interspeech 2020}, author={Kinoshita, Keisuke and von Neumann,
    Thilo and Delcroix, Marc and Nakatani, Tomohiro and Haeb-Umbach, Reinhold}, year={2020},
    pages={2652–2656} }'
  chicago: Kinoshita, Keisuke, Thilo von Neumann, Marc Delcroix, Tomohiro Nakatani,
    and Reinhold Haeb-Umbach. “Multi-Path RNN for Hierarchical Modeling of Long Sequential
    Data and Its Application to Speaker Stream Separation.” In <i>Proc. Interspeech
    2020</i>, 2652–56, 2020. <a href="https://doi.org/10.21437/Interspeech.2020-2388">https://doi.org/10.21437/Interspeech.2020-2388</a>.
  ieee: 'K. Kinoshita, T. von Neumann, M. Delcroix, T. Nakatani, and R. Haeb-Umbach,
    “Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application
    to Speaker Stream Separation,” in <i>Proc. Interspeech 2020</i>, 2020, pp. 2652–2656,
    doi: <a href="https://doi.org/10.21437/Interspeech.2020-2388">10.21437/Interspeech.2020-2388</a>.'
  mla: Kinoshita, Keisuke, et al. “Multi-Path RNN for Hierarchical Modeling of Long
    Sequential Data and Its Application to Speaker Stream Separation.” <i>Proc. Interspeech
    2020</i>, 2020, pp. 2652–56, doi:<a href="https://doi.org/10.21437/Interspeech.2020-2388">10.21437/Interspeech.2020-2388</a>.
  short: 'K. Kinoshita, T. von Neumann, M. Delcroix, T. Nakatani, R. Haeb-Umbach,
    in: Proc. Interspeech 2020, 2020, pp. 2652–2656.'
date_created: 2020-12-16T14:15:24Z
date_updated: 2023-11-15T12:14:25Z
ddc:
- '000'
department:
- _id: '54'
doi: 10.21437/Interspeech.2020-2388
file:
- access_level: open_access
  content_type: application/pdf
  creator: huesera
  date_created: 2020-12-16T14:16:32Z
  date_updated: 2020-12-16T14:16:32Z
  file_id: '20767'
  file_name: INTERSPEECH_2020_vonNeumann1_Paper.pdf
  file_size: 1725219
  relation: main_file
file_date_updated: 2020-12-16T14:16:32Z
has_accepted_license: '1'
language:
- iso: eng
oa: '1'
page: 2652-2656
publication: Proc. Interspeech 2020
quality_controlled: '1'
status: public
title: Multi-Path RNN for Hierarchical Modeling of Long Sequential Data and its Application
  to Speaker Stream Separation
type: conference
user_id: '49870'
year: '2020'
...