Improved Speech Separation Performance from Monaural Mixed Speech Based on Deep Embedding Network
Improved Speech Separation Performance from Monaural Mixed Speech Based on Deep Embedding Network
カテゴリ: 論文誌(論文単位)
グループ名: 【C】電子・情報・システム部門
発行日: 2022/06/01
タイトル(英語): Improved Speech Separation Performance from Monaural Mixed Speech Based on Deep Embedding Network
著者名: Shaoxiang Dang (Graduate School of Informatics, Nagoya University), Tetsuya Matsumoto (Graduate School of Informatics, Nagoya University), Hiroaki Kudo (Graduate School of Informatics, Nagoya University), Yoshinori Takeuchi (School of Informatics, Daido U
著者名(英語): Shaoxiang Dang (Graduate School of Informatics, Nagoya University), Tetsuya Matsumoto (Graduate School of Informatics, Nagoya University), Hiroaki Kudo (Graduate School of Informatics, Nagoya University), Yoshinori Takeuchi (School of Informatics, Daido University)
キーワード: speech separation,deep embedding network,monaural speech separation,permutation invariant training
要約(英語): Speech separation refers to the separation of mixed utterances in which multiple people are speaking simultaneously. This topic was created to address the cocktail party problem, and it has also evolved into a front-end procedure for speech-related research recently. There are two types of models for addressing it: classification methods and regression methods. Classification methods avoid considering the permutation problem, and regression methods are more precise. Deep clustering (DC) is a common deep learning classification technique that uses a deep embedding network to embed signals in the underlying manifold and allows signals with similar properties to cluster together in the embedded space. The ideal binary mask (IBM) supervises the model. Even if it is entirely isolated via supervised binary masking, this setting is susceptible to mistake. To remedy this flaw, this paper proposes a network with two modules based on deep embedding. More precise masks, such as the ideal ratio mask (IRM), can now be used instead of IBM. Furthermore, the cascade structure may preserve DC's high performance. On average, the suggested model beats the original DC model by 1.55dB in SNR, 4.45dB in SDR, 4.41dB in SDRi, 0.16 in STOI, and 0.3 in PESQ on a mixture of Japanese Newspaper Article Sentences (JNAS). Finally, a comparison is made with ChimeraNet which has a similar structure, the results prove the proposed model's superiority over previous DC-related studies.
本誌: 電気学会論文誌C(電子・情報・システム部門誌) Vol.142 No.6 (2022)
本誌掲載ページ: 643-649 p
原稿種別: 論文/英語
電子版へのリンク: https://www.jstage.jst.go.jp/article/ieejeiss/142/6/142_643/_article/-char/ja/
受取状況を読み込めませんでした
