TY - GEN
T1 - Self-supervised learning based phone-fortified speech enhancement
AU - Qiu, Yuanhang
AU - Wang, Ruili
AU - Singh, Satwinder
AU - Ma, Zhizhong
AU - Hou, Feng
N1 - Publisher Copyright:
Copyright © 2021 ISCA.
PY - 2021
Y1 - 2021
N2 - For speech enhancement, deep complex network based methods have shown promising performance due to their effectiveness in dealing with complex-valued spectrums. Recent speech enhancement methods focus on further optimization of network structures and hyperparameters, however, ignore inherent speech characteristics (e.g., phonetic characteristics), which are important for networks to learn and reconstruct speech information. In this paper, we propose a novel self-supervised learning based phone-fortified (SSPF) method for speech enhancement. Our method explicitly imports phonetic characteristics into a deep complex convolutional network via a Contrastive Predictive Coding (CPC) model pre-trained with self-supervised learning. This operation can greatly improve speech representation learning and speech enhancement performance. Moreover, we also apply the self-attention mechanism to our model for learning long-range dependencies of a speech sequence, which further improves the performance of speech enhancement. The experimental results demonstrate that our SSPF method outperforms existing methods and achieves state-of-the-art performance in terms of speech quality and intelligibility.
AB - For speech enhancement, deep complex network based methods have shown promising performance due to their effectiveness in dealing with complex-valued spectrums. Recent speech enhancement methods focus on further optimization of network structures and hyperparameters, however, ignore inherent speech characteristics (e.g., phonetic characteristics), which are important for networks to learn and reconstruct speech information. In this paper, we propose a novel self-supervised learning based phone-fortified (SSPF) method for speech enhancement. Our method explicitly imports phonetic characteristics into a deep complex convolutional network via a Contrastive Predictive Coding (CPC) model pre-trained with self-supervised learning. This operation can greatly improve speech representation learning and speech enhancement performance. Moreover, we also apply the self-attention mechanism to our model for learning long-range dependencies of a speech sequence, which further improves the performance of speech enhancement. The experimental results demonstrate that our SSPF method outperforms existing methods and achieves state-of-the-art performance in terms of speech quality and intelligibility.
KW - Contrastive predictive coding
KW - Self-attention mechanism
KW - Self-supervised learning
KW - Speech enhancement
UR - http://www.scopus.com/inward/record.url?scp=85119192598&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-734
DO - 10.21437/Interspeech.2021-734
M3 - Conference contribution
AN - SCOPUS:85119192598
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 2793
EP - 2797
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Y2 - 30 August 2021 through 3 September 2021
ER -