GAN-in-GAN for Monaural Speech Enhancement

Yicun Duan; Jianfeng Ren; Heng Yu; Xudong Jiang

doi:10.1109/LSP.2023.3293758

GAN-in-GAN for Monaural Speech Enhancement

Yicun Duan, Jianfeng Ren, Heng Yu, Xudong Jiang

Research output: Journal Publication › Article › peer-review

8 Citations (Scopus)

Abstract

Some generative adversarial networks (GANs) have been developed to remove background noise in real-world audio recordings. MetricGAN and its variants focus on generating a clean spectrogram from a noisy one, but the final audio quality can't be guaranteed. SEGAN and its variants directly generate an enhanced audio from a noisy one, but their over-long input representations make it less effective in identifying and removing audio noise. In this letter, a novel GAN-in-GAN framework is proposed, where the inner GAN conducts spectrogram-to-spectrogram recovery under the supervision of metric discriminators to effectively clean the audio noise, and the outer GAN conducts an audio-to-audio recovery under the supervision of multi-resolution discriminators to optimize the final audio quality. To tackle the challenges of utilizing multiple adversarial losses for training the proposed GAN-in-GAN simultaneously, a novel gradient balancing scheme is proposed to facilitate a coherent training. The proposed method is compared with state-of-the-art methods on the VoiceBank+DEMAND dataset for audio denoising. It outperforms all the compared methods.

Original language	English
Pages (from-to)	853-857
Number of pages	5
Journal	IEEE Signal Processing Letters
Volume	30
DOIs	https://doi.org/10.1109/LSP.2023.3293758
Publication status	Published - 2023

Keywords

GAN-in-GAN
Generative adversarial network
gradient balancing
speech enhancement

ASJC Scopus subject areas

Signal Processing
Electrical and Electronic Engineering
Applied Mathematics

Access to Document

10.1109/LSP.2023.3293758

Cite this

@article{88bf3027db634604928e0c52a62525e7,

title = "GAN-in-GAN for Monaural Speech Enhancement",

abstract = "Some generative adversarial networks (GANs) have been developed to remove background noise in real-world audio recordings. MetricGAN and its variants focus on generating a clean spectrogram from a noisy one, but the final audio quality can't be guaranteed. SEGAN and its variants directly generate an enhanced audio from a noisy one, but their over-long input representations make it less effective in identifying and removing audio noise. In this letter, a novel GAN-in-GAN framework is proposed, where the inner GAN conducts spectrogram-to-spectrogram recovery under the supervision of metric discriminators to effectively clean the audio noise, and the outer GAN conducts an audio-to-audio recovery under the supervision of multi-resolution discriminators to optimize the final audio quality. To tackle the challenges of utilizing multiple adversarial losses for training the proposed GAN-in-GAN simultaneously, a novel gradient balancing scheme is proposed to facilitate a coherent training. The proposed method is compared with state-of-the-art methods on the VoiceBank+DEMAND dataset for audio denoising. It outperforms all the compared methods.",

keywords = "GAN-in-GAN, Generative adversarial network, gradient balancing, speech enhancement",

author = "Yicun Duan and Jianfeng Ren and Heng Yu and Xudong Jiang",

note = "Funding Information: This work was supported by the Ningbo Municipal Bureau of Science and Technology under Grants 2019B10026 and 2022Z173. Publisher Copyright: {\textcopyright} 2023 IEEE.",

year = "2023",

doi = "10.1109/LSP.2023.3293758",

language = "English",

volume = "30",

pages = "853--857",

journal = "IEEE Signal Processing Letters",

issn = "1070-9908",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - JOUR

T1 - GAN-in-GAN for Monaural Speech Enhancement

AU - Duan, Yicun

AU - Ren, Jianfeng

AU - Yu, Heng

AU - Jiang, Xudong

PY - 2023

Y1 - 2023

N2 - Some generative adversarial networks (GANs) have been developed to remove background noise in real-world audio recordings. MetricGAN and its variants focus on generating a clean spectrogram from a noisy one, but the final audio quality can't be guaranteed. SEGAN and its variants directly generate an enhanced audio from a noisy one, but their over-long input representations make it less effective in identifying and removing audio noise. In this letter, a novel GAN-in-GAN framework is proposed, where the inner GAN conducts spectrogram-to-spectrogram recovery under the supervision of metric discriminators to effectively clean the audio noise, and the outer GAN conducts an audio-to-audio recovery under the supervision of multi-resolution discriminators to optimize the final audio quality. To tackle the challenges of utilizing multiple adversarial losses for training the proposed GAN-in-GAN simultaneously, a novel gradient balancing scheme is proposed to facilitate a coherent training. The proposed method is compared with state-of-the-art methods on the VoiceBank+DEMAND dataset for audio denoising. It outperforms all the compared methods.

AB - Some generative adversarial networks (GANs) have been developed to remove background noise in real-world audio recordings. MetricGAN and its variants focus on generating a clean spectrogram from a noisy one, but the final audio quality can't be guaranteed. SEGAN and its variants directly generate an enhanced audio from a noisy one, but their over-long input representations make it less effective in identifying and removing audio noise. In this letter, a novel GAN-in-GAN framework is proposed, where the inner GAN conducts spectrogram-to-spectrogram recovery under the supervision of metric discriminators to effectively clean the audio noise, and the outer GAN conducts an audio-to-audio recovery under the supervision of multi-resolution discriminators to optimize the final audio quality. To tackle the challenges of utilizing multiple adversarial losses for training the proposed GAN-in-GAN simultaneously, a novel gradient balancing scheme is proposed to facilitate a coherent training. The proposed method is compared with state-of-the-art methods on the VoiceBank+DEMAND dataset for audio denoising. It outperforms all the compared methods.

KW - GAN-in-GAN

KW - Generative adversarial network

KW - gradient balancing

KW - speech enhancement

UR - http://www.scopus.com/inward/record.url?scp=85164704101&partnerID=8YFLogxK

U2 - 10.1109/LSP.2023.3293758

DO - 10.1109/LSP.2023.3293758

M3 - Article

AN - SCOPUS:85164704101

SN - 1070-9908

VL - 30

SP - 853

EP - 857

JO - IEEE Signal Processing Letters

JF - IEEE Signal Processing Letters

ER -

GAN-in-GAN for Monaural Speech Enhancement

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this