PhasePerturbation: Speech Data Augmentation via Phase Perturbation for Automatic Speech Recognition

Chengxi Lei; Satwinder Singh; Feng Hou; Xiaoyun Jia; Ruili Wang

doi:10.1145/3611380.3628555

PhasePerturbation: Speech Data Augmentation via Phase Perturbation for Automatic Speech Recognition

Chengxi Lei, Satwinder Singh, Feng Hou, Xiaoyun Jia, Ruili Wang

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

1 Citation (Scopus)

Abstract

Most of the current speech data augmentation methods operate on either the raw waveform or the amplitude spectrum of speech. In this paper, we propose a novel speech data augmentation method called PhasePerturbation that operates dynamically on the phase spectrum of speech. Instead of statically rotating a phase by a constant degree, PhasePerturbation utilizes three dynamic phase spectrum operations, i.e., a randomization operation, a frequency masking operation, and a temporal masking operation, to enhance the diversity of speech data. We conduct experiments on wav2vec2.0 pre-trained ASR models by fine-tuning them with the PhasePerturbation augmented TIMIT corpus. The experimental results demonstrate 10.9% relative reduction in the word error rate (WER) compared with the baseline model fine-tuned without any augmentation operation. Furthermore, the proposed method achieves additional improvements (12.9% and 15.9%) in WER by complementing the Vocal Tract Length Perturbation (VTLP) and the SpecAug, which are both amplitude spectrum-based augmentation methods. The results highlight the capability of PhasePerturbation to improve the current amplitude spectrum-based augmentation methods.

Original language	English
Title of host publication	Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops
Publisher	Association for Computing Machinery, Inc
ISBN (Electronic)	9798400703263
DOIs	https://doi.org/10.1145/3611380.3628555
Publication status	Published - 6 Dec 2023
Externally published	Yes
Event	5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops - Hybrid, Tainan, Taiwan, Province of China Duration: 6 Dec 2023 → 8 Dec 2023

Publication series

Name	Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops

Conference

Conference	5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops
Country/Territory	Taiwan, Province of China
City	Hybrid, Tainan
Period	6/12/23 → 8/12/23

Keywords

data augmentation
phase spectrum augmentation
speech recognition

ASJC Scopus subject areas

Computer Graphics and Computer-Aided Design
Human-Computer Interaction

Access to Document

10.1145/3611380.3628555

Cite this

Lei, C., Singh, S., Hou, F., Jia, X., & Wang, R. (2023). PhasePerturbation: Speech Data Augmentation via Phase Perturbation for Automatic Speech Recognition. In Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops Article 2 (Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops). Association for Computing Machinery, Inc. https://doi.org/10.1145/3611380.3628555

Lei, Chengxi ; Singh, Satwinder ; Hou, Feng et al. / PhasePerturbation : Speech Data Augmentation via Phase Perturbation for Automatic Speech Recognition. Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops. Association for Computing Machinery, Inc, 2023. (Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops).

@inproceedings{8d194a1eb8ce499992e97a3fcb409370,

title = "PhasePerturbation: Speech Data Augmentation via Phase Perturbation for Automatic Speech Recognition",

abstract = "Most of the current speech data augmentation methods operate on either the raw waveform or the amplitude spectrum of speech. In this paper, we propose a novel speech data augmentation method called PhasePerturbation that operates dynamically on the phase spectrum of speech. Instead of statically rotating a phase by a constant degree, PhasePerturbation utilizes three dynamic phase spectrum operations, i.e., a randomization operation, a frequency masking operation, and a temporal masking operation, to enhance the diversity of speech data. We conduct experiments on wav2vec2.0 pre-trained ASR models by fine-tuning them with the PhasePerturbation augmented TIMIT corpus. The experimental results demonstrate 10.9% relative reduction in the word error rate (WER) compared with the baseline model fine-tuned without any augmentation operation. Furthermore, the proposed method achieves additional improvements (12.9% and 15.9%) in WER by complementing the Vocal Tract Length Perturbation (VTLP) and the SpecAug, which are both amplitude spectrum-based augmentation methods. The results highlight the capability of PhasePerturbation to improve the current amplitude spectrum-based augmentation methods.",

keywords = "data augmentation, phase spectrum augmentation, speech recognition",

author = "Chengxi Lei and Satwinder Singh and Feng Hou and Xiaoyun Jia and Ruili Wang",

note = "Publisher Copyright: {\textcopyright} 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.; 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops ; Conference date: 06-12-2023 Through 08-12-2023",

year = "2023",

month = dec,

day = "6",

doi = "10.1145/3611380.3628555",

language = "English",

series = "Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops",

publisher = "Association for Computing Machinery, Inc",

booktitle = "Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops",

}

Lei, C, Singh, S, Hou, F, Jia, X & Wang, R 2023, PhasePerturbation: Speech Data Augmentation via Phase Perturbation for Automatic Speech Recognition. in Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops., 2, Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops, Association for Computing Machinery, Inc, 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops, Hybrid, Tainan, Taiwan, Province of China, 6/12/23. https://doi.org/10.1145/3611380.3628555

PhasePerturbation: Speech Data Augmentation via Phase Perturbation for Automatic Speech Recognition. / Lei, Chengxi; Singh, Satwinder; Hou, Feng et al.
Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops. Association for Computing Machinery, Inc, 2023. 2 (Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - PhasePerturbation

T2 - 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops

AU - Lei, Chengxi

AU - Singh, Satwinder

AU - Hou, Feng

AU - Jia, Xiaoyun

AU - Wang, Ruili

PY - 2023/12/6

Y1 - 2023/12/6

N2 - Most of the current speech data augmentation methods operate on either the raw waveform or the amplitude spectrum of speech. In this paper, we propose a novel speech data augmentation method called PhasePerturbation that operates dynamically on the phase spectrum of speech. Instead of statically rotating a phase by a constant degree, PhasePerturbation utilizes three dynamic phase spectrum operations, i.e., a randomization operation, a frequency masking operation, and a temporal masking operation, to enhance the diversity of speech data. We conduct experiments on wav2vec2.0 pre-trained ASR models by fine-tuning them with the PhasePerturbation augmented TIMIT corpus. The experimental results demonstrate 10.9% relative reduction in the word error rate (WER) compared with the baseline model fine-tuned without any augmentation operation. Furthermore, the proposed method achieves additional improvements (12.9% and 15.9%) in WER by complementing the Vocal Tract Length Perturbation (VTLP) and the SpecAug, which are both amplitude spectrum-based augmentation methods. The results highlight the capability of PhasePerturbation to improve the current amplitude spectrum-based augmentation methods.

AB - Most of the current speech data augmentation methods operate on either the raw waveform or the amplitude spectrum of speech. In this paper, we propose a novel speech data augmentation method called PhasePerturbation that operates dynamically on the phase spectrum of speech. Instead of statically rotating a phase by a constant degree, PhasePerturbation utilizes three dynamic phase spectrum operations, i.e., a randomization operation, a frequency masking operation, and a temporal masking operation, to enhance the diversity of speech data. We conduct experiments on wav2vec2.0 pre-trained ASR models by fine-tuning them with the PhasePerturbation augmented TIMIT corpus. The experimental results demonstrate 10.9% relative reduction in the word error rate (WER) compared with the baseline model fine-tuned without any augmentation operation. Furthermore, the proposed method achieves additional improvements (12.9% and 15.9%) in WER by complementing the Vocal Tract Length Perturbation (VTLP) and the SpecAug, which are both amplitude spectrum-based augmentation methods. The results highlight the capability of PhasePerturbation to improve the current amplitude spectrum-based augmentation methods.

KW - data augmentation

KW - phase spectrum augmentation

KW - speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85182924414&partnerID=8YFLogxK

U2 - 10.1145/3611380.3628555

DO - 10.1145/3611380.3628555

M3 - Conference contribution

AN - SCOPUS:85182924414

T3 - Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops

BT - Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops

PB - Association for Computing Machinery, Inc

Y2 - 6 December 2023 through 8 December 2023

ER -

Lei C, Singh S, Hou F, Jia X, Wang R. PhasePerturbation: Speech Data Augmentation via Phase Perturbation for Automatic Speech Recognition. In Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops. Association for Computing Machinery, Inc. 2023. 2. (Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops). doi: 10.1145/3611380.3628555

PhasePerturbation: Speech Data Augmentation via Phase Perturbation for Automatic Speech Recognition

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this