TY - GEN
T1 - PhasePerturbation
T2 - 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops
AU - Lei, Chengxi
AU - Singh, Satwinder
AU - Hou, Feng
AU - Jia, Xiaoyun
AU - Wang, Ruili
N1 - Publisher Copyright:
© 2023 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2023/12/6
Y1 - 2023/12/6
N2 - Most of the current speech data augmentation methods operate on either the raw waveform or the amplitude spectrum of speech. In this paper, we propose a novel speech data augmentation method called PhasePerturbation that operates dynamically on the phase spectrum of speech. Instead of statically rotating a phase by a constant degree, PhasePerturbation utilizes three dynamic phase spectrum operations, i.e., a randomization operation, a frequency masking operation, and a temporal masking operation, to enhance the diversity of speech data. We conduct experiments on wav2vec2.0 pre-trained ASR models by fine-tuning them with the PhasePerturbation augmented TIMIT corpus. The experimental results demonstrate 10.9% relative reduction in the word error rate (WER) compared with the baseline model fine-tuned without any augmentation operation. Furthermore, the proposed method achieves additional improvements (12.9% and 15.9%) in WER by complementing the Vocal Tract Length Perturbation (VTLP) and the SpecAug, which are both amplitude spectrum-based augmentation methods. The results highlight the capability of PhasePerturbation to improve the current amplitude spectrum-based augmentation methods.
AB - Most of the current speech data augmentation methods operate on either the raw waveform or the amplitude spectrum of speech. In this paper, we propose a novel speech data augmentation method called PhasePerturbation that operates dynamically on the phase spectrum of speech. Instead of statically rotating a phase by a constant degree, PhasePerturbation utilizes three dynamic phase spectrum operations, i.e., a randomization operation, a frequency masking operation, and a temporal masking operation, to enhance the diversity of speech data. We conduct experiments on wav2vec2.0 pre-trained ASR models by fine-tuning them with the PhasePerturbation augmented TIMIT corpus. The experimental results demonstrate 10.9% relative reduction in the word error rate (WER) compared with the baseline model fine-tuned without any augmentation operation. Furthermore, the proposed method achieves additional improvements (12.9% and 15.9%) in WER by complementing the Vocal Tract Length Perturbation (VTLP) and the SpecAug, which are both amplitude spectrum-based augmentation methods. The results highlight the capability of PhasePerturbation to improve the current amplitude spectrum-based augmentation methods.
KW - data augmentation
KW - phase spectrum augmentation
KW - speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85182924414&partnerID=8YFLogxK
U2 - 10.1145/3611380.3628555
DO - 10.1145/3611380.3628555
M3 - Conference contribution
AN - SCOPUS:85182924414
T3 - Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops
BT - Proceedings of the 5th ACM International Conference on Multimedia in Asia, MMAsia 2023 Workshops
PB - Association for Computing Machinery, Inc
Y2 - 6 December 2023 through 8 December 2023
ER -