TY - GEN
T1 - Mix-fine-tune
T2 - 6th ACM International Conference on Multimedia in Asia, MMAsia 2024
AU - Lei, Chengxi
AU - Singh, Satwinder
AU - Hou, Feng
AU - Wang, Ruili
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.
PY - 2024/12/28
Y1 - 2024/12/28
N2 - Self-supervised Learning (SSL) using extensive unlabeled speech data has significantly improved the performance of ASR models on datasets like LibriSpeech. However, few studies have addressed the issue of domain mismatch between the data used to pre-train and fine-tune ASR models. Moreover, the Empirical Risk Minimization (ERM) principle, commonly used to train deep learning models, often causes the trained models to exhibit undesirable behaviors such as memorizing training data and being sensitive to adversarial examples. Thus, in this paper, we propose an alternate fine-tuning strategy, called Mix-fine-tune, to address domain mismatch in ASR systems and the limitations of the ERM training principle. Mix-fine-tune use a data-driven weighted sum of two speech sequences as input and the corresponding text sequences are used to calculate a weighted audio-text alignment Connectionist Temporal Classification (CTC) loss for fine-tuning a pre-trained model. Additionally, Mix-fine-tune incorporates the masked Contrastive Predictive Coding (CPC) loss, previously used exclusively for pre-training, into the fine-tuning process. Our novel strategy alternates between minimizing the CTC loss and the CPC loss to address the domain mismatch between pre-training and fine-tuning. We validate our method by fine-tuning different sizes of the Wav2Vec model using the public Air Traffic Control (ATC) corpus. The experiments show that Mix-fine-tune efficiently adapts the models pre-trained on general speech corpora like LibriSpeech to a specific domain (e.g., the air traffic control domain) by fine-turning.
AB - Self-supervised Learning (SSL) using extensive unlabeled speech data has significantly improved the performance of ASR models on datasets like LibriSpeech. However, few studies have addressed the issue of domain mismatch between the data used to pre-train and fine-tune ASR models. Moreover, the Empirical Risk Minimization (ERM) principle, commonly used to train deep learning models, often causes the trained models to exhibit undesirable behaviors such as memorizing training data and being sensitive to adversarial examples. Thus, in this paper, we propose an alternate fine-tuning strategy, called Mix-fine-tune, to address domain mismatch in ASR systems and the limitations of the ERM training principle. Mix-fine-tune use a data-driven weighted sum of two speech sequences as input and the corresponding text sequences are used to calculate a weighted audio-text alignment Connectionist Temporal Classification (CTC) loss for fine-tuning a pre-trained model. Additionally, Mix-fine-tune incorporates the masked Contrastive Predictive Coding (CPC) loss, previously used exclusively for pre-training, into the fine-tuning process. Our novel strategy alternates between minimizing the CTC loss and the CPC loss to address the domain mismatch between pre-training and fine-tuning. We validate our method by fine-tuning different sizes of the Wav2Vec model using the public Air Traffic Control (ATC) corpus. The experiments show that Mix-fine-tune efficiently adapts the models pre-trained on general speech corpora like LibriSpeech to a specific domain (e.g., the air traffic control domain) by fine-turning.
KW - domain adaptation
KW - fine-tuning
KW - self-supervised learning
KW - speech recognition
UR - http://www.scopus.com/inward/record.url?scp=85216188755&partnerID=8YFLogxK
U2 - 10.1145/3696409.3700259
DO - 10.1145/3696409.3700259
M3 - Conference contribution
AN - SCOPUS:85216188755
T3 - Proceedings of the 6th ACM International Conference on Multimedia in Asia, MMAsia 2024
BT - Proceedings of the 6th ACM International Conference on Multimedia in Asia, MMAsia 2024
PB - Association for Computing Machinery, Inc
Y2 - 3 December 2024 through 6 December 2024
ER -