TY - GEN
T1 - Free-FreeSLT
T2 - 6th ACM International Conference on Multimedia in Asia Workshops, MMAsia 2024 Workshops
AU - Sun, Weirong
AU - Ma, Yujun
AU - Wang, Ruili
N1 - Publisher Copyright:
© 2024 Copyright held by the owner/author(s).
PY - 2024/12/26
Y1 - 2024/12/26
N2 - Sign language translation (SLT) is a demanding task involving integrating visual and linguistic information, requiring cross-modal learning to translate visual motions into text. Current gloss-based methods employ gloss annotations for translation. Due to the availability of annotated sign language video data, gloss-based methods rely on labor-intensive and high-quality annotation work for sign language videos. To tackle this issue, we introduce a novel two-stage gloss-free sign language translation model with a parameter-free visual-language pre-training method, enhancing visual and semantic representations without introducing extra parameters. The proposed two-stage model involves: (i) During the pre-training stage, integrating Contrastive Language-Image Pre-training (CLIP) is adopted to align visual and textual features, which are then aggregated using a mean pooling mechanism; (ii) For the fine-tuning stage, parameters from the pre-trained model are inherited to enhance sign language translation. Our proposed model surpasses the leading gloss-free SLT model on PHOENIX-2014T across various n-gram levels in the BLEU score.
AB - Sign language translation (SLT) is a demanding task involving integrating visual and linguistic information, requiring cross-modal learning to translate visual motions into text. Current gloss-based methods employ gloss annotations for translation. Due to the availability of annotated sign language video data, gloss-based methods rely on labor-intensive and high-quality annotation work for sign language videos. To tackle this issue, we introduce a novel two-stage gloss-free sign language translation model with a parameter-free visual-language pre-training method, enhancing visual and semantic representations without introducing extra parameters. The proposed two-stage model involves: (i) During the pre-training stage, integrating Contrastive Language-Image Pre-training (CLIP) is adopted to align visual and textual features, which are then aggregated using a mean pooling mechanism; (ii) For the fine-tuning stage, parameters from the pre-trained model are inherited to enhance sign language translation. Our proposed model surpasses the leading gloss-free SLT model on PHOENIX-2014T across various n-gram levels in the BLEU score.
KW - Contrastive Language-Image Pre-training (CLIP)
KW - Gloss-free
KW - Sign Language Translation
UR - http://www.scopus.com/inward/record.url?scp=85216584889&partnerID=8YFLogxK
U2 - 10.1145/3700410.3702115
DO - 10.1145/3700410.3702115
M3 - Conference contribution
AN - SCOPUS:85216584889
T3 - Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops, MMAsia 2024 Workshops
BT - Proceedings of the 6th ACM International Conference on Multimedia in Asia Workshops, MMAsia 2024 Workshops
PB - Association for Computing Machinery, Inc
Y2 - 3 December 2024 through 6 December 2024
ER -