Self-attention based Text Knowledge Mining for Text Detection

Qi Wan; Haoqin Ji; Linlin Shen

doi:10.1109/CVPR46437.2021.00592

Self-attention based Text Knowledge Mining for Text Detection

Qi Wan, Haoqin Ji, Linlin Shen

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

46 Citations (Scopus)

Abstract

Pre-trained models play an important role in deep learning based text detectors. However, most methods ignore the gap between natural images and scene text images and directly apply ImageNet for pre-training. To address such a problem, some of them firstly pre-train the model using a large amount of synthetic data and then fine-tune it on target datasets, which is task-specific and has limited generalization capability. In this paper, we focus on providing general pre-trained models for text detectors. Considering the importance of exploring text contents for text detection, we propose STKM (Self-attention based Text Knowledge Mining), which consists of a CNN Encoder and a Self-attention Decoder, to learn general prior knowledge for text detection from SynthText. Given only image level text labels, Self-attention Decoder directly decodes features extracted from CNN Encoder to texts without requirement of detection, which guides the CNN backbone to explicitly learn discriminative semantic representations ignored by previous approaches. After that, the text knowledge learned by the backbone can be transferred to various text detectors to significantly improve their detection performance (e.g., 5.89% higher F-measure for EAST on ICDAR15 dataset) without bells and whistles. Pre-trained model is available at: https://github.com/CVI-SZU/STKM.

Original language	English
Title of host publication	Proceedings - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021
Publisher	IEEE Computer Society
Pages	5979-5988
Number of pages	10
ISBN (Electronic)	9781665445092
DOIs	https://doi.org/10.1109/CVPR46437.2021.00592
Publication status	Published - 2021
Externally published	Yes
Event	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021 - Virtual, Online, United States Duration: 19 Jun 2021 → 25 Jun 2021

Publication series

Name	Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
ISSN (Print)	1063-6919

Conference

Conference	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021
Country/Territory	United States
City	Virtual, Online
Period	19/06/21 → 25/06/21

ASJC Scopus subject areas

Software
Computer Vision and Pattern Recognition

Access to Document

10.1109/CVPR46437.2021.00592

Cite this

Wan, Q., Ji, H., & Shen, L. (2021). Self-attention based Text Knowledge Mining for Text Detection. In Proceedings - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021 (pp. 5979-5988). (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition). IEEE Computer Society. https://doi.org/10.1109/CVPR46437.2021.00592

@inproceedings{c80927a8f3a242028d10ccb30fd26d73,

title = "Self-attention based Text Knowledge Mining for Text Detection",

abstract = "Pre-trained models play an important role in deep learning based text detectors. However, most methods ignore the gap between natural images and scene text images and directly apply ImageNet for pre-training. To address such a problem, some of them firstly pre-train the model using a large amount of synthetic data and then fine-tune it on target datasets, which is task-specific and has limited generalization capability. In this paper, we focus on providing general pre-trained models for text detectors. Considering the importance of exploring text contents for text detection, we propose STKM (Self-attention based Text Knowledge Mining), which consists of a CNN Encoder and a Self-attention Decoder, to learn general prior knowledge for text detection from SynthText. Given only image level text labels, Self-attention Decoder directly decodes features extracted from CNN Encoder to texts without requirement of detection, which guides the CNN backbone to explicitly learn discriminative semantic representations ignored by previous approaches. After that, the text knowledge learned by the backbone can be transferred to various text detectors to significantly improve their detection performance (e.g., 5.89% higher F-measure for EAST on ICDAR15 dataset) without bells and whistles. Pre-trained model is available at: https://github.com/CVI-SZU/STKM.",

author = "Qi Wan and Haoqin Ji and Linlin Shen",

note = "Publisher Copyright: {\textcopyright} 2021 IEEE; 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021 ; Conference date: 19-06-2021 Through 25-06-2021",

year = "2021",

doi = "10.1109/CVPR46437.2021.00592",

language = "English",

series = "Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition",

publisher = "IEEE Computer Society",

pages = "5979--5988",

booktitle = "Proceedings - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021",

address = "United States",

}

Wan, Q, Ji, H & Shen, L 2021, Self-attention based Text Knowledge Mining for Text Detection. in Proceedings - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, IEEE Computer Society, pp. 5979-5988, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021, Virtual, Online, United States, 19/06/21. https://doi.org/10.1109/CVPR46437.2021.00592

Self-attention based Text Knowledge Mining for Text Detection. / Wan, Qi; Ji, Haoqin; Shen, Linlin.
Proceedings - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021. IEEE Computer Society, 2021. p. 5979-5988 (Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Self-attention based Text Knowledge Mining for Text Detection

AU - Wan, Qi

AU - Ji, Haoqin

AU - Shen, Linlin

PY - 2021

Y1 - 2021

N2 - Pre-trained models play an important role in deep learning based text detectors. However, most methods ignore the gap between natural images and scene text images and directly apply ImageNet for pre-training. To address such a problem, some of them firstly pre-train the model using a large amount of synthetic data and then fine-tune it on target datasets, which is task-specific and has limited generalization capability. In this paper, we focus on providing general pre-trained models for text detectors. Considering the importance of exploring text contents for text detection, we propose STKM (Self-attention based Text Knowledge Mining), which consists of a CNN Encoder and a Self-attention Decoder, to learn general prior knowledge for text detection from SynthText. Given only image level text labels, Self-attention Decoder directly decodes features extracted from CNN Encoder to texts without requirement of detection, which guides the CNN backbone to explicitly learn discriminative semantic representations ignored by previous approaches. After that, the text knowledge learned by the backbone can be transferred to various text detectors to significantly improve their detection performance (e.g., 5.89% higher F-measure for EAST on ICDAR15 dataset) without bells and whistles. Pre-trained model is available at: https://github.com/CVI-SZU/STKM.

AB - Pre-trained models play an important role in deep learning based text detectors. However, most methods ignore the gap between natural images and scene text images and directly apply ImageNet for pre-training. To address such a problem, some of them firstly pre-train the model using a large amount of synthetic data and then fine-tune it on target datasets, which is task-specific and has limited generalization capability. In this paper, we focus on providing general pre-trained models for text detectors. Considering the importance of exploring text contents for text detection, we propose STKM (Self-attention based Text Knowledge Mining), which consists of a CNN Encoder and a Self-attention Decoder, to learn general prior knowledge for text detection from SynthText. Given only image level text labels, Self-attention Decoder directly decodes features extracted from CNN Encoder to texts without requirement of detection, which guides the CNN backbone to explicitly learn discriminative semantic representations ignored by previous approaches. After that, the text knowledge learned by the backbone can be transferred to various text detectors to significantly improve their detection performance (e.g., 5.89% higher F-measure for EAST on ICDAR15 dataset) without bells and whistles. Pre-trained model is available at: https://github.com/CVI-SZU/STKM.

UR - http://www.scopus.com/inward/record.url?scp=85117190324&partnerID=8YFLogxK

U2 - 10.1109/CVPR46437.2021.00592

DO - 10.1109/CVPR46437.2021.00592

M3 - Conference contribution

AN - SCOPUS:85117190324

T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition

SP - 5979

EP - 5988

BT - Proceedings - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021

PB - IEEE Computer Society

T2 - 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2021

Y2 - 19 June 2021 through 25 June 2021

ER -

Self-attention based Text Knowledge Mining for Text Detection

Abstract

Publication series

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this