FACLSTM: ConvLSTM with focused attention for scene text recognition

Qingqing Wang; Ye Huang; Wenjing Jia; Xiangjian He; Michael Blumenstein; Shujing Lyu; Yue Lu

doi:10.1007/s11432-019-2713-1

FACLSTM: ConvLSTM with focused attention for scene text recognition

Qingqing Wang, Ye Huang, Wenjing Jia, Xiangjian He, Michael Blumenstein, Shujing Lyu, Yue Lu

Research output: Journal Publication › Article › peer-review

39 Citations (Scopus)

Abstract

Scene text recognition has recently been widely treated as a sequence-to-sequence prediction problem, where traditional fully-connected-LSTM (FC-LSTM) has played a critical role. Owing to the limitation of FC-LSTM, existing methods have to convert 2-D feature maps into 1-D sequential feature vectors, resulting in severe damages of the valuable spatial and structural information of text images. In this paper, we argue that scene text recognition is essentially a spatiotemporal prediction problem for its 2-D image inputs, and propose a convolution LSTM (ConvLSTM)-based scene text recognizer, namely, FACLSTM, i.e., focused attention ConvLSTM, where the spatial correlation of pixels is fully leveraged when performing sequential prediction with LSTM. Particularly, the attention mechanism is properly incorporated into an efficient ConvLSTM structure via the convolutional operations and additional character center masks are generated to help focus attention on right feature areas. The experimental results on benchmark datasets IIIT5K, SVT and CUTE demonstrate that our proposed FACLSTM performs competitively on the regular, low-resolution and noisy text images, and outperforms the state-of-the-art approaches on the curved text images with large margins.

Original language	English
Article number	120103
Journal	Science China Information Sciences
Volume	63
Issue number	2
DOIs	https://doi.org/10.1007/s11432-019-2713-1
Publication status	Published - 1 Feb 2020
Externally published	Yes

Keywords

convolutional LSTM
focused attention
scene text recognition
sequential prediction
spatial correlation

ASJC Scopus subject areas

General Computer Science

Access to Document

10.1007/s11432-019-2713-1

Cite this

@article{de5468ee7b014d6688fd117c669d53a2,

title = "FACLSTM: ConvLSTM with focused attention for scene text recognition",

abstract = "Scene text recognition has recently been widely treated as a sequence-to-sequence prediction problem, where traditional fully-connected-LSTM (FC-LSTM) has played a critical role. Owing to the limitation of FC-LSTM, existing methods have to convert 2-D feature maps into 1-D sequential feature vectors, resulting in severe damages of the valuable spatial and structural information of text images. In this paper, we argue that scene text recognition is essentially a spatiotemporal prediction problem for its 2-D image inputs, and propose a convolution LSTM (ConvLSTM)-based scene text recognizer, namely, FACLSTM, i.e., focused attention ConvLSTM, where the spatial correlation of pixels is fully leveraged when performing sequential prediction with LSTM. Particularly, the attention mechanism is properly incorporated into an efficient ConvLSTM structure via the convolutional operations and additional character center masks are generated to help focus attention on right feature areas. The experimental results on benchmark datasets IIIT5K, SVT and CUTE demonstrate that our proposed FACLSTM performs competitively on the regular, low-resolution and noisy text images, and outperforms the state-of-the-art approaches on the curved text images with large margins.",

keywords = "convolutional LSTM, focused attention, scene text recognition, sequential prediction, spatial correlation",

author = "Qingqing Wang and Ye Huang and Wenjing Jia and Xiangjian He and Michael Blumenstein and Shujing Lyu and Yue Lu",

note = "Publisher Copyright: {\textcopyright} 2020, Science China Press and Springer-Verlag GmbH Germany, part of Springer Nature.",

year = "2020",

month = feb,

day = "1",

doi = "10.1007/s11432-019-2713-1",

language = "English",

volume = "63",

journal = "Science China Information Sciences",

issn = "1674-733X",

publisher = "Science China Press",

number = "2",

}

TY - JOUR

T1 - FACLSTM

T2 - ConvLSTM with focused attention for scene text recognition

AU - Wang, Qingqing

AU - Huang, Ye

AU - Jia, Wenjing

AU - He, Xiangjian

AU - Blumenstein, Michael

AU - Lyu, Shujing

AU - Lu, Yue

PY - 2020/2/1

Y1 - 2020/2/1

N2 - Scene text recognition has recently been widely treated as a sequence-to-sequence prediction problem, where traditional fully-connected-LSTM (FC-LSTM) has played a critical role. Owing to the limitation of FC-LSTM, existing methods have to convert 2-D feature maps into 1-D sequential feature vectors, resulting in severe damages of the valuable spatial and structural information of text images. In this paper, we argue that scene text recognition is essentially a spatiotemporal prediction problem for its 2-D image inputs, and propose a convolution LSTM (ConvLSTM)-based scene text recognizer, namely, FACLSTM, i.e., focused attention ConvLSTM, where the spatial correlation of pixels is fully leveraged when performing sequential prediction with LSTM. Particularly, the attention mechanism is properly incorporated into an efficient ConvLSTM structure via the convolutional operations and additional character center masks are generated to help focus attention on right feature areas. The experimental results on benchmark datasets IIIT5K, SVT and CUTE demonstrate that our proposed FACLSTM performs competitively on the regular, low-resolution and noisy text images, and outperforms the state-of-the-art approaches on the curved text images with large margins.

AB - Scene text recognition has recently been widely treated as a sequence-to-sequence prediction problem, where traditional fully-connected-LSTM (FC-LSTM) has played a critical role. Owing to the limitation of FC-LSTM, existing methods have to convert 2-D feature maps into 1-D sequential feature vectors, resulting in severe damages of the valuable spatial and structural information of text images. In this paper, we argue that scene text recognition is essentially a spatiotemporal prediction problem for its 2-D image inputs, and propose a convolution LSTM (ConvLSTM)-based scene text recognizer, namely, FACLSTM, i.e., focused attention ConvLSTM, where the spatial correlation of pixels is fully leveraged when performing sequential prediction with LSTM. Particularly, the attention mechanism is properly incorporated into an efficient ConvLSTM structure via the convolutional operations and additional character center masks are generated to help focus attention on right feature areas. The experimental results on benchmark datasets IIIT5K, SVT and CUTE demonstrate that our proposed FACLSTM performs competitively on the regular, low-resolution and noisy text images, and outperforms the state-of-the-art approaches on the curved text images with large margins.

KW - convolutional LSTM

KW - focused attention

KW - scene text recognition

KW - sequential prediction

KW - spatial correlation

UR - http://www.scopus.com/inward/record.url?scp=85078272238&partnerID=8YFLogxK

U2 - 10.1007/s11432-019-2713-1

DO - 10.1007/s11432-019-2713-1

M3 - Article

AN - SCOPUS:85078272238

SN - 1674-733X

VL - 63

JO - Science China Information Sciences

JF - Science China Information Sciences

IS - 2

M1 - 120103

ER -

FACLSTM: ConvLSTM with focused attention for scene text recognition

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this