Hybrid generative–discriminative hash tracking with spatio-temporal contextual cues

Manna Dai, Shuying Cheng, Xiangjian He

Research output: Journal PublicationArticlepeer-review

1 Citation (Scopus)


Visual object tracking is of a great application value in video monitoring systems. Recent work on video tracking has taken into account spatial relationship between the targeted object and its background. In this paper, the spatial relationship is combined with the temporal relationship between features on different video frames so that a real-time tracker is designed based on a hash algorithm with spatio-temporal cues. Different from most of the existing work on video tracking, which is regarded as a mechanism for image matching or image classification alone, we propose a hierarchical framework and conduct both matching and classification tasks to generate a coarse-to-fine tracking system. We develop a generative model under a modified particle filter with hash fingerprints for the coarse matching by the maximum a posteriori and a discriminative model for the fine classification by maximizing a confidence map based on a context model. The confidence map reveals the spatio-temporal dynamics of the target. Because hash fingerprint is merely a binary vector and the modified particle filter uses only a small number of particles, our tracker has a low computation cost. By conducting experiments on eight challenging video sequences from a public benchmark, we demonstrate that our tracker outperforms eight state-of-the-art trackers in terms of both accuracy and speed.

Original languageEnglish
Pages (from-to)389-399
Number of pages11
JournalNeural Computing and Applications
Issue number2
Publication statusPublished - 1 Jan 2018
Externally publishedYes


  • Confidence map
  • Hash algorithm
  • Hierarchical framework
  • Maximum a posteriori (MAP)
  • Spatio-temporal cues

ASJC Scopus subject areas

  • Software
  • Artificial Intelligence


Dive into the research topics of 'Hybrid generative–discriminative hash tracking with spatio-temporal contextual cues'. Together they form a unique fingerprint.

Cite this