User video summarization based on joint visual and semantic affinity graph

Zhuo Lei; Ke Sun; Qian Zhang; Guoping Qiu

doi:10.1145/2983563.2983568

User video summarization based on joint visual and semantic affinity graph

Zhuo Lei, Ke Sun, Qian Zhang, Guoping Qiu

School of Computer Science

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

8 Citations (Scopus)

Abstract

Automatically generating summaries of user-generated videos is very useful but challenging. User-generated videos are unedited and usually only contain a long single shot which makes traditional video temporal segmentation methods such as shot boundary detection ineffective in producing meaningful video segments for summarization. To address this issue, we propose a novel temporal segmentation framework based on the clustering of joint visual and semantic affinity graph of the video frames. Based on a pre-trained deep convolutional neural network (CNN), we extract deep visual features of the frames to construct the visual affinity graph. We then construct the semantic affinity graph of the frames based on word embedding of the frames' semantic tags generated from an automatic image tagging algorithm. A dense neighbor method is then used to cluster the joint visual and semantic affinity graph to divide the video into subshot level segments and from which a summary of the video can be generated. Experimental results show that our approach outperforms state-of-the-art methods. Furthermore, we show that the method achieves results that are similar to those performed manually.

Original language	English
Title of host publication	Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016
Publisher	Association for Computing Machinery, Inc
Pages	45-52
Number of pages	8
ISBN (Electronic)	9781450345194
DOIs	https://doi.org/10.1145/2983563.2983568
Publication status	Published - 16 Oct 2016
Event	2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, Iv and L-MM 2016 - Amsterdam, Netherlands Duration: 16 Oct 2016 → …

Publication series

Name	Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016

Conference

Conference	2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, Iv and L-MM 2016
Country/Territory	Netherlands
City	Amsterdam
Period	16/10/16 → …

Keywords

Clustering
Joint affinity graph
User-generated video
Video summarization
Video temporal segmentation

ASJC Scopus subject areas

Human-Computer Interaction
Computer Graphics and Computer-Aided Design
Computer Vision and Pattern Recognition

Access to Document

10.1145/2983563.2983568

Cite this

Lei, Z., Sun, K., Zhang, Q., & Qiu, G. (2016). User video summarization based on joint visual and semantic affinity graph. In Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016 (pp. 45-52). (Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016). Association for Computing Machinery, Inc. https://doi.org/10.1145/2983563.2983568

Lei, Zhuo ; Sun, Ke ; Zhang, Qian et al. / User video summarization based on joint visual and semantic affinity graph. Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016. Association for Computing Machinery, Inc, 2016. pp. 45-52 (Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016).

@inproceedings{e3736a4a5e6143c5ab2d00c3659c1392,

title = "User video summarization based on joint visual and semantic affinity graph",

abstract = "Automatically generating summaries of user-generated videos is very useful but challenging. User-generated videos are unedited and usually only contain a long single shot which makes traditional video temporal segmentation methods such as shot boundary detection ineffective in producing meaningful video segments for summarization. To address this issue, we propose a novel temporal segmentation framework based on the clustering of joint visual and semantic affinity graph of the video frames. Based on a pre-trained deep convolutional neural network (CNN), we extract deep visual features of the frames to construct the visual affinity graph. We then construct the semantic affinity graph of the frames based on word embedding of the frames' semantic tags generated from an automatic image tagging algorithm. A dense neighbor method is then used to cluster the joint visual and semantic affinity graph to divide the video into subshot level segments and from which a summary of the video can be generated. Experimental results show that our approach outperforms state-of-the-art methods. Furthermore, we show that the method achieves results that are similar to those performed manually.",

keywords = "Clustering, Joint affinity graph, User-generated video, Video summarization, Video temporal segmentation",

author = "Zhuo Lei and Ke Sun and Qian Zhang and Guoping Qiu",

note = "Publisher Copyright: {\textcopyright} 2016 ACM.; 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, Iv and L-MM 2016 ; Conference date: 16-10-2016",

year = "2016",

month = oct,

day = "16",

doi = "10.1145/2983563.2983568",

language = "English",

series = "Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016",

publisher = "Association for Computing Machinery, Inc",

pages = "45--52",

booktitle = "Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016",

}

Lei, Z, Sun, K, Zhang, Q & Qiu, G 2016, User video summarization based on joint visual and semantic affinity graph. in Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016. Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016, Association for Computing Machinery, Inc, pp. 45-52, 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, Iv and L-MM 2016, Amsterdam, Netherlands, 16/10/16. https://doi.org/10.1145/2983563.2983568

User video summarization based on joint visual and semantic affinity graph. / Lei, Zhuo; Sun, Ke; Zhang, Qian et al.
Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016. Association for Computing Machinery, Inc, 2016. p. 45-52 (Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - User video summarization based on joint visual and semantic affinity graph

AU - Lei, Zhuo

AU - Sun, Ke

AU - Zhang, Qian

AU - Qiu, Guoping

PY - 2016/10/16

Y1 - 2016/10/16

N2 - Automatically generating summaries of user-generated videos is very useful but challenging. User-generated videos are unedited and usually only contain a long single shot which makes traditional video temporal segmentation methods such as shot boundary detection ineffective in producing meaningful video segments for summarization. To address this issue, we propose a novel temporal segmentation framework based on the clustering of joint visual and semantic affinity graph of the video frames. Based on a pre-trained deep convolutional neural network (CNN), we extract deep visual features of the frames to construct the visual affinity graph. We then construct the semantic affinity graph of the frames based on word embedding of the frames' semantic tags generated from an automatic image tagging algorithm. A dense neighbor method is then used to cluster the joint visual and semantic affinity graph to divide the video into subshot level segments and from which a summary of the video can be generated. Experimental results show that our approach outperforms state-of-the-art methods. Furthermore, we show that the method achieves results that are similar to those performed manually.

AB - Automatically generating summaries of user-generated videos is very useful but challenging. User-generated videos are unedited and usually only contain a long single shot which makes traditional video temporal segmentation methods such as shot boundary detection ineffective in producing meaningful video segments for summarization. To address this issue, we propose a novel temporal segmentation framework based on the clustering of joint visual and semantic affinity graph of the video frames. Based on a pre-trained deep convolutional neural network (CNN), we extract deep visual features of the frames to construct the visual affinity graph. We then construct the semantic affinity graph of the frames based on word embedding of the frames' semantic tags generated from an automatic image tagging algorithm. A dense neighbor method is then used to cluster the joint visual and semantic affinity graph to divide the video into subshot level segments and from which a summary of the video can be generated. Experimental results show that our approach outperforms state-of-the-art methods. Furthermore, we show that the method achieves results that are similar to those performed manually.

KW - Clustering

KW - Joint affinity graph

KW - User-generated video

KW - Video summarization

KW - Video temporal segmentation

UR - http://www.scopus.com/inward/record.url?scp=84995488319&partnerID=8YFLogxK

U2 - 10.1145/2983563.2983568

DO - 10.1145/2983563.2983568

M3 - Conference contribution

AN - SCOPUS:84995488319

T3 - Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016

SP - 45

EP - 52

BT - Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016

PB - Association for Computing Machinery, Inc

T2 - 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, Iv and L-MM 2016

Y2 - 16 October 2016

ER -

Lei Z, Sun K, Zhang Q, Qiu G. User video summarization based on joint visual and semantic affinity graph. In Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016. Association for Computing Machinery, Inc. 2016. p. 45-52. (Iv and L-MM 2016 - Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion, co-located with ACM Multimedia 2016). doi: 10.1145/2983563.2983568

User video summarization based on joint visual and semantic affinity graph

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this