Sequential and unsupervised document authorial clustering based on hidden markov model

Khaled Aldebei; Helia Farhood; Wenjing Jia; Priyadarsi Nanda; Xiangjian He

doi:10.1109/Trustcom/BigDataSE/ICESS.2017.261

Sequential and unsupervised document authorial clustering based on hidden markov model

Khaled Aldebei, Helia Farhood, Wenjing Jia, Priyadarsi Nanda, Xiangjian He

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

1 Citation (Scopus)

Abstract

Document clustering groups documents of certain similar characteristics in one cluster. Document clustering has shown advantages on organization, retrieval, navigation and summarization of a huge amount of text documents on Internet. This paper presents a novel, unsupervised approach for clustering single-author documents into groups based on authorship. The key novelty is that we propose to extract contextual correlations to depict the writing style hidden among sentences of each document for clustering the documents. For this purpose, we build an Hidden Markov Model (HMM) for representing the relations of sequential sentences, and a two-level, unsupervised framework is constructed. Our proposed approach is evaluated on four benchmark datasets, widely used for document authorship analysis. A scientific paper is also used to demonstrate the performance of the approach on clustering short segments of a text into authorial components. Experimental results show that the proposed approach outperforms the state-of-the-art approaches.

Original language	English
Title of host publication	Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	379-385
Number of pages	7
ISBN (Electronic)	9781509049059
DOIs	https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.261
Publication status	Published - 7 Sept 2017
Externally published	Yes
Event	16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017 - Sydney, Australia Duration: 1 Aug 2017 → 4 Aug 2017

Publication series

Name	Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017

Conference

Conference	16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017
Country/Territory	Australia
City	Sydney
Period	1/08/17 → 4/08/17

Keywords

Document Segmentation
Forensic Analysis
Intelligence Issues
Intrinsic Plagiarism Detection

ASJC Scopus subject areas

Computer Networks and Communications
Information Systems
Software
Information Systems and Management
Safety, Risk, Reliability and Quality

Access to Document

10.1109/Trustcom/BigDataSE/ICESS.2017.261

Cite this

Aldebei, K., Farhood, H., Jia, W., Nanda, P., & He, X. (2017). Sequential and unsupervised document authorial clustering based on hidden markov model. In Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017 (pp. 379-385). Article 8029464 (Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.261

Aldebei, Khaled ; Farhood, Helia ; Jia, Wenjing et al. / Sequential and unsupervised document authorial clustering based on hidden markov model. Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 379-385 (Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017).

@inproceedings{187d54835a044100980467ef9b8ee1dc,

title = "Sequential and unsupervised document authorial clustering based on hidden markov model",

abstract = "Document clustering groups documents of certain similar characteristics in one cluster. Document clustering has shown advantages on organization, retrieval, navigation and summarization of a huge amount of text documents on Internet. This paper presents a novel, unsupervised approach for clustering single-author documents into groups based on authorship. The key novelty is that we propose to extract contextual correlations to depict the writing style hidden among sentences of each document for clustering the documents. For this purpose, we build an Hidden Markov Model (HMM) for representing the relations of sequential sentences, and a two-level, unsupervised framework is constructed. Our proposed approach is evaluated on four benchmark datasets, widely used for document authorship analysis. A scientific paper is also used to demonstrate the performance of the approach on clustering short segments of a text into authorial components. Experimental results show that the proposed approach outperforms the state-of-the-art approaches.",

keywords = "Document Segmentation, Forensic Analysis, Intelligence Issues, Intrinsic Plagiarism Detection",

author = "Khaled Aldebei and Helia Farhood and Wenjing Jia and Priyadarsi Nanda and Xiangjian He",

note = "Publisher Copyright: {\textcopyright} 2017 IEEE.; 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017 ; Conference date: 01-08-2017 Through 04-08-2017",

year = "2017",

month = sep,

day = "7",

doi = "10.1109/Trustcom/BigDataSE/ICESS.2017.261",

language = "English",

series = "Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "379--385",

booktitle = "Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017",

address = "United States",

}

Aldebei, K, Farhood, H, Jia, W, Nanda, P & He, X 2017, Sequential and unsupervised document authorial clustering based on hidden markov model. in Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017., 8029464, Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017, Institute of Electrical and Electronics Engineers Inc., pp. 379-385, 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017, Sydney, Australia, 1/08/17. https://doi.org/10.1109/Trustcom/BigDataSE/ICESS.2017.261

Sequential and unsupervised document authorial clustering based on hidden markov model. / Aldebei, Khaled; Farhood, Helia; Jia, Wenjing et al.
Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017. Institute of Electrical and Electronics Engineers Inc., 2017. p. 379-385 8029464 (Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017).

Research output: Chapter in Book/Conference proceeding › Conference contribution › peer-review

TY - GEN

T1 - Sequential and unsupervised document authorial clustering based on hidden markov model

AU - Aldebei, Khaled

AU - Farhood, Helia

AU - Jia, Wenjing

AU - Nanda, Priyadarsi

AU - He, Xiangjian

PY - 2017/9/7

Y1 - 2017/9/7

N2 - Document clustering groups documents of certain similar characteristics in one cluster. Document clustering has shown advantages on organization, retrieval, navigation and summarization of a huge amount of text documents on Internet. This paper presents a novel, unsupervised approach for clustering single-author documents into groups based on authorship. The key novelty is that we propose to extract contextual correlations to depict the writing style hidden among sentences of each document for clustering the documents. For this purpose, we build an Hidden Markov Model (HMM) for representing the relations of sequential sentences, and a two-level, unsupervised framework is constructed. Our proposed approach is evaluated on four benchmark datasets, widely used for document authorship analysis. A scientific paper is also used to demonstrate the performance of the approach on clustering short segments of a text into authorial components. Experimental results show that the proposed approach outperforms the state-of-the-art approaches.

AB - Document clustering groups documents of certain similar characteristics in one cluster. Document clustering has shown advantages on organization, retrieval, navigation and summarization of a huge amount of text documents on Internet. This paper presents a novel, unsupervised approach for clustering single-author documents into groups based on authorship. The key novelty is that we propose to extract contextual correlations to depict the writing style hidden among sentences of each document for clustering the documents. For this purpose, we build an Hidden Markov Model (HMM) for representing the relations of sequential sentences, and a two-level, unsupervised framework is constructed. Our proposed approach is evaluated on four benchmark datasets, widely used for document authorship analysis. A scientific paper is also used to demonstrate the performance of the approach on clustering short segments of a text into authorial components. Experimental results show that the proposed approach outperforms the state-of-the-art approaches.

KW - Document Segmentation

KW - Forensic Analysis

KW - Intelligence Issues

KW - Intrinsic Plagiarism Detection

UR - http://www.scopus.com/inward/record.url?scp=85032336412&partnerID=8YFLogxK

U2 - 10.1109/Trustcom/BigDataSE/ICESS.2017.261

DO - 10.1109/Trustcom/BigDataSE/ICESS.2017.261

M3 - Conference contribution

AN - SCOPUS:85032336412

T3 - Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017

SP - 379

EP - 385

BT - Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017

Y2 - 1 August 2017 through 4 August 2017

ER -

Aldebei K, Farhood H, Jia W, Nanda P, He X. Sequential and unsupervised document authorial clustering based on hidden markov model. In Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017. Institute of Electrical and Electronics Engineers Inc. 2017. p. 379-385. 8029464. (Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017). doi: 10.1109/Trustcom/BigDataSE/ICESS.2017.261

Sequential and unsupervised document authorial clustering based on hidden markov model

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this