TY - GEN
T1 - Sequential and unsupervised document authorial clustering based on hidden markov model
AU - Aldebei, Khaled
AU - Farhood, Helia
AU - Jia, Wenjing
AU - Nanda, Priyadarsi
AU - He, Xiangjian
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/9/7
Y1 - 2017/9/7
N2 - Document clustering groups documents of certain similar characteristics in one cluster. Document clustering has shown advantages on organization, retrieval, navigation and summarization of a huge amount of text documents on Internet. This paper presents a novel, unsupervised approach for clustering single-author documents into groups based on authorship. The key novelty is that we propose to extract contextual correlations to depict the writing style hidden among sentences of each document for clustering the documents. For this purpose, we build an Hidden Markov Model (HMM) for representing the relations of sequential sentences, and a two-level, unsupervised framework is constructed. Our proposed approach is evaluated on four benchmark datasets, widely used for document authorship analysis. A scientific paper is also used to demonstrate the performance of the approach on clustering short segments of a text into authorial components. Experimental results show that the proposed approach outperforms the state-of-the-art approaches.
AB - Document clustering groups documents of certain similar characteristics in one cluster. Document clustering has shown advantages on organization, retrieval, navigation and summarization of a huge amount of text documents on Internet. This paper presents a novel, unsupervised approach for clustering single-author documents into groups based on authorship. The key novelty is that we propose to extract contextual correlations to depict the writing style hidden among sentences of each document for clustering the documents. For this purpose, we build an Hidden Markov Model (HMM) for representing the relations of sequential sentences, and a two-level, unsupervised framework is constructed. Our proposed approach is evaluated on four benchmark datasets, widely used for document authorship analysis. A scientific paper is also used to demonstrate the performance of the approach on clustering short segments of a text into authorial components. Experimental results show that the proposed approach outperforms the state-of-the-art approaches.
KW - Document Segmentation
KW - Forensic Analysis
KW - Intelligence Issues
KW - Intrinsic Plagiarism Detection
UR - http://www.scopus.com/inward/record.url?scp=85032336412&partnerID=8YFLogxK
U2 - 10.1109/Trustcom/BigDataSE/ICESS.2017.261
DO - 10.1109/Trustcom/BigDataSE/ICESS.2017.261
M3 - Conference contribution
AN - SCOPUS:85032336412
T3 - Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017
SP - 379
EP - 385
BT - Proceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017
Y2 - 1 August 2017 through 4 August 2017
ER -