Sequential and unsupervised document authorial clustering based on hidden markov model

Khaled Aldebei, Helia Farhood, Wenjing Jia, Priyadarsi Nanda, Xiangjian He

Research output: Chapter in Book/Conference proceedingConference contributionpeer-review

1 Citation (Scopus)

Abstract

Document clustering groups documents of certain similar characteristics in one cluster. Document clustering has shown advantages on organization, retrieval, navigation and summarization of a huge amount of text documents on Internet. This paper presents a novel, unsupervised approach for clustering single-author documents into groups based on authorship. The key novelty is that we propose to extract contextual correlations to depict the writing style hidden among sentences of each document for clustering the documents. For this purpose, we build an Hidden Markov Model (HMM) for representing the relations of sequential sentences, and a two-level, unsupervised framework is constructed. Our proposed approach is evaluated on four benchmark datasets, widely used for document authorship analysis. A scientific paper is also used to demonstrate the performance of the approach on clustering short segments of a text into authorial components. Experimental results show that the proposed approach outperforms the state-of-the-art approaches.

Original languageEnglish
Title of host publicationProceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages379-385
Number of pages7
ISBN (Electronic)9781509049059
DOIs
Publication statusPublished - 7 Sept 2017
Externally publishedYes
Event16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017 - Sydney, Australia
Duration: 1 Aug 20174 Aug 2017

Publication series

NameProceedings - 16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017

Conference

Conference16th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, 11th IEEE International Conference on Big Data Science and Engineering and 14th IEEE International Conference on Embedded Software and Systems, Trustcom/BigDataSE/ICESS 2017
Country/TerritoryAustralia
CitySydney
Period1/08/174/08/17

Keywords

  • Document Segmentation
  • Forensic Analysis
  • Intelligence Issues
  • Intrinsic Plagiarism Detection

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Software
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'Sequential and unsupervised document authorial clustering based on hidden markov model'. Together they form a unique fingerprint.

Cite this