Context-aware sentence categorisation: word mover’s distance and character-level convolutional recurrent neural network

  • Xinyu Fu

Student thesis: PhD Thesis

Abstract

Supervised k nearest neighbour and unsupervised hierarchical agglomerative clustering algorithm can be enhanced through word mover’s distance-based sentence distance metric to offer superior context-aware sentence categorisation performance. Advanced neural network-oriented classifier is able to achieve competing result on the benchmark streams via an aggregated recurrent unit incorporated with sophis- ticated convolving layer. The continually increasing number of textual snippets produced each year ne- cessitates ever improving information processing methods for searching, retrieving, and organising text. Central to these information processing methods are sentence classification and clustering, which have become an important application for nat- ural language processing and information retrieval. This present work proposes three novel sentence categorisation frameworks, namely hierarchical agglomerative clustering-word mover’s distance, k nearest neighbour-word mover’s distance, and convolutional recurrent neural network. Hierarchical agglomerative clustering-word mover’s distance employs word mover’s distance distortion function to effectively cluster unlabelled sentences into nearby centroid. K nearest neighbour-word mover’s distance classifies testing textual snippets through word mover’s distance-based sen- tence similarity. Both models are from the spectrum of count-based framework since they apply term frequency statistics when building the vector space matrix. Experimental evaluation on the two unsupervised learning data-sets show better per- formance of hierarchical agglomerative clustering-word mover’s distance over other competitors on mean squared error, completeness score, homogeneity score, and v-measure value. For k nearest neighbour-word mover’s distance, two benchmark textual streams are experimented to verify its superior classification performance against comparison algorithms on precision rate, recall ratio, and F1 score. Per- formance comparison is statistically validated via Mann-Whitney-U test. Through extensive experiments and results analysis, each research hypothesis is successfully verified to be yes. Unlike traditional singleton neural network, convolutional recurrent neural net- work model incorporates character-level convolutional network with character-aware recurrent neural network to form a combined framework. The proposed model ben- efits from character-aware convolutional neural network in that only salient features are selected and fed into the integrated character-aware recurrent neural network. Character-aware recurrent neural network effectively learns long sequence semantics via sophisticated update mechanism. The experiment presented in current thesis compares convolutional recurrent neural network framework against the state-of- the-art text classification algorithms on four popular benchmarking corpus. The present work also analyses three different recurrent neural network hidden recurrent cells’ impact on performance and their runtime efficiency. It is observed that min- imal gated unit achieves the optimal runtime and comparable performance against gated recurrent unit and long short-term memory. For term frequency-inverse docu- ment frequency-based algorithms, the current experiment examines word2vec, global vectors for word representation, and sent2vec embeddings and reports their perfor- mance differences. Performance comparison is statistically validated through Mann- Whitney-U test and the corresponding hypotheses are tested to be yes through the reported statistical analysis.
Date of Award8 Jul 2018
Original languageEnglish
Awarding Institution
  • Univerisity of Nottingham
SupervisorEugene Ch'ng (Supervisor) & Uwe Aickelin (Supervisor)

Keywords

  • Sentence Categorisation
  • Word Mover's Distance
  • Convolutional Neural Network
  • Recurrent Neural Network
  • Sentence Similarity
  • Sentence Distance

Cite this

'