Similarity and degree of perplexity analysis of chinese characters

Research output: Journal PublicationArticlepeer-review

1 Citation (Scopus)


The study of Chinese scripts has always been a topic drawing continuous scholarly attention from various perspectives. Facilitated with the development of computing technology, recent years saw an increasing interest in the graphic pattern of Chinese characters. This paper focuses on the likelihood of orthographical confusions among any given set of Chinese characters. By defining a normalized metric based on character stroke sequences, we introduce the similarity between two Chinese characters, a unique numerical value which falls within the interval [0, 1], measuring to what extent one character is prone to be confused with another. For a set consisting of a large quantity of characters, we introduce the concept of degree of perplexity (DP), measuring the number of strokes weighted average similarities between a given character and the rest of the characters in the set. An efficient and easy-to-implement algorithm is designed to compute the similarity and degree of perplexity. Our formulas are calibrated with numerical experiments simulated with the most frequently used 200 characters. Based on the numerical simulations, an exponential functional relationship between the degree of perplexity and the number of strokes is proposed and is calibrated with least square regression. Finally, possible applications of the measures introduced are discussed.

Original languageEnglish
Pages (from-to)189-206
Number of pages18
JournalJournal of Quantitative Linguistics
Issue number3
Publication statusPublished - Aug 2011
Externally publishedYes

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language


Dive into the research topics of 'Similarity and degree of perplexity analysis of chinese characters'. Together they form a unique fingerprint.

Cite this