Abstract
The study of Chinese scripts has always been a topic drawing continuous scholarly attention from various perspectives. Facilitated with the development of computing technology, recent years saw an increasing interest in the graphic pattern of Chinese characters. This paper focuses on the likelihood of orthographical confusions among any given set of Chinese characters. By defining a normalized metric based on character stroke sequences, we introduce the similarity between two Chinese characters, a unique numerical value which falls within the interval [0, 1], measuring to what extent one character is prone to be confused with another. For a set consisting of a large quantity of characters, we introduce the concept of degree of perplexity (DP), measuring the number of strokes weighted average similarities between a given character and the rest of the characters in the set. An efficient and easy-to-implement algorithm is designed to compute the similarity and degree of perplexity. Our formulas are calibrated with numerical experiments simulated with the most frequently used 200 characters. Based on the numerical simulations, an exponential functional relationship between the degree of perplexity and the number of strokes is proposed and is calibrated with least square regression. Finally, possible applications of the measures introduced are discussed.
Original language | English |
---|---|
Pages (from-to) | 189-206 |
Number of pages | 18 |
Journal | Journal of Quantitative Linguistics |
Volume | 18 |
Issue number | 3 |
DOIs | |
Publication status | Published - Aug 2011 |
Externally published | Yes |
ASJC Scopus subject areas
- Language and Linguistics
- Linguistics and Language