Similarity and degree of perplexity analysis of chinese characters

Yanhui Zhang

doi:10.1080/09296174.2011.581848

Similarity and degree of perplexity analysis of chinese characters

Yanhui Zhang

Research output: Journal Publication › Article › peer-review

1 Citation (Scopus)

Abstract

The study of Chinese scripts has always been a topic drawing continuous scholarly attention from various perspectives. Facilitated with the development of computing technology, recent years saw an increasing interest in the graphic pattern of Chinese characters. This paper focuses on the likelihood of orthographical confusions among any given set of Chinese characters. By defining a normalized metric based on character stroke sequences, we introduce the similarity between two Chinese characters, a unique numerical value which falls within the interval [0, 1], measuring to what extent one character is prone to be confused with another. For a set consisting of a large quantity of characters, we introduce the concept of degree of perplexity (DP), measuring the number of strokes weighted average similarities between a given character and the rest of the characters in the set. An efficient and easy-to-implement algorithm is designed to compute the similarity and degree of perplexity. Our formulas are calibrated with numerical experiments simulated with the most frequently used 200 characters. Based on the numerical simulations, an exponential functional relationship between the degree of perplexity and the number of strokes is proposed and is calibrated with least square regression. Finally, possible applications of the measures introduced are discussed.

Original language	English
Pages (from-to)	189-206
Number of pages	18
Journal	Journal of Quantitative Linguistics
Volume	18
Issue number	3
DOIs	https://doi.org/10.1080/09296174.2011.581848
Publication status	Published - Aug 2011
Externally published	Yes

ASJC Scopus subject areas

Language and Linguistics
Linguistics and Language

UN SDGs

This output contributes to the following UN Sustainable Development Goals (SDGs)

Access to Document

10.1080/09296174.2011.581848

Cite this

@article{2f98e5cf0f4843238fd9ffeb0d7dd92d,

title = "Similarity and degree of perplexity analysis of chinese characters",

abstract = "The study of Chinese scripts has always been a topic drawing continuous scholarly attention from various perspectives. Facilitated with the development of computing technology, recent years saw an increasing interest in the graphic pattern of Chinese characters. This paper focuses on the likelihood of orthographical confusions among any given set of Chinese characters. By defining a normalized metric based on character stroke sequences, we introduce the similarity between two Chinese characters, a unique numerical value which falls within the interval [0, 1], measuring to what extent one character is prone to be confused with another. For a set consisting of a large quantity of characters, we introduce the concept of degree of perplexity (DP), measuring the number of strokes weighted average similarities between a given character and the rest of the characters in the set. An efficient and easy-to-implement algorithm is designed to compute the similarity and degree of perplexity. Our formulas are calibrated with numerical experiments simulated with the most frequently used 200 characters. Based on the numerical simulations, an exponential functional relationship between the degree of perplexity and the number of strokes is proposed and is calibrated with least square regression. Finally, possible applications of the measures introduced are discussed.",

author = "Yanhui Zhang",

year = "2011",

month = aug,

doi = "10.1080/09296174.2011.581848",

language = "English",

volume = "18",

pages = "189--206",

journal = "Journal of Quantitative Linguistics",

issn = "0929-6174",

publisher = "Taylor and Francis Ltd.",

number = "3",

}

TY - JOUR

T1 - Similarity and degree of perplexity analysis of chinese characters

AU - Zhang, Yanhui

PY - 2011/8

Y1 - 2011/8

N2 - The study of Chinese scripts has always been a topic drawing continuous scholarly attention from various perspectives. Facilitated with the development of computing technology, recent years saw an increasing interest in the graphic pattern of Chinese characters. This paper focuses on the likelihood of orthographical confusions among any given set of Chinese characters. By defining a normalized metric based on character stroke sequences, we introduce the similarity between two Chinese characters, a unique numerical value which falls within the interval [0, 1], measuring to what extent one character is prone to be confused with another. For a set consisting of a large quantity of characters, we introduce the concept of degree of perplexity (DP), measuring the number of strokes weighted average similarities between a given character and the rest of the characters in the set. An efficient and easy-to-implement algorithm is designed to compute the similarity and degree of perplexity. Our formulas are calibrated with numerical experiments simulated with the most frequently used 200 characters. Based on the numerical simulations, an exponential functional relationship between the degree of perplexity and the number of strokes is proposed and is calibrated with least square regression. Finally, possible applications of the measures introduced are discussed.

AB - The study of Chinese scripts has always been a topic drawing continuous scholarly attention from various perspectives. Facilitated with the development of computing technology, recent years saw an increasing interest in the graphic pattern of Chinese characters. This paper focuses on the likelihood of orthographical confusions among any given set of Chinese characters. By defining a normalized metric based on character stroke sequences, we introduce the similarity between two Chinese characters, a unique numerical value which falls within the interval [0, 1], measuring to what extent one character is prone to be confused with another. For a set consisting of a large quantity of characters, we introduce the concept of degree of perplexity (DP), measuring the number of strokes weighted average similarities between a given character and the rest of the characters in the set. An efficient and easy-to-implement algorithm is designed to compute the similarity and degree of perplexity. Our formulas are calibrated with numerical experiments simulated with the most frequently used 200 characters. Based on the numerical simulations, an exponential functional relationship between the degree of perplexity and the number of strokes is proposed and is calibrated with least square regression. Finally, possible applications of the measures introduced are discussed.

UR - http://www.scopus.com/inward/record.url?scp=80052640334&partnerID=8YFLogxK

U2 - 10.1080/09296174.2011.581848

DO - 10.1080/09296174.2011.581848

M3 - Article

AN - SCOPUS:80052640334

SN - 0929-6174

VL - 18

SP - 189

EP - 206

JO - Journal of Quantitative Linguistics

JF - Journal of Quantitative Linguistics

IS - 3

ER -

Similarity and degree of perplexity analysis of chinese characters

Abstract

ASJC Scopus subject areas

UN SDGs

Access to Document

Other files and links

Fingerprint

Cite this