Some statistics about the structure of Chinese Characters

Posted by mark October 8, 2010 in the Group General Discussion.

I now have some statistics to illustrate some facts about writen Chinese.  For the past year, or so, I have been working on a Web site that contains structural data about characters (see if you are curious).  Anyway, I generated some statistics about character structure based on the characters in the old HSK vocabulary list (a larger set than for the new HSK, and a well defined sub-universe of Chinese characters). 

The first column in the table below is the number of times a character can be broken down into simpler component characters or radicals.  For example, 好 -> 女 + 子.  子 could be broken down into 了 and 一, giving 好 a count of 2. The 11 characters that can't be broken down are characters like 一 and 乙, which are already pretty simple.

The second column is the number of times that a character participates in the formation of a more complex character.  In my example above, 女 and 子 would each get a count for their participation in 好.  Interestingly, while some characters are active joiners, about 80% are stay-at-homes that don't participate in character formation at all, at least, not until someone needs to invent a new character, or at least, not in the sub-universe of characters I chose to work with.

0:               11    2219
1:             533      218
2:             845        94
3:             951        69

4:             432        60

5:              66        38
6:                4        23
7:                0        15
8:                0        13
9:                0        15
10 or more:   0        78

While these statistics are only for a limitted subset of all Chinese characters, I am confident you would see a similar pattern with any other reasonable sized set of Chinese characters.  Also, another person might decompose characters differently than I have.  However, it is only the last level of decomposition into "simplest" elements that is more of an art than a science.  Most decompositions are from one commonly used character into a couple other commonly used characters, like my example with 好.  There seem to be two forces operating.  One is combining a relatively small number of fixed elements to make a large number of characters.  The other is a limit on the acceptable complexity of a character.  The typical character has gone through 2 or three levels of compounding, and is a leaf node in the formation process.

Well, I don't actually know how characters were formed.  It is just my hypothesis from analyzing their appearance.  Perhaps, it is a useful observation.

