Some statistics about the structure of Chinese Characters (汉字）
I now have some statistics to illustrate some facts about writen Chinese. For the past year, or so, I have been working on a Web site that contains structural data about characters (see http://huamake.com/huamakefaq.htm if you are curious). Anyway, I generated some statistics about character structure based on the characters in the old HSK vocabulary list (a larger set than for the new HSK, and a well defined sub-universe of Chinese characters).
The first column in the table below is the number of times a character can be broken down into simpler component characters or radicals. For example, 好 -> 女 + 子. 子 could be broken down into 了 and 一, giving 好 a count of 2. The 11 characters that can't be broken down are characters like 一 and 乙， which are already pretty simple.
The second column is the number of times that a character participates in the formation of a more complex character. In my example above, 女 and 子 would each get a count for their participation in 好. Interestingly, while some characters are active joiners, about 80% are stay-at-homes that don't participate in character formation at all, at least, not until someone needs to invent a new character, or at least, not in the sub-universe of characters I chose to work with.
0: 11 2219
1: 533 218
2: 845 94
3: 951 69
4: 432 60
5: 66 38
6: 4 23
7: 0 15
8: 0 13
9: 0 15
10 or more: 0 78
While these statistics are only for a limitted subset of all Chinese characters, I am confident you would see a similar pattern with any other reasonable sized set of Chinese characters. Also, another person might decompose characters differently than I have. However, it is only the last level of decomposition into "simplest" elsements that is more of an art than a science. Most decompositions are from one commonly used character into a couple other commonly used characters, like my example with 好. There seem to be two forces operating. One is combining a relatively small number of fixed elements to make a large number of characters. The other is a limit on the acceptable complexity of a character. The typical character has gone through 2 or three levels of compounding, and is a leaf node in the formation process.
Well, I don't actually know how characters were formed. It is just my hypothesis from analyzing their appearance. Perhaps, it is a useful observation.
New lesson idea? Please contact us.