Progress on a Web Page for exploring 汉字

mark

December 23, 2009 at 08:47 AM posted in General Discussion

A month or so back, I posted about making a Web page for decomposing Chinese characters. I have since entered the information on 1067 characters. This is basically, the first section from the book below. It is by no means comprehensive, but it is enough to give users a good feel for how the page will work, and solicit feedback. In this post I will also discuss my intent in putting this page together, some difficulties I encountered, and my plans for it. The page is http://huamake.com/web2_0.htm .

My intentions for the Web page are that it could serve as a study aid, work as a character dictionary for people like myself who were never formally educated on how to use traditional character dictionaries, and that it could be fun to play with. I will say more about all of these goals later on in this post. Many materials already exist for aiding students of written Chinese, but the ones I know of all follow a tradition that evolved long before the Web existed. The basic method is to pick a set of radicals to use as classifiers for characters and associate each character with its appropriate classifying radical. For example，饣appears on the left hand side of several characters and these characters could be regarded as belonging to a group for classification and search purposes. When dealing with pen and paper reference materials, this makes perfect sense. However, the right hand portion of these characters often have visual elements that are shared with other characters that are not in the same classification bucket. For example,饭，反 and 返 all share a radical. It would be interesting to explore these relationshps, as well, and the basic hyperlinking mechanism of the Web was designed to express exactly this kind of non-linear association.

Ok, so, the basic idea is to take each character, break it down into all its component visual elements (radicals) and link each component to all of the characters that make use of that visual element. Then, one can happily explore the relationships, jumping from link to link according to one's fancy. It turns out that it is not actually that easy. The main problems are user interface design, identifying which portions of a given character are significant, whether they are the same or different than similar looking bits and pieces of other characters, and the fact that there are a lot of characters, therefore a lot of data entry to do. I have an approach to all of these, which is doubtless not perfect, but the result is now there for anyone who wants to take a look.

If the page works, I think it could be used in the following ways.

I think it could be an aid to studying characters. For example, I know there are several characters that contain 旦 and are pronounced dan4, but keeping straight what to place on the left hand side to differentiate these characters is a bit hard for my poor memory. This page would give me a handy way to refresh my memory on this and other similar questions.

I think the page could be used as a character dictionary. When I find a character in print that I don't recognize, I could use any character that contains a similar visual element to narrow my search. (I often have trouble, because I assume a character is classified under the the wrong radical. For example, several bits and pieces could be used to classify 警. It sure would be nice, if any path I took would get me there, but with a traditional character dictionary there is only one correct path.)

I think one might be able to play some interesting games with the page, like a variant of "6 Degrees of Separation"; given two characters, find a path of visual elements that leads from one to the other. For example, to get from 我 to 同, I might go 我〉戈〉咸〉mouth with a line over it > 同. Maybe, you can think of a shorter path, or we could have a race. That would be the game.

So far, my sources of character lists have been:

Reading and Writing Chinese Simplified Character Edition, Third Edition
William McNaughton -- basically the HSK A, B and C lists.

ccedict from MDBG (definitions and pronounciation information)

I tried following a suggestion from Andrew to nicely ask MDBG for their data on radicals, but they did not reply to my request. Perhaps, they think I am a ridiculous person, like the ones who ask them for advice on tattoos, or are simply too busy to respond. In any case, I didn't feel comfortable screen scraping information without permission. So, plan B was to make my data entry method as efficient as possible. Explaining the details would be too much of a digression.

As I mentioned, I have covered the first 1067 characters from the "Reading and Writing Chinese". I will be adding the 1200+ from that book over time, then I will go looking for the HSK D list. If and when I complete entry of the D list, I will consider the effort, more or less complete.

Once the data exists, it can be presented and used in different ways. I am only putting one of many possible faces on it.

mark

September 21, 2010 at 06:21 PM

I reached a milestone appropos of the base note for this post. I have now entered structural data for all of the characters used in the old and new HSK into my site. Take a look at http://huamake.com/huamakefaq.htm for some suggestions on how to use the data.

After all of that data entry, I am left with the distinct impression that the vast majority of Chinese characters are formed by buddying-up two previously existing characters. I think this process doesn't usually go for more than three or four levels deep, and there aren't that many atomic characters. So, most of the complexity is achieved through combinatorics. [ {atomic characters} X { atomic characters } is a lot of possible combinations, and if you repeat the process of combining existing characters a couple times, the result is a very large number of possible combinations. ] If I think of a good way to illustrate this thought visually or statistically, I will share it. I think this idea helps me to recognize and write characters.

Admittedly, the quality of my structural data could stand some improvement. I suspect it is one of those 90:10 rule things; another 90% effort will yield the remaining 10% improvement.

However, one of the advantages of the Web over printed material that also tries to expose patterns within characters, is that the Web is not limitted by the number of pages that can be used. The most comprehensive books I have found get a bit north of a 1000 characters and run into publishing limitations. My Web site now has structural information on 3000 characters, give or take, but it all still fits just fine in your browser.

One of the limitations of the Web over printed matter is that there is naturally less guidance on what structures to look for. You are more on your own for exploring that.

In the course of putting together my Web site, I discovered, often from other Cpod users, other sites that contain similar information, but I still persisted. Whether my perspective and presentation is useful to anyone other than myself, is still somewhat an open question, but I hope it is a positive addition to the resources available to Chinese learners.

baomingguang

February 12, 2010 at 12:02 AM

I just took HSK myself. What kinds of resources would you like to see more of?

包

mark

February 24, 2010 at 07:27 AM

Thanks for your suggestions 包. I implemented them.

baomingguang

February 23, 2010 at 07:49 AM

Also, for the vocabulary pages, you can make the top frame link back to your site if you add an "inturl" field at the end of the url (example):

&inturl=http%3a%2f%2fhuamake.com%2fweb2_0.htm%3ftheChar%3d%25u89C8

Note the URL is URLencoded to make it safe to pass through to the cgi script.

You could also add a custom title such as:

&title=Character%20Structures

which would make the top line of the frame say:

<< Back to Character Structures

If you use the "inturl" pair to include the url of the page the link is on as I showed above, clicking the link would take your site visitor back to the page on your site that the visitor just left.

You'll notice I did this on my site. Unfortunately since Dict.cn uses GB instead of Unicode it messes up the Chinese characters in the title. I suppose I should try to fix that somehow.

baomingguang

February 23, 2010 at 07:28 AM

Mark,

I like the features you've added. Grouping words that contain the same character does seem like a good way to learn those words. I use that same method myself.

Feel free to link to my sites. Thanks for pointing out the problem of simplified characters not working. I thought I had that problem fixed. To fix the problem for now, you can use this style of link instead:

http://chinese-characters.org/cgi-bin/lookup.cgi?characterInput=字

where 字 is the character in unicode. This should work for both simplified and traditional characters now. Make sure the "I" in characterInput is capitalized.

Yes, Dict.cn uses GB which does make things a little tricky sometimes.

I'm glad my efforts have been useful for you.

mark

February 20, 2010 at 07:41 PM

Hi 包,

You inspired me to do a couple of things. When my page displays a character it now lists the HSK vocabulary that that character occurs in. Each word is linked to one of the sites you used to generate usage examples. I also added an "etomology" link to "chinese-characters.org". The latter seems to work only when the code for the simplified character is also a code for a traditional character, though. I hope you don't mind the links to your sites.

One of the trickier bits was figuring out that some of the reference sites you use use GB2312 rather than Unicode, and how to deal with that.

baomingguang

February 15, 2010 at 07:12 PM

Mark,

For starters try learn-chinese-words.com/hsk

I just threw together a list of HSK vocabulary with links to some of my favorite online dictionaries with usage examples. Please let me know what you think.

The best source of sample HSK questions I know is at popupchinese.com/hsk/test (although lately I haven't been able to access popupchinese).

包

mark

February 12, 2010 at 09:06 AM

Well, for starters, a vocabulary list linked to usage examples would be nice.

More recordings and questions to use for practicing the listening section would be nice.

Sample questions that were linked to the appropriate section of a grammar reference, to explain the correct answer, would be nice.

Maybe, all these things exist but I didn't find them.

Meanwhile, I had trouble finding a list of the characters used, but I think I have solved that one.

mark

February 10, 2010 at 05:36 AM

Hi 包，

I almost missed your posts. Thanks for the references. I will definitely check them out. I am curious what suggestions you or others may have as to what kinds of materials for studying Chinese characters are not yet available on the Web. My current thought is to focus on preparation for the HSK, because I am planning to give it another go, and the on-line materials that I found are less helpful than I think they could be, but I am open to suggestion.

Mark

baomingguang

February 08, 2010 at 11:21 PM

While I'm at it, the YellowBridge dictionary, while containing a rather mediocre number of words, makes a really creative use of the CHISE data. Here's an example:

http://www.yellowbridge.com/chinese/character-etymology.php?searchChinese=1&zi=%E5%9C%8B#

baomingguang

February 08, 2010 at 11:03 PM

I don't think the posting script liked the last character I posted, because it cut off the rest of my post.

You can click some of the components to find other characters containing that component. Or you can click the "contained in" tab to see lots of characters containing 亥. The table is sorted by sound but also contains frequency information.

Of course, looking at the early forms of the character you can see it contained very different components from the modern version. So the "apparent" components are useful for categorization and memorization only, not for understanding the character's origins. But I'll leave that for later...

Here's where I got the information for the components. You can decompose about 100,000 characters using the data found here:

http://kanji.zinbun.kyoto-u.ac.jp/projects/chise/dist/base/chise-base-0.24.tar.gz

Unpack it with a GNUzip utility, and the files you need are in the "ids" directory. In fact, "IDS-UCS-Basic.txt" may be the only file you'll need. If you're a seasoned programmer, you'll like the scripts they've made to interact with the data.

I hope this helps. I'm looking forward to seeing what new ideas you come up with.
包

baomingguang

February 08, 2010 at 10:38 PM

Mark,

I like your website, especially the ABCDLists page above. It's a fresh approach and something I'd like to see more of on the 'net.

Before you re-invent the wheel, I want to share what I've discovered while trying to do something similar.

I've made the web site chinese-characters.org which contains some functionality like yours. I'll use your 亥 above as an example. Go to the page for this character on my site, and you'll notice the following characters in the "Apparent" box: ⿳亠

mark

January 16, 2010 at 07:39 PM

I now have the characters that are used in the HSK up in a separate page, and groupd by list (A - basic, B - elementary, C - Intermediate, D - Advanced). The white backed ones are the ones I haven't gotten around to entering structural data for, yet.

http://huamake.com/ABCDLists.htm

Anyway, I figure having the HSK lists in electronic form might be useful to someone. They weren't that easy to get. I ended up writing some code to do it.

There is a work in progress, at the bottom of this new page. Specifically, a listing of how many times each character is used as a radical. It appears to be a Zipf-like distribution; a few characters get a lot of use, most get very little to none. Somewhere around the knee of that curve might provide useful groupings for memorization purposes.

PS If you are using an older version of IE, the page with the structural analysis of characters does some thing that sometimes crashes the browser, but I haven't debuged it yet. Firefox and IE 8 both seem to work correctly.

@Daniel, zhongwen.com has turned out to be a great resource when I get stuck on how to decompose a character, but it doesn't let me explore in quite the way I want to, or I'm just stubborn and still doing my own thing.

markfilan

December 27, 2009 at 05:38 AM

..................

chenjiapei

December 26, 2009 at 01:20 PM

怎么说呢...感觉可能性比较小吧。

chenjiapei

December 26, 2009 at 01:20 PM

怎么说呢...感觉可能性比较小吧。

daniel70

December 26, 2009 at 06:36 AM

Hi Mark, You might try "drilling down" into each character. It is extensively hyper-linked. Also, you might check out the books by Heisig (Remembering the Hanzi) and Matthews (Learning Chinese Characters). They also do a character analysis that might be helpful to you. You can download a sample chapter from Heisig's book online.

mark

December 26, 2009 at 05:38 AM

@daniel70, I hadn't looked at www.zhongwen.com before. They have lots of interesting information and a nice layout. So, it is a good find.

However, their description of their character maps, "Since every character other than the basic pictographs and ideographs is composed of more than one component character, a choice must be made where to put the character's primary listing.", indicates they use the same kind of linear, one-character-in-one bucket organization that is no longer necessary in a hyper-linked environment. So, although I am shamed to say, my wheel is uglier in many respects, as far as I know, I am still not re-inventing something.

daniel70

December 26, 2009 at 12:21 AM

Hi Mark,

Have you looked at www.zhongwen.com?

mark

December 25, 2009 at 06:51 PM

As short of an omage to Henning's character points, what do an earthly branch, a child, coughing, carving and an imperative adverb have in common? ( the radical, 亥）

http://huamake.com/web2_0tst.htm?chardef=on&charpro=on&theChar=%25u4EA5