How many characters?

goulnik
March 23, 2008 at 10:08 AM posted in General Discussion

I just ran a character analysis script over a few sources, and I'm going to share the results in the next reply (to allow html formatting).

Before anyone starts shooting, I know the number of characters is a necessary condition to reading, but it is far from sufficient. I reckon you need to know at least 3 times more words than characters, but that's really just guestimate.

In any event, I often read that you need 2000 characters to start reading a newspaper, I always thought it was unrealistic, so here you go, analysis performed on a numner of lessons, respectively news stories:

 

 

Profile picture
goulnik
March 25, 2008 at 12:33 PM

agreed bout news vocab, even narrower, essential vocab in news selected by goulniky over the past 6 months...

Anyway, I wouldn't do 3 or even 4 if only for performance reasons. but you can run this over ChinesePod lessons too, and check for repeats...

Profile picture
henning
March 25, 2008 at 11:57 AM

This is basically a "essential news vocab" collection.

When you do that with 4 characters you should even get a list of "most frequent Chengyus, Suyus, and 2x2 combinations".

However, the longer the search string, the more topical and source-dependent the result.

Profile picture
goulnik
March 25, 2008 at 11:06 AM

Just expanded the parsing script to include character pairs. Ideally, if you wanted to run frequency analysis over character compounds you'd need a lexicon / dictionary database. But I figured as an approximation, I would run the analysis over pairs of characters, since most words are anyway 双音词 dissyllable (shuāngyīncí). So the algorithm uses brute force, in a string of characters such as ABCDE, it will treat AB / BC / CD / DE equally. While half of those 双音词 are likely to be meaningless, sometimes they aren't and can confuse the reader, but anyway it all gets sorted out in the end when looking at relative frequencies - all you have to do is ignore the bottom half or 2/3 or the results. You can download the html file and run it on your computer, but this uses JavaScript sparse arrays over a wide range of numerical values (basically up to 65,000 x 65,000) so it gets *very* slow. Here's the results of the analysis over 320 of my 88news (some unpublished).

Profile picture
tvan
March 24, 2008 at 01:16 PM

I seem to recall reading an article on literacy statistics comparing China, Taiwan, and Hong Kong. The gist of the article was that China considered 2,000 characters the essential number for "basic literacy" while Hong Kong and Taiwan had substantially more. (2,500 and 3,000?) The article's intent was to point out inconsistencies in international literacy rankings, but I think it supports the point being made here.

Personally, I find certain genres more challenging than others. Newspapers still give me trouble, but (short) paperback fiction is OK with a dictionary. I have problems with children's books for some reason. They seem to wander more.

Profile picture
goulnik
March 24, 2008 at 10:11 AM

I don't have the kind of home answering 'device' Henning has, but I'd agree with his estimate of 6000. For me that would be the stretched target, I'd realistically shoot for 4000. But he's right, you may stumble across that pivotal character that gets you in trouble, but there are tricks for speed reading / guessing, including character morphology.

RJBerki, I'm afraid I don't have probes to dig into the brains of Chinese people, educated or otherwise, but I'll try and think of a surrogate marker!

Profile picture
RJ
March 24, 2008 at 09:25 AM

Henning, I am so jealous of all you guys that have the answers right there with them at home. :-)

-RJ

Profile picture
henning
March 24, 2008 at 08:43 AM

I once "tested" my wife with a flashcard program that tests for 4000 characters, ordered by frequency (per level blocks of 780 characters). In the last level she once or twice flunked, meaning: she got the tone of a character wrong or said "It is only used in names" (even when there was some rarely used meaning given also).

Besides she knows almost all characters that I show her from my character collection as long as the frequency is below 6000. I think, 6000 is indeed a good *long-term* number.

According to that testing java app I can identify about 2500, and believe me, it is not enough.

Auntie68: Your statements have to be calibrated - I know your modesty. I wished I could read this variety of sources in Mandarin. I also envy you for being able to speak more languages than I can enlist.

I haven't tried any NC-17-sources in Mandarin, yet, that might be an interesting idea, however.

But isn't the correct Enterprise model number NCC-1701?

;)

Profile picture
wildyaks
March 24, 2008 at 08:43 AM

Character learning is just such hard work. I tell people that Chinese is the easiest language I have ever learnt, but it requires and extraordinary amount of dilligence to memorize characters and to retain them.

Jia you, everybody!

Profile picture
RJ
March 24, 2008 at 08:16 AM

Or you could study until you can read most articles comfortably and not worry about how many characters you know. Im sure the law of diminishing returns applies here so 2000 or 3000 may provide 90% of the reading ability that 6000 does. Keep reading and you will eventually know those 6000, if that is the number. If you want to read speeches by Hu Jintao then keep going. I would be more interested in an accurate number of how many the avg educated Chinese would know (how to wrtie and how to read). There's you're baseline and your target, assuming you want to be "as good as". Yves, how can we measure this?

-RJ

Profile picture
auntie68
March 24, 2008 at 07:54 AM

Henning, 6,000 characters? Wow. The only person I've come across who actually admitted to knowing 6,000 Chinese characters is... CPOD's "furyougaijin". And he is fluent in Russian and Japanese too! Doesn't look hopeful for me.

Having said that, I like to study many languages, even though it means that my vocabulary for any one language may be extremely limited, only covering my specific interests. For example, my favourite "Architectural Digest" is the German edition. After 5 or 6 years of reading the German AD, I don't really need a dictionary to read the stories. But the only other vocab I have in German is: Kurt Weill lyrics, 1930s cabaret stuff, bombastic Carl Orff "Greek" stuff, Moomin/Mumin stories, Winnie The Pooh (thank you, Harry Rowohlt!), and some NC-17 stuff. And maybe food. Outside these very limited areas, I couldn't hold a conversation in German for even two minutes... Sincerely.

Profile picture
henning
March 24, 2008 at 07:42 AM

Unfortunatelly, no simple statistical analysis can tell uncover the relevance of a character. If the one rare character in your source is the pivot for understanding the text (as it is so often) it carries a different weight than a character that only appears in names.

As characters are primarilly semantical units, they are mostly topical. In another thread billm pointed out that even a first year schoolbook can be frustrating because it includes sentences like 刨刨土, 捉捉虫 (both 刨 and 捉have probably neither appeared in a CPod lesson nor in a 88News item, yet).

So I guess 2000 characters are indeed enough. If you only read texts addressing a single, well-defined subject.

Otherwise, 6000 is a good target.

Profile picture
goulnik
March 24, 2008 at 05:38 AM

the script, you can save it and run locally

Profile picture
goulnik
March 24, 2008 at 05:12 AM

bazza, I can share this little script no problem. It's JavaScript, will just require that you grab all your comments into a series of files, load them up in one folder and have them listed in a separate file. You'll need a crawler for that, and a utility such as Filelist (freeware). Works with FireFox, not tested in any other browser.

Profile picture
tvan
March 24, 2008 at 04:13 AM

I read that 2,000 Chinese character limit, learned 2,000 Chinese characters. There's no way that enough! I believe Taiwan considers 3,000 to be adequate, and that's the number I'm shooting for, along with better grammar and more words, etc. etc.

Profile picture
auntie68
March 24, 2008 at 02:33 AM

Thanks once again, goulniky! I think that the way Chinese people "know" Chinese characters is worth taking into account.

I get the feeling that most native speakers of Chinese use, in their daily lives, rather a lot of written characters which (i) they may be able to read, but which they may not know how to pronounce or write (so they can't -- and don't -- use those characters in their daily speech); or which (ii) they use in their daily speech, appropriately and with the correct pronunciation, without knowing how to write the character. If they saw the character in print, they might not be able to recognize it without contextual clues.

Profile picture
RJ
March 24, 2008 at 01:40 AM

Thanks goulniky for the data! I think Mark makes a good point re distribution. Certainly at some level it is fine to expect to use a dictionary or to guess by context some characters. Even when reading english I often depend on a dictionary. I think your data supports the cliche that if you know 2000 characters there is a good chance you will be able to read the average newspaper article with reasonable comprehension. Reasonable comprehension being what? 80%? It was after all meant to be a general statement, not a guarantee that any article would be understood at the 100% level by knowing 2000 characters. I think the data is encoraging. If I know 2500 and most corresponding words I should be in good shape as long as I pack a dictionary. As an individual I would probably shy away from esoteric articles in which I have no interest and therefore also probably dont know the vocab and also be drawn to such articles in which I do have background, interest and therefore vocab.

-RJ

Profile picture
laosimake
March 24, 2008 at 01:27 AM

@goulniky

I clearly agree with your analysis and conclusions. So we learners need to accept that it will take a fair amount of effort (likely well beyond acquiring 2-3K characters) to read generally without the aid of a dictionary. As a aside, in casual conversation with a few native Chinese friends, it was guesstimated that a college educated Chinese probably knows something like 10-15K characters.

So back to the flashcards! :-)

Profile picture
bazza
March 23, 2008 at 08:59 PM

goulniky, would it be possible analysis a single user's posts? Determine how many unique characters they've used in all their comments?

Profile picture
goulnik
March 23, 2008 at 08:29 PM

@laosimake: if you run the analysis on newbie + elementary + intermediate the result is 2237 characters over 550 lessons (over a total of 57,000+ as appear in the html files, which incidentally include supplementary vocab), where 440 of those characters only appear once.

So, not surprisingly, the overlap is significant as one can reasonably assume that you start with a core set of characters.

One can run all kinds of analyses, and I also did what henning suggested, but I don't want to overdo it. One reason I did it in the first place was to disprove what I often hear say, that 2000 characters is what you needed to know in order to read a newpaper. Somewhat of a cliche, which I felt was a fallacy (aside from the importance of words as character compounds).

The numbers don't quite disprove this when run on ChinesePod lessons, even advanced (2000-3000), but clearly do with XinhuaNet. The reason simply has to do with the number of articles and thus the range of topics covered. ChinesePod does a good job at covering all sorts of subjects, in fact uses a broad range of characters considering the number and average size of lessons.

But 100-200 is still a very limited number if you consider the possible choice and depth of topics likely to be covered in a newspaper (running this on 1000s of zh.wikipedia articles could be interesting...)

So in the end, it's true that any single short article will only use a few 100 characters, problem is they'll likely be very focused, and the next article will use yet another set.

The other message is for anyone looking at HSK testing, you obviously have to start looking at multiple sources to digests lotsa new characters (1,033 are in band A, 2,018 in band B, 2,202 in band C, and 3,569 in grade D)

Profile picture
azerdocmom
March 23, 2008 at 07:35 PM

Wow, goulnik! That is incredible! You do so much for this community, it's humbling!

Profile picture
henning
March 23, 2008 at 06:32 PM

Thanks for sharing that great analysis, gouliniky.

Another way of approaching this might be relative frequencies for the different sources:

On average: How many different characters appear on average in a block of 1000 characters in a certain source (= number of distinct characters / total number of characters * 1000) ?

I think it might be interesting to see the distance between News (real life) and Advanced for this indicator.

For Newbie or Elementry, 88News would certainly be an unfair comparison, it would better be mirrored against the movie script of a soup opera with colloquial dialogue.

Profile picture
laosimake
March 23, 2008 at 06:16 PM

Goulniky,

Thank you for providing these data as an opportunity to reflect on the implications for language learning. I have a question and two comments.

First the question. Are the CPod numbers overlapping from one category to the next? E.g., does the intermediate count (1397) overlap the newbie/elementary count (1890), or is it strictly over and above? If overlapping, then what (total) vocabulary level would be achieved at each stage? E.g. might getting to the upper intermediate level be sufficient for reading basic news?

As for the first comment, the graphs seem to suggest that there is (within the lexical universe you selected) a common core vocabulary of maybe 1000-2000 characters and that above that new vocabulary is rather specialized. I.e., if one were to read across different types of material, one would expect to encounter new words even if their vocabulary numbered 5000.

Second comment relates to the most frequent characters in this analysis. It is interesting to compare your frequencies with another source, in this case, Yong Ho's "CE Frequency Dictionary" (based on character counts from elementary and secondary textbooks published in China during the period 1978-80). Your most frequent characters appear in Ho's list as: 的-1, 国-80,中-62, 人-8,大-19,一-2,不-6,年-56,新-198,会-54,是-4,图-325。 This seems to confirm the simplistic conclusion that while frequency lists are dependent on the source, there in fact may be a real "core" that is predominant.

Profile picture
bazza
March 23, 2008 at 05:18 PM

Can you analysis how many characters I've used? ;)

Profile picture
goulnik
March 23, 2008 at 04:40 PM

mark, here's the character distribution over the XinhuaNet set, number of occurences on y-axis (most frequent = character #1 occurs 18758 times, next one occurs 10562 times etc.):
12 most frequent characters are : 的 (18758), 国 (10562), 中 (8954), 人 (8588), 大 (7528), 一 (7470), 不 (6366), 年 (5922), 新 (5766), 会 (5666), 是 (5404), 图 (5326). Problem with the distribution argument is some low-frequency characters can still play a key role in a sentence.

Profile picture
mark
March 23, 2008 at 04:01 PM

Oh, one other thing that would be interesting to know, have you tried graphing the usage distribution by character?

That might indicate a cut-off point for how many one needs to know to read most of most material. I bet a lot of the unusual ones are names.

Profile picture
mark
March 23, 2008 at 03:48 PM

goulniky, your graph tells me that there are 2000 characters that one would encounter in any small set of articles, but you'd need to know around 4000 to read without resorting to a dictionary.

Another way to look at it would be how many characters are representable in Unicode? That has to be a finite and well defined set, and would be the natural limit to the characters any on-line material could use. Or 《简化字总表》 would probably contain the limit, if you only want to consider simplified characters.

Yes, this is very interesting data. Thanks for all the work.

Profile picture
goulnik
March 23, 2008 at 03:25 PM

from that graph, it also seems to me 2500 is really the turning point...

Profile picture
wildyaks
March 23, 2008 at 03:04 PM

Interesting!

Profile picture
goulnik
March 23, 2008 at 10:16 AM

In the Xinhuanet graph above, the number of stories are on the x-axis (crawled today, 0-2500), the cumulative number of unique characters are on the y-axis (0-4000). This show there's no end to it, I thought it would flatten out but it seems this would take a lot more articles. Note that over 2320 stories, I found 3805 characters of which 3397 appear more than once, for a total of 983,598 characters.

Profile picture
goulnik
March 23, 2008 at 10:11 AM

ChinesePod lessons characters
Newbie 250 1380
Elementary 160 1295
Newbie+Elementary 410 1860
Intermediate 130 1397
Upper Intermediate 85 1909
Advanced 98 2382
News news characters
88news published 205 2372
88news others 320 2698
BBC news 70 2642
Xinhuanet 320 2694
Xinhuanet 2320 3805

Xinhuanet