How many characters?

goulnik
March 23, 2008, 10:08 AM posted in General Discussion

I just ran a character analysis script over a few sources, and I'm going to share the results in the next reply (to allow html formatting).

Before anyone starts shooting, I know the number of characters is a necessary condition to reading, but it is far from sufficient. I reckon you need to know at least 3 times more words than characters, but that's really just guestimate.

In any event, I often read that you need 2000 characters to start reading a newspaper, I always thought it was unrealistic, so here you go, analysis performed on a numner of lessons, respectively news stories:

 

 

Profile picture
goulnik
March 23, 2008, 10:11 AM

ChinesePod lessons characters
Newbie 250 1380
Elementary 160 1295
Newbie+Elementary 410 1860
Intermediate 130 1397
Upper Intermediate 85 1909
Advanced 98 2382
News news characters
88news published 205 2372
88news others 320 2698
BBC news 70 2642
Xinhuanet 320 2694
Xinhuanet 2320 3805

Xinhuanet

Profile picture
henning
March 24, 2008, 07:42 AM

Unfortunatelly, no simple statistical analysis can tell uncover the relevance of a character. If the one rare character in your source is the pivot for understanding the text (as it is so often) it carries a different weight than a character that only appears in names. As characters are primarilly semantical units, they are mostly topical. In another thread billm pointed out that even a first year schoolbook can be frustrating because it includes sentences like 刨刨土, 捉捉虫 (both 刨 and 捉have probably neither appeared in a CPod lesson nor in a 88News item, yet). So I guess 2000 characters are indeed enough. If you only read texts addressing a single, well-defined subject. Otherwise, 6000 is a good target.

Profile picture
wildyaks
March 23, 2008, 03:04 PM

Interesting!

Profile picture
goulnik
March 23, 2008, 03:25 PM

from that graph, it also seems to me 2500 is really the turning point...

Profile picture
mark
March 23, 2008, 03:48 PM

goulniky, your graph tells me that there are 2000 characters that one would encounter in any small set of articles, but you'd need to know around 4000 to read without resorting to a dictionary. Another way to look at it would be how many characters are representable in Unicode? That has to be a finite and well defined set, and would be the natural limit to the characters any on-line material could use. Or 《简化字总表》 would probably contain the limit, if you only want to consider simplified characters. Yes, this is very interesting data. Thanks for all the work.

Profile picture
mark
March 23, 2008, 04:01 PM

Oh, one other thing that would be interesting to know, have you tried graphing the usage distribution by character? That might indicate a cut-off point for how many one needs to know to read most of most material. I bet a lot of the unusual ones are names.

Profile picture
goulnik
March 23, 2008, 04:40 PM

mark, here's the character distribution over the XinhuaNet set, number of occurences on y-axis (most frequent = character #1 occurs 18758 times, next one occurs 10562 times etc.):
12 most frequent characters are : 的 (18758), 国 (10562), 中 (8954), 人 (8588), 大 (7528), 一 (7470), 不 (6366), 年 (5922), 新 (5766), 会 (5666), 是 (5404), 图 (5326). Problem with the distribution argument is some low-frequency characters can still play a key role in a sentence.

Profile picture
bazza
March 23, 2008, 05:18 PM

Can you analysis how many characters I've used? ;)

Profile picture
laosimake
March 23, 2008, 06:16 PM

Goulniky, Thank you for providing these data as an opportunity to reflect on the implications for language learning. I have a question and two comments. First the question. Are the CPod numbers overlapping from one category to the next? E.g., does the intermediate count (1397) overlap the newbie/elementary count (1890), or is it strictly over and above? If overlapping, then what (total) vocabulary level would be achieved at each stage? E.g. might getting to the upper intermediate level be sufficient for reading basic news? As for the first comment, the graphs seem to suggest that there is (within the lexical universe you selected) a common core vocabulary of maybe 1000-2000 characters and that above that new vocabulary is rather specialized. I.e., if one were to read across different types of material, one would expect to encounter new words even if their vocabulary numbered 5000. Second comment relates to the most frequent characters in this analysis. It is interesting to compare your frequencies with another source, in this case, Yong Ho's "CE Frequency Dictionary" (based on character counts from elementary and secondary textbooks published in China during the period 1978-80). Your most frequent characters appear in Ho's list as: 的-1, 国-80,中-62, 人-8,大-19,一-2,不-6,年-56,新-198,会-54,是-4,图-325。 This seems to confirm the simplistic conclusion that while frequency lists are dependent on the source, there in fact may be a real "core" that is predominant.

Profile picture
henning
March 23, 2008, 06:32 PM

Thanks for sharing that great analysis, gouliniky. Another way of approaching this might be relative frequencies for the different sources: On average: How many different characters appear on average in a block of 1000 characters in a certain source (= number of distinct characters / total number of characters * 1000) ? I think it might be interesting to see the distance between News (real life) and Advanced for this indicator. For Newbie or Elementry, 88News would certainly be an unfair comparison, it would better be mirrored against the movie script of a soup opera with colloquial dialogue.

Profile picture
goulnik
March 23, 2008, 10:16 AM

In the Xinhuanet graph above, the number of stories are on the x-axis (crawled today, 0-2500), the cumulative number of unique characters are on the y-axis (0-4000). This show there's no end to it, I thought it would flatten out but it seems this would take a lot more articles. Note that over 2320 stories, I found 3805 characters of which 3397 appear more than once, for a total of 983,598 characters.

Profile picture
goulnik
March 23, 2008, 08:29 PM

@laosimake: if you run the analysis on newbie + elementary + intermediate the result is 2237 characters over 550 lessons (over a total of 57,000+ as appear in the html files, which incidentally include supplementary vocab), where 440 of those characters only appear once. So, not surprisingly, the overlap is significant as one can reasonably assume that you start with a core set of characters. One can run all kinds of analyses, and I also did what henning suggested, but I don't want to overdo it. One reason I did it in the first place was to disprove what I often hear say, that 2000 characters is what you needed to know in order to read a newpaper. Somewhat of a cliche, which I felt was a fallacy (aside from the importance of words as character compounds). The numbers don't quite disprove this when run on ChinesePod lessons, even advanced (2000-3000), but clearly do with XinhuaNet. The reason simply has to do with the number of articles and thus the range of topics covered. ChinesePod does a good job at covering all sorts of subjects, in fact uses a broad range of characters considering the number and average size of lessons. But 100-200 is still a very limited number if you consider the possible choice and depth of topics likely to be covered in a newspaper (running this on 1000s of zh.wikipedia articles could be interesting...) So in the end, it's true that any single short article will only use a few 100 characters, problem is they'll likely be very focused, and the next article will use yet another set. The other message is for anyone looking at HSK testing, you obviously have to start looking at multiple sources to digests lotsa new characters (1,033 are in band A, 2,018 in band B, 2,202 in band C, and 3,569 in grade D)

Profile picture
bazza
March 23, 2008, 08:59 PM

goulniky, would it be possible analysis a single user's posts? Determine how many unique characters they've used in all their comments?

Profile picture
laosimake
March 24, 2008, 01:27 AM

@goulniky I clearly agree with your analysis and conclusions. So we learners need to accept that it will take a fair amount of effort (likely well beyond acquiring 2-3K characters) to read generally without the aid of a dictionary. As a aside, in casual conversation with a few native Chinese friends, it was guesstimated that a college educated Chinese probably knows something like 10-15K characters. So back to the flashcards! :-)

Profile picture
RJ
March 24, 2008, 01:40 AM

Thanks goulniky for the data! I think Mark makes a good point re distribution. Certainly at some level it is fine to expect to use a dictionary or to guess by context some characters. Even when reading english I often depend on a dictionary. I think your data supports the cliche that if you know 2000 characters there is a good chance you will be able to read the average newspaper article with reasonable comprehension. Reasonable comprehension being what? 80%? It was after all meant to be a general statement, not a guarantee that any article would be understood at the 100% level by knowing 2000 characters. I think the data is encoraging. If I know 2500 and most corresponding words I should be in good shape as long as I pack a dictionary. As an individual I would probably shy away from esoteric articles in which I have no interest and therefore also probably dont know the vocab and also be drawn to such articles in which I do have background, interest and therefore vocab. -RJ

Profile picture
auntie68
March 24, 2008, 02:33 AM

Thanks once again, goulniky! I think that the way Chinese people "know" Chinese characters is worth taking into account. I get the feeling that most native speakers of Chinese use, in their daily lives, rather a lot of written characters which (i) they may be able to read, but which they may not know how to pronounce or write (so they can't -- and don't -- use those characters in their daily speech); or which (ii) they use in their daily speech, appropriately and with the correct pronunciation, without knowing how to write the character. If they saw the character in print, they might not be able to recognize it without contextual clues.

Profile picture
tvan
March 24, 2008, 04:13 AM

I read that 2,000 Chinese character limit, learned 2,000 Chinese characters. There's no way that enough! I believe Taiwan considers 3,000 to be adequate, and that's the number I'm shooting for, along with better grammar and more words, etc. etc.

Profile picture
goulnik
March 24, 2008, 05:12 AM

bazza, I can share this little script no problem. It's JavaScript, will just require that you grab all your comments into a series of files, load them up in one folder and have them listed in a separate file. You'll need a crawler for that, and a utility such as Filelist (freeware). Works with FireFox, not tested in any other browser.

Profile picture
goulnik
March 24, 2008, 05:38 AM

the script, you can save it and run locally

Profile picture
azerdocmom
March 23, 2008, 07:35 PM

Wow, goulnik! That is incredible! You do so much for this community, it's humbling!