Listening to all the Intermediate and Elementary Dialogues

svik
May 23, 2011, 01:51 AM posted in General Discussion

As a diversion from my normal study I have spent the past couple weeks listening to every dialogue from the Intermediate and Elementary lessons. While listening and reading the dialogues I recorded in a notebook every character that appeared, according to pinyin. My observations:

1. There are about 2000 characters that appear in these lessons.  There would be a somewhat greater number of words, but I did not try to estimate that.

2. Over 10% of the characters (about 225) have more than one pronunciation.  That was more than I anticipated.

3. According to the pinyin chart there are about 400 sounds.  28 of these were not represented in the 2000 characters.

4. Another 62 were represented by a single character, meaning that 3 of the possible tones were not found.  Examples:  开 kāi, 日 rì,  能 néng, 让 ràng

5.  Others were vastly over-represented:  ji had 39 characters, including 15 for jì
shi had 36 characters, including 17 for shì.  I find that words made of 2 such characters can be harder to remember:经济 jīng jì (economy)

6.  The Intermediate lessons starting in August 2006, especially with the series on 张亮Zhāng Liàng and 陈丽Chén Lí, seem to be of a higher technical quality, and the dialogues are generally more interesting.  This seemed to correspond with the appearance of Connie's voice (and others) in the dialogues.

Profile picture
rizone
May 25, 2011, 01:22 AM

wow!

Profile picture
svik

(Or, as they like to say in China) 哇!

Profile picture
chris
May 25, 2011, 07:04 AM

Good effort!  I've also done all the ele lessons and all but 10 of the intermediates, but that's taken me four and a half years, not just a couple of weeks, and that's without making a detailed analysis of the characters.

Profile picture
svik

I should have made that clear: I have studied previously 150 Ele and 120 IM, but here I just listened to the dialogues. I knew I would not have the time to go through my usual study routine for all of the earlier ones, but I was curious what was in there. I enjoyed hearing a lot of the dialogues, and could see how characters and words were used in different contexts.

Profile picture
babyeggplant
May 25, 2011, 08:18 AM

Nice work! Were you added to payroll?

Profile picture
svik

Not that I know of :)

Profile picture
bodawei
May 25, 2011, 01:36 PM

That would make you 个只会玩中文播客的呆子。。 :) That's a bit rude, I should have said ..中文播客迷。

Seriously, we need a shorter word for ChinesePod nerd.

One question: 'there are about 2000 characters that appear in these lessons. There would be a somewhat greater number of words,..'. I'm interested why you say that; 80% of words in Chinese are comprised of two characters, so if the number was 100% and there was no re-use of characters there would be 1,000 words for 2,000 characters. Of course there are many instances where characters combine in different ways to form words, but that would be a rather complex algorithm. You could of course be right; just wondering about your logic.

Profile picture
svik

Guilty as charged, except that I might be too old to be a nerd (>50).

From my earlier days of using a conventional dictionary, it seemed that most characters are words, as well as be used in 2 and 3 character words. So, I reasoned that if you have a large number of characters, and even larger number of words will arise. Mathematically, with 3 objects you get 3 pairs, but with 4 you get 6 and with 6 you get 15. Of course in a language not all characters will pair with others. Someone around might know the actual relationship in the Chinese language.

Profile picture
bodawei

Well we might be the blind leading the blind, but my memory of reading class was that the Chinese, okay, my teachers, had a pretty rigid view of what comprised a word (my analysis was marked as right or wrong), and there are not very many just comprised of one character. The majority of words are made up of two characters. (The dictionaries I use are not helpful on this point.) I guess strictly the lonesome character may be a word, and appear as such in a poem for example, but in normal discourse the Chinese resort to joining two characters. Often, I'll say usually, these two characters carry the same or similar meaning. You do get them as opposite meanings of course. Hence my query. Hopefully this will attract someone who knows more on this topic than we do.

Profile picture
svik

More blind commentary:

I haven't had a Chinese teacher for a long time, and never had your experience regarding "what is a word", so I haven't paid so much attention to that. I just read a few discussions online, and one interesting point was this: Newspapers are said to use about 2000 characters for 97% of their text. The problem for a beginner is that even if you "know" most of those characters, it is likely that you don't know all of the ways they form words with each other. That's why it's so tough for us. Using your number of 80% of words are 2 characters, if 2000 characters form 1000 single character words, then they would also form about 4000 2-character words. (ignoring the threes).

Profile picture
bodawei

Just re-wind a bit - I said (on certain assumptions, such as all words are comprised of two characters) 2,000 characters would form about 1,000 two character words. You say above: '2,000 characters form 1,000 single character words'; I'm sure you don't mean that, but what did you mean? If all words in the set were comprised of one character, 2,000 characters would form 2,000 words, right?

Then, you say '2000 characters'..'would also form about 4000 2-character words' - how did you get there? Some characters do combine with several characters .. but what is your algorithm to get to that 4,000 word figure?

Profile picture
RJ

I think you are grossly underestimating the number of words possible here. Think of it this way, the first character can create 1999 pairs with the rest of the characters plus the second can create another 1998 pairs with the others (you already used the combo with character 1) plus the third can create 1997 and so on. This ignores order however. Actually, since order does matter in our choice of possible pairs formed by 2000 characters this is a "permutation" rather than a "combination". The

general formula for finding the number of permutations of size k taken from n objects is:

n_P_K = n! / (n-K)!

The number of possible two character words using 2000 characters (if you consider that 1,2 may be a different word than 2,1) is represented by 2000_ P_ 2 and the answer using the formula is 3,998,000.

This represents the number of 2 character pairs possible. Those from the list of possibles that are actually used as 2 character words is a different story. Someone would have to check them all to see which are indeed used as words.

If you claim that order doesnt matter (1,2 same as 2,1) then this is a combination problem.

the formula for finding the number of combinations of k objects you can choose from a set of n objects is:

n_C_k = n! / k!(n - k)!

The number of possible 2 character combinations from a list of 2000 characters is 1,999,000. The point is, the number of possible 2 character words that can be created from 2000 characters is huge. Saying that one can read a newspaper if they know 2000 characters is meaningless. You need to know the words. Oh, and dont forget there are 3 character words as well.

Profile picture
chris

Good post RJ. When I read the first post above the other day, I immediately thought of my old A-Level maths classes (A-levels are the exams taken in the UK at age 18 before university). Your post, in particular the words "combinations" and "permutations" sends painful shivers down my spine as I recollect cramming for those exams!

Profile picture
bababardwan

what's the largest number of characters in a single word then?

Profile picture
RJ

I think that is a good research project for you. Most are 2, few are 3, less are 4, occasionally we see 5, 6, 7? Baba will let us know what the record is. Check the English - Chinese dictionary, transliterations dont count.

Profile picture
jennyzhu
May 26, 2011, 04:34 PM

Svik,

你真牛!Thank you for sharing your insights, many of which we haven't even developed. BTW, we have indeed come a long way since the early days. I always cringe when listening to my own early podcasts.  

Profile picture
svik

Hi Jenny,

Thanks for your encouragement. I started listening, sporadically in summer 2006, and you were already fun to listen to by then. It's funny how for a successful enterprise, such as Chinesepod, a sense of inevitability about it can develop among observers, whereas the people there from the beginning know how hard it was, and the many changes in direction along the way.

Profile picture
user271828

Wow svik, have you made an electronic (searchable, sortable - perhaps a spreadsheet) copy of this data? I would love to see it posted somewhere...

Profile picture
svik

Sorry! it only exists in my 80 page notebook. That would be a project for another day.

Profile picture
everett

I'm looking forward to that day. Thanks for the statistics. They're really useful for me since my first character learning goal is... coincidentally 2000 chars. So your post gave me the idea to spend some time reading newbie and ellie lessons and expansion sentences just to try to recognize characters. There ought to be a very good overlap with my other study materials, and you can use the "hover" function to get info on the characters with a minimum of effort. CP rocks, and svik rocks!