Learning to read and write Chinese as a second language is indeed a difficult task. College-level courses introduce the students to about 500 in the first year and another 500 in the second year. However, it is generally accepted that one needs to know between 3000-4000 characters in order to be able to read a standard Chinese newspaper. For those of us in the learning mode this is a daunting challenge. How do we get there from way down here? How do we make sure that we are learning the most common ones first? While we don't claim to know the answer to these questions, we have some ideas we'd like to share. Wouldn't it make sense to ensure that we are learning the most frequently used characters (and words) first?
There have been several attempt to identify the most frequently used Chinese characters. The Chinese Government itself publicized one such list in 1998. In another study Shih-Kun Huang analyzed Internet postings from 1993 and 1994 amounting to almost 172 million characters. Additional analysis was perfomed by C.H. Tsai. Since the study used Big5 encoded data, the results are more relevant to those interested in traditional Chinese characters.
A more recent analysis has been performed by Jun Da using web publications, including online versions of works written before 1911 (which accounted for 25% of the data). The data amounted to over 258 million characters encoded in GB2312 and GB13000. Accordingly, the results have been summarized using simplified Chinese characters. One major improvement of this study is that it also identified the most frequently used bigrams, most of which should be two-character words.
The statistical data also offers some hope. Although, 3000-400 characters is still a mind-boggling number of symbols to memorize, the following table indicates that at least partial understanding is achievable with a smaller set.
|Characters||Huang's 1994 results||Da's 2004 results|
|Top 250 characters||64.4%||57.1%|
|Top 500 characters||79.2%||72.1%|
|Top 1000 characters||91.1%||86.2%|
|Top 1500 characters||95.7%||92.4%|
|Top 2000 characters||97.9%||95.6%|
|Top 3000 characters||99.4%||98.3%|
The above table tells us that the top 1000 characters account for between 86% and 91% of the characters occurring in the real world. Assuming, with great hope, that there is a good correlation between the top 1000 characters found in these studies and the 1000 characters that most second year college students are supposed to master, we can conclude that there is a light at the end of the tunnel. Finally, while we wouldn't advocate studying characters solely based on their high frequency, we believe that studying such lists is a reasonable supplement to conventional study programs. Such lists also provide a sense of just where one is on the path to full Chinese literacy.
The results from the above studies have been used to generate two different sets of flashcards.
Which one should you use? In the overall scope of things, it probably doesn't matter. In fact, the flashcards will let you switch to the alternate script. However, since simplified characters potentially map to multiple traditional characters, there is possibility of a conversion error. For this reason, we recommend using the first list for those primarily interested in the traditional script and the second for those interested in the simplified script.