YellowBridge Chinese Dictionary and Language Tools
Your Ultimate Bridge to China
Chinese Language Study Aids

Learning the Most Commonly Used Chinese Characters

Learning to read and write Chinese as a second language is indeed a difficult task. College-level courses introduce the students to about 500 in the first year and another 500 in the second year. However, it is generally accepted that one needs to know between 3000-4000 characters in order to be able to read a standard Chinese newspaper. For those of us in the learning mode this is a daunting challenge. How do we get there from way down here? How do we make sure that we are learning the most common ones first? While we don't claim to know the answer to these questions, we have some ideas we'd like to share. Wouldn't it make sense to ensure that we are learning the most frequently used characters (and words) first?

There have been several attempt to identify the most frequently used Chinese characters. The Chinese Government itself publicized one such list in 1998. In another study Shih-Kun Huang analyzed Internet postings from 1993 and 1994 amounting to almost 172 million characters. Additional analysis was perfomed by C.H. Tsai. Since the study used Big5 encoded data, the results are more relevant to those interested in traditional Chinese characters.

A more recent analysis has been performed by Jun Da using web publications, including online versions of works written before 1911 (which accounted for 25% of the data). The data amounted to over 258 million characters encoded in GB2312 and GB13000. Accordingly, the results have been summarized using simplified Chinese characters. One major improvement of this study is that it also identified the most frequently used bigrams, most of which should be two-character words.

The statistical data also offers some hope. Although, 3000-400 characters is still a mind-boggling number of symbols to memorize, the following table indicates that at least partial understanding is achievable with a smaller set.

Cumulative Character Frequency for the Top N Characters
Characters Huang's 1994 results Da's 2004 results
Top 250 characters 64.4% 57.1%
Top 500 characters 79.2% 72.1%
Top 1000 characters 91.1% 86.2%
Top 1500 characters 95.7% 92.4%
Top 2000 characters 97.9% 95.6%
Top 3000 characters 99.4% 98.3%

The above table tells us that the top 1000 characters account for between 86% and 91% of the characters occurring in the real world. Assuming, with great hope, that there is a good correlation between the top 1000 characters found in these studies and the 1000 characters that most second year college students are supposed to master, we can conclude that there is a light at the end of the tunnel. Finally, while we wouldn't advocate studying characters solely based on their high frequency, we believe that studying such lists is a reasonable supplement to conventional study programs. Such lists also provide a sense of just where one is on the path to full Chinese literacy.

The results from the above studies have been used to generate two different sets of flashcards.

Which one should you use? In the overall scope of things, it probably doesn't matter. In fact, the flashcards will let you switch to the alternate script. However, since simplified characters potentially map to multiple traditional characters, there is possibility of a conversion error. For this reason, we recommend using the first list for those primarily interested in the traditional script and the second for those interested in the simplified script.

©2003-2014 J. Lau. All rights reserved.