Frequency Distribution with Chinese Words

I recently had been reading about Zipf’s Law and how it seems to hold for most natural languages. Given my background with the Chinese language, I wanted to see if the distributions would work the same as they do with English.

Zipfs Law states that with any corpus of natural language, the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, and so on. Basically, what we should see is an inverse relationship between rank and word frequency.

For simplicity’s sake, I decided to use the Sinica Treebank sample included in NLTK’s corpus collection. It is important to note that this is only a sample of the Sinica Treebank, so the results may not be as accurate as using the entire corpus. For the rest of this article, I will be using Python and NLTK in the code samples below.

Words

You can easily import the Sinica Treebank with NLTK and get a list of all the words using the following code:

>>> from nltk.corpus import sinica_treebank
>>> import nltk
>>> sinica_treebank.words()
['\xe4\xb8\x80', '\xe5\x8f\x8b\xe6\x83\x85', ...]

Don’t be worried about the weird looking strings (like '\xe4\xb8\x80'). These are just UTF-8 encoded Chinese characters. You can view the actual characters by using a print statement like this:

>>> print '\xe5\x8f\x8b\xe6\x83\x85'
友情

We will need to extract all the words from the corpus and figure out how many times the words appear. We do this using the following code:

>>> from collections import Counter
>>> words_with_count = Counter(sinica_treebank.words())
>>> counts = array(words_with_count.values())
>>> words = words_with_count.keys()

Now you have a list of all the words in the corpus and the number of times each word appears.

Plotting Frequency Distribution

Now let’s create a graph to show the frequency distribution of Chinese words in the Sinica Treebank. We will need to get both the rank and the frequency to create the Zipf plot.

>>> from pylab import *
>>> ranks = arange(1, len(counts)+1)
>>> indices = argsort(-counts)
>>> frequencies = counts[indices]

Here we are ordering the frequency counts in decending order and creating a rank that spans from 1 to the total number of words in the corpus.

The easiest way to display our frequency distribution in graph form will be by using a loglog graph with our x-axis showing the rank and the y-axis will show the word frequencies. We can create the graph like this:

>>> loglog(ranks, frequencies, marker=".")
>>> title("Zipf's Law for Sinica Treebank words")
>>> xlabel("Frequency rank of word")
>>> ylabel("Absolute frequency of word")
>>> grid(True)
>>> for n in list(logspace(-0.5, log10(len(counts)), 20).astype(int)):
...     dummy = text(ranks[n], frequencies[n], " " + words[indices[n]],
...                  verticalalignment="bottom",
...                  horizontalalignment="left")
...
>>> show()

This will create a graph that shows us the frequency distribution like this:

{<1>} sinica frequency distribution

Here is the full script to create this graph: