Chinese Text Segmentation

Posted on November 24, 2016 by Gwylim

Recently since I’ve been learning Mandarin, I’ve been working on automatically finding the most common words in a given text, in order to be able to study those words beforehand.

Unfortunately, chinese words are not separated by spaces, so determining what exactly the words in a given text are is non-trivial. I’ve written a text segmenter to handle this, which takes as input chinese text, and divides that text into words.

The most useful function of this program is to count how many times each word occurs, in order to find the most common words in a given text, which can be done by running zhtext words FILE. This also sorts the words by their relative frequency in the given text, so the words most particular to that text are first.

Algorithm details

If we have a dictionary of valid words, then there is a straightforward DP algorithm to find some way of splitting a string of characters into words. However, just choosing any segmentation into words is not good enough. The dictionary I’m using has entries for many characters by themselves, so for many sentences a valid segmentation would be to just treat each character as a separate word. This is obviously not going to be correct in most cases.

The first heuristic we can use is to look for a minimum-length split; that is, separate the string into the least possible number of words. This gives much better results; however, there are still some cases it gets wrong. For example, the string “不知道” can get segmented as “不知… 道”, whereas it should be “不… 知道” (it turns out that the dictionary has an entry for “不知” as a single word). Since these are the same length, our algorithm isn’t necessarily going to pick the right one.

I’ve found a fairly simple heuristic which extends the DP and seems to solve this in most common cases. The idea is to take advantage of frequency data. The strings “不” and “知道” will both be substantially more common than the string “不知道”, since they are common words. However, “不知” will almost always occur in the string “不知道”, so it’s frequency will be about the same. If we break ties by the product of the frequencies of the individual words, then “不… 知道” will be the preferred segmentation.

Of course, this could still go wrong if, for example, “道” was very common outside the string “知道”, or in cases where the best segmentation is not minimum-length. However, with the frequency heuristic, it seems to work well enough in most cases, at least to the point of being useable for determining the most common words.