Skip to main content
Huang Tingjian Fu Bo Shen Ci
Text Analytics

The Key Ingredient to Searching and Processing Chinese Text

Converting between simplified and traditional Chinese scripts

A Brit’s “I’m going to have a kip” is nearly incomprehensible to most Americans. No wonder people say that Great Britain and the U.S. are two countries divided by the same language. But, Chinese-speaking countries like China and Taiwan or Hong Kong and Singapore truly are countries divided by the same language — because Chinese is written two ways.

China and Singapore use the simplified Chinese script that was introduced during Mao’s Great Leap Forward in 1950s China. Simplified Chinese reduced the number of strokes, but also forced some characters to take the place of what was two similar-sounding characters (but often with completely unrelated meanings!). Taiwan and Hong Kong stuck with the traditional Chinese writing system. While the two scripts have the same origin, they are different enough that a person accustomed to traditional Chinese will have some difficulty reading simplified Chinese, and vice-versa.

The simplified character 发 pronounced fa replaced two characters, also pronounced fa: 發 (meaning emit) and 髮 (meaning hair)

Character-by-character conversion could give you 出髮 (= emitting hair?) 

To avoid asking users to type their search in twice (in both scripts) or to do any comprehensive Chinese text processing, all the text must be converted to one script, analyzed, and then results displayed in the user’s preferred script. It’s a special machine translation problem that would be a major headache for a business whose core competency wasn’t multilingualism.

Luckily, the Chinese Script Converter function of Babel Street Text Analytics tackles this problem for you! Equipped with dictionaries and linguistic smarts, it can handle the trickiest script conversion use cases. For example:

Translating word-by-word instead of character-by-character

Especially when converting from simplified to traditional Chinese, you’re faced with a one-to-many conversion problem. A simple character-to-character mapping might give you gibberish. With knowledge of words, Babel Street Text Analytics chooses the correct conversion from 出发 (set off) to 出發 (set off) and not 出髮 (emitting hair).

In a conversion to traditional Chinese gone wrong, one could never find documents with 出發 as they would be indexed as the non-word 出髮 — never to be found.

Traditional ChineseSimplified ChinesePronunciation (meaning)
出發出发chūfā (set off)
頭髮头发tóufǎ (hair)

Translation cases where the word is entirely different

Cases like the British “kip” to the American “nap” are especially difficult to handle. As China and Taiwan developed independently, it’s not surprising each country came up with different words for new concepts, especially technological advances.

Traditional ChineseSimplified ChineseMeaning
電腦 (diànnǎo)计算机 (jìsuànjī)computer

Chinese script conversion is an indispensable feature for Chinese text processing or search problems.

Huang Tingjian “Fubo Shrine scroll” (local) ink from the paper, via Wikimedia Commons

Find out how to transform your data into actionable insights.

Schedule a Demo

Stay Informed

Sign up to receive the latest intel, news and updates from Babel Street.

Babel Street Home
Trending Searches