Coventive creates new display code to internationalize Linux

Oct 10, 2000 — by LinuxDevices Staff — from the LinuxDevices Archive — views

Coventive Technologies offers a family of embedded Linux based software solutions that supports a broad range of international languages including English, French, German, Italian, Spanish, Portuguese, Russian, simplified and traditional Chinese, Japanese, Korean, Thai and Vietnamese. Coventive's multi-lingual capabilities derive from the company's patented Giga Character Set (GCS) technology, which Coventive describes as “a mathematical encryption algorithm that minimizes market localization requirements and preserves the full flavor, extent, and characteristics of each language.”

The following white paper, which describes Coventive's “Giga Character Set” (GCS) technology, has been reprinted with permission of Coventive Technologies.

GCS White Paper — Coventive Technologies
The Need for A Truly Multilingual Display Code

Today's widely used display codes lack the capacity to handle the pictographic languages spoken by 45 percent of the world's people. But what if a single kernel, a single operating system, could accommodate both alphabet languages and the tens of thousands of pictographs of Asian languages? What economies could be achieved? What barriers would fall? Coventive Technologies has built true multilingual capability into its Linux operating system. It promises to change the way that much of the world does business.

“Is it conceivable that the U.S. software industry would adopt a display code that could not computerize Shakespeare's Hamlet?”
In our age of global information technology, communication, and 24-hour markets, it is increasingly important to bridge barriers such as national languages. And in an age when technology can send a man to the moon, it is hard to imagine that high-quality multilingual computer software is not generally available. But true multilingual capability is a key differentiating feature of Coventive's Linux distribution. How does this work and what does it mean?

To process written text of a readable human language, computers must convert the written text into an internal numerical representation known as display code. Information and commands entered on the keyboard in any human language is thus converted to display code, processed, and/or stored. When results need to be sent to an output device (a printer or display monitor), the display code is re-converted into a human language format.

The Evolution of Display Codes

One of the earliest and best-known display codes is the American Standard Code for Information Interchange, or ASCII, which is used for English. ASCII is a display coding scheme based on seven binary digits (bits) of 0 or 1. That allows for 128 different characters (two to the seventh power). This is more than sufficient for English since there are 26 letters in the alphabet, a lower and upper case, ten numerical digits and about 35 commonly used punctuation marks. Now, if only everyone in the world used only English!

However, even Europeans found the basic ASCII format to be inadequate for expressing their languages, given the use of accented letters and incremental letters like “beta” in German. This resulted in two different display code solutions: extended ASCII and Latin-1. Extended ASCII, for example, is based on eight bits, rather than seven, thus allowing 256 different characters to be defined. Ultimately, none of the alphabet-based languages pose too significant a challenge when it comes to display coding, storage or processing. The real difficulty arises with handling pictogram-based languages, most notably, Chinese, Japanese and Korean, which are spoken by 45% of all people worldwide.

Various schemes have been used or proposed for representing complex languages in a computer display code. Some schemes break down the written characters into basic strokes of penmanship, such that each written character can be represented within the computer by a combination of strokes. However, we find that to completely represent certain characters, a large number of basic strokes are needed. This would lead to a computer display code scheme needing a substantial increase in the amount of storage to fully represent all characters. Another limitation is that a stroke-based display code could not simultaneously handle an alphabet-based language easily or efficiently. It is usually desirable in Asian-language software to have English as a default or secondary language.

This leads us to another scheme that attempts to represent multiple languages with a single display code format by reserving blocks of character codes for different languages. Such a scheme does not increase the efficiency of storing words in languages given the repetitive “look up” action required through a database of character codes for multiple languages. However, it does provide a standard way to represent text of all languages. This is the basis for the widely-used Unicode standard of “universal” display coding. Current editions of Microsoft Windows, Java, and virtually all Linux distributions support Unicode.

How the Unicode Standard Falls Short

Instead of the seven or eight bits used for ASCII and extended ASCII, Unicode assigns a 16-bit code to all characters of all languages that it supports. The 16-bit code allows for 65,536 different characters (2 to the 16th power), and all languages are assigned a range of Unicode for its language letters, symbols, or characters. For example, Greek is assigned 0370 to 03FF in Unicode (16-bits can be represented as four hexadecimal digits), Arabic is assigned 0600 to 06FF, and Chinese, Japanese and Korean are jointly assigned a block from 2EFF to 9FFF, which amounts to about 36,600 characters.

It seems generous that under Unicode, three languages alone get about 55% of the entire character set. Despite the allocation, Unicode leads to an incomplete rendering of not only those three Asian languages, but others as well. Firstly, Unicode can not easily include the new characters that continue to be formed, especially in the technical, scientific and lifestyle arenas. In some languages, for example, new scientific terms are typically represented by newly formed characters that, although formed as composites of existing characters, would need to be assigned a new numerical value or display code. Given the limited range allocated for each language, one can imagine that past a certain point, Unicode will run out of display codes.

Secondly, Unicode does not preserve the characteristics of language. For example, the directionality of written language can vary. Another example is that although the three Asian languages (Chinese, Japanese and Korean) are based on individual characters, we often have two or even three characters combined to form a new character. Unicode does not support this basic language feature either.

Finally, and perhaps most importantly, the number of display codes that Unicode provides is simply not large enough to permit a true universal display coding. The most glaring example is Chinese. The 36,600 codes assigned cover two versions of Chinese (simplified for China and traditional for Hong Kong and Taiwan), Japanese, and Korean. While an average literate Chinese only knows about 5,000 characters, there are actually about 100,000 Chinese characters in use today.

Without a more powerful and inclusive display code, whole industry and government functions are not able to fully and correctly computerize due to the language limitation alone. This includes the medical, pharmaceutical, biological, geological, chemical, horticultural, zoological, astronomic, and philosophical fields, among countless others. There are even surnames that Unicode does not cover, not to mention that most classical literature can not be correctly computerized using Unicode. Is it conceivable that the U.S. software industry would adopt a display code that could not computerize Shakespeare's Hamlet? Or that the National Science Foundation or NASA could not be fully computerized because part of their vocabulary could not be represented?

Coventive's Unique Giga-Character-Set

Coventive's Linux OS is unique among popularly available Linux distributions because it is based not on Unicode, but on a compatible display code scheme called Giga-Character Set, or GCS, developed by its research and development team. GCS addresses every one of the Unicode shortcomings discussed above. In short, GCS supports over 100,000 characters, is expandable, can preserve written language characteristics, and is even more efficient than Unicode, meaning it processes most complex languages much faster.

GCS is fundamentally different from other display codes because it is not based on assigning binary codes to characters or letters. GCS is actually a mathematical encryption algorithm that the computer uses to transition between natural language characters and letters and computer language bits. A different algorithm is developed for each language, which captures that specific language's peculiarities, such as basic symbols, spatial relationships, directionality and supplemental symbols.

With this methodology, no language can ever run out of display codes. GCS, therefore, can be fully inclusive of the natural language and still have room to accommodate new characters. Coventive's XLinux 1.5 version OS is equipped with GCS encoding for 12 languages that cover over 75% of the world's population. (The languages are English, French, Spanish, German, Italian, Portuguese, Russian, simplified Chinese, traditional Chinese, Japanese, Korean, Vietnamese and Klingon. More are planned for future versions.) GCS coverage of each language is much more comprehensive than Unicode's, and is more efficient to use.

The efficiency of GCS is based on the simple truth that computers can calculate much faster than they can search. When natural languages are being processed with a full set of available fonts, the traditional display code approach was simply a behemoth look-up table for each basic character in many different fonts. GCS “calculates” or derives the correct character and font. This is not only faster, but also requires less memory. For example, GCS handles Korean font files 1500 times faster than Unicode; Japanese fonts, 1.5 times faster; and Chinese fonts, 100 times faster. The same principle can be extended to phonic or text-to-speech processing with similar speed improvements and reduced memory requirements.

The Applications of GCS

GCS is very powerful when adopted in a full body of operating and application software. Though a simple solution like ASCII more than suffices in the largest software market in the world (the U.S.), even American software companies stand to benefit from using GCS. Time to market in the global context is greatly reduced when software development is based on GCS. There would be minimal need to localize software individually for each country market. Developers in any country could sell software products much more effortlessly to the global markets, with single versions that are truly multilingual.

Embedded software applications would similarly gain from more efficient language processing and reduced storage requirements. The benefits in terms of font cartridges, dictionaries and translators are obvious. But ultimately nearly all information appliances could be manufactured for multiple country markets using a single version. Depending on the software storage limitation of the product, the manufacturer can choose to use only a subset of the full GCS capability–for example, to accommodate only English, Chinese and Japanese, or any other combination of languages. Inventory could then be easier to manage when producing for multiple markets.

Finally, GCS is a major benefit to end-users, in particular to those who use a complex natural language. Many countries have not yet reached the level of automation and computerization found in the U.S., though that is the inevitable trend. GCS addresses a critical stumbling block to achieving a robust display code scheme that preserves the full flavor, extent and characteristics of a language. Even in the developed world of alphabet-based language, the increasingly global nature of business, communications, and political and social interactions makes true multilingual software an imperative.

This article was originally published on LinuxDevices.com and has been donated to the open source community by QuinStreet Inc. Please visit LinuxToday.com for up-to-date news and articles about Linux and open source.

Comments are closed.

Pages

Archives

Categories

Coventive creates new display code to internationalize Linux

Related Posts: