About SGMLISUG PubsBookstoreChaptersDeveloping SGMLJoin ISUG

SGML as the Lingua Franca in a Multilingual Environment

Mladen Damjanov, Deputy Director ISEA, 56 R. Glesener, L-1630 Luxembourg; Tel. +352 29 21 22, Fax: +352 29 21 20, E-mail: mdamjanov@isea.lu

The need for multilingualism is a direct consequence of market opportunities in world trade. Indeed, when information of any type is provided in a country, there is usually a legal requirement that it be in the language of that country.

This is especially true for consumer goods provided for the mass market, for which the documentation and associated information form a key component of the product itself. As is the case for any component, there has to be a clear approach both for the production and maintenance of the multilingual information. When

the product itself is information, as in the case of the Internet, multilingualism becomes essential.

Historically, multilingualism has been considered an ancillary problem, and has never been solved in an effective way. Rather, it has been considered as a problem to be circumvented, one rarely, if ever, totally surmountable.

Basically, two sets of costs can be applied in achieving multilingualism in addition to the cost of the original version (in what could be described as the master language):

Different approaches to achieve the desired results depend on whether the information is for translation purposes or for document management. With the Internet, there is an ever-increasing number of users from many different countries; there is a need here for users to navigate by use of the browser in their own language.

This must be considered from two points of view:

For the representation and actual exchange of the information increasing use is made of the ISO 10646 character sets and Unicode, where the latter can be considered as a subset of the former for the more popular languages. The Unicode character coding system is designed to support the interchange, processing, and display of the written texts in many diverse languages.

Unlike the ASCII character set, the Unicode standard is based upon a character set that includes over 39,000 combinations. In order to distinguish particular characters it is no longer necessary to use escape sequences or control codes

It should be noted, however, that the ASCII value of a given character will be the same in the Unicode standard.

Example: ASCII/8859-1 : A equals 0100 0001

Unicode : A equals 0000 0000 0100 0001

Nevertheless, in addition to the ISO/10646/UCS-2 and ISO/10646/UCS-4 encoding schemes, others, such as UTF-8, introduce the problem of escape sequences in a different manner.

However, where the management of the information is concerned, Unicode (or ISO 10646) does not provide the solution. What is needed are operations for the management of the information, with

focus on the fine processing of the structure of the information (documents), and the exploitation of the semantics. With manual procedures, even perfectly aligned, it is impossible to manage the documents in several linguistic versions in a coherent way. For example, there can be no guarantee of synoptism (common page numbering, etc) when using manual methods, especially when most documents are of high-volume. Although some tools exist for the management of documents (in machine-readable form), they do not offer the flexibility to identify parts of information or the location of a part of a document independently of the linguistic version considered. This is where the use of SGML becomes attractive.

SGML allows for the validation of the structure of the document in accordance with its DTD, and by the principle of tagging and the use of SHORTREF precise positioning within the document is possible, and even extraction of some of the information if desired, independently of the linguistic version considered. It

is also possible to allow a user to consult part of a document in a given linguistic version and obtain the same part in another language by managing links between the different linguistic versions of the same document.

When this problem is considered in the context of the Internet, the limitations of HTML are apparent. But extend HTML (an application of SGML) with functionality provided in XML (a subset of SGML), its usefulness is apparent. And XML can be considered as the next logical step towards full SGML.

Indeed the sole purpose of HTML is to allow for easy representation of the information on-screen. The structure of the information is deprived of all contextual notion which renders it particularly poor. Whatever the richness of the initial tagging of the information, its conversion to an HTML representation results in a loss of capacity to exploit it correctly afterwards.

The need to distribute the information available on the web in various languages is an obvious requirement. The emergence of tools for automatic translation of information available on the Web (e.g. SYSTRAN) bears testimony to this fact.

Translation of information is a complex process. It does not merely suffice to translate the words, but is necessary to interpret them according to the context in which they are placed, thus giving them their full significance and weight in a translation, to obtain a coherent result. Anyone who has performed translation tests using these tools will be able to testify to the doubtful results.

The level of SGML tagging of a document can express the granularity required, depending on the needs of the user. The semantics need to be added in the names of the tags to increase the possibilities of exploitation of the information. So now there can be added benefits to managing the information provided in multilingual form.

At the application level, the implementation of the SGML standard with all its features, in a multilingual environment, has given rise to the development of software tools that permit the desired

functionality/flexibility to:

Indeed, it is not merely the use of SGML, but also the exploitation of the standard, plus the precise specification of the grammar of documents, that has made it possible to achieve such functionality readily; hence processing multilingual documents no longer poses a problem - rather this may be regarded as an extension to the original information, resulting in added value.

Tools are becoming available for processing multilingual documents, integrating some of the functionality made available through SGML (e.g. XCORPUS offers a software environment for the manipulation of a textual corpus represented in accordance with SGML). And more are coming onto the marketplace.

These comments and observations are not pure theory. The implementation of these principles, and of the SGML standard today, allows us to develop multilingual applications within production environments.

For example, they have been applied to the production of the budget of the European Communities comprising a volume of some 2,000 pages per linguistic version. It is produced and maintained in the eleven official languages of the European Community.

Through the exploitation of the SGML standard, together with appropriate application software, it is possible to remove the barriers created by the needs for multilingualism in a cost-effective way. It could even be said that SGML is the vehicle for this: SGML is a vehicular language.

Contact Robin Cover with corrections and updates, or to submit contributions to the ISUG online document database.

ISUG 
logo
Copyright © 1998 International SGML Users' Group