About SGMLISUG PubsBookstoreChaptersDeveloping SGMLJoin ISUG

ISO 8879 Review: WG8 N1855

ISO/IEC JTC1/SC18/WG8

Document Processing and Related Communication

Document Description and Processing Languages

TITLE:Third Interim Report on the Project Editor's Review of ISO 8879
SOURCE: SGML RG
PROJECT: JTC1.18.15.1
PROJECT EDITOR: Dr. Charles F. Goldfarb
STATUS: Approved report
ACTION: For information
DATE: 24 May 1996
DISTRIBUTION: WG8 and Liaisons
REPLY TO: Dr. James D. Mason
(ISO/IEC JTC1/SC18/WG8 Convenor)
Oak Ridge National Laboratory
Information Management Services
Bldg. 2506, M.S. 6302, P.O. Box 2008
Oak Ridge, TN 37831-6302 U.S.A.
Telephone: +1 423 574-6973
Facsimile: +1 423 574-6983
Email: masonjd@ornl.gov
WWW: http://www.ornl.gov/sgml/wg8/wg8home.htm
FTP: ftp://ftp.ornl.gov/pub/sgml/wg8/

WG8 has directed the Project Editor of ISO 8879 (SGML) to conduct a systematic review of the standard to consider future development. As the review is in its final stages and participation is increasing, this report incorporates the substance of both previous reports (N1605 and N1701) and some other documents in order to reduce the number of documents that need be accessed by new participants.

I. Principles and policy

WG8 has agreed to a set of principles for any future development (JTC1/SC18/WG8 N1289, attached below). These principles ensure that all existing conforming SGML documents will continue to conform after any changes are made to the standard.

WG8 has also adopted a policy for the review (JTC1/SC18/WG8 N1350, attached below). The review process ensures that every clause, paragraph, note, and syntax production of ISO 8879 is reviewed.

At the Munich meeting in May, 1996, the following additional policy provisions were adopted:

  1. The standard shall be designed so that it is practical for documents to be created with ordinary (i.e., not SGML-aware) text editors. This principle does not justify the addition of excessive markup minimization techniques.
  2. A major objective of SGML is that SGML documents shall be processable by any conforming SGML system. Although a conforming SGML system could store documents in a proprietary format that describes portions of the SGML information set, such a format alone is insufficient for conformance. It is therefore required that conforming SGML systems, if they read pre-existing SGML documents, shall be able to import and, if they write SGML documents, shall be able to export those SGML documents in a "plain text" storage format. A "plain text" storage format is provisionally defined as one that is capable of being read and written by commonly used text editors and file utilities that do not inherently interpret specialized notations such as SGML or proprietary formats of particular products. Some examples of editors whose native storage format is plain text are DOS EDIT and EDLIN, Windows Notepad (but not Write or Wordpad), UNIX EMACS and VI, and Macintosh TeachText and SimpleText.

II. Review activities

The review has been structured in two related activities:

Activity 1. Information described by SGML markup

The objective of this activity is to define explicitly the information described by the SGML syntax and to group it, as appropriate, into useful "information sets", such as the Element Structure Information Set (ESIS) (see annex A of ISO/IEC DIS 13673).

This activity is now complete. The "SGML Property Set" defines the full set of information described by SGML markup. This work was done in conjunction with DSSSL and HyTime experts and appears in both the DSSSL standard and the SGML Extended Facilities (formerly "General Facilities") annex of the HyTime standard. We are proposing that it be incorporated in the revision of ISO 8879.

Activity 2. Proposed changes to ISO 8879

The objective of this activity is to identify changes required to correct or enhance the text of ISO 8879, and to publish a revised edition of the standard that incorporates those changes and the changes made by Amendment 1 (1988).

The activity will be conducted in the following sequence:

  1. Clause-by-clause examination of the standard
    This step is currently in process. It will result in a list of requirements expected to be satisfied by the revision of SGML. Interim reports are being published as the work proceeds. These reports include requirements gathered from comments submitted by participants as well as from the systematic examination of the standard.
  2. Member Body approval of requirements to be satisfied
    The list of requirements generated by the first step will be reviewed for technical accuracy. A final list of "Requirements to be Satisfied by the Revision of ISO 8879", including the expected changes needed to satisfy the requirements, will be submitted for Member Body approval. Also submitted will be the reasons why the list fails to include any requirement that had previously been identified as a requirement expected to be satisfied.
  3. Preparation and balloting of text of changes
    Text will be prepared for the changes needed to satisfy the approved requirements and will be ballotted in accordance with ISO directives.
  4. Publication of revised ISO 8879
    Upon approval of the text of the changes, the changes will be incorporated into the text of ISO 8879, together with Amendment 1, and the integrated text will be published.

The review has progressed sufficiently that we can state that changes will be recommended. In order to acquaint the SGML community with the types of change we are contemplating, we have prepared an interim list of requirements and associated changes. Please note that the list is by no means complete with respect either to the set of requirements or the possible changes associated with each requirement. Nor do we believe it to be statistically representative of the changes that we will eventually recommend. The list follows, in no particular order (except for grouping together items from the same earlier report).

Note that the first ten requirements below are from N1605 and N1701.

Requirement: To facilitate automatic translation of well-defined document character sets with human inspectability of the translation information

(Note: This information is not required for SGML parsing.)

  1. Add the ability to specify when the definition of a character set is represented using the format of the character set description parameter of the SGML declaration.
  2. Add the ability, within a character set description, to identify a base set character by its string name (not just by its character number).
  3. Add the ability to indicate when the string name for a base set character is taken from the ISO10646 character registry, rather than being the name used in the base set definition itself. (For ISO sets, the two are normally the same.)
  4. Allow the base set of a formal character set definition to be itself a formal character set definition. This allows subsets of large character sets to be used in the definition of other character sets.

Requirement: To facilitate automatic generation of display character entity sets from definitional character entity sets.

For definitional character entity sets, add the ability to define the entity text as either a character in a defined character set and/or by an ISO10646 character registry name.

Requirement: To facilitate more powerful management of sophisticated versioning requirements.

Allow Boolean combinations of INCLUDE and IGNORE keywords in marked section declarations.

Requirement: To enhance the usefulness of capacities.

  1. Make specification of any or all capacity limits in a document optional.
  2. Make an SGML system's support for any or all capacity limits optional.
  3. Add capacity limits for document instances, such as number of elements, number of data characters, etc.
  4. Allow optional specification of actual capacities (not just capacity limits).

Requirement: To provide more flexibility for attribute design.

  1. Allow a given name token to appear in the declared value of more than one attribute in the same attribute definition list.
  2. Allow multiple attribute definition list declarations for a single element type.

Requirement: To simplify the specification of SGML declarations.

Allow reference to an existing SGML declaration with local modifications.

Requirement: To allow more choices among optional SGML features.

Modularize the SHORTTAG feature so that attribute minimization can be used with or without allowing empty tags.

Requirement: To clarify difficult portions of the text of the standard.

Clarify the description of record boundary handling, explaining clearly the relationship between detecting data characters and ignoring characters, including whether ignored characters are first recognized as data characters.

Requirement: To support multi-byte character sets with greater convenience

Devise a less burdensome method for declaring long sequences of character numbers.

Requirement: To facilitate name-space modularization in DTDs and LPDs

[The report did not include possible changes that address this requirement.]

Requirement: To support the requirements identified by an earlier review.

The recommendations in WG8 N1035 are considered to be incorporated into this list.

Requirement: To support multi-byte character sets with greater convenience

  1. Provide a way to declare name characters that have no upper/lower case distinction without having to repeat the characters.
  2. Provide a uniform character range specification mechanism and use it uniformly (e.g., in the UC/LCNMSTRT and SHORTREF specifications) to add large chunks of characters concisely).
    [WG8 N1854 addresses this requirement further.]

Requirement: To support multi-lingual text

Provide a means for specifying more than one language in the public text language of a formal public identifier.

Requirement: To enable SGML documents to be interchanged in a plain text format that is accessible by commonly used text editors and file utilities that are not inherently capable of interpreting SGML.

Revise the conformance clause to require that conforming SGML systems be able to use a plain text format in addition to any other format that they use.

Requirement: To integrate document architectures, formal system identifiers, and property sets and groves into SGML.

Incorporate the SGML Extended Facilities of the revised ISO/IEC 10744 into ISO 8879.

Requirement: To correct editorial errors

  1. In clause 5.1 state that Figures 1 - 4 are based on the reference concrete syntax and that the character numbers used are those of the syntax-reference character set.
  2. Remove references to ISO 2022.

Requirement: To support simple external references in SGML

It shall be possible, in an IDREF or IDREFS attribute value, to specify the name of the entity that defines the name space in which a referenced ID occurs. The syntax is "#ENTITY entity-name ID" (or possibly a list of IDs after which a suitable reserved word, such as #THISDOC, could be used to identify additional local IDs). This facility shall not be extended any further.

The standard will point out that HyTime should be used for more sophisticated requirements such as indirect referencing. The standard will also warn the user of the risks of using direct referencing to external objects whose location could change. (Note: The experience of one large user has been that any object referenced more than twice should be referenced indirectly.)

Requirement: To allow any SGML document to be a subdocument

SGML declarations will optionally be allowed in SUBDOC entities.

Requirement: To extend the character model of SGML to include entity and storage management.

The standard will be based on the character model that is summarized below. Terms used that are not defined here are defined in ISO 8879 or in the Formal System Identifier Definition Requirements (FSIDR) of the SGML Extended Facilities. Some terms defined in those standards are introduced here with informal definitions that are intended to be consistent with the formal ones.

NOTE: This text is subject to revision as the Review proceeds.
  1. An SGML document is made up of entities, which are sequences of characters. Character sequences are passed to the SGML parser by the entity manager. As characters are abstract semantic constructs, and as computers can only deal with representations of abstractions, what actually passes to the parser is a non-standardized internal representation of the characters, known as the "system representation of characters" (SysRC).
  2. The SGML declaration that governs a document defines a character set known as the "document character set" (DCS). A character set is the mapping of a character repertoire onto a "code set", which is an ordered set of consecutive bit combinations of equal size. A bit combination is an ordered collection of bits, interpretable as a binary number. The size of a bit combination in a code set (the "code set width") is the smallest power of 2 that is an integral multiple of 8 and which is such that 2, raised to that power, is at least 1 greater than the largest bit combination in the code set.
  3. The DCS maps both significant characters and purely data characters to bit combinations in the code set. The base-10 integer equivalent of the bit combination mapped to a character is the character's "character number", which can be used in numeric character references in the prolog and document instances. The DCS identifies the remaining bit combinations of the code set as being mapped to "non-SGML characters"; they have character numbers. For each bit combination in the code set, even those that are non-SGML, there is a corresponding SysRC.
  4. An entity is a virtual storage object. The entity manager actually maps an entity onto one or more real storage objects (or portions thereof). A storage object consists of "octet sequences" (or other formats) representing the stored characters, known as the "storage representation of characters" (StoRC). A StoRC is invalid if it does not represent an SGML character.
  5. In the course of accessing storage objects, the entity manager, in conjunction with the storage manager(s), could invoke processes such as conversion, decompression, or decryption whose inverses might have been applied when the StoRC was stored, in addition to mapping from the StoRC to the character semantics (and therefore to the SysRC). Such processes and mappings could either be intrinsic to the storage manager or specified as an attribute in the storage object specification in the FSI.
  6. The StoRC is mapped to the SysRC in one of two ways: either with or without the use of the document character set.
    • If the storage object is DCS-dependent a "bit combination transformation format" (BCTF) is applied to the StoRC to produce bit combinations, which are in turn converted to characters (and therefore to the SysRC) using the document character set.
    • Without the DCS, the storage object is associated with a mapping from the StoRC to characters (and therefore to the SysRC). This mapping, which could imply transformation of an octet sequence, is known as a "character repertoire encoding" (encoding).
  7. DCS-dependent conversions are accomplished by the parser giving the entity manager the mapping from the bit combinations to the SysRC. The entity manager in turn passes it to the storage manager, which uses that to get from the StoRC to the SysRC.
  8. Whether or not the DCS is used for mapping the StoRC to the SysRC, it always exists. Therefore, there is always a mapping from the DCS to the SysRC. That mapping is used to interpret numeric character references in the prolog and document instances, and also to determine whether a StoRC is invalid because it does not represent an SGML character.
  9. The SGML declaration also defines a character set known as the "syntax-reference character set". Its purpose is to allow characters used in the defined concrete syntax to be referenced within the SGML declaration by character numbers that are not dependent on the DCS. It does not play a role in the mapping from the StoRC to the SysRC.
  10. A "char" in a grove is an abstract data type representing a character (defined above as an abstract semantic construct). The concrete representation of a grove must allow any char to be distinguished from all other chars, and for the semantics of each char to be determined. The concrete representation may or may not rely on the DCS, UNICODE, and/or other character sets and repertoires.

Contact Robin Cover with corrections and updates, or to submit contributions to the ISUG online document database.

ISUG logo
Copyright © 1997 International SGML Users' Group