Activity 1. Information described by SGML markup
The objective of this activity is to define explicitly the information
described by the SGML syntax and to group it, as appropriate, into
useful "information sets", such as the Element Structure Information
Set (ESIS) (see annex A of ISO/IEC DIS 13673).
This activity is now complete. The "SGML Property Set" defines the
full set of information described by SGML markup. This work was done
in conjunction with DSSSL and HyTime experts and appears in both the
DSSSL standard and the SGML Extended Facilities (formerly "General
Facilities") annex of the HyTime standard. We are proposing that it be
incorporated in the revision of ISO 8879.
Activity 2. Proposed changes to ISO 8879
The objective of this activity is to identify changes required to
correct or enhance the text of ISO 8879, and to publish a revised
edition of the standard that incorporates those changes and the
changes made by Amendment 1 (1988).
The activity will be conducted in the following sequence:
- Clause-by-clause examination of the standard
This step is currently in process. It will result in a list of
requirements expected to be satisfied by the revision of SGML. Interim
reports are being published as the work proceeds. These reports
include requirements gathered from comments submitted by participants
as well as from the systematic examination of the standard.
- Member Body approval of requirements to be satisfied
The list of requirements generated by the first step will be
reviewed for technical accuracy. A final list of "Requirements to be
Satisfied by the Revision of ISO 8879", including the expected changes
needed to satisfy the requirements, will be submitted for Member Body
approval. Also submitted will be the reasons why the list fails to
include any requirement that had previously been identified as a
requirement expected to be satisfied.
- Preparation and balloting of text of changes
Text will be prepared for the changes needed to satisfy the approved
requirements and will be ballotted in accordance with ISO directives.
- Publication of revised ISO 8879
Upon approval of the text of the changes, the changes will be
incorporated into the text of ISO 8879, together with Amendment 1, and
the integrated text will be published.
The review has progressed sufficiently that we can state that changes
will be recommended. In order to acquaint the SGML community with the
types of change we are contemplating, we have prepared an interim list
of requirements and associated changes. Please note that the list is
by no means complete with respect either to the set of requirements or
the possible changes associated with each requirement. Nor do we
believe it to be statistically representative of the changes that we
will eventually recommend. The list follows, in no particular order (except for grouping together items from the same earlier report).
Note that the first ten requirements below are from N1605 and N1701.
Requirement: To facilitate automatic translation of well-defined
document character sets with human inspectability of the translation
information
(Note: This information is not required for SGML parsing.)
- Add the ability to specify when the definition of a character set
is represented using the format of the character set description
parameter of the SGML declaration.
- Add the ability, within a character set description, to identify a
base set character by its string name (not just by its character
number).
- Add the ability to indicate when the string name for a base set
character is taken from the ISO10646 character registry, rather than
being the name used in the base set definition itself. (For ISO sets,
the two are normally the same.)
- Allow the base set of a formal character set definition to be
itself a formal character set definition. This allows subsets of
large character sets to be used in the definition of other character
sets.
Requirement: To facilitate automatic generation of display character
entity sets from definitional character entity sets.
For definitional character entity sets, add the ability to define
the entity text as either a character in a defined character set
and/or by an ISO10646 character registry name.
Requirement: To facilitate more powerful management of sophisticated versioning requirements.
Allow Boolean combinations of INCLUDE and IGNORE keywords in marked
section declarations.
Requirement: To enhance the usefulness of capacities.
- Make specification of any or all capacity limits in a document optional.
- Make an SGML system's support for any or all capacity limits optional.
- Add capacity limits for document instances, such as number of elements, number of data characters, etc.
- Allow optional specification of actual capacities (not just
capacity limits).
Requirement: To provide more flexibility for attribute design.
- Allow a given name token to appear in the declared value of more than one attribute in the same attribute definition list.
- Allow multiple attribute definition list declarations for a single element type.
Requirement: To simplify the specification of SGML declarations.
Allow reference to an existing SGML declaration with local modifications.
Requirement: To allow more choices among optional SGML features.
Modularize the SHORTTAG feature so that attribute minimization can be used with or without allowing empty tags.
Requirement: To clarify difficult portions of the text of the standard.
Clarify the description of record boundary handling,
explaining clearly the relationship between detecting data characters
and ignoring characters, including whether ignored characters are
first recognized as data characters.
Requirement: To support multi-byte character sets with greater convenience
Devise a less burdensome method for declaring long
sequences of character numbers.
Requirement: To facilitate name-space modularization in DTDs and LPDs
[The report did not include possible changes that address this
requirement.]
Requirement: To support the requirements identified by an earlier
review.
The recommendations in WG8 N1035 are considered to be incorporated into this list.
Requirement: To support multi-byte character sets with greater
convenience
- Provide a way to declare name characters that have no
upper/lower case distinction without having to repeat the characters.
- Provide a uniform character range specification mechanism and use
it uniformly (e.g., in the UC/LCNMSTRT and SHORTREF specifications) to
add large chunks of characters concisely).
[WG8 N1854 addresses this requirement further.]
Requirement: To support multi-lingual text
Provide a means for specifying more than one language in
the public text language of a formal public identifier.
Requirement: To enable SGML documents to be interchanged in a plain
text format that is accessible by commonly used text editors and file
utilities that are not inherently capable of interpreting SGML.
Revise the conformance clause to require that conforming SGML
systems be able to use a plain text format in addition to any other
format that they use.
Requirement: To integrate document architectures, formal system
identifiers, and property sets and groves into SGML.
Incorporate the SGML Extended Facilities of the revised ISO/IEC
10744 into ISO 8879.
Requirement: To correct editorial errors
- In clause 5.1 state that Figures 1 - 4 are based on the reference
concrete syntax and that the character numbers used are those of the
syntax-reference character set.
- Remove references to ISO 2022.
Requirement: To support simple external references in SGML
It shall be possible, in an IDREF or IDREFS attribute value, to
specify the name of the entity that defines the name space in which a
referenced ID occurs. The syntax is "#ENTITY entity-name ID" (or
possibly a list of IDs after which a suitable reserved word, such as
#THISDOC, could be used to identify additional local IDs). This
facility shall not be extended any further.
The standard will point out that HyTime should be used for more sophisticated requirements such as indirect referencing. The standard will also warn the user of the risks of using direct referencing to external objects whose location could change. (Note: The experience of one large user has been that any object referenced more than twice should be referenced
indirectly.)
Requirement: To allow any SGML document to be a subdocument
SGML declarations will optionally be allowed in SUBDOC entities.
Requirement: To extend the character model of SGML to include entity
and storage management.
The standard will be based on the character model that is summarized below. Terms used that are not defined here are defined in ISO 8879 or in the Formal System Identifier Definition Requirements (FSIDR) of the SGML Extended Facilities. Some terms defined in those standards are introduced here with informal definitions that are intended to be consistent with the formal ones.
NOTE: This text is subject to revision as the Review proceeds.
- An SGML document is made up of entities, which are sequences of
characters. Character sequences are passed to the SGML parser by the
entity manager. As characters are abstract semantic constructs, and as
computers can only deal with representations of abstractions, what
actually passes to the parser is a non-standardized internal
representation of the characters, known as the "system representation
of characters" (SysRC).
- The SGML declaration that governs a document defines a character
set known as the "document character set" (DCS). A character set is
the mapping of a character repertoire onto a "code set", which is an
ordered set of consecutive bit combinations of equal size. A bit
combination is an ordered collection of bits, interpretable as a
binary number. The size of a bit combination in a code set (the "code
set width") is the smallest power of 2 that is an integral multiple of
8 and which is such that 2, raised to that power, is at least 1
greater than the largest bit combination in the code set.
- The DCS maps both significant characters and purely data characters
to bit combinations in the code set. The base-10 integer equivalent of
the bit combination mapped to a character is the character's
"character number", which can be used in numeric character references
in the prolog and document instances. The DCS identifies the remaining
bit combinations of the code set as being mapped to "non-SGML
characters"; they have character numbers. For each bit combination in
the code set, even those that are non-SGML, there is a corresponding
SysRC.
- An entity is a virtual storage object. The entity manager actually
maps an entity onto one or more real storage objects (or portions
thereof). A storage object consists of "octet sequences" (or other
formats) representing the stored characters, known as the "storage
representation of characters" (StoRC). A StoRC is invalid if it does
not represent an SGML character.
- In the course of accessing storage objects, the entity manager, in
conjunction with the storage manager(s), could invoke processes such
as conversion, decompression, or decryption whose inverses might have
been applied when the StoRC was stored, in addition to mapping from
the StoRC to the character semantics (and therefore to the SysRC).
Such processes and mappings could either be intrinsic to the storage
manager or specified as an attribute in the storage object
specification in the FSI.
- The StoRC is mapped to the SysRC in one of two ways: either with or
without the use of the document character set.
- If the storage object is DCS-dependent a "bit combination
transformation format" (BCTF) is applied to the StoRC to produce bit
combinations, which are in turn converted to characters (and therefore
to the SysRC) using the document character set.
- Without the DCS, the storage object is associated with a mapping
from the StoRC to characters (and therefore to the SysRC). This
mapping, which could imply transformation of an octet sequence, is
known as a "character repertoire encoding" (encoding).
- DCS-dependent conversions are accomplished by the parser giving the
entity manager the mapping from the bit combinations to the SysRC. The
entity manager in turn passes it to the storage manager, which uses
that to get from the StoRC to the SysRC.
- Whether or not the DCS is used for mapping the StoRC to the SysRC,
it always exists. Therefore, there is always a mapping from the DCS to
the SysRC. That mapping is used to interpret numeric character
references in the prolog and document instances, and also to determine
whether a StoRC is invalid because it does not represent an SGML
character.
- The SGML declaration also defines a character set known as the
"syntax-reference character set". Its purpose is to allow characters
used in the defined concrete syntax to be referenced within the SGML
declaration by character numbers that are not dependent on the DCS. It
does not play a role in the mapping from the StoRC to the SysRC.
- A "char" in a grove is an abstract data type representing a
character (defined above as an abstract semantic construct). The
concrete representation of a grove must allow any char to be
distinguished from all other chars, and for the semantics of each char
to be determined. The concrete representation may or may not rely on
the DCS, UNICODE, and/or other character sets and repertoires.