Modelling Concepts in XML and SGML

How does one go about creating even a rough model? This depends upon what you’re modelling, but in many cases there’ll be existing examples of the kind of information you wish to model in the form of existing documents or data. An analysis of the structure of existing examples will help you to create and refine the model, as well as providing vital test material for checking the validity of your model once it is supposed to be complete. If you haven’t got existing examples, the model will need to be created from an analysis of any existing specifications for the design of the information in question, including design goals and specifications, specifications of usage requirements, related models and templates.

There are a variety of information modelling tools that can be used at this point in the process, and you’ll probably choose one to suit yourself and your need to share it with others. It could be a simple drawing tool, or one designed to create charts and diagrams of various types. It could be a general-purpose information modelling tool, such as one that uses UML. It could be an SGML or XML information modelling tool that is capable of creating a machine-readable form of your information model that can be used directly by other SGML or XML tools. The point is that what you will need to share with others is something that they can understand and which conveys clearly the important details of the model, and this is likely to be graphical.

Many powerful tools use these concepts because of the ability to display in a graphical format. For example many tools can make complicated decisions much simpler by extrapolating the decision into a graphical interface. A common example is in network and security tools for example as referenced in this article about Residential or Datacenter proxies.

The SGML and XML view of the information world is that the structure of information may be represented by a tree (or directed graph, if you prefer). The tree has a ‘root’ which contains the whole of a unit of information (a ‘document’ in SGML or XML parlance). The document contains a variety of subordinate information units which together make up the whole, and each of these subordinate units in turn may contain further sub-units.

Modelling your information will therefore involve deciding first what precisely is the unit of information that you intend to model, then determining the subordinate units that together make up the whole, then doing the same for each subordinate unit, and so on until you are modelling the smallest components of information that simply contain unstructured data: the ‘leaves’ of your tree structure. This is sometimes referred to as a ‘top-down’ approach to modelling, and is a feature of practically all information modelling for SGML or XML applications.

An alternative approach that is sometimes effective is to identify the smallest, unstructured units of information, then work out how these are aggregated to create identifiable composite structures. This is sometimes referred to as a ‘bottom-up’ approach to modelling.

In traditional SGML applications it is generally the case that one doesn’t attempt to model the content between the markup, i.e. what is often referred to as unstructured text. However, in many XML applications – especially those that are concerned with strongly-typed data, such as codes, dates and quantities (in various units and to varying degrees of precision) – it is important that the modelling process should include not only the structure to be represented by XML markup but also the allowable values of the content between the markup. Recent developments in the XML community have defined new ways of defining information models that include models for content, and these are referred to as schema definition languages. The original language for defining information models in SGML and XML – the language of the Document Type Definition (DTD) – doesn’t have this capability.

Creating information models for use in SGML or XML applications is not like designing a data model for a conventional database application. A typical information model for an SGML or XML application will be much less constraining about the structure of information. Whereas a data model for a classical database application will be very precise about the number and order of fields, and the length of the content of fields, a typical SGML or XML information model will allow greater variation in the number, sequence and content of the structural components (elements). The definition of a prototype for any given element (referred to as an ‘element type definition’) may define its content to be sub-elements that may occur one, twice or an arbitrary number of times, may not necessarily always occur in the same sequence, or may not occur at all in some instances. The reason for this flexibility is that this is essential in order to describe the structure of most information objects of any complexity that are commonly found in the real world, especially large information objects such as human-readable documents.

Once an information model has been agreed down to the last detail (and making use where possible of pre-existing models for commonly-used information objects such as tables and mathematical expressions), you will be ready to express this formally as an SGML DTD or an XML DTD or schema.

Expressing your model in SGML or XML

SGML and XML provide a lot of flexibility in the way that you can express any given information model. However, as in most things, there are better and not so good ways of going about it. Here are a few pointers to good practice:

Always express your information model as precisely as possible. For example, SGML and XML make it very simple to make parts of your model ‘optional’, but you should only do this if you really want to leave it to the end user to decide whether or not to use the optional structure. If you want to control whether or not the end user includes a structure, you should provide clear alternatives with guidance on which alternative tu select in which circumstances.

Break down a large information model into modules, defining structures that are common to several different information models in modules separate to those in which you define structures specific to one model. For example, leave models you’ve re-used from external sources, such as structures for tables and mathematical formulae, in separate modules. Modularising your model in this way makes it easier to re-use common components – and it also makes the whole model easier to maintain in the likely event of needing to revise it in future.
Adopt consistent rules for naming and identifying the elements and attributes in your information model. SGML and XML place some constraints upon naming, but still provide plenty of scope for you to adopt a variety of conventions. The current fashion (in languages using the latin alphabet, at least) is for long, phrase-like names in which the individual words are identified by using UpperCamelCase (all ‘words’ begin with a capital letter, otherwise lower-case) or lowerCamelCase (as before, but first ‘word’ is all lower-case).

Adopt consistent rules for what information to model using attributes. If an information object contains only text characters, you have a choice as to whether to model it as an element containing text only or as an attribute of the parent element. Whichever you choose in your information model will determine how the information must be marked up each time you use the information model for instances of actual data. Your approach will probably be influenced by whether you are expressing your information model in DTD language or in an alternative XML-based schema language (such as XML Schema or RELAX NG). Generally, attributes work well for metadata, such as identifiers and data management information, but less well for the actual content of the information you are modeling.
Even by adopting quite strict conventions, you’ll still find that SGML and XML both provide enormous flexibility to model a wide range of information types in a standard, open and machine-readable language. This flexibility is the reason why SGML and now XML have been adopted for a large and ever-growing range of applications.

John Forrest
IT Blogger, author of Using a BBC iPlayer Proxy

Leave a Reply