Here’s a great introduction to the history of SGML written and presented by Susan Schreibman.
My name is Susan Schreibman, I’m Professor of Digital Humanities at Maynooth University and I’m the co developer of the Text Encoding and TEI module. I’m going to give you a very short history of SGML. So SGML, you can see it on the slide, was developed in 1986, HTML 1991, and XML 1998. What you might notice is SGML was developed prior to the World Wide Web.
In fact HTML is developed from SGML and it is the language that enabled the World Wide Web. What HTML, XML, and SGML have in common, well they’re all markup languages as opposed to programming or processing languages. SGML and XML and other derivatives of SGML are meta languages, they are languages that describe other languages. But they all use the same syntax as you can see on the slide. They use tags or elements, in this case you can see line group and line as elements, and there are specialized software that interprets those tags either for display and/or search, retrieval, and analysis. All these markup languages use attributes and you can see it here line, in fact the attribute is number three or number two, and these attributes further delineate specific features of text.
In the early 1980s the idea that markup should be focused on the structural elements of the document and leave the visual presentation of that structure to other software that would interpret those tags and this is what led to the creation of SGML. The idea which we still hold is separating content from its display. By the time SGML became a standard in 1986 it was adopted by industries that used a lot of text; the IRS, the Department of Defense, and in fact the text encoding community. SGML as a tag set is unlimited, potentially comprising millions of tags. What’s important about it is that allows for user communities to define and develop their own tag sets. Having said that, it was extremely difficult to work with syntactically, particularly over a networked environment because as we talked about earlier it was developed in a pre-internet environment. Yet, SGML is very powerful in its descriptive capabilities.
SGML only described a syntax for including markup in documents. There was a separate syntax called a DTD that described what tags could be used, for example where in the hierarchy and how often. What DTDs allowed were many documents to be encoded according to the same standard and these many documents are known as document instances. We’ll be exploring DTDs and schema, which serves the same purpose for XML documents, later in this course. SGML was made easy for humans but was difficult for machines to process. It was made easy because at the time in the 1980s the only way to get text into a computer was to type it in. So there was a lot of what’s called minimization, so that human beings had to type less, less keystrokes. So for example, tags didn’t need to be closed, attribute values didn’t have to have quote marks, tags could be replaced with a delimiter string, upper and lower case were read as the same by the computer.
And this made it very difficult for processing. SGML allowed for empty tags so that the empty tag inherited the value from the nearest previous start tag. What this meant was that the program went along when it found another open tag it realized that the content before must have closed. So it took a lot of extra processing power. HTML is an application of SGML designed to support the sharing of articles amongst researchers originally. HTML’s limited vocabulary was in fact its strength because HTML tags were developed by and large for display, presentational vocabulary, it didn’t facilitate descriptive searching. In the early days of the world wide web tech abuse became rife. Each browser rendered HTML tags differently so if you picked a h1 tag, a heading one tag, on one browser it might be 20 points in bold, another 18 points in italic. People began not using the tag semantically, h1 for the highest level header, but began using other tags and adding in display information to control how the web pages looked.
The success of HTML was exponential and within a few years after the release of the first browsers in 1991, code became so buggy and browsers became so bloated attempting to render the most ill encoded text that another solution was needed. eXtensible markup language was developed as a compromise. It’s a meta language like SGML so it allows communities to create their own tag sets. But it also contains a number of features that make it simpler for computers to process. This rule set is known as well-formed which we describe in this section.
Case Study – How to Bypass the UK Porn Filter