








Dr. Steven Finch, Principal Researcher, Technology Labs, Thomson Technology Services Group, 1375 Piccard Drive, Rockville, MD 20850, USA; Tel: + 1 (301)548-4093; Email: sfinch@thomtech.com
The aim of annotating text with SGML is to add to the text an explicit description of its structure. Search engines, on the other hand, return texts to users according to whether (or to what extent) they contain combinations of words and phrases which satisfy the user's query. One would hope that it would be possible to fully exploit the value we have added to a collection of texts by marking them up in SGML by being able to query on the structural annotation we have added.
Many search engines today claim to be 'SGML-aware'. Sometimes, this means little more than claiming that they can index SGML documents by ignoring the markup. More often it means that they can index the markup to a very limited extent (typically throwing away all information about the context of occurrence of elements and attributes). It never means that as much information about the structure of the document is indexed as is present in the original document. This article briefly describes some of the reasons why this is so, and what SGML-aware indexing is not.
Indexing large quantities of information for the purposes of retrieving it has always been an essential task when information is made available to people other than the author. Until recently, it had never been possible (or desirable) to manually index every word or concept appearing in every document in a vast collection. Computers have changed this fact; it is now relatively cheap to identify and index every word in every document in a collection. Such an index is called a full text inverted index. Inverted indexes can be thought of as mappings from words to sets of locations of words in documents, in much the same way that a back-of-book index can be thought of as a mapping from concepts (expressed as index terms) to sets of page numbers. There are two major differences between traditional back-of-book indexes and computer generated inverted word indexes: firstly, computer generated indexes are comparable in size to the original text, rather than being a small fraction of the size. Secondly, computer generated indexes are indexes of words, and not concepts. This is a very important distinction, since it is generally concepts that people wish to find, and not words. This observation implies that it requires a slightly different skill to use computer-generated indexes than traditional back-of-book indexes, and also implies that SGML-encoded structural information might be very usefully incorporated into queries: after all, the presence of a word in a title might be a much better indicator of the relevance of a document than the presence of the same word buried somewhere in a footnote.
Once such indexes are generated, the search engine is the interface to these indexes by which users can retrieve the documents they want. Search engines come in two flavours: the traditional Boolean search engine allows users to enter logical queries on what terms must might or must not appear in the document (or the documents' "meta data", which is a set of fields about the document describing its type, author, source, and so on). A simple example of a Boolean query might be 'CAT and (SAT or MAT)' which specifies those documents containing the word 'CAT' and at least one of SAT or MAT. Most Boolean search engines allow the specification of phrases and/or proximity operations too, together with wild-card term endings, permitting queries such as 'SGML near (SEARCH ENGINE* )' specifying sets of documents containing the phrase search engine(s) appearing near the word SGML.
The newer ranked retrieval search engines allow users to enter a set of terms which they would like to see in returned documents. The search engine then returns to the user those documents in its database which contain some of those terms ordered with those documents containing the most of the user's desirable terms (and hence likely to be the most relevant) at the top of the list, and those containing fewer towards the bottom of the list. Most search engines on the world-wide-web (eg. AltaVista, Infoseek) offer this type of search engine as a default (although they also offer traditional Boolean versions too). Ranked retrieval search engines have the advantage that they are easier to use than traditional Boolean search engines at the expense of offering users less control over what they see.
Contact Robin Cover with corrections and updates, or to submit contributions to the ISUG online document database.
