About SGMLISUG PubsBookstoreChaptersDeveloping SGMLJoin ISUG

Features of Knowledge Discovery Systems

What Kinds of Knowledge Can Be Discovered?

In principle, two types of knowledge discovery tasks can be found: description and prediction. Through description, a system finds patterns to present to users in an understandable form. Examples of descriptive methods include association rule discovery and clustering. A clustering tool takes a collection of objects, e.g. documents, and creates a grouping: objects that belong to the same cluster are somehow similar to each other and differ from the objects in the other clusters. An association rule discovery tool reveals co-occurrences, like

A => B, confidence (0.7), support (12),

which tells us that if A occurs, also B occurs with the probability of 0.7; additionally, A and B occur together 12 times.

With the descriptive methods the understandability of the pattern representations is crucial. The results may be visualised graphically. Moreover, a clustering tool may characterise each cluster with some concept. Similarly, the amount of discovered association rules should not be so overwhelming that the pearls of information remain unnoticed.

The predictive systems find patterns to predict the future behaviour of some objects. For instance, we could have a categorisation tool which learns to file documents within predefined folders. The predictive systems need a training phase. A categorisation tool gets a sample set of documents, each document labelled with the respective categories. Analysing these examples, the tool learns the necessary patterns to be used with new uncategorized documents. The resulting patterns of a predictive system may not have to be understandable, if the prediction seems to work. Understandability may be desirable, though, so that one can trust the prediction. Usually, the results are evaluated using a test set, i.e., a new set of documents is given to the tool, this time without the categories. As the categories of these documents are known to the evaluators, it is easy to compare the original categories to the categories attached by the tool.

As the concept of knowledge discovery is not fixed, it is unnecessary to exclude some related tasks and methods. For instance, many tasks are verify: we already have a hypothesis and seek to find support from the data. For instance, text retrieval can be seen this way. We state a query, and if we receive some answer, we know that documents fulfilling the conditions exist and we can access them. Furthermore, techniques known as information extraction aim to fill predefined templates by identifying and extracting knowledge from freetext. For instance, a template can have slots for a seller, a buyer, a product, and a price, and the extraction tool analyses freetext, identifies fragments that describe selling actions, and extracts as much of the knowledge needed to fill the template slots as possible. The filled templates can then be stored in a database.

Thus, knowledge discovery utilises methods from several traditional fields, like statistics, database management, information retrieval, machine learning, and natural language processing. Knowledge discovery is a process that gives a framework for applying various methods. An ideal knowledge discovery system controls the whole life span from defining the discovery task to utilising the results. Systems that support multiple tasks for discovering knowledge from, e.g., SGML documents, and that also support the entire life span of the discovery process, may not exist yet. Whereas, single-task tools for extracting features (like technical terms and names) from text, as well as clustering and categorisation tools are available.

[Next Section]   [Previous Section]

Contact Robin Cover with corrections and updates, or to submit contributions to the ISUG online document database.

ISUG 
logo
Copyright © 1998 International SGML Users' Group