Automating Text Creation Using SGML | ISGMLUG

Welcome to our SGML and XML Resource site

Automating Text Creation Using SGML

In 1995, the staff of the HTI began work on the American Verse Project, an electronic archive of American poetry. Although a few eighteenth century works will soon be comprised, the vast majority of the works are from the nineteenth and early twentieth centuries. The collection is both browseable and searchable. Users who just wish to scan the listing of available texts and read a poem can, and many do; a number of the works contained are hard to find outside of big academic libraries or are in very poor condition and don’t circulate, as well as their availability on the internet is a great boon to readers and researchers. For instance in some cases restrictions on copyright meant using an intermediary server to access, using an Irish proxy for example to access those limited to the Irish Republic.

Although a few eighteenth century works will soon be contained, nearly all the works are from the nineteenth and early twentieth centuries. The group is browsable and searchable. Users who only need to scan the listing of accessible texts and read a poem can, and many do; a number of the works contained are not easy to locate outside of big academic libraries or are in quite poor state and don’t circulate, as well as their availability on the internet is a remarkable blessing to readers and researchers. The capacity to search the set is useful for tasks as easy as locating a poem that starts with the line “Thou art not lovelier than lilacs” or as complicated as comparing examples of flower vision in early American poems in general.
The list was enlarged to include poets of special interest to American literary historians from the Department of English in Michigan. A listing of almost 400 American writers of poetry was gathered.

Working from this list, a survey of publications was made and an electronic bibliography of electronic and print variants was built. Several hundred titles from the Michigan group were assessed to decide whether they were within the range of the undertaking; texts were chosen and prioritized based on their scholarly interest as well as their physical properties (e.g., extent of deterioration and “scanability”).

Without being disbound the volumes chosen for inclusion in the American Verse Project are scanned; now, the HTI is using its Xerox Scan Manager applications for batch scan and the Xerox 620 scanner; BSCAN and a Fujitsu 3096 scanner have additionally been used. TypeReader is the software package mainly used for optical character recognition (OCR); it’s functioned very well for the acknowledgement of the older typefaces in the nineteenth century content and has an unobtrusive, user friendly proofing interface. A UNIX program accessible from XIS, ScanWorx, has been used less often; because it could be trained to recognize non-standard characters, like long s, it’s useful for the earliest volumes in the set. A program developed by Prime Recognition that uses up to five OCR engines to radically improve OCR accuracy, Prime OCR, is being assessed for potential use. The HTI gives an excellent deal of attention to precision in the digitization procedure, together with the premise that accessibility to electronic texts that are dependable is significant.

After a volume is in electronic form, automated routines are run to supply a first level of SGML markup, identifying clear text structures (lines of poetry, page breaks, paragraphs) and potential scanning malfunctions, such as left out pages. Mindful manual markup happens in the following period, using SoftQuad’s Author/Editor SGML editing program as well as the TEI’s “TEILite” DTD. The encoding staff of the HTI does equivocal markup introduced by the automatic labeling procedure is cleared up and markup also classy for automated routines or open.

After encoding, a formatted, printed replica of the text is used to evidence against the primary volume and for review of the markup by a senior encoder. All pictures discovered in the first volume suggested and are scanned in the encoded text; a picture of the title page and verso is, in addition, contained. Ultimately, complete bibliographic information, including the size of the electronic file as well as a local call number is contained in the header of the electronic text. A cataloger reviews the header as well as a record for the electronic text is made for the on-line library catalog in Michigan.

Further Information
Obtaining a French IP Address – http://www.changeipaddress.net/french-ip-address/


  • Recent Posts