Markup Languages for Complex Documents

About MLCD

Virtually all electronic documents contain markup in some form or other. Most current standards for generic markup are based on SGML (Standard Generalized Markup Language). For example, the SGML-based markup language HTML (Hypertext Markup Language) is an essential part of the technology underlying the World Wide Web. SGML's subset XML (Extensible Markup Language) plays a crucial role as exchange format in sectors ranging from industry over business and administration to education and academic research.

XML documents can be regarded as linearizations of trees using a notation similar to labelled bracketing, and the trees so represented can be regarded as parse trees or abstract data structures conforming to the grammar defined in the documents' DTD or schema. These close ties between linearization, data structure, and grammar have been essential to the success of XML.

Notwithstanding XML's strengths, crucial problems invite further research. Most notably, since XML is based on context-free or constituent structure grammars, there are difficulties representing overlapping, fragmented or disordered elements, and multiple co-existing alternative structures.

Such complex structures are ubiquitous in traditional as well as in "digitally born" documents. For example, pages, columns and lines tend to overlap with chapters, sections and sentences in almost every printed or hand-written document. Sentences and direct speech tend to overlap in prose, verse lines and sentences in poetry, speeches and various other phenomena in drama. Any attempt to record the world's written cultural heritage without taking such structures into account runs the risk of misrepresenting the original sources. Complex structures are also frequent in databases, computer games, computer-based literature and linguistically annotated corpora.

MLCD aims to provide a notation, a data structure and a constraint language which as far as possible is compatible with and retains the strengths of XML-based markup, yet solves the problems with representation and processing of complex structures. The project also contributes in attempts to develop a formal semantics for markup systems. Solutions to the problems addressed by the project have to be practically feasible as well as theoretically sound. Therefore, the project develops experimental and prototype application software as proof of concept.

Work on MLCD started in 2001. The project is a collaboration between researchers currently at the universities of Bergen (Norway), Montréal (Canada), and at Black Mesa Technologies (New Mexico, USA). The project is lead by Claus Huitfeldt (University of Bergen), and its web resources are hosted by Black Mesa Technologies. It has received funding from the University of Bergen, Uni Digital, and the Meltzer Foundation.

MLCD is expected to have practical as well as theoretical significance. If the project is successful, future document technology may respond better to the needs of scholars and other document creators and their audiences than is the case today. The cross-disciplinary nature of the project may also lay the foundation for new approaches to document theory which have better chances of reconciling markup technology with adequate theories of document structure.

Last updated 09.11.2011 by Claus Huitfeldt
