Markup Languages for Complex Documents

Research

The principal goal of MLCD is to contribute to a theoretical foundation for markup, representation and processing of overlapping, fragmented or disordered elements, and multiple co-existing complete or partial alternative structures. For the purposes of our work, we call such structures complex structures, and we call documents containing such structures complex documents.

One major challenge to the project is to retain the tight integration of linearization, data structure and constraint language found in XML-based systems. Another important challenge is to provide solutions which are not only theoretically sound, but also practically feasible.

Thus, the more specific sub-goals of the project are to develop: 1) a suitable notation for markup of complex documents; 2) a data structure which can serve as the basis for a computational model for such documents; 3) a formal grammar which can provide the basis for a constraint language capable of expressing and exerting context-sensitive constraints and constraints on complex structures; 4) a formal semantics for the interpretation of complex documents; 5) prototype applications as proof of concept of the results of the work above.

Work on MLCD started in 2001. Thus, the project has obtained a number of preliminary or close to final results on most of its research topics. Below, we briefly describe these preliminary results and provide the most immediately relevant references to the current state of research. (For a more comprehensive list, see Publications).

The MLCD Overlap Corpus

The MLCD Overlap Corpus (MOC) is intended to make it easier to compare different methods of handling overlap, not just on theoretical or abstract grounds, but in terms of concrete examples from real and constructed texts. The essential idea of the corpus is to make available a single body of material, ranging from compact examples to full texts of novel or five-act-play length, tagged for the same information, using a variety of overlap notations. Currently a proof of concept is under development. See: The MLCD Overlap Corpus (Proof of Concept)

Data Structure: Goddag

One of the early achievements of MLCD was the specification of the Goddag (Generalized Ordered-Descendant Directed Acyclic Graph) structure. It was originally based on the realization that overlap (which was the first kind of complex structure we considered) can be represented simply as multiple parentage.

A Goddag is a directed acyclic graph in which each node is either a leaf node, labeled with a character string, or a nonterminal node, labeled with a generic identifier. Directed arcs connect nonterminal nodes with each other and with leaf nodes. No node dominates another node both directly and indirectly, but any node may be dominated by any number of other nodes.

We distinguish a restricted and a generalized form of Goddag. Conventional XML trees satisfy the requirements of generalized as well as restricted Goddags. In addition, restricted Goddags lend themselves to representation of documents with concurrent hierarchies or arbitrarily overlapping elements, whereas generalized Goddags also allow for a convenient representation of documents with multiple roots, with alternate orderings, and discontinuous or fragmented elements.

The similarities between trees and Goddags allow similar methods of interpreting the meaning of markup: properties can be inherited from a parent, overridden by a descendant, and so on. There is some chance for conflict and confusion, since with multiple parents, it is possible that different parents have different and incompatible properties.

The Goddag proposal was originally published in Michael Sperberg-McQueen and Claus Huitfeldt: "Goddag: A Data Structure for Overlapping Hierarchies"; Lecture Notes in Computer Science, vol. 2003/2004, Springer, pp. 139 160.

Recent work has revealed a weakness in the current specification of Goddag, which leads to problems with the representation of discontinuous elements. A discussion of problems and suggested modifications can be found in Claus Huitfeldt and Michael Sperberg-McQueen: "Representation and processing of Goddag structures: implementation strategies and progress report."

Goddag has already received some attention in the markup research and development community, as can readily be seen by a Web search for the keywords "Goddag" and "markup".

Notation: TexMecs

It is always possible to construct Goddags from XML documents. In the general case, they will be trees, which are subsets of Goddags. It is also possible to construct Goddags from the various mechanisms customarily used in order to represent complex structures in XML. However, these mechanisms depend on application-specific processing and vocabularies, and tend to be cumbersome.

Thus, one may either try to establish standards for the representation of complex structures in XML, or provide an alternative notation which lends itself to a more straightforward representation of complex structures. We believe that these options are complimentary, and that both should be pursued.

Thus, we have defined an alternative notation to XML, TexMecs. TexMecs is partly based on MECS (Multi-Element Code System), a markup language developed by Claus Huitfeldt for the Wittgenstein Archives at the University of Bergen in the 1990ties. (Hence its name: TexMecs stands for "Trivially extended MECS").

The basic principles of its design are:

  • For documents that exhibit a straightforward hierarchical structure, TexMecs is isomorphic to XML.
  • Every TexMecs document is translatable into a Goddag structure without application-specific processing.
  • Every Goddag structure is representable as a TexMecs document.

A particular advantage of TexMecs is a simple and straightforward notation for what we have called complex structures. We also plan to design algorithms for translating widely recognized XML conventions for representation of complex structures into Goddags, and vice versa.

For details, see Michael Sperberg-McQueen and Claus Huitfeldt: "TexMecs: An experimental markup meta-language for complex documents"

TexMecs has already received some attention in the markup research and development community, as can readily be seen by a Web search for the keywords "TexMecs" and "markup".

Constraint Language: Duck-Rabbit

One of the most important remaining tasks for the MLCD project is the identification of a constraint mechanism which relates to Goddags as naturally as constituent structure grammars relate to trees, which constitute a subset of Goddags. Constraint languages for XML documents exist in the form of XML DTDs, XML Schema, Relax NG and others. These methods invariably define context-free grammars allowing the representation of XML documents in the form of parse trees. However, since Goddag structures are directed acyclic graphs more general than trees, they cannot easily be identified with parse trees based on context-free grammars.

Several possible ways forward exist and remain to be explored. One approach starts from the observation that Goddags can be projected into sets of tangled trees. One way to achieve at least partial validation of complex documents, therefore, is to write grammars for each such tree and validate each projection against the appropriate grammar. Each such grammar will treat some start- and end-tags in the usual way as brackets surrounding structural units, but treat other start- and end-tags as if they were empty elements. This allows some measure of control over the interaction and overlapping of specific elements in different grammars; whether it provides enough control remains to be explored.

For the most recent proposal, see Michael Sperberg-McQueen: "Rabbit/duck grammars: a validation method for overlapping structures."

Last updated 28.11.2006 by Claus Huitfeldt
Style based on 'SyndicateMe' by rhildred