Taming the Beast: JATS data, non-JATS data, and XML Namespaces

Piez W.

Publication Details

An introduction to basic concepts, gotchas, and rules of thumb for working with namespaces in JATS documents and processing systems, addressing questions including the following: What are namespaces in XML, why do I need them, and why am I so confused? What can I do about this? Can I avoid namespaces altogether? (Yes, sometimes, but mostly no, not in the real world.) If I can't avoid them, how do I work with them, and what practices do I follow so as to understand what's going on in my data, recognize and fix problems when they arise, and prevent them from ever arising? What are the rules of good namespace hygiene?

Introduction

The XML namespaces mechanism is one of the most controversial aspects of W3C-specified XML technologies. Or maybe not, as whether to dislike namespaces is not debated as much as what to dislike about them. Among the criticisms of namespaces are that they are opaque, counter-intuitive, and sometimes baffling, demanding expert knowledge from novices; that they don't provide functionality implied by their design, while they do affect document models in unexpected ways; and that seemingly innocuous namespace-related bugs can bring otherwise well-built systems to their knees.

Namespaces, however, deal with a real problem that requires careful and consistent handling throughout an XML-based system: how to work with more than one vocabulary at once. And since (when used as intended) they offer a robust solution, they are seemingly here to stay. As web standards such as XLink, XHTML, CALS/Oasis/Docbook tables, MathML, and SVG become increasingly important parts of the infrastructure (both with JATS and non-JATS data) – to say nothing of namespace-aware technologies such as XPath, XSLT and XQuery – it becomes increasingly necessary to understand how namespaces really work and how to manage them gracefully. We know the bad news: namespaces can be tricky. The good news is that they don't have to be out of control.

What you need to know about namespaces

A brief review of the design of XML namespaces, in the context of the problem they are trying to solve, provides important context for this discussion. If you are an XML user (frequent or even occasional) what follows will probably be familiar. In particular, if you are a developer of stylesheets or software handling XML, nothing here should be new (although this synopsis might be useful as a cross-check against your understanding). Even if you are a manager who never has to interact directly with XML code, a brief look at the way namespaces work will suggest what accounts for their notoriety within the XML family of technologies.

Why we need namespaces in XML

The following chart shows which of several common XML vocabularies contain elements with certain names.

This is only a sampling of the name collisions arising between these vocabularies, whether they use the same names to identify either the “same things” such as addresses, titles or captions (which despite their semantic commonality are unlikely to be exactly the same, as modeled), or different things altogether (for example set means something very different in each of the tag sets listed). And of course only five tag sets are presented here, while the list of vocabularies that an XML-based data processing system might want to handle is, in principle, open-ended. So naming collisions of this sort are potentially a problem in any system that has to deal with more than one type of XML at once, either because documents contain markup from more than one tag set (such as JATS documents containing MathML), or because the same body of code (such as a single set of stylesheets) has to discriminate between them. In modern systems based on the web architecture, this is increasingly the case.

An expensive solution to this problem would be to try and deal with it ad hoc. A more general solution is to rely on a mechanism that can systematically distinguish not only between the names of XML elements and attributes, but between the families of documents or markup – the namespaces – to which those names belong. In order that XML technologies could maintain a general level of interoperability in the face of this challenge, in 1999 (directly following the publication of the XML Recommendation in 1998), the World Wide Web Consortium published its Recommendation Namespaces in XML.

The solution, its problems, and their solution

Essentially, the namespaces mechanism in XML provides a way of designating, with a prefix, the family of names (the namespace) to which the name of a particular XML element or attribute belongs. Instead of tagging something simply as <set>, we are encouraged to tag it as <math:set> or <mml:set> or <svg:set> as the case may be. The prefix identifies the namespace. In namespace terminology, such a name is called a qualified name or QName. The name following the prefix is called the local part of the name (or local name), retaining its function of designating the type of structure being tagged as a generic identifier within the typing system or ontology provided by the tag set.

Of course, this solution comes with its own problem built in, namely how to allocate prefixes. Will W3C MathML be tied to the prefix mathml or math or mml? More vexingly, once mml is assigned to MathML, can no one else use that prefix? One requirement of namespaces, in order that they be viable, is that they should be global in scope (globally unique), and yet not a source (or no more than absolutely necessary) of stress and contention between rival parties.

Of course, the Internet already had a system, when namespaces were designed, for allocating and assigning names, i.e. the Domain Name system, with its apparatus for registering and sharing domain names. Moreover, the Internet had already standardized a naming system, the Universal Resource Identifier (URI), which was designed to accommodate the Domain Name system as well as (by way of URNs, a related specification) other managed names and identifiers such as ISSNs and DOIs (digital object identifiers).

By identifying namespaces with URIs, XML can refer back to the external context in which it (generally) operates, namely the Internet itself, to manage the problem of who “owns” a name. Instead of qualifying the local name set simply as mml:set or math:set, the qualification could refer to http://www.w3.org/1998/Math/MathML. Similarly, SVG could use http://www.w3.org/2000/svg, TEI could use http://www.tei-c.org/ns/1.0 (or something like it), and so forth.

In order to avoid having to use these long strings as identifiers within tags in XML documents, a binding mechanism is then specified, which allows namespace-aware processors to resolve that a qualified name with a prefix such as math:math or mml:math would be correctly related to its namespace, http://www.w3.org/1998/Math/MathML.

This binding mechanism is namespace declaration syntax, and takes the form in XML of attributes. So, for example, this document contains declarations (the attributes with names starting with the string xmlns, which now becomes a reserved syntax in XML) for MathML and SVG:

Fig. 1A document with namespaces

Element names prefixed mml are assigned to the MathML namespace while those prefixed svg are assigned to the SVG namespace. Elements whose names have no prefixes are not assigned to a namespace at all (since no default namespace is declared).

<article xmlns:mml="http://www.w3.org/1998/Math/MathML"
  xmlns:svg="http://www.w3.org/2000/svg">
  <title>My Summer Equation</title>
  <equation><alternatives>
    <mml:math>1 + 2 = 3</mml:math>
    <svg:svg><!-- SVG image goes here --></svg:svg>
  </alternatives></equation>
... </article>  
Any names of elements and attributes in this document provided with the prefixes mml or svg (with a delimiting colon) will then be referred to the appropriate namespace.

If a different prefix, or none, is more convenient elsewhere, declarations can be provided to enable that usage instead; in fact, namespace declarations can occur on any element in XML, adding to or overriding declarations appearing higher up; and once given on an XML element remain in scope on its descendants.

Finally, it may be that you wish to have a family of elements identified with a namespace, but without the tagging overhead of a prefix. In this case, a namespace can be declared by using a simple attribute xmlns: this is called the default namespace, and indicates a namespace to be assumed when there is no prefix. Keep in mind that like any namespace declaration, its scope will be the element where the declaration appears, with all its contents (attributes and element descendants) until and unless some other namespace declaration intervenes. Consequently, it is possible to have a document with multiple different default namespaces in different places.

Complications of namespaces

This system is not completely inelegant; however, experience has shown that it has a number of weaknesses, most of which amount to the same thing: when namespaces are involved, XML is now considerably less transparent and more complex to work with.

For one thing, many users assume (and not without reason) that when they see URI syntax, some kind of network functionality is implied. Why would the namespace declaration say http://www.w3.org/1998/Math/MathML if no operations were to be performed using that Internet address? It makes XML look more magical to everyone, but the implications for developers are especially pernicious. One might think that a schema for a tag set, or documentation, would be available at that address. But there is neither any guarantee of that, nor any need for it, for namespaces to work as designed. The string is nothing more than a mechanism for disambiguation between tag sets that might potentially be in collision at any time, even though in the event they are usually not.

More generally, however, is simply the way namespaces complicate both the syntax of XML (of most importance to users) and the document model (of most importance to developers). The bindings between namespaces and the prefixes that indicate them (or the lack of a prefix that indicates a namespace, in the case of a default namespace) can be changed by new declarations in the document at any time, as in this example:

Fig. 2A document with default namespace declarations

In this case, namespace declarations on elements within the document (the math and svg elements within alternatives) assign unprefixed names to their respective namespaces. Consequently, different namespaces serve as the default namespace within their own scope, while outside the scope, unprefixed names are still in no namespace. Element names by themselves no longer indicate what namespaces they are assigned to.

Formally, the names of elements in this document are the same as given in Fig. 1, because the namespaces are the same even while the prefixes indicating them are not.

<article
  <title>My Summer Equation</title>
  <equation><alternatives>
    <math xmlns="http://www.w3.org/1998/Math/MathML">1 + 2 = 3</math>
    <svg xmlns="http://www.w3.org/2000/svg"><!-- SVG image goes here --></svg>
  </alternatives></equation>
... </article>  

For practical purposes, this XML is identical to that given in Fig. 1, and will perform identically in namespace-aware processing (for example, giving the same results when run with the same stylesheets) – users and developers just have to become accustomed to the “loose” relation between an XML element's name, and its name as given, even or especially when it has a qualifying prefix; and how determining its actual identity (since the element's type depends on which family of elements it belongs in) means dereferencing it against another bit of data given elsewhere in the document, sometimes quite a distance away.

A disconcerting aspect of this is how “we aren't supposed to care” about prefixes even while we (must) care about namespaces. That is, while it evidently matters what an element or attribute is “named” by virtue of its local name and namespace (in combination, the qualified name), we aren't supposed to care what prefix is used, but only how it provides for expansion into the qualified name, which we actually can't see (very readily or at all). In other words, we do have to care – we have to care enough to learn what to care about and what not to. Partly in consequence of this, namespaces are the cause of some of the most common beginner errors in tagging XML and writing stylesheets or processing for it. Moreover, since these errors typically prevent names from being resolved properly, they have severe effects, such as preventing entire documents from being processed at all.*

To make matters somewhat worse, when it comes to managing all this, it turns out that tools can only give a certain amount of help. Since namespace processing is built deep into XML parsers (or more precisely, into those XML parsers that support namespaces, but not into those that don't), actually working with namespaces and their prefixes as namespaces and prefixes is actually quite difficult. (Again, we aren't supposed to care about prefixes.) Functions for doing this are available in the XPath 2.0 family of XML processing languages (XSLT 2.0 and XQuery), but until you've wrapped your head around namespaces, qualified names, the difference between elements in no namespace and elements in a default namespace (which look the same unless you are mindful of declarations or lack of them), and the extra complications of attribute names in namespaces, these facilities will be difficult to work with.

Finally, if you work with XML in both namespace-aware and non-namespace-aware systems, you have also to cultivate a double vision, since without namespace processing, a prefix is only part of a name, mml:math and math:math and math can in no way be the same, and namespace declarations are only attributes. Especially in documentary systems, this is quite common, since namespaces are not part of XML itself and (most significantly) they are not supported by DTD processors, which do not perform namespace resolution when they validate XML documents. (Schema languages developed after 1999 including RNG and XSD will validate documents using qualified names.)

A specification for “clean namespaces”

The good news is that many of these problems can be mitigated significantly by following just a couple of simple rules when working with XML. We can define a usage profile for namespaces in XML syntax that will make them more transparent and easier to deal with. Given such a specification, plus a little extra energy when necessary to keep our usage in line with it, we will simply prevent many namespace-related problems from occurring in the first place.

Rules for a “clean namespaces” XML document are as follows:

  • All namespace declarations occur at the top of the document. No elements other than the document (root) element present namespace declarations that bind new prefixes, unbind old ones, or change bindings in scope.
  • Namespace aliasing is forbidden: each namespace is given a single prefix (or none, for a single default namespace).
As long as these rules are followed, we know that throughout the document, the same prefix (or lack of one) will indicate the same namespace. This reduces syntactic clutter and makes the document more transparent to users.

Additionally, individual projects or repositories may wish to establish normative sets of bindings between namespaces and prefixes. For example, “MathML elements should always appear with the mml prefix, never math and never unprefixed (as MathML by default)”.

And finally, if there is ever a chance that elements from a tag set without a namespace at all will be needed (such as off-the-shelf JATS through version 3.0, which has no namespace assigned it), unprefixed names must not be assigned to a namespace by default. This is because names not in a namespace may not carry a prefix either, making it impossible to mix them into documents with default namespaces without either declarations or “undeclarations” (declarations that remove a namespace binding**) appearing with them.

The example given in Fig. 1 is conformant to these rules, while that in Fig. 2 is not (because it contains namespace declarations below the top level).

Enforcing “clean namespaces”

These constraints can be enforced in at least two ways:

XPath 2.0

A technology based on XDM (the XML Data Model) such as Schematron (using XPath 2.0) or XSLT 2.0 can check a document, comparing namespaces in scope on its different elements for consistency.

Two such Schematrons are provided as demonstrations. The first (namespace-checkup.sch) simply checks for consistency according to the simple rules of clean namespaces. The second (namespace-strict.sch) checks, additionally, that the namespaces used, and their prefixes, match with those expected.

DTD

Paradoxically, since namespace declarations are merely attributes to a parser that validates to a DTD, a DTD is a useful instrument for validating their usage.

One thing that a DTD will not do is validate whether a namespace declaration, if declared as a #FIXED or defaulted attribute on a given element, is actually given explicitly in the document. (If it can be provided by default it does not have to be given.) If you wish to validate that a namespace declaration is correct and given in a document, declare its attribute like this:

<!ATTLIST article xmlns:mml (http://www.w3.org/1998/Math/MathML) #REQUIRED >
This will not provide the expected namespace as a default value for the attribute; but it will require the attribute to be given, and allow only the expected value as the correct one.

Of course, the element named in the declaration needs to be the correct one for the top-level (document) element if it is to be allowed there. And to ensure clean namespaces, no namespace declaration attributes should be allowed on any other elements.

Normalizing document syntax to clean up namespaces

What if you have documents, they contain arbitrary namespace declarations (which, perhaps, make them invalid to your DTD), and you wish to clean them up?

XSLT 2.0 can do this job, and a demonstration stylesheet (namespace-cleanup.xsl) is provided with this paper. When possible, this stylesheet will promote all namespace declarations to the top, normalizing their prefixes (to a value given in the stylesheet if there is one), and reassign all prefixes on names in the document to their normalized forms. If this is not possible (in cases where the assignments clash or when an assignment to a default namespace cannot be accomplished cleanly because of the presence of elements in no namespace), a warning is issued and output simply copies the input. (A diagnostic mode is also provided.)

Additionally, stylesheets are given to perform related (but simpler) transformations:

  • namespace-bind.xsl – reassigns all elements in no namespace to a namespace given in the stylesheet, with or without a prefix as indicated.
  • namespace-unbind.xsl – does the opposite, removing all elements in a namespace designated (irrespective of prefix) into no namespace
The third possible utility of this family would be a stylesheet that casts all elements from one namespace into another. But this is only a variation on those given (and this requirement should be rare), so it is left as an exercise for the XSLT developer.

Strategic considerations: JATS and namespaces

To date (2011), JATS itself has not been assigned a namespace; nor does the draft 0.4 version of the NISO JATS (equivalent to the NLM 3.1 Journal Article document models) propose one. This is because for the most part, JATS projects and initiatives have not needed a namespace for JATS itself, since our applications – which are, after all, predominantly publishing operations characterized by staged and managed workflows migrating information from one discrete format into another – have not required significant arbitrary mixing between data tagged as JATS and using other XML vocabularies. The exceptions to this, such as XLink attributes, MathML and Oasis tables, even demonstrate why this is: they are all used within JATS, but not the other way around (MathML documents do not need to host JATS data, for example). In other words, within our application domains we have been able to regard JATS as “first among equals”.

Fig. 3. Typical JATS workflow (schematic view).

Fig. 3Typical JATS workflow (schematic view)

As long as document production systems move information through a discrete set of steps, the families of elements in use and hence their namespaces can and should be controlled, while JATS itself can be regarded as the “hosting language” without a namespace.

Over the medium term, this may be a reasonably viable scenario, and short of specific application requirements that indicate otherwise, a JATS-based system may have no reason to alter this status quo. Namespaces can continue to be used to identify adjunct tag sets, while JATS itself can remain without one. This will not and should not prevent individual projects that may wish to assign namespaces to their own profiles of JATS.

Fig. 4. JATS workflow of the future.

Fig. 4JATS workflow of the future

Over the longer term, however, there may be more need to move JATS data in and out of systems in which it is not the primary format or even a native format. As this happens, it will become more necessary for it to have a namespace, so it can “play well with others”, such as XHTML in web-based editors, or metadata formats in mixed-format XML databases.

Over the longer term, we may feel we need a more generalized approach in which all tag sets in play can be identified with a namespace; and there may come a time when projects wish to provide their local versions of JATS with a namespace or even for a namespace to be proposed for the standard.

At that point, an interesting question will come to the fore: what about local variants of JATS? One problem this paper has not considered is namespace proliferation: what happens when two different tag sets that are substantially the same use different namespaces in the same system (or a tag set appears to be two different tag sets because it appears in two different namespaces). In order to avoid the headaches that come with this (since changing a namespace has the effect of changing all the names in the namespace), projects that deploy subsets of JATS may want to use the same namespace as JATS as a whole, in order for both to indicate their semantic fidelity to JATS within the scope of their systems, and to continue to take advantage of tools that handle general JATS. Projects which extend JATS, however, may wish to define their own namespaces – or to define their extensions in private namespaces – in order to protect the integrity of the community standard as a whole.

The strategic question remaining is whether JATS projects will then find DTD technology as serviceable as they have to this point. It may be that the requirement to handle namespaces gracefully will bring JATS past the tipping point and induce projects to migrate their schemas to XSD or RelaxNG. On the other hand, it may be that even in the brave new world of “all namespaces, all the time” the need for “clean namespaces” and the usefulness of DTDs for enforcing them (in addition to all the other features of DTDs we rely on) will persist.

References

1.
Berners-Lee, T., R. Fielding and L. Masinter. Uniform Resource Identifier (URI): Generic Syntax. RFC 3986. Network Working Group, Internet Engineering Task Force (IETF). http://tools​.ietf.org/html/rfc3986.
2.
Bray, Tim, Dave Hollander, Andrew Layman, and Richard Tobin. Namespaces in XML 1.1 (Second Edition). W3C Recommendation 16 August 2006 . http://www​.w3.org/TR/xml-names11/
3.
Bray, Tim, Dave Hollander, Andrew Layman, Richard Tobin, and Henry S. Thompson. Namespaces in XML 1.0 (Third Edition). W3C Recommendation 8 December 2009. http://www​.w3.org/TR/xml-names/
4.
Kay, Michael. XSLT 2.0 and XPath 2.0 Programmer's Reference. New York: John Wiley and Sons, 2008.
5.
Thomson, Henry. What's a URI and why does it matter? http://www​.ltg.ed.ac.uk/~ht/WhatAreURIs/.
6.

Demonstration files for download

XSLT and Schematron files described in this paper are available for download as namespace-utilities.zip.

Footnotes

*

Note that the special set of problems related to dealing with namespaces in transformations and queries is not a topic of this paper. For better or worse (and this is something that managers of XML projects need to keep in mind), it is more or less necessary that at least one member of any XML development team be comfortable with namespaces and with avoiding and fixing namespace-related problems.

**

The capability to remove a binding to a namespace from unprefixed names by “undeclaring” the default namespace deals with the problem of mixing elements not in a namespace with elements in a default namespace. For example, consider this confusing (but sadly common) example:

<html xmlns="http://www.w3.org/1999/xhtml">
  <body>
    <h1 xmlns="">My Summer Vacation</h1>
    ...
  </body>
</html>
Here, the html and body elements are in the XHTML namespace, http://www.w3.org/1999/xhtml by virtue of a default namespace declaration binding unprefixed names (on the xmlns attribute). But the h1 is not in a namespace – not because of any declaration as such, but because its own xmlns attribute has “undeclared” the default namespace for it and its descendants.