Listen up! Is it time for systems to start hearing what attributes are saying?

Seligy MC.

Publication Details

Virtually none of us produces attribute-free content, and all of us rely on at least a few attributes for internal processing and publishing to our own websites. Outside of our own publishing systems, sensitivity to attributes and their values is patchy. But this is at odds with a world in which content exchange and reuse are more important than ever in the shared scholarly publishing infrastructure, much of which is supported or could be supported by attributes. This paper raises the topic of attribute insensitivity among systems in the shared infrastructure, and why it is so important to explore this in light of the changing role of JATS attributes in scholarly publishing. The aim is to start a conversation about whether attribute insensitivity exists, and if so, why id does and how we might address the challenges around it to improve content reuse and exchange.

 

Currently, the JATS (Blue) DTD describes 135 attributes on a volume of 263 elements, suggesting an immense potential for information storage in journal article content through attribute values alone. Although many publishers probably do not use all or even most of the attributes available in JATS, virtually none of us produces attribute-free content, and all of us rely on at least a few attributes for internal processing and publishing to our own websites.

But outside the publisher’s own internal systems, external systems’ sensitivity to attributes and their values seems to be tenuous. Meaning, many attributes and their values may be ignored by systems that exist throughout the scholarly publishing infrastructure, including web host platforms, search engines, archives and repositories, indexers and aggregators, and other systems through which JATS article XML must pass.

It is possible that attribute insensitivity does not exist, and that the question of attribute insensitivity need not be raised at all; however, the role of attributes in JATS XML usage is changing. Whereas attributes have always been useful workhorses in the service of XML creation and rendering, they are now critical to the processes of discovery, exchange, reuse, retrieval, and other aspects of interoperability among the systems of the scholarly publishing infrastructure. Thus, even if the challenge of attribute insensitivity turns out to be insignificant upon investigation, the potential prize of optimised interoperability for all players in the scholarly publishing infrastructure is worth discussion.

In truth, there are currently no comprehensive data that quantify this phenomenon among machines: there is no list of attributes to which systems are generally insensitive, nor is there a catalog of systems that lack sensitivity. What does exist is some anecdotal evidence and repeated references to attribute insensitivity, particularly in recent working-group discussions of XML tagging practices that support content reuse and exchange among systems.

For example, in my own organisation, a publisher of 20+ STEM journals, we mark up <contrib> elements that are authors as <contrib contrib-type=”author”> and those that are editors as <contrib contrib-type=”editor”>. Some years ago, we noticed that Google Scholar was including the editor name in the author line of a search return, as though the editor were an author of the article. The contrib-type attribute was being ignored, thus all <contrib> elements were treated as though they contained the same kind of content. We worked with Google Scholar to resolve the problem, but it took some time before it was completely resolved.

As another example, in our JATS4R Authors and Affiliations subgroup, whose remit is recommendations on best practices for tagging authors and affiliations in JATS XML, we made a preliminary recommendation that each <aff> element must contain a single and complete affiliation, because sometimes publishers have affiliation strings such as the following:

“Departments of Biologya, and Chemistryb, University of Whitehorse”

The subgroup determined that content reuse and exchange by machines is impeded by <aff> elements that contain concatenated partial affiliations such as those in the example above because (1) such strings are difficult to parse and build citations; and (2) institutional identifiers cannot be applied to each affiliation unambiguously in such strings.

However, the subgroup strongly felt that unless it provided publishers with a way to display the original string of concatenated partial affiliations, the recommendation would fail to be adopted. And after much deliberation and discussion, the subgroup determined that there was only one legitimate way to do this in JATS, and that was to use attribute values on <aff>, like so:

<aff specific-use=”display” content-type=”combined-aff”>Departments of Biology<label><sup>a</sup></label> and Chemistry<label><sup>b</sup></label>, University of Whitehorse</aff>

Some members of the group, who have experience with content conversion, exchange, and interoperability among systems, then raised the concern that systems would ignore attributes on <aff> and would consider it to simply be an additional <aff> in the metadata, thereby causing even more problems. In the end, the subgroup submitted a proposal to the JATS Standing Committee to add a new element (<display-aff>) to JATS for this purpose.

It is surely possible and may be necessary to continue to work on or around individual cases of attribute sensitivity as they come up, such as the one my organisation experienced with @contrib-type and Google Scholar. But as noted previously, the role of JATS attributes is changing, and there are new challenges on the horizon whose successful solutions depend on attribute recognition by systems.

Just 10 years ago, when I first started to learn about JATS (then NLM 3.0), we were still very much rooted in a print-centric world, in which presentation was particularly important, and style was perhaps more front-of-mind than substance. In those days, the ability to load full-text article XML to a website and display that content (as opposed to metadata and PDFs only) was a victory for dissemination. The role of JATS attributes then was mostly to serve in the mechanisms behind display and linking online. The scholarly publishing infrastructure was still very young at the time: Crossref was not yet 10 years old, PubMed Central was younger than that, Portico and other archives were just getting started, and Google Scholar was still a baby at 4 years old.

As our industry moved further into the digital age, the scholarly publishing universe began to expand beyond that of the publisher’s own website to include digital catalogs, archives and repositories, databases, DOI and other persistent-identifier-assigning authorities, discovery platforms, and more sophisticated search engines. The focus began to shift from XML dissemination and display to exchange with these outside systems for optimised discovery, exchange, reuse, retrieval, and storage.

Along with this expansion of the scholarly publishing infrastructure (and perhaps because of it), there has been a growing movement towards being open and transparent about the information that supports journal article authority and legitimacy. This information includes, but is not limited to, the provenance, source, authorship, ownership, funding, access, and supporting materials of an article. Whereas it used to be enough to indicate the authors of an article, it is now important to expose the specific contributions that each author has made. Conflict-of-interest statements are common. There is a demand to expose exactly which data supported the results and where a person can find that data if they want to follow up. Funding agencies want to see whether their money has been well spent, and ideally, they want confirmation to come in the form of funding information encoded in machine-readable form; in other words, in the XML.

The need to expose these proofs (or at least, assertions) of authority and legitimacy, along with the need to disambiguate the ever-growing pool of authors and institutions, has led to another major force that is affecting the role of attributes — the proliferation of metadata standards, particularly those that involve identifiers designed to persistently and unambiguously associate the “nouns” in scholarly publishing, such as an article to its author(s) or an author to his or her institution(s). In a world in which there are now multiple assigning bodies for a given type of identifier (e.g., Crossref or Figshare for DOIs; Ringgold or OrgID for institutional IDs), attributes like @pub-id-type and @institution-id-type, which are intended to identify the ID-assigning body, are important for conveying the authority of the ID to humans, and essential for enabling a system to exchange data (such as citation information) with the correct ID-assigning authority.

And all of this means that we have many new objects to deal with in article XML, such that machines can find them, understand what to do with them, and pass them along to other systems. Some of these objects are new, with no specific, dedicated elements in JATS. These are objects that either did not exist in an earlier, more print-oriented age (for example, article versions and online publishing events), or did exist but were not explicitly modelled because only recently have transparency/openness and reuse/exchange become so important (e.g., data availability statements, author contributions, clinical trials, and conflict-of-interest statements).

One way of handling these objects is to develop new elements. But this is a sure way to further bloat an already fulsome standard, and why do it when at least some of the new objects have the same structure —more or less — as elements that already exist in JATS?

Take a data availability statement (DAS), an object thatis currently being modelled by the Data Availability Statement working group.* The purpose of a DAS is to ensure that readers of an article will be able to locate any data that were generated or analysed to produce the work described in the article. A DAS is not only a text statement, or at least, not always; it can comprise links to external repositories and reference lists in which certain Crossref attributes may be used to distinguish one type of data reference from another.

The DAS working group originally asked the JATS Standing Committee for a new element to contain this object, specifically because the working group felt that the content was unique. However, the application was rejected because it was felt that @sec-type on <sec> was a reasonable solution that works with existing JATS elements and attributes. As it stands now, the DAS is captured in the XML as a <sec> with @sec-type=”data-availability-statement”.

Aside from the solution’s dependence on systems being sensitive to this particular value of @sec-type, by using @sec-type for this purpose, we are not merely saying that the DAS <sec> is just a particular section of the article, so that we might, perhaps, render the reference list differently here than we do the main reference list. We are using @sec-type to say “This is not just any other <sec>. This is a <sec> where you will find information on all of the sources, citations, and other facts of data availability that support the conclusions of this work.”

Another example of this use of an attribute can be found in JATS4R’s recently released recommendations on conflict-of-interest statements and clinical trials, in which, respectively, @fn-type on <fn>** and @content-type on <related-object>*** are used to identify the kind of objects these actually are.

In other words, we are calling upon an attribute to define substantial and substantive differences among instances of the same element. To, in effect, signal that the nature and meaning of the content marked up with particular attributes/attribute values is exceptional.

This particular and more modern function of an attribute is ideally what we in the JATS4R Authors and Affiliations subgroup would have liked to use @specifc-use on <aff> to solve the problem of representing strings of concatenated partial affiliations. That is, we needed an attribute to be able to signal that a particular <aff> is not a real <aff> but instead another kind of content.

Essentially, attributes can be understood as element metadata. And just as JATS article metadata is arguably more comprehensive and important than ever to the process of discovery, exchange, and reuse, attributes are —not coincidentally— also mission-critical, because they underpin the metadata structures within the article XML. The urgency for ensuring that systems are sensitive to attributes — at least certain attributes with certain values — is clear. How to address this issue is less so.

As stated earlier, we do not have data that would inform on which systems ignore attributes or which attributes are being ignored, but even if we could gather such data, it probably would not be all that useful. It is more useful to assume that the phenomena exists, determine the specific needs for attribute sensitivity given the functions that JATS attributes must perform today in the mechanics of interoperability, and proceed from there.

Of course, it is possible that attribute insensitivity is related to issues with processing; the W3C tutorial on XML elements vs. attributes alludes to potential issues with this and advises that “attributes are handy in HTML, but in XML, you should try to avoid them.”****

It is also possible that the problem of attribute insensitivity may be partly cultural: elements seem more significant than attributes, and there is an existential component to the question, “Does this object need its own element?”

But the more likely and obvious problem is that systems cannot “listen” to what attributes have to say when they are not all speaking the same language. The values and context of many JATS XML attributes in real usage are hugely variable. And some attributes are probably more prone than others to “creative” usage, because they occur on myriad JATS elements, and (or) their intended use (per the JATS tag library) is more open than others to wider interpretation. @specific-use comes to mind here, as does @content-type. With such a wide range of possible values and contexts for attributes that may be encountered, it is not a stretch to imagine that systems deal with this variety of input by simply ignoring it.

Assuming that any actual challenges with processing can be gotten around, one reasonable way forward is to define and agree upon the requirements for some key attributes and values so that we can all start producing consistent inputs that systems can rely on and therefore become sensitive to. I say “key” here because it would be impossible to standardise the usage and values for every JATS attribute; as a colleague of mine pointed out, getting publishers to agree to this would be like herding cats, and probably much more difficult than that. What I am suggesting is that we judiciously standardise a few.

But which few? The attributes we choose to address should be those that are important to the mechanisms of exchange, or what can be thought of as the “moving parts” of an object. Because within a given article object, not every part of that object is going to be important for this purpose.

What we would gain by standardising the use and values of at least the attributes important to exchange and reuse is huge: reduced effort working out solutions on an individual basis, easier decisions in content XML editing, efficiencies for content vendors and hosts, and of course solving some of the challenges of optimising content for interoperability among the systems that make up our shared infrastructure.

The key word here is “shared”. Because both the requirements and the solutions for addressing attribute insensitivity cannot come from JATS itself; it is not the mandate of JATS to pre- or proscribe any particular tagging practice. Where these can and should come from are groups like JATS4R, STS4I, FORCE11, and Metadata2020, JATS4R being the group whose mandate is probably most directly tied to the mission of improving interoperability through recommended JATS XML tagging practices.

The direction of the work — that is, which attributes need to be standardised and sensitised for which article objects in which contexts and and in which systems — needs to come from the JATS-using community. Wherever possible, it is important for the community to be aware of and participate in efforts that are being made by various working groups to optimise content for interoperability. We may not come to a unanimous agreement on these things, but we can surely achieve a level of consensus sufficient to solve a lot of problems. In any case, nothing will be improved if we don’t have the conversation.

Acknowledgements

I would like to thank Alexander (“Sasha”) Schwarzman (Optical Society of America) and Kelly McDougall (MIT Press) for reading this paper and providing valuable comments, and Sarah Currie (Canadian Science Publishing) for thoughtful copyediting.

Footnotes

*

A joint effort between JATS4R and FORCE11; draft recommendations can be found at https://jats4r.org/data-availability-statements

**
***
****