The Public Knowledge Project XML Publishing Service and meTypeset: Don't call it "Yet Another Word-to-JATS Conversion Kit"

Alex Garnett; Juan Pablo Alperin; John Willinsky

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2015.

The Public Knowledge Project XML Publishing Service and meTypeset: Don't call it "Yet Another Word-to-JATS Conversion Kit"

Authors

Alex Garnett,¹ Juan Pablo Alperin,¹ and John Willinsky².

Affiliations

¹ Simon Fraser University

² Stanford University

The Public Knowledge Project's Open Journal Systems, which provides a robust workflow and interface for editing, publishing, and indexing scholarly journal content, has always been somewhat agnostic as to the format of the content itself. Most authors, unsurprisingly, use Microsoft Word for its familiarity and ubiquity during the actual writing process, and the act of getting content from Word format into something that can be easily consumed on the web after that has always been something of a mystery among publishers. Those who have the means will typically outsource or partially automate the markup of an article in XML or an XML-like format, and this XML can then be transformed into HTML on the fly for viewing (OJS does include some stylesheets for this purpose); others, including many smaller open source journals, convert directly to the printable PDF format as a path of least effort, leaving them with something that looks nice on a printed page but potentially not so nice and not too flexible otherwise. For the past two years, PKP has been working on a web service (which integrates into OJS' workflow via a provided plugin) to fully automate the conversion of Word/compatible documents into the National Library of Medicine's standard JATS XML format (the same format which underlies PubMed Central), using fuzzy parsing and machine learning heuristics, and transform documents from there into matching human-readable HTML and PDF. This development, broadly speaking, takes two parts: one, the core OpenOffice/Word 2007 "docx" XML to JATS XML conversion engine, called "meTypeset" developed jointly with Martin Eve of the Open Library of the Humanities, and two, the web service pipeline, which unites meTypeset and other open-source libraries (including LibreOffice, ParsCit, ExifTool, and others). This service provides citation parsing, XMP metadata, and other industry-standard features. Improvement of various parsing features and automated evaluation is ongoing.

Introduction

Having developed Open Journal Systems (OJS) as an open source journal management and publishing software in 2001, the Public Knowledge Project (PKP) has built up a community of users among peer-reviewed open access journals that tend to run on very low budgets, charging neither readers nor authors for access. A 2009 study found that journals were spending less than $200 USD per article on publishing expenses, with close to half of the 7,000 titles employing coming from the Global South[1]. Our promise with has always been to provide professional level publishing support for online peer-reviewed scholarly publishing, which has led us to ensure that OJS is able to handle such standard features as submission checklists, double-blind reviews, reviewer ratings, journal statistics, and proper Google Scholarly indexing, among other things.

However, the increasing reliance among scholarly publishers on off-shore XML mark-up of articles, enabling ready deposit in PubMed Central for the life sciences and more generally well-formatted publishing outputs in multiple formats has posed a particular challenge for our community. To date, OJS has had no special functionality for working with document markup; while it provides a very robust workflow for authors, reviews, and editors to take the requisite turns in producing scholarly material, it is effectively blind to the materials’ form and content. Our approach has been to fashion a semi-automated XML mark-up process using a stack of open source software tools that will do most of the basic work in tagging the typical Word manuscript, with the finishing touches provided by hand. We see such a tool providing OJS users with an opportunity of further raising the quality of their publications in a variety of contexts, without adding unreasonably to their expense or labour.

In providing a description of this new publishing service, we begin with the key building block of our open source stack, which is meTypeset. MeTypeset was originally developed, and is still maintained, by Martin Eve of the Open Library of the Humanities; PKP provided additional development resources to meTypeset during the development of this stack. In short, meTypeset is a fork of OxGarage with a substantial amount of Python logic added to provide heuristic-based fuzzy parsing for going from an inconsistently formatted Word/compatible document to well-formed JATS. Like OxGarage, it uses TEI as an intermediary format, but because the parsing logic has been designed in a linear way to go from Word to TEI to JATS, it is not currently capable of outputting TEI at the end of the process; if this feature were desired, it would likely be easiest to do a direct XSL conversion from the final JATS XML back to TEI. A lengthier explanation of meTypeset’s operation is as follows:

meTypeset In Depth

The first thing meTypeset does is a "straight" conversion from .docx XML to TEI XML, using some pre-specified XSL mappings which we forked from http://wiki.tei-c.org/index.php/OxGarage. There's no logic here, and no attempt to coerce obviously-weird document structure; just mapping from Word's image tag to TEI's image tag, and so forth. We work in TEI from this point forward before converting to NLM. After that, one of the first and most complex classes that gets called by meTypeset to do the work of actually parsing and reformatting is the Size Classifier. In effect, this recurses through the document a few times, identifies where Word's internal headings have already been applied, and corroborates those with the use of Bold, Italics, Underlining, and font face and size changes to work out the nesting of various headings and sub-headings in the document. At the end of this stage, the expectation is that we'd have the Introduction, Methods, Results, and (key to this discussion) References section as top-level headings (or example, that is; none of these are semantically hardcoded) and any subsections beneath them.

The next hook that is fired is the Bibliography Add-ins class, which is a bit simpler. What it does is scan for inline references throughout the text (i.e. not the reference section at the end of the document) that have had additional XML inserted into them by Zotero or Mendeley Word plugins that an author might have been using to author their document. This is important for two reasons; one, this XML isn't part of the .docx spec (since it's added by Word plugins), so if we don't go looking for it, it's just additional cruft that gets printed in there at the end (once the document is no longer in Word format), and two, it's extremely useful if it happens to be there, as it means that the user has effectively already done the word of tagging the references for us, and we can use this data in the following steps. Next, there's the Bibliography Classifier class, which contains some miscellaneous functionality to help with finding the References section itself. As mentioned, the Size Classifier does the work of actually figuring out the document structure, which likely includes a "References" or "Works Cited" section near the end. The Bibliography Classifier will first check against a list of keyphrases in a number of different languages, stored in here, to see if any parsed document section has a synonymous heading (e.g. “Works Cited”). Note that this is currently the only place where we use any sort of linguistic cues, due to the difficulty in maintaining multilingual support with this approach. If it gets a match, it stores a flag in the document to indicate we have a likely reference section; this is revisited in a later step. The Bibliography Classifer also contains the beginnings of some hooks to support guided manual cleanup of parser errors. Right now, this manual cleanup uses interactive.py when meTypeset is called from the terminal, and is disabled in our webservice because that's not much of an interface, but we're planning on exposing these hooks for anyone wishing to build a WYSIWYG cleanup interface atop meTypeset.

After this comes the very important NLM Manipulate class, where the TEI work-in-progress document is turned into NLM; a significant amount of cleanup on the article body text formatting takes place here to ensure we're getting valid NLM, and some miscellaneous fuzzy parsing functions that don't handle references specifically are also fired here. However, roughly half of the functions in this class do handle references specifically -- this is where we take into account that stored flag for any matched "Reference" keyphrases; we also check for a series of very short "paragraphs" (or quasi-paragraphs that might have been formatted as a list, or as a series of linebreaks) that all contain the same indentation structure, and/or a single year number, and/or the same numbering structure, working backward from the end of the document to avoid mis-tagging actual body text lists, and finally, tag and number them as references, with the eventual goal of getting something like:

<ref-list>
     <ref id="1">Some text</ref>
     <ref id="2">Some text</ref>
</ref-list>

There's then a bit more cleanup that takes place to avoid having a redundant section heading that says "References" due to the way our keyphrase functionality is called, as well as to eliminate any weird whitespace or numbering cruft leftover from our tag insertion. There's also a check to ensure we haven't inadvertently merged multiple references together. After this, inline references should be the only thing missing – they are only tagged at this point if the user had been utilizing Bibliography Add-ins with Zotero or Mendeley as described above. The Reference Linker module thus inserts <xref> tags throughout the document, using (in order of descending preference) those bibliography add-ins, followed by simple string matching of bracketed references (e.g. [1]), including some logic to catch bracketed references that have been input as [6-9] or [5,8], and, finally, some regular expressions to match strings from a bibliography entry (author names and year numbers) against parenthetical references in the body text. While it is currently unused for reasons of scale, we've also developed a Bibliography Database module that will store parsed reference data for later use. Currently it uses Python's shelve module, but it could be easily adapted to a MySQL database.

After this, meTypeset is done with the document and it is handed off to the next component of the PKP XML Parsing Service stack.

The PKP XML Parsing Service

The PKP XML Parsing Service (XMLPS) wraps meTypeset and a handful of other related tools to provide a stable API and webservice endpoint for straightforward Word/compatible documents to JATS conversion. It is documented and hosted at https://github.com/pkp/xmlps and there is currently a public instance available at http://pkp-udev.lib.sfu.ca/.

In brief, and in order of operation, the tools currently wrapped by XMLPS are:

LibreOffice

LibreOffice is called – using the unoconv (https://github.com/dagwieers/unoconv) wrapper to avoid some command-line Java headaches – on all uploaded documents to convert them to the Microsoft Word 2007+ XML DOCX standard. This enables us to easily support RTF, ODT, and DOC input (among others) when the majority of parsing has been developed against a DOCX target. We even “convert” DOCX to DOCX using the same LibreOffice call in order to ensure that we are working from a well-formed document, as this was an issue in initial testing.

LibreOffice is also called directly by meTypeset later in the stack if any of the images contained in the DOCX wrapper are found to be in the legacy WMF (“Windows Metafile”) or EMF format, in order to convert them to PNG; Imagemagick, which would be lighter-weight, ostensibly supports WMF and EMF files, but as of this writing, support has been broken upstream since 2013.

meTypeset

See above. meTypeset is designed to work on an unadultered DOCX, so it is called immediately after LibreOffice; most of our other steps are like post-processing, relative to meTypeset.

ParsCit

ParsCit (http://aye.comp.nus.edu.sg/parsCit/) is one of several existing solutions for parsing citation strings from free text within an article body into structured XML and/or BibTeX (a proto-JSON standard still used as an interchange format by reference managers like Mendeley and Zotero). In our experience, it is also the most performant and the easiest to deploy locally (it is written in Perl and utilizes the Conditional Random Fields++ library for machine learning functionality). ParsCit receives the <ref-list> block from the output of meTypeset, and outputs BibTeX structured referefences.

bibutils

A combination of the bibutils (http://bibutils.refbase.org/) package and some of our own XSL is then used to transform the BibTeX output by ParsCit to JATS-formatted XML references and paste them back into the JATS output by meTypeset. More of our own XSL is then used to transform the now-complete JATS XML into an HTML render which can be viewed in a browser or styled further; currently, we provide a single responsive layout with stylesheets as part of XMLPS.

Pandoc

Pandoc (http://johnmacfarlane.net/pandoc/), which is billed as a “swiss-army knife” for converting files from one markup format to another, and is almost certainly the best extant use of Haskell in 2015, is used crucially but sparingly in our stack. Pandoc recently incorporated functionality to handle Microsoft Word files as input for transformation to any of the other supported formats (including, primarily, Markdown); however, this is very much a “straight” conversion which does not account for various Word formatting idiosyncrasies as well as meTypeset. We are, however, able to use it to style the references according to a user's chosen citation format using its builtin implementation of citeproc along with the Citation Style Language repository (http://citationstyles.org/). Inline citations (both numerical and author-year style) are linked to bibliography entries in the JATS, and replaced them with the desired citation formatting in the HTML.

Wkhtmltopdf

Wkhtmltopdf (short for “webkit HTML to PDF”; http://wkhtmltopdf.org/) is a simple one-call tool that allows us to go directly from the finished, styled HTML to a cleanly printed PDF (i.e. without images cut between pages) without needing to return to the XSL to provide a separate PDF transformation.

Exiftool

Exiftool (http://www.sno.phy.queensu.ca/~phil/exiftool/), which is designed around reading and writing metadata from various imaging formats, is used to embed article front matter as structured XMP metadata to the finished PDF. While still not widely used in most contexts (short of a “document information” block in newer versions of Adobe Reader), it is a nice value-add that's increasingly being provided by large publishers, and was very easy to implement.

Evaluation and Use

We are still at the proof-of-concept stage with this XML mark-up service and have greatly appreciated the broader community’s interest and support in testing and evaluating this service. To date, we have had preliminary evaluation performed by York University Libraries, and we receive approximately 100 hits on our demonstration web instance each month. We have not yet conducted a rigorous quantitative evaluation of parsing quality as our funding hours to date have not permitted it; our development and testing corpus currently constitutes 50 heterogeneous documents sourced from various OJS sites in different disciplines. At this point, we welcome others to use, evaluate, and contribute to what we developed and assemble thus far.

Future Work

As of January 2015, two crucial features are disabled: front matter parsing and PDF input. Each is for a different reason: first, meTypeset’s front matter parsing was judged not to be of acceptable quality as of the completion of the last development round in August 2014 ^*. Second, PDF input was previously achieved through the use of http://pdfx.cs.man.ac.uk/usage (closed-source) under a research agreement with the University of Manchester It has since been removed from the PKP XML Parsing Service that wraps meTypeset However, there are plans to re-add both of these features by incorporating the CERMINE Java library developed by the Centre for Open Science at the University of Warsaw (https://github.com/CeON/CERMINE) Just as meTypeset uses a heuristic-based approach to produce well-structured JATS from Word/compatible documents, CERMINE uses a machine learning approach to produce JATS from PDFs During an earlier development round, we assisted the CERMINE developers in retraining their machine learning models (which were developed against generally well-formed documents from mathematics and natural sciences) on our own corpora (which are variously poorly-formed documents that have been automatically converted to PDF from Word, and span the humanities, social sciences, and medicine) The precision, recall, and F-scores from retraining CERMINE on our corpora can be found in Table 1. Designing a new module for the PKP XML Parsing Service which runs CERMINE in parallel with meTypeset (following a straight conversion to PDF for documents originating in Word/compatible formats) in order to inject front matter and automatically evaluate the parsing quality of certain document elements between both approaches is a high priority for a third funding round.

Table 1

%	precision	recall	F1
abstract	76.79 (+30.42, +29.54)	88.97 (+35.18, +33.38)	82.43 (+32.62, +31.35)
title	91.95 (+1.14, +6.74)	89.31 (+4.14, +5.86)	90.61 (+2.71, +6.29)
journal	88.72 (+9.64, +0.16)	72.30 (+26.67, +0.23)	79.67 (+21.80, +0.20)
authors	90.05 (+6.56, +5.58)	86.16 (+19.62, +2.71)	88.06 (+14.00, +4.11)
affiliation	78.70 (+9.88, +1.98)	74.57 (+26.10, +12.16)	76.58 (+19.70, +7.76)
year	98.41 (+7.03, +4.64)	92.53 (+34.02, +4.37)	95.38 (+24.04, +4.50)
volume	97.26 (+1.74, +0.26)	85.75 (+29.43, +0.23)	91.14 (+20.28, +0.24)
issue	96.30 (+0.31, +0.34)	62.87	76.08 (+21.12, +0.11)
average	89.77 ()	81.56 (+24.94, +7.39)	84.99 (+19.53, +6.82)

Conclusion

We feel confident that we are on the right path in developing an open source stack that can semi-automate with reasonable accuracy the XML mark-up of the Word manuscripts typically submitted to peer-reviewed scholarly journals. This project demonstrates the ways in which open source software tools can be explored and combined to create new systems and services, which in this case promise to extend the contributions being made by open access to research and scholarship. While there is still much room for improvement in our results, the precision of the mark-up will remain an iterative process and one that will, with greater use, decidedly grow in accuracy, and expand in the range of documents and academic disciplinary traditions that it can handle. The goal remains one of lowering the financial barriers to global participation in the standards of peer-reviewed scholarly publishing while raising the bar of journal quality and value for the larger world.

References

1.: Edgar BD, Willinsky J. A survey of the scholarly journals using Open Journal Systems. Scholarly and Research Communication.

Footnotes

*: Article metadata can be supplied directly to meTypeset via an API call; this is designed to override front matter parsing in cases where there is already known-good metadata for an article, but currently serves as a workaround for the disabled functionality.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

Bookshelf ID: NBK279666

Contents