Collecting XML at article submission at eLife: two steps forward, one step back?

Melissa Harrison

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2016.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016 [Internet].

Show details

Contents

Collecting XML at article submission at eLife: two steps forward, one step back?

Melissa Harrison.

Author Information and Affiliations

When eLife was launched in 2012, almost all article metadata was collected from what the corresponding author entered into the submission form during the peer review phase. These data were converted to JATS XML at acceptance and, with the text of the author Word file, were fed to the content processor (typesetter/compositor) to generate full text XML during production, resulting in final output for publication of the version of record (XML, HTML and PDF). This paper will describe the benefits and limitations of this approach, and how and why eLife has reverted to a more traditional method of using the complete author's Word file to generate elements of the XML metadata within the production process.

eLife also publishes the accepted manuscript PDF for approximately 60% of accepted articles. This is done within days of acceptance and is followed up later by the full text final version. This process still relies on the metadata entered via the submission system to generate the HTML heading and metadata information online (from JATS XML).

Aspects of the peer-review process and submission system that affect the acquisition and conversion of article metadata for both accepted article PDFs and the final version of record will be discussed. I will also outline the challenges eLife encountered in the efforts to streamline the production process and improve the end-to-end author experience.

Introduction

eLife aimed to achieve an XML first workflow for all metadata but found this more difficult to achieve than anticipated. When setting up the publishing platform, the submission system and the production vendor, we selected the most secure and flexible vendors to partner with us. Since that time we have replaced some of those partnerships and built internal workflows to get us closer to our goals of an efficient, fast, and author-friendly workflow.

There are benefits and limitations to all workflow approaches, and this paper will describe our experiences to date, focusing mainly on the process from submission to publication. I will describe three variations below, with the benefits and pitfalls, as well as eLife-specific issues.

1. Metadata output as an XML source for production

When the submission system was set up we planned to collect all metadata from the corresponding author via the submission screens, thereby eliminating the content processor (typesetter/compositor) requirement for finding the information from different parts of the manuscript and key information that may be hidden within the text (for example funding details). This would also mean that the metadata would arrive into production in a structured XML format - JATS XML.

The corresponding author was asked to input the following information in the submission screens:

title;
abstract;
author details – names, affiliations (subdivided into department, institution, city and country), any present addresses, whether corresponding author, which authors contributed equally, contribution statements and competing interests;
group author details, if applicable;
funding details (funder, grant reference and authors associated with the funding);
major datasets generated and/or used (authors, year, title, source, url/accession number, accessibility statement);
ethics statements (human, animal, clinical trial registration);
major subject area, research organism and author-defined keywords.

The authors were informed the title page they provided within their article file would be removed before the accepted article would be sent through to production, so the information on the submission screens had to be accurate.

Benefits

In theory, the content processor would always use whatever metadata popped out of the submission system without making any content or tagging changes. The content processor would have only one source of metadata, in clean XML format. Any updates to information would be made at source and be carried through to publication, meaning the submission system database would be accurate and up to date. Also if we updated, for example, the licensing information, there would be one place to make the change at source, rather than multiple changes throughout the workflow.

Pitfalls

Submission systems are not set up to collect the level of metadata required for production and publication. Some pieces of information from the Word file could not be output in structured XML but were provided by the author on their cover page, so these details were pushed through via “notes to content processor”, which required interpretation and the introduction of manual intervention (for example different groupings of equal contributions, more than two address affiliations per author).

Authors have a strong preference for working in Word and keeping things simple. Filling out lots of forms on a submission system is onerous, so we were making the submission process harder for them. When an author resubmits and updates their Word file, they don’t necessarily update the submission screens as well, so if metadata changes it has to be updated separately. Therefore, a QC stage was introduced before exporting content to production to ensure everything in the final author Word file was also on the submission screens or added as to note to content processor. This was costly considering the amount of metadata we were collecting.

Changing specifications at source is more expensive than asking the content processor to change their outputs. Finally, despite all best efforts, authors will still make changes to metadata at the proofing stage and so the submission system still becomes out of date.

Unique eLife issues

The submission system was live a number of months before the website was built, therefore we were specifying XML requirements for the submission system output blind to the final XML tagging requirements. Many changes were needed which then fell to the content processor. Although the content processor received the information within XML structured tags, it still required programmatic conversions to some XML tagging that we’d not anticipated or desired.

2. All key metadata output as XML source for production but supplemented by author Word file

Late in 2015 we changed typesetting vendor. The new vendor would convert the author Word file to HTML/XML at the point of output and encouraged us to use what was in the author's Word file as well as the XML output to convert the content. As a result we stopped removing the author’s title page and started delivering all information to the vendor on acceptance.

Benefits

Less stringent QC was required on the submission system output and the “notes to content processor” field was not required any more. The conversion to HTML at the point of entering the production workflow means that everyone touching the content is doing it via the same interface and in a streamlined system. For final publication purposes the HTML conversions is a means to an end; it is highly structured and can be automatically converted to JATS XML at any point during the process.

Pitfalls

We were back to two streams of data going to production, which required interpretation by the content processor. The submission system was no longer the source of most information, and subsequently not kept up to date. A post-export QC process was now required. The problem and cost were now moved downstream, although we are confident that in time this QC can be reduced to a minimal level. The effort required by authors when submitting to eLife remains high.

Unique eLife issues

Approximately 60% of authors chose to publish their accepted manuscript in the early form. We set up an internal automated workflow to do this: the files are exported from the submission system to an Amazon Web Services (AWS) bucket, and SQL query outputs in the form of CSV files are also output to another AWS bucket. Our internal workflow receives the new output files and then collects the corresponding data from the CSV output to build the XML required to publish the paper. This process was built using the existing workflow, relying on the submission screen data. So, vigorous QC of the submission screens in order to ensure the publication data are correct for this workflow is still required.

3. Reduce key metadata output as XML source for production, rely more on author Word file

We are now in the process of reviewing the submission screen requirements and reducing them, to reduce the amount of input required from authors. However, as our author accepted manuscript publication workflow still relies on these data, we cannot remove all submission requirements and revert to the author Word file completely.

Benefits

Reduce author submission requirements and workload; reduced QC requirements.

Pitfalls

The submission screen data input cannot be removed completely and so there will still be a dual source of data for the content processor.

Unique eLife issues

This year eLife launched its own publication platform. Until this time, we could not use all the metadata collected during submission and publish it at the time of the author’s accepted manuscript. Now we can build further on our in-house workflow specifications and publish more, if not most of, information authors provide via the submission screens at the point of first publication, for example, author equal contribution, multiple corresponding authors, major dataset information and ethics information. We will now select which information is of most value at this stage of publication, and keep the relevant submission screen questions. The remainder we will be able to remove, aiding the author submission process.

General observations

There are some things many authors see as unnecessary or don’t understand from an XML point of view, so if a publisher wants clean and good XML output, some QC will always be required. For example, authors can list a number of different departments in the same field on the submission screen (or string of text in their Word file), but in the XML each department should be listed as its own entity (e.g. <institution content-type="dept">). Authors don’t always supply all necessary information required to find a source, for example a database accession number or url.

Production workflow changes and future plans

We now convert to HTML and have XML available from the point of content entering into the production workflow, and the typeset PDF is available at any point thereafter (it can be automatically re-generated at each point of the production process, incorporating any changes). We can view the content at any time in the process, and all actions are made in the single online HTML viewer system, whether it is a production, content processor, copy editor or author action. This means eLife is much more in control of content during production, and we have access to it at any point. We can drop additional assets into the article when they are ready (for example, eLife publishes the author’s decision letter and response, which are delivered to production a few days after the main files and data, and the eLife Digest, which is delivered 5-10 days after acceptance). We automatically send a link of the “Proof” to author when we know the paper is complete and ready for them. We receive the author’s reviewed content automatically, and we can automatically sign off for final delivery and view the content on the hosting platform within 10 minutes.

In theory, based on the new workflow we now have in place, we could publish final content within a week of acceptance, if not sooner, and with no intervention from the content processor, except their initial pre-editing and/or copy editing.

Using this new platform, we can speed up our production times, and we can add new features to our workflow. For example, we plan to allow authors to add ORCIDs to their paper during production, using ORCID’s system-to-system authentication so they are validated, via the author proofing tool. We also plan to add more metadata to the funding details; this is available from the FundRef database.

Now we are more in control we can make more process improvements and reduce turnaround times.

Conclusion

Until authors submit and convert to HTML from the point of submission and edit in a common system that is built for this type of content and can convert to XML at any time point, we will not escape most of the issues discussed above. Once we get to that point, production and editorial workflow processes merge into one, and we will have the potential to publish the author’s final version of record article in full text HTML/XML at the point of acceptance, or even submission (bearing in mind support for preprint repositories).

This article is distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use and redistribution provided that the original author and source are credited.

Bookshelf ID: NBK350147

Contents

PubReader
Print View
Cite this Page
Harrison M. Collecting XML at article submission at eLife: two steps forward, one step back? In: Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2016.

In this Page

Introduction
1. Metadata output as an XML source for production
2. All key metadata output as XML source for production but supplemented by author Word file
3. Reduce key metadata output as XML source for production, rely more on author Word file
General observations
Production workflow changes and future plans
Conclusion

Other titles in this collection

Journal Article Tag Suite Conference (JATS-Con) Proceedings

Conference Links

Recent Activity

Clear Turn Off Turn On

Collecting XML at article submission at eLife: two steps forward, one step back?...
Collecting XML at article submission at eLife: two steps forward, one step back? - Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

Bookshelf

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2016 [Internet].

Collecting XML at article submission at eLife: two steps forward, one step back?

Authors

Affiliations

Introduction

1. Metadata output as an XML source for production

Benefits

Pitfalls

Unique eLife issues

2. All key metadata output as XML source for production but supplemented by author Word file

Benefits

Pitfalls

Unique eLife issues

3. Reduce key metadata output as XML source for production, rely more on author Word file

Benefits

Pitfalls

Unique eLife issues

General observations

Production workflow changes and future plans

Conclusion

Views

In this Page

Other titles in this collection

Conference Links

Recent Activity