U.S. flag

An official website of the United States government

NCBI Bookshelf. A service of the National Library of Medicine, National Institutes of Health.

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012 [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2012.

Cover of Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012

Journal Article Tag Suite Conference (JATS-Con) Proceedings 2012 [Internet].

Show details

JATS for both journals and books?: A case study of adopting JATS to build a single search for Ejournals and Ebooks

and .

Author Information and Affiliations

Ontario Scholars Portal (SP) Journals is an XML based digital repository containing over 32,000,000 articles from over 14,000 full text journals of 25 publishers. It has been a success of adopting NLM Journal Archiving and Interchange Tag Set for its XML based E-journals system using MarkLogic since 2006. Scholars Portal Books is a PDF based platform containing 460,000 ebooks from 25 publisher running on Ebrary’s ISIS system. While the PDF is still the dominating format for ebooks, the publishers start to move from PDF to XML. This article describes the pilot of transforming a publisher’s XML book chapter metadata into NLM book DTD XML format and load into MarkLogic database so the user can get the book chapter results from the Journals search interface. This article will discuss why NLM book DTD is chosen, examine the process of data transforming and loading, analyze the benefits and challenges and make the recommendations of improving the book DTD.

Background

Scholars Portal is the project of Ontario Council of University Libraries (OCUL) to provide shared technology infrastructure and shared collections to OCUL universities. Scholars Portal services include digital content of ebooks, ejournals, statistics data, Geoportal, interlibrary loan and citation management. Ejournals platform was the first content service provided to OCUL universities. Starting in 2006, the SP development team began planning for a migration of the Scholars Portal Ejournal repository from ScienceServer to a new XML-based database using MarkLogic. NLM Journal Archiving and Interchange Tag Set was chosen for this new system. The publishers’ native data is transformed to NLM Tag Set in SP Journals in order to normalize data elements to a single standard for archiving, display and searching. It has been proved as a big success of adopting this DTD. SP Ejournals repository becomes the top research resource for OCUL universities with the average monthly download of 555,000. SP also witnesses that JATS is gaining in popularity within the journal publishers. SP Ejournals ingests the data from 25 vendors, 12 of them are sending the XML file in NLM tag suite and one in the newest version of JATS.

Scholars Portal Books is a PDF-based platform containing 460,000 ebooks from 25 collections of various publishers running on Ebrary’s ISIS system. The ISIS system is designed to load and display PDF books. The PDF-based reading interface offers multiple- page view options, including a grid view to help users easily navigate in and among books. While the PDF is still the dominating format, the publishers start to move from PDF to XML book. Some publishers send us the full text in XML and others send the metadata in XML and the full text in PDF. There seems no XML DTD/schema has gained the popularity within the book publishers. The publishers all use their home-developed DTD/schema. Ravit David (2011) described a pilot of loading XML book into ISIS. The practice was generating a PDF file by extracting each book’s text and feeding the PDF, along with the metadata from the MARC file into the ISIS platform so the books can be searchable on the Ebook platform. The publisher’s XML book source data was loaded into Marklogic for rendering the html view when the user read the book from ISIS Ebook reader. The XML books can be searched and viewed from the Ebook platform, but with lots major problems as the platform is designed for PDF books.

Instead of loading the XML book as it is in MarkLogic, the new pilot is to transform the XML book source data into NLM book DTD XML format in MarkLogic. When the users do a search from Ejournals platform, the query is also sent to XML book chapter database. So the users can get the book chapter level search result from Ejournals platform and then are directed to Ebook platform. The object is to direct the traffic to Ebook platform from Ejournals platform and therefore increase the usage of Ebook platform.

Data transformation

Our pilot chose to load Springer chapter level XML books into MarkLogic. First, the common XML format need to be decided as different publishers use different DTD/schema. On searching of the DTD, we discovered that no DTD/schema is dominate in the ebook publishing industry. Each DTD has pros and cons that are specific to a given application. We were just looking for the one that would work for our specific needs. “The NCBI Book Tag Sets were written using the Publishing Tag Set as a base and adding book-specific elements” (NCBI, 2012). So it has common elements with NLM journal DTD. The NLM book DTD is a very good fit for the scholarly content (Perera, 2011). The experience of other organizations such as ACS shows that adopting NLM tag set for both book and journals will minimize the staff learning time and minimize the amount of XML translation needed (O'Brien and Fisher, 2010). While other book XML schema such as DocBook, Epub were also considered, our success of adopting NLM journal DTD suggested that the NLM Book DTD was obviously the best choice.

Crosswalk

The crosswalk of the metadata from Springer’s A++ V2.4 DTD to NLM book DTD was indeed a pretty good match; however, a few gaps are identified in NLM book DTD.

--A set of tags to describe the book series metadata

The logic hierarchy structure of a book may contain book series—book—book part—book chapter. While NLM book DTD provides the elements for the metadata that is specific to a book as a whole (<book-meta>) or a book part/chapter (<book-part-meta>), there is no dedicated tag set to describe book series, such as book series title, ISSN. Moreover, the hierarchy structure for book part and chapter is not obvious by using attribute <book-part part-type="part"> and <book-part part-type="chapter">.

--Subject or classification for the book

Library of Congress Subject Headings (LCSH), Library of Congress Classification (LCC) are the well-known controlled vocabularies to describe subject categories for a book. While there is <subj-group> for <book-part>, there is NO such a tag for book as a whole. We have to map /Publisher/Series/Book/BookInfo/BookSubjectGroup to /book/book-meta/kwd-group (kwd-group-type="subject") although subject is different from keyword by all means.

--Chapter level DOI

Springer and other publishers start to assign chapter lever doi for reference linking to increase the discoverability. Unfortunately there is no tag for chapter level doi in NLM book DTD.

The loader

There is one xml file for each book chapter metadata in Springer source data. The root element is <Publisher>. There is a clear hierarchy structure for Publisher-Series-Book-Part-Chapter and the metadata to describe each level of the book. A java program has transformed the Springer source data into NLM book DTD XML based on the crosswalk. The converted Scholars Portal NLM book DTD xml has the root element as <book> and the metadata to describe the book, book part and book chapter. 837,469 Springer book chapters have been loaded in MarkLogic.

Figure 1 shows the example of the source xml.

Fig. 1. Example of source data.

Fig. 1

Example of source data.

Figure 2 shows the example of the converted xml.

Fig. 2. Example of SP NLM book DTD data.

Fig. 2

Example of SP NLM book DTD data.

The display of the search results

Scholars Portal Journals application is built in XQuery. XQuery is a declarative programming language which can be used to express queries and transformations of XML data in MarkLogic. MarkLogic Server is used for indexing and searching. Our search application is best viewed with newer browsers, including Firefox 3 and Internet Explorer 7 and 8. Older versions will work but with some loss of functionality.

The features include

  • Search and access full text articles from a range of academic publishers
  • Browse journals by subject
  • Subscribe to RSS feeds for journal updates
  • Search for figures and tables within articles
  • Link to cited and citing references
  • Print, email and download records
  • Save citations to RefWorks and other bibliographic management software

After loading the book chapters into MarkLogic xmlbooks database, the search for journals extended to books database too. In journal search result page, the search results from the Ejournals database is displayed in the main page. The term is also searched in xmlbooks database for chapters through AJAX call and the result is displayed on the right side of the page.

Here is the sample of the book chapters in the search result list

Fig. 3. Book chapters in the search result list.

Fig. 3

Book chapters in the search result list.

In journal article details page, the keywords from the article are used to search for the matching chapters and the result is displayed in "Related Chapters" tab. Figure 4shows the book chapters result in the journal article details page.

Fig. 4. Book chapters in the journal article details page.

Fig. 4

Book chapters in the journal article details page.

The PDF link for the book chapter direct the user to the Ebook platform for the book and allows the user to download the book chapter PDF if they are entitled.

Figure 5 is an exmaple of the linking to Ebook platform

Fig. 5. Linking to Ebook platform.

Fig. 5

Linking to Ebook platform.

Future directions

This pilot project is still in testing phase. More analysis need to be done after it goes alive and get the feedback from the users. NLM book DTD has been provided to be a good meta-data model and meets our needs of the metadata conversion for this project. We are still hesitate to use it as the full text XML model due to the gaps identified and the fact that it has not been updated for 4 years. We expect the new Book Interchange Tag Suite will be developed to a high degree of complexity and gain popularity like journal DTD.

References

1.
David Ravit H., Sahebi Shahin Ezzat, Kawula BartekandJayasinghe Dileshni . Challenges and Potential of Local Loading of XML Ebooks. Proceedings of Balisage: The Markup Conference 2011. Balisage Series on Markup Technologies. 2011. Aug.Available fromhttp://www​.balisage.net​/Proceedings/vol7​/html/David01/BalisageVol7-David01.html.
2.
Book and Collection Tag Library version 3.0 [Internet]: National Center for Biotechnology Information; [cited 2012 Sep 02]. Available from: http://dtd​.nlm.nih.gov​/book/tag-library/3.0/index.html.
3.
Perera Chandi. Book Publishing with JATS. Journal Article Tag Suite Conference (JATS-Con) Proceedings 2011. Bethesda, MD: National Center for Biotechnology Information (US); 2011. Available fromhttp://www​.ncbi.nlm.nih​.gov/books/NBK62098/
4.
O'Brien Dan, Fisher Jeff. Journals and Magazines and Books, Oh My! A Look at ACS' Use of NLM Tagsets. In Journal Article Tag Suite Conference (JATS-Con) Proceedings 2010. Bethesda, MD: National Center for Biotechnology Information (US); 2010. Available fromhttp://www​.ncbi.nlm.nih​.gov/books/NBK47083/
Copyright 2012 by Wei Zhao, Jayanthy Chengan.

The copyright holder grants the U.S. National Library of Medicine permission to archive and post a copy of this paper on the Journal Article Tag Suite Conference proceedings website.

This work is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License

Bookshelf ID: NBK100352

Views

  • PubReader
  • Print View
  • Cite this Page

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...