Ontario Scholars Portal (SP) Journals is an XML based digital repository
containing over 32,000,000 articles from over 14,000 full text journals of 25
publishers. It has been a success of adopting NLM Journal Archiving and
Interchange Tag Set for its XML based E-journals system using MarkLogic since
2006. Scholars Portal Books is a PDF based platform containing 460,000 ebooks
from 25 publisher running on Ebrary’s ISIS system. While the PDF is still the
dominating format for ebooks, the publishers start to move from PDF to XML. This
article describes the pilot of transforming a publisher’s XML book chapter
metadata into NLM book DTD XML format and load into MarkLogic database so the
user can get the book chapter results from the Journals search interface. This
article will discuss why NLM book DTD is chosen, examine the process of data
transforming and loading, analyze the benefits and challenges and make the
recommendations of improving the book DTD.
Background
Scholars Portal is the project of Ontario Council of University Libraries (OCUL) to
provide shared technology infrastructure and shared collections to OCUL
universities. Scholars Portal services include digital content of ebooks, ejournals,
statistics data, Geoportal, interlibrary loan and citation management. Ejournals
platform was the first content service provided to OCUL universities. Starting in
2006, the SP development team began planning for a migration of the Scholars Portal
Ejournal repository from ScienceServer to a new XML-based database using MarkLogic.
NLM Journal Archiving and Interchange Tag Set was chosen for this new system. The
publishers’ native data is transformed to NLM Tag Set in SP Journals in order to
normalize data elements to a single standard for archiving, display and searching.
It has been proved as a big success of adopting this DTD. SP Ejournals repository
becomes the top research resource for OCUL universities with the average monthly
download of 555,000. SP also witnesses that JATS is gaining in popularity within the
journal publishers. SP Ejournals ingests the data from 25 vendors, 12 of them are
sending the XML file in NLM tag suite and one in the newest version of JATS.
Scholars Portal Books is a PDF-based platform containing 460,000 ebooks from 25
collections of various publishers running on Ebrary’s ISIS system. The ISIS system
is designed to load and display PDF books. The PDF-based reading interface offers
multiple- page view options, including a grid view to help users easily navigate in
and among books. While the PDF is still the dominating format, the publishers start
to move from PDF to XML book. Some publishers send us the full text in XML and
others send the metadata in XML and the full text in PDF. There seems no XML
DTD/schema has gained the popularity within the book publishers. The publishers all
use their home-developed DTD/schema. Ravit David (2011) described a pilot of loading XML book into ISIS. The practice was
generating a PDF file by extracting each book’s text and feeding the PDF, along with
the metadata from the MARC file into the ISIS platform so the books can be
searchable on the Ebook platform. The publisher’s XML book source data was loaded
into Marklogic for rendering the html view when the user read the book from ISIS
Ebook reader. The XML books can be searched and viewed from the Ebook platform, but
with lots major problems as the platform is designed for PDF books.
Instead of loading the XML book as it is in MarkLogic, the new pilot is to transform
the XML book source data into NLM book DTD XML format in MarkLogic. When the users
do a search from Ejournals platform, the query is also sent to XML book chapter
database. So the users can get the book chapter level search result from Ejournals
platform and then are directed to Ebook platform. The object is to direct the
traffic to Ebook platform from Ejournals platform and therefore increase the usage
of Ebook platform.
Data transformation
Our pilot chose to load Springer chapter level XML books into MarkLogic. First, the
common XML format need to be decided as different publishers use different
DTD/schema. On searching of the DTD, we discovered that no DTD/schema is dominate in
the ebook publishing industry. Each DTD has pros and cons that are specific to a
given application. We were just looking for the one that would work for our specific
needs. “The NCBI Book Tag Sets were written using the Publishing Tag Set as a base
and adding book-specific elements” (NCBI,
2012). So it has common elements with NLM journal DTD. The NLM book DTD
is a very good fit for the scholarly content (Perera,
2011). The experience of other organizations such as ACS shows that
adopting NLM tag set for both book and journals will minimize the staff learning
time and minimize the amount of XML translation needed (O'Brien and Fisher, 2010). While other book XML schema such as
DocBook, Epub were also considered, our success of adopting NLM journal DTD
suggested that the NLM Book DTD was obviously the best choice.
Crosswalk
The crosswalk of the metadata from Springer’s A++ V2.4 DTD to NLM book DTD was
indeed a pretty good match; however, a few gaps are identified in NLM book DTD.
--A set of tags to describe the book series metadata
The logic hierarchy structure of a book may contain book series—book—book
part—book chapter. While NLM book DTD provides the elements for the metadata
that is specific to a book as a whole (<book-meta>) or a book
part/chapter (<book-part-meta>), there is no dedicated tag set to
describe book series, such as book series title, ISSN. Moreover, the
hierarchy structure for book part and chapter is not obvious by using
attribute <book-part part-type="part"> and <book-part
part-type="chapter">.
--Subject or classification for the book
Library of Congress Subject Headings (LCSH), Library of Congress
Classification (LCC) are the well-known controlled vocabularies to describe
subject categories for a book. While there is <subj-group> for
<book-part>, there is NO such a tag for book as a whole. We have to map
/Publisher/Series/Book/BookInfo/BookSubjectGroup to
/book/book-meta/kwd-group (kwd-group-type="subject") although subject is
different from keyword by all means.
--Chapter level DOI
Springer and other publishers start to assign chapter lever doi for reference
linking to increase the discoverability. Unfortunately there is no tag for
chapter level doi in NLM book DTD.
The loader
There is one xml file for each book chapter metadata in Springer source data. The
root element is <Publisher>. There is a clear hierarchy structure for
Publisher-Series-Book-Part-Chapter and the metadata to describe each level of
the book. A java program has transformed the Springer source data into NLM book
DTD XML based on the crosswalk. The converted Scholars Portal NLM book DTD xml
has the root element as <book> and the metadata to describe the book, book
part and book chapter. 837,469 Springer book chapters have been loaded in
MarkLogic.
shows the example of the source
xml.
shows the example of the converted
xml.
Example of SP NLM book DTD data.
The display of the search results
Scholars Portal Journals application is built in XQuery. XQuery is a declarative
programming language which can be used to express queries and transformations of
XML data in MarkLogic. MarkLogic Server is used for indexing and searching. Our
search application is best viewed with newer browsers, including Firefox 3 and
Internet Explorer 7 and 8. Older versions will work but with some loss of
functionality.
The features include
Search and access full text articles from a range of academic
publishers
Browse journals by subject
Subscribe to RSS feeds for journal updates
Search for figures and tables within articles
Link to cited and citing references
Print, email and download records
Save citations to RefWorks and other bibliographic management
software
After loading the book chapters into MarkLogic xmlbooks database, the search for
journals extended to books database too. In journal search result page, the
search results from the Ejournals database is displayed in the main page. The
term is also searched in xmlbooks database for chapters through AJAX call and
the result is displayed on the right side of the page.
is the sample of the book chapters in
the search result list
Book chapters in the search result list.
In journal article details page, the keywords from the article are used to search
for the matching chapters and the result is displayed in "Related Chapters" tab.
shows the book chapters result
in the journal article details page.
Book chapters in the journal article details page.
The PDF link for the book chapter direct the user to the Ebook platform for the
book and allows the user to download the book chapter PDF if they are
entitled.
is an exmaple of the linking to
Ebook platform
Linking to Ebook platform.
Future directions
This pilot project is still in testing phase. More analysis need to be done after it
goes alive and get the feedback from the users. NLM book DTD has been provided to be
a good meta-data model and meets our needs of the metadata conversion for this
project. We are still hesitate to use it as the full text XML model due to the gaps
identified and the fact that it has not been updated for 4 years. We expect the new
Book Interchange Tag Suite will be developed to a high degree of complexity and gain
popularity like journal DTD.
References
- 1.
- 2.
- 3.
Perera Chandi.
Book Publishing with JATS.
Journal Article Tag Suite Conference (JATS-Con) Proceedings
2011. Bethesda, MD: National
Center for Biotechnology Information (US); 2011.
Available
from
http://www.ncbi.nlm.nih.gov/books/NBK62098/- 4.
O'Brien Dan, Fisher Jeff.
Journals and Magazines and Books, Oh My! A Look at ACS'
Use of NLM Tagsets. In
Journal Article Tag Suite
Conference (JATS-Con) Proceedings 2010. Bethesda,
MD: National Center for Biotechnology Information
(US); 2010. Available
from
http://www.ncbi.nlm.nih.gov/books/NBK47083/