Why JSON and JSON Lines

Here are some explanations and justifications for using JSON and JSON Lines formats for delivering metadata and data.

Why JSON and JSON Lines

Here are some explanations and justifications for using JSON and JSON Lines formats for delivering metadata and data.

Why does NCBI Datasets prefer the JSON Lines format?

JSON Lines (pronounced “jason-lines”) is a simple, robust, and easy to use format that has reasonably wide adoption, strong tooling support, excellent compliance, and significant expressivity. The format balances human readability and convenience with machine readability and operability. For many use cases, the JSON Lines format compares favorably with the alternatives, including multiple specialized bioinformatics formats and generic tabular formats For further details, read below and consult our reference documentation.

Why use the JSON format for various data reports?

JSON (JavaScript Object Notation; RFC8259) is a text format, both human and machine readable, for structured data that borrows from JavaScript syntax. NCBI favors the FAIR Principles, which aim to improve the findability, accessibility, interoperability, and reuse of digital assets. Essential elements to meet these objectives include:
  • Rich, machine-readable data and metadata: JSON is a standard that allows for a well-defined and complex structure, to facilitate encoding rich, machine-readable data and metadata. JSON supports data types and schemas that promote machine-readability. Formats such as CSV/TSV typically fail to support this degree of richness.
  • A formal, accessible, shared, and broadly applicable language for knowledge representation: JSON meets these objectives, with both a formal specification to define the file format’s grammar and very wide adoption. JSON is a lingua franca of API (Application Programming Interfaces) communications, especially in the modern RESTful web ecosystem. JSON shares the properties of being humans readable and machine-readable, with minimal need for parsing heuristics. In practice, formats such as CSV/TSV present inadequacies with respect to formality and compliance with specifications.
  • Domain-relevant community standards: There is a lack of agreement on CSV/TSV formats, and while tabular data and specialized formats are prevalent in bioinformatics, JSON is a generic format that is also a significant community standard in this field and a prevalent standard in the complementary areas of web services, API microservices, and data science. This interoperability benefits from the similarity of representation when data are of a similar type, using well-established and sustainable file formats.

Why use the JSON Lines format for various data reports?

The JSON Lines format consists of a sequence of concatenated JSON, one value per line. JSON Lines was formerly called NDJSON (Newline Delimited JSON). Historically, these were two independent variations on JSON, but the community and authors of the formats have agreed to reconconcile and consolidate them.

JSON Lines is a hybrid format, offering several advantages over conventional JSON as it is line-oriented:

  1. JSON Lines works well with common text processing tools, including the UNIX Core Utilities, such as grep, sed, head, and tail.
  2. JSON Lines facilitates stream processing since it is concatenable and any subset of lines may be processed independently of others. By comparison, JSON requires editing prior content whenever new content is appended. This same advantage was the motivation for JSON-Seq (JSON Text Sequences) as documented in the corresponding IANA specification, RFC7464). Despite having a recognized IANA Media Type assignment, JSON-Seq failed to gain adoption and was replaced by JSON Lines.

What are the main disadvantages of JSON, JSON Lines, and tabular formats such as CSV/TSV?

While JSON, JSON Lines and tabular formats provide many advantages, it is important to note that no format is ideal, and each presents some challenges:

  • JSON format formally encodes exactly one value, although that value may be null (an empty value) or an array of multiple values.
    • Arrays require an opening and closing tag and rigidly specified separators, and therefore, literally concatenating valid JSON does not produce a valid JSON result.
    • Many JSON processing libraries enforce the constraint of exactly one value and assume the value may be loaded entirely into memory, although there is very little technical reason for this implementation constraint. As a result, it may be difficult to process large data sets in JSON using popular tools.
  • JSON Lines lacks the same degree of adoption as JSON and drops some of the latter’s capabilities.
    • There is no current IANA Media Type registered for JSON Lines.
    • Tools for JSON may lack support for JSON Lines and fail to the format,.
    • There is no standard for associating a schema with JSON Lines, and the format does not support associating top-level metadata with its content. Instead, common metadata must be redundantly repeated for every line.
  • CSV/TSV tabular formats lack rigorous standardization and there is no widely adopted practice for recording format metadata, such as type information or schemas.
    • A multiplicity of variants exists, with no explicit format indicator; and the format requires heuristics or external configuration to parse data correctly. The main elements of syntax variability include the representation of newlines, field delimiters (comma, tab, or other), quoting and escaping of special characters, column name headers, comment lines, blank lines, and character sets (ASCII versus UTF-8 and other Unicode support).
    • There is no explicit distinction between numbers and strings, inadequate formality for representing numbers in a context of international conventions, and no specification for how to represent dates.
    • The absence of rigid syntax and data types leads to a high frequency of data corruption, such as problems encoding values with literal commas (,), quotation marks ("), or whitespace, and any values similar to a date expressed in any worldwide language ([1], [2], [3]).
    • CSV/TSV also lacks expressivity, as a single tabular file cannot easily represent complex structures and relationships, without embedding some other syntax (e.g. the use of tag-value lists in a column). The Relational Model may be used to represent richer data via multiple tabular files, but standards are lacking for the transfer of multiple files and their associated relationship schemas.
    • While attempts have been made to rectify these problems, such as Frictionless Data, the alternatives lack widespread adoption by industry and tooling.

Why not use other formats, such as JSON-LD and RDF/OWL?

The FAIR Principles guidance highlights other formats among domain-relevant community standards, such as JSON-LD (JSON Linked Data) and RDF/OWL (Resource Description Framework, Web Ontology Language). Note: JSON-LD should not to be confused with JSON Lines. These other formats are significant for the concept of the Semantic Web. At this time, the user community’s demand for these formats remains below the minimum threshold required for NCBI Datasets to support in order to maximize NCBI’s contribution to the scientific community.

Generated November 25, 2024