Bibliographic Metadata Formats
SWORD itself does not specify a metadata format should be used for the transfer. We needed to transfer citation information both ways between publicationslist.org and the repository, so some simple format for coding the metadata was needed.
We looked at a number of possible formats / encodings:
- dublin core : (as used for OAI/PMH) doesn't cover all the required fields
- SWAPP, EPDCX : too complex for our application
- eprintsxml : field names specific to one repository
- MODS : appealing because its use in BibUtils (a set of tools for interconverting between most bibliography formats); down side is that it has a very complex XML schema
- bibtex : appealing because it is simple and covers all required citation styles, and import/export routines are available for most repositories; but is an old format with poor support for UTF8
- JSON : appealing as it is much simpler than XML and has great cross language support; it still needs a set of field names though
- XML : more complex data encoding format than JSON, but a possibility
- RDF : even more complex than vanilla XML
- OpenURL : simple, but designed as a pointer / reference format rather than for storing citations
- PubmedXML : good for journal articles; doesn't cover other reference types as well
- Proprietary bibliography formats : endnote / copac / ISI / RIS ; less expressive than Bibtex
The format we chose is JSON using the field names / schema from bibtex. This combines the simplicity of json with the well established and tested bibtex schema, without any of the disadvantages associated with using latex. It also makes it easy to add extra fields without modifying any schema documents.
JSON-BIBTEX Bibliographic Metadata format
The selected bibliographic metadata format is based on bibtex coded using JSON [www.json.org]. This offers a rich, simple and well supported set of fields for representing citation metadata without the data loss associated with converting to dublin core. The fields are documented in the standard wikipedia entry on bibtex [en.wikipedia.org/wiki/BibTeX]. Most repositories already have routines to import / export to BibTex, which will include the mappings from their internal field names to the bibtex standard. The fields are:
address:
Publisher's address (usually just the city, but can be the full address
for lesser-known publishers)
author:
The name(s) of the author(s) (in the case of more than one author,
separated by and)
booktitle:
The title of the book, if only part of it is being cited
chapter:
The chapter number
edition:
The edition of a book, long form (such as "first" or "second")
editor:
The name(s) of the editor(s)
eprint:
A specification of an electronic publication, often a preprint or a
technical report
howpublished:
How it was published, if the publishing method is nonstandard
institution:
The institution that was involved in the publishing, but not necessarily
the publisher
journal:
The journal or magazine the work was published in
month:
The month of publication (or, if unpublished, the month of creation)
note:
Miscellaneous extra information
number:
The "number" of a journal, magazine, or tech-report, if applicable.
(Most publications have a "volume", but no "number" field.)
organization:
The conference sponsor
pages:
Page numbers, separated either by commas or double-hyphens. For books,
the total number of pages.
publisher:
The publisher's name
school:
The school where the thesis was written
series:
The series of books the book was published in (e.g. "The Hardy Boys")
title:
The title of the work
url:
The WWW address
volume:
The volume of a journal or multi-volume book
year:
The year of publication (or, if unpublished, the year of creation)
In addition to the standard fields, the following optional fields are used to provide extra database IDs:
refid: --- the client's ID for this entry
urllink: --- a link to the HTML version
pdflink: --- a link to the PDF
keywords: --- a list of tags, comma separated
doi: --- the DOI number
pubmed: --- the pubmed ID {optional}
articletype: --- 'article', 'book', 'inproceedings' etc
To simplify parsing of the list of authors / editors, these are also supplied as structured JSON list fields, e.g.:
author: "Bloggs, Joe M; Other, A.N.",
editor: "A Z Other",
author_list : [{first:"Joe", middle:"M", last:"Bloggs"},
{first:"A", middle:"N", last:"Other"}],
editor_list : [{first:"A", middle:"Z", last:"Other"}, {...} ]
The field 'type' is used to denote the type of entry:
article
An article from a journal or magazine.
Required fields: author, title, journal, year
Optional fields: volume, number, pages, month, note
book
A book with an explicit publisher.
Required fields: author/editor, title, publisher, year
Optional fields: volume, series, address, edition, month, note, pages
booklet
A work that is printed and bound, but without a named publisher or
sponsoring institution.
Required fields: title
Optional fields: author, howpublished, address, month, year, note
conference
The same as inproceedings, included for Scribe (markup language)
compatibility.
Required fields: author, title, booktitle, year
Optional fields: editor, pages, organization, publisher, address, month, note
inbook
A part of a book, usually untitled. May be a chapter (or section or
whatever) and/or a range of pages.
Required fields: author/editor, title, chapter/pages, publisher, year
Optional fields: volume, series, address, edition, month, note
incollection
A part of a book having its own title.
Required fields: author, title, booktitle, year
Optional fields: editor, pages, organization, publisher, address, month, note
inproceedings
An article in a conference proceedings.
Required fields: author, title, booktitle, year
Optional fields: editor, pages, organization, publisher, address, month, note
manual
Technical documentation.
Required fields: title
Optional fields: author, organization, address, edition, month, year, note
mastersthesis
A Master's thesis.
Required fields: author, title, school, year
Optional fields: address, month, note
misc
For use when nothing else fits.
Required fields: none
Optional fields: author, title, howpublished, month, year, note
phdthesis
A Ph.D. thesis.
Required fields: author, title, school, year
Optional fields: address, month, note
proceedings
The proceedings of a conference.
Required fields: title, year
Optional fields: editor, publisher, organization, address, month, note
techreport
A report published by a school or other institution, usually numbered
within a series.
Required fields: author, title, institution, year
Optional fields: type, number, address, month, note
unpublished
A document having an author and title, but not formally published.
Required fields: author, title, note
Optional fields: month, year
Sample JSON-Bibtex file
The JSON format is a simple language-independent notation for structured data. It is based on a subset of the Javascript Object Notation and includes object definitions (field / value pairs), arrays, and simple data types (strings, numbers). The JSON-Bibtex structure is simply a set of name/value pairs:
{
"refid":"3",
"type":"article",
// Standard bibtex fields:
"title":"Catalyzer: a novel tool for integrating, managing and publishing \
heterogeneous bioscience data",
"year":"2006",
"author":"Howell, F W; Cannon, R C; Goddard, N H",
"journal":"Concurrency Computat.: Pract. Exper.",
"volume":"19",
"number":"",
"pages":"207-221",
"month":"",
// Extended bibtex fields:
"doi":"10.1002\/cpe.1044",
"pubmed":"",
"pdflink":"",
"urllink":"http:\/\/www3.interscience.wiley.com\/...",
"abstract":"The integrative ambitions of systems biology - (...)",
"note":"",
"keywords":"XML, Semantic Web, e-Science"
}
Summary
We found the use of the 'JSON-Bibtex' format for exchanging data between publications list and ePrints to work well; most repositories already have facilities for importing / exporting bibtex, so the extra work needed to support a JSON encoding is not that great. The few extra fields we needed which are not part of standard BibTex were very easy to add to the JSON objects.