EM-Loader API [draft only: July 2008 ]
API for synchronising a publications list with a repository
This document describes an API for synchronising a researcher's personal publications list with their entries in a repository. It is being developed as part of the EM-Loader project for enhanced metadata loading / submission to repository systems, with an initial implementation connecting publicationslist.org to The Depot's version of ePrints.
JSON-BIBTEX Bibliographic Metadata format
The selected bibliographic metadata format is based on the bibtex coded using JSON [www.json.org]. This offers a rich, simple and well supported set of fields for representing citation metadata without the data loss associated with converting to dublin core. The fields are documented in the standard wikipedia entry on bibtex [en.wikipedia.org/wiki/BibTeX]. Most repositories already have routines to import / export to BibTex, which will include the mappings from their internal field names to the bibtex standard. The fields are:
address:
Publisher's address (usually just the city, but can be the full address
for lesser-known publishers)
author:
The name(s) of the author(s) (in the case of more than one author,
separated by and)
booktitle:
The title of the book, if only part of it is being cited
chapter:
The chapter number
edition:
The edition of a book, long form (such as "first" or "second")
editor:
The name(s) of the editor(s)
eprint:
A specification of an electronic publication, often a preprint or a
technical report
howpublished:
How it was published, if the publishing method is nonstandard
institution:
The institution that was involved in the publishing, but not necessarily
the publisher
journal:
The journal or magazine the work was published in
month:
The month of publication (or, if unpublished, the month of creation)
note:
Miscellaneous extra information
number:
The "number" of a journal, magazine, or tech-report, if applicable.
(Most publications have a "volume", but no "number" field.)
organization:
The conference sponsor
pages:
Page numbers, separated either by commas or double-hyphens. For books,
the total number of pages.
publisher:
The publisher's name
school:
The school where the thesis was written
series:
The series of books the book was published in (e.g. "The Hardy Boys")
title:
The title of the work
url:
The WWW address
volume:
The volume of a journal or multi-volume book
year:
The year of publication (or, if unpublished, the year of creation)
In addition to the standard fields, the following optional fields are used to provide extra database IDs:
refid: --- the client's ID for this entry
urllink: --- a link to the HTML version
pdflink: --- a link to the PDF
keywords: --- a list of tags, comma separated
doi: --- the DOI number
pubmed: --- the pubmed ID {optional}
articletype: --- 'article', 'book', 'inproceedings' etc
To simplify parsing of the list of authors / editors, these are also supplied as structured JSON list fields, e.g.:
author: "Joe M Bloggs and A N Other",
editor: "A Z Other",
author_list : [{first:"Joe", middle:"M", last:"Bloggs"},
{first:"A", middle:"N", last:"Other"}],
editor_list : [{first:"A", middle:"Z", last:"Other"}, {...} ]
The field 'type' is used to denote the type of entry:
article
An article from a journal or magazine.
Required fields: author, title, journal, year
Optional fields: volume, number, pages, month, note
book
A book with an explicit publisher.
Required fields: author/editor, title, publisher, year
Optional fields: volume, series, address, edition, month, note, pages
booklet
A work that is printed and bound, but without a named publisher or
sponsoring institution.
Required fields: title
Optional fields: author, howpublished, address, month, year, note
conference
The same as inproceedings, included for Scribe (markup language)
compatibility.
Required fields: author, title, booktitle, year
Optional fields: editor, pages, organization, publisher, address, month, note
inbook
A part of a book, usually untitled. May be a chapter (or section or
whatever) and/or a range of pages.
Required fields: author/editor, title, chapter/pages, publisher, year
Optional fields: volume, series, address, edition, month, note
incollection
A part of a book having its own title.
Required fields: author, title, booktitle, year
Optional fields: editor, pages, organization, publisher, address, month, note
inproceedings
An article in a conference proceedings.
Required fields: author, title, booktitle, year
Optional fields: editor, pages, organization, publisher, address, month, note
manual
Technical documentation.
Required fields: title
Optional fields: author, organization, address, edition, month, year, note
mastersthesis
A Master's thesis.
Required fields: author, title, school, year
Optional fields: address, month, note
misc
For use when nothing else fits.
Required fields: none
Optional fields: author, title, howpublished, month, year, note
phdthesis
A Ph.D. thesis.
Required fields: author, title, school, year
Optional fields: address, month, note
proceedings
The proceedings of a conference.
Required fields: title, year
Optional fields: editor, publisher, organization, address, month, note
techreport
A report published by a school or other institution, usually numbered
within a series.
Required fields: author, title, institution, year
Optional fields: type, number, address, month, note
unpublished
A document having an author and title, but not formally published.
Required fields: author, title, note
Optional fields: month, year
Sample JSON-Bibtex file
The JSON format is a simple language-independent notation for structured data. It is based on a subset of the Javascript Object Notation and includes object definitions (field / value pairs), arrays, and simple data types (strings, numbers). The JSON-Bibtex structure is simply a set of name/value pairs:
{
"refid":"3",
"type":"article",
// Standard bibtex fields:
"title":"Catalyzer: a novel tool for integrating, managing and publishing \
heterogeneous bioscience data",
"year":"2006",
"author":"F W Howell, R C Cannon, N H Goddard",
"journal":"Concurrency Computat.: Pract. Exper.",
"volume":"19",
"number":"",
"pages":"207-221",
"month":"",
// Extended bibtex fields:
"doi":"10.1002\/cpe.1044",
"pubmed":"",
"pdflink":"",
"urllink":"http:\/\/www3.interscience.wiley.com\/cgi-bin\/abstract\/112608290\/ABSTRACT?CRETRY=1&SRETRY=0",
"abstract":"The integrative ambitions of systems biology and neuroinformatics - (...)",
"note":"",
"keywords":"XML, Semantic Web, e-Science, biological databases, data publication, data sharing"
}
SWORD single item full text deposit
SWORD is an interface based on the Atom publishing protocol which provides scope for wrapping metadata in arbitrary XML formats. You can do a HTTP POST of a zip file containing the full text and an XML metadata file in METS format. We have chosen to embed JSON-Bibtex messages as the metadata payload of the METS file:
// file.zip: contents
json.xml
pdf1.pdf
// Sample 'json.xml' file:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<mets ID="sort-mets_mets" OBJID="sword-mets" LABEL="DSpace SWORD Item"
PROFILE="DSpace METS SIP Profile 1.0" xmlns="http://www.loc.gov/METS/"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd">
<metsHdr CREATEDATE="2007-09-01T00:00:00">
<agent ROLE="CUSTODIAN" TYPE="ORGANIZATION">
<name>Textensor</name>
</agent>
</metsHdr>
<dmdSec ID="sword-mets-dmd-1" GROUPID="sword-mets-dmd-1_group-1">
<mdWrap LABEL="JSON Metadata" MDTYPE="OTHER" OTHERMDTYPE="JSON-BIBTEX"
MIMETYPE="text/xml">
<jsonData>
{"refid":"4",
"type":"article",
"title":"A large scale model of the cerebellar cortex using PGENESIS",
"year":"2000",
"author":"Howell, F; Dyhrfjeld-Johnsen, J",
"journal":"Neurocomputing",
"volume":"32",
"number":"",
"pages":"1041-1046",
"month":"","doi":"","pubmed":"","pdflink":"","urllink":"",
"abstract":"",
"note":"",
"keywords":""}
</jsonData>
</mdWrap>
</dmdSec>
<fileSec>
<fileGrp ID="sword-mets-fgrp-1" USE="CONTENT">
<file GROUPID="sword-mets-fgid-0" ID="sword-mets-file-1"
MIMETYPE="application/pdf">
<FLocat LOCTYPE="URL" xlink:href="pdf1.pdf" />
</file>
</fileGrp>
</fileSec>
<structMap ID="sword-mets-struct-1" LABEL="structure"
TYPE="LOGICAL">
<div ID="sword-mets-div-1" DMDID="sword-mets-dmd-1" TYPE="SWORD Object">
<div ID="sword-mets-div-2" TYPE="File">
<fptr FILEID="sword-mets-file-1" />
</div>
</div>
</structMap>
</mets>
The return value is an HTTP error code if something went wrong with authentication, or an ATOM XML string like:
<?xml version="1.0" encoding="UTF-8"?>
<atom:entry xmlns:atom="http://www.w3.org/2005/Atom" xmlns:sword="http://purl.org/net/sword/">
<atom:id>162</atom:id>
<atom:author>
<atom:name>kiz</atom:name>
<atom:email>ian@extremewashingup.org</atom:email>
</atom:author>
<atom:content type="text/xml" src="http://art-abstracts.edina.ac.uk/app/collections/162/ServletClient-0"/>
<atom:link href="http://art-abstracts.edina.ac.uk/app/collections/162"
rel="edit-media"/>
<atom:link href="http://art-abstracts.edina.ac.uk/app/collections/162.atom"
rel="edit"/>
<atom:summary type="text">The integrative ambitions
of systems biology an...</atom:summary>
<atom:title>Catalyzer: a novel tool for integrating,
managing and publishing heterogeneous bioscience
data</atom:title>
<atom:source>
<atom:generator uri="http://art-abstracts.edina.ac.uk">the
Depot</atom:generator>
</atom:source>
<atom:updated>2008-06-03T15:01:23Z</atom:updated>
<sword:treatment>Deposited items will remain in your
work area until you log into the depot and deposit
them.</sword:treatment>
<sword:formatNamespace>JSON</sword:formatNamespace>
</atom:entry>
The useful return values are the <atom:id>, the submission ID, and the <atom:updated> timestamp.
Status and ID fields for each entry
The repository will create its own unique IDs for entries, and possibly also include a workflow where entries are submitted, checked then approved and given a final accession ID. The client will maintain its own IDs for entries too; the field names used for passing these different IDs are as follows:
repo_submissionid --- repository's internal item identifier
repo_accessionid --- could be same as submissionid, or only allocated when accepted
repo_accessionurl --- final URL of entry in repository (the metadata page)
repo_submissionurl --- optional: URL which takes you to the edit page for the submission
repo_modified --- optional: a timestamp (GMT) when this entry was last modified
in the repository e.g. "2008-07-25 15:48:49"
repo_status --- 'accepted', 'pending', 'rejected'
repo_statusmsg --- optional: reason for rejection
deposit-list
This operation deposits a list of metadata items in the repository. It does an update for any changed fields of items which are already in the repository, creates new items where necessary, and returns a status entry for each item.
// HTTP POST deposit-list
// POST parameters:
userid --- e.g. "joe@example.com" - ID of user doing deposit
list --- JSON array of metadata objects, e.g.:
[
{ // A new metadata entry to deposit:
"refid": "123", // --- client's ID for this item.
"type": "article",
"address": "",
"journal": "Lob Control",
"note": "this is a note",
"pages": "iii51-iii58",
"title": "Reductions in tobacco smoke pollution and increases in support\
for smoke-free public places following the implementation of comprehensive\
smoke-free workplace legislation in the Republic of Ireland: findings from\
the ITC Ireland\/UK Survey.",
"volume": "15 Suppl 3",
"year": "2006",
"month": "jun",
"abstract": "OBJECTIVE: To evaluate the psychosocial and ... "
},
{
// An update to perform on an item already in the repository:
"repo_submissionid": 109,
"refid": "124", // --- client's ID for this item.
"type": "article",
"address": "",
"journal": "Lob Control",
"note": "this is a note",
"pages": "iii51-iii58",
"title": "Reductions in tobacco smoke pollution and increases in support\
for smoke-free public places following the implementation of comprehensive\
smoke-free workplace legislation in the Republic of Ireland: findings from\
the ITC Ireland\/UK Survey.",
"volume": "15 Suppl 3",
"year": "2006",
"month": "jun",
"abstract": "OBJECTIVE: To evaluate the psychosocial and ... "
}
]
// The return value is an array of repository status messages
// so the client can update its pointers for the item to point
// at the repository copy.
[
{
"client_refid": 123, // Client's reference for this item
"repo_submissionid": 109,
"repo_modified": "2008-07-25 15:48:49",
"repo_accessionid": 26,
"repo_accessionurl": "show-list.php?userid=fred@textensor.com&accessionid=26",
"repo_submissionurl": "show-list.php?userid=fred@textensor.com&submissionid=109",
"repo_status": "accepted",
"repo_statusmsg": ""
}
// , { ... return values for other submissions ... }
]
Mapping of JSON-Bibtex fields to repository internal fields
It is likely that the repository will use different field names for metadata from the Bibtex field names. Because of this, a mapping is needed on upload to convert the bibtex style data to fill out the relevant fields.
The conversion is equivalent to the existing import modules for bibtex
(e.g. for eprints: perl_lib/EPrints/Plugin/Import/BibTex.pm)
It is possible that the conversion will be lossy if the repository does
not support the same rich metadata as bibtex.
The synchronize operation
The synchronise algorithm which the deposit-list
method needs to implement is:
$userid = $_POST["userid"];
$clientlistobj = json_decode( $_POST["list"] ); // get a list of metadata objects
$ret = array();
foreach ($clientlistobj as $clientmeta) {
if (isset($clientmeta["repo_submissionid"])) {
// it's an update of an existing entry
$repoentry = updateExistingEntryFromClient( $entry["repo_submissionid"], $clientmeta );
}
else {
$repoentry = createNewSubmission( $clientmeta );
}
$ret[] = array(
"client_refid" => $clientmeta["refid"],
"repo_modified"=> // date this metadata entry last modified at repo end
"repo_submissionid"=> // submission ID for repoentry,
"repo_submissionurl"=> // URL of edit screen for entry
"repo_accessionid"=> // accession ID in repository (can be '')
"repo_accessionurl"=> // URL of final entry in repository (can be '')
"repo_status"=> "accepted" // or "pending", "rejected"
"repo_statusmsg"=> "" // or reason for rejection
)
}
// Return the status of each submission / update
print json_encode($ret);
fetch-list
Fetches a list of metadata entries for a particular user.
// usage: GET fetch-list?userid=joe@example.com
// sample return value:
[
{
"repo_submissionid": 109,
"repo_modified": "2008-07-25 15:48:49",
"repo_accessionid": 26,
"repo_accessionurl": "show-list.php?userid=fred@textensor.com&accessionid=26",
"repo_submissionurl": "show-list.php?userid=fred@textensor.com&submissionid=109",
"repo_status": "accepted",
"repo_statusmsg": "",
// These are the fields mapped from internal repository field names to bibtex
"type": "article",
"address": "",
"journal": "Lob Control",
"note": "this is a note",
"pages": "iii51-iii58",
"title": "Reductions in tobacco smoke pollution and increases in support\
for smoke-free public places following the implementation of comprehensive\
smoke-free workplace legislation in the Republic of Ireland: findings from\
the ITC Ireland\/UK Survey.",
"volume": "15 Suppl 3",
"year": "2006",
"month": "jun",
"abstract": "OBJECTIVE: To evaluate the psychosocial and ... "
}
// , { ... more entries ... }
]
Updating client values from the repository
If the repository metadata values are edited, the client
application can show the two versions to the user and
let them decide whether to include the edits. The
value of the repo_modified timestamp is useful
as the client can store the last known modified time of the
repository copy, so it knows when something has changed.
One important change is the repo_status field
changing from 'pending' to 'accepted' for moderated repositories;
this allows the client to update its pointers to the repository
entry from the repo_accessionurl.
The fetch-list API call exports the repository metadata
to bibtex; it is possible that some metadata information will be lost in
translation (if the client initially does a submit of bibtex style metadata,
the repository imports that to its own format, then exports; each conversion
has the potential to lose some information. Because of this, it is
important to set the repo_modified timestamp.
Authentication
The SWORD interface uses BasicAuth authentication; the Java client / command line library is useful for hiding the complexity of the SWORD interface. The fetch-list and deposit-list API calls also use BasicAuth, with a username / password for the API user. In addition, the userid GET / POST parameter provides the repository user (equivalent to the X-On-Behalf-Of parameter of SWORD.