Experiences using SWORD for batch deposit
SWORD is an interface based on the Atom publishing protocol which provides scope for wrapping metadata in arbitrary XML formats. You can do a HTTP POST of a zip file containing the full text and an XML metadata file in METS format. We have chosen to embed JSON-Bibtex messages as the metadata payload of the METS file (because a SWORD module for ePrints already supported METS).
// file.zip: contents
json.xml
pdf1.pdf
// Sample 'json.xml' file:
<?xml version="1.0" encoding="utf-8" standalone="no"?>
<mets ID="sort-mets_mets" OBJID="sword-mets" LABEL="DSpace SWORD Item"
PROFILE="DSpace METS SIP Profile 1.0" xmlns="http://www.loc.gov/METS/"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.loc.gov/METS/ http://www.loc.gov/standards/mets/mets.xsd">
<metsHdr CREATEDATE="2007-09-01T00:00:00">
<agent ROLE="CUSTODIAN" TYPE="ORGANIZATION">
<name>Textensor</name>
</agent>
</metsHdr>
<dmdSec ID="sword-mets-dmd-1" GROUPID="sword-mets-dmd-1_group-1">
<mdWrap LABEL="JSON Metadata" MDTYPE="OTHER" OTHERMDTYPE="JSON-BIBTEX"
MIMETYPE="text/xml">
<jsonData>
{"refid":"4",
"type":"article",
"title":"A large scale model of the cerebellar cortex using PGENESIS",
"year":"2000",
"author":"Howell, F; Dyhrfjeld-Johnsen, J",
"journal":"Neurocomputing",
"volume":"32",
"number":"",
"pages":"1041-1046",
"month":"","doi":"","pubmed":"","pdflink":"","urllink":"",
"abstract":"",
"note":"",
"keywords":""}
</jsonData>
</mdWrap>
</dmdSec>
<fileSec>
<fileGrp ID="sword-mets-fgrp-1" USE="CONTENT">
<file GROUPID="sword-mets-fgid-0" ID="sword-mets-file-1"
MIMETYPE="application/pdf">
<FLocat LOCTYPE="URL" xlink:href="pdf1.pdf" />
</file>
</fileGrp>
</fileSec>
<structMap ID="sword-mets-struct-1" LABEL="structure"
TYPE="LOGICAL">
<div ID="sword-mets-div-1" DMDID="sword-mets-dmd-1" TYPE="SWORD Object">
<div ID="sword-mets-div-2" TYPE="File">
<fptr FILEID="sword-mets-file-1" />
</div>
</div>
</structMap>
</mets>
The return value is an HTTP error code if something went wrong with authentication, or an ATOM XML string like:
<?xml version="1.0" encoding="UTF-8"?>
<atom:entry xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sword="http://purl.org/net/sword/">
<atom:id>162</atom:id>
<atom:author>
<atom:name>user123</atom:name>
<atom:email>user123@example.com</atom:email>
</atom:author>
<atom:content type="text/xml"
src="http://repository.edina.ac.uk/app/collections/162/ServletClient-0"/>
<atom:link href="http://repository.edina.ac.uk/app/collections/162"
rel="edit-media"/>
<atom:link href="http://repository.edina.ac.uk/app/collections/162.atom"
rel="edit"/>
<atom:summary type="text">The integrative ambitions
of systems biology an...</atom:summary>
<atom:title>Catalyzer: a novel tool for integrating,
managing and publishing heterogeneous bioscience
data</atom:title>
<atom:source>
<atom:generator uri="http://repository.edina.ac.uk">the
Depot</atom:generator>
</atom:source>
<atom:updated>2008-06-03T15:01:23Z</atom:updated>
<sword:treatment>Deposited items will remain in your
work area until you log into the depot and deposit
them.</sword:treatment>
<sword:formatNamespace>JSON</sword:formatNamespace>
</atom:entry>
The useful return values are the <atom:id>, the submission ID, and the <atom:updated> timestamp. As can be seen from the examples above, the use of METS and ATOM leads to a rather complex transfer format littered with XML namespace declarations; a lightweight JSON wrapper would be better to remove the need for so much redundant information.
Summary of experiences using SWORD
The main concept behind SWORD - a standard machine interface way to send items to repositories - is an extremely good idea, and an impressive number of repository developers have adopted it especially considering SWORD was developed as a small scale project. In practice, we encountered a number of issues when trying to use it with the ePrints repository: hopefully this experience will contribute to improvements in future versions of SWORD:
Issues using SWORD in practice
1. Packaging formats. We used METS as a wrapper to specify the PDF file because this was the method supported by the standard ePrints SWORD module; METS is overly complicated for the task however, and it would be better if a simpler JSON format was supported.
2. Metadata formats. Again, SWORD doesn't specify, so each project will choose / invent their own. The JSON-Bibtex format we used was very convenient and worked well - it would be good if a format such as this were promoted as the 'standard' format to use with SWORD.
3. Use of the Atom publishing protocol. Atom has been extremely successful when used as a replacement for RSS; for the task of posting data to a repository, it does seem like overkill, necessitating the use of a client library rather than the built-in support most languages have for using vanilla HTTP POST. A simpler API just based around HTTP POST of a zip file with JSON metadata would be easier for developers to use, and would have an easier learning curve, without having to wade through standards documents. Perhaps this could be considered for a future version of SWORD, which has the potential to become a Really Simple Web service Offering Repository Deposit.
4. Missing features: support for updates and versions of existing items. We implemented a way of doing this with ePrints as it was an essential feature for us, it would be better if this were part of the standard.
5. Need for 'fetch' interfaces as well as 'put' ones: we needed additional interfaces as well as SWORD to fetch the status of deposited items over time, to do author searches of metadata, and to create accounts. The scope / branding of SWORD could be expanded to standardize such interfaces which we found essential parts of the deposit process ('fire and forget' was not sufficient).
Issues with the SWORD module implementation for ePrints
Debugging remotely was impossible - on error it just returned an HTTP error code, with no hint as to the line of metadata which caused the problem.
Individual SWORD transfers were slow - even just transferring metadata took a number of seconds per item. We didn't have time to investigate whether this was caused by inefficiencies in the Java client or in the eprints repository module.