API Interfaces used
The SWORD interface uses the Atom publishing protocol for one task: depositing single items plus metadata into a repository. It does not specify which metadata format (e.g. MODS, bibtex, JSON) or packaging format (e.g. .zip, METS) should be used, however, so it is not sufficient to know a repository has a SWORD module: it must have a SWORD module which can interpret the particular metadata and packaging format you need to send.
We needed a number of machine-machine interfaces in addition to SWORD to implement the workflow connecting publicationslist.org to the DEPOT. The interfaces we required were:
1. initial single item deposit
- SWORD + {METS + JSON-bibtex} sent as .zip2. send updates / new versions of metadata
- Needed to add a metadata field for updates3. check status of deposited items
- Needed an extra call on the repository to fetch status of items deposited4. search repository for existing items by author
- Needed to avoid sending duplicate items to repository5. machine interfaces for setting up user repository accounts
- Needed to automate creation of user accounts and save them having to go through another registration process
SWORD was sufficient for item 1, and could be extended simply for item 2: but all the others needed new interfaces; it is likely that these will be requirements for most other systems which deposit too, as 'fire and forget' is not sufficient.
1. Single item deposit using SWORD
For single item deposit using SWORD, we used the sample Java SWORD_1.0 client written by Stuart Lewis, while working at Aberystwyth University:
java -cp $CLASSPATH org.purl.sword.client.ClientFactory -cmd -t post -file "$ZIPFILE" -href "$SWORDURL" -formatNamespace JSON -filetype application/zip -u $PLUSER -p $PLPASS -onBehalfOf $REPOUSER
The item was packaged as a .zip file containing METS xml, JSON-bibtext metadata, and the PDF full text version (if available).
We created a privileged user on the repository for publicationslist.org; this user was permitted to submit papers on behalf of other users, so to use SWORD we just needed to know a researcher's user name on the repository system.
Issues we encountered:
Debugging SWORD deposits remotely was impossible; if there was a minor problem with the metadata (e.g. a comma in the wrong place), the module just returned an HTTP error code, but no hint as to line numbers or detail of the problem - it either worked fine or not at all. In our case, we were forced to use the Apache error logs by logging in to the remote server; it would be better if error messages were sent back over the wire.
-
Because SWORD itself does not specify a metadata or packaging format, each user is likely to use a different format. We wanted a simple format to send bibtex-style metadata along with each paper, but none of the examples we looked at was suitable, so we developed a JSON encoding of bibtex-style metadata.
-
The atom publishing protocol seemed overly complicated for what we needed (essentially a HTTP POST of data with metadata), so we used the Java client library rather than native PHP. Atom has been a great success when used for simple news feeds as a replacement for RSS, but there are simpler alternatives for posting data.
-
The submission seemed to be somewhat slow; several seconds per item. This must be caused by inefficient implementation at the SWORD client or ePrints module end rather than network delays. A batch mode (sending 10s of items in a single request) would be useful too.
Despite these issues, and the somewhat experimental nature of the sample SWORD module for ePrints, we were able to get single item submission working via SWORD.
2. Sending updates / corrections to repository entries
The initial version of SWORD specifically excluded support for updates to existing entries - but this feature was essential for our use, as users would correct / update metadata entries over time, and attach full text some time after initially entering metadata. We implemented this feature by setting a eprintid field when sending an updated version of metadata for an item. This was tied in to the ePrints feature for versioning - which creates a new repository ID for the new version of the item, and maintains copies of all versions. It would be better if updates were supported in SWORD and if versioning was handled more elegantly by ePrints than at present.
3. Fetching status of deposited items
SWORD is only concerned with sending single items, but we needed to be able check the status of items deposited by a user, i.e. do a metadata fetch from the repository. Many repositories have an approval process; users submit items, repository staff check and correct the metadata some time later. The publicationslist needs to be updated with the correct links once a repository item has been accepted. For this we needed an extra interface: "fetch latest status of items deposited by user X"
We implemented this as a new module for ePrints which returned the status of deposited items as an Atom feed. Sample call using CURL to do a HTTP GET with basicauth (shown on multiple lines for clarity):
curl -u $user:$pass "http://sample.repository.edu/m2m/PLOrg?
output=PLOrg_Status // Output format requires (Atom/JSON-bibtex)
&user=a.n.other" // Repository user name
The return value is an ATOM feed like below:
<?xml version="1.0" encoding="utf-8" ?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>the Depot: Authors/Editors is "howell" </title>
<link href="http://art-abstracts.edina.ac.uk/"></link>
<updated>2009-03-19T13:47:57Z</updated>
<id>tag:repository.edina.ac.uk,2009:feed:PublicationsList.Org - status list</id>
<entry>
<id>tag:repository.edina.ac.uk:item:/391</id>
<category>accepted</category>
<link>repository.edina.ac.uk/391</link>
<updated>2009-03-03T15:01:52Z</updated>
<content>
{
"eprintid":"391",
"succeeds":"384",
"author":"Howell, F W; Stuart, I M",
"type":"article",
"abstract":"....",
"title":"Tuning the Lucas Distributor"
// ... for all fields see metadata page.
}
</content>
</entry>
<entry>
<!-- ... more entries -->
</entry>
</feed>
We chose an Atom feed as the format of the response, partly because repositories already include support for generating Atom output. For the structured metadata for each article, we chose a JSON encoding, sent in the content field of each Atom entry. This is simpler and more flexible than devising a new XML format for encoding bibliographic metadata.
We used the standard Atom fields id, updated. We use the category field to send the status of the deposited item (accepted, rejected, pending).
4. Search repository for existing items by author
Before submitting papers, it would be useful if a researcher could search the local repository (or even all repositories) to see which of their papers have already been stored by a co-author. This would ideally involve a search and select interface where the researcher can look for publications matching their name, and select the ones which are really theirs (rather than by another author with the same name). A similar interface for searching the PubMed database using the NCBI eUtils web service has proven popular for publications list users [publicationslist.org/pubmed.html]
The interface needs to return rich bibtex-style metadata if the resulting data is to be used to populate a publications list - dublin core (e.g. as returned by OAI-PMH) is not sufficient.
We developed an additional module for ePrints to perform this task, which does a simple search on author name and returns an Atom feed with rich JSON metadata for the items.
curl -u $user:$pass "http://repository.edina.ac.uk/cgi/m2m/authors?
name=einstein,a" // Author name supplied as surname,initials
This returns Atom / JSON format search results in the same format as searches by depositor ID (section 3 above). This machine interface would be sufficient for implementing a 'search/select/import' GUI similar to the one used for PubMed for importing data from a repository into publicationslist.org, but we didn't implement a GUI for this within the scope of this project. A production implementation of such a search API would need additional parameters to paginate return values for very large numbers of search results (only likely to be an issue for subject repositories spanning more than one university).
5. Creating repository accounts using web services
Users are already authenticated with PublicationsList.org, and typically will not already have an account on their institutional repository. We wanted to be able to automate account creation, so users could just press a 'Click to create repository account' link and not have to fill out registration forms. We created a new module create_acc to create user accounts on the repository.
curl -u $user:$pass "http://repository.edina.ac.uk/cgi/create_acc
?username=test1000 // suggested repository user name
&email=test1000@example.com // user's email
&password=0DMwbBTkPUz8I" // user's password for new account (crypt() version)
The return value is a simple XML document: on success:
<returns>
<return>
<result>OK</result>
<message>User created</message>
<code>0</code>
<email>test1001@example.com</email>
<username>test1000</username>
</return>
</returns>
If either user name or email is in use, it returns error messages:
<returns>
<return>
<result>err</result>
<message>email in use: test1000@example.com => test1000</message>
<email>test1000@example.com</email>
<username>test1000</username>
<code>3</code>
</return>
<return>
<result>err</result>
<message>username in use: fwh => fred@textensor.com</message>
<email>fred@textensor.com</email>
<username>fwh</username>
<code>2</code>
</return>
</returns>
A user interface can use these return values to prompt the user for a different user name (if it is taken), or to find out the existing repository user name for the user's email (if they already have a repository account).
6. List of the interface modules used
A sample JSON configuration file is given below: these are the web services interfaces needed to configure PublicationsList.org to use a particular repository. One of the links is the SWORD module (the depositurl); the others are for the other machine interfaces needed to implement the workflow.
{
// The main website of the repository:
"url" : "http://repository.edina.ac.uk/",
// A short description:
"desc" : "The DEPOT",
// The URL for users to create accounts manually:
"registerurl" : "http://repository.edina.ac.uk/cgi/register",
// The SWORD endpoint for deposit:
"depositurl" : "http://repository.edina.ac.uk/app/deposit/archive",
// The interface for getting status of items deposited by a user
"statusurl" : "http://repository.edina.ac.uk/cgi/m2m/PLOrg?output=Atom_JSON&user=",
// The URL for creating new repository accounts
"createaccounturl" : "http://repository.edina.ac.uk/cgi/m2m/create_acc",
// The URL for searching repository by author name
"authorsearchurl" : "http://repository.edina.ac.uk/cgi/m2m/authors?output=Atom_JSON&name=",
// The username / password for admin user on repository:
"pluser" : "admin",
"plpass" : "xxxxx"
}
7. Summary
In order to implement a smooth workflow for maintaining publications list on publicationslist.org and synchronising it with deposits in a repository, we were able to use SWORD but also needed a small number of extra machine interfaces for fetching status of deposited items from the repository. We also used a simple JSON encoding for sending bibtex-style metadata across the wire, along with Atom feeds.