So, I have a servlet (based on initial work from Tod Olson at uchicago, expanded by me) to provide holdings information from Horizon in the DLF ils-di “dlfexpanded” format. The servlet code and some documentation is available.
That’s the short statement. It turns out that you can’t really just say that without providing some more specifics, caveats, exceptions, limitations etc. Also it’s worth adding some interesting observations.
As we’ve moving ahead with blacklight, we’re going to need to have some way to get item holdings information out of Horizon. By “item holdings information” I mean “copy” information, what items do we have, what are their call numbers, what are their statuses (checked in or out among many others), what are their locations, etc. etc. Everything you’d need to provide an actual OPAC display telling the users what they need to know about our holdings.
A sidenote on terminology: In Horizon there are ‘items’, and sometimes a bib just has ‘items’. But sometimes a big has different sets of items in groups — this is usually used for serials, or occasionally for multi-volume series. Horizon confusingly calls this set of items a ‘copy’. The DLF ils-di report calls it a ‘holdingset’. I have no idea what your ILS calls it. It’s a two-level hiearchy, a bib can contain one or more copies/holdingsets which each contains items. OR a bib can contain one or more items directly, without the intervening copy/holdingset.
And, the way most people are doing this at present (for a variety of reasons) is checking in realtime at point of demand for this info, not trying to index it. So, okay, go with the conventional wisdom. So I need a realtime service to provide this info from Horizon.
But I figure, as long as I’m doing this, MUCH better to provide the info in some standard format, instead of a custom one. Then, theoretically, the consuming code on the Blacklight end can be written to that standard format, instead of being custom for Horizon. And my understanding is that the Blacklight team has indeed been thinking/wishing for some standard stuff on the Blacklight end to consume stuff in DLF ils-di format, and/or jangle (which also typically, at the moment, uses the DLF ‘dlfexpanded’ format to actually return data in).
So, okay, that makes sense.
But DLF ils-di format is not a complete spec
So it turns out once you decide to return data in the DLF ils-di “dlfexpanded” format, you’re actually not done deciding what your data is actually going to look like.
The dlfexpanded format is just kind of a coat tree to hang your actual metadata ‘coats’ on. dlfexpanded lets you give a list of itemIDs and say they belong to a bib; it lets you give a list of holdingsets and say which itemIDs belong to them. Good so far. But to actually describe anything else about those items and holdingsets (location, call number, item status, any user-displayable notes, etc), you’ve got to include additional metadata of your own choosing — dlfexpanded gives you some hooks that it allows you to hang basically whatever other namespaced (and hopefully specified and standardized) XML you want on.
So figuring out what metadata to actually use to describe everything I wanted about my Items and Copies (aka ‘holdingsets’) took a bit of investigating and thinking.
Sure, I used the dlf:simpleavailability format that dlfexpanded gives you just to say whether something is “available” or not (and provide a custom user-displayable string conveying that).
Although I ended up only providing that at the item level. The dls-di report seems to assume the client could ask for ‘availability’ at the bib or holdingset level too. But I wasn’t even sure what the semantics of this should be, and figuring out the code to this without impacting performance (more on performance later) was tricky. So, okay, the client can look at the availability on all items and figure out how to sum them up at the bib or copy level itself, if needed (I’m not sure I’ll even need to, for my use cases).
But I want to say a lot more about my Items and Copies than simpleavailability. I want to include enough data that my complete OPAC screen could be replicated by third party software.
So after hunting around for available ‘standard’ options, I settled on good old MFHD — expressed in marc-xml. I considered the new fangled “ISO Holdings”, but limited public documentation is available, and from looking at the schema that is available, it didn’t look like ISO Holdings would let me express anything that MFHD didn’t. Sure, MFHD is kind of a bear for the developer to work with, with all those opaque numeric codes, but oh well, went with the known evil, MFHD.
Except I’m not really using mfhd as is typical. I use just enough of it to express what I want. I include kind of a dummy ‘leader’ just for the sake of appearances, since there’s nothing in the leader I actually need. In standard MFHD usage, you would rarely (never?) have an individual MFHD record just for an item, but the dlfexpanded “coat tree” gives me hooks to hang MFHDs for individual items, and that makes it a lot more convenient to express and retrieve things unambigously, so why not. So anyway, it’s MFHD, but I’m not neccesarily saying any existing MFHD-processing tools will be able to do much with it, I’m using it so unusually (although not illegally in any way as far as I can tell). Oh well, at least it’s a standard format.
Interestingly, while MFHD theoretically lets you express serial run statements in a machine readable form… A) I don’t have that info in my ILS anyway, and B) that machine readability in the way mfhd has you express it is a lot more theoretical than practical. So I’m not doing that. If my ILS had the data, I’d probably express it in the more straightforward ONIX Serial Coverage Statement instead of MFHD. (Note to ONIX people — why oh why do you only provide the actual schema in a zip file online? You used to provide it individually. Very inconvenient.)
But wait, there’s more
But to completely express all the data I’d need to duplicate my OPAC display in external software, mfhd still didn’t quite do it for me. Mostly, I wanted more internal ILS codes. mfhd lets me express ‘location’ and ‘collection’ as user-presentable strings, but I want to reveal my internal non-mutable codes for these too. mfhd doesn’t let me express the concept of ‘item type’ that’s in my catalog at all!
So after looking around some more for something to do that, I gave up and just created my own very simple XML schema to do it, which I’m calling “ILS holdings schema” for expressing internal codes and such, in case you want to.
And one more plug for DAIA
And as I alluded to my last post, I’m using DAIA too — at this point solely to expose the URL that can be accessed to issue a ‘request’ for the item through HIP. This is a bit against the spirit of DAIA, since exactly what a ‘request’ will do is unclear [recall a checked out item, or only add you to a hold list? Let you check it out, or only request it to be provided in the special collections reading room? Deliver it to a circ desk, or actually to your office (as we provide to some people). Who knows!]
And worse, I’m not able to actually pre-check if ‘request’ really is available or not, for reasons discussed in the last post. Which is really against the spirit of DAIA.
But oh well, it was such a nice little schema for simply revealing a URL for a service, and my OPAC ‘request’ feature is a service… so I used it.
At some later point I hope to go back and make a real nice DAIA response, but it’ll be a buncha work, which isn’t required by the specs of the project I’m working on presently.
Oh, and I only provide DAIA at the item-level too, not at the Copy or Bib level. (I think some people’s Horizon setups actually do allow Requests at the Copy or Bib level, but not ours, so I couldn’t quite figure out how it should/would work and didn’t have time for it).
So I think the servlet is reasonably fast, but the trickwhen you’re developing an API that’s going to be used by other software is… “reasonable” gets a lot less forgiving. I mean, let’s say there’s a search result ‘hit list’ with 20 hits on it — my software might want to call this API 20 times for one web page! A 0.2 second response time might be pretty good for a user-facing web app, but not for an API that needs to be called 20 times to deliver one page to the user.
So I might have some speed issues, that theoretically I can optimize to some extent. (Although I’m not looking forward to it. Java is not my specialty. If I had to do it over again, not sure I would have done this in Java, although it made sense at the time for several reasons. And if I were going to do it in Java, I think I’d want to use a framework of some kind, not do it with the pretty low-level stuff that JDBC and Servlet APIs alone give you. But that would result in it’s own trade-offs.)
But perhaps worse than the speed issues are some response size issues. I took a look at the response for a bib I knew would have a lot of items — JAMA, with dozens of holdingsets and hundreds or more items. The dlfexpanded response was 1.2 megs! That might be an issue for sending accross the network, loading into memory, and parsing the XML on the client side.
It’s so large in part because there’s some redundancy in the multiple metadata formats we use to express everything. A basic schema-less ad hoc uchicago-created XML response for the same data is only 220k. Which is still pretty big.
So, I provided some extra query parameters (not specified in dlf ils-di of course) to allow the client to limit the data returned, if it doesn’t really need all of it. The client can choose which metadata payloads it wants for items or copies, instead of taking all of them. And the client can choose NOT to have items included in a response that includes copies, just to include the copy information, and let the client ask for the item info later if it needs it.
We will see how it goes.
Standard or not? Workable or not?
So, okay, I’m providing my info in the DLF ils-di ‘dlfexpanded’ format, but how standard is it? If someone says “Oh yeah, I have code that can consume dlfexpanded”, does that mean it will automatically work with my (or anyone elses!) dlfexpanded info?
Doubtful. You’ve got your choice of metadata payloads to hang on that ‘coat tree’, and everyone can choose different things. Even once you’ve chosen, two people providing the same ones may be using them slightly differently (as evidenced by a few choices I had to make here and there with how to use mfhd).
On top of that, for performance related reasons, or to fit ‘dlfexpanded’ into the actual use cases I have (which go beyond simple DLF “getAvailability”), my dlfexpanded responses sometimes don’t include everything — just because there are no ‘items’ listed in the response doesn’t necessarily mean there are no items, they might have been suppressed based on the request parameters for performance. And, those request parameters are non-standard, but I think (at least for my use cases), the client is really going to need to use them to avoid a performance nightmare.
Or, if you asked my API for info on a certain item, you get a dlfexpanded response that only has that item in it, not all the other items belonging to the same bib, which may or may not be misleading or confusing to the consumer.
Meanwhile, I’ve only written the producer end of things so far, I haven’t even written the consumer. When I get around to writing the consumer, I’m probably going to run into even more tricks and problems requiring me to go back and revise, including but not limited to performance stuff.
So we’ll see. I don’t blame the DLF ils-di task force for this; they did a great job. But we make the map as we tread the path, there’s no way to map out everything without actually trying it in practice first, and trying it in a bunch of different use cases and scenarios to abstract out the commonalities. So, we’re figuring it out as we go, that’s the only way to do it, and the ils-di task force wisely recognized that and didn’t try to map everything out in advance.
Still, it means this stuff is trickier than it might originally seem. The specs, standards, and best practices are not “done”, not even close. We’ve got to figure out a bunch of stuff.