jump to navigation

DLF ils-di dlfexpanded service for Horizon September 10, 2009

Posted by jrochkind in General.
1 comment so far

So, I have a servlet (based on initial work from Tod Olson at uchicago, expanded by me) to provide holdings information from Horizon in the DLF ils-di “dlfexpanded” format. The servlet code and some documentation is available.

That’s the short statement. It turns out that you can’t really just say that without providing some more specifics, caveats, exceptions, limitations etc. Also it’s worth adding some interesting observations.

Motivation

As we’ve moving ahead with blacklight, we’re going to need to have some way to get item holdings information out of Horizon. By “item holdings information” I mean “copy” information, what items do we have, what are their call numbers, what are their statuses (checked in or out among many others), what are their locations, etc. etc. Everything you’d need to provide an actual OPAC display telling the users what they need to know about our holdings.

A sidenote on terminology: In Horizon there are ‘items’, and sometimes a bib just has ‘items’. But sometimes a big has different sets of items in groups — this is usually used for serials, or occasionally for multi-volume series.  Horizon confusingly calls this set of items a ‘copy’.   The DLF ils-di report calls it a ‘holdingset’.  I have no idea what your ILS calls it. It’s a two-level hiearchy, a bib can contain one or more copies/holdingsets which each contains items.  OR a bib can contain one or more items directly, without the intervening copy/holdingset.

And, the way most people are doing this at present (for a variety of reasons) is checking in realtime at point of demand for this info, not trying to index it. So, okay, go with the conventional wisdom. So I need a realtime service to provide this info from Horizon.

But I figure, as long as I’m doing this, MUCH better to provide the info in some standard format, instead of a custom one. Then, theoretically, the consuming code on the Blacklight end can be written to that standard format, instead of being custom for Horizon.  And my understanding is that the Blacklight team has indeed been thinking/wishing for some standard stuff on the Blacklight end to consume stuff in DLF ils-di format, and/or jangle (which also typically, at the moment, uses the DLF ‘dlfexpanded’ format to actually return data in).

So, okay, that makes sense.

But DLF ils-di format is not a complete spec

So it turns out once you decide to return data in the DLF ils-di “dlfexpanded” format, you’re actually not done deciding what your data is actually going to look like.

The dlfexpanded format is just kind of a coat tree to hang your actual metadata ‘coats’ on.  dlfexpanded lets you give a list of itemIDs and say they belong to a bib; it lets you give a list of holdingsets and say which itemIDs belong to them. Good so far. But to actually describe anything else about those items and holdingsets (location, call number, item status, any user-displayable notes, etc), you’ve got to include additional metadata of your own choosing — dlfexpanded gives you some hooks that it allows you to hang basically whatever other namespaced (and hopefully specified and standardized) XML you want on.

So figuring out what metadata to actually use to describe everything I wanted about my Items and Copies (aka ‘holdingsets’) took a bit of investigating and thinking.

simpleavailability

Sure, I used the dlf:simpleavailability format that dlfexpanded gives you just to say whether something is “available” or not (and provide a custom user-displayable string conveying that).

Although I ended up only providing that at the item level. The dls-di report seems to assume the client could ask for ‘availability’ at the bib or holdingset level too. But I wasn’t even sure what the semantics of this should be, and figuring out the code to this without impacting performance (more on performance later) was tricky. So, okay, the client can look at the availability on all items and figure out how to sum them up at the bib or copy level itself, if needed (I’m not sure I’ll even need to, for my use cases).

But I want to say a lot more about my Items and Copies than simpleavailability. I want to include enough data that my complete OPAC screen could be replicated by third party software.

mfhd

So after hunting around for available ’standard’ options, I settled on good old MFHD — expressed in marc-xml.   I considered the new fangled “ISO Holdings”, but limited public documentation is available, and from looking at the schema that is available, it didn’t look like ISO Holdings would let me express anything that MFHD didn’t. Sure, MFHD is kind of a bear for the developer to work with, with all those opaque numeric codes, but oh well, went with the known evil, MFHD.

Except I’m not really using mfhd as is typical. I use just enough of it to express what I want.  I include kind of a dummy ‘leader’ just for the sake of appearances, since there’s nothing in the leader I actually need. In standard MFHD usage, you would rarely (never?) have an individual MFHD record just for an item, but the dlfexpanded “coat tree” gives me hooks to hang MFHDs for individual items, and that makes it a lot more convenient to express and retrieve things unambigously, so why not. So anyway, it’s MFHD, but I’m not neccesarily saying any existing MFHD-processing tools will be able to do much with it, I’m using it so unusually (although not illegally in any way as far as I can tell). Oh well, at least it’s a standard format.

Interestingly, while MFHD theoretically lets you express serial run statements in a machine readable form…  A) I don’t have that info in my ILS anyway, and B) that machine readability in the way mfhd has you express it is a lot more theoretical than practical.  So I’m not doing that.  If my ILS had the data, I’d probably express it in the more straightforward ONIX Serial Coverage Statement instead of MFHD.  (Note to ONIX people — why oh why do you only provide the actual schema in a zip file online? You used to provide it individually. Very inconvenient.)

But wait, there’s more

But to completely express all the data I’d need to duplicate my OPAC display in external software, mfhd still didn’t quite do it for me. Mostly, I wanted more internal ILS codes.  mfhd lets me express ‘location’ and ‘collection’ as user-presentable strings, but I want to reveal my internal non-mutable codes for these too. mfhd doesn’t let me express the concept of ‘item type’ that’s in my catalog at all!

So after looking around some more for something to do that, I gave up and just created my own very simple XML schema to do it, which I’m calling “ILS holdings schema” for expressing internal codes and such, in case you want to.

And one more plug for DAIA

And as I alluded to my last post, I’m using DAIA too — at this point solely to expose the URL that can be accessed to issue a ‘request’ for the item through HIP.  This is a bit against the spirit of DAIA, since exactly what a ‘request’ will do is unclear [recall a checked out item, or only add you to a hold list?  Let you check it out, or only request it to be provided in the special collections reading room?  Deliver it to a circ desk, or actually to your office (as we provide to some people). Who knows!]

And worse, I’m not able to actually pre-check if ‘request’ really is available or not, for reasons discussed in the last post.  Which is really against the spirit of DAIA.

But oh well, it was such a nice little schema for simply revealing a URL for a service, and my OPAC ‘request’ feature is a service… so I used it.

At some later point I hope to go back and make a real nice DAIA response, but it’ll be a buncha work, which isn’t required by the specs of the project I’m working on presently.

Oh, and I only provide DAIA at the item-level too, not at the Copy or Bib level. (I think some people’s Horizon setups actually do allow Requests at the Copy or Bib level, but not ours, so I couldn’t quite figure out how it should/would work and didn’t have time for it).

Performance Issues

So I think the servlet is reasonably fast, but the trickwhen you’re developing an API that’s going to be used by other software is… “reasonable” gets a lot less forgiving. I mean, let’s say there’s a search result ‘hit list’ with 20 hits on it — my software might want to call this API 20 times for one web page!  A 0.2 second response time might be pretty good for a user-facing web app, but not for an API that needs to be called 20 times to deliver one page to the user.

So I might have some speed issues, that theoretically I can optimize to some extent. (Although I’m not looking forward to it. Java is not my specialty. If I had to do it over again, not sure I would have done this in Java, although it made sense at the time for several reasons. And if I were going to do it in Java, I think I’d want to use a framework of some kind, not do it with the pretty low-level stuff that JDBC and Servlet APIs alone give you. But that would result in it’s own trade-offs.)

But perhaps worse than the speed issues are some response size issues. I took a look at the response for a bib I knew would have a lot of items — JAMA, with dozens of holdingsets and hundreds or more items. The dlfexpanded response was 1.2 megs!  That might be an issue for sending accross the network, loading into memory, and parsing the XML on the client side.

It’s so large in part because there’s some redundancy in the multiple metadata formats we use to express everything.  A basic schema-less ad hoc uchicago-created XML response for the same data is only 220k. Which is still pretty big.

So, I provided some extra query parameters (not specified in dlf ils-di of course) to allow the client to limit the data returned, if it doesn’t really need all of it. The client can choose which metadata payloads it wants for items or copies, instead of taking all of them. And the client can choose NOT to have items included in a response that includes copies, just to include the copy information, and let the client ask for the item info later if it needs it.

We will see how it goes.

Standard or not? Workable or not?

So, okay, I’m providing my info in the DLF ils-di ‘dlfexpanded’ format, but how standard is it?  If someone says “Oh yeah, I have code that can consume dlfexpanded”, does that mean it will automatically work with my (or anyone elses!) dlfexpanded info?

Doubtful.  You’ve got your choice of metadata payloads to hang on that ‘coat tree’, and everyone can choose different things. Even once you’ve chosen, two people providing the same ones may be using them slightly differently (as evidenced by a few choices I had to make here and there with how to use mfhd).

On top of that, for performance related reasons, or to fit ‘dlfexpanded’ into the actual use cases I have (which go beyond simple DLF “getAvailability”), my dlfexpanded responses sometimes don’t include everything — just because there are no ‘items’ listed in the response doesn’t necessarily mean there are no items, they might have been suppressed based on the request parameters for performance. And, those request parameters are non-standard, but I think (at least for my use cases), the client is really going to need to use them to avoid a performance nightmare.

Or, if you asked my API for info on a certain item, you get a dlfexpanded response that only has that item in it, not all the other items belonging to the same bib, which may or may not be misleading or confusing to the consumer.

Meanwhile, I’ve only written the producer end of things so far, I haven’t even written the consumer. When I get around to writing the consumer, I’m probably going to run into even more tricks and problems requiring me to go back and revise, including but not limited to performance stuff.

So we’ll see. I don’t blame the DLF ils-di task force for this; they did a great job. But we make the map as we tread the path, there’s no way to map out everything without actually trying it in practice first, and trying it in a bunch of different use cases and scenarios to abstract out the commonalities.  So, we’re figuring it out as we go, that’s the only way to do it, and the ils-di task force wisely recognized that and didn’t try to map everything out in advance.

Still, it means this stuff is trickier than it might originally seem. The specs, standards, and best practices are not “done”, not even close.  We’ve got to figure out a bunch of stuff.

DAIA and ILS complexity September 2, 2009

Posted by jrochkind in General.
8 comments

So DAIA is a nice little response format-slash-API specification from Jakob Voss.

It’s focused on a very specific goal: describing what services are available for a given item, possibly providing URLs to access that service for a given item, telling the user how long they’ll have to wait to get that service, etc.

Some more specific scenarios mapped to my library might make things more clear. For a given item and user, that user might be able to:

  • Look at the item in the library. Which they might be able to do immediately (upon finding it in the stacks), or there might be a 1 or 2 business day delay because it’s in some kind of closed stacks or offsite storage, and they’re going to have to request it.
    • OR, there might be a longer delay, because the item is currently checked out, and they’re going to have to wait until it comes back — or maybe they have ‘recall’ privileges, and there’s still a delay, but shorter!
  • Check the book out?  Again, maybe they can, or maybe they can’t at all. If they can, maybe they’re going to have to first ‘recall’ it (if they’re allowed to), with a longer delay.
  • Request the book for delivery to a circ desk?  Related to recall/checkout, but in rare cases they might be able to request delivery to a circ desk, but only view it in library! And there are cases where they might be able to check it out, but NOT request delivery.  Or where they can request delivery, but they won’t get it until the book comes back on it’s own, they have no ‘recall’ privileges.

Now, the answers to these questions, once determined, are easily expressible in DAIA, no problem.

The problem is, as the complicated foregoing discussion may have hinted, that determining the answers to these questions from our ILS is enormously complex. All the info is in the ILS somehow. In the end, either the ILS is going to allow a ‘request’ or a ‘loan’ or a ‘recall’, or it’s not.  And there’s info in the ILS to let us predict what’s going to happen, and estimate how long it’ll take until the user gets access (as DAIA allows us to express once we’ve figured it out).  It’s all there somehow — but trying to figure out how to actually predict it, oh boy, I get confused really quick. There are dozens of different tables I need to consult in the ILS, and figure out how they interact and which takes priority or overrides which other.  Privileges can be set on item statuses, locations, groups, etc. Borrower statuses, groups, types, etc. And they are not set, in my ILS Horizon, in only one place, but in dozens of different places with different semantics that all interact in ill-defined ways.

Phew.

It seems like something a user would expect, in this day and age, that when they look up a book the listing could actually TELL them if they can check the book out (and how long they’ll have to wait to get it, if there’s a recall involved, etc), if they can view it in the library, if they can request it for delivery, etc.  Our ILS is currently incapable of doing that — to the extent that it even always displays a ‘request’ button, and the user has to actually click on it to find out if they actually can make a request or not.  Which is generally the only way a user can find out what services are available, by trying them.  Which depending on the service may or may not be able to be done over the web (can you look at it in the library? Who knows unless you go there and try. Or call a librarian and hope they aren’t as confused as I am!).  You want to know how long you’re probably going to have to wait to get it?  Too bad.

At first I optimistically thought I could calculate all this stuff from the ILS, deliver it in DAIA, and then use it in new interfaces to actually tell the users what they’re going to want to know. DAIA is quite up to it.  But writing code to actually calculate these things — very non-trivial.  Not so happy with Horizon right now.

Anyone reading this know about the open source ILS’s?  Would this be easier in any of them?

back at work September 1, 2009

Posted by jrochkind in General.
add a comment

I have returned from my leave of absence, and am back at work.

maps, territories, and discretion August 4, 2009

Posted by jrochkind in General.
2 comments

Lorcan Dempsey mentions in passing that decisions of which manifestations belong to the same work set is”discretionary at the edges”:

A note on ‘discretionary’. We cluster stuff based on aggregate cataloger choices. I like Tim Spalding’s characterization of the ‘cocktail party test’ in a blog entry about works and LibraryThing.

Regarding ‘discretionary’, I think this is exactly right. It’s important to note that the ‘work set’ is a subjective and contextual choice, not some objective piece of data waiting to be discovered. But that doesn’t mean it’s useless, it’s very important because (in Western culture at least?), the concept of ‘work’ exists, and is of value to users — it’s socially constructed, its got grey edges, reasonable people may disagree in edge cases, but that doesn’t mean it doesn’t exist or isn’t useful!  (For instance, when a patron comes in asking for a copy of Hamlet, try telling her “Sorry, there’s no such thing as ‘Hamlet’, please come back when you can tell me a particular edition published in a particular place at a particular year that you want.”  Ha!)

The FRBR report says: “The concept of what constitutes a work and where the line of demarcation lies between one work and another may in fact be viewed differently from one culture to another.”  Quite right.

(This element of ‘discretion’ is present to some extent in ALL models of reality — and our bibliographic description is indeed a model. And always has been, even if it hasn’t always been formalized, even if it’s been based on traditional implicit shared understanding, not spelled out. The ‘map’ is never the ‘territory’, just a useful abstraction/approximation, with certain discretionary choices made as to be useful to a certain community/context).

So traditional cataloging tries to make these work distinctions (to the extent that they are implied in AACR2 choices like ‘uniform title’, and hopefully more explicit in RDA, but I can’t say for sure) by setting out precise instructions meant to result in choices that match that cultural determination of ‘work’.

LibraryThing tries to do it instead by just relying on members of that culture using their intution, and averaging out everyone’s choices and relying on them to reach consensus through discussion.

Neither is more ‘correct’, and neither is more ‘FRBR’, just two different approaches to trying to create a collective decision about work sets that is useful to users.  Both are discretionary and subjective.

Sometimes this is hard for those in the library community to understand; we seem to attract people who want there to be a ‘right’ and ‘wrong’ answer, who want “but is it REALLY the same work?” to be based on some kind of objective reality with an absolute discoverable ‘correct’ answer.  Sorry, that’s just not reality, modelling reality is inherently full of discretionary choices. But we can, as we traditionally do in library cataloging, set out rules and guidelines to make it as likely as possible that different people at different places will make the same choices.

But we really ought to explain why those rules and guidelines are what they are, and what the goal is. In order to better allow catalogers to use their professional judgement to make the choices most likely to accomplish that goal.

AACR2 rules that kind of sort of provide guidelines for making ‘work set’ decisions, but couch them in terms of simple orthographic decisions for ‘uniform title’, without ever even mentioning that this is really a decision about work sets — gee, it definitely doesn’t help us understand what we’re doing, and try to do it to serve the users as well as possible.  (It also doesn’t help that ‘uniform title’ is meant to express (at least) several other things in addition to ‘work set’ — we really need to record the ‘is in work set X’ decision as it’s own discrete reconstructable data element.) Anyone know if RDA improves on any of this? I still haven’t had the fortitude to try and make it through the RDA draft.

Work-centric or manifestation-centric display?

Lorcan also points out that:

Interestingly, Goodreads and LibraryThing seem to default to a work-based view: the entry is at the work level…   Amazon seems to default to a particular ‘manifestation’ or ‘expression’… Google Books seems to do something similar…. Worldcat.org is more like Amazon and Google. At the moment, it aims to show the most highly held member of a work set in a result, and then link to other editions from that…

I’d be interested if Worldcat is considering trying to make a ‘work’ view the default ‘landing page’ from a search, a bit more like LibraryThing. I suspect this would actually be of more general use than the library legacy practice of always showing individual manifestations as search ‘landing pages’.

Lorcan says: “There are reasons for taking these various approaches and each service make decisions based on what it is trying to do, and the view it takes of its user interests.”  Certainly true as far as it goes — but I’ve never seen a written out clear analysis of what the reasons for the traditional library manifestation-centered display are, what they are trying to accomplish, what user interests we believe they are meeting.

I suspect that in fact this choice isn’t based on any actual clearly thought attempt to meet certain user interests — but instead just because we’ve always done it that way. Because in the card catalog world it was impractical to do otherwise. And in the online world, it takes a bit more work to do otherwise.  Not because doing it this way actually is necessarily optimal for meeting identified user interests.

Please note that I’m not saying that we should ‘catalog at the work level’, whatever that would mean. Our cataloging practices certainly still need to describe manifestations, and there need to be different records for different manifestations. (On the other hand, something like subject cataloging probably is best done once at the work-level, not duplicated effort for every manifestation.) But a work-centric display can still be provided — if there is sufficient data recorded to allow software to reconstruct cataloger decisions about work-set groupings!  Current practice makes this difficult.

(And note that’s why Amazon and Google don’t have workset-centric displays. They don’t have the data to do it! Even Google’s vaunted algorithmic prowess can’t, apparently, determine work set groupings reliably enough to make a work-centric display. At least not within the resources Google is willing to throw at the problem. LibraryThing can do it because of volunteer human labor!  Library cataloging theoretically relies on such human labor, and we certainly spend an enormous amount of person hours in such labor — but don’t actually capture the fruit of that labor in unambiguous enough form to make it easy for the software to take advantage of. Shame on us.)

on leave July 31, 2009

Posted by jrochkind in General.
3 comments

I will be on a leave of absence from work for the entire month of August, getting some much needed rejuvenation, hopefully coming back to work renergized. heh.

I will have spotty internet access throughout the month of August. If you need to get in touch with me, better off leaving a comment here (which will end up notifying me at my personal email address, which I’ll check spottily but occasionally), then sending to my work address (which I probably won’t check at all). Feel free to leave a comment asking me to get in touch with you (just don’t get upset if you don’t hear back for a while!); the email address you enter in your blog comment will be visible to me.

See you all in September if not before then!

exposing holdings in dlf ils-di standard format web service July 31, 2009

Posted by jrochkind in General.
7 comments

So, as we move toward Blacklight implementation, I needed some way to expose item/holdings details from my Horizon ILS so they could be consumed for display (and/or indexing) in Blacklight.

I figured, as long as I’m doing this, I might as well do it in some kind of standard (rather than custom ad-hoc) format, so the consumer on the Blacklight end could be standard-ish. And it could be possibly re-used by others, or re-used by ourselves if we switch ILSs, we’d just need to write the provider end on the ILS end, and could keep the same consumer.

So looking around for standard formats, the DLF ILS-di format (xsd schema) seemed pretty suitable, designed for just this task.

So, thankfully, Horizon actually keeps all it’s info in a fairly well normalized rdbms that you can access directly, making this not too hard a task. On top of that. So, those fine folks at uchicago already had a little extension to Horizon to provide the item information in their own custom ad-hoc format, which they kindly shared with me. So I took that, and modified it to produce in ils-di format.

Metadata formats

Now, the thing about the ils-di format.  It gives you a sort of skeleton to hang your info on. You can list items. You can list what ils-di calls ‘holdingsets’ (and Horizon confusingly calls ‘copies’, and I don’t know what your ILS calls them — a group of related items, like all the bound volumes of a particular bib; or the multiple volumes of a multi-volume set).  You can express which items are in which holdingsets.

This is all great, because there wasn’t a simple standard format to do that in before.  But when you actually want to say something about the holdingsets or items, dls-di just gives you a slot to put some other (hopefully standard) metadata format in.  ( With one exception — isl-di gives you “SimpleAvailability” to describe a human-displayable label, and one of four coded SimpleAvailability statuses to describe item availability/status.  This was wise of them, because there was no good way to provide status from a standard vocabulary without this.)

Now, I think ils-di is exactly right to do things this way. Break the problem into manageable chunks, solve one chunk with a solution meant to do one thing well, and make sure your solution can be ‘loosely coupled’ with other solutions meant to solve the other parts. Fine, good show.

But that still leaves me to figure out how to actually describe what I want to describe, using what XML schemas, standardized if possible. (And leaves the community to arrive at a standard set of these extra schemas at a later date, if we want to write software that really is ‘plug and play’ with each other. Oh well, that’s how it goes, better to try some things and define ‘best practices’ and standards off of what works well, then to try and ’standardize’ before trying in the wild.).

All my stuff

So what’s all the data elements I have that I want to describe somehow, in these extra metadata packages embedded in dlf-di?

Well, you can see them right here in uchicago’s custom ad hoc format, what their servlet did out of the box, with this example of a moderately complex serials record:

http://hip-dev.mse.jhu.edu/items/bib/418855.uchicago

So, okay, where to put it?  Well, bibIDs and itemIDs are already in the dlf-schema itself.  So what else do we have?  Marc Format for Holdings Data in MarcXML seems likely.  Maybe ISO Holdings?  Maybe NCIP?

I started with MFHD in marcxml, because NCIP confuses me (and everyone else), and ISO Holdings you need to pay a couple hundred bucks to look at the standard (although you can see the .xsd schema alone for free).

So in MFHD you can put a lot of stuff actually.  Although it’s somewhat confusing to look at, since it uses those obscure marc tag codes and such. But you can put in there:

  • user-displayable ‘location’ and ‘collection’ in tag 852
  • ‘holding’ (ie ‘holdingset’ ie ‘copy’) identifier in tag 001.
  • shelfmark (ie call number/copy information) also in 852.
  • A coded value of whether that call number is LCC, NLM, Dewey, Sudocs, a couple others, or ‘other’ or ‘unknown’. 852 indicator 1.
  • For ‘holdingsets’ user-presentable coverage statements (for main run, indexes, or supplements), in 866-888.
    • ( Note, if my ILS actually had machine-understandable coverage statements, which it does not, you theoretically maybe could put them in MFHD, but I’d much prefer ONIX Serial Coverage, which I think does it much more elegantly and clearly. But I don’t have that data available  anyway.)
  • I think you can provide an un-coded user-presentable item status/availability string somewhere, but SimpleAvailability takes care of that better so I didn’t worry about.

Meanwhile, dlf:SimpleAvailability is handling my need for both a coded and user-displayable item status/availability string, great, one thing done well. (Although I needed to create a mapping from my 109 internal ‘item status’ codes to the four dlf:SimpleAvailability values!).

But that still left me with some things I wanted to include.  Well, MFHD gives me user-displayable labels for location and collection. But I really wanted to include my ILS’s internal codes for location and collection and item status. Why would I want purely local internal codes? Well, because applications I’m using to consume this can possibly be configured to make use of them even though they are purely local identifiers (especially if I’m writing the apps myself!).  I also wanted to include ‘item type’ as both an internal code and a user-displayable label, and strangely MFHD has no spot for even user-displayable label for that.  Also similarly wanted to expose my internal system “call number type” id, which is not always mappable to a standard type in MFHD like LCC or DDC or whatever.

I looked over what documentation I could find for NCIP, as well as the NCIP xml schema, didn’t seem to have the fields I needed either. I even looked at the ISO Holdings schema without any documentation (my skills at reading raw XML schemas have improved muchly through this project). Nope, not there.

So, what?

Ross Singer had an idea that you could do this purely with DublinCore (including refinements in ‘dcterms’) and RDF. That might be possible, but I just couldn’t figure out how to do it. But really, I don’t think there are sufficient elements in dc:terms to cover all of those data elements, although Ross found some clever ways to try and express a few of them (Ross trying to do a bit MORE than I really needed, since he didn’t want to depend on the dlf-di schema but I’m just trying to get some metadata I can embed in dlf-di for now, that’s my use case).

So I guess there’s theoretically some way to express your own refinements to dcterms?  But I got lost trying to figure that out.

So one way or another,  I figured I was going to define my own vocabulary. I could do it as an RDF Vocabularly alone, but I got confused trying to think about that, and once you go to trying to express that in RDF-XML… got confused again.  Or I could do it in a custom XML Schema.  If I’m going to have to create my own vocabulary anyway, XML Schema just seemed simpler, both to produce and to consume. (And it would be easy for me or someone else to convert this to RDF at a later date, starting from a schema.  RDF-XML even lets any defined XML namespace pretty much be RDF out of the box, just add a few RDF attributes here or there!).

So custom schema it was. I created (or am in the middle of creating) an awfully simple XML schema for these elements I needed, mostly internal ILS values, and for each one the schema says you can supply one or more (internal or external) identifiers using a child dc:identifier, a user-displayable label using a child dc:title, and if you like a longer-format user-displayable description. (Didn’t re-use dc:description for this because I really wanted a couple extra attributes there seemed to be no way to add to a dc:description).

Here it is, work in progress. (Not even sure if this validates yet).

The (not so) final product

So here it is, the current version of a dlf ils-di document produced live from my (development box) Horizon, including in it’s metadata payload MFHD in marcxml, dlf:SimpleAvailability, and my custom as yet un-named schema.

See for example this same moderately complicated serials record:

http://hip-dev.mse.jhu.edu/items/bib/418855

Where to next?

Well, I’ve got to finish polishing it off, make sure all the XML validate against the schemas, make sure the new schema I created is really valid, etc.  Polish off a few more things.

Then, I’d like to put this code (derived from uchicago’s code, with their permission) on Google Code, so that other Horizon institutions can use it to provide dlf ils-di responses from their catalogs, woo.  (I tried to keep the code as generalizable as possible — for instance, the mapping from your local item status codes to the four dlf:SimpleAvailability values is configurable in a properties file).

I’ve also got my eyes on DAIA as another metadata schema to include in the dlf ils-di response eventually.  DAIA is focused on doing what SimpleAvailability does, but with more detail: What services are available, and what’s the URL access points for that service? I need to figure out how to correctly extend DAIA to include services that aren’t in DAIA’s built-in four. (I specifically need the service ‘get a photocopy of a portion of this item’, and ‘place an ILS request/hold for pickup at circ desk’, two services we offer that DAIA doesnt’ specify right now).

And Ross tells me what I’ve done so far has gotten me a lot of the way to a jangle implementation. Great, that was part of the goal, so apparently it succeeded. I’ll finish off the rest of jangle when I have a use case that demands it, which could be sooner or later! (And first i’ll need to understand jangle better!).

APIs and vendor lock-in July 23, 2009

Posted by jrochkind in General.
3 comments

Eric Lease Morgan asks on code4lib:

I heard someplace recently that APIs are the newest form of vendor
lock-in. What’s your take?

My reply (expanded a bit from my listserv post):

Standards-Based

When they are custom vendor-specific APIs and not standards-based APIs, they can definitely function that way. I’m still not sure if even a vendor-specific API is more or less lock-in than NOT having an API.  On the one hand, you will start to have software written against the vendor-specific API, that won’t work without changing it up if you switch vendors.  But on the other hand, with SFX and Umlaut, for instance, Umlaut does so much more than SFX, and the SFX adapter piece is such a small part, that in that case, for us at least, having SFX with an API and Umlaut on top of it it definitely makes it _easier_ for us to switch link resolvers without disrupting our services built on top of it.

Which we don’t do well at

But really, what you want is standards-based APIs, not vendor-specific APIs. That would give you the best of all worlds. There are a couple challenges that keep us from getting there though. One is that the library community, historically, is, well, pretty AWFUL at writing standards.  We come up with standards that don’t actually accomplish what they were intended to accomplish, are too complicated for anyone to implement right (on either producer or consumer side), and leave so much wiggle room that someone can claim they support the standard but not in a way that any other software will ever understand.  (NCIP anyone?)

Outside standards?

So there are a couple ways to try to get better at this. One is definitely looking outside the library world for standards to use. But unlike code4libbers, I don’t think (from my experience) that’s always possible or easy.  We have priority problems that, while they are not entirely foreign to the larger world, aren’t as high a priority for most of the non-library world, meaning they don’t yet have robust standards solutions. However, especially when standards are extensible (like XML ones often are), you can sometimes start with a general standard and extend it for the library space.

Standards based on, not preceeding, practice

Secondly, instead of creating standards before anyone has actually tried solving the problem the standard is meant to solve (as we often seem to do), the BEST standards are created by generalizing/abstracting from existing best practices. A buncha people try it first, you see what works and what doesn’t, you see what the actual use cases and needs are, you take the best out of what’s been done, and you standardize it.   But doing it this way means you need to go through a period of vendor/product specific (eg) APIs before you can get to the standard.  The library world is still immature in developing good software infrastructures, we’re going to need to through some more pain for a while, no way around it.

Vendor capabilities?

But another problem in all of this is that vendors may not have the interest OR the in-house expertise to actually provide standards-based APIs.  The APIs we often get now from vendors, frankly, are kind of kludgey, and do not fill me with confidence that the vendor actually has the proper staff or resources allocated to create good standards-based APIs — which, definitely, takes more time than creating a kludgey vendor-specific one-off.   Or maybe the vendor actually is dis-interested in this because they want lock-in.  Or maybe it’s just the case that the quality of your APIs doesn’t effect your sales at all, so it doesn’t make (short term at least) business sense to do it well.  (Heck, the _presence_ of an API has only just begun to effect sales, but libraries aren’t good enough at judging how good it is, that even a crappy API is probably ‘good enough’ for sales).

Open source, community work

One way out of this is definitely open source. We’ll work out the best practices and standards ourselves, and then we start insisting that vendors follow them.  The DLF-DI API is perhaps one example of an attempt at this, created from a generalization of the experience of library developers.   But the library developer community is also small, and generally fairly in-experienced. Creating APIs is done best by experienced developers who understand what’s going to make the API useable or not.

But, anyway, one step at a time. I firmly believe that even vendor-specific kludgey APIs are better than no APIs at all — we learn how to do better by trying.

Consuming applications

It’s also worth pointing out, as some subsequent commenters on that thread did, that the application consuming an API bears some reponsibility here. As much as possible, you need abstract out the API connector code, so you can easily switch the app to use multiple APIs, so long as they all have more or less the same data/capabilities (something which certainly isn’t guaranteed, admitted). This too takes more time, but is do-able. Among the software I work on, Umlaut manages to do it pretty well, Xerxes does not. This is in part because of the more focused and limited function of a link resolver compared to a federated search engine, made it easier to do with Umlaut. And I guess half of the SFX API more or less is standards-based: OpenURL.

As a result, even though both SFX and Metalib have vendor-specific APIs, our use of the SFX API, in my opinion, lessens our vendor lock-in, while our use of the Metalib API increases it.

In this case, this was mostly due to factors outside our control. But it also can definitely depend on how well you’ve architected your client code, to abstract out the API connectors. Sometimes I feel like this is heresy in code4lib with it’s “just get it done” ethos, but good, well-architected code matters.

What librarians do July 1, 2009

Posted by jrochkind in General.
6 comments

So I just gave (or co-gave) a presentation here on Umlaut as deployed here as our Find It service.

One of the most exciting parts to me was that various (non-IT)  librarians in the room, un-prompted, starting throwing out ideas of what it could do in the future. Quite good ideas. I had to resist the techies urge to respond to them with “Well, yeah, but see, that’s harder than it might seem to make work like that…”, and instead try to be encouraging and positive, because it was great to have such a conversation. We hardly ever have such conversations.

Why? I think becuase usually a non-technical librarian has absolutely no way to put such innovative thoughts into practice.  As Karen Schneider talked about in her 2007 Code4Lib Keynote, libraries have ended up outsourcing a significant part of their core business to vendors,  in a way that we pay for it, and we get it, and we pretty much take what we get.

My experience made me realize today that one of the (many) negative side effects of this is that librarians have lost the opportunity (and thus been implicitly  ‘trained’ not to even bother trying) of doing what librarians should be doing in this era when so many of our services are delivered over the web: Figuring out how to make these services meet our users needs better!

Contrary to popular belief, you can’t just let your users tell you what your services will be. Sure, of course you need to listen to your users. And if you listen and observe very carefully, you can figure out what your users needs are, some of which they may not even be able to articulate themselves, but others of which they most certainly can.  But you can’t count on your users to identify the best solutions to these needs. That’s what we’re for, that’s why we’re professionals!

And, to me at least, it’s one of the most most interesting and rewarding parts of our jobs.

But the outsourcing of much of the libraries business to vendors has taken the opportunity to do that away from most of us — an IT geek like me in a library that let’s him get away with it still has some. Most non-IT librarians have had it reinforced that they shouldn’t even bother. And while you have to be an IT type to implement new online services or features, you shouldn’t have to be one to be engaged in dreaming up and planning them.

One thing open source can do is return this power to us.   I’m pretty pleased where Umlaut (and my ability to explain it) is finally at the point where it’s future potential can be seen enough to encourage non-technical librarians to start suggesting “Hey, but what if it could do this and that to? Wouldn’t that be great?”

And, if I can somehow find the time amongst the way too many really great things that I’d like to do if I had time, maybe soon it will!

cataloging theory really is useful June 30, 2009

Posted by jrochkind in General.
6 comments

As much as I’m sometimes frustrated by our common inherited legacy cataloging practices, I actually do think the cataloging theory developed by Lubetzky, Svenonius, Cutter, and others is still useful — sometimes you just need to ‘translate’ it to the modern environment.

I’ve been thinking about how having persistent unique identifiers (bib IDs) for our records is really important — but not generally prioritized in some of our legacy cataloging practice. There are a bunch of ways to explain why this is important (and it’s kind of obvious to the CS-perspective-inclined).

But I realized another way goes back to some language used in my cataloging class.  A cataloging record is called a ’surrogate’ for the physical item described. That’s exactly what it is, even more so in the digital age:  it allows the physical item to be ‘projected’ into the digital environment as a digital object which is a ’surrogate’ for the physical object (or sets of objects, depending on context you consider it in) it represents.

Perhaps this helps explain why a persistent bib ID is important using cataloging theory language.  As a surrogate for the physical object in the digital environment, we want to be able to link to the surrogate in different ways — from simply bookmarking it, to building more complicated ’semantic’ relationships based upon it.  All of that depends on having a persistent identifier — a persistent bib ID — for the surrogate.  Changing the bib ID of the surrogate in the digital environment in unpredictable ways would be analagous to periodically changing where the physical item is physically shelved in unpredictable ways!  The internal unique identifier for the surrogate is essentially it’s digital “location”.

[That's a bit of an oversimplification -- giving the digital surrogate a reliable digital 'location' requires some layering on top of the unique internal ID, to give it a unique persistent URI too. But the pre-requisite for that is a persistent unique internal ID.]

[And, incidentally, for the semantic web geeks reading, this gets at some of my dissatisfaction with this focus on 'real world objects' vs 'documents' or whatever they're currently calling the second class. I don't think it's at all a clear distinction, and can often get confusing right quick, and I think it's probably a mistake to rely on such a confusing distinction for crucial parts of your 'specs'.  A cataloging record is a 'web document', surely, but it's also a surrogate (not JUST a 'description') for a real world object.  Sure, we can split hairs and talk about how to handle that. But the fact that it gets so confusing and abstract and hair-splitting and subject to debate worries me and makes me suspicious of relying on such a distinction for describing how to 'do business' in the sem web.]

NYU goes live with Umlaut June 29, 2009

Posted by jrochkind in General.
add a comment

NYU has gone live with Umlaut. I’m holding my breath hoping that nothing will go wrong with their installation that’s my fault. :)

Hi all,
We’ve deployed Umlaut to our production Primo environment at NYU.

Umlaut is available through the “GetIt” link on a search results page at
http://www.bobcat.nyu.edu and is hosted at http://getit.library.nyu.edu

Thanks,

Scot Dalton
Web Development
Division of Libraries
New York University

It’s interesting to me that they are using Umlaut to work around an exceptionally poor part of Primo’s user experience — the page (or really pages in a ‘tabbed’  frameset wrapper) that actually gets the user to accessing the document (physical location/availability or electronic availability etc).

Turns out Umlaut is exceptionally well suited to replace this role in Primo, because Primo already well relies/supports calling out to an  OpenURL receiver, and because Umlaut is designed for this kind of ‘known item’ and/or ‘last mile’ service.  I think (un-humbly) that the mark of a well-thought-out piece of software is when it can serve well in situations that aren’t exactly like it was designed for.  A ‘known item service provider’ is something we needed all along but didn’t realize it, and once you have one you can find ways to use it I never thought of.  I expect that more Primo customers will become interested in Umlaut.

And, my understanding is that Summon will also rely on sending out an OpenURL for actual local ‘last mile’ access, so I predict that Summon customers will similarly be interested in Umlaut.

I hope anyway!  Thanks very much to Scot from NYU for spearheading the Umlaut deployment there;  I have been very impressed by how quickly Scot was able to get things up and running, with little help from me, including writing some new features and plug-ins to talk to Aleph. Although I’d like to think that the quality of Umlaut’s code and documentation gets some credit here, Scot has been a pleasure to work with, and I hope he will continue working on Umlaut.

Somewhat oddly from my point of view, NYU has deployed Umlaut only in the context of their Primo OPAC/discovery layer.  Traditional link resolver use still goes right to SFX.  Personally, I think that our users in most of our libraries already have too many different interfaces to deal with, and I place a priority on consolidating and integrating them. Umlaut’s goal is to serve this role by providing a ‘known item last mile’ interface in as many contexts as possible.  But I understand that politically it can be difficult to make big changes at once, and my understanding is that NYU does eventually plan to target Umlaut for traditional link resolver use too.