Principle of avoiding “false promises” in interfaces September 24, 2009
Posted by jrochkind in General.5 comments
So lately I keep thinking about this idea I think of as a “false promise” in a user interface. Not sure if other people already recognize this and refer to the concept by some other label, let me know if you know they do.
But the idea is that your software shouldn’t suggest by it’s input that it can do something that it really can’t do at all. This becomes especially tricky when we’re dealing with our library data and systems that in fact can’t do a lot of things. Some examples will help.
SFX ‘citation linker’ input screen
SFX by default has a screen that let’s you input an article citation, and then SFX will try to find links or other information for it. (I don’t want to put a link to mine here cause I don’t want to attract the robots).
Now, to begin with, this is both an annoying process for the user, and an error-prone process for SFX. But I want to draw your attention to two particular fields on that screen: “Author” and “Article Title”.
The default input screen asks you to input an “Author”. However, in (estimating) 95%-99% of cases, SFX can’t actually do anything with that author or title you’ve input at all. It doesn’t help SFX find a match, it doesn’t effect SFX’s functionality at all.
So our interface implies that the user ought to enter author and title — a painful and annoying process for the user. The “false promise” here, in my opinion, is that this will do anything at all. Now, granted, in a tiny minority of cases it will, which is why SFX puts the field there. But that means we’re making a “false promise” in the vast majority of cases, in my opinion. We’re “leading the user on.”
MARC relator codes
This might be a better example. So MARC fields for listing controlled authors or other contributors (100 and 700) theoretically allow the data to say particularly what relationship the contributor has to the work at hand. (Author? Editor? Illustrator? Performer on a musical composition? Composer? Wrote a preface?).
Most OPAC interfaces don’t do much with this. But if you start thinking of what you might want to do, an initial naive approach might be to allow the user to limit a search by these relator codes. Don’t just give me any record that has Noam Chomsky in any 100 or 700 — that’s what our traditional interfaces do, but for prolific people it might give me too much. I really only want books where Noam Chomksy wrote a preface.
So, okay, maybe you go ahead and provide this limit in your search interface. The problem is that the vast majority of our data doesn’t have these relator codes. So if you just do a search for Noam Chomksy with relator code for ‘wrote a preface’, you’re going to miss most of the books that Noam Chomsky really did write a preface for.
You might miss it because Noam Chomsky is in a 700 field with no relator code. Or you might miss it because we don’t often record people who wrote prefaces at all.
In either case though, I think the interface was making a ‘false promise’, it suggested you could search limiting by role of the contributor, but our data doesn’t really support that at all. The results are going to be misleading if the user assumes the interface really can do what it suggests it can.
So?
So what do you think? Any other examples you can think of of ‘false promises’ that our interfaces make?
Identifying the ‘false promises’ is easier than fixing them. Usually they are there because of limitations in our software or data that are not easy or cheap to resolve. If you really get rid of all of the false promises, you have to get rid of much of your functionality! Or pepper it with disclaimers and limitations that most users won’t read anyway, and just make us look kind of incompetent if they do. (“WARNING: You can TRY to search on relator code, but your results will only include a tiny percentage of things that really matched your search.”)
A reasonable display for series data in MARC? September 24, 2009
Posted by jrochkind in General.16 comments
So I know plenty of catalogers read my blog (or used to). Appreciate any feedback or advice you have on this.
Basically, I’m trying to figure out how to actually do a useful user-friendly display of ’series’ information from MARC records.
My assumptions
So we have 440, 490, and 8xx. There’s a distinction between “transcribed” series, and controlled (aka “traced” or “access point”). I know that the controlled data is meant to be used for collocation. I am assuming that the “transcribed” data is better for user display though. Is this right? (I’ll refer to these two concepts as “displayable” and “controlled”).
So if we’ve got a 440, then that is both displayable and controlled.
But current practice going forward is not to use 440, but instead to use a 490 for displayable, and a 8xx for controlled.
So what should the interface do?
So thinking about an individual record display. I can’t just list all 440, 490, and 8xx fields under “Series”, because in the case of 490/8xx, that’ll lead to me displaying the same series twice. Once in transcribed form, and once in controlled form. This is confusing and doesn’t make sense.
So what I’m thinking is that for a 490/8xx pair, I actually display the 490 on the screen — it’s the value meant for user-display. But it’s clickable, and when you click on it, the search that will be executed is actually on the corresonding 8xx, because that’s the field meant for collocation.
This is assuming there is a corresponding 8xx. If there’s not, it’s somewhat simpler. We display the 490, and either it’s not click-searchable at all, or if it is, it searches an uncontrolled series index of all 490s, it doesn’t actually try to collocate on a controlled field, cause we don’t have one.
Does this make sense? Am I missing something?
But the problem
But there’s still a problem here. A record can theoretically belong to multiple series. Meaning it could have multiple 490s. Each of which may or may not have a controlled 8xx corresponding to it.
As far as I can tell, there’s no way to tell which 8xx goes with which 490. Especially since a 490 may or may not have a corresponding 8xx.
This might not effect very many records, that have multiple series, but it still annoys me to have a known ‘bug’, a known case where things won’t work right at all. I’m not really sure what the heck my code should do if there are multiple 490s. Am I missing something?
By the way
This is one good example of how it’s somehow difficult or even impossible to get meaningful information out of our AACR2/MARC, despite some people’s belief to the contrary that it’s always simple and straightforward.
So… what the heck should be done with this 440/490/8xx stew?
Amazon Windowshop: Serendipitous Browsing Online September 18, 2009
Posted by jrochkind in General.add a comment
Fiacre O’Duinn alerts us to a kind of interesting interface Amazon provides, which I hadn’t been aware of before: Amazon Windowshop.
Fiacre asks if this is what the library catalog should look like.
I wouldn’t want the WHOLE library catalog to look ONLY like that — but I think it could be VERY useful and interesting to provide a “serendipitous browsing” interface to the catalog (on top of a more traditional type-in-search-get-result list interface) that is along the lines of Amazon windowshop.
Try to replicate the experience of browsing the shelves, but online you get the benefit that you can arrange books in more than one dimension (as amazon windowshop does in two), re-arrange them in different orders (for instance LCC OR DDC OR something else entirely, don’t have to pick just one), and additionally be able to allow unified browsing of a corpus that may be in several different physical locations (including off-site storage) or may be currently checked out but maybe you want to include them in the ‘browse’ anyway.
I’ve been thinking for a while about how to provide such an online serendipitous browse experience, like a physical shelf browse but taking advantage of the unique affordances offered by the online environment. And I definitely thought (cover) images were a necessary component — I had been thinking of iTunes coverflow as a model. Amazon Windowshop provides another VERY interesting model to try and steal the best parts of — whenever I or anyone else can find the time to try and work on it! Too many cool projects, not enough time. (And replicating Amazon windowshop would take some fancy coding).
Sophisticated item services from Umlaut in Xerxes federated search interface September 15, 2009
Posted by jrochkind in General.add a comment
So, if you try to architect your applications solidly and flexibly, and build in features for integration, and it all works out okay, one of the benefits you get is it’s pretty easy to combine them.
I’ve added a feature to the Xerxes federated search tool to add sophisticated item-level information and services that were already being compiled by our Umlaut installation— to Xerxes record-detail pages.
I think this is pretty neat from a sort of ’single business’ perspective of providing consistent services regardless of what tool the user happens to be using.
So now, when you look at an item detail page in Xerxes, you can, right on that page, see:
- call numbers and availability
- Full text links from SFX, right on the page
- Links to “similar items” content from Web of Knowledge and Scopus.
- links to pre-filled ILL forms, as appropriate.
- For monographic content, full text, preview, and ’search inside’ functionality from Amazon, Google, and others.
- Other stuff — whatever happens to be configured in Umlaut, when new stuff is added to Umlaut, it’ll automatically show up in Xerxes too. (Well, new services of the existing types; if a whole new type/section is added to Umlaut, will take a couple lines of code in Xerxes to add it).
This is live in production here now, but you can’t really see it without a local login. So here’s some screenshots of Xerxes item detail pages, content from Umlaut circled in red.
It’s worth noting that this content is inserted on the page by javascript after page load. It can take 1-3 seconds or so to come in (depending on speed Umlaut can do it’s thing), which you can’t see in the screenshots. While waiting, you get a spinner and status message. If a user doesn’t have javascript enabled, this feature won’t effect their page view at all.
DLF ils-di dlfexpanded service for Horizon September 10, 2009
Posted by jrochkind in General.1 comment so far
So, I have a servlet (based on initial work from Tod Olson at uchicago, expanded by me) to provide holdings information from Horizon in the DLF ils-di “dlfexpanded” format. The servlet code and some documentation is available.
That’s the short statement. It turns out that you can’t really just say that without providing some more specifics, caveats, exceptions, limitations etc. Also it’s worth adding some interesting observations.
Motivation
As we’ve moving ahead with blacklight, we’re going to need to have some way to get item holdings information out of Horizon. By “item holdings information” I mean “copy” information, what items do we have, what are their call numbers, what are their statuses (checked in or out among many others), what are their locations, etc. etc. Everything you’d need to provide an actual OPAC display telling the users what they need to know about our holdings.
A sidenote on terminology: In Horizon there are ‘items’, and sometimes a bib just has ‘items’. But sometimes a big has different sets of items in groups — this is usually used for serials, or occasionally for multi-volume series. Horizon confusingly calls this set of items a ‘copy’. The DLF ils-di report calls it a ‘holdingset’. I have no idea what your ILS calls it. It’s a two-level hiearchy, a bib can contain one or more copies/holdingsets which each contains items. OR a bib can contain one or more items directly, without the intervening copy/holdingset.
And, the way most people are doing this at present (for a variety of reasons) is checking in realtime at point of demand for this info, not trying to index it. So, okay, go with the conventional wisdom. So I need a realtime service to provide this info from Horizon.
But I figure, as long as I’m doing this, MUCH better to provide the info in some standard format, instead of a custom one. Then, theoretically, the consuming code on the Blacklight end can be written to that standard format, instead of being custom for Horizon. And my understanding is that the Blacklight team has indeed been thinking/wishing for some standard stuff on the Blacklight end to consume stuff in DLF ils-di format, and/or jangle (which also typically, at the moment, uses the DLF ‘dlfexpanded’ format to actually return data in).
So, okay, that makes sense.
But DLF ils-di format is not a complete spec
So it turns out once you decide to return data in the DLF ils-di “dlfexpanded” format, you’re actually not done deciding what your data is actually going to look like.
The dlfexpanded format is just kind of a coat tree to hang your actual metadata ‘coats’ on. dlfexpanded lets you give a list of itemIDs and say they belong to a bib; it lets you give a list of holdingsets and say which itemIDs belong to them. Good so far. But to actually describe anything else about those items and holdingsets (location, call number, item status, any user-displayable notes, etc), you’ve got to include additional metadata of your own choosing — dlfexpanded gives you some hooks that it allows you to hang basically whatever other namespaced (and hopefully specified and standardized) XML you want on.
So figuring out what metadata to actually use to describe everything I wanted about my Items and Copies (aka ‘holdingsets’) took a bit of investigating and thinking.
simpleavailability
Sure, I used the dlf:simpleavailability format that dlfexpanded gives you just to say whether something is “available” or not (and provide a custom user-displayable string conveying that).
Although I ended up only providing that at the item level. The dls-di report seems to assume the client could ask for ‘availability’ at the bib or holdingset level too. But I wasn’t even sure what the semantics of this should be, and figuring out the code to this without impacting performance (more on performance later) was tricky. So, okay, the client can look at the availability on all items and figure out how to sum them up at the bib or copy level itself, if needed (I’m not sure I’ll even need to, for my use cases).
But I want to say a lot more about my Items and Copies than simpleavailability. I want to include enough data that my complete OPAC screen could be replicated by third party software.
mfhd
So after hunting around for available ’standard’ options, I settled on good old MFHD — expressed in marc-xml. I considered the new fangled “ISO Holdings”, but limited public documentation is available, and from looking at the schema that is available, it didn’t look like ISO Holdings would let me express anything that MFHD didn’t. Sure, MFHD is kind of a bear for the developer to work with, with all those opaque numeric codes, but oh well, went with the known evil, MFHD.
Except I’m not really using mfhd as is typical. I use just enough of it to express what I want. I include kind of a dummy ‘leader’ just for the sake of appearances, since there’s nothing in the leader I actually need. In standard MFHD usage, you would rarely (never?) have an individual MFHD record just for an item, but the dlfexpanded “coat tree” gives me hooks to hang MFHDs for individual items, and that makes it a lot more convenient to express and retrieve things unambigously, so why not. So anyway, it’s MFHD, but I’m not neccesarily saying any existing MFHD-processing tools will be able to do much with it, I’m using it so unusually (although not illegally in any way as far as I can tell). Oh well, at least it’s a standard format.
Interestingly, while MFHD theoretically lets you express serial run statements in a machine readable form… A) I don’t have that info in my ILS anyway, and B) that machine readability in the way mfhd has you express it is a lot more theoretical than practical. So I’m not doing that. If my ILS had the data, I’d probably express it in the more straightforward ONIX Serial Coverage Statement instead of MFHD. (Note to ONIX people — why oh why do you only provide the actual schema in a zip file online? You used to provide it individually. Very inconvenient.)
But wait, there’s more
But to completely express all the data I’d need to duplicate my OPAC display in external software, mfhd still didn’t quite do it for me. Mostly, I wanted more internal ILS codes. mfhd lets me express ‘location’ and ‘collection’ as user-presentable strings, but I want to reveal my internal non-mutable codes for these too. mfhd doesn’t let me express the concept of ‘item type’ that’s in my catalog at all!
So after looking around some more for something to do that, I gave up and just created my own very simple XML schema to do it, which I’m calling “ILS holdings schema” for expressing internal codes and such, in case you want to.
And one more plug for DAIA
And as I alluded to my last post, I’m using DAIA too — at this point solely to expose the URL that can be accessed to issue a ‘request’ for the item through HIP. This is a bit against the spirit of DAIA, since exactly what a ‘request’ will do is unclear [recall a checked out item, or only add you to a hold list? Let you check it out, or only request it to be provided in the special collections reading room? Deliver it to a circ desk, or actually to your office (as we provide to some people). Who knows!]
And worse, I’m not able to actually pre-check if ‘request’ really is available or not, for reasons discussed in the last post. Which is really against the spirit of DAIA.
But oh well, it was such a nice little schema for simply revealing a URL for a service, and my OPAC ‘request’ feature is a service… so I used it.
At some later point I hope to go back and make a real nice DAIA response, but it’ll be a buncha work, which isn’t required by the specs of the project I’m working on presently.
Oh, and I only provide DAIA at the item-level too, not at the Copy or Bib level. (I think some people’s Horizon setups actually do allow Requests at the Copy or Bib level, but not ours, so I couldn’t quite figure out how it should/would work and didn’t have time for it).
Performance Issues
So I think the servlet is reasonably fast, but the trickwhen you’re developing an API that’s going to be used by other software is… “reasonable” gets a lot less forgiving. I mean, let’s say there’s a search result ‘hit list’ with 20 hits on it — my software might want to call this API 20 times for one web page! A 0.2 second response time might be pretty good for a user-facing web app, but not for an API that needs to be called 20 times to deliver one page to the user.
So I might have some speed issues, that theoretically I can optimize to some extent. (Although I’m not looking forward to it. Java is not my specialty. If I had to do it over again, not sure I would have done this in Java, although it made sense at the time for several reasons. And if I were going to do it in Java, I think I’d want to use a framework of some kind, not do it with the pretty low-level stuff that JDBC and Servlet APIs alone give you. But that would result in it’s own trade-offs.)
But perhaps worse than the speed issues are some response size issues. I took a look at the response for a bib I knew would have a lot of items — JAMA, with dozens of holdingsets and hundreds or more items. The dlfexpanded response was 1.2 megs! That might be an issue for sending accross the network, loading into memory, and parsing the XML on the client side.
It’s so large in part because there’s some redundancy in the multiple metadata formats we use to express everything. A basic schema-less ad hoc uchicago-created XML response for the same data is only 220k. Which is still pretty big.
So, I provided some extra query parameters (not specified in dlf ils-di of course) to allow the client to limit the data returned, if it doesn’t really need all of it. The client can choose which metadata payloads it wants for items or copies, instead of taking all of them. And the client can choose NOT to have items included in a response that includes copies, just to include the copy information, and let the client ask for the item info later if it needs it.
We will see how it goes.
Standard or not? Workable or not?
So, okay, I’m providing my info in the DLF ils-di ‘dlfexpanded’ format, but how standard is it? If someone says “Oh yeah, I have code that can consume dlfexpanded”, does that mean it will automatically work with my (or anyone elses!) dlfexpanded info?
Doubtful. You’ve got your choice of metadata payloads to hang on that ‘coat tree’, and everyone can choose different things. Even once you’ve chosen, two people providing the same ones may be using them slightly differently (as evidenced by a few choices I had to make here and there with how to use mfhd).
On top of that, for performance related reasons, or to fit ‘dlfexpanded’ into the actual use cases I have (which go beyond simple DLF “getAvailability”), my dlfexpanded responses sometimes don’t include everything — just because there are no ‘items’ listed in the response doesn’t necessarily mean there are no items, they might have been suppressed based on the request parameters for performance. And, those request parameters are non-standard, but I think (at least for my use cases), the client is really going to need to use them to avoid a performance nightmare.
Or, if you asked my API for info on a certain item, you get a dlfexpanded response that only has that item in it, not all the other items belonging to the same bib, which may or may not be misleading or confusing to the consumer.
Meanwhile, I’ve only written the producer end of things so far, I haven’t even written the consumer. When I get around to writing the consumer, I’m probably going to run into even more tricks and problems requiring me to go back and revise, including but not limited to performance stuff.
So we’ll see. I don’t blame the DLF ils-di task force for this; they did a great job. But we make the map as we tread the path, there’s no way to map out everything without actually trying it in practice first, and trying it in a bunch of different use cases and scenarios to abstract out the commonalities. So, we’re figuring it out as we go, that’s the only way to do it, and the ils-di task force wisely recognized that and didn’t try to map everything out in advance.
Still, it means this stuff is trickier than it might originally seem. The specs, standards, and best practices are not “done”, not even close. We’ve got to figure out a bunch of stuff.
DAIA and ILS complexity September 2, 2009
Posted by jrochkind in General.8 comments
So DAIA is a nice little response format-slash-API specification from Jakob Voss.
It’s focused on a very specific goal: describing what services are available for a given item, possibly providing URLs to access that service for a given item, telling the user how long they’ll have to wait to get that service, etc.
Some more specific scenarios mapped to my library might make things more clear. For a given item and user, that user might be able to:
- Look at the item in the library. Which they might be able to do immediately (upon finding it in the stacks), or there might be a 1 or 2 business day delay because it’s in some kind of closed stacks or offsite storage, and they’re going to have to request it.
- OR, there might be a longer delay, because the item is currently checked out, and they’re going to have to wait until it comes back — or maybe they have ‘recall’ privileges, and there’s still a delay, but shorter!
- Check the book out? Again, maybe they can, or maybe they can’t at all. If they can, maybe they’re going to have to first ‘recall’ it (if they’re allowed to), with a longer delay.
- Request the book for delivery to a circ desk? Related to recall/checkout, but in rare cases they might be able to request delivery to a circ desk, but only view it in library! And there are cases where they might be able to check it out, but NOT request delivery. Or where they can request delivery, but they won’t get it until the book comes back on it’s own, they have no ‘recall’ privileges.
Now, the answers to these questions, once determined, are easily expressible in DAIA, no problem.
The problem is, as the complicated foregoing discussion may have hinted, that determining the answers to these questions from our ILS is enormously complex. All the info is in the ILS somehow. In the end, either the ILS is going to allow a ‘request’ or a ‘loan’ or a ‘recall’, or it’s not. And there’s info in the ILS to let us predict what’s going to happen, and estimate how long it’ll take until the user gets access (as DAIA allows us to express once we’ve figured it out). It’s all there somehow — but trying to figure out how to actually predict it, oh boy, I get confused really quick. There are dozens of different tables I need to consult in the ILS, and figure out how they interact and which takes priority or overrides which other. Privileges can be set on item statuses, locations, groups, etc. Borrower statuses, groups, types, etc. And they are not set, in my ILS Horizon, in only one place, but in dozens of different places with different semantics that all interact in ill-defined ways.
Phew.
It seems like something a user would expect, in this day and age, that when they look up a book the listing could actually TELL them if they can check the book out (and how long they’ll have to wait to get it, if there’s a recall involved, etc), if they can view it in the library, if they can request it for delivery, etc. Our ILS is currently incapable of doing that — to the extent that it even always displays a ‘request’ button, and the user has to actually click on it to find out if they actually can make a request or not. Which is generally the only way a user can find out what services are available, by trying them. Which depending on the service may or may not be able to be done over the web (can you look at it in the library? Who knows unless you go there and try. Or call a librarian and hope they aren’t as confused as I am!). You want to know how long you’re probably going to have to wait to get it? Too bad.
At first I optimistically thought I could calculate all this stuff from the ILS, deliver it in DAIA, and then use it in new interfaces to actually tell the users what they’re going to want to know. DAIA is quite up to it. But writing code to actually calculate these things — very non-trivial. Not so happy with Horizon right now.
Anyone reading this know about the open source ILS’s? Would this be easier in any of them?
back at work September 1, 2009
Posted by jrochkind in General.add a comment
I have returned from my leave of absence, and am back at work.
maps, territories, and discretion August 4, 2009
Posted by jrochkind in General.2 comments
Lorcan Dempsey mentions in passing that decisions of which manifestations belong to the same work set is”discretionary at the edges”:
A note on ‘discretionary’. We cluster stuff based on aggregate cataloger choices. I like Tim Spalding’s characterization of the ‘cocktail party test’ in a blog entry about works and LibraryThing.
Regarding ‘discretionary’, I think this is exactly right. It’s important to note that the ‘work set’ is a subjective and contextual choice, not some objective piece of data waiting to be discovered. But that doesn’t mean it’s useless, it’s very important because (in Western culture at least?), the concept of ‘work’ exists, and is of value to users — it’s socially constructed, its got grey edges, reasonable people may disagree in edge cases, but that doesn’t mean it doesn’t exist or isn’t useful! (For instance, when a patron comes in asking for a copy of Hamlet, try telling her “Sorry, there’s no such thing as ‘Hamlet’, please come back when you can tell me a particular edition published in a particular place at a particular year that you want.” Ha!)
The FRBR report says: “The concept of what constitutes a work and where the line of demarcation lies between one work and another may in fact be viewed differently from one culture to another.” Quite right.
(This element of ‘discretion’ is present to some extent in ALL models of reality — and our bibliographic description is indeed a model. And always has been, even if it hasn’t always been formalized, even if it’s been based on traditional implicit shared understanding, not spelled out. The ‘map’ is never the ‘territory’, just a useful abstraction/approximation, with certain discretionary choices made as to be useful to a certain community/context).
So traditional cataloging tries to make these work distinctions (to the extent that they are implied in AACR2 choices like ‘uniform title’, and hopefully more explicit in RDA, but I can’t say for sure) by setting out precise instructions meant to result in choices that match that cultural determination of ‘work’.
LibraryThing tries to do it instead by just relying on members of that culture using their intution, and averaging out everyone’s choices and relying on them to reach consensus through discussion.
Neither is more ‘correct’, and neither is more ‘FRBR’, just two different approaches to trying to create a collective decision about work sets that is useful to users. Both are discretionary and subjective.
Sometimes this is hard for those in the library community to understand; we seem to attract people who want there to be a ‘right’ and ‘wrong’ answer, who want “but is it REALLY the same work?” to be based on some kind of objective reality with an absolute discoverable ‘correct’ answer. Sorry, that’s just not reality, modelling reality is inherently full of discretionary choices. But we can, as we traditionally do in library cataloging, set out rules and guidelines to make it as likely as possible that different people at different places will make the same choices.
But we really ought to explain why those rules and guidelines are what they are, and what the goal is. In order to better allow catalogers to use their professional judgement to make the choices most likely to accomplish that goal.
AACR2 rules that kind of sort of provide guidelines for making ‘work set’ decisions, but couch them in terms of simple orthographic decisions for ‘uniform title’, without ever even mentioning that this is really a decision about work sets — gee, it definitely doesn’t help us understand what we’re doing, and try to do it to serve the users as well as possible. (It also doesn’t help that ‘uniform title’ is meant to express (at least) several other things in addition to ‘work set’ — we really need to record the ‘is in work set X’ decision as it’s own discrete reconstructable data element.) Anyone know if RDA improves on any of this? I still haven’t had the fortitude to try and make it through the RDA draft.
Work-centric or manifestation-centric display?
Lorcan also points out that:
Interestingly, Goodreads and LibraryThing seem to default to a work-based view: the entry is at the work level… Amazon seems to default to a particular ‘manifestation’ or ‘expression’… Google Books seems to do something similar…. Worldcat.org is more like Amazon and Google. At the moment, it aims to show the most highly held member of a work set in a result, and then link to other editions from that…
I’d be interested if Worldcat is considering trying to make a ‘work’ view the default ‘landing page’ from a search, a bit more like LibraryThing. I suspect this would actually be of more general use than the library legacy practice of always showing individual manifestations as search ‘landing pages’.
Lorcan says: “There are reasons for taking these various approaches and each service make decisions based on what it is trying to do, and the view it takes of its user interests.” Certainly true as far as it goes — but I’ve never seen a written out clear analysis of what the reasons for the traditional library manifestation-centered display are, what they are trying to accomplish, what user interests we believe they are meeting.
I suspect that in fact this choice isn’t based on any actual clearly thought attempt to meet certain user interests — but instead just because we’ve always done it that way. Because in the card catalog world it was impractical to do otherwise. And in the online world, it takes a bit more work to do otherwise. Not because doing it this way actually is necessarily optimal for meeting identified user interests.
Please note that I’m not saying that we should ‘catalog at the work level’, whatever that would mean. Our cataloging practices certainly still need to describe manifestations, and there need to be different records for different manifestations. (On the other hand, something like subject cataloging probably is best done once at the work-level, not duplicated effort for every manifestation.) But a work-centric display can still be provided — if there is sufficient data recorded to allow software to reconstruct cataloger decisions about work-set groupings! Current practice makes this difficult.
(And note that’s why Amazon and Google don’t have workset-centric displays. They don’t have the data to do it! Even Google’s vaunted algorithmic prowess can’t, apparently, determine work set groupings reliably enough to make a work-centric display. At least not within the resources Google is willing to throw at the problem. LibraryThing can do it because of volunteer human labor! Library cataloging theoretically relies on such human labor, and we certainly spend an enormous amount of person hours in such labor — but don’t actually capture the fruit of that labor in unambiguous enough form to make it easy for the software to take advantage of. Shame on us.)
on leave July 31, 2009
Posted by jrochkind in General.3 comments
I will be on a leave of absence from work for the entire month of August, getting some much needed rejuvenation, hopefully coming back to work renergized. heh.
I will have spotty internet access throughout the month of August. If you need to get in touch with me, better off leaving a comment here (which will end up notifying me at my personal email address, which I’ll check spottily but occasionally), then sending to my work address (which I probably won’t check at all). Feel free to leave a comment asking me to get in touch with you (just don’t get upset if you don’t hear back for a while!); the email address you enter in your blog comment will be visible to me.
See you all in September if not before then!
exposing holdings in dlf ils-di standard format web service July 31, 2009
Posted by jrochkind in General.7 comments
So, as we move toward Blacklight implementation, I needed some way to expose item/holdings details from my Horizon ILS so they could be consumed for display (and/or indexing) in Blacklight.
I figured, as long as I’m doing this, I might as well do it in some kind of standard (rather than custom ad-hoc) format, so the consumer on the Blacklight end could be standard-ish. And it could be possibly re-used by others, or re-used by ourselves if we switch ILSs, we’d just need to write the provider end on the ILS end, and could keep the same consumer.
So looking around for standard formats, the DLF ILS-di format (xsd schema) seemed pretty suitable, designed for just this task.
So, thankfully, Horizon actually keeps all it’s info in a fairly well normalized rdbms that you can access directly, making this not too hard a task. On top of that. So, those fine folks at uchicago already had a little extension to Horizon to provide the item information in their own custom ad-hoc format, which they kindly shared with me. So I took that, and modified it to produce in ils-di format.
Metadata formats
Now, the thing about the ils-di format. It gives you a sort of skeleton to hang your info on. You can list items. You can list what ils-di calls ‘holdingsets’ (and Horizon confusingly calls ‘copies’, and I don’t know what your ILS calls them — a group of related items, like all the bound volumes of a particular bib; or the multiple volumes of a multi-volume set). You can express which items are in which holdingsets.
This is all great, because there wasn’t a simple standard format to do that in before. But when you actually want to say something about the holdingsets or items, dls-di just gives you a slot to put some other (hopefully standard) metadata format in. ( With one exception — isl-di gives you “SimpleAvailability” to describe a human-displayable label, and one of four coded SimpleAvailability statuses to describe item availability/status. This was wise of them, because there was no good way to provide status from a standard vocabulary without this.)
Now, I think ils-di is exactly right to do things this way. Break the problem into manageable chunks, solve one chunk with a solution meant to do one thing well, and make sure your solution can be ‘loosely coupled’ with other solutions meant to solve the other parts. Fine, good show.
But that still leaves me to figure out how to actually describe what I want to describe, using what XML schemas, standardized if possible. (And leaves the community to arrive at a standard set of these extra schemas at a later date, if we want to write software that really is ‘plug and play’ with each other. Oh well, that’s how it goes, better to try some things and define ‘best practices’ and standards off of what works well, then to try and ’standardize’ before trying in the wild.).
All my stuff
So what’s all the data elements I have that I want to describe somehow, in these extra metadata packages embedded in dlf-di?
Well, you can see them right here in uchicago’s custom ad hoc format, what their servlet did out of the box, with this example of a moderately complex serials record:
http://hip-dev.mse.jhu.edu/items/bib/418855.uchicago
So, okay, where to put it? Well, bibIDs and itemIDs are already in the dlf-schema itself. So what else do we have? Marc Format for Holdings Data in MarcXML seems likely. Maybe ISO Holdings? Maybe NCIP?
I started with MFHD in marcxml, because NCIP confuses me (and everyone else), and ISO Holdings you need to pay a couple hundred bucks to look at the standard (although you can see the .xsd schema alone for free).
So in MFHD you can put a lot of stuff actually. Although it’s somewhat confusing to look at, since it uses those obscure marc tag codes and such. But you can put in there:
- user-displayable ‘location’ and ‘collection’ in tag 852
- ‘holding’ (ie ‘holdingset’ ie ‘copy’) identifier in tag 001.
- shelfmark (ie call number/copy information) also in 852.
- A coded value of whether that call number is LCC, NLM, Dewey, Sudocs, a couple others, or ‘other’ or ‘unknown’. 852 indicator 1.
- For ‘holdingsets’ user-presentable coverage statements (for main run, indexes, or supplements), in 866-888.
- ( Note, if my ILS actually had machine-understandable coverage statements, which it does not, you theoretically maybe could put them in MFHD, but I’d much prefer ONIX Serial Coverage, which I think does it much more elegantly and clearly. But I don’t have that data available anyway.)
- I think you can provide an un-coded user-presentable item status/availability string somewhere, but SimpleAvailability takes care of that better so I didn’t worry about.
Meanwhile, dlf:SimpleAvailability is handling my need for both a coded and user-displayable item status/availability string, great, one thing done well. (Although I needed to create a mapping from my 109 internal ‘item status’ codes to the four dlf:SimpleAvailability values!).
But that still left me with some things I wanted to include. Well, MFHD gives me user-displayable labels for location and collection. But I really wanted to include my ILS’s internal codes for location and collection and item status. Why would I want purely local internal codes? Well, because applications I’m using to consume this can possibly be configured to make use of them even though they are purely local identifiers (especially if I’m writing the apps myself!). I also wanted to include ‘item type’ as both an internal code and a user-displayable label, and strangely MFHD has no spot for even user-displayable label for that. Also similarly wanted to expose my internal system “call number type” id, which is not always mappable to a standard type in MFHD like LCC or DDC or whatever.
I looked over what documentation I could find for NCIP, as well as the NCIP xml schema, didn’t seem to have the fields I needed either. I even looked at the ISO Holdings schema without any documentation (my skills at reading raw XML schemas have improved muchly through this project). Nope, not there.
So, what?
Ross Singer had an idea that you could do this purely with DublinCore (including refinements in ‘dcterms’) and RDF. That might be possible, but I just couldn’t figure out how to do it. But really, I don’t think there are sufficient elements in dc:terms to cover all of those data elements, although Ross found some clever ways to try and express a few of them (Ross trying to do a bit MORE than I really needed, since he didn’t want to depend on the dlf-di schema but I’m just trying to get some metadata I can embed in dlf-di for now, that’s my use case).
So I guess there’s theoretically some way to express your own refinements to dcterms? But I got lost trying to figure that out.
So one way or another, I figured I was going to define my own vocabulary. I could do it as an RDF Vocabularly alone, but I got confused trying to think about that, and once you go to trying to express that in RDF-XML… got confused again. Or I could do it in a custom XML Schema. If I’m going to have to create my own vocabulary anyway, XML Schema just seemed simpler, both to produce and to consume. (And it would be easy for me or someone else to convert this to RDF at a later date, starting from a schema. RDF-XML even lets any defined XML namespace pretty much be RDF out of the box, just add a few RDF attributes here or there!).
So custom schema it was. I created (or am in the middle of creating) an awfully simple XML schema for these elements I needed, mostly internal ILS values, and for each one the schema says you can supply one or more (internal or external) identifiers using a child dc:identifier, a user-displayable label using a child dc:title, and if you like a longer-format user-displayable description. (Didn’t re-use dc:description for this because I really wanted a couple extra attributes there seemed to be no way to add to a dc:description).
Here it is, work in progress. (Not even sure if this validates yet).
The (not so) final product
So here it is, the current version of a dlf ils-di document produced live from my (development box) Horizon, including in it’s metadata payload MFHD in marcxml, dlf:SimpleAvailability, and my custom as yet un-named schema.
See for example this same moderately complicated serials record:
http://hip-dev.mse.jhu.edu/items/bib/418855
Where to next?
Well, I’ve got to finish polishing it off, make sure all the XML validate against the schemas, make sure the new schema I created is really valid, etc. Polish off a few more things.
Then, I’d like to put this code (derived from uchicago’s code, with their permission) on Google Code, so that other Horizon institutions can use it to provide dlf ils-di responses from their catalogs, woo. (I tried to keep the code as generalizable as possible — for instance, the mapping from your local item status codes to the four dlf:SimpleAvailability values is configurable in a properties file).
I’ve also got my eyes on DAIA as another metadata schema to include in the dlf ils-di response eventually. DAIA is focused on doing what SimpleAvailability does, but with more detail: What services are available, and what’s the URL access points for that service? I need to figure out how to correctly extend DAIA to include services that aren’t in DAIA’s built-in four. (I specifically need the service ‘get a photocopy of a portion of this item’, and ‘place an ILS request/hold for pickup at circ desk’, two services we offer that DAIA doesnt’ specify right now).
And Ross tells me what I’ve done so far has gotten me a lot of the way to a jangle implementation. Great, that was part of the goal, so apparently it succeeded. I’ll finish off the rest of jangle when I have a use case that demands it, which could be sooner or later! (And first i’ll need to understand jangle better!).

