jump to navigation

What librarians do July 1, 2009

Posted by jrochkind in General.
5 comments

So I just gave (or co-gave) a presentation here on Umlaut as deployed here as our Find It service.

One of the most exciting parts to me was that various (non-IT)  librarians in the room, un-prompted, starting throwing out ideas of what it could do in the future. Quite good ideas. I had to resist the techies urge to respond to them with “Well, yeah, but see, that’s harder than it might seem to make work like that…”, and instead try to be encouraging and positive, because it was great to have such a conversation. We hardly ever have such conversations.

Why? I think becuase usually a non-technical librarian has absolutely no way to put such innovative thoughts into practice.  As Karen Schneider talked about in her 2007 Code4Lib Keynote, libraries have ended up outsourcing a significant part of their core business to vendors,  in a way that we pay for it, and we get it, and we pretty much take what we get.

My experience made me realize today that one of the (many) negative side effects of this is that librarians have lost the opportunity (and thus been implicitly  ‘trained’ not to even bother trying) of doing what librarians should be doing in this era when so many of our services are delivered over the web: Figuring out how to make these services meet our users needs better!

Contrary to popular belief, you can’t just let your users tell you what your services will be. Sure, of course you need to listen to your users. And if you listen and observe very carefully, you can figure out what your users needs are, some of which they may not even be able to articulate themselves, but others of which they most certainly can.  But you can’t count on your users to identify the best solutions to these needs. That’s what we’re for, that’s why we’re professionals!

And, to me at least, it’s one of the most most interesting and rewarding parts of our jobs.

But the outsourcing of much of the libraries business to vendors has taken the opportunity to do that away from most of us — an IT geek like me in a library that let’s him get away with it still has some. Most non-IT librarians have had it reinforced that they shouldn’t even bother. And while you have to be an IT type to implement new online services or features, you shouldn’t have to be one to be engaged in dreaming up and planning them.

One thing open source can do is return this power to us.   I’m pretty pleased where Umlaut (and my ability to explain it) is finally at the point where it’s future potential can be seen enough to encourage non-technical librarians to start suggesting “Hey, but what if it could do this and that to? Wouldn’t that be great?”

And, if I can somehow find the time amongst the way too many really great things that I’d like to do if I had time, maybe soon it will!

cataloging theory really is useful June 30, 2009

Posted by jrochkind in General.
5 comments

As much as I’m sometimes frustrated by our common inherited legacy cataloging practices, I actually do think the cataloging theory developed by Lubetzky, Svenonius, Cutter, and others is still useful — sometimes you just need to ‘translate’ it to the modern environment.

I’ve been thinking about how having persistent unique identifiers (bib IDs) for our records is really important — but not generally prioritized in some of our legacy cataloging practice. There are a bunch of ways to explain why this is important (and it’s kind of obvious to the CS-perspective-inclined).

But I realized another way goes back to some language used in my cataloging class.  A cataloging record is called a ’surrogate’ for the physical item described. That’s exactly what it is, even more so in the digital age:  it allows the physical item to be ‘projected’ into the digital environment as a digital object which is a ’surrogate’ for the physical object (or sets of objects, depending on context you consider it in) it represents.

Perhaps this helps explain why a persistent bib ID is important using cataloging theory language.  As a surrogate for the physical object in the digital environment, we want to be able to link to the surrogate in different ways — from simply bookmarking it, to building more complicated ’semantic’ relationships based upon it.  All of that depends on having a persistent identifier — a persistent bib ID — for the surrogate.  Changing the bib ID of the surrogate in the digital environment in unpredictable ways would be analagous to periodically changing where the physical item is physically shelved in unpredictable ways!  The internal unique identifier for the surrogate is essentially it’s digital “location”.

[That's a bit of an oversimplification -- giving the digital surrogate a reliable digital 'location' requires some layering on top of the unique internal ID, to give it a unique persistent URI too. But the pre-requisite for that is a persistent unique internal ID.]

[And, incidentally, for the semantic web geeks reading, this gets at some of my dissatisfaction with this focus on 'real world objects' vs 'documents' or whatever they're currently calling the second class. I don't think it's at all a clear distinction, and can often get confusing right quick, and I think it's probably a mistake to rely on such a confusing distinction for crucial parts of your 'specs'.  A cataloging record is a 'web document', surely, but it's also a surrogate (not JUST a 'description') for a real world object.  Sure, we can split hairs and talk about how to handle that. But the fact that it gets so confusing and abstract and hair-splitting and subject to debate worries me and makes me suspicious of relying on such a distinction for describing how to 'do business' in the sem web.]

NYU goes live with Umlaut June 29, 2009

Posted by jrochkind in General.
add a comment

NYU has gone live with Umlaut. I’m holding my breath hoping that nothing will go wrong with their installation that’s my fault. :)

Hi all,
We’ve deployed Umlaut to our production Primo environment at NYU.

Umlaut is available through the “GetIt” link on a search results page at
http://www.bobcat.nyu.edu and is hosted at http://getit.library.nyu.edu

Thanks,

Scot Dalton
Web Development
Division of Libraries
New York University

It’s interesting to me that they are using Umlaut to work around an exceptionally poor part of Primo’s user experience — the page (or really pages in a ‘tabbed’  frameset wrapper) that actually gets the user to accessing the document (physical location/availability or electronic availability etc).

Turns out Umlaut is exceptionally well suited to replace this role in Primo, because Primo already well relies/supports calling out to an  OpenURL receiver, and because Umlaut is designed for this kind of ‘known item’ and/or ‘last mile’ service.  I think (un-humbly) that the mark of a well-thought-out piece of software is when it can serve well in situations that aren’t exactly like it was designed for.  A ‘known item service provider’ is something we needed all along but didn’t realize it, and once you have one you can find ways to use it I never thought of.  I expect that more Primo customers will become interested in Umlaut.

And, my understanding is that Summon will also rely on sending out an OpenURL for actual local ‘last mile’ access, so I predict that Summon customers will similarly be interested in Umlaut.

I hope anyway!  Thanks very much to Scot from NYU for spearheading the Umlaut deployment there;  I have been very impressed by how quickly Scot was able to get things up and running, with little help from me, including writing some new features and plug-ins to talk to Aleph. Although I’d like to think that the quality of Umlaut’s code and documentation gets some credit here, Scot has been a pleasure to work with, and I hope he will continue working on Umlaut.

Somewhat oddly from my point of view, NYU has deployed Umlaut only in the context of their Primo OPAC/discovery layer.  Traditional link resolver use still goes right to SFX.  Personally, I think that our users in most of our libraries already have too many different interfaces to deal with, and I place a priority on consolidating and integrating them. Umlaut’s goal is to serve this role by providing a ‘known item last mile’ interface in as many contexts as possible.  But I understand that politically it can be difficult to make big changes at once, and my understanding is that NYU does eventually plan to target Umlaut for traditional link resolver use too.

MARC 856, I don’t like you June 18, 2009

Posted by jrochkind in General.
10 comments

If there’s one MARC field that needs an over-haul, it’s the 856. Roy has already talked about how it’s pretty much impossible to tell what the url in an 856 represents with relation to the item cataloged.

But here’s a question for you, let’s say you have an 856 URL to full text for a serial. And you know what date ranges it covers. What sub-field would you put that in? $3 or $z? I see it in both. $3 seems somewhat more appropriate, but $3 also often contains various other kinds of information, such as the name of the provider, or the chapters of a monograph that are covered.  Or even a human-readable description of whether the link represents full text at all, or just tables of contents etc — most 856 fields don’t have this at all, but when they do it’s usually in the $3.

So basically, if you want software to be able to tell the user what range of dates is covered in full text, from the MARC, forget about it.  And that’s not even talking about trying to get a machine processable range of dates, so software could actually calculate whether a particular date/issue is included.  Even finding a simple display string representing dates of coverage is pretty much infeasible.

For how significant links to electronic versions (or supplementary information) has become, I’m kind of amazed that the 856 spec and practice hasn’t really been changed since the early days of the WWW.  Probably becuase most of our ILS’s wouldn’t be able to handle it anyway. A vicious circle, as we generally run into with MARC.

What’s needed

Just for a start,

  • A way to encode in machine-recognizable way whether the link represents full text, or just excerpts, or something else entirely (like a review). Right now full text and excerpts are coded the same — when they’re coded at all. As Roy’s paper discusses.
  • A field for extent of coverage that is actually used only for extent of coverage. The $3 is theoretically this, but people throw all sorts of stuff into it. And possibly different fields for date range extent of coverage, vs. other extent of coverage (like for a monograph, chapter 3 only, or volume 1 only, etc).  [Yes, it now occurs to me that the way this interacts with the first bullet point needs to be thought through].
  • A field for displayable provider/platform.

And that’s just the low-hanging fruit. Then we start wanting the provider/platform to not just be human readable, but controlled in some way (URIs?), so we can collocate by provider/platform.

And then we start thinking about how we really want the dates of coverage for a serial to be machine actionable so software can compute if a particular date/issue is included. Which gets us thinking about Marc Format for Holdings, which is it’s own entirely gigantic mess.  I don’t think this kind of ‘holdings’ data is usually used at all with records with an 856 representing electronic full text.  But even if they were, typical MFHD use makes it pretty infeasible for software to actually answer this question.

When authority is wrong June 16, 2009

Posted by jrochkind in General.
9 comments

“J Sakai” is a name describing at least two people. “J Sakai” with the “J”, so far as I know, standing for nothing in particular (likely a pseudonym) is the name of a revolutionary leftist theorist.

“Sakai, J”, apparently short for “Jun’ichi’” is also the name of a (or at least one!) physicist.

They seem to all share the same LC authority record, that is titles authored by both have been assigned to the same authority record.

http://worldcat.org/identities/lccn-n85-343552

http://errol.oclc.org/laf/n85-343552.html

I am as reasonably confident as one can get that the author of Settlers: The Mythology of the White Proletariat (which I recommend, by the way) is not the author of Phase conjugate optics (which I probably couldn’t understand enough of to have an opinion on).  If the Identities page is to believed, under “Associated Subjects”, we have an author who’s publications touch upon “Anti-imperialist movements” and “Working class whites”, as well as “Columns, concrete — testing” and “Bridges — foundations and piers — mathematical models.”  Such a renaissance man may exist, but being somewhat familiar with the political Sakai’s work, I’m pretty sure it’s not he.

[I also have to wonder if it's but one or in fact two Japanese physicists responsible for works on structural engineering theory; as well as on optics and solar flares. But I'm not familiar with those authors/works, and at least that's all physics. And then we've got some books on geology and fossils too; a third scientist?]

But at some point some cataloger (or machine process?) appears to have seen the authority record for Physicist J Sakai, and decided it suited for revolutionary communist theorist J Sakai.

So I’ve noticed this, let’s say I’m a naive non-librarian user, or heck, let’s not and just say I’m me. What would I do about this?  The naive user might think since they’ve discovered it on worldcat.org, they should report it to OCLC.  But what’s OCLC going to do about it?  We know that OCLC is really only displaying the aggregated collective recorded data of an international community of catalogers.  So I can… what, find a cataloger who has NACO authority to change these records (and create a new authority record for the communist Sakai), AND has the time/interest to do so, just out of general community spirit, even though it doesn’t really matter for their institution which is paying them and expects them to catalog X records per hour that d matter to them?

Kind of makes one reconsider again the idea of ‘wiki-like’ editing of our collective cataloging corpus. Would we be worse or better off if an interested user could, upon noting this issue in Worldcat Identities, immediatly and easily,  right on that web page make a correction? Even make a correction resulting in a new authority record for the ‘missing’ Sakai?  Even if the interested person isn’t NACO-certified, or isn’t even a cataloger, or isn’t even a librarian?   As opposed to requiring not only a librarian, but a select elite of certified librarians, to open up a completely alternate interface, find the individual records in question, and submit changes?

The idea that limiting editing access to an esoteric elite results in perfect information is simply a myth.

The physicist Sakai, Jun’ichi 1936-, if he happened accross the Identities page, would probably be none too pleased to find the cover images presented as ‘representative works’ of his ouvre being politically charged works by a completely different guy, Confronting Fascism: Discussion Documents for a Militant Movement, and Looking at the White Working Class Historically (actually by David Gilbert, but with a foreward by the leftist Sakai).

Swap ISBNdb for BookFinder June 15, 2009

Posted by jrochkind in General.
4 comments

I’ve been using ISBNdb.com to provide a link from library pages for a known item book to a vendor-independent way to buy a book online. I don’t like letting Amazon or any other particular vendor get a monopoly. ISBNdb.com provides a list of a variety of vendors, with prices.

ISBNdb was the first thing I discovered that will let me do this. And ISBNdb has a really nice API, although I don’t really use it for much other than linking to the page. But I like supporting them (they get the Amazon affiliate ID etc), because the guy seems like a good guy, and the API is nice and I might want to use it some day.

The problem is that the ISBNdb.com interface is kind of messy. And it only updates prices ‘on demand’, so if you ask for prices for an item that hasn’t been looked at for a while, they might not be up to date (but will trigger ISBNdb’s crawlers to go fetch new prices, which will be there in a few minutes — which is too late for ‘just in time’ for the user!).

So then I discover BookFinder.  Similar service as far as providing prices from multiple vendors. Much cleaner interface. The ads are more subtle.  It does have a direct linking syntax, but doesn’t actually have an API.  BookFinder actually conducts the web crawl really just-in-time, before showing you the answers. Which I think is a better interface, but leaves no good way to pre-check to make sure BookFinder had a page for the relevant ISBN. — but experience with these ISBN-based services tends to be that if if the ISBN exists, they WILL have some data.

So what do you think, should I swap ISBNdb.com for BookFinder in Umlaut?

An ISBNdb example.

A BookFinder example.

update. Hmm, one downside of the BookFinder interface is if I just feed it an ISBN, it doesn’t actually give me feedback as to what book I’m searching for, to make sure it was the right one.

OCLC COinS generator more powerful than expected June 12, 2009

Posted by jrochkind in General.
6 comments

The COinS generator at http://generator.ocoins.info/ is useful for, well, generating COinS.

We like to include COinS in Code4Lib Journal article citations to materials not available on the web.  You can see more about COinS and how/why we use them if you like.

So I figured the generator was pretty mechanical — it just took the fields you entered, and inserted them into the proper data fields. There’s no text to indicate it does anything more powerful than this, and I’ve learned not to expect anything fancy library-domain software, right?

Except it looks like it actually takes the details you supply and looks up applicable ISSN, ISBN, and even DOI, and then inserts them in the OpenURL context object for you, even when you didn’t enter them!  Wow! That’s neat!  And leads to a much more reliable OpenURL, indeed.

Take this citation, which looked really mysterious to me, the citation actually looks like it’s missing some info (is that a conference proceedings? Shouldn’t it have location then?).  But okay, I entered it into the coins generator.

Göker, A., & He, D. (2000). “Analysing Web Search Logs to Determine Session Boundaries for User-Oriented Learning.” Proceedings of Adaptive Hypermedia and Adaptive Web-Based Systems, 319-322.

Which somehow, I have no idea how, figured out to add ISSN 0302-9743 to it, and DOI 10.1007/3-540-44595-1_38.

Now that ISSN doesn’t actually belong to the proceedings identified, it belongs to Lecture notes in computer science.  Where the paper was apparently (re-)published? And, with that ISSN and DOI, my own institutional link resolver somehow actually manages to find some licensed full text for me:

http://findit.library.jhu.edu/go/95317

You can’t get passed the wall, but clicking on the Springer link really DOES get me to the full text for that article.

Not exactly sure what’s going on, but it’s kinda neat. Curious if anyone from OCLC involved in that thing reads this, and can give the technical details of what it’s actually doing.

Incidentally, that findit.library.jhu.edu page should have a COinS embedded in it itself, so if you have a COinS browser extension supplied, you can go from their to your own institutional link resolver, to see if you can get the full text from there. If you happen to be interested in this article, or just want to compare and contrast.

(Also, if you try to generate a COinS for a book, and enter the ISBN, it adds a weird rft_id value with a URL in it pointing to the Amazon page. I don’t think I like that. If I want to advertise for Amazon, I’ll choose to, thank you very much!  Also, I much prefer to record an Amazon relationship with a urn:asin: value, then with a not necessarily persistent HTML URI to an Amazon page. If I do want to record an ASIN.)

could our catalog look like this? June 10, 2009

Posted by jrochkind in General.
8 comments

From Casey Bisson, someone elses mock-up of a suggestion for what an Amazon results page should look like, to actually give the user the context on the result page to know which items are of interest:

[Oops, my blog layout isn't big enough to see the interesting right column while displaying the image at full size. Click to see the image at full size].

The entire essay by Bret Victor is well-worth a read, on interface layout design for actually helping users accomplish their tasks with appropriate context.

UPCs, EANs, and ISBNs: the verdict May 28, 2009

Posted by jrochkind in General.
1 comment so far

So it gets confusing understanding how these thigns relate. But I think I’ve figured it out, and it’s actually quite simple. By EAN, I mean EAN-13 specifically, which is usually what people mean when they say EAN.

A UPC is just an EAN beginning with 0 (zero), with the initial 0 left off. All EAN’s beginning with 0 are US/Canadian. At some point someone in the US thought, hey, we don’t need to actually interoperate with the rest of the world,  we only care about us, so let’s just leave that 0 off, since all of ours are 0 anyway.  Since n+0=0 (you can call that the ‘arithmetic identity property of zero’ if you want to be fancy), you don’t need to recompute the check digit or anything, just take a UPC, prepend a 0 to it, you’ve got a valid EAN.  Likewise, if you have an EAN beginning with 0, just drop it and you’ve got a UPC.

An ISBN-13 is an EAN.  That’s right, any EAN beginning with 978 or 979 is an ISBN-13, and any ISBN-13 already is an EAN.  Those first few digits in an EAN are a region/country code, but since there are so many books published, books got their own ‘country’ in the EAN Universe, the so-called “Bookland.” (I want to live there!).

So an ISBN-10, to be converted to an EAN, just needs to be converted to an ISBN-13 (including recalculating check-digit).  The move from ISBN-10 to ISBN-13, in addition to adding extra space for ISBNs, also makes them fully legal EANs with no conversion necessary. (That’s why they used 978 and 979 instead of say “1″, and “2″ — to make it finally a fully harmonized and valid EAN while they were expanding the address space. Since 978 and 979 are the EAN ‘bookland’ prefixes.).

Phew. Took me a few days of puzzling to figure that out, so maybe it’ll help someone else. Someone correct me if I’m wrong, but I believe I ain’t.

Oh, and on top of all that, it looks like an EAN-13 is now officially called a GTIN-13. Phew. So many names for essentially the same thing.

While we’re on the topic

OCLC/LC standards actually TELL you to put binding information (eg “paperback”) and other “qualification” (eg “vol 1″) right jammed into 020$a with the ISBN. Making it uneccesarily difficult for machine actors to actually pull out the actual ISBN, which they might need, from a bib record.  You might think it would be a lot better to put that info in a different sub-field, and in fact there is a subfield $b for binding information, but it’s “obsolete. do not use.”

What the heck is wrong with us?

Question for the crowd

Is there a canonical way (or any good way) to express an EAN as a URI?

update 5:36pm. Oh boy. I think. Maybe. There’s also something called EPC, which it turns out is a superset of EAN-13.  I think you can losslessly convert an EAN-13 to an EPC, and I think that an EPC may have a canonical URI form.  EPC is intended for use with RFID, and can express more than an EAN-13 (aka a GTIN-13), but I think you can have a valid EPC which is no more or less than a representation of an EAN-13.  But I can’t quite figure out the details.

Confused enough yet?

update again two minutes later. aha! http://tools.ietf.org/html/draft-mealling-epc-urn-02

Still confused. It looks like you can have a URN for a GTIN (aka EAN-13?) like for example this:

urn:epc:id:sgtin:900100.0003456.1234567

But you do need those periods. And only the first two groups of numbers corresopnd to the EAN-13.  In that case, the EAN0-13, would, I think, be 9001000003456.  That next number is some extra info (I think identifying the particular item in FRBR-speak, the actual individual physical thing). What if you don’t have the extra info, can you still make a GTIN URN? I could be wrong about a GTIN being the same thing as an EAN-13. Very confused.

And lastly…

Using the good folks at O’Reilly as an example, it looks like urn:epc:id:gtin (instead of sgtin) can be used to represent a standard GTIN-13.

See:

http://opmi.labs.oreilly.com/product/9780596521677

<dc:identifier>urn:epc:id:gtin:063692.0015396</dc:identifier>

Except what I don’t understand is why they supply an ISBN-13 which appears to be different. Based on my understanding that GTIN-13 == EAN-13 == ISBN-13, the ISBN-13 should be the same… and yet it is not…

<dc:identifier>urn:isbn:9780596153960</dc:identifier>

So I still remain confused. But looks like the folks at O’Reilly might be good to talk to.

UPC/EAN lookup May 20, 2009

Posted by jrochkind in General.
9 comments

So if you have an ISBN, there are a variety of places you can look up what it represents, as discussed in the last post.

Books (at least for the past 40 some years) have ISBNs.  But what about media (from the past few decades) other than books, like CD’s and DVD’s?  They’ve got a barcode, but it’s not an ISBN. It’s a UPC (which is now actually called an EAN or a UCC-13).  The same system used on any barcoded retail items you find at the store. Although, sadly, unlike ISBNs, EAN’s from CD’s, DVD’s etc aren’t typically recorded in our bibliographic records (why not??).

I don’t know of any great way free/cheap way to look up a EAN and figure out what it’s for.  I don’t even know if there’s a database of such somewhere that if you have enough money you can access, or if blocks of EAN’s are just divied out to vendors, and there’s no central authority tracking how they use them.

But someone just brought my attention to upcdatabase.com, which is a community volunteer updated database of some UPCs.   I was told it even has an API, although I haven’t been able to find it. But it looks like you can download an entire db dump if you want it.

Testing a few random CD’s from my collection, it’s coverage isn’t very good though. Which isn’t that shocking, since it’s volunteer compiled and there are millions (billions?) of EANs out there.

similar www.checkupc.com. Seems to have better coverage, but doens’t share their data. (Wonder if they’re getting it from a licensed source instead of volunteer submissions).

Hmm, if anyone would have a db of many many EANs, I’d think it would be Amazon, but I don’t think their API reveals that functionality. Actually, it looks like maybe it does. Darn, but if I start using that it’s really going to be irreplacable if Amazon takes their API away from me.

And wait a second, it looks like here’s a free resource from the identifier assigning authority itself, although no API. Wonder how mad they’d get (or what counter-measures they’d take) if I screen scraped.http://gepir.gs1.org/V31/xx/gtin.aspx?Lang=en-US. But strangely, it’s coverage doens’t seem to be as good as checkupc.com’s.

Okay, it turns out I don’t know much about what’s out there. Anyone know more? Is there a good reliable free/cheap source of EAN/UPC lookup?