A Review of WorldCat Search API(s)

You know, like a book review, except it’s an API review!  Yeah, it gets awfully long. Hopefully it’ll be useful to someone.

So OCLC gets some cred just for providing an API at all; indeed, 5(?) years ago when it first came out, this was kind of an exciting thing in the library world, especially from OCLC.

But years later, we have started to expect more than just bare existence (hopefully! at least we deserve to!). And I don’t think the WorldCat Search apis have actually changed much since they got released. I had the opportunity to play with them for the first time recently, so here’s my review, with some suggestions for some improvements.

My particular experimental investigative use case was not replacing my local catalog with Worldcat (Local) via the API; it was instead just providing an integrated WorldCat search alongside the local catalog, mainly for purposes of feeding into ILL. Since this is an enhancement to not having this service, rather than replacement of an existing full-featured service (like replacing the catalog would be), and since existing methods of doing ILL or searching WorldCat remain intact, we can get by with a lot more basic feature set than we could if we were actually replacing our catalog.

The tl;dr version of my personal evaluation is: The WorldCat Search API is sufficient for some basic use cases; you aren’t going to be able to duplicate worldcat.org with it (if for no other reason than there is absolutely no api access to facet display); it could really use some expanded normalized metadata in the non-MARC response formats, most especially ‘content type’/format sorts of values (which are already being created by worldcat.org, just not exposed in the api response).

If OCLC is trying to position WorldCat Local as a ‘discovery service’ competitor with Summon, Primo, EBSCO Discovery Service, they (and their potential customers) should know that all the competitors have API’s that will let you create your own fully-featured interface instead of using the default provided one. Even the competitors with the worst API (I’ve been reviewing them all lately, and some of them are quite bad), still have barely enough functionality to make this possible (with difficulty) — Worldcat does not. 

OpenSearch vs. SRU

There are two main variants of the Worldcat Search API: An “OpenSearch” one, and an “SRU” one:  There is a free-to-everyone variant of the OpenSearch one too; the SRU one can only be used by libraries with an OCLC relationship of some kind. (I think you have to be either an OCLC Cataloging Member, or subscribe to WorldCat on FirstSearch, but not neccesarily both? There might be other ways to get access too).

OpenSearch

In the OpenSearch one, you can’t do any kind of field-specific search at all, you can only search the general keyword-anywhere index. So as soon as I figured that out, I stopped looking into it, and moved to the SRU.

The OpenSearch one lets you get your results in either “RSS” or “Atom”. I didn’t actually look into these response types much once I realized the OpenSearch variant wouldn’t do, so I’m not sure if they are actually consistent with each other.

There is a “Worldcat Basic API” — which appears to be basically the exact same thing as the OpenSearch variant of the “WorldCat Search API” (you confused yet?),  although this isn’t stated clearly in documentation.   I think (from a discussion on the wc-dev-net listserv) the Worldcat Basic API is in fact exactly the same as the WorldCat Search API OpenSearch variant, except the Basic API has rate limits.

SRU

The SRU variant ostensibly uses the “SRU” protocol. But you can pretty much ignore what “SRU” is — I do!  It’s not like there’s any software that will work against “SRU” generically.  You can just look at the documented parameters the “SRU” variant takes for pagination and sorting etc., whether they happen to be as specified by SRU, it hardly matters.

What does matter is what SRU means for the format of your query: It means CQL.  So you can’t ignore what CQL is to use the SRU variant. You’re going to have to figure out the syntax and semantics of CQL in order to send queries to the SRU variant of the API.  Realize that standard google-style queries like << globalization "drug war" columbia >> aren’t legal CQL, you’re going to have formulate as CQL, for instance: << srw.kw cql.all "globalization columbia" AND srw.kw cql.adj "drug war" >>.

You have to look in OCLC docs for the field names and operators actually supported by Search API — and the docs aren’t always quite right (for instance, prefix for all indexes seems to be ‘srw’, not ‘sru’ as in the docs). This “indexes” overview page in the docs might help too (I wish it had more examples though).

The benefit of CQL, is that it’s semantically very specified, you can send complicated boolean algebraic multi-fielded-combination queries to it, and it’ll give you exactly what you asked for.  And if you happen to have already worked with CQL, you’re already familiar with it (I luckily had).  The downside of CQL is that it’s confusing, it’s not something you’re going to make your users write queries legal for, so you’re going to have translate user entered queries to CQL before sending em to the API.

The SRU variant has it’s own two choices of results response type: MarcXML or what they call “Dublin Core”.  We’ll talk more about those in just a bit.

Suggestion for OCLC:  Make request type and response type independent axes

It’s confusing that you have OpenSearch request type with responses in Atom and RSS, and then SRU request type with a completely disjoint set of response types: MarcXML or so-called DC.  This made it a lot more confusing for me to figure out which variant I wanted to use.

I’d suggest you make these symmetric and parallel.  You can choose to send your query in an OpenSearch request format — and choose to get back Atom, RSS, MarcXML, or DC. Or you can choose to do your request as SRU/CQL format — and choose to get back Atom, RSS, MarcXML, or DC.  Two seperate independent axes you get to choose from, request type, and response type.

Although this suggestion comes first in my review, it’s not necessarily the most high priority or urgent one.  I include it partially as an illustration of what good API’s do — they exhibit parallelism between different parts, and consistency, and give you individual building blocks that you can put together how you like.  When they do this, they are a lot easier to user, quicker to get started with, more comprehensible to developers — and more flexible as to developer use-cases.

Suggestion for OCLC: Support a worldcat.org-style query request type

The two differnet types of queries we can send OCLC are kind of too-simple and too-complicated with no happy middle ground.

The OpenSearch variant lets you (I think! I didn’t investigate fully) send an end-user-friendly type of query, just a list of terms, which (I think!) respects phrase-quotes.  Sometimes that’s all you need — if the Atom/RSS response formats are sufficient.

CQL is great if you want full power to send complex multi-fielded boolean searches (and I hope you wanted DC or MarcXML response types, since those are your options now).  It’s great to have this option, but CQL can be difficult to work with, and almost certainly is going to require you to parse and re-formulate your end-user entered queries in CQL.

What’s needed — and missing — is a happy medium.  A query language that lets you do fielded searches, and perhaps at least basic boolean operators (if not nested boolean expressions with parens) — but with syntax that is more contemporary google-like, easier for developers to work with, and plausibly something that end-users could actually write queries in.

You know, like worldcat.org already has! worldcat.org actually already supports queries like << ti:”manufacturing consent” au:chomsky >>.  The API should accept these too.  With results in RSS, Atom, or MarcXML or DC. With pagination and sorting etc.  Heck, just take the SRU variant, but give a way to supply a query in “worldcat.org” format, not just CQL.

And make this actually worldcat.org format, not just something that simulates it, using the same underlying code that interprets actual worldcat.org queries.  So as worldcat.org evolves, the API will evolve with it. If this kind of user-friendly sophisticated querying syntax is useful for worldcat.org — why wouldn’t it be for apps built on the API too?

This still isn’t neccesarily the highest priority suggestion I have — I can deal with CQL and may even prefer to for my use cases — but one that just makes sense, and another illustration suggesting that OCLC is definitely not ‘dog fooding‘ the API, it is not built on the same logic used by OCLC applications they actually care to make meet users needs.

No Facets?

None of the API variants give you facet results — how many ‘postings’ you had for each value within a category, such as the ‘format’ category.

This alone would stop you from using the API as a replacement for, say, the native default WorldCat Local web application.

In this, WorldCat Local is behind all it’s competitors in the ‘discovery service’ market. Every other gives you an API that, while sometimes clunky and hard to work with, has facets and is generally suitable to replicate and customize the default native interface. Not WorldCat.

For my use case, facets would be nice but aren’t neccesarily essential. But if I were looking to replace my catalog with WorldCat Local, and counting on API access to build a custom interface — it would be a show stopper.

Note that you can (I believe) actually limit on any of the values that would be facetted upon — only by using the SRU variant (not the OpenSearch variant) and adding it into your CQL query as a boolean clause.  You can say “restrict my result set to only Video”.  You just can’t get the facet report listing all the “format type” values in your result set and the postings for each.

I’d suggest OCLC add facetting support to the API.  And when you add it in, add it to both the OpenSearch and SRU variants. And in addition to a response element for facet information, provide a way to request selected facets that’s not just adding a boolean clause to your CQL query, that’s seperate HTTP query params appropriate for the use case, consistent accross both OpenSearch and SRU variants. There’s no ‘standard’ way to do these things in OpenSearch or SRU or any other standard I know about — which means you’ve got the responsibility to just make up a good one. Fortunately all the XML response types already in use are accomodating of adding extensions with custom namespaces to the response, and nothing stops you from adding custom recgonized query parameters to any request endpoint.

Response types not sufficiently expressive

Now, here’s a problem that I personally would prioritize as more urgent to do something about.

With SRU (which you need to have a sufficiently expressive query type if you want fielded search or booleans!), you get your choice of MarcXML and what they call “Dublin Core”.

MarcXML is exactly what you think.   Dublin Core is an XML response which has a certain skeleton for echoing back query context and providing a records list (is this skeleton from, say, OAI-PMH, or is it just something for worldcat? I am not sure and honestly don’t care too much). The individual records have elements using XML-namespaces to try and kind of semantic-web style give you data elements about the record. Most (but not all) of the data elements come from Dublin Core.

I honestly don’t care what vocabulary they come from, what I care about is that they are very basic.

  • Okay, there’s title, authors (split into dc:creator(s?) and dc:contributors), publisher, all useful and neccesary for the most basic use cases.
  • There’s a dc:language with three-letter ISO language codes — awesome!  There’s dc:subject’s which seem to be any 6xx’s, complete pre-coordinated string, without any way to tell which vocabulary a given one comes from — but okay, good enough, for many basic use cases.
  • But now we get into some problems. There’s a whole bunch of dc:description fields, which seem to be the values of any Marc 5xx’s, and perhaps some other fields. But there’s no way to tell what a given dc:description may be, semantically  — there might be a summary (marc 520) or table of contents (Marc 505) in there somewhere, likely to be of interest to the end-user. But these are mixed in with random pieces of information I have no idea what Marc field they come from (complete value: “UO-78 414–UO-78 415”, eh?), and strings where I know exactly where they come from, and know that usually the users won’t care (“Includes index”).
  • Similarly, and even worse, for dc:identifier:  which is a big grab bag of all sorts of identifiers.  A dc:identifier element might be an ISBN; or might be an ISSN. Or maybe a DOI if the original Marc record had one? Other kinds of identifiers I’ve never heard of and don’t know what they are? Check.  URL’s from Marc 856’s, whether or not they can actually serve as an identifier for the item described? In there.   Oh, and the ISBN’s are straight-from-MARC transcribed, which means they might include a suffix like “(pbk)”, not actually part of the ISBN.
    • What I want, a very common use case, is to know the records ISBN, if any. For example. I’m reduced to checking every dc:identifier, looking for ones that have 10 or 13 adjacent-ish valid ISBN chars (don’t forget a trailing “X”) is valid, and figuring that’s probably an ISBN. Or going the extra mile and using the check digit to see if it validates as an ISBN. Really? Just to get the ISBN out?
  • Fortunately, OCLCnumber and LCCN are not just thrown into dc:identifeir, but placed in non-Dublin Core oclc:recordIdentifier elements. (It’s a bit odd telling an LCCN from an OCLCnumber, but possible). This also shows that this “DC” format doesn’t need to include just elements from the DC vocabulary, and alreadn’t doesnt’ exclusively — which is just fine.  Standard vocabularies are nice, but if the data you need isn’t there at all, it’s no consolation to know it wasn’t there because it’s not part of DC.

Okay, so you could say “Well, if you want all that level of detail, then use MarcXML, not the DC representation.”  Okay, I could do this here. And then I could look in the 020 for ISBN (oh wait, should I look in some weird 7xx too maybe?), and the 520 for the summary, etc.

But geez, working with Marc is such a pain. You have to know all the complicated rules and history to write complicated logic to look in one place mostly, unless X then look in Y, etc.  It gets to be a mess. It’s nice to have MarcXML available (and I’d demand it if my use case was replacing the local catalog with worldcat api), but I really would rather not have to use it for basic use cases.  Right now, as soon as I need something that’s not in the DC representation, I need to give up on the DC representation and go to MarcXML, and write my own logic to get things (title, authors, publisher) out of MARC that were in the DC representation.

So one enhancement  might be allowing me to request BOTH DC and MarcXML in the same response.  The same XML skeleton the DC framework gives me, but, if I ask for it,  there’s an “oclc:marc” element in each record that has the MarcXML in it. The beauty of XML, we can mix and match and/or embed like this no problem!

And I think that would actually be a good enhancement. But I think the “DC” format (really a more or less generic XML format) — really ought to be enhanced too. Put some attribute on all those dc:descriptions telling me more info — heck, just put the original marc field number on there for lack of any better vocabulary. <dc:description origField=”520″>  — now I know it’s a summary.  Put similar attributes to disambiguate those dc:identifiers — and strip out the “(pbk.)” garbage from the ISBNs, you already have to be doing that to index them in worldcat, do them for display in the response too.

But okay, you know what really demonstrates the complete infeasibility of  the “okay, if you want more detail, just ask for MarcXML” approach?   Format or “document type” information, which is not provided in any useful way in “DC” response, and is entirely infeasible to extract in any useful way from MarcXML.

Reasonable Format/Document Type info completely unavailable

An items “type” or “format” matters a lot to end-users. Whether an item is a “book” or “video” is pretty important for a user in figuring out if it meets their needs or was what they were looking for.

Getting this kind of type/format information out of Marc is very complicated. There are historical and epistemological/ontological reasons for this — type/genre/format is a slippery, context-dependent, and subjective concept. Nonetheless, anyone trying to build a contemporary discovery system on MARC needs to deal with it.

OCLC has dealt with it — worldcat.org has it’s own locally constructed taxonomy of formats/types — you get a hieararchical “format” facet on the left with such items as “Book”, “eBook”, “Dissertation”, “Music”, “CD”. And every hit in the results is labelled with some string like “Music CD” which seems to have some relation to the facet taxonomy.

So OCLC’s got logic to turn MARC into a reasonable format/type taxonomy that they believe is of use to users. But the WorldCat Search API doesn’t really expose it.

The Search API does let you apply limits as to type/format. With two different indexes: “Material type limit” and “Primary document type limit”.  And the taxonomies are relatively useful. And the complete list of valid values in each of these indexes is actually provided in docs, hooray (although you have to hunt for it, and I haven’t experimented to make sure the docs are actually right).  What relation do these two indexes have to the worldcat.org ‘format’ taxonomy?  It’s unclear. I think ideally it should be the same taxonomy. Why maintain two?  Let the API benefit from any improvements to the format taxonomy in worldcat.org, make them the same.

But even the taxonomies that are exposed by the API as limits are not exposed in the response, to use as labels next to each item.  It’s important for any UI that’s going to be using limits to be able to say which documents belong to which categories, so the end-user can use that to inform their limit choices.  It’s also just plain important to be able to tell the user if a hit is a video or a book, for just about any search use case of the worldcat corpus.

What does the Search API give you instead?  Two different under-documented fields with only barely useful values:

  • There is a dc:type element, which is documented as ” includes a string identifying the type of material, which (depending on the MARC record) can also include form and genre”, but which actually seems to be using the DCMI Type Vocabulary (dc:type elements aren’t required to do this, although it’s a good guess when it’s undocumented what they’re doing).  This is nice, but the DC Terms “type” vocabulary isn’t nearly granular enough for worldcat use cases:  Both a Book and a Serial (and a Dissertation) are all just “Text”.  Worldcat chooses to mark videos as “Image” rather than the more specific “MovingImage”, so you can’t tell the difference between a video/film and an actual static flat image (of which there are a few in worldcat). AND for some reason while dc:type is usually present in the response, sometimes it’s missing.
  • There’s a completely undocumented dc:format element, which seems to have some more-or-less straight-from-the-marc unholy assemblage of  marc 300 and perhaps some other fields. At it’s best it can be something like “1 sound disc : digital ; 4 3/4 in”. And the ever popular “ix, 241 leaves : ill., maps ; 28 cm.”. Oh, but don’t forget if the cataloging record was from another country, you might get gems like “+ 1 Faltblatt”.   Not super useful for displaying to the user, or even for trying to heuristically determine if the thing is a video or not. Oh, and dc:format may or may not actually present in any given record too (some few have neither a dc:type nor a dc:format).

But a Serial record vs a Book record makes a big difference to my users. And it’s nice to know when something is a Dissertation. So I end up doing some really awful hacks: If it’s got a dc:subject that ends in “–Periodicals” it’s probably a serial (although this misses Serials that don’t have LCSH, like those from Germany, among others). If it looks like a video or audio, show the user the “dc:format”, even though it’s ugly like  “1 sound disc : digital ; 4 3/4 in”, there’s no better way to tell the user what format the thing is. Etc. It’s a kind of ridiculous mess.

Note that even if I wanted to look at the MarcXML — going from Marc to a useful ‘format’ vocabulary is very non-trivial.  Going  from Marc to the same vocabulary the WorldCat Search API uses for it’s limits is entirely infeasible; even if they documented exactly what logic they used, why would I want to re-implement it? They’re already calculating it, the API should share it with me.

What I think is IMO the highest priority most urgent enhancement WorldCat Search API needs.  Expose the internal format/type vocabularies in the “DC” response.  Ideally the WorldCat Search API limits should exactly match the worldcat.org ones too. Either way, whatever vocabulary WorldCat Search API uses for limits, just expose the value(s) that a given item has in the results response. If there are two field, “Material type” and “Primary document type”, okay expose both!  If the fields can be multi-valued, no problem! Just expose what you’ve got so I can use it to present a useful and consistent interface!

Various differences with worldcat.org

This API is not an API to worldcat.org functionality, as you might expect, and want. It’s a different application, set up differently, and you can’t always tell how. For instance:

The keyword index for the WorldCat Search API is not the same index used in the WorldCat.org service. It matches the index used in the cataloging and FirstSearch WorldCat versions of the database. In general the main thing not included in this index that is included in the WorldCat.org index are standard numbers other then the ISBN. So, for example, the OCLC number and ISSN are not included in this index.

http://oclc.org/developer/documentation/worldcat-search-api/tips-specific-indexes

Okay, but, um, why?   Why would it be best for the ‘keyword anywhere’ index to work one way in Worldcat.org, but another in the API?  I doubt it is, rather I suspect worldcat.org actually gets attention to evolve it’s functionality to best meet user needs — while the API is implemented with a different codebase, that nobody pays much attention to.  It would be a lot preferable if this API just in general used the same logic and functionality as worldcat.org, implemented using the same underlying logic, so it would stay in sync as worldcat.org evolved. Clearer and easier to understand, likely to be more useful to API use case users — if it’s best for worldcat.org, odds are it’s best for the API too.

It should also be noted that not every record in worldcat.org is available from this search api. It’s not entirely clear what records are going to be suppressed from the API. Generally any “traditional cataloging type” record is in there, but other records OCLC gets directly from publishers or aggregators and puts in worldcat.org — may or may not be available through the search API. As a general rule of thumb, article-level records will not be available through the search API, but there are other categories of missing content that can be surprising (I found that worldcat.org seemed to have a lot more dissertations than the search api corpus).

Again, these differences would likely be a show-stopper if one wished to use WorldCat Local as a replacement for one’s local catalog, and wanted to use the API to build a customized interface to WorldCat Local.

Speed Issues

It needs to be pointed out that the API is not exactly a speed demon. It usually returns search results in 2-3 seconds, which is barely within the realm of acceptable. But in my testing, it was not unheard of for me to see 5, 7, or even 10 second response times. If that happens more than a tiny percentage of the time (and anecdotally it seems to me to), that’s completely unacceptable for using live in a user-facing application. You can’t make the user wait 7 seconds to see a page.  Is worldcat.org sometimes that slow too, or have they made sure worldcat.org is fast but let the API languish?  The API’s got to be faster with 90th percentile slowest results still well under 5 seconds; the median response time should ideally be under 1 second. (It can be done, I know because at least one competing discovery service does it, more on that at some future date.)

Documentation

It’s great that OCLC at least has documentation, and it’s available on the public web. (That I put “great” in italics and mean it shows how low our standards/expectations have gotten in our industry, alas). Even though it’s on the public web, you probably aren’t going to find it googling for “worldcat Search API”, you wind up with three or four different variations on marketting brochures, none of which link to the docs.

Once you do find it, it’s… okay.  It basically covers what you need to know, although it’s got some errors, inconsistencies, plenty of missing info…. and the format it’s presented in on the web can be confusing.   The CSS style choices make things hard to read. Not enough internal hyperlinks cross-referencing parts of the docs.  To get from one doc page to another, you’ve got to go look for the table of contents in the right sidebar, scrolling a buncha pages down to find the “Search API” section. You often click on a link in the contents, only to find it gives you nothing but a one paragraph summary and you’ve got to go back to that hard to find sidebar list and pick again.

But hey, at least it’s there, and not protected behind a login!

Standards work for you, you don’t work for Standards

All the WorldCat Search API’s are very standards-based, you can tell the people in charge really love standards. OpenSearch, CQL, Dublin Core.

And that’s great. The problem is when you get locked into thinking all there is to it is implementing the standard. The “dc:identifier” thing is a great example. Sure, ISBN, ISSN, DOI, and (usually but not always) a URI for a record are all examples of identifiers.  So, sure, it’s completely legal and standards compliant to just throw them all in their own dc:identifier.

The problem is that there are very few if any actual use cases of an API where a giant bag of identifiers, where you have no idea what sort of identifier any given string is, is actually useful. You’ve coded to the standard, great, but you haven’t made something that actually meets actual developer use case.  (I’m reminded, although it isn’t quite the same thing)  of a certain kind of ‘requirements failure’ — “Hey, we did exactly what the requirements said, so we count this project finished as a complete success! Whether it actually meets user needs is irelevant”

The likelyhood of coding-to-standard not meeting actual use cases is even higher with what are effectively ‘dead’ standards.  OpenSearch has nobody maintaining it as a standard, as far as we could tell a couple years ago when we looked (to talk about making it not violate the html5 spec). From when OpenSearch 1.1 was released (7 years ago!) to now, it’s been frozen, and will probably remain that way for some time.

CQL and SRU are, well, they didn’t catch on quite as much as were anticipated. It was invented in a pre-Google world, and it shows.  I’m not sure there’s any client software outside of OCLC that “just works” against any old “SRU” server.

These are not vibrant evolving standards that have been tested against many use cases and expanded to meet them. They are frozen-in-time fossils.

Still, yes, it’s great that you start with OpenSearch, or CQL. But you’ve got to remember to consider your actual use cases, and extend beyond the standards (which fortunately all these standards provide for) to meet them, or provide non-standard alternatives when needed (like a way to enter a query in “worldcat.org legal” end=user-friendly form, not just CQL).

This entry was posted in General. Bookmark the permalink.

3 Responses to A Review of WorldCat Search API(s)

  1. Dorothea says:

    So, uh, any chance that we’ll see the results of your full API evaluation?

    Because that would be an EPIC thing to teach with.

  2. jrochkind says:

    Evaluation of other discovery service APIs? Yep, you definitely will eventually. I have to write it, and I have to decide how I’m going to publish it, whether as it’s own thing here or elsewhere, or together with another evaluation that’ll also be published probably elsewhere. Yeah, this is vague. But, yeah, you’ll see it.

  3. Alice Sneary says:

    Thanks for this thorough review of the current WorldCat Search API,
    Jonathan. You¹ve touched on a number of issues that we as OCLC staff have
    also wanted to improve in terms of usability and general function for this
    particular API. The good news is, there is a team of people evaluating the
    current WorldCat Search API, and they¹re looking to provide something better
    and more useful for the community in the next 12 months or so.

    So your feedback is incredibly timely and helpful for that team, to know what
    at least one member of the library developer community would find MORE useful
    with an API related to WorldCat and discovery views. Are there additional
    ideas that other people want to share? We always love hearing feedback from
    developers actually using the service–when it works and when it doesn¹t–but
    now is a particularly great time to send your wish list items. Comment here or send them to devnet@oclc.org.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s