jump to navigation

custom code4lib google search engine: another late night idea June 2, 2008

Posted by jrochkind in General.
1 comment so far

Interesting small project, maybe I’ll find time, but if someone beat me to it I’d be happy.  Make a piece of software that checks the delicious feed for tag code4lib and automatically adds everything it finds to a custom google search engine.

Bingo, a search engine of content code4libbers find useful. People sometimes tag things they think of interest to other codelibbers withthe “code4lib” tag, so they’ll show up in the planet automatically. What if you could add something of interest to code4libbers to a search engine?

The possibilities of this technique of delicious->custom google search engine seem awfully intersting, and a ‘code4lib’ search engine a good demonstration that should be very useful to code4libbers.

Too many ideas May 27, 2008

Posted by jrochkind in General.
add a comment

Up too late, and for some reason I can’t stop thinking of ideas for improving my interfaces. That I don’t have time to do, first having to get my interfaces barely decent before I can add this stuff! So blogging.

Lib X improvements ideas (some of which might already be in libx without me knowing it; haven’t had time to explore LibX as much as I’d like):

* SFX has an interface to return a very quick response indicating ‘yes’ or ‘no’ to full text present for ISSN and date. How about LibX finds any OpenURLs on a page (COinS or an actual full OpenURL to your link resolver, recognized because LibX already knows your resolver base url), and asks SFX for coverage on each one, and indicates that to the user  if full text is present or not. SFX doens’t do ISBNs, but I could build my own ISBN index (as part of umlaut).

* EZProxy.  Every single page looked at, Lib X should check the EZProxy API to see if it’s proxyable. If it is, give the user a highlighted message “Off campus and no access? Click here to proxy please.”

OPAC:

On every search (or appropriate searches), in a sidebar or top bar, offer the first few hits from my federated search product; and the first few hits from OCLC worldcat.  Similar to how google offers Google Books and Scholar content sometimes on an ordinary google search. APIs for both can let me show the first few hits right on the page. “Click here to see more results like this”. The heading/note for the Worldcat search needs to somehow make it clear that Worldcat can be used to find more hits in a topical type search, OR to find a known item not found in our catalog to make an ILL request. (If only I could figure out how to make worldcat.org reliably show my link resolver link on hits! Maybe I need to get LibX to do that too).

[Sadly, I don't think I can show federated search results to an unauthenticated user. But I thinkI can tell if a user is already signed into our Single Sign On using shibboleth, and show it to them if they are. And offer them a login button if they are not, to see results.  If they're on campus, I can show them results straightaway.]

On 0-hit searches give even more prominence to Worldcat for finding the known item or topics you wanted.

On a 0-hit search that was for ISBN (often forwarded by LibX etc., occasionally typed in manually), offer a link to make an ILL request for that item.  Pre-fill the ILL request with details, using Worldcat API or Bowker ISBN API to look up complete metadata for the ISBN given. Maybe better show this metadata to the user with the ILL button, before they’ve clicked on it, to confirm this is what they wanted.  This may be applicable to other types of searches that are 0-hit, where there is reason to think it was a known item search, and enough clues in the search terms to try and identify the item.

Naturally, a 0 hit search on ISBN (or other known item for which an isbn or other identifier can be found) should use xISBN/thingISBN to offer alternate versions that may be in our catalog too.

Man, we could make this stuff work so much better for users, using pretty simple stuff, no serious R&D needed, just some time spent doing it.

Academic freedom and privacy May 26, 2008

Posted by jrochkind in General.
1 comment so far

An important reminder that this stuff really does matter, really seriously.  Do you know precisely what personally identifiable information your library’s systems keep on on patron information consumption? I don’t.   This story (while the information was not, so far as I know, from library systems) reminds me again that this really does matter, and patron privacy should be non-negotiable, not even for enhanced reccommendation features etc.  Every library should have regular privacy audits of their system to be aware of exactly what personally identifiable logs are kept being kept by what system, and that they are being kept intentionally by choice, not accidentally, only what is neccesary to condut business. Figuring this out for our chaotic ecology of often proprietary closed-source software is not trivial, but it matters.  It should go without saying that every library should have patron confidentiality policies that, to the extent allowable by law, prohibit library staff from reporting on patron activities to law enforcement.

http://www.boingboing.net/2008/05/26/uk-set-to-deport-mas.html

“UK set to deport Master’s student whose Master’s degree research led him to look up Al Qaeda info - ratted out by Nottingham University

Posted by Cory Doctorow, May 26, 2008 7:19 PM | permalink

Academics at the UK’s Nottingham University were arrested as terrorists for downloading Al Qaeda documents from a US government server in the course of research into a Master’s degree convering terrorist tactics. The two UK-born profs were released, but the student faces deportation to Algeria under the Terrorism Act, where he believes he will be tortured. The university — which encouraged its staffers to rat out people they thought were involved in researching terrorism — refuses to acknowledge that anything is wrong with any of this.”

The original article:
http://education.guardian.co.uk/higher/news/story/0,,2282045,00.html

What I wish I could do with Internet Archive May 23, 2008

Posted by jrochkind in General.
4 comments

To explain my current frustration with Internet Archive a bit more, here’s what I want to do.

What we want to do

I have library interfaces that display detail pages for known items: My catalog, and my link resolver. When my known item is a book, I’d like to do a search on the IA corpus, and see if a publically available digitized full text is available, and tell my users about it. (Check for a publically available audio book version too while we’re at it, why not). Maybe search on identifiers if available, but I know most of the time I’ll be searching on title/author keyword, and experimenting against indexdata’s OCA index, that works Good Enough ™, so good. This would be a really great feature. Users shouldn’t have to know to go to the IA search to find IA content, and shouldn’t have to do a second search, I want to put it right in my own searches.

Brewster suggests an access method

So how do I do this? At code4lib after Brewster Kahle’s keynote, I asked him this, and he sent me a nifty url to a search of IA that would let me limit to collections (like just full text ones, perhaps), and actually returned some XML with a very basic set of metadata on the items (title, collection it came from, format, a url to get associated assets including more metadata). Okay, I said, I wish that was actually documented somewhere, but that’s good enough that’ll work, so good!

The Case of the Disappearing Interface

So, several months later, I (or actually Jason Ronallo interning for me) finally get around to writing the functionality that’s going to use this access—and the access has been removed! First of all, if I already had spent time writing to this interface (at Brewster’s reccommendation!) only to find it disappear–I’d be really mad, and that’s an indication of a lack of proper communication between IA and developers like me. So I guess it’s fortunate that I didn’t, and it disappeared before I started. But still, how do I do what I need now?

It ain’t in the OL

Perhaps IA thinks that the OpenLibrary APIs are sufficient, and that’s why they got rid of their other search. Indications are that the IA is “getting” to some extent with OpenLibrary that they actually need to understand and provide for machine-access needs, if they want developers like us to interface with it. But the OpenLibrary is not good enough for me, because the OpenLibrary database doesn’t in fact include all Internet Archive hosted free digital content! It is only intended to include Internet Archive content that has MARC, but Alexis Rossi of the OpenLibrary project estimates that there are upwards of 100,000 (~25% of entire corpus) digital texts hosted by the Internet Archive which do not have MARC, and thus aren’t included in the OpenLibrary db and there are no plans to include them. Well, gee, I want to search over these too! Not to mention audio book versions–I don’t know if any of these are included in the OpenLibrary database. Why would I want to leave them out of my search to put them in front of my users just because they don’t have MARC?

So I’m (or Jason is) basically reduced to a screen scrape of the Internet Archive HTML search results. An html page which doesn’t even have quite sufficient metadata to do everything I could do with the previous XML hit list, even once I do succesfully screen scrape. And I don’t think this ‘advanced’ search lets me limit to a set of _multiple_ collections like the last one did–I’m going to have to do several searches to search several collections (putting more load on the IA servers while we’re at it). But that’s what I’m going to do, because I see little other choice. And I hate screen scraping. It is a testament to how valuable I believe the IA content is to my users, that I’m even willing to go that route–I wouldn’t do it for just anyone.

A little attention and support could go a long way

So this is an example of why I get the feeling that the Internet Archive is uninterested in developers like me and our attempts to expose Internet Archive content to our users–even as they complain, mystified, about why more developers aren’t hooking into IA content! It’s no mystery to me. I don’t think anyone at the IA even realizes what a problem they caused by eliminating this XML search, because I don’t think anyone is even paying attention to what people like us are trying to do–while complaining that we’re not doing more.

And, yeah, sure, I could harvest the whole darn IA with OAI-PMH, find just the pieces I want in it (full text and possibly audio book collections), and index them locally. Maybe eventually I’ll do that. IA does give me permission. But if IA actually gave me a machine searchable interface to their own indexes so I don’t have to figure out how to do that, it’d go a lot faster, easier, and more locally maintainable for me. I’d really rather not run my own local index of of IA content–experiences with the IndexData OpenContent OCA index shows that such things really do require attention and effort to keep them up and working (that OCA index keeps breaking and going down and being heavily out of date), attention and effort that are in short supply where I am.

If IA wants to encourage people to write hooks into their system, a little bit of support on their end (which would require caring about what we are trying to do in the first place, instead of just telling us what they think we should be trying to do) would go a long way–you get enough infrastructure there for us, we can and want to put all sorts of cool things on top of it.

But you’ve got to get enough actually supported infrastructure there to make it easy enough to do that we can find time to do it, cause we’re busy and underfunded, like everyone. You don’t need to write our applications for us, we really will write open source apps on top of the services you provide, but you’ve got to give us those services—creating our own index on top of your OAI-PMH feed (which incidentally, is also, according to colleagues who have tried to use it, a bear to deal with that often doesn’t behave as expected) is apparently just a bit too much for us out here in library developer world to realistically accomplish succesfully with our resources.

Update 26 May: The XML search is back! Thanks Internet Archive!  Thanks Jason for finding it somehow! I haven’t been checking my email or the lists over the long weekend, so I’m not sure how Jason discovered it was back, or if it’s been advertised, or if it’s a response to this complaint, or what. But thanks!

GBS/OCLC May 23, 2008

Posted by jrochkind in General.
add a comment

Something retrospectively obvious just occurred to me about the Google Book Search/OCLC partnership. If it means what I think it means, does that mean I don’t need to worry about the GBS API anymore (and it’s restriction to client-side javascript only access, bah!)—-I can just search the worldcat API instead?

I guess it probably doesn’t mean that _every_ book GBS book will be in worldcat though, so not quite. But it might be one option. Wonder how long until most/all full-text-digitized GBS books are in worldcat?

Wonder if there’s any way to limit a worldcat api search to only GBS books, or to only records which represent publically accessible fultext digitizations?

I think it’s kind of sad that Worldcat is apparently going to include GBS digitized books before it includes Internet Archive hosted books (which there are still no public plans for). I blame both parties. I’ve got nothing against GBS books being in Worldcat. I am suspicious of OCLC being more interested in wheeling-and-dealing with getting exclusive access to things from Google in return for access to things I don’t think it’s OCLC’s place to control access to in the first place—instead of prioritizing increasing access to IA’s already public access content which could use better access. I also remain frustrated with IA’s lack of attention to machine accessibility of it’s content and metadata, or interest in meeting the needs of those who would like to provide hooks into it. Although for all I know IA did approach OCLC for a partnership and were rebuffed. I am frustrated with both organizations lack of transparency. Google’s lack of transparency I am not frustrated with, only because it’s what I expect from a for-profit megacorp like Google.

The Internet Archive has in fact been trying to get OCLC to give their members ‘permission’ (that I don’t  believe is OCLC’s to give or withhold in the first place, but anyway) to share all their records with IA.  Which OCLC has markedly with-held.  Now OCLC hasn’t given members blanket ‘permission’ to share any held records with Google I don’t think, just records for titles scanned for GBS. (Unless OCLC is really going to share their entire corpus? THAT would make me angry, not that they’re sharing it, but that they are privileging Google over competitors and alternatives in this sharing). I believe the IA is now generally getting MARC records for digitized books that go into IA from libraries too. So that’s fine. But still.  The IA, unlike Google, doesn’t have anything to offer in return, because most of what the IA has they give permission to anyone to use, unlike Google, they can’t barter permission or access that would otherwise be withheld. So, good for IA. (I’m still frustrated with IA’s lack of attention to meeting the needs of those who want to write software that hooks into their corpus. But at least it’s not lack of permission that’s the barrier).

Open Access Geographic Data May 21, 2008

Posted by jrochkind in General.
2 comments

So I just stumbled accross a great set of projects I didn’t even know about.

The most visible is http://www.openstreetmap.org/ , a very nice free mapping system ala Google Maps or Yahoo Maps etc, but with all the data user-contributed and open access. (it doesn’t have driving directions though). It’s out of the UK, but had pretty complete street maps for Baltimore MD, at least I didn’t check more.

But that’s not what got me excited. What got me excited is http://www.geonames.org/, an open access gazetteer/geocoder/etc. Check it out, enter a place name, get lattitude and longtitude. Also in their db is hieararchical relationships between place names (ie, topopnyms) (locations in cities in states in countries), alternate language names for a place, and a bunch more.

And they’ve got a great set of APIs, and also their entire dataset is freely downloadable if you need to do something the API is incapable of. Enjoy the open access!

This is a great additional option to commercial companies geocoding APIs, like Yahoo and Google.

It’s got a pretty good free text query that determines ranked matching places–that is, place name geocoding. As you can guess, I immediately tried it out with a few LCSH geographic headings. It works pretty well, although not perfectly.

Remove punctuation from “Detroit (Mich.)” turning into “Detroit Mich”, and the first hit is exactly what you want. “Illinois and Michigan Canal”? It’s in there, man! And is matched by that string. Ah, but “Illinois and Michigan Canal (Ill.)”, another heading for the same place in my catalog does not hit, not even if you remove the punctuation, not even if you change our weird “Ill.” abbreviation to the postal code IL. So maybe a good heuristic is to remove the parenthetical qualifier and try again if at first you don’t succeed.

So anyhow, I can imagine all sorts of cool interfaces we can do with geocoded LCSH (and possibly other place vocabularies, if we have them). In the “place” facet of your fancy facetted search, have a button to show all the places on a map. But wait, there’s more! Let the user draw a bounding box on the map to limit the search to just the places within it. We could do that! And lots more.

I wonder if Ed Summers would feel like using the geonames.org db to add lattitude and longitude to all of the LCSH geographic terms in his SKOS vocabulary? It’s do-able. And would be SUPER cool.

One frustrating thing for me in general is that I can think of all sorts of cool experimental features, but we don’t even have the basics in place yet. I can’t add that to my “cool new facetted discovery tool” until I _have_ a cool new facetted discovery tool. We’ve got to catch up to “barely decent what is expected in 2008″ before we can add actual forward-looking things. This was a frustrating with my Umlaut link resolver too–I did spend a lot of time getting it up and running as a really good link resolver platform, but it doesn’t have very many really forward looking features yet. But it’s a great platform for them, it should be easy to add them now that the infrastructure is there—but I don’t have the time, I’m busy getting the next set of services (metasearch) up to “barely decent what is expected in 2008″. Although for Umlaut, I’m hoping Jason Ronallo, who is doing an internship for his school developing Umlaut this summer, can crank out some impressive stuff.

In other news May 16, 2008

Posted by jrochkind in General.
comments closed

I have up to now rigorously avoided posting any content to this blog that was personal, political, or not professionally relevant. But sometimes, I guess I can’t resist it anymore. I will resist posting any commentary to the link to this article that I can not resist publicizing

Some Detainees Are Drugged For Deportation
Immigrants Sedated Without Medical Reason
by Amy Goldstein and Dana Priest | Washington Post Staff Writers
Page A1; May 14, 2008

The U.S. government has injected hundreds of foreigners it has deported with dangerous psychotropic drugs against their will to keep them sedated during the trip back to their home country, according to medical records, internal documents and interviews with people who have been drugged.

http://www.washingtonpost.com/wp-srv/nation/specials/immigration/cwc_d4p1.html

Identity woes: Google docs enterprise accounts? February 19, 2008

Posted by jrochkind in General.
2 comments

So rsinger creates some slides in Google Docs. He tries to send me an invitation to view those slides.

I get this login, which I’ve never seen before:

gdocs-jh.jpg

” Welcome to Johns Hopkins Universtiy documents and spreadsheets program, powered by Google.”

There’s some kind of Google Docs enterprise account? (Which spells “Universtiy” wrong?). [I don’t know if it’s recognizing me by IP, or by the fact that the email address I have associated with my google accounts, and which rsinger tried to invite, is @jhu.edu) But here’s the problem, it won’t recognize any of my existing Google accounts. None of them. So I note that it insists my username is “@jhu.edu”. I try my JHU Single Sign-On. No dice either. (And now feel like an idiot, because I just got ‘phished’ by Google—our SSO credentials should never be entered anywhere except the enterprise sign-on form). It offers me the ability to create a new account, which I’m scared to do, because having multiple Google accounts is _exactly_ what has led to this kind of headache for me in the past. (Google tools are continually torturing me with identity issues like this, this is not new).

Ah, but look, it offers a “forgot your username or password?” link (which sometimes showed up for me and sometimes didn’t). Maybe that will at least tell me what the heck account it’s looking for, and maybe even how to recover my credentials:

gdocs-jh2.jpg

No such luck. Sometimes I hate computers. Anyone know what the heck is going on, and how i look at rsinger’s slides? Should I complain to someone at my central IT about this? (what the heck is up with the mispelling of university, man!). (At first I was worried they were taking this from my google scholar link resolver registration without telling me; since that kind of mis-spelling is the kind of thing I’d do! But checked my google scholar institution registration, no mis-spelling. So it’s not from my stuff.)

This also shows the dangers of relying for so much on a product which offers you no tech support whatsover (not even an open source communty which knows how the product works). There’s basically nothing I can do here. How frustrating!

For the record February 15, 2008

Posted by jrochkind in General.
add a comment

“Library 2.0 Gang” is the name that Talis has given to their regular discussion podcast–or maybe just a name that Richard used for that particular episode? I was invited to be on the show once, but I am not in fact part of any “library 2.0 gang”, and certainly have not proclaimed myself to be! I don’t think there is any ‘library 2.0 gang’ outside of that one discussion. I think the phrase ‘library 2.0′ is kind of a silly phrase, and am a bit embarrassed to have it written that I’ve proclaimed myself to be a part of such a ‘gang’. Oh well. Let’s set the record straight though and say that if there was any proclaiming to be part of a ‘library 2.0 gang’, it was certainly not ’self-’proclaiming.

OAISter -> points to plenty of non open access stuff January 25, 2008

Posted by jrochkind in General, Link Resolvers, Practice, open access.
3 comments

So I had been operating under the incorrect assumption that OAISter only aggregated feeds which claimed to be of open access materials.

After embarrassingly sending them a letter (and cc’ing code4lib) asking for clarification I noticed their collection development policy page. (Embarrassing because I should have checked first).

http://www.oaister.org/restricted.html

  • We harvest and retain all records that point to digital resources.
  • This includes freely-available and restricted-access digital resources.

(more…)