jump to navigation

Tagging and motivation in library catalogs? May 10, 2008

Posted by jrochkind in Practice, Theory, business, cataloging.
2 comments

Eh, this comment was long enough I might as well post it here too, revised and expanded a bit. (I’ve been flagging on the blogging lately). Karen Schneider thinks about “tagging in a workflow context

Tagging in library catalogs hasn’t worked yet for a number of reasons…

Karen goes on to discuss much of the ‘when’ of tagging, but I still think the ‘why’ of tagging is more relevant. Why would a user spend their valuable time adding tags to books in your library catalog?

I think the vast majority of succesful tagging happens when users tag to aid their OWN workflow. Generally to keep track of things. You tag on delicious to keep track of your bookmarks. You tag on librarything to organize your collections. The most succesful tagging isn’t done to help _other_ people find things, but to keep track of things yourself–at least not at first, not the tagging that builds the successful tag ecology. Most cases of a successful tagging community where people do to tag to help others find things–I’d suggest it would be because it somehow benefits them personally to help people find things. Such as, maybe, tagging your blog posts on wordpress.com because you want others to find your blog posts–still a personal benefit.

A succesful tag ecology is generally built on tagging actions that serve very personal interests which do not need the succesful tagging ecology on top of it. Interests served even if you are the only one who is tagging. The succesful tagging ecology which builds out of it–and which goes on to provide collective benefit that was not the original intent of the taggers–is an epiphenomenon.

Amazon might be a notable exemption to this hypothesis, perhaps because it such a universally used service before tagging already. (Unlike our library catalogs).  I would be interested to understand what motivates users to tag in Amazon. Anyone know of anyone who’s looked into this? It’s also possible that if amazon’s tags are less useful, it is in fact because of this lack of personal benefit from tagging.

So what personal benefit can a user get in tagging in a library catalog? If we provided better ’saved records’ features, perhaps, keep tracks of books you’ve checked out, books you might want to check out, etc. But I’m not sure if our users actually USE our catalogs enough to find this useful, no matter how good a ’saved records’ feature we provide. In an academic setting, items from the catalog no longer neccesarily make up a majority of a user’s research space.

To me, that suggests, can we capture tags from somewhere else? My users export items to refworks. Does refworks allow tagging yet? If it did, is there a way to export (really re-import) these tags BACK to the catalog, when a user tags something? But even if so, it would be better if Refworks somehow magically aggregated tags from _different_ catalogs, of the same work. But that relies on identifier issues we haven’t solved yet. If our catalogs provide persistent URLs (which they don’t usually, which is a tragedy), users COULD tag in delicious if they wanted to. Is there a way to scan delicious for any tags including your catalogs url, and import those back in?

In addition to organizing one’s research and books/items of interest, are there other reasons it would serve a patron’s interest to tag, other things they could get out of it?  A professor might tag books of interest for their students, perhaps (not that most professors are looking for more technological things to spend time on helping students, but some are).   And librarians themselves might tag things with non-controlled-vocabulary topic areas they know would be of use to a particular class or program or department, with terms of use to those classes or programs or departments.  Can anyone think of any other reasons tagging could be of benefit to a user (not whether a successful tagging ecology would be of collective benefit–but benefits an individual user can get from assigning tags in a library catalog).

Worldcat covers a much larger share of my academic users’ research universe than my own catalog. And worldcat has solved the “aggregating different copies of this work from different libraries” problem to some extent. Which is why it would make so much sense for worldcat to offer a tagging service–which can be easily incorporated into your own local catalog for both assigning and displaying tags (if not for searching) ala library thing. It is astounding to me that OCLC hasn’t provided this yet. It seems to be a very ‘low hanging fruit’ (a tagging interface on worldcat.org with a good API is not rocket science) that is worth a try.

More on open access discoverability May 8, 2008

Posted by jrochkind in Practice, open access, programming.
4 comments

This is worth pulling out into a post of it’s own. Thanks to Dorothea Salo for the comments on the post where I broached this issue sort of in passing. Good to know that I’m indeed not alone in worrying about this stuff.

But there are actually a few different (but related) issues Dorothea has identified here, some of which aren’t a problem for my projects at all, others of which are. Let’s analyze them out:

1. Some faculty are unwilling to publish open access.

This might be a problem, but despite this problem there’s plenty of free-web publicly accessible scholarly content available. (I use this phrase because the specific licensing might be unclear, but an unauthenticated user can get it on the web.) I’m thinking specifically about so-called preprint/postprint public accessible versions of articles that also appear in not-open-access journals. There’s lots of it. This is in fact what motivates my desires in the first place.

2. Some repository software doesn’t allow control of access to the level desired by repository managers.

This might be a problem too, but despite it, most supposed “open access” repositories do contain material that the repository does not in fact make available to the general unauthenticated public! So the software might not be flexible enough, but it is often restricting access to contents in it anyway. And including metadata for those restricted items in the general OAI-PMH feed, without any predictable machine-readable way to tell that it is in fact restricted content.

So it’s in fact the ability of many repositories now to restrict content that brings me to my issue:

3. I have no way to identify the universe of actually publically accessible ‘open access’ scholarly content.

Even if I created an aggregate index of OAI-PMH feeds from all “open access” repositories—it would include content which is not viewable by an unauthenticted user! What I want to do in my software is, I have a known-item citation, I want to tell the user if there’s a publically-viewable copy of this citation online. I have no way to find/identify such a copy though! I have no way to weed out the stuff that isn’t really publically accessible. I don’t want to send the user to something they cant’ access—some repositories listed in DOAR actually have the majority of their items (in the OAI-PMH feed) not available to the unauthenticated off-campus user!

So 1 and 2 might be issues in general, but aren’t what’s providing the roadblock for me. 3 is. There are a couple other issues worth nothing, one that is an inconvenience (but not a roadblock) for my project, one that is not.

4. Difficulty of identifying articles in repositories matching a citation.

When I experimentally tried doing a search against OAISter (before I realized that OAISter didn’t even limit itself to so-called open access repositories; and before I realized that even open access repositories weren’t)—I had to do a search based just on title and author keywords. It would be better if I could search based on an identifier (DOI or pmid) when present—or based on structured publication data for the actual publication of the pre/postprint: ISSN, vol, issue, page number. But these things aren’t available in the OAI-PMH feed, and in fact probably aren’t even in most repositories metadata. Most repository metadata doesn’t try to connect a pre or post-print to the actual published version in any way.

This is annoying, but I found that author/title keyword search worked good enough to be useful even without this, so it wasn’t a roadblock.

5. Might be publically accessible, but is it open access?

This gets at what the SPARC/DOAJ initiative is trying to solve. Okay, I’m a reader, I can look at this article online on the free-web, but what am I allowed to do with it? Am I allowed to reproduce it? This matters to readers and is a real issue, but doesn’t in fact matter to my project. All I care about is if I can show them the full text on the public web—once I can do that, I can worry about helping them understanding the license and their access rights, but first I need to help them discover the article in the first place!

“Freedom Summer of Code” May 8, 2008

Posted by jrochkind in open source, programming.
add a comment

Freedom Summer of Code is a summer-of-code-style distributed collaboration for technology projects benefiting radical/progressive movements. Exciting idea.

http://www.fsdaily.com/Community/Announcing_Freedom_Summer_of_Code

(en) Riseup Labs is excited to announce the Freedom Summer of Code! We aim to advance critical movement technology projects and tools that benefit a wide-variety of radical social justice organizations and movements.

[...]

Modeled after the successful Google Summer of Code, the Freedom Summer of Code adds a radical social justice twist. We will be working with select tech activist organizations to generate interesting ideas as well as help people develop several projects over the next three months.

The [Freedom Summer of Code -> https://we.riseup.net/fsoc] aims to advance critical movement technology projects and tools that benefit a wide-variety of radical social justice organizations and movements; inspire developers to become more interested in directly participating in social-justice tech organizations; contributes back, for the benefit of all, to the free software world which sustains us while simultaneously honoring individual’s labor; increases the social ownership and democratic control over information, ideas, technology, and the means of communication; empowers organizations and individuals to use technology in struggles for liberation. We are developing software that is geared specifically to the needs of network organizing and democratic collaboration, providing new services that greatl enhance your security and privacy.

Consider this is a call-out!
—————————-

To get started, think about how you would like to participate. Regardless of your technical skills, we need your help and have numerous ways to plug into FSoC:

* We want your proposals, dream big! Submit any and all politically important technology project proposals for and by the radical tech community! Individuals, or organizations, can submit ideas for what they would like to see done during FSoC. We will collect these proposals and put them online for potential programmers to check out.

* Interested organizations should sign up: we want your organization to join FSoC, to not just submit project ideas, but also be an organizational contact person who can act as a facilitator if your project/organization is chosen.

* Do you want to participate? Come apply to the program, submit an individual project proposal, and when the time comes, you can pick projects that you are interested in working on.

* We also need facilitators, you are encouraged to apply to help individual participants through the process.

To learn about what kinds of things we are looking for, how to submit a proposal, to sign up as participating organization/facilitator or to apply as a participant in the first FSoC, come visit the [Freedom Summer of Code site -> https://we.riseup.net/fsoc].

[...]

Google feature changes; open access discoverability May 7, 2008

Posted by jrochkind in Practice, open access, programming.
3 comments

So, I’ve found out about a couple new things from Google I hadn’t known about. (Google is such a prominent player in our space, we need to keep up with what’s going on there so we know how to exploit it to maximum effect. I need to remember to go explore google’s interfaces and documentation more regularly to see changes).  1.  Google search API now allows server-side access. 2. Google search allows limit on usage license.  And both these things got me started about open access discoverability again.

1. Google API allows server-side access!

Thanks to Kent Fitch for alerting us on the code4lib listserv.

http://code.google.com/apis/ajaxsearch/documentation/#fonje

“For Flash developers, and those developers that have a need to access the AJAX Search API from other Non-Javascript environments, the API exposes a simple RESTful interface….

“An area to pay special attention to relates to correctly identifying yourself in your requests. Applications MUST always include a valid and accurate http referer header in their requests. In addition, we ask, but do not require, that each request contains a valid API Key.”

This is huge. I’ve complained before about how it was difficult to incorporate Google features into my own service-oriented software in a maintainable way when only javascript AJAX functions were allowed.

Now if only they’d do the same thing for the Google Books Discoverability api. That’s where I really need it; it’s still not clear to me how I might usefully incorporate automated general google search (including google scholar) into my library applications dealing with scholarly materials, because of the high chance that what Google returns will be for-pay and not available to my users: I don’t want to show them that.

So it was with interest I noticed a new feature:

2. Google search supports usage rights limit

Take a look at the Google advanced search page. Click on “Date, usage rights, numeric range, and more”. Look, there’s a “usage rights” limit which filters by CC licenses. When did that show up?  Of course, it can only include things in the filter that advertise a CC license in a way that Google’s bots can recognize. (Not sure how this is done, Google doesn’t say; I think I recall there’s a standard CC-endorsed way to do this?).

Unfortunately, some initial test searches revealed that this is a tiny piece of the actual open access pie.  Many scholarly materials that ARE available online open access are not in fact in Google’s indexes. Probably because they don’t advertise it properly in a machine-readable way? Still, this is a great step by Google, and indicates that Google recognizes users are increasingly having trouble with getting too much restricted content in their google search results.

But my frustration remains with the scholarly open access community. If the problem is that open access repositories aren’t advertising CC licenses properly–why aren’t these software packages (many of them open source) being fixed? Why isn’t there general concerted funded effort from the open access repository community to solve this general problem: And the general problem is there’s no good place to search aggregated open access content and ONLY open access content. To use in software that wants to answer the question “Is there an open access version of the article with this title and author available?” No good way to do it. And this lack of discoverability is a huge problem with the utility of the existing open access repository domain. I don’t understand why there isn’t more concerted effort to solve it.

Although, in fairness, I did recently become aware of a European initiative, that’s apparently actually funded, to address at least part of this issue.  Registering in machine readable format whether content is open access is the first step to building aggregated indexes. (It’s a dirty secret of the ‘open access repository’ domain that much of the content in so-called “open access repositories” is not in fact open access at all, it’s behind IP and password based restrictions. A cursory sample of items in repositories listed in the OpenDOAR–whose collection policies say that a reason for EXCLUSION from OpenDOAR is “Site requires login to access any material (gated access) - even if freely offered”–will reveal that that collection policy is quite often honored in the breach. Although I guess DOAJ has less of a problem with that, and that SPARC/DOAJ initiative is just about DOAJ, so it’s not clear to me that the SPARC project will really address my problem.  I guess the SPARC project is about people not being sure if they can re-use material in DOAJ journals—my problem is being able to do a meta-search limited to publically available open access content in the first place, and I don’t care if it’s licensed for re-use, I just want to find only stuff that is actually viewable online for free!

Hmph.   What can we do to get the open repositories communities to take note of this problem and address resources toward it?

rails debugging April 17, 2008

Posted by jrochkind in Practice, programming.
add a comment

I know other rails devs read this blog. I LOVE ruby-prof.  It rocks. You have to use the ‘graph’ profile to really get it’s power, in default mode it doesn’t do much.  I haven’t even tried it yet with KCachegrind visualization, haven’t had the energy to go over THAT learning curve. Like everything else in the Rails world, there’s a bit of a learning curve to figure it out–for me, mainly in finding the right documentation. Which was that excellent blog post referenced above. After there, it flowed smoothly.

It’s really helping me figure out where the bottlenecks are Umlaut resolve action.

The query_trace plugin is pretty great too.

And, in that vein, I still don’t understand how some of my fellow coders get along without ruby-debug. But if I were better at conscientiously writing the unit tests I should be writing, maybe I wouldn’t be using ruby-debug so much.

Rails gotcha — assigning relationships April 15, 2008

Posted by jrochkind in Practice, Rails, programming.
add a comment

You’ve got Employees and Departments. Each Employee has one Department, each Department has many Employees. Very many.  Let’s say thousands, or even tens of thousands.

So you want to create a new Employee and assign it to a Department.

dept = Department.find_existing_dept_somehow()  # existing one fetched from db

employee = Employee.createAndInit() # newly created not yet saved

Now you have two choices

1: departments << employee
2: employee.department =  department

Either way you end with: employee.save!

Those might look equivelent, but I think the first ends up being a huge performance problem. I believe that is because the first call will end up requiring a fetch of all the department’s employees (thousands or more), before adding the new employee to it–and possibly doing an implicit save of one or more object too.  While the second never forces the potentially expensive fetching of all the department’s employees. But I’m just guessing here. All I know is that when I changed the #1 style to the #2 style, I just erased one mysterious performance hit in my app.

Can licensing make an API useless? April 4, 2008

Posted by jrochkind in business, open source, programming.
2 comments

As I discussed in a previous essay, it’s the collaborative power of the internet that makes the open source phenomenon possible. The ability to collaborate cross-institution and develop a ‘community support’ model is what can make this round of library-developed software much more successful than the ‘home grown’ library software of the 1980s.

So how does this apply to APIs? Well, library customers are finally demanding APIs, and some vendors are starting to deliver. But the point of an API is that a third party will write client code against it. If that client code is only used by one institution, as I discussed, it’s inherently a more risky endeavor than if you have client code that’s part of a cross-institutional collaborative project. For all but the smallest projects involving API client code, I think it is unlikely to be a wise managed risk to write in-house software that is only going to be used or seen by your local institution.

The problem is if a vendor’s licenses, contracts, or other legal agreements require you to do this, by preventing you from sharing the code you write against the API with other customers.

On the Code4Lib listserv, Yitzchak Schaffer writes

here’s the lowdown from SerSol:

“The terms of the NDA do not allow for client signatories to share of any information related to the proprietary nature of our API’s with other clients. However, if you would like to share them with us we can make them available to other API clients upon request. I think down the road we may be able to come up with creative ways to do this - perhaps an API user’s group, but for now we cannot allow sharing of this kind of information outside of your institution.”

To me, this seriously limits the value of their APIs. So limiting, that I am tempted to call them useless for all but the simplest tasks. For any significant development effort, it’s probably unwise for an institution to undertake a development effort under these terms. That’s assuming that an individual institution even has the capacity tod o this–the power of internet collaboration is that it increases our collective capacity by making that capacity collective. Both that increased capacity and managed risk through a shared support scenario requires active collaboration between different institutions—even SerSol’s offer to perhaps make a finished product available to other clients “upon request” is not sufficient, active and ongoing collaboration between partners is required.

If I’m involved in any software evaluation processes where we evaluate SerSol’s products, I am going to be sure to voice this opinion and it’s justification. If any existing SerSol API customers are equally disturbed by this, I’d encourage you to voice that concern to SerSol. Perhaps they will see the error of their ways of customers (and especially potential not yet signed customers) complain.

Ross Singer notes that this is especially ironic when SerSol claims their APIs are “standards based” (http://www.serialssolutions.com/ss_360_link_features.html). What proprietary information could they possibly be trying to protect (from their own customers!).

Open source, support status, and risk management March 28, 2008

Posted by jrochkind in business, open source.
1 comment so far

Deciding whether to go with a particular open source product is an exercise in risk management. To be sure, let’s be clear—deciding whether to go with a particular proprietary product is also an an exercise in risk management. (And really, most organizational management decisions probably are, but what do I know, I’ve never been a manager and don’t have an mba).

Evaluating the risk level of an open source product is kind of new terrain for some in the library world. It is comforting to remember that there are some aspects of evaluation that really aren’t much different for open source software than for any other software — for instance, looking at whether the product has the features you need, and how well it works.

There are other aspects that need to be approached differently for open source. In this essay, I’m going to look at just one of them, that is cause for particular concern among some people — open source support models, how you get support for an open source product, what you are risking in terms of support with an open source product. All open source products/projects are not equal here. In trying to explain to others how to approach risk management related to support options in a particular open source product, I’ve found it useful to talk about three situations or statuses an open source project may have with regard to support. (more…)

Think you can use Amazon api for library service book covers? March 19, 2008

Posted by jrochkind in Practice, business, catalogs, programming.
4 comments

Think again.

http://listserv.nd.edu/cgi-bin/wa?A2=ind0803&L=ngc4lib&T=0&O=D&X=77132057060E3A8667&P=6033

Jesse Haro of the Phoenix Public Library writes:

Following the release of the Customer Service Agreement from Amazon this past

December, we requested clarification from Amazon regarding the use of AWS for library catalogs and received the following response:

“Thank you for contacting Amazon Web Services. Unfortunately your application does not comply with section 5.1.3 of the AWS Customer Agreement. We do not allow Amazon Associates Web Service to be used for library catalogs. Driving traffic back to Amazon must be the primary purpose for all applications using Amazon Associates Web
Service.”

There are actually a bunch of reasons library software might be interested in AWS. But the hot topic is cover images. If libraries could get cover images for free from AWS, why pay for the expensive (and more technically cumbersome!) Bowker Syndetics service to do the same? One wonders what went on behind the scenes to make Amazon change their license terms in 2007 to result in the above. I am very curious as to where Amazon gets their cover images and under what, if any, licensing terms. I am curious as to where Bowker Syndetics gets their cover images and on what licensing terms–I am curious as to whether Bowker has an exclusive license/contract with publishers to sell cover images to libraries (or to anyone else other than libraries? I’m curious what contracts Bowker has with whom). All of this I will probably never know unless I go work for one of these companies.

I am also curious about the copyright status of cover images and cover image thumbnails in general. Who owns copyright on covers? The publisher, I guess? Is using a thumbnail of a cover image in a library catalog (or online store) possibly fair use that would not need copyright holder permission? What do copyright holders think about this? This we may all learn more about soon. There is buzz afoot about other cover image services various entities are trying to create with an open access model, without any license agreements with publishers whatsoever.

Google Book Search API March 13, 2008

Posted by jrochkind in Practice, programming.
3 comments

So Google has announced a much-awaited api for pre-checking availability of full text in Google Books. Here is one post with more detail than other announcements I’ve found.

I note that the API is described as a javascript api, and examples are provided where the request to the API is made on the client-side with javascript.

However, there’s no technical reason why you couldn’t do this server-side as well. It’s just an HTTP GET request with certain query parameters which returns JSON. I can certainly parse JSON server-side.

It makes a big difference to me whether I can do this server-side or not. Why? One example is because my software wants to query multiple sources of digital text (including our own licensed e-text from our catalog), and do something different depending on whether there is any available text or none. In some contexts, the user may even get an entirely different page depending on the answer to that. It’s difficult or impossible to implement that kind of logic only on the client-side (plus it would only work for those with javascript).

So there’s no technical reason I can’t do it server-side. But Google may certainly stop me with policy. They could rate-limit requests to the API from any given IP (and it sounds like they DO: “Because developers often issue an atypical quantity of requests, you may accidentially tip the security precautions found in Google Book Search.” ). Google certainly has it’s own business reasons to want to aggregate as much individual data as possible, not let my application be an intermediate proxy. (Google is in fact in the business of collecting usage data, not of providing search. Think about how they make their money). So hmm, time will tell.

Interestingly, a couple of the examples on the announcement are Google Books pre-check integrated into sfx! It sounds as if this was done by Ex Libris, not by the individual customer. And when I attempt to reverse-engineer the HTML to see what’s going on–it looks to me like SFX is indeed making a server-side pre-check, not doing it in javascript on the client side. Which would be encouraging. Unless Ex Libris somehow has special permission. Hmm.

Eagerly awaiting more information about this. Not quite sure how to get it.

updates (14 Mar):

1.

Got a reply from Google:

You can do something similar to this on the client side, just add some logic
in the JavaScript on whether or not to show the div with the books dependent
on the viewability information.

Unfortunately, we don’t support server side querying of the API, because
viewability is based on local rights limitations (different countries
consider different books to be public domain), and we think it hurts theuser experience to provide incorrect viewability information.

Doh! This doesn’t really answer my concern I’m afraid, I really can’t do what I need to do client side, at least not without extreme difficulty or loss of functionality. But I guess that’s how it’ll be!

2.

Had the idea of asking the ksu SFX example for an XML response, to see what that tells us. Of course everything in the XML response is neccesarily generated server side. The Google Books section looks like this:

<target>
<target_name>LOCAL_GOOGLE_BOOK_SEARCH</target_name>
<target_public_name>Google Book Search</target_public_name>
<target_service_id>22170000000000005</target_service_id>
<service_type>getCitedReference</service_type>
<parser>GoogleBookSearch::Isbn</parser>
<parse_param/>
<proxy>no</proxy>
<crossref>no</crossref>
<note/>
<authentication/>
<char_set/>
<displayer>GoogleBookSearch::Reference</displayer>

<target_url>
http://books.google.com/books?id=BpPwy8t3OtIC&ie=ISO-8859-1&source=gbs_ViewAPI
</target_url>
</target>

Hmm. It’s hard to say. The XML response does not include what’s in the HTML response telling you for sure what kind of access is available. But it does include a google books URL that sure looks like it required talking to Google on the server-side to generate–it doesnt’ have an ISSN in it, it has a Google unique ID. How would SFX know that Google unique ID without talking to Google on the server-side? Which Google tells me they don’t allow. Hmm. Curioser.

Another update: If I turn off javascript and look at the ksu SFX page, I still get the Google Books link. It would definitely appear to be server side. Is Ex Libris SFX allowed to do something that I am not?