last.fm has a really nice api.
I’d really like to write a plug-in for last.fm for Umlaut.
But first I’ve got to figure out how Umlaut is going to figure out (and track in it’s internal schema) whether a given OpenURL citation actually represents some kind of musical album, or at least a sound resource that has a higher chance than just any random citation of being a musical album. Otherwise there are going to be too many false positives, since keyword search on artist/album is going to be the only way to search last.fm. A bit tricky since the commonly used OpenURL referent formats don’t really provide for this.
Such a plug-in could provide free cover images, an artist biography, and a link to the really useful last.fm artist page. The API even reveals if the album is going to have streamable samples from last.fm, so you can pre-notify the user of that before clicking on the link. Would be cool.
Anyone know any algorithms for string similarity?
Something occurs to me that would be useful for all sorts of Umlaut plug-ins that rely on keyword searches of external databases like this. Are there any good standard algorithms for computing some measure of ‘similarity’ between two strings? It’s quite likely that album name, or book title, or author/artist name in the incoming citation won’t match _exactly_ with a record found in an external database, but will still in fact represent the same real world thing.
An algorithm for computing some measure of similarity between two strings would be helpful in improving Umlaut’s matching heuristics in these cases. If one is a strict subset of the other, or close to a strict subset, or the same except for some puncutation (or a strict subset except for some punctuation), etc., odds are better they’re the same. But rather than try to write a bunch of rule-based heuristics like this, I’m thinking a computational similarity approach already worked out by someone else is the way to go.
Why a link resolver really isn’t federated search
I think this is a good example of why the ‘link resolver’ domain really is qualitatively different than the ‘federated search’ domain.
Sure, much like a broadcast search application, Umlaut goes out and searches several different external databases using APIs, or screen scraping, etc.
But a link resolver is focused like a laser on a very particular application, receiving a known item citation and trying to find versions, services, and descriptions of that known item in external databases. That special focus leads to code that you’d never put in a general purpose federated search application; and also allows one to leave out all sorts of code that would be basic requirements in a federated search application.
Similar, but not quite the same thing.
6 thoughts on “last.fm, and umlaut’s net”
Thanks Ryan. I don’t think any of those will quite work in this situation, but following wikipedia trails, I arrive at n-grams, and remember someone telling me that they had used n-grams to do workset grouping, a similar sort of problem, really. Now I just need to remember who it was and get them to explain it to me.
See also http://en.wikipedia.org/wiki/Levenshtein_distance (used to calculate “fuzzy” strings in translation tools, for one example).
The problem is there’s multiple layers to this…
Simple editing mistakes – I was running a fevor, eidting can be a pain
editing distance/Levenshtein can do pretty well with these
Then there’s the mis-remembering of things or substituting similar words in a title…
ie “Our Gods Wear Capes” instead of “Our Gods Wear Spandex”
I’d love to see a paper that analyzes some of the possible issues and suggests possible matching techniques. Heck, I may have read such a paper at some point. My memory fails me. Let us know if you find one ;).
Reading about n-gram similarity, I think it has some promise, but so far most of what I find is a bit too mathematical for me to understand.
Man, 10 years ago I could have read a C.S./math paper like this:
But now it’s all greek to me.
Let me know if it means more to you. Otherwise I think maybe I remembered who was telling me about n-gram similarity, and I’m going to hunt them down and ask them. I think they were using n-gram algorithms for deciding if two records were of the same bibliographic work, which is a pretty similar application.
The differences I’m running into are things like one title has a colon followed by a subtitle, the other omits the sub-title entirely. Or one is missing a word entirely. Or author names where one has a middle name, and one doens’t, or one is firstname-lastname and the other opposite. That n-gram similarity article makes me think one of those algorithms might work here, but I’m not sure–it kind of sounds like Longest Common Substring magically combined with edit distance, which sounds good. That’s kind of what I need, like, if you make just a few edits, you get a really long common subtring, right? But who knows.
I’m starting with a set of strings that are more likely than not to be what I want anyway, as they are the rest of a relevancy ranked search from an external database. So that might help.
Man, in another life I could have been an advanced algorithm creating kind of guy, but I’m so not.
Is it overly simplistic to suggest some vector space models like cosine or Pearson’s R or Jaccard? Order won’t matter, nor would things be weighted by field… There are also probabilistic models – I’m thinking of the chapters I missed in Manning, C. D., Raghavan, P., & Schutze, H. (2008) – http://www-csli.stanford.edu/~hinrich/information-retrieval-book.html