I am not the first one to suggest this (I think Ross Singer and Ed Summers have promoted it in the past), but this really cool wikipedia-miner tool mentioned by Arash Joorabchi on the code4lib list made it suddenl seem totally feasible today to me.
I think wikipedia-miner, by applying statistical analysis text-mining ‘best guess’ type techniques, provides more relationships than dbpedia alone does. I know that wikipedia-miner’s XML interface is more comprehensible and easily usable by me than dbpedia’s (sorry linked data folks).
What if instead of maintaining our own subject and name authority files, we simply used wikipedia as an authority file? Wikipedia is a list of concepts (including some personal names), and wikipedia-miner (or possibly dbpedia) also extracts lead-in synonyms and relationships for those concepts. What if when cataloging a book, you just looked up the relevant wikipedia article for it’s subjects and controlled authors, and linked that? The relationships and lead-in synonyms could be used in a browse list or other interface in similar ways to how we can use our existing library authority data.
This would potentially be more feasible today for subjects than names. Not every name we care about is going to be in wikipedia, although some are. (and we couldn’t even neccesarily add the missing ones ourselves to wikipedia; not every person we care about is ‘notable’ enough for wikipedia). But I suspect the subject coverage of wikipedia articles as a controlled vocabulary is already sufficient to provide subject access.
Would this result in as good subject access as our current library controlled vocabulary efforts? Or even better? (Our current efforts are surely not perfect). Would it result in cost savings? It would potentially make our data more interoperable with other data sources. What other upsides or downsides do you see?
It’s pretty awesome that wikipedia exists as a giant encyclopedia whose text is all open access, making such data mining experiments and re-uses possible.
One risk of tying our authority control horse to something like wikipedia is that we’re counting on a) wikipedia continuing to exist, b) wikipedia continuing to make it’s data available for use by a tool like wikipedia-miner, and c) wikipedia-miner continuing to exist and be maintained (although the latter could be done with library resources if needed). I’m not sure if this is more or less of a risk than relying that our existing authority control infrastructure will continue to exist and be maintained, in an environment of dwindling library resources.