using wikipedia as an authority file?

jrochkind General May 17, 2011

I am not the first one to suggest this (I think Ross Singer and Ed Summers have promoted it in the past), but this really cool wikipedia-miner tool mentioned by Arash Joorabchi on the code4lib list made it suddenl seem totally feasible today to me.

I think wikipedia-miner, by applying statistical analysis text-mining ‘best guess’ type techniques, provides more relationships than dbpedia alone does. I know that wikipedia-miner’s XML interface is more comprehensible and easily usable by me than dbpedia’s (sorry linked data folks).

What if instead of maintaining our own subject and name authority files, we simply used wikipedia as an authority file? Wikipedia is a list of concepts (including some personal names), and wikipedia-miner (or possibly dbpedia) also extracts lead-in synonyms and relationships for those concepts. What if when cataloging a book, you just looked up the relevant wikipedia article for it’s subjects and controlled authors, and linked that? The relationships and lead-in synonyms could be used in a browse list or other interface in similar ways to how we can use our existing library authority data.

This would potentially be more feasible today for subjects than names. Not every name we care about is going to be in wikipedia, although some are. (and we couldn’t even neccesarily add the missing ones ourselves to wikipedia; not every person we care about is ‘notable’ enough for wikipedia). But I suspect the subject coverage of wikipedia articles as a controlled vocabulary is already sufficient to provide subject access.

Would this result in as good subject access as our current library controlled vocabulary efforts? Or even better? (Our current efforts are surely not perfect). Would it result in cost savings? It would potentially make our data more interoperable with other data sources. What other upsides or downsides do you see?

It’s pretty awesome that wikipedia exists as a giant encyclopedia whose text is all open access, making such data mining experiments and re-uses possible.

One risk of tying our authority control horse to something like wikipedia is that we’re counting on a) wikipedia continuing to exist, b) wikipedia continuing to make it’s data available for use by a tool like wikipedia-miner, and c) wikipedia-miner continuing to exist and be maintained (although the latter could be done with library resources if needed). I’m not sure if this is more or less of a risk than relying that our existing authority control infrastructure will continue to exist and be maintained, in an environment of dwindling library resources.

Published by jrochkind

View all posts by jrochkind

Published May 17, 2011

41 thoughts on “using wikipedia as an authority file?”

Ryan Shaw says:

May 17, 2011 at 3:41 pm

It seems clear that, considered as a standalone authority file, Wikipedia is much richer and more useful than the Library of Congress authority files. But of course only the latter have been used in the creation of millions of catalog records. Thus there is a need to link the two systems. Given the vast differences in structure, governance, and process between the Library of Congress and Wikipedia, it would be undesirable to merge the two systems. A single organization should and could not control all the information found in these two systems. But they should be made interoperable and potentially linkable into a single distributed system. Why separate and isolate authority records and records in reference resources such as gazetteers or encyclopedias? Though they may differ in granularity of description and degree of interpretation, a well-designed, flexible network of knowledge organization systems could and should connect all of them. As usual the Germans are way ahead on this.
Librarian says:

May 17, 2011 at 4:04 pm

Great post!
jrochkind says:

May 17, 2011 at 7:03 pm

Ryan, I agree that we’d want to deal with legacy records by mapping (surely by automated statistical means, if we do it at all) our existing subject vocabularies to wikipedia articles under that scenario.

But let’s say, as a thought experiment, we did that, and then abandoned LCSH to use just wikipedia as an authority file. What would be teh downside of this? I’m not seeing the obvious logic of your statement “Given the vast differences in structure, governance, and process between the Library of Congress and Wikipedia, it would be undesirable to merge the two systems. ” Why not? Although to be clear, it’s not exactly merging them that I’m (just hypothetically, as a thought experiment, to staek a position and investigate the implications) proposing — it’s actually abandoning LCSH and just using Wikipedia instead. Yeah, the governance and process is different. So? What if we did that?
Steve Casburn says:

May 17, 2011 at 10:04 pm

Jon: I have been thinking exactly the same thing (“why can’t Wikipedia be a catalog front end?”), and want to explore the idea and push it forward.

The non-relationship between Wikipedia and librarians has been one of those frustrating situations where if librarians in general were to focus less on denouncing what they see as wrong with the new ideas and more on determining how to shape and improve those ideas, then we could be much better public servants, much better professionals, and much more relevant to where the world of information will be. Now that I have 11 years in the profession and am solidly established, I want to spend the final 25-30 years of my career pushing for exactly that kind of change in focus.

Thanks for posting this article!
jrochkind says:

May 18, 2011 at 1:09 am

I’m honestly not going so far to think Wikipedia alone is a ‘catalog front end’, although that’s perhaps a related idea — but wikipedia as the ‘authority file’ for subjects (if not names), is a somewhat different or more focused idea, still compatible with a multitude of different ‘catalog front ends’.
Steve Casburn says:

May 18, 2011 at 2:19 am

Jonathan: I might have been one step beyond where you were! (And perhaps not a step you’d follow me in taking.)

Let’s say that a library integrates its name and subject authority records with Wikipedia.

Let’s also note that many, many more people start their research in Wikipedia than they do in a library catalog.

Wouldn’t it make sense for the library community to work with Wikipedia to provide links from the Wikipedia entries used as name and subject authorities that would allow people to find library resources related to the topic? And wouldn’t that make Wikipedia a kind of catalog front-end (though not with all of the features we associate with an OPAC)?

OCLC (at least, I think it’s OCLC) is already doing something that could grow into this vision. Pull up the Wikipedia entry on Abraham Lincoln. Near the very bottom, just above the Political Offices infobox, is an “Authority Control” infobox. Clicking on the LCCN (the actual number, not “LCCN”) takes you to the WorldCat Identities page for Abraham Lincoln.

It would just take a tweak by OCLC to their Identities pages, and we would have a pathway from a Wikipedia name or subject entry to the local library’s cataloged resources on that person or thing. And another tweak by OCLC (having libraries register their in-branch IP addresses) would allow a library to pass its in-branch users straight from that Identities page to the library’s resources with one click.

What do you think?
Arash Joorabchi says:

May 18, 2011 at 7:24 am

Hi Jonathan,

I have also been contemplating the idea of mapping Wikipedia to LCSH and vice versa in order to enrich both for using in my text analysis experiments. This poster (http://portal.acm.org/citation.cfm?id=1555488) presented at JCDL-09 by Yoji Kiyota et al. discusses the possibility of such mapping.
Bruce says:

May 18, 2011 at 1:12 pm

Wikipedia is a list of concepts (including some personal names) …

Is this really the case? Seems to me wikipedia is a list of somewhat more grounded people (not names), things, places, events, as well as some higher-level concepts.
jrochkind says:

May 18, 2011 at 3:42 pm

Bruce: Well, yeah, the semantics/ontology gets tricky, what’s a “concept” exactly? I mean to suggest that Wikipedia is a list of much the same things as LCSH is a list of (and including some but not all of the things that the NAF is a list of). You could say it’s a list of ‘topics’, I suppose. Or you could say it’s a list of “terms”, which isn’t saying much. But I am suggesting it is a list of terms that can be used to describe aboutness, which covers much the same ground as LCSH (although of course with different term/vocabulary choices).

In discussing controlled vocabularies, especially in the “thesaurus” community, often the word “term” is used to describe each, well, element. I find that word insufficient in it’s implication of “wording”. Because a given element in a controlled vocabulary can have mutiple wordings/labels/terms that are used for it (for instance, one in English another in Spanish), but what’s fundamental to the controlled vocabulary is it choice of how to map the real world domain into a set of elements, where each element may have attributes such as relationships (possibly labelled) to other elements, various human-displayable labels, etc. I tend to call these elements “concepts”, avoiding the philosophical issues of what their relationship to the (supposed) “objective world” is.

But anyway, regardless of the vocabulary we choose, the suggestion (which may or may not be true) is that Wikipedia is a list of ‘things’ whose domain is pretty similar to LCSH’s list of “things”.
Karen Coyle says:

May 19, 2011 at 1:26 am

Isn’t this what we mean by linked data? Essentially any entry point on the web can be a link to your data if we share our identifiers and make some inferences about things being the same. It’s not that Wikipedia becomes our authority file but that we would eliminate the separate between Wikipedia and library data. We could continue to have our own authority files if we wish, but our data could also be fully interactive with Wikipedia. And if we do it as linked data, this will be true not only with Wikipedia but with any other dataset with which we share identifiers.

Although you wouldn’t want to serve users from dbpedia, dbpedia is the link switching station for a number of different data sets, including Wikipedia. So it seems to me that there’s more bang for our buck if we would integrate library name identifiers with dbpedia, and then there could be interrelations with any systems querying dbpedia for related identities.
James Weinheimer says:

May 19, 2011 at 10:10 am

I love the idea of actually working with Wikipedia instead of always preaching against it–as I did for many years. If you can’t beat them, join them, and I think Wikipedia would become a great partner for libraries.

That said, I think Karen was correct in that Wikipedia does not have to replace our files, which is something I believe would be detrimental to the library cataloging effort, but to work *with* our files so that all could benefit. Library authority files exist to show, essentially, cataloging precedent: a cataloger decided earlier that a title will be used this way, or a name will be used that way, plus there is a little bit of extra information in there to make the files easier to use.

Wikipedia does something quite different, but it could still be a great tool to incorporate into library systems.
jrochkind says:

May 19, 2011 at 12:08 pm

Jim, what do you think about it replacing just our _subject_ authorities?

There are a variety of related ideas here, different ones can be done without doing the other ones.

But I’m also trying to think about what we can _stop_ doing — there’s no way we can just keep adding things we’re doing without ceasing to do _something_. This applies both to metadata control/cataloging, and to writing features into the software we use. So the point of the original blog post here is that thought experiment
Karen Coyle says:

May 19, 2011 at 2:05 pm

Jonathan, if your goal is to save cataloger time, then I would think that you would want to look to the publishers first, at least in terms of author names. Few authors actually end up with wikipedia entries, and probably not before a library gets that person’s first book. Publishers, on the other hand, have by definition a relationship with the authors and publish metadata with the author name. For contemporary and especially first-time authors, publishers are the first to create metadata.

Then there are the various databases being created for academic authors (some of which also end up in library catalogs):
http://people.bibkn.org/
http://vivo.cornell.edu/
http://names.mimas.ac.uk/

I think the trick is getting everyone to assign identifiers so we don’t have to worry about string ambiguity.

As for subjects, anything that would get us out of LCSH would be a time saver.

All that said, since few libraries do original cataloging, where do you see the real savings coming from?
Ryan Shaw says:

May 19, 2011 at 2:08 pm

But let’s say, as a thought experiment, we … abandoned LCSH to use just wikipedia as an authority file. What would be the downside of this?

One downside you already mentioned: there may be subjects that don’t meet notability standards in Wikipedia. Unlikely, perhaps, but I wouldn’t just assume that every LCSH subject has a Wikipedia counterpart.

Another potential downside is that it’s not always clear what Wikipedia redirects indicate. Most of the time they can be interpreted as “these are alternative names for the same thing”. But sometimes they mean “this is a sub-topic of this other thing.” For example, an article on Mrs. X’s husband is deleted for lack of notability, and the URL for Mr. X is redirected to Mrs. X’s article, where there is a brief mention of him. (This happens with general non-person subjects a lot too, where it is decided that something is actually a sub-topic of something else and not worthy of its own article.) In the context of Wikipedia, where you can read the article and understand why you’ve been redirected, this is not a problem. But if you’re extracting relations from Wikipedia and using them out of context it could be a problem.

More generally, I think it’s problematic to piggy-back on another community’s work. It’s not fair to the Wikipedians to make them responsible for the continued upkeep and maintenance of your subject catalogs, when that’s not what they’ve volunteered to do. Wikipedians should do what’s right for Wikipedia without having to worry about how it affects your OPAC. If the library community decides it doesn’t want to maintain subject authorities anymore, then it should work within Wikipedia to create links from articles to related works, and not simply pluck out relations and plunk them down in an alien context.
jrochkind says:

May 19, 2011 at 3:30 pm

Well, under my suggestion (which again is just a sort of devil’s advocate thought experiment), _whoever_ it is that’s now doing LCSH assignments would be doing wikipedia term assignments instead (rather than in addition to). Few libraries do much original cataloging, but SOMEBODY’s assigning those! But yeah, the point of this would not be to save time, it would be to make things better while not costing much ADDITIONAL time. (It would probably save a huge amount of time in LCSH vocabulary maintenance itself; whether that time saved (inside and outside LC) would go to other useful library metadata control projects or just be laid off, I don’t know.)

Defintiely less realistic for names than subjects. We do want to control names of all authors or as many authors as possible, not just ‘notable’ ones.

For subjects…. if it’s not notable enough for wikipedia, I wonder if anyone would miss it if it were gone. It’d be a different set of subject terms, certainly, different in many ways. I suspect it would be good enough though, and probably just as good.

Yeah, for author names, some other effort is more well matched than wikipedia. In addition to the ones Karen mentioned, what ever happened to ORCID?
jrochkind says:

May 19, 2011 at 3:35 pm

Most of the time they can be interpreted as “these are alternative names for the same thing”. But sometimes they mean “this is a sub-topic of this other thing.” For example, an article on Mrs. X’s husband is deleted for lack of notability,

Interestingly, however, the same is true of Library Authority 4xx’s too though, I think! For the same reasons. X lacks an LCSH heading of it’s own for lack of sufficient representation in the literature (not so different than ‘notability’), so X is used as a 4xx (“redirected”) to Y. And I don’t think there’s any way in our own records to distinguish a 4xx that’s been used like that from the more common case (in both wikipedia and our authority file) of a 4xx that’s been used as an actual “alternative name” instead.

I certainly would not expect wikipedia to do things specifically in order to meet our use cases. (They’re not going to, whether we expect them to or not). But if what they are doing for their own purposes can be re-purposed for our purposes as well as-is, there is absolutely nothing unfair about “piggy-backing” on it — this is what open data is all about, that kind of “piggy-backing”. Wikipedia releases their content under the license they release it under with the motivation of letting the data be useful for novel purposes — this is what dbpedia is all about, in fact. (Just like, IMO, there would be absolutely nothing unfair about some _other_ community using our own LCSH or NAF for their own purposes, if they decided it met their purposes. )
Christopher Warner says:

May 19, 2011 at 7:30 pm

The problem with using Wikipedia as an authority file is that Wikipedia is simply not the single authority on any one given item. Not only that but this essentially grants Wikipedia more authority over data relevance that any one entity should ever have. I’m not saying it’s not a place to find valid, relevant or authoritative data on a subject. I’m saying it’s not the ONLY place to find such data.

The web-of-data or linked data should be premised on specific data for a segment of knowledge that is based on a system of trust. A trust-metric evaluated and seeded by major authorities of anyone topic.

So for “X” topic there would be a number of authorities the community itself considered to be valid authorities on the subject. They would then be used as seeds for that particular topic and the community itself could validate/invalidate each other on whatever classification appropriate. This way, not only does no one person control the validity of a topic/item/uri but it distributes a proper graph that is non-static. Relevance is based on other metrics.

The general idea is that the web itself is the data and that it can be mined regardless of where that data is. Using Wikipedia as an authority only promote more silo’d data.
Ryan Shaw says:

May 19, 2011 at 9:28 pm

… this essentially grants Wikipedia more authority over data relevance that any one entity should ever have.

This is the point I was trying to make with my comment about piggy-backing being unfair. I don’t think it’s unfair to use their data—as you say, that’s the whole point. But it is unfair to make Wikipedia responsible for being the One True Authority File, given that they don’t (currently) see that as their mission. Which is why I’d rather take a mapping-identifiers approach.
Ryan Shaw says:

May 19, 2011 at 9:34 pm

Note that with a mapping approach, you could still achieve the same savings in vocabulary maintenance that you advocate. Map the existing subject headings and state that, henceforth, all new subjects will be selected from Wikipedia topics, and LC will simply mint identifiers for them. Relations will mirror Wikipedia’s. Then maintaining LCSH becomes a purely mechanical process (thus saving labor), but there is still an escape route in case Wikipedia jumps the shark or other attractive sources of authority emerge.
jrochkind says:

May 20, 2011 at 1:26 am

Yep, that makes sense, Ryan.
jrochkind says:

May 20, 2011 at 2:06 am

Christopher: We’re talking about the same meaning of the word ‘authority’, right? I’m not talking about the content of the articles, I’m just talking about the _list_ of articles. Do you think LC has more authority over data relevance than any one entity should have by controlling the list of LCSH? “Wikipedia”, being just the net result of lots of differnet people collaborating to create (for the purposes we’re talking about here) a list of articles, is even less “one entity” than LC is via LCSH.

But yes, I understand the idea of linked data. I doubt it’s going to be feasible any time _soon_ to replace our controlled vocabularies with just “you know, all the stuff on the web thats linked”. It’s not entirely clear to me how that would actually work in practice to provide actual subject access — assigning subject terms to bibliographic items so they can be grouped by subject and found by subject, and so different items on the same subject are assigned to the same subject terms even if they don’t use the same terms in their ‘original’ metadata. That’s what we have controlled subject vocabularies for. It’s not entirely clear to me how you’d use “just a bunch of ever changing stuff on the web, linked” as a vocabulary for that purpose. There might be a way, but it’s something that’s going to be figured out experimentally with some false starts and technical challenges. But we could replace one controlled vocabulary (LCSH) with another (list of Wikiepdia articles) if we wanted to, anytime.
Christopher Warner says:

May 20, 2011 at 3:11 am

Do you think LC has more authority over data relevance than any one entity should have by controlling the list of LCSH? “Wikipedia”, being just the net result of lots of differnet people collaborating to create (for the purposes we’re talking about here) a list of articles, is even less “one entity” than LC is via LCSH.

Not at all, however when I speak of authority I am also speaking about trust and distribution. The fact is that Wikipedia owns the resources which provide the data. It is not distributed sans a dump that one can replicate from. It’s not distributed in any form beyond that and unless you use dbpedia one can’t access it.

But yes, I understand the idea of linked data. I doubt it’s going to be feasible any time _soon_ to replace our controlled vocabularies with just “you know, all the stuff on the web thats linked”. It’s not entirely clear to me how that would actually work in practice to provide actual subject access — assigning subject terms to bibliographic items so they can be grouped by subject and found by subject, and so different items on the same subject are assigned to the same subject terms even if they don’t use the same terms in their ‘original’ metadata.

For the sake of fairness, please don’t “quote” things I haven’t said, unless that is your own off the cuff commentary. As far as controlled vocabularies go what you described is exactly what graph databases manage. So for instance I can assign a subject term to a bibliographic item that can be grouped by subject, found by subject with subsequent different items/objects that can be assigned to the same subject even if they don’t use the same terms in the original metadata. In-fact, I can do this now using the plone stack which uses an object database so you could even apply annotated metadata directly to an object if one wanted. In fact I could do this with Lucene and Mysql. That aside a proper graphdb like Allegrograph will allow you to slot trust factor, weights etc etc to each triple. So the over all graph/index/db whatever you want to call it will provide the correct node based on what you are requesting.

that’s what we have controlled subject vocabularies for. It’s not entirely clear to me how you’d use “just a bunch of ever changing stuff on the web, linked” as a vocabulary for that purpose.

I realize.

There might be a way, but it’s something that’s going to be figured out experimentally with some false starts and technical challenges. But we could replace one controlled vocabulary (LCSH) with another (list of Wikiepdia articles) if we wanted to, anytime.

There is a way and I can assure you if I had all of the pieces figured out I’d already have it in a box for your purchase. I’m not trying to stop you, you’re free to do whatever you like I’m just debating your premise.

What if instead of maintaining our own subject and name authority files, we simply used wikipedia as an authority file? Wikipedia is a list of concepts (including some personal names), and wikipedia-miner (or possibly dbpedia) also extracts lead-in synonyms and relationships for those concepts.

What I’m stating flatly is that this is a mistake. I could probably think of other many reasons but the above is a good start. If anything, exposing your own controlled vocabularies in a standard distributed, no point of failure fashion across the contextual web of data space or “topic” space would be more useful. Even if it meant just mirroring the data in different namespaces. You haven’t even begun to scratch the surface of use case. I mean, this is a difficult problem, using Wikipedia as your master? For controlled vocabularies? Seriously?
jrochkind says:

May 20, 2011 at 9:04 am

I’m confused, from my perspective it’s _easier_ to get a copy of wikipedia than it is to get a copy of LCSH.

On the linked data stuff, we will see. While I think that RDF etc is a fine data format to keep things in, because it orders your data well to make such experimentation possible, I think that some of the (to me) more complicated aspirational visions people have for it is still largely unproven and I don’t think we should hitch ALL our horses to assuming it will be workable.
Karen Coyle says:

May 20, 2011 at 10:11 am

“I’m confused, from my perspective it’s _easier_ to get a copy of wikipedia than it is to get a copy of LCSH.”

http://id.loc.gov/authorities/

has a link captioned: “Download entire concept scheme.”

That said, LCSH has so many problems (pre-coordination, free-floating subfields, etc.) that it’s an unlikely candidate for linking since no other community uses subject terms like it. FAST has a better chance of being able to have equivalent concepts created by other communities. It would be relevant to take a look at the “Vocabulary Mapping Project” (http://cdlr.strath.ac.uk/VMF/documents.htm) which is also downloadable.
jrochkind says:

May 20, 2011 at 10:16 am

What’s available from id.loc.gov is not the entirety of LCSH, some fields from the marc authorities are not present in id.loc.gov’s version of the data, or are coded so as to lose some meaning. It might be sufficient anyway, in which case, okay, it’s as easy to get a copy of (id.loc.gov version of ) LCSH as it is to get a copy of wikipedia, but certainly no easier.
Karen Coyle says:

May 20, 2011 at 10:55 am

Sorry, VMF is just data elements (I mis-remembered), but then I remembered Open Cyc, a general subject vocabulary:

http://www.cyc.com/cycdoc/vocab/vocab-toc.html

and the DBPedia ontology itself:

http://mappings.dbpedia.org/server/ontology/classes

It’s worth drilling down in the dbpedia classes because they get into an interesting level of detail:

http://mappings.dbpedia.org/server/ontology/classes/Election

I also wonder, now that we’re going on and on about this :-) if it wouldn’t be more useful to consider LC Classification rather than LCSH as a linking point from library data to other subject schemes. Assuming that we will continue to give book class numbers for shelf location, maybe there’s a way to get more bang out of that buck by riffing off of the class number instead of the subject heading.
jrochkind says:

May 20, 2011 at 11:07 am

if it wouldn’t be more useful to consider LC Classification rather than LCSH as a linking point from library data to other subject schemes.

Well, see, this is because we’re talking about at least two different things here. My original post was not meant to be a consideration of “linking points” at all, I wasn’t looking for “linking points”.

Rather, it was specifically meant to be a consideration of how we provide subject access to our collections, and what would happen if we stopped maintaining our own subject vocabulary, and instead used wikipedia itself as a subject vocabulary. Which remains an interesting question to me in and of itself.

I left unspoken what the value of having controlled vocabulary for access was, but maybe that’s been confusing. The value of controlled subject vocablary, I think, is to allow people to find items on a given topic represented in the subject vocabulary, brought together with each other. Even if those items do not use the same natural vocabulary internally (“WWI” vs “the great war” or what have you being a classic example), polysemy etc. There are other values when the controlled vocabulary elements have relationships between them and such.

Now, if such a controlled subject vocabulary is no longer useful, that’s a differnet discussion. I was assuming for the sake of argument it is useful (we do put lots of resources into it in LCSH), but considering an alternate approach to getting one. (I think it is clearly useful if it can be pulled off, but one could argue that it’s simply not feasible to pull it off usefully at a price we can afford. But then, that was part of the idea of this hypothetical proposal too, imagining ways to ‘piggy back’ on data that already exist).

And discussion in general ways to “link” our data to all manner of other data for many varied purposes is another discussion too (or possibly several other discussions). I think it’s useful to (at least sometimes) focus on specific functions and things that can actualy be done today in incremental steps, rather than only on big picture abstract generalized conceptual stuff. So what was interesting to me was the realization that we could use wikipedia this way _right now_ if we wanted to.

But clearly many of my readers are interested in having a different discussion then I am, heh, which is fine.
Monica says:

May 20, 2011 at 12:16 pm

Not capable of to get into the depth of all this, but regarding name disambiguation, just want to point out that ORCID is very much alive and working. http://www.orcid.org/

They just had a participant meeting on May 18 in Boston.
http://www.orcid.org/civicrm/event/info?reset=1&id=1

Perhaps somebody who attended can give an update? That looks like a serious effort, involving publishers, libraries, scholarly societies, being conducted with a lot of openness.
Christopher Warner says:

May 20, 2011 at 1:10 pm

I’d certainly be interested in seeing some of this working groups discussion. If they plan to post a linked data set that solves disambiguation for topic sets that would be a highly, highly useful thing.
Andromeda says:

May 21, 2011 at 8:55 pm

So hey, you can do that! And people have! I wrote a paper about mining Wikipedia to generate automated subject headings (it was in Information Technology and Libraries in March; you can get a preprint on my web site: http://www.andromedayelton.com/wp/resume/). Some of the sources I consulted to write my paper were all about people’s use of Wikipedia for authority information. They’re mostly in the CS literature, not the library literature, but hey, now they’re in my bibliography, and hopefully of use to you.
Saskia says:

May 22, 2011 at 4:39 am

In an article for the recent Code4Lib journal (http://journal.code4lib.org/articles/4916), Renée McBride reports that the 19th-Century American Sheet Music Collection uses Wikipedia as the controlled vocabulary for topics. However she says that because faceting and relevance ranking in discovery platforms are currently based on LCSH, Wikipedia authority control will have to comply with LCSH when the collection is integrated into such a discovery tool.
Andromeda says:

May 22, 2011 at 9:39 am

@Saskia: I don’t understand the claim. My husband is an Endeca engineer and ensures me that LCSH is not baked into Endeca (which will happily auto-generate facets from a variety of data; indeed most Endeca customers are not libraries and have data entirely agnostic of LCSH). It may be that the local installation described in the article is configured around LCSH in a way that would be challenging to rewrite — maybe it would be easier for them to bring their ContentDM data into an LCSH schema than to reconfigure their Endeca installation — but it’s certainly not a fundamental limitation of the faceting and relevance technology.
Kam Solusar says:

May 28, 2011 at 11:32 pm

As Bryan mentioned in the first comment, the German Wikipedia has been adding links to various authority files to their articles for a couple of years now. Biographical articles can be tagged with the corresponding PND, LCCN and VIAF IDs. Other entities, organizations or events can be tagged with the corresponding GKD ( https://secure.wikimedia.org/wikipedia/en/wiki/Gemeinsame_K%C3%B6rperschaftsdatei ) and other topics are linked to the Schlagwortnormdatei ( https://secure.wikimedia.org/wikipedia/en/wiki/Schlagwortnormdatei – the German equivalent to the LCSH)

There’s also a cooperation with the German National Library (DNB), which in turn adds links to the corresponding Wikipedia articles to their online database ( http://d-nb.info/gnd/118529579/about/html for example). Wikipedians also submit all errors and omissions they discover in that database to the DNB each month.

It’s also implemented on the English Wikipedia, but is only used in little more than 1,000 biographical articles so far, while the German Wikipedia already has included such metadata in more than 160,000 articles.

There’s also a little tool that is linked from each biographical article ( http://toolserver.org/~apper/pd/person/Albert_Einstein for example ) that extracts information from biographical articles and combines it with links to several other websites. It also creates an authority record from that information, see http://toolserver.org/~apper/pd/PeEnDe.php?id=1278360
jrochkind says:

May 28, 2011 at 11:37 pm

Wow, Kam, that sounds pretty amazing.

Are there any interesting current uses of that relational data, liking wikipedia to authorities? Or any ways it makes authority work more efficient, things that don’t have to be done because wikipedia is doing it?
kcoylenet says:

May 29, 2011 at 10:10 am

Kam, Apper’s tools are great! I looked on his talk page but my German is minimal — is there a way to point these tools at a different Wiki? (Obviously I’m thinking en., but presumably if you can do it for one….)
Kam Solusar says:

May 31, 2011 at 6:16 pm

Are there any interesting current uses of that relational data, liking wikipedia to authorities? Or any ways it makes authority work more efficient, things that don’t have to be done because wikipedia is doing it?
Not that I know of, at the moment. But I don’t have any connections to libraries, authorities or similar organizations, so unless something is posted on Wikipedia or on one of the Wikimedia mailing lists, I’m probably the last one to find out about such things :)

Kam, Apper’s tools are great! I looked on his talk page but my German is minimal — is there a way to point these tools at a different Wiki? (Obviously I’m thinking en., but presumably if you can do it for one….)
Seems like the tools are unfortunately only usable with the German Wikipedia at the moment. From what I understand, the tools use information extracted from the Persondata template ( https://secure.wikimedia.org/wikipedia/en/wiki/Wikipedia:Persondata ) and the categories of biographical articles and store this information in a separate database. From a technical point of view, it would certainly be possible to do the same fro the English Wikipedia and other language versions where the Persondata template is used. I’ll ask APPER if there are any plans.
Kam Solusar says:

May 31, 2011 at 6:35 pm

Just found an old posting on APPER’s talk page from last december ( https://secure.wikimedia.org/wikipedia/de/wiki/Benutzer_Diskussion:APPER/Archiv/2010#Wikipedia-Personensuche ). Seems like it would be quite a lot of work for one developer to make it work with other Wikipedia language versions as well.

I hope that more people from the English Wikipedia become interested in creating such tools in the future. There are already lots of tools out there that use geographical data from Wikipedia, but there seems to be not as much interest in creating tools that extract and use biographical metadata.
Pingback: The Muninn Project » Blog Archive » Authority, Statistics and RDF
jrochkind says:

May 31, 2011 at 11:25 pm

Ah, Kam, you’re coming from the wikipedia side? There were some folks on the code4lib listserv (library sector software developers) who were interested in adding some links to ‘biographical’ US library data to wikipedia pages, possibly via a bot, but weren’t sure how to navigate the wikipedia policies and politics to get it approved and done correctly. Would you be interested in giving them some advice or support, if they’re interested and need some orientation?
Kam Solusar says:

June 2, 2011 at 1:12 pm

Yep, I’ve been contributing there for a few years now. But I mainly work on the German Wikipedia, so I’m unfortunately not too familiar with all the finer aspects of the English Wikipedia (rules and policies can be quite a bit different across the communities of the various language versions).

However there’s the GLAM-Wiki project on the Wikimedia Outreach Wiki ( https://secure.wikimedia.org/wikipedia/outreach/wiki/GLAM ) for projects and cooperations between Wikimedia wikis and the GLAM (Galleries, Libraries, Archives and Museums) community. The Wikimedia Foundation is always interested in cooperations, so I’m sure if the code4lib folks contact the GLAM-Wiki team, the team can provide advice and help.

It would certainly be interesting to see a cooperation between the Wikimedia Foundation and organizations like VIAF/OCLC. At the moment, only few biographical articles are tagged with the corresponding VIAF ID, but if such a cooperation resulted in the addition of such data to hundreds of thousands of articles, I’m sure this would be a huge incentive for programmers to come up with new interesting tools utilizing such data and combine it with other available databases.
Pingback: A library mashup: Wikipedia and the catalog « all things cataloged

Share this:

Published by jrochkind

41 thoughts on “using wikipedia as an authority file?”

Leave a comment