Re-usable linked big data for real

Some awesome news from VIAF — the license terms will of VIAF data reuse will be made more clear (and rather liberal), and more complete access to bulk data will be made available.

This is great. Name authority data is something which has value only when it’s used, and the more use it gets the more value it has to everyone, and the more people/institutions will find it worth their while to maintain it too.  These are great steps for VIAF.

To some extent I think we’re still in the ‘learning as we go’ stage with big reusable linked data on the web, despite what some people may tell you it’s not entirely clear how to do it ‘right’, or how to do it in a way that will actually be easy for people to re-use.

I predict a couple of pain points here, around licensing terms and data format.

Licensing terms for data

For real world likely use cases of VIAF, data from VIAF is likely to be merged into other data sets, in complex ways.

I suggest VIAF needs to be more clear about the ‘attribution’ requirement in CC-BY —  about where/how the attribution needs to be given.

On every individual page (or API response?) shown to a user (or software) that might possibly include a data element that may have come from VIAF?   A given app may or may not find it easy to track whether a given data element DID come from VIAF; if such tracking is actually required to meet the license terms, this is actually a significant barrier, it’s a non-trivial problem.

Or just a single attribution in the application ‘about’ page ‘documentation’ saying “some data from VIAF? That’s a lot more realistic.

Or something in between?

Something workable can probably be arrived at; Still, these issues are why I still suggest that any license but a CC0/ODC-PDDL no-rights-reserved type license for data is a bigger barrier than you might think to re-use.  An attribution requirement on data is a bigger barrier to re-use than an attribution license on narrative content, precisely because common data use cases are quite likely to involve remixing and mashing up and merging on a much smaller more granular level, and in a much more dynamic/automated way,  and through more generations of indirect reuse — than adaptation/remixing of narrative or human language content. It can become a significant non-trivial problem to track the ‘provenance’ of individual pieces of data as it’s remixed and reused, in order to satisfy attribution requirements.

(A entirely seperate complication with CC licenses and data is that a CC license is only enforceable if the person offering the license actually owned copyright on the thing licensed in the first place. Whether data is copyrightable is complex and varies between legal jurisdictions. If it turns out VIAF data isn’t copyrightable at all in some jurisdictions, then the CC-BY license on it may not be enforceable in those jurisdictions. Which isn’t neccesarily a problem for the consumer, but is an added level of confusion and complexity. See Creative Commons and Data from the Australian National Data Service, and my own implications of CC-BY on data )

Data formats: Still an RDF heretic

Also, I may be (may be? okay, I am)  revealing myself as crotchety old anti-RDF heretic here, but trying to comprehend the abstract and indirect graph of data here (VoID documents, what?), and start thinking out how to write the software to use it — is reminding me unpleasantly of dealing with SOAP.

To be fair, I haven’t looked at the actual VIAF data yet (is it even available yet in the new forms described?), just basing this on trying to understand what it means from the documentation and examples and previous experience with RDF.  So maybe it’ll end up much simpler than I fear.

I understand why VIAF wants to vend the data in RDF, for it’s maximal abstraction, and (at least in my opinion) as a sort of proof of concept of RDF for real data.

It’s a (worthwhile) “proof of concept” attempt, in my opinion, becuase I think RDF in general is still to a large extent a grand experiment, an aspirational project. RDF can be very difficult to work with, and I think that’s part of the reason we haven’t seen more uptake of RDF or a more robust RDF data ecology.  Whenever you try to make data (or just about anything else) universally abstract and maximally flexible, you wind up with something more complicated (and usually more resource-intensive and more verbose) to use than custom-fit data formats.   That doesn’t mean not to try the abstract route, I do think it’s a worthy experiment. But I think the complicated multi-vocabulary RDF may end up a barrier too.

I’m reminded of the SGML->XML->JSON progression. SGML is a lot more flexible than XML, but XML won out by being simpler (and neccesarily less flexible/abstract as a result).  Then namespaces were added to XML to add in a level of abstraction for machine re-use that is great in theory (and actually a precursor to RDF in some ways) — but most developers can tell you they end up being very difficult to work with, which is part of why they haven’t caught on much — and part of why JSON (which lacks the vocabulary-isolating features of XML namespaces) is starting to be more popular than XML despite it’s lack of features that XML (esp XML with namespaces has), or perhaps becaue of it — less features means simpler and easy to use.

So, sure, provide data in as good RDF as you can to particulate in the collective aspirational research project of RDF — I’m not giving up on it yet. But if VIAF were able to also provide the data in a simpler, more concise, less abstract/indirect, format fit to the VIAF data specifically (likely JSON)–I suspect they’d get more consumers using it.

So if VIAF data ends up not being used by many consumers as is — maybe it’s because there’s a lack of interest (that someone will suggest marketting can fix, rightly or wrongly), or maybe it’s because those interested lack the money/resources to do anything about it — or maybe it’s because there are still barriers to use that can be reduced, by providing easier-to-comply-with license terms, and/or by providing data in simpler easier to work with formats than RDF.

4 thoughts on “Re-usable linked big data for real

  1. You may have missed the the following right there in the annoucement: “Attribution is to the Virtual International Authority File VIAF (Virtual International Authority File), and we specify in the VoID document that the use of the canonical VIAF URI (e.g. http://viaf.org/viaf/49224511) qualifies as attribution if more traditional ways of acknowlement are difficult.”

  2. Thanks Bryan. I don’t think it says _where_ the attribution should appear. On every HTML page or within every API response that contains (or may contain) VIAF data? Or just once in an ‘about’ page in an HTML app, or in documentation for an API?

    If you meant the latter, that’s great! If it’s obvious to everyone else you meant the latter and I’m just being dense, sorry! In case there’s anyone else as dense as me, it might be wise to be more clear. If you’re already being clear in some part of the announcement/documentation that I’m somehow not seeing, sorry!

    But it kind of sounds like you mean the former, since the example is to linking to a _particular_ VIAF record, it sounds like you are required to track what data elements came from what _particular_ VIAF record so you can link back to it. That would be not so great. Oops, maybe you’re saying this already ought to be clear that you _do_ need to track data provenance so you can link back to the particular VIAF record it came from? (And I’m not personally sure what ‘more traditional ways of acknolwedgement’ even are! Or exactly what ‘use of’ a URI means; use of it where, does it need to be user-visible in an HTML app? Just any old place in an API?)

    I think maybe it’s not just me being dense, the language you quote is suggestive but not entirely determinative. Although I’ve certainly been dense before.

Leave a comment