‘open’ data, with attribution license vs no-restriction license.

There is an ongoing debate about whether ‘open data’ should be released under an ‘attribution required’ style license, or a ‘no restrictions’ style license.  For instance, lately OCLC has been suggesting it’s data is or should be licensed with an ODC-BY attribution-required license; while European libraries and cultural heritage institutions have been leaning toward and encouraging CC0-style no-restrictions licenses or releases instead. Here’s an article summing up both positions (read the whole thing to the end to get the big picture, thanks for Graham Seaman on the NGC4Lib listserv for the pointer).

Part of the confusing thing is that in some jurisdictions it’s unclear how to legally enforce restrictions on data reuse (or if you can at all) in the first place. This position paper from Creative Commons outlines some of the issues (although I’ve seen CC staff since disclaim this position paper and say it does not represent current CC positions or recommendations). But let’s ignore the legal issues for now, and just hypothetically assume you can legally require whatever restrictions you want on ‘open data’, what is “best”? (This is a big not actually valid assumption, but just for this discussion, let’s go with it). 

Of course, that suggests the question “best for whom?”.  Content owners understandably often want to require attribution (assuming they are willing to release any semblence of ‘open data’ in the first place) — to get credit, to drive customers back to them, for various reasons.  But there are real barriers and expenses created by such a requirement for third party potential users of this data (that we’ll address a bit later).

So there’s a tension here. No big surprise, there are usually tensions between different stakeholder interests in such things, practice (and society in general!) is about the negotiation of them, we’re used to it.

But it would be a mistake to confuse an entity wanting a “BY” style license for it’s own interests with, well, differing decisions about what makes a ‘good application’.  There may be differing opinions on that too, but such decisions are always context-dependent and (putatively) legally enforceable licenses enforcing universal rules is not where they play out.

In the NGC4Lib discussion, Karen Coyle suggests:

At the same time, most people want to have some idea of where data comes from — some way of gauging the “authoritativeness” of the data. This also requires carrying forward something that looks like attribution, and the W3C is calling it “provenance.” It may be the case that the desire to get credit and the need to know your sources will be solved with a single addition to the linked data technology.

With, to my reading, an intention of that being an implied defense of OCLC’s belief in requiring an attribution style license.

But if ‘most people will want it’ anyway, you wouldn’t need to enforce it with a license, right? Those that wanted it would do it, and those that didn’t wouldn’t, and that wouldn’t bother you if your concern was giving most people what they want. It would bother you if your concern was enforcing attribution as a (putative) content owner for your own interests — which is not illegitimate or unreasonable, but is a different concern.

I think there are two different things here, ‘good design’ and licensing. We don’t generally try to enforce good design or good UI or good functionality with licensing.  Licensing is to protect the rights of the owner, and serve their interests.  Not to try to somehow enforce that all users practice good design, somehow.

After all, even if ‘most people most of the time want X’, a license (assuming it’s legally enforceable) requires it all the time.  Few design principles are universal. Trying to put them in a license is a mistake.

If that was really one’s goal. Instead, I often see owners intentionally confusing these things as a kind of misdirection — they are insisting on certain licensing terms for business reasons (which may not be unreasonable), but when questioned on this, they try to misdirect and say “Well, it’s just good design anwyay, right?”  Maybe, maybe not, but that’s not the role of licenses.

Why is it a barrier?

In fact, I think there are use cases involving wild remixing, combination, and derivation of data from many different sources — where the requirement to always keep track of provenance will incure a significant cost, that will sometimes change the cost/benefit calculation of using data from a source that requires attribution, perhaps to the point that it’s no longer feasible.

Design decisions in software are never just about “is this helpful”, they are always about costs and benefits. Trying to enforce a design decision in a license takes what should be a contextually-specific design decision and trying to make it a universal licensing requirement. This is not about good design.

Let’s look at a specific example from a recent OCLC agreement/clarification with Europeana. 

In response to these concerns, OCLC requested and Europeana agreed to ask subsequent users of the metadata to give attribution to both OCLC and to the contributing institution as the source, and to make them aware of the OCLC cooperative’s community norms around data. This attribution and awareness are consistent with the expectations that OCLC member institutions have of one another with respect to data use. It is also consistent with Europeana’s Usage Guidelines for Metadata, particularly the principle of “giving credit where credit is due.”

Okay, let’s say OCLC data goes into Europeana. It ‘requires’ attribution. (it’s unclear if this is a legal requirement or just a suggestion, let’s assume its’ a legal requirement as those trying to release data under “BY” are trying to make it).  Okay.  It’s a bit tricky (meaning possibly expensive) to begin to potentially keep track of which records or data elements came from OCLC and which didn’t. But it gets worse.

Now let’s say a title in one of the records that came from OCLC and wound up in Europeana had a typo or error in it. Someone corrects that typo or error. Does that data element still require an OCLC attribution?  Plus possibly an attribution to whoever corrected it, if they’re also demanding a “BY” style license on their contribution?

Let’s say a record (or set of semantic assertions, or identified entity) from OCLC is merged with another record representing the same thing from Europeana. Some algorithm tries to take the ‘best’ parts of both records. But let’s say both the Europeana and the OCLC record actually had the exact same “title of book” value (because it was correctly transcribed from the actual book).  When there were two seperate records, you could use that title from the Europeana record without attribution — but now that they’re merged, since the merging algorithm combined both records, the exact same title string value you cant’ use without attributing OCLC?  (This starts to approach some of the legal/practical issues around licensing data we said we’d avoid, let’s not go there any further).

Now, in fact Europeana releases it’s “own” data as CC0, no restrictions. (Why do I put “own” in scare quotes? Well, it’s “own” data now includes data that originally came from OCLC, and comes along with an attribution ‘requirement’).  But let’s say Europeana followed OCLC’s lead, and itself claimed a “-BY” style attribution required license.

Now someone downstream, let’s call em Remixer X,  using data from Europeana… all the data needs to be attributed to Europeana… but some also needs to be attributed to OCLC. Now that data gets edited in Remixer X’s system, further confusing the attribution requirements. And maybe Remixer X doesn’t just get data from Europeana/OCLC, it gets data from a couple other sources, and let’s say they all follow the “with attribution” approach. And then someone else consumes the aggregated data from Remixer X, which neccesarily comes with an attribution-style license, cause that’s the only way Remixer X can share the data that originally came from Europeana/OCLC with an attribution-style license. But then Remixer X adds their own attribution style license on top too. Rinse, repeat, a big old confusing hard to track mess of attribution.

Tracking provenance is hard

It may also be neccesary or desired for certain applications. There may be solutions for some of these provenance tracking challenges, and there may be more solutions appearing in the future perhaps out of ‘semantic web’ technologies, making it easier.

But if we have a data universe full of remixing, re-using, re-combining, derivating, multi-generational — the universe that I think many open data/semantic web enthusiasts imagine, with justifiable enthusiasm — then it is indisputably a challenge. A challenge doesn’t mean it’s impossible, but it does mean that solving that problem in a given application or service raises the costs of that application or service compared to not doing so.

Software design decisions are always about cost/benefit.  If users or applications really want or benefit from data provenance, then those developing applications and services will have to weigh that against everything else, as with all software development. In some cases they’ll provide it. In other cases, they’ll decide it’s infeasible and the product can’t be finished at all unless it’s abandoned. In other cases they’ll decide other desired features win out (we can never throw in everything we imagine or desire to any software, choices always have to be made). In other cases provenance tracking might not be in initial release but added later. Etc.  Different decisions will be made in different contexts, depending on the application, the continuing evolving understanding of in what ways data provenance may matter, the evolving capabilities of tools for doing it, etc.

That is, if those developing applications consuming open data had the choice.If, on the other hand it’s legally required, then your choices and options are much smaller: Do we develop the application with a provenance tracking feature? Do we develop it using other data sources, and not the sources of data with the ‘attribution’ requirements, or with minimal remixing and derivation to keep the problem more tractable?  Or do abandon it and not develop it at all, because we simply don’t have the resources/capabilities to do the provenance tracking the license requires?

Attribution licenses for data are a very real barrier to the envisioned world of multi-generational re-use, combination, and re-mixing of data. Barriers can be climbed — some can make it over them at some times, and others can’t.  Content owners or controllers may have interests leading them to try and enforce attribution-style licenses (assuming they legally can even do that, which is in fact a big ‘if’, and I don’t want you to forget that although we are hypothetically ignoring it in this discussion).  So, that’s the world we live in, we’re used to barriers to technological innovation due to different stakeholder interests, that’s how it goes, that’s the sea we swim in, it’s nothing new.  But don’t think such restrictions are “for your own good”, they are real barriers. 

Community Norms or Guidelines Documents though, are Just Fine

There’s another approach. Instead of trying to require attribution through licensing, you can license/release with no restrictions, but publish a community norms document suggesting it would be great if people attributed/tracked provenance.  Such a document is not meant to be legally enforceable or a requirement, it’s just a friendly suggestion.

This approach is for instance recommended in that (possibly not the present opinion of CC) CC position paper on science data we previously referenced.

And that’s just fine. Good actors will do their best to follow those norms, but feel no guilt in bending them or abandoning them when it’s not feasible for their application, and nobody will fault them for it, or try to take them to court. Bad actors, well, bad actors will be bad actors, and life goes on.

OCLC itself seems to be of multiple minds about whether it’s trying to legally enforce ‘attribution’ in a license for it’s data, or simply recommending it as a norm. (And whether they expect all members to universally follow the ‘norms’ or be shunned, or whether they understand that different contexts call for different approaches).

The OCLC/Europeana announcement we previously linked talks only of norms, and doesn’t even mention the fact that OCLC has sometimes suggested there are legal licensing restrictions requiring attribution.  And yet in other places, OCLC is recommending others use an ostensibly legally enforcible attribution-requiring license. And with some products, OCLC allows use only just such an ostensibly legally enforcible license itself. (I am confused about whether OCLC currently claims WorldCat data itself is protected by a -BY style license or not, but since when hasn’t WorldCat re-use policies/licenses been confusing?).

If OCLC is of two minds and a bit contradictory on that, that means it’s organizational mind isn’t totally made up. I hope they will move toward recognizing the real barriers that promoting and issuing attribution-requiring licenses (that may or may not be legally enforceable) raise, and instead using and recommending no-restrictions license with community norms documents, like is becoming standard among European government and cultural heritage organizations, instead. Perhaps experiences like the Europeana incident will help move them there. The Europeana announcement doesn’t mention a legally enforceable -BY license, because it woudn’t make any sense if it did, the course of action they describe only makes sense with non-enforceable community norms.

2 thoughts on “‘open’ data, with attribution license vs no-restriction license.

Leave a comment