implications of CC-BY on data

So in a comment on another post, Mr. Gunn mentioned that Mendeley distributes it’s citation database under a CC-BY license.  That’s pretty awesome of them, to allow for re-use of their data, instead of trying to monopolize and monetize it. Seriously, kudos.  This post is not about Mendeley or what they should do. But it does prompt me to write about CC-BY and data in general, and certain issues.

There are two main topics of interest/concern here. The first I’ve written about before — the legal validity of CC licenses for data — and will briefly mention again because I think it’s important to know, but then we’ll put it aside and get to the second — what are the practicalities of honoring  a CC-BY license for data?

Validity of CC for data

So in order to license anything under a CC license, BY or anything else, you need to first own the copyright for it. That’s the premise of ALL the CC licenses, which is pretty clear if you read the legal versions.  I, the licensor, have copyright to this thing. Which means you can’t do any of the things protected by copyright (mainly various kinds of reproduction and performance) without my permission.  So I hereby give you permission (a license) to do these certain things under these certain terms.

If you don’t have the copyright to what you’re trying to license, you can’t offer a license to it.  In the US, in the general case, data isn’t really protected by copyright. In some limited ways it can be; the particular collection/arrangement can be, but not generally the individual pieces of data, although some specific kinds of ‘data’ might really be copyrightable (‘data’ is kind of a vague term).  But even if the stuff at hand is copyrightable, the person offering the license needs to own or otherwise be authorized to offer a license for it, for a CC license to actually be binding or enforceable or mean anything at all. In many cases I see people offering a CC-BY license for data, that data wasn’t really “created” by them at all, if it is copyrightable it’s not at all clear to me that the person trying to bind you by a CC re-use license is the one who would own that copyright, or has been licensed by whoever would own that copyright to re-license it to you under a CC license.

So that’s potentially a pretty big problem — you can only give someone a CC license if the licensed work in question is copyrightable, and if you own the copyright (or have been authorized by the copyright holder to re-license it) — otherwise that purported CC license is meaningless, the user isn’t bound by it at all, they neither have any meaningful permission to use the thing (if permission is required), nor are they  bound by any restrictions.

But let’s put that aside for now, just for the sake of discussion, and assume that, okay, here’s this collection of data, or database, or pieces of data, and I have a CC-BY license to use it, and it really is valid, I have permission to use the stuff, but only under the terms of the CC-BY license, let’s assume that.

But what does it actually mean

Even assuming that, just for the sake of argument, it’s still not clear to me what CC-BY in particular means, how I should behave to follow that license, as a user.  Let’s focus on CC-BY because it seems to be a very popular license (in general, and for data in particular), and it only gives us one thing to think about, the BY attribution requirement.

What does this attribution requirement really mean in the case of data?

The Situation

In at least my typical data uses, I’m taking a bunch of data, and I’m putting it in a database of my own, combined with data from a bunch of other sources. To keep things simple, let’s say all these other sources are either completely public domain, are licensed to me also under a CC-BY license, or are directly ‘owned’ (if they can be owned) by me already.  (It’s complicated enough even with this situation where everything is CC-BY, as we’ll see, it gets even MORE complicated if that’s not the case).

So I’m mixing this stuff all together, not only are maybe some records from one source and some from another, but even within a ‘record’, some elements are from one store and some from another. I’m using a variety of possible methods, both algorithmic, crowd-sourced, and expert-edited, to make my database as good as possible. You know, maybe I have geographic data from a bunch of sources, and I combine it all together, and I have algorithms (maybe based on usage, and constantly evolving) to take the ‘best’ piece of info when different sources conflict, and I let my users improve it themselves when they find errors, etc.

I don’t really know anymore which piece of data came from which source (and even if it came from a particular source originally, it may have been extensively modified by my users or metadata experts).  If I had to somehow make my software create an audit trail so this could be reliably and clearly tracked, that’s a huge additional expense for my software, but I probably don’t need to do that, right?  If I instead assume that the CC-BY license (potentially multiple CC-BY licenses from a dozen or more different sources, remember) applies to ALL of it.  (Which it may not, it’s sort of license-restriction-expansion here, I have to act like it applies to all of it, when some of it may be from public domain sources).

So, but,  what are my legal attribution obligations?

But, okay, let’s go with that, still… what does the CC-BY license actually require me to do? Well, it requires me to give attribution.

When you’re talking about a single piece of narrative text by someone else you got under CC-BY, that’s pretty clear, put some attribution at the bottom or top or something, and if you want to let your own readers know of their rights, link to the CC-BY license and let them know the original author will let THEM republish it too.

But with my data… I’ve got a web-app fronting it, with thousands or even millions of different pages. Does every page need attribution in the footer? Or just attribution on my site’s ‘about’ page?  What if I give people access to the actual data itself, via, say, SQL, or SPARQL, or an Atom feed, or  a custom API.  Do I have to find a way to somehow embed a comment in every response with the attribution? (Don’t forget, there could be a dozen different sources that require attribution, they’ve all got to be here).  Is there even a way to do that with SQL results? Or is it good enough to just put the attribution on my site’s documentation page explaining how to use the API? Or included as a comment (or in the actual ‘title’ cause nobody’s gonna see a comment?) with every auto-discovery-advertised Atom feed?

What did the licensor expect, anyway?

One question is what the original licensor would expect or want me to do. And I’m actually curious about this, with so many people releasing data under such licenses, it would make a good study/survey to find out what data-releasers are expecting here.

But the point of a CC license, when it’s applied to content like it’s intended, is that it’s a legally binding license, according to the legal terms in the license. The original licensor can’t change their mind later about what I’m allowed to do, which would be a mess. Nor do I actually need to seek out the original licensor to ask them — they’ve already made it known exactly what rights I have in the license, all I’ve got to do is look at the license. That’s the goal anyway, and it works pretty well for “content”, but as we’ve seen starts to fall apart with data. What the heck am I required to do with it, or allowed to do with it, it’s not really that clear.

Further down the stream

That gets even more so when we talk about addtional down-stream use.  The really exciting thing about open data is that it can continue to be re-mixed and re-used by more generations, mashing it together with other data to create more stuff. Data, unlike narrative human language, is inherently composed of a bunch of individual pieces just begging to be mixed and matched. Sure, sometimes you do that with narrative text too — and CC-BY allows me to “re-mix”.  But if you’re not William Burroughs, you’re probably not taking individual sentences from a bunch of different texts and shuffling them all around — but that’s exactly what you want to do with data, and then someone else wants to take that new database you created and do it AGAIN mixing with other sources, and so on down the road.

Can they take my aggregated database (including several sources of data, some CC-BY from a variety of licensors), and take out individual pieces, and put it in their own new database mixed together with a bunch MORE sources?


But it’s not up to ME, right?  It’s, I guess, up to the original ‘owners’ of that data (if they really own it all, see section 1).  All I can really say is “A bunch of stuff in this database came from sources X,Y and Z, all of which claim you can’t use it without their permission [I don’t know if that’s so!], but say you have their permission as long as you credit them [I don’t know exactly what kinds of credit, better ask them.”

But, okay, maybe it just means that that further downstream use needs to credit all the CC-BY licensors on their About page too. That’s not too bad, although it’s going to be an increasingly LONG list of entities on that About page. So hopefully just on the About page and not a footer on every page and record is acceptable.  (Hey, maybe they need to credit ME too, if they take it from aggregated database; I don’t know if I really have any ownership or ability to control re-use of that data (which mostly came from other places), but I can act like I do and tell them it’s CC-BY and they need to credit me, just like the originators did).

An enterprise that isn’t particularly legally adventurous may look at this whole situation and say, oh geez, “probably”, that’s not good enough for me, I have no idea if or what I’m allowed to do with this stuff, forget it, I’m not basing my project on it when I’m not sure if I’m allowed to do what I want to do and under what terms. Which would be a shame if the original licensors didn’t intend that outcome, if they were actually trying to let other people use their data and know they were allowed to use the data.



Seriously, just CC0/public-domain/PDDL/no-rights-reserved your data.  You probably aren’t really sure you own it and have the right to impose restrictions on re-use in the first place, in fact nobody’s probably sure and it depends on legal jurisdiction and exactly the nature of your data and the use.

Plus, when you try, it makes it much more confusing and difficult for other people to know what their rights and obligations are in re-using it. Data’s a lot trickier like that then narrative text, because it’s use will so often involve treating individual data elements atomically and mixing and matching them with data from other sources, and then enhancing them yourself, etc.   It’s really hard to know if you’re allowed to do what you’re doing, and to further redistribute what you’re doing (under what terms?) , because of this.

If you really want the data to be re-useable, just CC0/public-domain/no-rights-claimed with it.

If you want people to credit you on their About page, just ask them, don’t try to make it a legally binding license that might or might not actually be legally binding.  If they’re not assholes, and it’s feasible, they will.  And if you’re jsut asking them, you can tell them exactly what sort of attribution you’d like them to give you, in ordinary plain language, instead of hoping they’ll guess right by reading a legal CC-BY license that doens’t really apply to data anyway.

2 Responses to implications of CC-BY on data

  1. Adrian Pohl says:

    Hello Jonathan,

    basically I agree with you that public domain waivers are the best approach for data. But I accept that some data providers want to use an attribution license and would reply to your two objections:
    1. To circumvent the problems with CC-BY for data (I agree with you that content licenses aren’t appropriate for data) you can use the ODC-BY license, which is an attribution license for data and data sets.
    2. Regarding attribution practice it probably has to evolve for data but there are already practices in other areas we could learn from.

    The problems you describe are similar to the problem of attributing wikipedia. It’s licensed under CC-BY-SA but normally you don’t enumerate all contributors to an article if you republish an article or parts of it. You attribute wikipedia and everybody has the possibility to look at the history and find out all contributor names. This is a pragmatic approach obviously nobody objects to although wikipedia contributors might.
    And as far as I know – I am not a developer – it’s similar with code: You don’t know for every single line of code who is the creator you only know who contributed to a file (but not what and how much) and so you attribute at file level.

    I think, similar practices could emerge for data. Since – as you rightfully say – individual data elements aren’t copyrighted one just has to acknowledge the use of bigger data portions. You then could add attribution to a description of your dataset attributing creators of data sets you reused.

    FYI interest: There is a Working Group on Open Bibliographic Data within the Open Knowledge Foundation. Questions like this are adressed there as well as approaches for aggregating metadata and making it open. In fact, this post by you has generated some discussion on the working group’s mailing list. Also, we recently published the Principles on Open Bibliographic Data.


  2. Pingback: Re-usable linked big data for real | Bibliographic Wilderness

