awesome github feature: no excuse for not having good docs anymore

I am a documentation fanatic. The way some people get all excited about how code isn’t finished until there are tests (which I agree with generally): I think code isn’t finished until there are docs, and until you’ve updated the docs to match your changes.

It’s just not usable by other developers (including future you) unless there are docs — both inline docs, including of the sort that can be autogenerated into api docs with javadoc or rdoc ; and also often there is need for some narrative overview architectural docs too.

Now, github is a great place to host a project. And the way github renders markdown as html makes it pretty great for viewing docs too.

I like the way github’s project landing page is a source tree and a README. That’s exactly right, I don’t want the source tree hidden, because when I’m investigating someone elses project, or when I’m trying to figure something out about a project i’m already using, I want to be able  to navigate and view the source in a web browser very quickly.

YES, the source is important documentation in itself, but seldom is sufficient documentation on it’s own. Usually you at least need a README.

And then github gives you the README right below. Markdown readme rendered into HTML, so easy on the eyes on the brain. Great, perfect!

And then in some projects, the README keeps growing…. and growing… and growing. And you might think, gee, why not move some of this to some other files, if a source code file shouldn’t be 1000 lines long, a README prob shouldn’t be either.

The problem was that github did not support relative hyperlinks in README and other rendered markdown/markup.  So you couldn’t link out from the README to another file in your source, in a context-specific way. Sure, you could link out to your wiki — but wiki isn’t versioned along with source code, and you want to link to the version of the docs that go along with the exact snapshot of the source repo you’re looking at. Sure, you could link out using absolute URL hyperlinks — but same problem, you’re always linking to master, when you want to link to the file in the same snapshot the reader is looking at the README in.

You might think relative links would ‘just work’. They kind of do with hyperlinks in generated rdoc/markdown you generate on your local machine. But because of the way github’s infrastructure works, the relative path between a README and the rest of the source tree changes depending on whether you’re looking at project landing page, a branch tip, a tag, or a SHA hash snapshot. So they didn’t.

But, finally, github fixed things so relative links work. Hooray. 

So, yeah, you can split out your docs into multiple files in `./docs/whatever.md`.  And have those links work whether you’re viewing in github, or in generated rdoc on rubydocs.org, or generated rdoc on your local machine, or anywhere.

The other huge benefit this has is that you can link from a README to a source file now too.  When discussing a class, you can link to the source of the class. When providing an overview of a feature that’s covered in more depth in inline rdoc/javasdoc docs, you can link to it. Helping you avoid duplication of documentation between source files and doc files (DRY is just as important in docs as in source, for the same reasons), and integrating your documentation files and your source control.

This is going to make it so much more efficient to maintain good docs for projects hosted on github, and to maintain docs that are properly snapshotted with the source snapshot they refer to, instead of wiki docs where you never can tell what version of the source they refer to (last release? master?).

Hooray. ndushay says “I am not your mother. Write your test code.” I say, I ain’t your daddy either,  write your documentation.

 

 

((My next github doc fantasy? The syntax highlighter for ruby could realize when there was embedded markdown/rdoc in top-of-class and top-of-method comments, and render them formatted)

Posted in General | 1 Comment

Kinda like Google Books for LP’s

Just heard about this, fascinating.  Amoeba Records, a for-profit indie record store (and a great one) with several locations in California, is digitizing old out of print LP’s…. on an “opt out” basis ala Google Books.

If they haven’t made contact with the copyright holder, they put profits in escrow for the artist (or, erm, rightsholder(s)?  Music rights are complicated and way beyond my understanding).

http://www.variety.com/article/VR1118065137/

Some Vinyl Vaults artists have proven so elusive that even diligent detective work could not track them down. Henderson points to an unknown ’70s country artist known only as C.J., whose album “My Lady’s Eyes” is for sale on the site.

“We couldn’t find C.J.; we couldn’t find a label that put the record out,” Henderson says. “But it’s a compelling piece, (so) we said, ‘This should be up.’ “

Weinstein adds that if a sale is made, the money goes into an escrow account. “If (someone says), ‘That’s mine,’ well, OK, we can either take it down or we’ll sell it, and you’ve got this nice (digital) master. We’ll sell it, we’ll promote it; let’s sign a contract.”

I am generally pleased they are doing this. Old LP’s need digitizing. I am happy if it helps Amoeba stay in business. It sounds like their approach is fair…

…which doesn’t mean it’s legal. The opt-out nature they describe above is similar to the kind of thing that got Google Books and HathiTrust sued…. except it’s actually less legally defensible because they aren’t just using “snippets”, they really are selling complete digitized work.  (HathiTrust one, but may not have if they had been a for-profit endeavor, let alone selling complete copies!  Google Books lawsuit is still unresolved).

I wonder how deep their legal risk analysis went, and what that analysis was, or if they are just kind of doing it naively, or what.  Either way, it doesn’t neccesarily mean they’ll get sued…. but the ‘performing rights organizations’ for music like BMI employ armies of people who’s jobs it is to ‘protect music rights’ by extracting licensing fees. On the other hand, I guess this isn’t a ‘performance rights’ issue; like I said, music rights are complicated and mostly confuse me.

But I hope it works out for them, and all the artists involved.

It also saddens me that this kind of truly essential work of preserving cultural heritage…. would probably never be done by a library, for a variety of reasons including the possible legal risk, but also just plain lack of interest, lack of resources, lack of institutional motivation.

Posted in General | Leave a comment

Metadata vocab re-use question

I have a technical question about metadata vocab reuse, and the best way to do something I’m doing.

I’m working on an API for returning a list of scholarly articles.

I am trying to do as much as I can with already existing technical metadata devices.

In general, I am going to do this as an Atom XML response, with some ‘third party’ XML namespaces in use too for full expression of what I want to express. Using already existing vocabularies, identified by URI.

In general, this is working fine — especially using the PRISM vocabulary for some scholarly citation-specific metadata elements. Also some things that were already part of Atom, and may be a bit of DC here or there.

I am generally happy with this approach, and plan to stick to it.

But there are a few places where I am not sure what to do. In general, there’s a common pattern where I need to express a certain ‘element’ using multiple vocabularies simultaneously (and/or no vocabulary at all, free text).

For instance, let’s take the (semantically vague, yes) concept of type/genre. I have a schema.org type URI that expresses the ‘type’. I can also express the ‘type’ using the dcterms ‘type’ vocabulary. I could theoretically have a couple more format/type vocabularies I’d like to expose, but let’s stop there as an example. And on top of this, I also have a free text ‘type’ string (which may or may not be derivable from the controlled vocabs), which I’d like to make available to API consumer too.

Any individual item may have some, all, or none of these data associated with it.

Now, the dcterms ‘type’ element is capable of holding any or all of these. http://dublincore.org/documents/dcmi-terms/#terms-type

“Recommended best practice is to use a controlled vocabulary such as the DCMI Type Vocabulary [DCMITYPE]. To describe the file format, physical medium, or dimensions of the resource, use the Format element.”

See, recommended is to use a controlled vocab such as DCMI Type Vocab, but this makes it clear you can also use the ‘type’ element for another controlled vocab, or no controlled vocab at all.

So it’s legal to simply do something like this:

<!-- schema.org: -->
<dcterms:type>http://schema.org/ScholarlyArticle</dcterms:type>
<!-- dcterms type vocab: -->
<dcterms:type>http://purl.org/dc/dcmitype/Text</dcterms:type>
<!-- free text not from a controlled vocab: -->
<dcterms:type>Scholarly Book Review</dcterms:type>

And I’ve been the consumer of API’s which do something like that: Just throw a grab bag of different things into repeated dcterms:type elements, including URIs representing values from different vocabs, and free text. They figure, hey, it’s legal to use dcterms:type that way according to the docs for the dcterms vocab.

And as a consumer of services that do that… I do not want to do it. It is too difficult to work with as a consumer, when you don’t know what the contents of a dcterms:type element might be, from any vocab, or none at all. It kind of ruins the utility of the controlled vocabs in the first place, or requires unreasonably complex logic on the client side.

So. Another idea that occurs is just to add some custom attributes to the dcterms:type element.

<dcterms:type vocab="schema.org">http://schema.org/ScholarlyArticle</dcterms:type> 
<dcterms:type vocab="dcterms">http://purl.org/dc/dcmitype/Text</dcterms:type>
<dcterms:type>Scholarly Book Review</dcterms:type>

Now at least the client can a lot more easily write logic for “Is there a dcterms value? If so what is it.

But I can’t really tell if this is legal or not — attributes are handled kind of inconsistently by various XML validators. Maybe I’d need to namespace the attribute with a custom namespace too:

... xmlns:mine=http://example.org/vocab ...

<dcterms:type mine:vocab="schema.org">http://schema.org/ScholarlyArticle</dcterms:type>

But namespaces on attributes are handled very inconsistently and buggily by various standard XML parsing libraries I’ve used, so I don’t really want to do that, it’s going to make things too hard on the client to use namespaced attributes.

But I kind of like the elegancy of that ‘add attributes to dcterms:type’ approach. I suppose you could even use full URIs instead of random terms to identify the vocab, for the elegance of it:

<dcterms:type vocab="http://schema.org">http://schema.org/ScholarlyArticle</dcterms:type> 
<dcterms:type vocab="http://dublincore.org/documents/dcmi-terms">http://purl.org/dc/dcmitype/Text</dcterms:type>

Another option, especially if that isn’t legal, is to give up dcterms entirely and use only my own custom namespace/vocab for ‘type’ elements:

<mine:schema-type>http://schema.org/ScholarlyArticle</dcterms:type> 
<mine:dcterms-type>http://purl.org/dc/dcmitype/Text</dcterms:type>
<mine:uncontrolled-type>Scholarly Book Review</dcterms:type>

Which is kind of ‘inelegant’, but would probably work fine too. Realistically, any consumer of my response is going to be custom written for my response, it can be written to deal with mine:schema-type just as well as dcterms:type with attribute vocab=something. Standardizing here isn’t really necessary at all for primary use cases, although there are a variety of ancillary hypothetical benefits.

Or maybe there’s some other solution entirely I’m not thinking of.

So, any feedback? What solution makes sense, balancing standards, clarity, parsimony, ease of development, ease of client development, etc.?

The ‘type’ example is a good example, but this comes up in some other places too. Another example is for ‘language’, I may have either or both of an ISO code (two letter or three letter variety, such as “en” or “eng”), and an English-language free text description of the language “English”, and want to provide either one or both, unambiguously and easy to consume for the client.

Posted in General | 8 Comments

Librarians: When do we stand up for freedom of inquiry?

Libraries often hold and circulate controversial materials, even  materials that one could argue are despicable or dangerous.

Here are some examples from worldcat, a database of combined holdings of thousands of US and international libraries.

That list probably wasn’t actually particularly surprising or shocking to any readers. We know and expect libraries to make controversial material available, and even material advocating positions some or all of us believe are horrid or even dangerous.

Why?  To begin with, because even people who vehemently disagree with a view, might have research needs that require them to consult material expressing that view.   Pretending something doesn’t exist doesn’t make it go away, and we often need to research and understand things we might wish didn’t exist.

But, more fundamentally, because we believe…  well, here is a quote from an American Libraries Association resolution:

WHEREAS, the freedom of thought is the most basic of all freedoms and is inextricably linked to freedom of inquiry….

…WHEREAS, ALA reiterates its opposition to any proposal or actions by government that suppresses the free and open exchange of knowledge and information or that intimidates individuals exercising free inquiry;

Libraries are constantly under attack by their funders and by the public for the controversial material they distribute, and we even take pride in it: The ALA uses “banned books week” as a marketing campaign, advertising the fact that we circulate books people have tried to suppress.

The ALA has an official “Freedom to Read” statement:

The freedom to read is essential to our democracy. It is continuously under attack. Private groups and public authorities in various parts of the country are working to remove or limit access to reading materials, to censor content in schools, to label “controversial” views, to distribute lists of “objectionable” books or authors, and to purge libraries. These actions apparently rise from a view that our national tradition of free expression is no longer valid; that censorship and suppression are needed to counter threats to safety or national security, as well as to avoid the subversion of politics and the corruption of morals. We, as individuals devoted to reading and as librarians and publishers responsible for disseminating ideas, wish to assert the public interest in the preservation of the freedom to read….

We protect the privacy of our patrons because we think “freedom of inquiry can be preserved only in a society in which privacy rights are rigorously protected”; and try to resist “these pressures toward conformity [which] present the danger of limiting the range and variety of inquiry and expression on which our democracy and our culture depend.”  Honored members of our profession have even gone to jail rather than be compelled to testify in court in a way that thought threatened freedom of thought.

There are all sorts of threats to freedom of inquiry that librarians know about, but most of us would never think that, in the U.S., in the 21st century, there is any threat of actually being imprisoned for distributing controversial or subversive literature.  That doesn’t happen in America, not anymore, we think. 

Tarek Mehanna: Imprisoned for distributing subversive material

I thought so too, which is why I was so shocked to learn of the case of Tarek Mehanna.

Let’s get this out of the way: What Mehanna was convicted of was “providing material support to terrorists” under the PATRIOT act.

But what did he actually do?  As the ACLU of Massachusetts wrote in their brief supporting Mehanna’s case (which the court refused to accept into the record):

In each count of the indictment, the government has alleged acts that are protected under the First Amendment. Those acts include: the Defendant watched “jihadi videos” with friends; lent compact discs to people in the Boston area to “create like-minded youth”; discussed with friends his views of suicide bombings, the killing of civilians, and dying on the battlefield for Allah; translated texts that were freely available on the internet; looked for information online about the nineteen 9/11 hijackers; and inquired into how to transfer files from one computer to another and to keep translated files anonymous.

In the FBI’s press release after Mehanna’s sentencing, they express pride in helping to investigate a man who presented such a danger:

…The co-conspirators attempted to radicalize others and inspire each other by, among other things, watching and distributing jihadi videos….

…Mehanna continued his efforts to provide material support by, among other things, translating and posting on the Internet al Qaeda recruitment videos and other documents….

These dangerous activities the FBI highlights in their press release (because they were at the center of Mehanna’s conviction): Translating and distributing controversial, subversive, despicable, even dangerous material? That’s a very library-like activity, isn’t it? And apparently it’s one that the FBI will investigate you for, even brag about investigating, an activity which can even put you in prison.

Okay, I know some think: maybe there are some first amendment issues here, but this is still a dangerous man, a member of Al Qaeda, right?

But Mehanna is not a member or working at the direction of Al Qaeda or any other terrorist organization. Yes, Mehanna has some very radical political viewpoints. It may surprise you, however, that he has a history of arguing on the internet against the idea that Muslims are religiously allowed to attack civilians .

If you’re curious about Mehanna’s character, beliefs, or personality, the best thing to do is read his statement at his sentencing hearing. Really, go read it, it’s worth reading.

“Tarek translated a variety of Islamic texts out of a scholarly desire to make the texts available to English speakers, to expose them to other viewpoints”, according to his support website.   The reason or ideological motivation someone has for distributing controversial literature ought not to matter, of course:  The first amendment is supposed to protect even those people whose beliefs we find abhorrent. But it’s still worth pointing out that Mehanna’s beliefs aren’t even what you’d probably think they are, from the propaganda against him.  Read his personal statement, see what he has to say for himself.

People still wonder, okay, why would the prosecutor prosecute this guy, if he really wasn’t cooperating with Al Qaeda, if he actually believed attacks on civillians were immoral?  I don’t know.  But he was prosecuted by a US Attorneys’ office well-known for it’s aggressive and vindictive prosecutions, using their prosecutorial discretion against people who have crossed them: The office of Carmen Ortiz, the same office that prosecuted Aaron Swartz.  Mehanna was approached and asked to become an FBI informant, and he refused; maybe that made them mad; that’s what Mehanna thinks.

He is in prison for 17 years.  It is a mark of having gotten used to living in the most incarcerating nation on the planet that a 17-year sentence may not seem all that long. But think about how old you’ll be in 17 years, then think about being away from your family, your career, your personal projects for 17 years, locked up in prison. It’s an awfully long time. To be locked up for distributing controversial literature.

Some are still suspicious, could this really be what’s happened, isn’t this America? Read up on it yourself. Read Andrew March’s Op-Ed in the New York Times: ”one of the most important free speech cases we have seen since Brandenburg v. Ohio in 1969.”  Read the ACLU of Massachusetts’ legal briefs, or Alex Abdo, ACLU Staff Attorney’s, guest blog piece on boston.com. Read Adam Serwer’s article in Mother Jones, Glenn Greenwald on Salon.com, and an article in the Boston area’s MetroWest Daily News.   Read that even one of the jurors in Mehanna’s case is not comfortable with the outcome. 

As a librarian, I support Tarek Mehanna

The librarians’ professional code of ethics says:

In a political system grounded in an informed citizenry, we are members of a profession explicitly committed to intellectual freedom and the freedom of access to information. We have a special obligation to ensure the free flow of information and ideas to present and future generations.

A special obligation.

Tarek Mehanna should not be in prison, and we should do what we can to get him out.

But even more importantly, we have a professional responsibility to society as a whole, to defend freedom of expression and inquiry, and to speak up when we see it threatened. I see our society heading to very scary places.  Increasingly ubiquitous government surveillance without a warrant;  unpredictably harsh punishments for relatively benign crimes at “prosecutor’s discretion”;  even things that sound like only dystopian science fiction that wouldn’t happen here, like secret laws authorizing assasination of american citizens.

“We have a special obligation to ensure the free flow of information and ideas to present and future generations.”

Libraries and librarians have a reputation for standing up for freedom of expression and inquiry. A reputation that many of us are justifiably proud of.  And America needs us to do just that right now.

We only deserve that reputation if we are willing to stand up and speak out for freedom of inquiry even when it’s bitterly controversial, when it’s still not too late and things are really on the line, even when it’s not obvious to everyone that we’re right, even at some personal or professional risk  – those are the times when it actually matters, right?

Tarek Mehanna was prosecuted for acts that ought to be protected as freedom of expression and inquiry, including translating and distributing controversial material, an activity at the core of libraries’ missions and activities. 

If you, like me, feel called by your professional ethics to support Tarek Mehanna, check out www.freetarek.com . Write a letter to Tarek, tell him you’re a librarian (anyone in prison likes getting letters from strangers, for real).  I’m sure he could use a donation to his legal campaign (the appellate case is in process), although the paypal suggested on freetarek.com is “currently unable to accept funds”.

But maybe more importantly, talk to people about Tarek Mehanna and the threat to freedom of inquiry.  Talk to other librarians, and everyone else. Tell people that, as a librarian, you’re deeply concerned about this case.   “Like” the Free Tarek facebook page. If anyone has any interest or ideas in some organized activity as librarians to support Tarek Mahenna, please get in touch.

Posted in General | 2 Comments

Is China that scary?

The nytimes published another article about Aaron Swartz: How M.I.T. Ensnared a Hacker, Bucking a Freewheeling Culture

This article is generally good, putting Swartz’s behavior in the right context, especially the “hacker” context of MIT.

But it included a couple of, to my reading, very odd references to “China”:

According to the timeline, the tech team detected brief activity from China on the netbook — something that occurs all the time but still represents potential trouble.

Well, yeah, I mean, almost anything unusual “represents potential trouble”, emphasis on the ‘potential’ — that sentence means almost nothing.

(Although, amusingly, they also admit that there’s actually nothing unusual or unexpected about network activity from China, which actually happens “all the time” they say. But anyhow. Yes, something you don’t understanding could “potentially” represent unspecified “trouble” of some sort,  that is certainly vague enough to be undeniably so.)

But by throwing in an essentially non-sequitor reference to China,  they’re, of course,  counting on it being read as “Well, of course, if there’s activity from China, then we need to treat it as terribly serious security threat. I mean, China people!”

Michael Sussmann, a Washington lawyer and a former federal prosecutor of computer crime, said that M.I.T. was the victim and that, without more information, it had to assume any hackers were “the Chinese, even though it’s a 16-year-old with acne.”

Wait, what? What if it’s a 16-year-old with acne who speaks Chinese and lives in China? What the heck are they talking about, why is “assume the hackers are Chinese” so scary?

In fact, none of us would be surprised if there were efforts to scrape scholarly articles from JStor originating in China.

Our anecdotal observations are that indeed efforts to pirate scholarly articles often originate in China, Russia, and other parts of the developing world. The most obvious and sensible explanation for this would be that these are places with a lot of scholars and researchers who lack the money to pay for the scholarly content they need for their scholarship and research. I’m not sure why this is any more scary or dangerous than someone pirating scholarly content in Des Moines though, it’s hardly a national security issue that someone might read academic research without paying for it in China.

I don’t think I had realized the level of “Chinese hacker hysteria” we’ve achieved, where apparently all you need to do is drop in a non-sequitor “But it might have been from China”, and you’ve succesfully messaged “And therefore a national security issue, becuase the Chinese are scary and out to get us.”  The more things change, the more they stay the same. 

Posted in General | Leave a comment

NISO discovery system final(?) report

There’s a NISO Open Discovery Initiative, “ formed to develop a recommended practice related to the index-based discovery services for libraries. The ODI aims to investigate and improve the ecosystem surrounding these discovery services, with a goal of broader participation of content providers and increased transparency to libraries.”

They did a survey of libraries using ‘discovery’ products (by which I think they mean aggregated index products, which include both scholarly citation and local holdings metadata).

They have a “final report”…. or at least a report in a PDF whose URL includes “report final”, but then the email announcing the report also says “please note that the full set of recommendations from the ODI Working Group is still underway and will be available in draft form for comment a few months from now” …so I guess it’s a preliminary report, not a final report?

Here’s a direct URL for the actual PDF, instead of the URL included in the announcement email which still leaves you 2-3 confusing clicks away from actually reading the report:

http://www.niso.org/apps/group_public/download.php/9977/NISO%20ODI%20Survey%20Report%20Final.pdf

I haven’t read it yet; if anyone else has, please share your thoughts.

I will admit that my interest in fitting reading it into my schedule waned due to their disclosure that the report wasn’t “final”, at all, but in a few months we’d get “draft form”, and then an actual final one sometime after that? NISO committee, your target audience is busy and has limited and divided attention; if you want maximum impact, I’d suggest minimizing the amount of times you send email blasts to announcing a single actual report that you most want people to read.

Posted in General | 1 Comment

My article search API comparison published in Code4Lib Journal

My article comparing possible article search services with API’s is now published in the latest Code4Lib Journal.

A Comparison of Article Search APIs via Blinded Experiment and Developer Review

http://journal.code4lib.org/articles/7738

This study looks at perceived user preference between products that can provide a scholarly article search service via an application programming interface (API). The study set up a blinded review and asked users at Johns Hopkins to select the service that provided the most useful results. Few statistically significant preferences were detected, and some interpretation is provided of what the results might tell us. The specific products evaluated for this study are: Serials Solutions Summon, Ex Libris Primo, EBSCO EDS, EBSCOHost ‘traditional’ API, and Elsevier Scopus. Re-usable open source tools for implementing article search were created to support the study and future development, and a developer review of the APIs is included based on the developer’s experience in this implementation.

Posted in General | Leave a comment

EdX course starting soon: Computational thinking for non-programming librarians?

I’ve written  before about the idea of ‘computational thinking’ and it’s relevance to librarians:  While I don’t think all librarians need to be programmers or software engineers, I think pretty much all librarians do need to have some basic grasp of the nature of computational problem-solving.

Thanks to colleague Christie Peterson for drawing my attention to this free course on MIT’s EdX.

Introduction to Computer Science and Programming

6.00x is an introduction to using computation to solve real problems. The course is aimed at students with little or no prior programming experience who have a desire (or at least a need) to understand computational approaches to problem solving. Some of the people taking the course will use it as a stepping stone to more advanced computer science courses, but for many, it will be their first and last computer science course.

Since the course will be the only formal computer science course many of the students take, we have chosen to focus on breadth rather than depth. The goal is to provide students with a brief introduction to many topics so they will have an idea of what is possible when they need to think about how to use computation to accomplish some goal later in their career. That said, it is not a “computation appreciation” course. It is a challenging and rigorous course in which the students spend a lot of time and effort learning to bend the computer to their will.

Sounds exactly right, don’t it? Except that maybe you’d prefer a little couple week module, rather than the “challenging and rigorous course” it promises.

I’d encourage non-programming librarians interested in expanding their understanding of “computational approaches to problem-solving” — which is something I personally would consider as should be a core competency of just about any professional library worker in the contemporary era (and certainly anyone that manages metadata).

Posted in General | 1 Comment

Library values and the growing scholarly digital divide: In memoriam Aaron Swartz

Why did you decide to become a librarian or work in libraries? For me, like many of us, working in a library wasn’t just an arbitrary job to pay the bills, we have a special affinity for the mission and values of libraries. A mission and values which focus on connecting people to the research and information they need to make informed decisions and actions, through a democratic and egalitarian approach that serves all in need, rather than focusing on maximizing profit that can be extracted from our customers.

In fact, libraries are just about the only ‘information institutions’ whose business interests are centered on aiding our users, not in commodifying our users as demographic data, ‘eyeballs’, or paying customers.

Even the university library has historically been a center of knowledge distribution to the public at wide, not just the university community. Especially — but not only — at public universities, who saw dissemination of knowledge to the citizenry as part of their mission.

Consider the US resident of 30 years ago who wanted access to a scholarly article. She could walk into a university library (at almost all public and many private universities), take a bound journal off the shelf, browse, locate, and read articles of interest, and even (in the post-photocopy world) photocopy an article for personal research uses. This usage pattern was in fact the same one that those affiliated with that university would engage in, the university library served as the hub of scholarly knowledge for the affiliated and non-affiliated alike — at least in the US and the first world.

The digital revolution has changed this. The access to scholarly knowledge we provide is through licensed electronic copies, mostly available only to our affiliates. Those affiliated with paying institutional customers can now access scholarly articles from the comfort of their own homes — but those not affiliated are basically out of luck. Even if the university library has a printed copy of the article in which a non-affiliate might be interested (increasingly unlikely); even if the library provides public entry and public workstations at which the non-affiliate can view licensed electronic articles (if they can wait on line to get on one of the limited public workstations) — their second-class status and access level would be apparent to them: One method of convenient access for affiliates, another method of very inconvenient and frequently impossible access for everyone else.

This change isn’t mainly due to decisions made by libraries, and in fact certainly the economic and cultural changes of the digital revolution are making things very difficult for libraries. In one salient example, libraries largely can’t purchase ebooks for lending even to their own affiliates.

But the fact is, the digital revolution, which would seem to provide the technology to make access to the world’s information ever more widespread, more efficiently and affordably than ever — is instead widening the access gap between the information haves and have-nots. University libraries, which used to serve as the runway for public access to scholarly output — now, in many people’s minds, serve as symbols of their host universities as gated communities with high walls keeping out the information have-nots. (And don’t get me wrong, even our own affiliated patrons aren’t exactly happy with ease of access issues either).

What are libraries and librarians doing about this? What are we saying about it? What could or should we be doing and saying?

* . * . * . * . *

Aaron Swartz was a ‘child prodigy’ of the development of the internet as we know it. Among other things, he was responsible for the RSS 1.0 standard while still a teenager.

And Swartz shared the presumed values of libraries — the widespread and democratic dissemination of human knowledge, the minimization of inequality in information access.

He was involved in the development of Creative Commons, and in the Internet Archive’s Open Library project. (He wrote: “Our goal is to build the world’s greatest library, then put it up on the Internet free for all to use and edit.”)

And in 2009, he targetted PACER, a government-run system which charged significant fees for access to court records. Swartz bulk downloaded these public documents (which are not under copyright), and contributed them to a non-fee-charging public repository, Carl Malumud’s public.research.org. While PACER still has a fee based structure for downloads, fees are waived for a basic level of use, and there’s a Firefox browser extension that lets users forward their free downloads to an Internet Archive collection.

The FBI investigated Swartz for his PACER bulk downloads, but never pressed charges (perhaps because no laws were broken). Swartz’s actions helped bring attention to the restrictions on public access to public documents, and helped contribute a bit to widening access, by both the publicity it brought to the issue, and by his direct action to download and redistribute the documents.

More recently, Swartz was apparently offended by the growing gap in access to scholarly output, and perhaps thinking on analogy to his work with PACER, he set up a system to bulk download as much of the well-regarded non-profit JStor aggregator’s content as he could get. Using methods that sound out of a techno-thriller, he allegedly set up a rogue server in the basement of an MIT building, which, by virtue of being in MIT’s IP address range had access to JStor and proceeded to scrape as many documents as he could.

Of course, the legal and practical situation of this JStor endeavor was quite different than PACER.

He was caught. He was arrested in January 2011.

JSTOR said they did not want to press charges [2], however MIT has made no such statement (we can’t know for sure what either of these organizations was saying to the prosecutor behind the scenes).

The government pressed charges anyway. Multiple felony charges adding up to the possibility of 50+ years in prison and $4 million in fines.

For copying documents that any academic institution (including, Harvard, where Swartz held an affiliation with a fellowship! [3]) already license ‘all you can eat’ access to from JStor.

I heard about this situation and was outraged, but then forgot about it. I figured eventually there’d be a big campaign in defense of Swartz, and I’d donate some money and sign some petitions and reblog the calls for support when it happened. Such an organized campaign never materialized, I’m not sure why. The well-known lawyer Lawrence Lessig, a friend of Swartz’s, suggests that Swartz was “unable to appeal openly to us for the financial help he needed to fund his defense, at least without risking the ire of a district court judge.”.

On January 11, 2013, Swartz killed himself.

Swartz’s family writes

Aaron’s death is not simply a personal tragedy. It is the product of a criminal justice system rife with intimidation and prosecutorial overreach. Decisions made by officials in the Massachusetts U.S. Attorney’s office and at MIT contributed to his death. The US Attorney’s office pursued an exceptionally harsh array of charges, carrying potentially over 30 years in prison, to punish an alleged crime that had no victims. Meanwhile, unlike JSTOR, MIT refused to stand up for Aaron and its own community’s most cherished principles.

Attorney Lessig writes, in his blog post entitled Prosecutor as Bully:

…the question this government needs to answer is why it was so necessary that Aaron Swartz be labeled a “felon.” For in the 18 months of negotiations, that was what he was not willing to accept, and so that was the reason he was facing a million dollar trial in April… And so as wrong and misguided and fucking sad as this is, I get how the prospect of this fight, defenseless, made it make sense to this brilliant but troubled boy to end it.

* . * . * . * . *

Swartz’s alleged actions may or may not have violated criminal law[1]; the ethics of Swartz’s actions are very debatable (legal or not,  intentional direct action in a principled violation of the law can still be ethical; but there are certainly arguments that in this case his actions were not); but  in any event his actions, certainly in retrospect, seem not to be at all strategic or wise.

But what his actions were not is the kind of “Ocean’s 11″ larceny and/or terroristic attack that the government tried to paint them as. And there’s very little room for dispute there.

Lessig writes:

From the beginning, the government worked as hard as it could to characterize what Aaron did in the most extreme and absurd way. The “property” Aaron had “stolen,” we were told, was worth “millions of dollars” — with the hint, and then the suggestion, that his aim must have been to profit from his crime. But anyone who says that there is money to be made in a stash of ACADEMIC ARTICLES is either an idiot or a liar. It was clear what this was not, yet our government continued to push as if it had caught the 9/11 terrorists red-handed.

Librarians and libraries know how the market for scholarly publications works, and know well that suggesting that Swartz “stole property” worth “millions” is ridiculous.

  • There’s little market for selling such a document collection without legal authorization — the institutional customers willing to pay the kinds of prices one pays for a JStor license (largely set by publishers not JStor, it is true) aren’t going to stop paying JStor (and other aggregators and publishers) in favor of illicitly pirated content.
  • And the third world markets which can’t afford to license content from JSTor and other aggregators? The dirty secret libraries know is that this content is already successfully — but quietly — being pirated on a regular basis. Every university library, and most hosting content platforms, can find evidence of such unauthorized acquisition, by drib and drab as well as in bulk,  if they care to look for it.   (See for example, Heather Tones White, “Electronic Resources Security: A look at Unauthorized Users” in the Code4Lib Journal.) We know that people regularly take advantage of our networks to ‘pirate’ scholarly articles for overseas markets, and the nature of our infrastructures and capabilities give us little means of preventing this.
  • And we know that, JStor licensees (nearly every university) typically have unlimited and non-metered access to licensed JStor collections, putting the portrayal of bulk downloading as massive million dollar ‘theft’ — even if this kind of bulk downloading was prohibited by JStor terms of service — in the farcical context it deserves.

Librarians and libraries have professional knowledge that portraying Swartz’s activity as a million-dollar-plus profit-movitated larceny, and prosecuting it as such, is ridiculous. And librarians and libraries know that the inequity in access to scholarly content that offended Swartz is a real problem. However misguided his approach to addressing the issue, Swartz was on our side — or at least, we should have been on Swartz’s side, writing the prosecutor and court with our professional expertise that this was not the sort of crime it was being portrayed as.

Libraries and librarians should have stepped up to defend Swartz publically. But largely they didn’t. Both Lessig and Swartz’s family hold MIT  accountable for, unlike JStor, refraining from publically stating they did not want charges filed against Swartz.

Were the librarians of the MIT library arguing in defense of Swartz behind the scenes? I have no way of knowing.

But I know that few other libraries or librarians were standing with Swartz, and we all should have been, and we largely did not, and it’s a shame. [3.5]

* . * . * .  * . *

Lessig writes:

For remember, we live in a world where the architects of the financial crisis regularly dine at the White House — and where even those brought to “justice” never even have to admit any wrongdoing, let alone be labeled “felons”.

We live in a society and with a system of laws that prioritizes — above all other concerns or values — protecting the ability of private businesses to make a dollar off the public. In this case, for publishers to profit off of ideas and words they claim are their property,  in a world where there’s virtually no commons left and everything is someone’s property. A system that prioritizes private profit above any public value in equitable access to information, and prioritizes an apparent desire to to send some kind of perverse lesson about the value of privatized profit above justice, above proportionality, and above individual’s lives.

The priorities and values of those in power are broken, and not just when it comes to intellectual property — but the area of intellectual property is where libraries operate.  And it’s in this area in which libraries — both public and academic — have a history of speaking out for and acting to create equitably distributed access to research and information.

I am honestly not sure libraries are going to exist anymore in a couple decades. The information ocean in which libraries swim has been changing drastically, I think we had a limited amount of time to learn how to swim in this new ocean, and I think our time may have run out without us rising to the (very real and difficult) challenge — we have not succeeded in making our place in the digital environment, and it may already be too late for us to catch up before we are reduced and eliminated by our hosting and funding organizations.

It’s not that libraries aren’t needed anymore, they still are. In fact, we may be needed more than ever;  in a society and economy where information is more important than ever, libraries — public and academic both — are, I will say again, the only institution specializing in information whose interests, business plans, values, and missions are in expanding access to information without a profit motive of our own, who can act institutionally with interests fully aligned with those of our users and the public at large — if we remember our historical values and accept our responsibility to do so.

Libraries are the institutions with the ability and responsibility to sound the alarm that the digital divide in access to scholarly output is growing, not shrinking.

But are we doing so? We can indeed find university librarians and libraries talking about how increasing scholarly publishing prices are imperiling our ability to provide access to our own users, who — as scholars — do after all generate the ‘content’ in the first place. But how often do we talk about those left outside our walled gardens entirely, how access to human knowledge has been sequestered behind paywalls, financially inaccessible to the “law-abiding” public at large, especially in the developing world?

As libraries, do we have a unique role, responsibility, and power here? If so, what would it look like for libraries to take a stand?

There aren’t obvious answers: What power do we have, when we’re economic hostages to the publishers too, and are struggling to be perceived as relevant just to our direct constituencies? And, anyhow, what chance is there that our administrators will even share these values or find them a priority, in an increasingly privatized, de-funded, neo-liberal environment where libraries are increasingly expected to ‘entrepenurially’ turn a dime off their users somehow too?[4]

I don’t know, but I know it starts with speaking up.  We ought to be willing to  take at least a fraction of the risks — organizational, professional, and personal —  that Swartz did in acting for equitable access to scholarly output (if hopefully in more strategic and successful ways) — and if we’re not, we ought to at least be speaking out in defense of those who are, like Swartz was.[5]

It might not save libraries, but it’ll help keep libraries worth saving.

Can’t we at least go out fighting?


[1] Laurence Lessig, who it should be said was both a friend of Swartz’s and a knowledgeable attorney, wrote:

Even if the facts the government alleges are true, I am not sure they constitute a crime. There is considerable uncertainty in this area of the law. Many wonder about the quick conversion of terms-of-service into criminal prosecution. But that’s a question the courts will ultimately have to resolve.

http://mediafreedom.org/2011/07/larry-lessig-responds-says-swartzs-alleged-actions-crossed-ethical-line/

[2] While JStor did not want charges pressed, they still published a statement (taken off the net sometime between two days ago when I first found them and today; thanks Internet Archive for preserving a copy) misleadingly implying Swartz’s actions as a kind of theft they simply were not. Such as confusingly claiming that they considered the situation resolved because they had “secured from Mr. Swartz the content that was taken” — a nonsensical claim when talking about digital documents which can have unlimited copies made of them at no expense, but one which helps build a narrative considering the actions as if they were a theft of physical property, which is unavailable to it’s rightful owner until returned.

[3] Why did Swartz use MIT’s network instead of Harvard’s, where Swartz held a fellowship? Perhaps because MIT’s network lacks the basic security protections almost any other university network would have against the kind of approach he used. See http://unhandled.com/2013/01/12/the-truth-about-aaron-swartzs-crime/ . Perhaps because MIT’s ‘hacker’ culture in which such clever security intrustions were often considered as entertaining pranks.

[3.5]. There are certainly some exceptions to lack of attention to Swartz’s case, such as Nancy Sims excellent piece in College and Research Libraries News.  And it should be said, that JStor, perhaps ironically, is making a bit more effort at increasing access to scholarly content than most of their peers in scholarly publishing and aggregation — especially sadly and ironically including an initiative obviously already in development announced just this week. (Was it in development before Swartz’s 2011 bulk download, or did he help shame them into it? I do not know).   Whether or not  JStor, a non-profit aggregator which itself must license it’s content from the publishers,  is doing enough to increase accessibilty of scholarly output (I think few if any of the institutions involved in academic publishing and dissemination are) — they are hardly an example of the worst, the most greedy, or the most culpable. (Anyone in the industry could make suggestions as to who would be at the top of that list, and we’d probably come up with much the same list). 

[4] Where libraries are unique, as i’ve said several times in this essay, is in their business model of acting on behalf of our users, instead of trying to make money off of them. If our funders try to turn us into just another business with a profit motive, we’ll be just like everyone else but not as good at it as them, and surely sign the papers on our own dissolution.

[5] From a Journal of Higher Education article:

“What Aaron Swartz did was a clear violation of the rules and protocols of the library and the community,” says Christopher Capozzola, an associate professor of history and acting associate dean of the school of humanities, arts, and social sciences. “But the penalties in this case, and the sources of those penalties, are really remarkable. These penalties really go against MIT’s culture of breaking down barriers.”

And John H. Summers, a “historian and former Harvard lecturer”, quoted in the same article:

“What Aaron’s case begs us to remember is that universities are supposed to be public, not-for-profit institutions,” Mr. Summers says. “They owe a standing moral debt to the public.”

I suggest that in issues of access to research materials, university libraries and librarians professional role and responsibility is to act as the conscience of the university, reminding their hosting institutions of that moral debt and responsibility, and of our institutional academic cultures of breaking down barriers.

Posted in General | 12 Comments

Really, folks, update your Rails apps

The recently announced Rails vulnerability is a bad one — probably gives an exploiter ability to execute arbitrary ruby code on your server, which basically means to execute arbitrary commands on the shell.  Which often can be escalated to all sorts of other stuff.

And this looks to be the sort of vulnerability for which attackers can easily write port-scanners to just scour the net for exploitable apps — it exists in most any Rails app, exploitable in a generic way. You’ve got to patch your apps. Here’s a scary comment on hacker news trying to drive that home too. 

If you are on Rails 3.0, 3.1, or 3.2, fixing this can be as easy as bundle update, check in Gemfile and redeploy. If you are on Rails 2.3 pre-bundler, updating Rails to latest patch release in production can be more of a pain — but there’s still 2.3 version relased to update to.

If you are pre Rails 2.2, you are still vulnerable but you don’t have a release to update to, you’re going to have to figure out what to do. It seems like you can probably be fine by following the workaround instructions to simply disable xml/yaml params parsing (it’s highly unlikely your app uses these features, if it does, then you’re REALLY out of luck) — but the announcement doesn’t at present give tested instructions for how to do this in various rails versions, might take some investigation.

You’ve really got to do this to every app, even weird out of the way ones that seem unimportant and don’t do anything that matters — the vulnerability could mean that app is providing a hole onto one of your servers inside your firewall.

If you don’t even know if you have such apps running, or have inherited apps from predecessors without documentation or current staff competence in maintaining, and you don’t know where to find the source or how to redeploy or anything about rails and you have no idea where to begin…. oh boy, I mention this because one of my nightmares is that I’m some day going to leave such a pile-of-doo for the future at my organization. I do my best not to, but in the end it’s an organizational issue not a personal one, our organizations have got to start acting like competent IT organizations which don’t bite off more than they can chew and fund IT operations adequately to reliably and securely support what’s being asked of them. And university libraries just plain don’t. And some day, they’re all going to reap the consequences. And then decide “See, this is why we shouldn’t do any IT inhouse, but just buy black boxes from vendors the end.”  Which will only hasten the irrelevance and disappearance of libraries.

Okay, my rant went a bit off track there (erm, off the rails?).

Anyhow, while this Rails security vulnerability is bad,  I don’t think it should probably be taken to mean that Rails is inherently less secure than anything else, or that Rails made a mistake larger than other similar products have and will make. Except that Rails is popular, so a target for attackers.

Posted in General | 2 Comments