Columbian student faces jail time for sharing scholarly thesis

Columbia strengthened their copyright laws were strengthened in 2006, basically at U.S. demands as part of a free trade agreement. 

As a result, according to Nature News Blog,Diego Gómez Hoyos , a Columbian student, faces jail time for posting someone elses thesis on Scribd. 

In the U.S., of course, ‘grey’ sharing of copyrighted scholarly work without permission is fairly routine. We call it ‘grey’ only because everyone does it, and so far publishers in the U.S. have shown little inclination to stop it, when it’s being done amongst scholars on a one-by-one basis — not because it’s legal in the U.S. If you google (scholar) search recent scholarly publications, you can quite frequently find ‘grey’ publically accessible copies on the public internet, including on Scribd.  

What is done routinely by scholars in the U.S. and ignored, gets you a trial and possible jail time in Columbia — because of laws passed to satisfy the U.S. in ‘free trade’ agreements.  This case may start going around the facebooks as “copyright out of control”, and it is that, but it’s also about how neo-colonialism is alive and well, what’s good for the metropole isn’t good for the periphery, and ‘free trade’ agreements are never about equality.

Student may be jailed for posting scientist’s thesis on web
Posted on behalf of Michele Catanzaro


A Colombian biology student is facing up to 8 years in jail and a fine for sharing a thesis

by another scientist on a social network.


Diego Gómez Hoyos posted the 2006 work, about amphibian taxonomy, on Scribd in 2011. An undergraduate at the time, he had hoped that it would help fellow students with their fieldwork. But two years later, in 2013, he was notified that the author of the thesis was suing him for violating copyright laws. His case has now been taken up by the Karisma Foundation, a human rights organization in Bogotá, which has launched a campaign called “Sharing is not a crime”.




Gómez says that he deleted the thesis from the social network as soon as he was notified of the legal proceedings. But the case against him is rolling on, with the most recent hearing taking place in Bogotá in May. He faces between 4 and 8 years in jail if found guilty. The next hearing will be in September.


The student, who is currently studying for a master’s degree in conservation of protected areas at the National University of Costa Rica in Heredia, refuses to reveal who is suing him. He says he does not want to “put pressure on this person”. “My lawyer has tried unsuccessfully to establish contacts with the complainant: I am open to negotiate and get to an agreement to move this issue out of the criminal trial,” he told Nature.


The case has left Gómez feeling disappointed. “I thought people did biology for passion, not for making money,” he says. “Now other scientists are much more circumspect [about sharing publications].”


Posted in General | Leave a comment

Google Scholar Alerts notifies me of a citation to me

So I still vainly subscribe to Google Scholar Alerts results on my name, although the service doesn’t work too well today. 

Today (after retrurning from summer vacation), I found an alert in my inbox to Googlization of Libraries edited by edited by William Miller, Rita Pellen. 

Except oddly, the Google Books version wasn’t searchable so I could find where my name was mentioned. (But clearly Google has/had text at one point to generate the alert for me!).  

But the Amazon copy was searchable. Amazon doesn’t let you copy and paste from books, but I’ll retype. 

Of course, some aspects of this comparison do not fit. For example, it is unlikely that the existence of Google Scholar is going to “dumb down” research (It might, however, make possible the distribution of less reputable research, unfinished manuscripts, etc. Scholars like Jonathan Rochkind have explored this concept. [32]).

From Standing on the Shoulders of Libraries by Charlie Potter in, Googlization of Libraries, edited by William Miller and Rita Pellen, Routledge 2009. Page 18. 

I don’t actually recall exploring that concept.  Let’s see what the cite is…  doh, the page of citations for that chapter isn’t included in the Amazon preview. Let’s see Google… afraid not, Google wouldn’t show me the page either. 

I wonder how many scholars are doing research like this, from the freely avaiable previews from Google/Amazon, giving up when they run up against the wall.  

Maybe I’ll ILL the book, Amazon search says I’m cited a few more times in other chapters, although it won’t show me them. 

Posted in General | Leave a comment

ActiveRecord Concurrency in Rails4: Avoid leaked connections!

My past long posts about multi-threaded concurrency in Rails ActiveRecord are some of the most visited posts on this blog, so I guess I’ll add another one here; if you’re a “tl;dr” type, you should probably bail now, but past long posts have proven useful to people over the long-term, so here it is.

I’m in the middle of updating my app that uses multi-threaded concurrency in unusual ways to Rails4.   The good news is that the significant bugs I ran into in Rails 3.1 etc, reported in the earlier post have been fixed.

However, the ActiveRecord concurrency model has always made it too easy to accidentally leak orphaned connections, and in Rails4 there’s no good way to recover these leaked connections. Later in this post, I’ll give you a monkey patch to ActiveRecord that will make it much harder to accidentally leak connections.

Background: The ActiveRecord Concurrency Model

Is pretty much described in the header docs for ConnectionPool, and the fundamental architecture and contract hasn’t changed since Rails 2.2.

Rails keeps a ConnectionPool of individual connections (usually network connections) to the database. Each connection can only be used by one thread at a time, and needs to be checked out and then checked back in when done.

You can check out a connection explicitly using `checkout` and `checkin` methods. Or, better yet use the `with_connection` method to wrap database use.  So far so good.

But ActiveRecord also supports an automatic/implicit checkout. If a thread performs an ActiveRecord operation, and that thread doesn’t already have a connection checked out to it (ActiveRecord keeps track of whether a thread has a checked out connection in Thread.current), then a connection will be silently, automatically, implicitly checked out to it. It still needs to be checked back in.

And you can call `ActiveRecord::Base.clear_active_connections!`, and all connections checked out to the calling thread will be checked back in. (Why might there be more than one connection checked out to the calling thread? Mostly only if you have more than one database in use, with some models in one database and others in others.)

And that’s what ordinary Rails use does, which is why you haven’t had to worry about connection checkouts before.  A Rails action method begins with no connections checked out to it; if and only if the action actually tries to do some ActiveRecord stuff, does a connection get lazily checked out to the thread.

And after the request had been processed and the response delivered, Rails itself will call `ActiveRecord::Base.clear_active_connections!` inside the thread that handled the request, checking back connections, if any, that were checked out.

The danger of leaked connections

So, if you are doing “normal” Rails things, you don’t need to worry about connection checkout/checkin. (modulo any bugs in AR).

But if you create your own threads to use ActiveRecord (inside or outside a Rails app, doesn’t matter), you absolutely do.  If you proceed blithly to use AR like you are used to in Rails, but have created Threads yourself — then connections will be automatically checked out to you when needed…. and never checked back in.

The best thing to do in your own threads is to wrap all AR use in a `with_connection`. But if some code somewhere accidentally does an AR operation outside of a `with_connection`, a connection will get checked out and never checked back in.

And if the thread then dies, the connection will become orphaned or leaked, and in fact there is no way in Rails4 to recover it.  If you leak one connection like this, that’s one less connection available in the ConnectionPool.  If you leak all the connections in the ConnectionPool, then there’s no more connections available, and next time anyone tries to use ActiveRecord, it’ll wait as long as the checkout_timeout (default 5 seconds; you can set it in your database.yml to something else) trying to get a connection, and then it’ll give up and throw a ConnectionTimeout. No more database access for you.

In Rails 3.x, there was a method `clear_stale_cached_connections!`, that would  go through the list of all checked out connections, cross-reference it against the list of all active threads, and if there were any checked out connections that were associated with a Thread that didn’t exist anymore, they’d be reclaimed.   You could call this method from time to time yourself to try and clean up after yourself.

And in fact, if you tried to check out a connection, and no connections were available — Rails 3.2 would call clear_stale_cached_connections! itself to see if there were any leaked connections that could be reclaimed, before raising a ConnectionTimeout. So if you were leaking connections all over the place, you still might not notice, the ConnectionPool would clean em up for you.

But this was a pretty expensive operation, and in Rails4, not only does the ConnectionPool not do this for you, but the method isn’t even available to you to call manually.  As far as I can tell, there is no way using public ActiveRecord API to clean up a leaked connection; once it’s leaked it’s gone.

So this makes it pretty important to avoid leaking connections.

(Note: There is still a method `clear_stale_cached_connections` in Rails4, but it’s been redefined in a way that doesn’t do the same thing at all, and does not do anything useful for leaked connection cleanup.  That it uses the same method name, I think, is based on misunderstanding by Rails devs of what it’s doing. See Fear the Reaper below. )

Monkey-patch AR to avoid leaked connections

I understand where Rails is coming from with the ‘implicit checkout’ thing.  For standard Rails use, they want to avoid checking out a connection for a request action if the action isn’t going to use AR at all. But they don’t want the developer to have to explicitly check out a connection, they want it to happen automatically. (In no previous version of Rails, back from when AR didn’t do concurrency right at all in Rails 1.0 and Rails 2.0-2.1, has the developer had to manually check out a connection in a standard Rails action method).

So, okay, it lazily checks out a connection only when code tries to do an ActiveRecord operation, and then Rails checks it back in for you when the request processing is done.

The problem is, for any more general-purpose usage where you are managing your own threads, this is just a mess waiting to happen. It’s way too easy for code to ‘accidentally’ check out a connection, that never gets checked back in, gets leaked, with no API available anymore to even recover the leaked connections. It’s way too error prone.

That API contract of “implicitly checkout a connection when needed without you realizing it, but you’re still responsible for checking it back in” is actually kind of insane. If we’re doing our own `` and using ActiveRecord in it, we really want to disable that entirely, and so code is forced to do an explicit `with_connection` (or `checkout`, but `with_connection` is a really good idea).

So, here, in a gist, is a couple dozen line monkey patch to ActiveRecord that let’s you, on a thread-by-thread basis, disable the “implicit checkout”.  Apply this monkey patch (just throw it in a config/initializer, that works), and if you’re ever manually creating a thread that might (even accidentally) use ActiveRecord, the first thing you should do is: do 

   # stuff

Once you’ve called `forbid_implicit_checkout_for_thread!` in a thread, that thread will be forbidden from doing an ‘implicit’ checkout.

If any code in that thread tries to do an ActiveRecord operation outside a `with_connection` without a checked out connection, instead of implicitly checking out a connection, you’ll get an ActiveRecord::ImplicitConnectionForbiddenError raised — immediately, fail fast, at the point the code wrongly ended up trying an implicit checkout.

This way you can enforce your code to only use `with_connection` like it should.

Note: This code is not battle-tested yet, but it seems to be working for me with `with_connection`. I have not tried it with explicitly checking out a connection with ‘checkout’, because I don’t entirely understand how that works.

DO fear the Reaper

In Rails4, the ConnectionPool has an under-documented thing called the “Reaper”, which might appear to be related to reclaiming leaked connections.  In fact, what public documentation there is says: “the Reaper, which attempts to find and close dead connections, which can occur if a programmer forgets to close a connection at the end of a thread or a thread dies unexpectedly. (Default nil, which means don’t run the Reaper).”

The problem is, as far as I can tell by reading the code, it simply does not do this.

What does the reaper do?  As far as I can tell trying to follow the code, it mostly looks for connections which have actually dropped their network connection to the database.

A leaked connection hasn’t necessarily dropped it’s network connection. That really depends on the database and it’s settings — most databases will drop unused connections after a certain idle timeout, by default often hours long.  A leaked connection probably hasn’t yet had it’s network connection closed, and a properly checked out not-leaked connection can have it’s network connection closed (say, there’s been a network interruption or error; or a very short idle timeout on the database).

The Reaper actually, if I’m reading the code right, has nothing to do with leaked connections at all. It’s targeting a completely different problem (dropped network, not checked out but never checked in leaked connections). Dropped network is a legit problem you want to be handled gracefullly; I have no idea how well the Reaper handles it (the Reaper is off by default, I don’t know how much use it’s gotten, I have not put it through it’s paces myself). But it’s got nothing to do with leaked connections.

Someone thought it did, they wrote documentation suggesting that, and they redefined `clear_stale_cached_connections!` to use it. But I think they were mistaken. (Did not succeed at convincing @tenderlove of this when I tried a couple years ago when the code was just in unreleased master; but I also didn’t have a PR to offer, and I’m not sure what the PR should be; if anyone else wants to try, feel free!)

So, yeah, Rails4 has redefined the existing `clear_stale_active_connections!` method to do something entirely different than it did in Rails3, it’s triggered in entirely different circumstance. Yeah, kind of confusing.

Oh, maybe fear ruby 1.9.3 too

When I was working on upgrading the app, I’m working on, I was occasionally getting a mysterious deadlock exception:

ThreadError: deadlock; recursive locking:

In retrospect, I think I had some bugs in my code and wouldn’t have run into that if my code had been behaving well. However, that my errors resulted in that exception rather than a more meaningful one, maybe possibly have been a bug in ruby 1.9.3 that’s fixed in ruby 2.0. 

If you’re doing concurrency stuff, it seems wise to use ruby 2.0 or 2.1.

Can you use an already loaded AR model without a connection?

Let’s say you’ve already fetched an AR model in. Can a thread then use it, read-only, without ever trying to `save`, without needing a connection checkout?

Well, sort of. You might think, oh yeah, what if I follow a not yet loaded association, that’ll require a trip to the db, and thus a checked out connection, right? Yep, right.

Okay, what if you pre-load all the associations, then are you good? In Rails 3.2, I did this, and it seemed to be good.

But in Rails4, it seems that even though an association has been pre-loaded, the first time you access it, some under-the-hood things need an ActiveRecord Connection object. I don’t think it’ll end up taking a trip to the db (it has been pre-loaded after all), but it needs the connection object. Only the first time you access it. Which means it’ll check one out implicitly if you’re not careful. (Debugging this is actually what led me to the forbid_implicit_checkout stuff again).

Didn’t bother trying to report that as a bug, because AR doesn’t really make any guarantees that you can do anything at all with an AR model without a checked out connection, it doesn’t really consider that one way or another.

Safest thing to do is simply don’t touch an ActiveRecord model without a checked out connection. You never know what AR is going to do under the hood, and it may change from version to version.

Concurrency Patterns to Avoid in ActiveRecord?

Rails has officially supported multi-threaded request handling for years, but in Rails4 that support is turned on by default — although there still won’t actually be multi-threaded request handling going on unless you have an app server that does that (Puma, Passenger Enterprise, maybe something else).

So I’m not sure how many people are using multi-threaded request dispatch to find edge case bugs; still, it’s fairly high profile these days, and I think it’s probably fairly reliable.

If you are actually creating your own ActiveRecord-using threads manually though (whether in a Rails app or not; say in a background task system), from prior conversations @tenderlove’s preferred use case seemed to be creating a fixed number of threads in a thread pool, making sure the ConnectionPool has enough connections for all the threads, and letting each thread permanently check out and keep a connection.

I think you’re probably fairly safe doing that too, and is the way background task pools are often set up.

That’s not what my app does.  I wouldn’t necessarily design my app the same way today if I was starting from scratch (the app was originally written for Rails 1.0, gives you a sense of how old some of it’s design choices are; although the concurrency related stuff really only dates from relatively recent rails 2.1 (!)).

My app creates a variable number of threads, each of which is doing something different (using a plugin system). The things it’s doing generally involve HTTP interactions with remote API’s, is why I wanted to do them in concurrent threads (huge wall time speedup even with the GIL, yep). The threads do need to occasionally do ActiveRecord operations to look at input or store their output (I tried to avoid concurrency headaches by making all inter-thread communications through the database; this is not a low-latency-requirement situation; I’m not sure how much headache I’ve avoided though!)

So I’ve got an indeterminate number of threads coming into and going out of existence, each of which needs only occasional ActiveRecord access. Theoretically, AR’s concurrency contract can handle this fine, just wrap all the AR access in a `with_connection`.  But this is definitely not the sort of concurrency use case AR is designed for and happy about. I’ve definitely spent a lot of time dealing with AR bugs (hopefully no longer!), and just parts of AR’s concurrency design that are less than optimal for my (theoretically supported) use case.

I’ve made it work. And it probably works better in Rails4 than any time previously (although I haven’t load tested my app yet under real conditions, upgrade still in progress). But, at this point,  I’d recommend avoiding using ActiveRecord concurrency this way.

What to do?

What would I do if I had it to do over again? Well, I don’t think I’d change my basic concurrency setup — lots of short-lived threads still makes a lot of sense to me for a workload like I’ve got, of highly diverse jobs that all do a lot of HTTP I/O.

At first, I was thinking “I wouldn’t use ActiveRecord, I’d use something else with a better concurrency story for me.”  DataMapper and Sequel have entirely different concurrency architectures; while they use similar connection pools, they try to spare you from having to know about it (at the cost of lots of expensive under-the-hood synchronization).

Except if I had actually acted on that when I thought about it a couple years ago, when DataMapper was the new hotness, I probably would have switched to or used DataMapper, and now I’d be stuck with a large unmaintained dependency. And be really regretting it. (And yeah, at one point I was this close to switching to Mongo instead of an rdbms, also happy I never got around to doing it).

I don’t think there is or is likely to be a ruby ORM as powerful, maintained, and likely to continue to be maintained throughout the life of your project, as ActiveRecord. (although I do hear good things about Sequel).  I think ActiveRecord is the safe bet — at least if your app is actually a Rails app.

So what would I do different? I’d try to have my worker threads not actually use AR at all. Instead of passing in an AR model as input, I’d fetch the AR model in some other safer main thread, convert it to a pure business object without any AR, and pass that in my worker threads.  Instead of having my worker threads write their output out directly using AR, I’d have a dedicated thread pool of ‘writers’ (each of which held onto an AR connection for it’s entire lifetime), and have the indeterminate number of worker threads pass their output through a threadsafe queue to the dedicated threadpool of writers.

That would have seemed like huge over-engineering to me at some point in the past, but at the moment it’s sounding like just the right amount of engineering if it lets me avoid using ActiveRecord in the concurrency patterns I am, that while it officially supports, it isn’t very happy about.

Posted in General | Leave a comment

SAGE retracts 60 papers in “peer review citation ring”

A good reminder that a critical approach to scholarly literature doens’t end with “Beall’s list“, and maybe doesn’t even begin there. I still think academic libraries/librarians should consider it part of their mission to teach students (and faculty) about current issues in trustworthiness of scholarly literature, and to approach ‘peer review’ critically.

London, UK (08  July 2014) – SAGE announces the retraction of 60 articles implicated in a peer review and citation ring at the Journal of Vibration and Control (JVC). The full extent of the peer review ring has been uncovered following a 14 month SAGE-led investigation, and centres on the strongly suspected misconduct of Peter Chen, formerly of National Pingtung University of Education, Taiwan (NPUE) and possibly other authors at this institution.

In 2013 the then Editor-in-Chief of JVC, Professor Ali H. Nayfeh,and SAGE became aware of a potential peer review ring involving assumed and fabricated identities used to manipulate the online submission system SAGE Track powered by ScholarOne Manuscripts™. Immediate action was taken to prevent JVC from being exploited further, and a complex investigation throughout 2013 and 2014 was undertaken with the full cooperation of Professor Nayfeh and subsequently NPUE.

In total 60 articles have been retracted from JVC after evidence led to at least one author or reviewer being implicated in the peer review ring. Now that the investigation is complete, and the authors have been notified of the findings, we are in a position to make this statement.

Some more summary from, which notes this isn’t the first time fake identities have been fraudulently used in peer review.

Posted in General | Leave a comment

Botnet-like attack on EZProxy server

So once last week, and then once again this week, I got reports that our EZProxy server was timing out.

When it happened this week, I managed to investigate while the problem was still occuring, and noticed that the EZProxy process on the server was taking ~100% of available CPU, it was maxing out the CPU. As normally the EZProxy process doesn’t get above 10 or 20% of CPU in `top`, even during our peak times, something was up.

Looking at the EZProxy logs, I noticed a very high volume of requests logged with the “%r” LogFormat placeholder as eg:

"GET[CACHEBUSTER]&referrer=[REFERRER_URL]&pubclick=[INSERT_CLICK_TAG] HTTP/1.0" seems to be related to serving ads; and these requests were coming from hundreds of different IP’s.  So first guess is that this is some kind of bot-net trying to register clicks on web ads for profit. (And that guess remains my assumption).

I was still confused about exactly what that logged request meant — two URLs jammed together like that, what kind of request are these clients actually making?

Eventually OCLC EZProxy support was able to clarify that this is what’s logged when a client tries to make a standard HTTP Proxy request to EZProxy, as if EZProxy were an standard HTTP Proxy server. Ie,

curl --proxy

Now, EZProxy isn’t a standard HTTP Proxy server, so does nothing with this kind of request.  My guess is that some human or automated process noticed a DNS hostname involving the word ‘proxy’, and figured it was worth a try to sic a bot army on it. But it’s not accomplishing what it wanted to accomplish, this ain’t an open HTTP proxy, or even a standard HTTP proxy at all.

But, the sheer volume of them was causing problems. Apparently EZProxy needs to run enough  logic in order to determine it can do nothing with this request that the volume of such requests were making EZProxy go to 100% CPU utilization, even though it would do nothing with them.

It’s not such a large volume of traffic that it overwhelms the OS network stack or anything; if I block all the IP addresses involved in EZProxy config with `RejectIP`, then everything’s fine again, CPU utilization is back to <10%.  It’s just EZProxy the app that is having trouble dealing with all these.

So first, I filed a feature/bug request with OCLC/EZProxy, asking EZProxy to be fixed/improved here, so if something tries making a standard HTTP Proxy request against it, it ignores it in a less CPU-intensive way, so it can ignore a higher volume of these requests.

Secondly, our local central university IT network security thinks they may have the tools to block these requests at the network perimeter, before they even reach our server.  Any request that looks like a standard HTTP Proxy request can be blocked at the network perimeter before it even reaches our server, as there is no legitimate reason for these requests and nothing useful that can be done with them by EZProxy.

If all this fails, I may need to write a cronjob script which regularly scans the EZProxy logs for lines that look like standard HTTP Proxy requests, notes the IP’s, and then automatically adds the IP’s to an EZProxy config file with `RejectIP` (restarting EZProxy to take effect).  This is a pain, would have some delay before banning abusive clients (you don’t want to go restarting EZProxy ever 60 seconds or anything), and would possibly end up banning legitimate users (who are infected by malware? But they’d stay banned even after they got rid of the malware. Who have accidentally configured EZProxy as an HTTP Proxy in their web browser, having gotten confused? But again, they’d stay banned even after they fixed it).

I guess another alternative would be putting EZProxy behind an apache or nginx reverse proxy, so we could write rules in the front-end web server to filter out these requests before they make it to EZProxy.

Or having a log scanning cronjob, which actually blocks bad ip’s with the OS iptables (perhaps using the ‘fail2ban’ script), rather than with EZProxy `RejectIP` config (thus avoiding need for an EZProxy restart when adding blocked IP’s).

But the best solution would be EZProxy fixing itself to not take excessive CPU when under a high volume of HTTP Proxy requests, but simply ignore them in a less CPU-intensive way. I have no idea how likely it is for OCLC to fix EZProxy like this.



Posted in General | Leave a comment

Ascii-ization transliteration built into ruby I18n gem in Rails

A while ago, I was looking for a way in ruby to turn text with diacritics (é) and ligatures (Æ) and other such things into straight ascii (e ; AE).

I found there were various gems that said they could do such things, but they all had problems. In part, because the ‘right’ way to do this… is really unclear in the general case, there’s all sorts of edge cases, and locale-dependent choices, and the giant universe of unicode to deal with.  In fact, it looks like at one point the Unicode/CLDR suite included such an algorithm, but it kind of looks like it’s been abandoned and not supported, with no notes as to why but I suspect the problem proved intractable.  (Some unicode libraries currently support it anyway; part of Solr actually does in one place; communication about these things seems to travel slowly).

For what I was working on before, I realized that “transliterating to ascii” wasn’t the right solution after all — instead, what I wanted was the Unicode Collation Algorithm, which you can use to produce a collation string, such that for instance “é” will transform to the same collation string as “e”, and “Æ” to the same collation string “AE” — but that collation string isn’t meant to be user-displayable, it won’t neccesarily actually be “e” or “AE”.  It can still be used for sorting or comparing in a “down-sampled to ascii” invariant way.  And, like most of the Unicode suite, it’s pretty well-thought-through and robust to many edge cases.

For that particular case of sorting or comparing in a “down-sampled to ascii invariant way”, you want to create a Unicode collation sort key, for :en locale, with “maximum level” set to 1. And it works swimmingly.  In ruby, you can do that with the awesome twitter_cldr gem — I contributed a patch to support maximum_level, which I think has made it into the latest version.

Anyway, after that lengthy preface explaining why you probably don’t really want to “transliterate to ascii” exactly, and it’s doomed to be imperfect and incomplete…

…I recently noticed that the ruby i18n gem, as used in Rails, actually has a transliterate-to-ascii feature built in. With some support for localization of transliteration rules that I don’t entirely understand.  But anyhow, if I ever wanted this function it in the future — knowing it’s going to be imperfect and incomplete — I’d use the one from I18n, rather than go hunting for the function in some probably less maintained gem.

I guess you might want to do this for creating ‘slugs’ in URL paths, becuase non-ascii in URL’s ends up being such a mess…  it would probably mostly work good enough for an app which really is mostly English, but if you’re really dealing heavily in non-ascii and especially non-roman text, it’s going to get more complicated than this fast. Anyway.

 # => "AEroskobing"

 # When it can't handle it, you get ? marks. 
 # => "???"

Still haven’t figured out: How to get the ruby irb/pry/debugger console on my OSX workstation to let me input UTF8, which would make playing out stuff like this and figuring stuff out a lot easier!  Last time I tried to figure it out, I got lost in many layers of yak shaving involving homebrew, readline libraries, rebuilding ruby from source… and eventually gave up.  I am curious if every ruby developer on OSX has this problem, or if I’ve somehow wound up unique.

Posted in General | Leave a comment

Dangers of internet culture to humans, and alternatives

I am increasingly not liking what the use of the internet does to our society, and to us. I actually find that maybe my techie friends are more likely to share these concerns than my friends at large, which some find ironic, but isn’t at all, it’s because we have more exposure to it, live more of our lives on the internet and have done so for longer.

Here’s a really good presentation on one aspect of this — the universal surveillance state of affairs brought on by always-on-the internet culture. (That’s actually just one of the areas of my concern, although a big one).

All of us are early adopters of another idea— that everyone should always be online. Those of us in this room have benefitted enormously from this idea. We’re at this conference because we’ve built our careers around it.

But enough time has passed that we’re starting to see the shape of the online world to come. It doesn’t look appealing at all. At times it looks downright scary.

And the author grows on to describe the dangers of the universal surveillance state of affairs, and how current political economy (my words) of the internet exacerbates the problem with centralization of internet services, and business models built on a new kind of advertising that depends on universal surveillance.

The presenter admits he doesn’t have the solution, but proposes three areas of solution exploration: regulation, de-centralization, and de-americanization.

I think we technologiests in libraries are in an interesting spot. On the one hand, we necessarily have a role of bringing more technology to libraries.  And in that role, we have faced resistance from some ‘traditionalists’  worried about what these technological solutions do to the culture of libraries, and to culture at large. I have never been entirely unsympathetic to these worries — but I have definitely become even more sympathetic as culture progresses.

Nevertheless, libraries have no choice but to meet the needs and desires of our users, and the needs and desires of our users are emphatically in the direction of using technology to make things more convenient to them. No matter how many scholars bemoan the move from print to digital, scholars as a mass are simply not using print as much and using digital more and more and demanding more convenience of digital. If we don’t make their lives better with technology, we won’t survive as institutions.

But I also think libraries are potentially well-placed to play a role in addressing the harms of the internet on culture, that the author of that presentation is talking about.

While he doesn’t identify it as a theme in solutions, the presenter (I wish they signed their work, so I had a name to cite!) identifies advertising as the economic foundation of the internet as fundamentally rotten. Libraries can play a role as non-advertising-focused civil society institutions providing internet services and infrastructure to citizens. I’ve been interested in this since I got involved in technology and libraries; I’m not sure how much I’ve been seeing it happening, though. Do you have encouraging places you see this happening? Do you have ideas for how it could happen (and how the funding/organizational instructure can support those ideas?).

Libraries, as well as university IT and other non-business-oriented IT infrastructure providers, can also take the lead in minimizing the collection/storage of personally identifiable information.  Are we?  There is — or at least was, in PATRIOT act resisting days — a lot of talk about libraries responsibility to avoid keeping incriminating (legally or otherwise) information on our users. But we’re treading water barely managing to providing the IT services we need to provide, how many of us have actually spent time auditing and minimizing personally identifiable information in our systems? How often do we have this as a design goal in designing new systems? What would it take to change this?

What other ways might libraries find to play a role in changing the cultural role of the internet and minimizing the universal surveillance state?

One of the worst aspects of surveillance is how it limits our ability to be creative with technology. It’s like a tax we all have to pay on innovation. We can’t have cool things, because they’re too potentially invasive.

Imagine if we didn’t have to worry about privacy, if we had strong guarantees that our inventions wouldn’t immediately be used against us. Robin gave us a glimpse into that world, and it’s a glimpse into what made computers so irresistible in the first place.

I have no idea how to fix it. I’m hoping you’ll tell me how to fix it. But we should do something to fix it. We can try a hundred different things. You people are designers; treat it as a design problem! How do we change this industry to make it wonderful again? How do we build an Internet we’re not ashamed of?

Posted in General | 3 Comments

security and reliability of research data storage

So, according the Chronicle of Higher Education, a vendor that sold cloud storage of research data to researchers had a crash where they lost a bunch of data.  (Thanks Dr. Sarah Roberts for the pointer.)

This is a disaster for many of the researchers.

I think the Chronicle tells exactly the wrong story with their emphasis and headline though: “Hazards of the Cloud: Data-Storage Service’s Crash Sets Back Researchers.”

Hazards of the cloud, you think? The alternative to storing your research data in ‘the cloud’ is…. researchers keeping it themselves on local file storage?

Your typical person, even your typical scientist, just keeping their own files on their own Hard Drives… I do not think they are capable of doing this with a higher level of reliability than a competent IT organization or business specializing in this. Of course, some organizations are more competent than others, it sounds like SocioCultural Research Consultants with their Dedoose product is especially incompetent, if they don’t have any recoverable backups of their customers’ data. 

But even though trusting a third party with your data is scary because they might be incompetent… leaving individual researchers to fend for themselves as far as storing research data is a recipe for disaster.  Storing data reliably is something it takes skilled experts to do right, researchers in other fields are not qualified to do this on their own. (It turns out evaluating a third-party vendor’s competence is also tricky!)

And it’s not just reliability. It’s security. 2013-2014 are like the years of the IT industry collectively realizing that security is really hard to do right.  And when we’re talking research data, “security” means confidentiality and privacy of research subjects. Depending on the nature and risk of the research, a security breach can mean embarrassment or much much worse for your research participants.

I’m an IT professional, which just means I know enough to know I wouldn’t even trust myself to keep high-risk research data secure. I’d want storage and security specialists involved. Individual researchers? Entrusting this task to overworked grad students? Forget it.

This is not a hazard of the cloudThis is a hazard of digital research data. It doesn’t go away if everyone avoids “the cloud.” I absolutely think with confidence that research data stored on local hard drives on research team members’ desks or laptops — possibly multiple copies on multiple team members’ laptops — is, by and large, going to be less secure than research data stored by a competent professional third-party entity specializing in this task.

“The cloud” — if that means a remote server managed by someone else (and that’s pretty much all ‘the cloud’ means in this context) — is part of the solution, not the nature of the problem.  When that ‘someone else’ is a competent expert entity.

Ideally, I think, universities should be providing this service for their affiliated researchers. Rather than leaving them to fend for themselves, whether in local storage or in individual agreements with vendors. In fact, it would make a lot of sense for university libraries in particular to be providing this service. University libraries have started thinking about how to play a role in preserving research data for archival/historical purposes. The best way to be positioned to do this, is to play a role in storing the data in the first place, a service that researchers have an immediate need and direct interest in.

I’m not sure I trust universities or university libraries to be able to provide secure and reliable data storage either though. Universities have a tendency to underspend and under-provision IT projects, compared to what’s really necessary for a high quality reliable product.   It would probably make sense for universities to pool their resources in consortiums to create data storage services architected and staffed by competent professionals (compensated enough to get highly skilled professionals). So we’d be back to ‘the cloud’ after all, if perhaps university-owned ‘cloud service’.  But it’s not the cloud that’s a hazard; the hazard is that storing data reliably and securely is a non-trivial task that takes professional specialists to get right.

Posted in General | 1 Comment

ILLiad target added to Umlaut

Umlaut is the open source “SFX front-end”, “link resolver aggregator”, or “known-item service provider.”

I recently added a service plugin to generate links to ILLiad‘s OpenURL receiving point, so you can place an ILL request for the current citation.  Currently it’s just in Umlaut’s git repo, it’s not yet in an official release.

Now, typically Umlaut wraps a lot of under-the-hood SFX functionality.  Currently, I’m letting SFX generate the ILLiad links, and Umlaut gets them and places them on it’s own page, along with other SFX-generated links. And that still works.

Why prefer to have Umlaut generate the links directly instead?  Well, there are a number of tweaks that many customers  apply to customize SFX’s built-in ILLiad parser to begin with.  For instance, to make sure citations that appear to be dissertations have “genre=dissertation” set on the URL pointed at ILLiad. (Not actually part of the OpenURL spec technically, but it’s what ILLiad wants).

I too was maintaining a locally customized SFX ILLiad target parser, which had to be periodically reviewed and updated for new versions of SFX.  But then I ran into a problem there was no good way to work around in SFX — SFX was taking an incoming citation that lacked an ISBN, trying to find a match for it in the SFX knowledge base of ebooks, and then ‘enhancing’ the citation with details from the closest match. The result was, an OpenURL from OCLC for a DVD wound up going to SFX (via Umlaut), but the then the ILLiad link SFX generated ended up being for a different work, a book with a similar name from the SFX kb.

This was no good. And there was no good way to have Umlaut work around it.

We really pay for SFX for it’s knowledge base keeping track of licensed serials entitlements, and how to appropriately link out to content hosting platforms in this knowledge base.  The actual logic and functionality of SFX we could probably replicate with open source software (like Umlaut), it’s not rocket science, but that constantly updated knowledge base is where the value is.  Now, SFX gives you all sorts of other functionality along for the ride, like linking out to ILLiad.  The benefit of using it is that the vendor we’re paying anyway is maintaining it for us. But for the ILLiad target parser, we were having to maintain a local customized version anyway! And when such a function isn’t working out the way we want it to, Umlaut makes it easy to replace that chunk of functionality with our own code.

Umlaut’s own linking-to-ILLiad service can be used in contexts where you’re using Umlaut without SFX too. I don’t know of any implementers doing so, but Umlaut is being written to be agnostic as to underlying link resolver kb.

I haven’t put this in production yet, it might be a while until I have time to test it fully and switch it in during a non-disruptive time of year. But it’s available now, and I figured I’d announce it.

This pretty simple Umlaut adapter was also a good place to try modelling good automated tests of an Umlaut adapter. Umlaut’s architecture was largely solidified before any of it’s developers had much experience with automated testing, and some aspects of it would be done differently today, especially with regard to ease of testing. But there ended up being a fairly decent way to stick this Umlaut service adapter in a test harness.  Here’s a test class. 

Posted in General | Leave a comment

Adding constraints to projects for success

The company Github is known as a place where engineers have an unusual amount of freedom to self-organize as far as what projects they work on.

Here’s a very interesting blob post from a Github engineer, Brandon Keepers, on “Lessons learned from a cancelled project” — he has six lessons, which are really six principles or pieces of advice for structuring a project and a project team.

In some ways the lessons learned are particular to an environment with so much freedom — however, reading through I was struck by how many apply to the typical academic library environment too.

Since an academic library isn’t a for-profit business that measures success by it’s bottom line, we too can suffer from lack of defined measures of success or failure. (“Define success and failure“).

I think this isn’t just about being able to “evaluate if what you’re doing is working” — it’s also about knowing when to stop, as either a success or failure. I think many of us find ourselves in projects that can seem to go on forever, in pursuit of perfection, when there are other things we ought to be attending to instead of having one project monopolize our own time, or our team or organization.

This is related to Keepers’ second principle as well, “Create meaningful artificial constraints“. While we can’t say “money is not a factor” in an academic library, or that we are given ultimate failure to do whatever we want — I think we often do find ourselves with too many options (and too many stakeholders trying to evaluate all the options), or the expectation that “we can meet all the requirements at once,” which Keepers suggests is “paralyzing.” (Sound familiar?).    In a typical for-profit startup, freedom is constrained by a focus on the “minimal viable product”, the quickest way to sustainable revenue. When you have too much freedom — either because of Github’s culture of self-organizing, or an academic libraries… let’s say lack of focus — you need to define some artificial constraints in order to make progress.

Keepers highlights the “milestone” as an artificial constraint — a fixed date ‘deadline’, but one which “should never include scope”. You say “first beta user 2 weeks from today”, but you don’t say exactly what feature are to be included. “If you work 60 hours per week trying to meet a deadline, then you have missed the point. A constraint should never be used to get someone to work harder. It is a tool to enable you to work smarter.”

Which brings us to “Curate a collective vision“, and “People matter more than product.”  On that second one, Keepers writes

“For the first 9 months, I cared more about the outcome of the product than the people on the team. I gave feedback on ideas, designs and code with the assumption that the most important thing about that interaction was creating a superior product. I was wrong… If you care about people, the product will take care of itself. Pour all your energy into making sure your teammates are enjoying what they are doing. Happier people create better products.

That’s a lesson it took me a while to learn and I still have a lot of trouble remembering.

I don’t know if academic libraries just end up similar to the Github environment, or if it Keepers has come up with lessons that really apply to just about any organization (or any large non-startup organization?).  Either way, I found a lot of meat in his relatively short blog post, and I encourage reading it and reflecting on how those lessons may apply to your workplace, and how might account for them in your own organization.

Posted in General | Leave a comment