Concurrency in Rails 5.0

My previous posts on concurrency in ActiveRecord have been some of the most popular on this blog (which I’d like to think means concurrency is getting more popular in Rails-land), so I’m going to share what I know about some new concurrency architecture in Rails5 — which is no longer limited to ActiveRecord.

(update: Hours before I started writing this unawares, matthewd submitted a rails PR for a Rails Guide, with some really good stuff; have only skimmed it now, but you might wanna go there either before, after, or in lieu of this).

I don’t fully understand the new stuff, but since it’s relatively undocumented at present, and has some definite gotchas, as well as definite potentially powerful improvements — sharing what I got seems helpful. This will be one of my usual “lots of words” posts, get ready!

The new architecture primarily involves ActiveSupport::Reloader (a global one of which is in Rails.application.reloader) and ActiveSupport::Executor (a global one of which is in Rails.application.executor). Also ActiveSupport::Dependencies::Interlock (a global one of which is at ActiveSupport::Dependencies.interlock.

Why you need to know this

If you create any threads in a Rails app yourself — beyond the per-request threads a multi-threaded app server like Puma will do for you. Rails takes care of multi-threaded request dispatch for you (with the right app server), but if you’re doing any kind of what I’ll call “manual concurrency” yourself —Thread.new, any invocations of anything in concurrent-ruby (recommended), or probably any celluloid (not sure), etc. — you got to pay attention to be using the new architecture to be doing what Rails wants — and to avoid deadlocks if dev-mode-style class-reloading is happening.

If you’re getting apparent deadlocks in a Rails5 app that does multi-threaded concurrency, it’s probably about this.

If you are willing to turn off dev-mode class-reloading and auto-loading altogether, you can probably ignore this.

What I mean by “dev-mode class-reloading”

Rails5 by default generates your environments/development.rb with  with config.cache_classes==false, config.eager_load==false. Classes are auto-loaded only on demand (eager_load == false), and are also sometimes unloaded to be reloaded on next access (cache_classes == false). (The details of when/how/which/if they are unloaded is outside the scope of this blog post, but has also changed in Rails5).

You can turn off all auto-loading with config.cache_classes==true, config.eager_load==true — the Rails5 default production.  All classes are loaded/require’d en masse on boot, and are never unloaded.  This is what I mean by ‘turn off dev-mode class-reloading and auto-loading altogether’.

The default Rails5 generated environments/test.rb has config.cache_classes==true, config.eager_load==false.  Only load classes on demand with auto-loading (eager_load == false), but never unload them.

I am not sure if there’s any rational purpose for having config.cache_classes = false, config.eager_load = true, probably not.

I think there was a poorly documented  config.autoload in previous Rails versions, with confusing interactions with the above two config; I don’t think it exists (or at least does anything) in Rails 5.

Good News

Previously to Rails 5, Rails dev-mode class-reloading and auto-loading were entirely un-thread-safe. If you were using any kind of manual concurrency, then you pretty much had to turn off dev-mode class-reloading and auto-loading. Which was too bad, cause they’re convenient and make dev more efficient. If you didn’t, it might sometimes work, but in development (or possibly test) you’d often see those pesky exceptions involving something like “constant is missing”, “class has been redefined”, or “is not missing constant” — I’m afraid I can’t find the exact errors, but perhaps some of these seem familiar.

Rails 5, for the first time, has an architecture which theoretically lets you do manual concurrency in the presence of class reloading/autoloading, thread-safely. Hooray! This is something I had previously thought was pretty much infeasible, but it’s been (theoretically) pulled off, hooray. This for instance theoretically makes it possible for Sidekiq to do dev-mode-style class-reloading — although I’m not sure if latest Sidekiq release actually still has this feature, or they had to back it out.

The architecture is based on some clever concurrency patterns, so it theoretically doesn’t impact performance or concurrency measuribly in production — or even, for the most part, significantly in development.

While the new architecture most immediately effects class-reloading, the new API is, for the most part, not written in terms of reloading, but is higher level API in terms of signaling things you are doing about concurrency: “I’m doing some concurrency here” in various ways.  This is great, and should be a good for future of Just Works concurrency in Rails in other ways than class reloading too.  If you are using the new architecture correctly, it theoretically makes ActiveRecord Just Work too, with less risk of leaked connections without having to pay lots of attention to it. Great!

I think matthewd is behind much of the new architecture, so thanks matthewd for trying to help move Rails toward a more concurrency-friendly future.

Less Good News

While the failure mode for concurrency used improperly with class-reloading in Rails 4 (which was pretty much any concurrency with class-reloading, in Rails 4) was occasional hard-to-reprodue mysterious exceptions — the failure mode for concurrency used improperly with class-reloading in Rails5 can be a reproduces-every-time deadlock. Where your app just hangs, and it’s pretty tricky to debug why, especially if you aren’t even considering “class-reloading and new Rails 5 concurrency architecture”, which, why would you?

And all the new stuff is, at this point, completely undocumented.  (update some docs in rails/rails #27494, hadn’t seen that before I wrote this).  So it’s hard to know how to use it right. (I would quite like to encourage an engineering culture where significant changes without docs is considered just as problematic to merge/release as significant changes without tests… but we’re not there yet). (The docs Autoloading and Reloading Constants Guide, to which this is very relevant, have not been updated for this ActiveSupport::Reloader stuff, and I think are probably no longer entirely accurate. That would be a good place for some overview docs…).

The new code is a bit tricky and abstract, a bit hard to follow. Some anonymous modules at some points made it hard for me to use my usual already grimace-inducing methods of code archeology reverse-engineering, where i normally count on inspecting class names of objects to figure out what they are and where they’re implemented.

The new architecture may still be buggy.  Which would not be surprising for the kind of code it is: pretty sophisticated, concurrency-related, every rails request will touch it somehow, trying to make auto-loading/class-reloading thread-safe when even ordinary ruby require is not (I think this is still true?).  See for instance all the mentions of the “Rails Reloader” in the Sidekiq changelog, going back and forth trying to make it work right — not sure if they ended up giving up for now.

The problem with maybe buggy combined with lack of any docs whatsoever — when you run into a problem, it’s very difficult to tell if it’s because of a bug in the Rails code, or because you are not using the new architecture the way it’s intended (a bug in your code). Because knowing the way it’s intended to work and be used is a bit of a guessing game, or code archeology project.

We really need docs explaining exactly what it’s meant to do how, on an overall architectural level and a method-by-method level. And I know matthewd knows docs are needed. But there are few people qualified to write those docs (maybe only matthewd), cause in order to write docs you’ve got to know the stuff that’s hard to figure out without any docs. And meanwhile, if you’re using Rails5 and concurrency, you’ve got to deal with this stuff now.

So: The New Architecture

I’m sorry this is so scattered and unconfident, I don’t entirely understand it, but sharing what I got to try to save you time getting to where I am, and help us all collaboratively build some understanding (and eventually docs?!) here. Beware, there may be mistakes.

The basic idea is that if you are running any code in a manually created thread, that might use Rails stuff (or do any autoloading of constants), you need to wrap your “unit of work” in either Rails.application.reloader.wrap { work } or Rails.application.executor.wrap { work }.  This signals “I am doing Rails-y things, including maybe auto-loading”, and lets the framework enforce thread-safety for those Rails-y things when you are manually creating some concurrency — mainly making auto-loading thread-safe again.

When do you pick reloader vs executor? Not entirely sure, but if you are completely outside the Rails request-response cycle (not in a Rails action method, but instead something like a background job), manually creating your own threaded concurrency, you should probably use Rails.application.reloader.  That will allow code in the block to properly pick up new source under dev-mode class-reloading. It’s what Sidekiq did to add proper dev-mode reloading for sidekiq (not sure what current master Sidekiq is doing, if anything).

On the other hand, if you are in a Rails action method (which is already probably wrapped in a Rails.application.reloader.wrap, I believe you can’t use a (now nested) Rails.application.reloader.wrap without deadlocking things up. So there you use Rails.application.executor.wrap.

What about in a rake task, or rails runner executed script?  Not sure. Rails.application.executor.wrap is probably the safer one — it just won’t get dev-mode class-reloading happening reliably within it (won’t necessarily immediately, or even ever, pick up changes), which is probably fine.

But to be clear, even if you don’t care about picking up dev-mode class-reloading immediately — unless you turn off dev-mode class-reloading and auto-loading for your entire app — you still need to wrap with a reloader/executor to avoid deadlock — if anything inside the block possibly might trigger an auto-load, and how could you be sure it won’t?

Let’s move to some example code, which demonstrates not just the executor.wrap, but some necessary use of ActiveSupport::Dependencies.interlock.permit_concurrent_loads too.

An actual use case I have — I have to make a handful of network requests in a Rails action method, I can’t really push it off to a bg job, or at any rate I need the results before I return a response. But since I’m making several of them, I really want to do them in parallel. Here’s how I might do it in Rails4:

In Rails4, that would work… mostly. With dev-mode class-reloading/autoloading on, you’d get occasional weird exceptions. Or of course you can turn dev-mode class-reloading off.

In Rails5, you can still turn dev-mode class-reloading/autoloading and it will still work. But if you have autoload/class-reload on, instead of an occasional weird exception, you’ll get a nearly(?) universal deadlock. Here’s what you need to do instead:

And it should actually work reliably, without intermittent mysterious “class unloaded” type errors like in Rails4.

ActiveRecord?

Previously, one big challenge with using ActiveRecord under concurrency was avoiding leaked connections.

think that if your concurrent work is wrapped in Rails.application.reloader.wrap do or Rails.application.executor.wrap do, this is no longer a problem — they’ll take care of returning any pending checked out AR db connections to the pool at end of block.

So you theoretically don’t need to be so careful about wrapping every single concurrent use of AR in a ActiveRecord::Base.connection_pool.with_connection  to avoid leaked connections.

But I think you still can, and it won’t hurt — and it should sometimes lead to shorter finer grained checkouts of db connections from the pool, which matters if you potentially have more threads than you have pool size in your AR connection. I am still wrapping in ActiveRecord::Base.connection_pool.with_connection , out of superstition if nothing else.

Under Test with Capybara?

One of the things that makes Capybara feature tests so challenging is that they inherently involve concurrency — there’s a Rails app running in a different thread than your tests themselves.

I think this new architecture could theoretically pave the way to making this all a lot more intentional and reliable, but I’m not entirely sure, not sure if it helps at all already just by existing, or would instead require Capybara to make use of the relevant API hooks (which nobody’s prob gonna write until there are more people who understand what’s going on).

Note though that Rails 4 generated a comment in config/environments/test.rb that says “If you are using a tool that preloads Rails for running tests [which I think means Capybara feature testing], you may have to set [config.eager_load] to true.”  I’m not really sure how true this was in even past versions Rails (whether it was neccessary or sufficient). This comment is no longer generated in Rails 5, and eager_load is still generated to be true … so maybe something improved?

Frankly, that’s a lot of inferences, and I have been still leaving eager_load = true under test in my Capybara-feature-test-using apps, because the last thing I need is more fighting with a Capybara suite that is the closest to reliable I’ve gotten it.

Debugging?

The biggest headache is that a bug in the use of the reloader/executor architecture manifests as a deadlock — and I’m not talking the kind that gives you a ruby ‘deadlock’ exception, but the kind where your app just hangs forever doing nothing. This is painful to debug.

These deadlocks in my experience are sometimes not entirely reproducible, you might get one in one run and not another, but they tend to manifest fairly frequently when a problem exists, and are sometimes entirely reproducible.

First step is experimentally turning off dev-mode class-reloading and auto-loading altogether  (config.eager_load = true,config.cache_classes = true), and see if your deadlock goes away. If it does, it’s probably got something to do with not properly using the new Reloader architecture. In desperation, you could just give up on dev-mode class-reloading, but that’d be sad.

Rails 5.0.1 introduces a DebugLocks feature intended to help you debug these deadlocks:

Added new ActionDispatch::DebugLocks middleware that can be used to diagnose deadlocks in the autoload interlock. To use it, insert it near the top of the middleware stack, using config/application.rb:

config.middleware.insert_before Rack::Sendfile, ActionDispatch::DebugLocks

After adding, visiting /rails/locks will show a summary of all threads currently known to the interlock.

PR, or at least initial PR, at rails/rails #25344.

I haven’t tried this yet, I’m not sure how useful it will be, I’m frankly not too enthused by this as an approach.

References

  • Rails.application.executor and Rails.application.reloader are initialized here, I think.
  • Not sure the design intent of: Executor being an empty subclass of ExecutionWrapper; Rails.application.executor being an anonymous sub-class of Exeuctor (which doesn’t seem to add any behavior either? Rails.application.reloader does the same thing fwiw); or if further configuration of the Executor is done in other parts of the code.
  • Sidekiq PR #2457 Enable code reloading in development mode with Rails 5 using the Rails.application.reloader, I believe code may have been written by matthewd. This is aood intro example of a model of using the architecture as intended (since matthewd wrote/signed off on it), but beware churn in Sidekiq code around this stuff dealing with issues and problems after this commit as well — not sure if Sidekiq later backed out of this whole feature?  But the Sidekiq source is probably a good one to track.
  • A dialog in Rails Github Issue #686 between me and matthewd, where he kindly leads me through some of the figuring out how to do things right with the new arch. See also several other issues linked from there, and links into Rails source code from matthewd.

Conclusion

If I got anything wrong, or you have any more information you think useful, please feel free to comment here — and/or write a blog post of your own. Collaboratively, maybe we can identify if not fix any outstanding bugs, write docs, maybe even improve the API a bit.

While the new architecture holds the promise to make concurrent programming in Rails a lot more reliable — making dev-mode class-reloading at least theoretically possible to do thread-safely, when it wasn’t at all possible before — in the short term, I’m afraid it’s making concurrent programming in Rails a bit harder for me.  But I bet docs will go a long way there.

Posted in General | Tagged | 4 Comments

A class_eval monkey-patching pattern with prepend

Yes, it’s best to avoid “monkey-patching” — changing an already loaded ruby class by reopening the class to add or replace methods.

But sometimes you’ve got no choice, because a dependency just doesn’t give you the API you need to do what you need, or has a bug that hasn’t been fixed in a release you can use yet.

And in some cases I really do think it actually makes sense to most forward-compatibly make your customization to a dependency, in a way that’s surgically targetted to avoid replacing or copy-pasting code you _don’t_ want to customize, to make it most likely your code will keep working with future releases of the dependency.

Module#prepend, added in Ruby 2.0, makes it easier to do this kind of surgical intervention, because you can monkey-patch a new method replacing an original implementation, and still call super to call default/original implementation of that very same method. Something you couldn’t do before to methods that were implemented directly in the original class-at-hand (rather than a module/superclass it includes/extends).

But a Module you are going to prepend can’t include “class macros”, class methods like activerecord’s `validates` for instance.  For a module that’s going to be included in a more normal way, ActiveSupport::Concern in Rails can let ‘class macros’ live sensibly in the module — but AS::Concern has no support for prepend, not gonna help.  (Maybe a PR to Rails? If I had some indication that rails maintainers might be interested in such a PR, I might try to see if I could make something reasonable, but I hate working on tricky stuff only to have maintainers reject it as something they’re not interested in).

You might be able to hack something up yourself with Module#prepended, similar to an implementation one could imagine being a part of AS::Concern. But I don’t, I just Use The Ruby. Here’s how I do my class_eval monkey-patches with Concern, trying to keep everything as non-magical as possible, and without diminishing readability too much from when we just used class_eval without Module.prepend.


# spell out entire class name, so it's not defined yet
# we'll get a raise -- we don't want to define it fresh here
# accidentally when we're expecting to be monkey-patching
Some::Dependency::Foo.class_eval
  # 'class macros' go here
  validates :whatever

  # We want the instance methods inline here for legibility,
  # looking kind of like an ordinary class. But we want
  # to use prepend. And giving it a name rather than an
  # anonymous module can help with stack traces and other debugging.
  # this is one way to do all that:
  prepend(FooExtension = Module.new do
    def existing_method
      if custom_guard_logic
        return false
      end

      super
    end
  end)

Last part: I put all these extensions in a directory I create, ./app/extensions

Because of what I’ll show you next, you can call these files whatever you want, so I put them in the same directory structure and with the same name as the original file being patched, but with _extension on the end. So the above would be at ./app/some/dependency/foo_extension.rb.

And then I put this to_prepare in my ./config/application.rb, to make sure all these monkey-patch extensions get loaded in dev-mode class-reloading, properly effecting the thing they are patching even if that thing is dev-mode class-reloaded too:

    config.to_prepare do
      # Load any monkey-patching extensions in to_prepare for
      # Rails dev-mode class-reloading. 
      Dir.glob(File.join(File.dirname(__FILE__), "../app/extensions/**/*_extension.rb")) do |c|
        Rails.configuration.cache_classes ? require(c) : load(c)
      end

So there you go. This seems to be working for me, arrived at this pattern in fits and pieces, copying techniques from other projects and figuring out what worked best for me.

Posted in General | Tagged | 1 Comment

Segmenting “Catalog” and “Articles” in EDS API

About 4 years ago, I posted a long position paper arguing that a “bento-style” search  was appropriate for the institution I then worked at. (I’ve taken to calling it a “search dashboard” approach too since then.)   The position paper stated that this recommendation was being made in the face of actually existing technical/product constraints at the time; as well as with the (very limited) research/evidence we had into relevant user behavior and preferences. (And also because for that institution at the time, a bento-style search could be implemented without any additional 6-figure software licenses, which some of the alternatives entailed).

I never would have expected that 4 years later the technical constraint environment would be largely unchanged, and we would not have (so far as I’m aware) any significant additional user research (If anyone knows about any write-ups, please point us to them). But here we are. And “bento style” search has kind of taken over the landscape.

Putting reasons for that and evaluations of whether it’s currently the best decision aside, for a client project I have been implementing a “bento style” search dashboard with several of the components targetting the EDS API.  (The implementation is of course using the bento_search gem, expect a new release in the near future with many enhancements to the EDS adapter).

The client wanted to separate “Catalog” and “Article” results in separate “bento” boxes — clicking the “see all results” link should take the user to the EDS standard interface, still viewing results limited to “Catalog” and “Articles”. It was not immediately clear how to best accomplish that in EDS.  The distinction could be based on actual source of the indexed records (indexed from local ILS, vs EDS central index), or on format (‘article’ vs ‘monograph and things like that’, regardless of indexing source).  I was open to either solution in exploring possibilities.

I sent a query to the Code4Lib listserv for people doing this with EDS and discovered: This is indeed a very popular thing to do with EDS; People are doing it a whole variety of different kind of hacky ways.   The conclusion is I think probably the best way might be creating a custom EDS “limiter” corresponding to a “(ZT articles)” query, but I’m not sure if anyone is actually doing that, and i haven’t tried it yet myself.

Possibilities identified in people’s off-list responses to me:

  • Some people actually just use full un-limited EDS results for “Articles”, even though it’s labelled “Articles”! Obviously not a great solution.

  • Some people set up different EDS ‘profiles’, one which just includes the Catalog source/database, and one which includes all source/databases except ‘Catalog’.  This works, but I think doesn’t give the user a great UI for switching back and forth once they are in the EDS standard interface, or choosing to search over everything once they are there — although my client ultimately decided this was good enough, or possibly even preferred to keep ‘catalog’ and ‘articles’ entirely separate in the UI.
  • One person was actually automatically adding “AND (ZT article)” to the end of the user-entered queries. Which actually gives great results. Interestingly, it even returns some results marked “Book” format type in EDS — because they are book chapters, which actually seems just right. On the API end, this is just fine to invisibly add an “AND (ZT article)” to the end of the query. But once we direct to ‘view all results’, redirecting to a query that has “AND (ZT article)” at the end looks sloppy, and doesn’t give the user a good UI for choosing to switch between articles, catalog, and everything, once they are there in the EDS standard interface.

  • Some people are using the EDS format “source type” facets, limiting to certain specified ‘article-like’ values.  That doesn’t seem as good as the “(ZT article)” hack, because it won’t include things like book chapters that are rightly included in “(ZT article)”.  But it may be good enough or the best available option.  But, while I believe I can do that limit fine form the API, I haven’t figured out any way to ‘deep link’ into EDS results with a pre-selected query that has pre-selected “source type” facet limits.  Not sure if there any parameters I can add on to the `?direct=true&bquery=your-query-here`  “deep link” URL to pre-select source type facets.

Through some communication with an internal EDS developer contact, I learned it ought to be possible to create a custom “limiter” in EDS corresponding to the AND (ZT articles)hack. I’m not sure if anyone is actually doing this, but sounds good to me for making an “Articles Only” limiter which can be used in both standard EDS and via API. The instructions I was given were:

Good news, we can do your best option here.  We’ve got a feature
called “Custom Limiters” that should do the trick.

http://search.ebscohost.com/login.aspx?direct=true&scope=site&site=eds-live&authtype=guest&custid=ericfrier&groupid=main&profile=eds_fgcu%20&bquery=nanotechnology+AND+PZ+Article

Take a look at how this search “pre-selects” the custom limiter and
removes the syntax from the search query.

In order to accomplish this, the library needs to add a custom
limiter for the specific search syntax you’d like to use.  In
this case, this needs to be pasted in the bottom branding of their
EDS profile:

[gah, having so much trouble getting WP to let me put example
script tag in here. --jrochkind ] 

(pretend this is an <)script type="text/javascript" src="http://widgets.ebscohost.com/prod/simplekey/customlimiters/limiter.php?modifier=AND%20PZ%20Article&label=Articles%20Only&id=artonly"> </script>


This script catches any use of “AND PZ Article” and instead simulates
a limiter based on that search syntax.

I haven’t actually tried this myself yet, but it sounds like it should probably work (modulo the typo “PZ Article” for “ZT Article”, which I think is the right one to use on EDS).  Hard to be sure of anything until you try it out extensively with EDS API, but sounds good to me.

Posted in General | Leave a comment

Getting full-text links from EDS API

The way full-text links are revealed (or at least, um, “signalled”) in the EBSCO EDS API  is… umm… both various and kind of baroque.

I paste below a personal communication from an EBSCO developer containing an overview. I post this as a public service, because this information is very unclear and/or difficult to figure out from actual EBSCO documentation, and would also be very difficult to figure out from observation-based reverse-engineering, because there are so many cases.

Needless to say, it’s also pretty inconvenient to develop clients for, but so it goes.

There are a few places to look for full-text linking in our API.  Here’s an overview:

a.       PDF FULL TEXT: If the record has {Record}.FullText.Linkselement, and the Link elements inside have a Type element that equals “pdflink”, then that means there is a PDF available for this record on the EBSCOhost platform.  The link to the PDF document does not appear in the SEARCH response, but the presence of a pdflink-type Link should be enough to prompt the display of a PDF Icon.  To get to the PDF document, the moment the user clicks to request the full text link, first call the RETRIEVE method with the relevant identifiers (accession number and database ID from the Header of the record), and you will find a time-bombed link directly to the PDF document in the same place (FullText.Links[i].Url) in the detailed record.  You should not display this link on the screen because if a user walks away and comes back 15 minutes later, the link will no longer work.  Always request the RETRIEVE method when a user wants the PDF document.

b.       HTML FULL TEXT: If the record has {Record}.FullText.Text.Availability element, and that is equal to “1”, then that means the actual content of the article (the full text of it) will be returned in the RETIREVE method for that item.  You can display this content to the user any way you see fit.  There is embedded HTML in the text for images, internal links, etc.

c.       EBOOK FULL TEXT: If the record has {Record}.FullText.Linkselement, and the Link elements inside have a Type element that equals “ebook-epub” or “ebook-pdf”, then that means there is an eBook available for this record on the EBSCOhost platform.  The link to the ebook does not appear in the SEARCH response, but the presence of a ebook-type Link should be enough to prompt the display of an eBook Icon.  To get to the ebook document, the moment the user clicks to request the full text link, first call the RETRIEVE method with the relevant identifiers (accession number and database ID from the Header of the record), and you will find a time-bombed link directly to the ebook in the same place (FullText.Links[i].Url) in the detailed record.  You should not display this link on the screen because if a user walks away and comes back 15 minutes later, the link will no longer work.  Always request the RETRIEVE method when a user wants the ebook.

d.       856 FIELD FROM CATALOG RECORDS: I don’t think you are dealing with Catalog records, right?  If not, then ignore this part.  Look in {Record}.Items.  For each Item, check the Group element.  If it equals “URL”, then the Data element will contain a link we found in the 856 Field from their Catalog, along with the link label in the Label element.

e.       EBSCO SMARTLINKS+: These apply if the library subscribes to a journal via EBSCO Journal Service.  They are direct links to the publisher platform, similar to the custom link.  If the record has {Record}.FullText.Linkselement, and the Link elements inside have a Typeelement that equals “other”, then that means there is a direct-to-publisher-platform link available for this record.  The link to the PDF document does not appear in the SEARCH response, but the presence of a other-type Link should be enough to prompt the display of a Full Text Icon.  To get to the link, the moment the user clicks to request the full text link, first call the RETRIEVE method with the relevant identifiers (accession number and database ID from the Header of the record), and you will find a time-bombed link directly to the document in the same place (FullText.Links[i].Url) in the detailed record.  You should not display this link on the screen because if a user walks away and comes back 15 minutes later, the link will no longer work.  Always request the RETRIEVE method when a user wants the link.

f.        FULL TEXT CUSTOMLINKS: Look in the {Record}.FullText.CustomLinks element.  For each element there, you’ll find a URL in theUrl element, a label in the Text element, and an icon if provided in the Icon element.

g.       Finally, we have NON-FULLTEXT CUSTOMLINKS that point to services like ILLiad, the catalog, or other places that will not end up at full text.  You’ll find these at {Record}.CustomLinks.  For each element there, you’ll find a URL in the Url element, a label in the Text element, and an icon if provided in the Icon element.

One further variation (there are probably more I haven’t discovered yet), for case ‘d’ above, the ... object sometimes has a bare URL in it as described above, but other times has an escaped text which when unescaped becomes source for a weird  XML node which has the info you need in it. It’s not clear if this varies due to configuration, due to your ILS EDS is connecting to, or something else. In my case, I am inexplicably see it sometimes change from request to request.
Posted in General | Leave a comment

Maybe beware of microservices

In a comment on my long, um,  diatribe last year about linked data, Eric Hellman suggested “I fear the library world has no way to create a shared technology roadmap that can steer it away from dead ends that at one time were the new shiny,” and I responded “I think there’s something to what you suggest at the end, the slow-moving speed of the library community with regard to technology may mean we’re stuck responding with what seemed to be exciting future trends…. 10+ years ago, regardless of how they’ve worked out since. Perhaps if that slow speed were taken into account, it would mean we should stick to well established mature technologies, not “new shiny” things which we lack the agility to respond to appropriately.”

I was reminded of this recently when running across a blog post about “Microservices”, which I also think were very hyped 5-10 years ago, but lately are approached with a lot more caution in the general software engineering industry, as a result of hard-earned lessons from practice.

Sean Kelly, in Microservices? Please, Don’t does write about some of the potential advantages of microservces, but as you’d expect from the title, mainfully focuses on pitfalls engineers have learned through working with microservice architectures. He warns:

When Should You Use Microservices?

“When you’re ready as an engineering organization.”

I’d like to close by going over when it could be the right time to pivot to this approach (or, if you’re starting out, how to know if this is the right way to start).

The single most important step on the path to a solid, workable approach to microservices is simply understanding the domain you’re working in. If you can’t understand it, or if you’re still trying to figure it out, microservices could do more harm than good. However, if you have a deep understanding, then you know where the boundaries are and what the dependencies are, so a microservices approach could be the right move.

Another important thing to have a handle on is your workflows – specifically, how they might relate to the idea of a distributed transaction. If you know the paths each category of request will make through your system and you understand where, how, and why each of those paths might fail, then you could start to build out a distributed model of handling your requests.

Alongside understanding your workflows is monitoring your workflows. Monitoring is a subject greater than just “Microservice VS Monolith,” but it should be something at the core of your engineering efforts. You may need a lot of data at your fingertips about various parts of your systems to understand why one of them is underperforming, or even throwing errors. If you have a solid approach for monitoring the various pieces of your system, you can begin to understand your systems behaviors as you increase its footprint horizontally.

Finally, when you can actually demonstrate value to your engineering organization and the business, then moving to microservices will help you grow, scale, and make money. Although it’s fun to build things and try new ideas out, at the end of the day the most important thing for many companies is their bottom line. If you have to delay putting out a new feature that will make the company revenue because a blog post told you monoliths were “doing it wrong,” you’re going to need to justify that to the business. Sometimes these tradeoffs are worth it. Sometimes they aren’t. Knowing how to pick your battles and spend time on the right technical debt will earn you a lot of credit in the long run.

Now, I think many library and library industry development teams actually are pretty okay at understanding the domain and workflows. With the important caveat that ours tend to end up so complex (needlessly or not), that they can be very difficult to understand, and often change — which is a pretty big caveat, for Kelly’s warning.

But monitoring?  In library/library industry projects?  Years (maybe literally a decade) behind the software industry at large.  Which I think is actually just a pointer to a general lack of engineering capabilities (whether skill or resource based) in libraries (especially) and the library industry (including vendors, to some extent).

Microservices are a complicated architecture. They are something to do not only when there’s a clear benefit you’re going to get from them, but when you have an engineering organization that has the engineering experience, skill, resources, and coordination to pull off sophisticated software engineering feats.

How many library engineering organizations do you think meet that?  How many library engineering organizations can even be called ‘engineering organizations’?

Beware, when people are telling you microservices are the new thing or “the answer”. In the industry at large, people and organizations have been burned by biting off more than they can chew in a microservice-based architecture, even starting with more sophisticated engineering organizations than most libraries or many library sector vendors have.

Posted in General | Leave a comment

“Internet Archive Successfully Fends Off Secret FBI Order”

https://theintercept.com/2016/12/01/internet-archive-fends-off-secret-fbi-order-in-latest-victory-against-nsls/

A DECADE AGO, the FBI sent Brewster Kahle, founder of the Internet Archive, a now-infamous type of subpoena known as a National Security Letter, demanding the name, address and activity record of a registered Internet Archive user. The letter came with an everlasting gag order, barring Kahle from discussing the order with anyone but his attorney — not even his wife could know.

But Kahle did eventually talk about it, calling the order “horrendous,” after challenging its constitutionality in a joint legal effort with the Electronic Frontier Foundation and the American Civil Liberties Union. As a result of their fight, the FBI folded, rescinding the NSL and unsealing associated court records rather than risk a ruling that their surveillance orders were illegal. “This is an unqualified success that will help other recipients understand that you can push back on these,” Kahle told reporters once the gag order was lifted.

The bureau continued to issue tens of thousands of NSLs in subsequent years, but few recipients followed in Kahle’s footsteps. Those who did achieved limited but important transparency gains; as a result of one challenge, a California District Court ruled in 2013 that the everlasting gag orders accompanying NSLs are unconstitutional, and last year Congress passed a law forcing the FBI to commit to periodically reviewing such orders and rescinding them when a gag is no longer necessary to a case.

Now, Kahle and the archive are notching another victory, one that underlines the progress their original fight helped set in motion. The archive, a nonprofit online library, has disclosed that it received another NSL in August, its first since the one it received and fought in 2007. Once again it pushed back, but this time events unfolded differently: The archive was able to challenge the NSL and gag order directly in a letter to the FBI, rather than through a secretive lawsuit. In November, the bureau again backed down and, without a protracted battle, has now allowed the archive to publish the NSL in redacted form.…

Posted in General | Leave a comment

“Harvesting Government History, One Web Page at a Time”

http://www.nytimes.com/2016/12/01/nyregion/harvesting-government-history-one-web-page-at-a-time.html

With the arrival of any new president, vast troves of information on government websites are at risk of vanishing within days. The fragility of digital federal records, reports and research is astounding.

No law protects much of it, no automated machine records it for history, and the National Archives and Records Administrationannounced in 2008 that it would not take on the job.

“Large portions of dot-gov have no mandate to be taken care of,” said Mark Phillips, a library dean at the University of North Texas, referring to government websites. “Nobody is really responsible for doing this.”

Enter the End of Term Presidential Harvest 2016 — a volunteer, collaborative effort by a small group of university, government and nonprofit libraries to find and save valuable pages now on federal websites. The project began before the 2008 elections, when George W. Bush was serving his second term, and returned in 2012.

It recorded, for example, the home page of the United States Central Command on Sept. 16, 2008, and the State Department’s official blog on February 13, 2013. The pages are archived on servers operated by the project, and are available to anyone.

The ritual has taken on greater urgency this year, Mr. Phillips said, out of concern that certain pages may be more vulnerable than usual because they contain scientific data for which Mr. Trump and some of his allies have expressed hostility or contempt.

Posted in General | Leave a comment

Three articles on information ethics and power

Today I happened to come across three very good articles which to me all seemed to form a theme: Ethical and political considerations of information and information technology.

First, Weaponized data: How the obsession with data has been hurting marginalized communities

Consider contexts and who is driving the data: The problem of people not from communities affected by communities making decisions for those who are is very prevalent in our field, and the work around data is no exception. Who created the data? Was the right mix of people involved? Who interpreted the data? The rallying cry among marginalized communities is “Stop talking about us without us,” and this applies to data collection and interpretation.

I think there’s deeper things to be said about ‘weaponized data’ too that have been rattling around in my brain for a while, this essay is a useful contribution to the mix.

For more on measurement and data as a form of power and social control, and not an ‘objective’ or ‘neutral’ thing at all, see James C. Scott’s Seeing Like a State, and the works of Michel Foucault.

Second, from Business Insider, Programmers are having a huge discussion about the unethical and illegal things they’ve been asked to do by Julie Bort.

I’m not sure I buy the conclusion that “what developers really need is an organization that governs and regulates their profession like other industries have” — professional licensure for developers, you can’t pay someone to write a program unless they are licensed? I don’t think that’s going to work, and it’s kind of the opposite of democratization of making software that I think is actually important.

But requiring pretty much any IT program anywhere to include 3 credits of ethics would be a good start, and is something academic credentialing organizations can easily do.

“We rule the world,” he said. “We don’t know it yet. Other people believe they rule the world but they write down the rules and they hand them to us. And then we write the rules that go into the machines that execute everything that happens.”

I don’t think that means we “rule the world”. It means we’re tools.  But increasingly important and powerful ones. Be careful who’s rule you are complicit with.

Thirdly and lastly but not leastly, a presentation by Tara Robertson, Not all information wants to be free. (Thanks for the link Sean Hannan via facebook).

I can’t really find a pull quote to summarize this one, but it’s a really incredible lecture you should go and read. Several case studies in how ‘freeing information’ can cause harm, to privacy, safety, cultural autonomy, and dignity.

This is not a topic I’ve spent a lot of time thinking about, and Robertson provides a very good entry to it.

The original phrase “information wants to be free” was not of course meant to say that people wanted information to be free. Quite the opposite, it was that many people, especially people in positions of power did not want information to be free — but it is very difficult to keep information under wraps, it tends toward being free anyway.

But yes, especially people in positions of power — the hacker assumption was that the digital era acceleration of information’s tendency toward unrestricted distribution would be a net gain to freedom and popular power.  Sort of the “wikileaks thesis”, eh?  I think the past 20 years have definitely dashed the hacker-hippy techno-utopianism of Steward Brand and Mondo 2000 in a dystopian world of state panopticon, corporate data mining (see the first essay on data as a form of power, eh?), information overload distraction and information bubble ignorance.

Information may want to be free, but the powerful aren’t the only ones that are harmed when it becomes so.

Still, while it perhaps makes sense for a librarian’s conference closing lecture, I can’t fully get behind Robertson’s conclusion:

I’d like to ask you to listen to the voices of the people in communities whose materials are in the collections that we care for. I’d also like to invite you to speak up where and when you can. As a profession we need to travel the last mile to build relationships with communities and listen to what they think is appropriate access, and then build systems that respect that.

Yes, and no. “Community’s” ideas of “appropriate access” can be stifling and repressive too,  as the geeks and queers and weirdos who grew up to be hackers and librarians know well too.   Just because “freeing” information can do and has done  real harm to the vulnerable, it doesn’t mean the more familiar story of censorship as a form of political control by the powerful isn’t also often true.

In the end, all three of these essays I encountered today, capped off by Robertson’s powerful essay, remind us that information is power, and, like all power, it’s formation and expression and use is never neutral, it has real consequences, for good and ill, intended and unintended. Those who work with information need to think seriously about their ethical responsibilities with regard to that power they wield.

Posted in General | Leave a comment

Rubyland: A new ruby news and blog feed aggregator

So I thought there should be a site aggregating ruby rss/atom feeds. As far as I’m aware, there hasn’t been a really maintained one for a couple years now.

So in my spare time on my own, I made one, that worked the way I wanted. http://www.rubyland.news.

The source is open at github.

I’ve got a few more features planned still.

It’s running on a free heroku dyno with a free postgres. This works out — the CPU needs of an RSS aggregator are not very high, so this works out. But it does limit things in some ways, such as no SSL/https.  If any organization is interested in sponsoring rubyland for a modest contribution to pay for hosting costs and make more things possible, get in touch.

Most people seem to approach feed aggregators with a tool that produces static HTML. I decided to make a dynamic site to make certain things possible/easier, and use the tools I knew. But since the content is of course mostly static, there’s a lot of caching going on. Rails fragment caching over the entire page, as well as etags delivered to browsers.

Some other interesting features of the code include: flexbox for responsive display with zero media queries, which was fun (although I think I’ll have to add a media query for the a UI element I’m going to add soon); reddit API for live comments count on /r/ruby; and feedjira providing a great assist in dealing with feed idiosyncracies.

But beyond the code (which was fun to write), I’m hoping the Rubyland aggregator can be a valuable resource for rubyists and help (re-)strenghten the ruby online community, which is in a bit of a weird state these days.

Posted in General | Tagged | Leave a comment

flexbox is so nice there’s really no need for ‘grid’ frameworks anymore

That’s really all I got to say. I guess I should start tweeting or something.

Posted in General | Leave a comment