Preservation in a war zone

On the cover of today’s NYTimes (print washington edition)

Race in Iraq and Syria to Record and Shield Art Falling to ISIS
By ANNE BARNARD MARCH 8, 2015

BAGHDAD — In those areas of Iraq and Syria controlled by the Islamic State, residents are furtively recording on their cellphones damage done to antiquities by the extremist group. In northern Syria, museum curators have covered precious mosaics with sealant and sandbags….

…There was also the United States invasion in 2003, when American troops stood by as looters ransacked the Baghdad museum, a scenario that, Mr. Shirshab suggested, is being repeated today….

…The Babylon preservation plan also includes new documentation of the site, including brick-by- brick scale drawings of the ruins. In the event the site is destroyed, Mr. Allen said, the drawings can be used to rebuild it….

…The American invasion alerted archaeologists to what needed protecting. After damage and looting at many sites, documentation and preservation accelerated. One result was that the Mosul Museum, attacked by the Islamic State, had been digitally cataloged…

…He oversees an informal team of Syrians he has nicknamed the Monuments Men, many of them his former students. They document damage and looting by the Islamic State, pushing for crackdowns on the black market. Recently, the United Nations banned all trade in Syrian artifacts….

…Now, Iraqi colleagues teach conservators and concerned residents simple techniques to use in areas controlled by the Islamic State, such as turning on a cellphone’s GPS function when photographing objects, to help trace damage or theft, or to add sites to the “no-strike” list for warplanes….

Posted in General | Leave a comment

Factors to prioritize (IT?) projects in an academic library

  • Most important: Impact vs. Cost
    • Impact is how many (what portion) of your patrons will be effected; and how profound the benefit may be to their research, teaching, learning.
    • Cost may include hardware or software costs, but for most projects we do the primary cost is staff time.
    • You are looking for the projects with the greatest impact at the lowest cost.
    • If you want to try and quantify, it may be useful to simply estimate three qualities:
      • Portion of userbase impacted (1-10 for 10% to 100% of userbase impacted)
      • Profundity of impact (estimate on a simple scale, say 1 to 3 with 3 being the highest)
      • “Cost” in terms of time. Estimate with only rough granularity knowing estimates are not accurate. 2 weeks, 2 months, 6 months, 1 year. Maybe assign those on a scale from 1-4.
      • You could then simply compute (portion * profundity) / cost, and look for the largest values. Or you could plot on a graph with (benefit = portion * profundity) on the x-axis, and cost on the y-axis. You are looking for projects near the lower right of the graph — high benefit, low cost.
  • Demographics impacted. Will the impact be evenly distributed, or will it be greater for certain demographics? Discipline/school/department? Researcher vs grad student vs undergrad?
    • Are there particular demographics which should be prioritized, because they are currently under-served or because focusing on them aligns with strategic priorities?
  • Types of services or materials addressed.  Print items vs digital items? Books vs journal articles? Other categories?  Again, are there service areas that have been neglected and need to be brought to par? Or service areas that are strategic priorities, and others that will be intentionally neglected?
  • Strategic plans. Are there existing Library or university.strategic plans? Will some projects address specific identified strategic focuses? Can also be used to determine prioritized demographics or service areas from above.
    • Ideally all of this is informed by strategic vision, where the library organization wants to be in X years, and what steps will get you there. And ideally that vision is already captured in a strategic plan. Few libraries may have this luxury of a clear strategic vision, however.
Posted in General | 3 Comments

ethical code for software engineering professionals?

Medical professionals have  professional ethical codes. For instance, the psychologists who (it is argued) helped devised improved torture methods for the U.S. government are accused of violating the ethical code of their profession.

Do software engineers and others who write software have professional ethical duties?

Might one of them be to do one’s best to create secure software (rather than intentionally releasing software with vulnerabilities for the purposes of allowing people in the know to exploit), and responsibly disclosing any security vulnerabilities found in third party software (rather than keeping them close so they can be used them for exploits)?

If so, are the software developers at the NSA (and, more likely, government contractors working for the NSA) guilty of unethical behavior?

Of course, the APA policy didn’t keep the psychologists from doing what they did, and there is some suggestion that the APA even intentionally made sure to leave enough loophole, which they potentially regret.   And there have been similar controversies within Anthropology. There’s no magic bullet to ethical behavior from simply writing rules, but I still think it’s a useful point for inquiry, at least acknowledging that there is such a thing as professional ethics for the profession, and providing official recognition that these discussions are part of the profession.

Are there ethical duties of software engineers and others who create software?  As software becomes more and more socially powerful, is it important to society that this be recognized? Are these discussions happening?  What professional bodies might they take place in? (IEEE? ACM?).  The ACM has a code of ethics, but it’s pretty vague, it seems easy to justify just about any profit-making activity.

Are these discussions happening?   Will the extensive Department of Defense funding of Computer Science (theoretical and applied) in the U.S. make it hard to have these discussions? (When I googled, the discussion that came up of how DoD funding effects computer science research was from 1989 — there may be self-interested reasons people aren’t that interested in talking about this).

Posted in General | 3 Comments

Be careful of regexes in a unicode world

Check out the following, which I wrote some time ago:

    # remove non-alphanumeric, excluding apostrophe; replace with space
    title.gsub!(/[^\w\s\']/, ' ') 

See any problem with that? What is \w and \s again? The ruby docs helpfully explain:

/\w/ - A word character ([a-zA-Z0-9_])
/\s/ - A whitespace character: /[ \t\r\n\f]/

See the problem yet?

"el revolución".gsub(/[^\w\s\']/, ' ')
# => "el revoluci n"

Oops. ó is not in the class [a-zA-Z0-9_]. \w doesn’t actually mean “a word character” at all, unless your input is only ascii. The docs probably really should warn you about this, describing the class as “an ascii word character”, and warning you to use other metacharacters if you aren’t just dealing with ascii.

Fortunately, ruby also provides some unicode-aware regex character classes, but they’re a lot harder to remember and longer to type. Here it is right, let’s use unicode-aware spacing instead of `\s` too:

"el: revolución".gsub(/[^[[:alnum:]][[:space:]]\']/, ' ')
#=> "el  revolución"

Yep, that’s what we wanted. There are several other unicode-aware character classes, apparently defined by POSX. The docs also say there’s a couple non-POSIX ones, including:

/[[:word:]]/ - A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation

I wasn’t able to make that work, it didn’t seem to be recognized in my ruby. I am not sure why, and didn’t bother finding out. What works is good enough for me.

But in a non-ascii world, it turns out, you almost never actually want to use those traditional regex character class metacharacters that many of us have been using for decades. \w and \s, no way. \d is less risky since you probably really do mean 0-9 and not digits from some other script, but that better be what you mean.

Posted in General | 1 Comment

Ruby threads, gotcha with local vars and shared state

I end up doing a fair amount of work with multi-threading in ruby. (There is some multi-threaded concurrency in Umlaut, bento_search, and traject).  Contrary to some belief, multi-threaded concurrency can be useful even in MRI ruby (which can’t do true parallelism due to the GIL), for tasks that spend a lot of time waiting on I/O, which is the purpose in Umlaut and bento_search (in both cases waiting on external HTTP apis). Traject uses multi-threaded concurrency for true parallelism in jruby (or soon rbx) for high performance.

There’s a gotcha with ruby threads that I haven’t seen covered much. What do you think this code will output from the ‘puts’?

value = 'original'

t = Thread.new do
  sleep 1
  puts value
end

value = 'changed'

t.join

It outputs “changed”.   The local var `value` is shared between both threads, changes made in the primary thread effect the value of `value` in the created thread too.  This is an issue not unique to threads, but is a result of how closures work in ruby — the local variables used in a closure don’t capture the fixed value at the time of closure creation, they are pointers to the original local variables. (I’m not entirely sure if this is traditional for closures, or if some other languages do it differently, or the correct CS terminology for talking about this stuff).  It confuses people in other contexts too, but can especially lead to problems with threads.

Consider a loop which in each iteration prepares some work to be done, then dispatches to a thread to actually do the work.  We’ll do a very simple fake version of that, watch:

threads = []
i = 0
10.times do
  # pretend to prepare a 'work order', which ends up in local
  # var i
  i += 1
  # now do some stuff with 'i' in the thread
  threads << Thread.new do
    sleep 1 # pretend this is a time consuming computation
    # now we do something else with our work order...
    puts i
  end
end

threads.each {|t| t.join}

Do you think you’ll get “1”, “2”, … “10” printed out? You won’t. You’ll get 10 10’s. (With newlines in random places becuase of interleaving of ‘puts’, but that’s not what we’re talking about here). You thought you dispatched 10 threads each with different values for ‘i’, but the threads are actually all sharing the same ‘i’, when it changes, it changes for all of them.

Oops.

Ruby stdlib Thread.new has a mechanism to deal with this, although like much in ruby stdlib (and much about multi-threaded concurrency in ruby), it’s under-documented. But you can pass args to Thread.new, which will be passed to the block too, and allow you to avoid this local var linkage:

require 'thread'

value = 'original'

t = Thread.new(value) do |t_value|
  sleep 1
  puts t_value
end

value = 'changed'

t.join

Now that prints out “original”. That’s the point of passing one or more args to Thread.new.

You might think you could get away with this instead:

require 'thread'

value = 'original'

t = Thread.new do
  # nope, not a safe way to capture the value, there's
  # still a race condition
  t_value = value
  sleep 1
  puts t_value
end

value = 'changed'

t.join

While that will seem to work for this particular example, there’s still a race condition there, the value could change before the first line of the thread block is executed, part of dealing with concurrency is giving up any expectations of what gets executed when, until you wait on a `join`.

So, yeah, the arguments to Thread.new. Which other libraries involving threading sometimes propagate. With a concurrent-ruby ThreadPoolExecutor:

work = 'original'
pool = Concurrent::FixedThreadPool.new(5)
pool.post(work) do |t_work|
  sleep 1
  puts t_work # is safe
end

work = 'new'

pool.shutdown
pool.wait_for_termination

And it can even be a problem with Futures from ruby-concurrent. Futures seem so simple and idiot-proof, right? Oops.

value = 100

future = Concurrent::Future.execute do
  sleep 1
  # DANGER will robinson!
  value + 1
end

value = 200

puts future.value # you get 201, not 101!

I’m honestly not even sure how you get around this problem with Concurrent::Future, unlike Concurrent::ThreadPoolExecutor it does not seem to copy stdlib Thread.new in it’s method of being able to pass block arguments. There might be something I’m missing (or a way to use Futures that avoids this problem?), or maybe the authors of ruby-concurrent haven’t considered it yet either? I’ve asked the question of them.  (PS: The ruby-concurrent package is super awesome, it’s still building to 1.0 but usable now; I am hoping that it’s existence will do great things for practical use of multi-threaded concurrency in the ruby community).

This is, for me, one of the biggest, most dangerous, most confusing gotchas with ruby concurrency. It can easily lead to hard-to-notice, hard-to-reproduce, and hard-to-debug race condition bugs.

Posted in General | Leave a comment

Control of information is power

And the map is not the territory.

From the Guardian, Cracks in the digital map: what the ‘geoweb’ gets wrong about real streets

“There’s no such thing as a true map,” says Mark Graham, a senior research fellow at Oxford Internet Institute. “Every single map is a misrepresentation of the world, every single map is partial, every single map is selective. And every single map tells a particular story from a particular perspective.”

Because online maps are in constant flux, though, it’s hard to plumb the bias in the cartography. Graham has found that the language of a Google search shapes the results, producing different interpretations of Bangkok and Tel Aviv for different residents. “The biggest problem is that we don’t know,” he says. “Everything we’re getting is filtered through Google’s black box, and it’s having a huge impact not just on what we know, but where we go, and how we move through a city.”

As an example of the mapmaker’s authority, Matt Zook, a collaborator of Graham’s who teaches at the University of Kentucky, demonstrated what happens when you perform a Google search for abortion: you’re led not just to abortion clinics and services but to organisations that campaign against it. “There’s a huge power within Google Maps to just make some things visible and some things less visible,” he notes.

From Gizmodo, Why People Keep Trying To Erase The Hollywood Sign From Google Maps

But the sign is both tempting and elusive. That’s why you’ll find so many tourists taking photos on dead-end streets at the base of the Hollywood Hills. For many years, the urban design of the neighbourhood actually served as the sign’s best protection: Due to the confusingly named, corkscrewing streets, it’s actually not that easy to tell someone how to get to the Hollywood Sign.

That all changed about five years ago, thanks to our suddenly sentient devices. Phones and GPS were now able to aid the tourists immensely in their quests to access the sign, sending them confidently through the neighbourhoods, all the way up to the access gate, where they’d park and wander along the narrow residential streets. This, the neighbours complained, created gridlock, but even worse, it represented a fire hazard in the dry hills — fire trucks would not be able to squeeze by the parked cars in case of an emergency.

Even though Google Maps clearly marks the actual location of the sign, something funny happens when you request driving directions from any place in the city. The directions lead you to Griffith Observatory, a beautiful 1920s building located one mountain east from the sign, then — in something I’ve never seen before, anywhere on Google Maps — a dashed grey line arcs from Griffith Observatory, over Mt. Lee, to the sign’s site. Walking directions show the same thing.

Even though you can very clearly walk to the sign via the extensive trail network in Griffith Park, the map won’t allow you to try.

When I tried to get walking directions to the sign from the small park I suggest parking at in my article, Google Maps does an even crazier thing. It tells you to walk an hour and a half out of the way, all the way to Griffith Observatory, and look at the sign from there.

No matter how you try to get directions — Google Maps, Apple Maps, Bing — they all tell you the same thing. Go to Griffith Observatory. Gaze in the direction of the dashed grey line. Do not proceed to the sign.

Don’t get me wrong, the view of the sign from Griffith Observatory is quite nice. And that sure does make it easier to explain to tourists. But how could the private interests of a handful of Angelenos have persuaded mapping services to make it the primary route?

(h/t Nate Larson)

Posted in General | Leave a comment

Fraud in scholarly publishing

Should librarianship be a field that studies academic publishing as an endeavor, and works to educate scholars and students to take a critical perspective?  Some librarians are expected/required to publish for career promotion, are investigations in this area something anyone does?

From Scientific American, For Sale: “Your Name Here” in a Prestigious Science Journal:

Klaus Kayser has been publishing electronic journals for so long he can remember mailing them to subscribers on floppy disks. His 19 years of experience have made him keenly aware of the problem of scientific fraud. In his view, he takes extraordinary measures to protect the journal he currently edits, Diagnostic Pathology. For instance, to prevent authors from trying to pass off microscope images from the Internet as their own, he requires them to send along the original glass slides.

Despite his vigilance, however, signs of possible research misconduct have crept into some articles published in Diagnostic Pathology. Six of the 14 articles in the May 2014 issue, for instance, contain suspicious repetitions of phrases and other irregularities. When Scientific American informed Kayser, he was apparently unaware of the problem. “Nobody told this to me,” he says. “I’m very grateful to you.”

[…]

The dubious papers aren’t easy to spot. Taken individually each research article seems legitimate. But in an investigation by Scientific American that analyzed the language used in more than 100 scientific articles we found evidence of some worrisome patterns—signs of what appears to be an attempt to game the peer-review system on an industrial scale.

[…]

A quick Internet search uncovers outfits that offer to arrange, for a fee, authorship of papers to be published in peer-reviewed outlets. They seem to cater to researchers looking for a quick and dirty way of getting a publication in a prestigious international scientific journal.

This particular form of the for-pay mad-libs-style research paper appears to be prominent  mainly among researchers in China. How can we talk about this without accidentally stooping to or encouraging anti-Chinese racism or xenophobia?   There are other forms of research fraud and quality issues which are prominent in the U.S. and English-speaking research world too.  If you follow this theme of scholarly quality issues, as I’ve been trying to do casually, you start to suspect the entire scholarly publishing system, really.

We know, for instance, that ghost-written scholarly pharmaceutical articles are not uncommon in the U.S. too.   Perhaps in the U.S. scholarly fraud is more likely to come for ‘free’ from interested commercial entities, then by researchers paying ‘paper salesmen’ for poor quality papers.  To me, a paper written by a pharmaceutical company employer but published under the name of an ‘independent’ researcher is arguably a worse ethical violation, even if everyone involved can think “Well, the science is good anyway.”  It also wouldn’t shock me if very similar systems to China’s paper-for-sale industry exist in the U.S., on a much smaller scale, but they are more adept at avoiding reuse of nonsense boilerplate, making it harder to detect. Presumably the Chinese industry will get better at avoiding detection too, or perhaps already is at a higher end of the market.

In both cases, the context is extreme career pressure to ‘publish or perish’, into a system that lacks the ability to actually ascertain research quality sufficiently, but which the scholarly community believes has that ability.

Problems with research quality, don’t end here, they go on and on, and are starting to get more attention.

  • An article from the LA Times from Oct 2013,
    Science has lost its way, at a big cost to humanity: Researchers are rewarded for splashy findings, not for double-checking accuracy. So many scientists looking for cures to diseases have been building on ideas that aren’t even true.” (and the HN thread on it).
  • From the Economist, also from last year, “Trouble at the lab: Scientists like to think of science as self-correcting. To an alarming degree, it is not.”
  • From Nature August 2013 (was 2013 the year of discovering scientific publishing ain’t what we thought?), “US behavioural research studies skew positive:
    Scientists speculate ‘US effect’ is a result of publish-or-perish mentality.

There are also individual research papers investigating particular issues, especially statistical methodology problems, in scientific publishing.  I’m not sure if there are any scholarly papers or monographs which take a big picture overview of the crisis in scientific publishing quality/reliability — anyone know of any?

To change the system, we need to understand the system — and start by lowering confidence in the capabilities of existing ‘gatekeeping’.  And the ‘we’ is the entire cross-disciplinary community of scholars and researchers. We need an academic discipline and community devoted to a critical examination of scholarly research and publishing as a social and scientific phenomenon, using social science and history/philosophy of science research methods; a research community (of research on research) which is also devoted to education of all scholars, scientists, and students into a critical perspective.   Librarians seem well situated to engage in this project in some ways, although in others it may be unrealistic to expect.

Posted in General | 2 Comments

Notes on oddities of Solr WordDelimiterFilter

A edited version of post I sent to the Blacklight listserv…

I have a WordDelimiterFilter configured in my analysis for the ‘text’ type. I thought I originally inherited that from Blacklight suggested configuration, although it doesn’t appear to be there at the moment if I’m looking at the right place:

https://github.com/projectblacklight/blacklight-jetty/blob/master/solr/blacklight-core/conf/schema.xml#L183

I’m not sure if this repo respresents Stanford’s current code for it’s Blacklight-based catalog, but it has a WordDelimiterFilter much like mine:

https://github.com/solrmarc/stanford-solr-marc/blob/master/stanford-sw/solr/conf/schema.xml#L349

Note that it’s got `splitOnCaseChange=”1″`, for both index and query time (no separate index/query analysis). Mine has the same. Although stanford applies the ICUFoldingFilter (case-insensitivity) _before_ the WDF, which actually probably means splitOnCaseChange isn’t doing actually doing anything, by the time the filter gets it, there are no more case changes. In mine, I do the ICUFoldingFilter _after_ the WDF, so the WDF can still do it’s thing.

I’ve noticed something unexpected and probably undesirable with my setup:

Specifically, if the query includes a mixed-case term like “DuBois”, I expected this would match soure term “dubois” OR source term “du bois”.

But it turns out it _only_ matches source term “du bois”. Which was unexpected for one user that noticed it, and knew that our search was generally ‘case insensitive’ — a search for “dubois” would match source term “dubois”, but a search for “duBois” would not, violating their expectations. And I agree this is probably bad.

I thought the WDF could do what I wanted. But after spending a bunch of time with the docs, playing around with different configurations, and trying to get advice on the solr-user listserv — frankly, I’m still really confused about exactly what the WDF will do in various configurations, it’s a complicated thing.

But I think the WDF is not capable of doing quite what I expected.

I think what I need to do is split into separate index and query time analysis, which can be identical in all ways except in query time analysis splitOnCaseChange=0 — it still remains on in index time analysis.

The result of this seems to be that query time “DuBois” will only match source material single word “dubois” (in any case, it’ll also match source material “DuBois” still) — if it’s only going to match one of the choices, I think this is the right one.

Source material “DuBois” will still be indexed such that both queries “dubois” (or “DuBois”) and “du bois” will match it — source case changes will be expanded to two words in index, as an alternate, along with one word in index. But you can’t quite do the same thing at query time, to allow query with case change “DuBois” to match both variants in source.

I think this is probably the right thing to do — although in general, the WordDelimiterFilter is scaring me enough that if I had to do it over, I either wouldn’t use it at all, or would use it only with very specific configuration designed to support specific tested cases. As it is, I’m not quite sure what all it’s doing, and am scared to change it a lot. It’s odd to me that the example suggested analysis configuration given in the Solr wiki for the WordDelimiterFilter would seem to be subject to the same problem.

I am curious if anyone has dealt with this, and has any feedback. Especially Stanford, since I know they have a great test suite on their Solr configuration — although if that github represents current Stanford conf, the splitOnCaseChange=1 is probably having no effect at index OR query time, since there’s a case-normalization filter BEFORE it.

Posted in General | Leave a comment

debugging apache Passenger without enterprise

I kind of love Passenger for my Rails deployments. It Just Works, it does exactly what it should do, no muss, no fuss.  I use Passenger with apache.

I very occasionally have a problem that I am not able to reproduce in my dev environment, and only seems to reproduce on production using Passenger apache. Note well: In every case so far, the problem actually had nothing to do with passenger or apache, there were other differences in environment that were causing it.

But still, being able to drop into a debugger in the Rails actually running under apache Passenger would have helped me find it quicker.

Support for dropping into the debugger, remotely, when running under Apache is included only in Passenger Enterprise.  I recommend considering purchasing Passenger to support the Passenger team, the price is reasonable… for one server or two. But I admit I have not yet purchased Enterprise, mainly because the number of dev/staging/production servers I would want it on to have it everywhere starts to make the cost substantial for my environment.

But it looks like there’s a third-party open source gem meant to provide the same support! See https://github.com/davejamesmiller/ruby-debug-passenger .   It’s two years old in fact, but just noticing it today myself, huh.

I haven’t tried it yet, but making this post as a note to myself and others who might want to give it a try.

The really exciting thing only in Passenger Enterprise, to me, is the way it can deploy with a hybrid multiple process+multi-threaded-request-dispatch setup. This is absolutely the best way to deploy under MRI, I have no doubts at all, it just is (and I’m surprised it’s not getting more attention).   This lower-level  feature is unlikely to come from a third-party open source, and I’m not sure I’d trust it if it did. The open source Puma, an alternative to Passenger, also offers this deploy model. I haven’t tried it in Puma myself beyond some toy testing like the benchmark mentioned above.  But I know I absolutely trust Passenger to get it right with no fuss. If you need to maximize performance (or deal with avoiding end-user latency spikes in the presence of some longer-running requests), and deploy under MRI, you should definitely consider Passenger Enterprise just for this multi-process/multi-thread combo feature.

Posted in General | Leave a comment

“More library mashups”, with Umlaut chapter

I received my author’s copy of More Library Mashups, edited by Nicole Engard.  I notice the publisher’s site is still listing it as “pre-order”, but I think it’s probably available for purchase (in print or e).

Publisher’s site (with maybe cheaper “pre-order” price?)

Amazon

It’s got a chapter in it by me about Umlaut.

I’m hoping it attracts some more attention and exposure for Umlaut, and maybe gets some more people trying it out.

Consider asking your employing library to purchase a copy of the book for the collection! It looks like it’s got a lot of interesting stuff in it, including a chapter by my colleague Sean Hannan on building a library website by aggregating content services.

Posted in General | Leave a comment