DOAJ API in bento_search 1.5

bento_search is a gem I wrote that lets you search third party search engine APIs with standardized, simple, natural ruby API. It’s focused on ‘scholarly’ sources and use cases.

In the just-released version 1.5, a search engine adapter is included for the Directory of Open Access Journals (DOAJ) article search api.

While there certainly might be circumstances where you want to provide end-users with interactive DOAJ searches, embedded in your application, my main thoughts of use cases are different, and involve back-end known-item lookup in DOAJ.

It’s not a coincidence that bento_search introduced multi-field querying in this same 1.5 release.

The SFX link resolver  is particularly bad at getting users to direct article-level links for open access articles. (Are products from competitors like SerSol or OCLC any better here?). At best, you are usually directed to a journal-level URL for the journal title the article appears in.

But what if the link resolver knew it was probably an open access journal based on ISSN (or at the Umlaut level, based on SFX returning a DOAJ_DIRECTORY_OPEN_ACCESS_JOURNALS_FREE target as valid).  You could take the citation details, look them up in DOAJ to see if you get a match, and if so take the URL returned by DOAJ and return it to the user, knowing it’s going to be open access and not paywalled.

searcher = BentoSearch::DoajArticlesEngine.new
results = searcher.search(:query => {
    :issn       => "0102-3772",
    :volume     => "17",
    :issue      => "3",
    :start_page => "207"
})
if results.count > 0
   url = results.first.link
   # => "http://www.scielo.br/scielo.php?script=sci_arttext&pid=S0102-37722001000300002"
   # hey, maybe we got a doi too. 
   doi = results.first.doi
   # => "10.1590/S0102-37722001000300002"
end

Or if an application isn’t sure whether an article citation is available open source or not, it could check DOAJ to see if the article is listed there.

Perhaps such a feature will be added to Umlaut at some point.

As more and more is published open access, DOAJ might also be useful as a general large aggregator for metadata enhancement or DOI reverse lookup, for citations in it’s database.

Another known-item-lookup uses of DOAJ might be to fetch an abstract for an article in it’s database.

Neat!

DOAJ API tips

For anyone interested in using the DOAJ Article Search API (some of whom might arrive here from Google), I found the DOAJ API to be pretty easy to work with and straightforward, but I did encounter a couple tricky parts that are worth sharing.

URI Escaping in a Path component

The DOAJ Search API has the query in the path component of the url, not a query param: /api/v1/search/articles/{search_query]

In the path component of a URI, spaces are not escaped as “+” — “+” just means “+”, and will indeed be interpreted that way by the DOAJ servers.  (Thanks DOAJ api designers for echo’ing back the query in the response, to make my bug there a bit more discoverable!) Spaces are escaped as “%20”.  (Really, escaping spaces as “+” even in query param is an odd legacy practice of unclear standards compliance, but most systems accept it, in the query params after the ? in a URL).

At first I just reached for my trusty ruby stdlib method `CGI.escape`, but that escapes spaces as `+`, resulting in faulty input to the API.  Then I figured maybe I should be using ruby `URI.escape` — that does turn spaces into “%20”, but leaves some things like “/” alone entirely. True, “/” is legal in a URI, but as a path component separator! If I actually wanted it inside the last path component as part of the query, it should be escaped as “%2F”. (I don’t know if that would ever be a useful thing to include a query to this API, but I strive for completeness).

So I settled for ruby `CGI.escape(input).gsub(“+”, “%20″)` — ugly, but okay.

Really, for designing API’s like this, I’d suggest always leaving a query like this in a URI query param where it belongs (” http://example.org/search?q=query “).  It might initially seem nice to have URLs for search results like ” https://doaj.org/api/v1/search/articles/baltimore “, but when you start having multi-word input, or worse complex expression (see next section), it gets less nice quick: ” https://doaj.org/api/v1/search/articles/%2Bbibjson.title%3A%28%2Bsomething%29%20%2Bbibjson.author.name%3A%28%2Bsmith%29

Escaping is confusing enough already; stick the convention, there’s a reason the query component of the URI (after the question mark) is called the query component of the URI!

ElasticSearch as used by DOAJ API defaults to OR operator

At first I was confused by the results I was getting from the API, which seemed very low precision, including results that I wasn’t sure why.

The DOAJ Search API docs helpfully tell us that it’s backed by ElasticSearch, and the query string can be most any ElasticSearch query string. 

I realized that for multi-word queries, it was sending them to ElasticSearch, with the default `default_operator` of “OR”, meaning all terms were ‘optional’. And apparently with a very low (1?) `minimum_should_match`

Meaning results included documents with just any one of the search terms. Which didn’t generally produce intuitive or useful results for this corpus and use case — note that DOAJ’s own end-user-facing search uses an “AND” default_operator producing much more precise results.

Well, okay, I can send it any ElasticSearch query, so I’ve just got to prepend a “+” operator to all terms, to make them mandatory. Which gets a bit trickier when you want to support phrases too, as I do; you need to do a bit of tokenization of your own. But doable.

Instead of sending query, which the user may have entered, as:  apple orange “strawberry banana”

Send query: +apple +orange +”strawberry banana”

Or for a fielded search:  bibjson.title:(+apple +orange +”strawberry banana”)

Or for a multi-fielded search where everything is still supposed to be mandatory/AND-ed together, the somewhat imposing:  +bibjson.title:(+apple +orange +”strawberry banana”) +bibjson.author.name:(+jonathan +rochkind)

Convoluted, but it works out.

I really like that they allow the API client to send a complete ElasticSearch query, it let me do what I wanted even if it wasn’t what they had anticipated. I’d encourage this pattern for other query API’s — but if you are allowing the client to send in an ElasticSearch (or Solr) query, it would be much more convenient if you also let the client choose the default_operator (Solr `q.op`), and `minimum_should_match` (Solr `mm`).

So, yeah, bento_search

The beauty of bento_search is that one developer figures out these confusing idiosyncracies once (and most of the bento_search targets have such things), encode them in the bento_search logic — and you the bento_search client can be blissfully ignorant of them, you just call methods on a BentoSearch::DoajArticlesEngine same as any other bento_search engine (eg engine.search(‘apple orange “strawberry banana”‘), and it takes care of the under-the-hood api-specific idiosyncracies, workarounds, or weirdness.

Notes on ElasticSearch

I haven’t looked much at ElasticSearch before, although I’m pretty familiar with it’s cousin Solr.

I started looking at the ElasticSearch docs since DOAJ API told me I could send it any valid ElasticSearch query. I found it familiar, from my Solr work, they are both based on Lucene after all.

I started checking out documentation beyond what I needed (or could make use of) for the DOAJ use too, out of curiosity. I was quite impressed with ElasticSearch’s feature set, and it’s straightforward and consistent API.

One thing to note is ElasticSearch’s really neat query DSL that lets you specify queries as a JSON representation of a query abstract syntax tree — rather than just try to specify what you mean in a textual string query.  For machine-generated queries, this is a great feature, and can make it easier to specify complicated queries than in a textual string query — or make certain things possible that are not even possible at all in the textual string query language.

I recall Erik Hatcher telling me several years ago — possibly before ElasticSearch even existed — that a similar feature was being contemplated for Solr (but taking XML input instead of JSON, naturally).   I’m sure the hypothetical Solr feature would be more powerful than the one in ElasticSearch, but years later it still hasn’t landed in Solr so far as I know, but there it is in ElasticSearch….

I’m going to try to keep my eye on ElasticSearch.

This entry was posted in General. Bookmark the permalink.

One Response to DOAJ API in bento_search 1.5

  1. Pingback: bento_search 1.5, with multi-field queries | Bibliographic Wilderness

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s