Another round of citation features in a sufia app

I reported before on our implementation of an RIS export feature in our sufia 7.4 app.

Since then, we’ve actually nearly completely changed our implementation. Why? Well, it started with us moving on to our next goal: on-page human-readable citation. This was something our user analysis had determined portions of our audience/users wanted.

Turns out that what seemed “good enough” metadata for an RIS export (meeting or exceeding user expectations; users were used to citation exports not being that great, and having to hand-edit them themselves) seemed not at all good enough when actually placed on the page as a human-readable citation (in Chicago format).

We ended up first converting our internal metadata to citeproc-json format/schema. Then using that intermediate metadata as a source for our RIS export, as well as for conversion to human-readable citation with citeproc-ruby.  The conversion/production happens at display-time, from data in our Solr index, which required us to add some data to the Solr index that wasn’t previously there.

On metadata and citations

Turns out getting the right machine-interprable metadata for a really correct citation is pretty tricky.

It occurs to me that if citations is a serious use case, you should probably consider it when designing your metadata schema in the first place, to make sure you have everything you need in machine-readable/interprable format. (As unrealistic as this suggestion sounds for many actual projects in our sector). Otherwise can find you simply don’t have what you need for a reasonable citation.

We ended up adding a few metadata fields, including a “source” field for items in our digital collection that are excerpts from works (which are not in our collection), and need the container work identified in the citation.

In other cases, an excerpt is an independent work in our repo, but also has a ‘child’ relationship to a parent, that is it’s container for purposes of citation. But in yet other cases, there’s a work with a ‘parent’ work that is for organizational/arrangement purposes only, and is not a container for purposes of citation — but our metadata leaves the software no way to know which is which. (In this case we just treat them all like containers for purposes of citation, and tolerate the occasional not-really-correct-ness, as the “incorrect” citations still unambiguously identify the thing cited).

We also implemented a bunch of heuristics to convert various “just string” fields to parsed metadata. For instance our author (or publisher) names, while from FAST and other library vocabularies, are just in our system as plain single strings. The system doesn’t even record the original authority identifier. (I think this is typical for a sufia/hyrax app, while they use the qa gem to load terms, if the gem supplies identifiers from the original vocabulary, they aren’t recorded).

So, the name `Stayner, Heinrich, -1548` needs to be displayed in some parts of the citation (first author for instance) as Stayner, Heinrich, but in other parts (second author or publisher) as Heinrich Stayner, and in no case includes the dates in the citation, so we gotta try parsing it.  Which is harder than you’d think with all the stuff that can go into an AACR2-style name heading (question marks or the word “approximately”, or sometimes the word “active”, other idiosyncracies).  And then a corporate name like an imaginary design firm Jones, Smith, Garcia is never actually Garcia Jones, Smith or something like that.

Then there’s turning our dates from a custom schema into something that fits what a citation expects.

Our heuristics get good enough — in fact, I think our automatically-generated human readable citations end up as good or better as anything else I’ve seen automatically generated on the web, including from major publishers–but they are definitely far from perfect, and have lots of errors in many edge cases. Hopefully all errors that don’t change or confuse about the thing cited, which of course is the point.

CSL, CSL-json, and ruby-citeproc

CSL, the Citation Style Language, is a system for automatically generating human-readable citations according to XML stylesheets for various citation formats/styles.

While I believe CSL originally came out of zotero, some code has been extracted (and is open source like zotero itself), and the standard itself as an independent standard. Whether via the code or the schema/standard implemented in other and various code open source and not, it has been adopted by other software packages too (like Mendeley, which is not open source).

One part of CSL is a json format (defined with a json schema) to represent an individual “work to be cited”.  This also originally came from Zotero, and doesn’t seem to totally have a universal name yet, or a ton of documentation.  The schema in the repo is called “csl-data.json,” but I’ve also seen this format referred to as just “csl-json”, as well as “citeproc-json” (with or without the hyphens).  It also has even more adoption beyond zotero — it is one of the standard formats that CrossRef (and other DOI resolvers?) can return.  The common IANA/MIME “Content-Type” is `application/vnd.citationstyles.csl+json`, but historically another (incorrect?) form has sometimes been used, `application/citeproc+json`. Some of the names/content type(s) might confuse you into thinking this is a JSON representation of a CSL style (describing a citation format/style like “Chicago” or “MLA”), but it’s not, it’s a format of metadata about a particular “work to be cited”.  I kind of like to call it “csl-data-json” (after the schema URL) to avoid confusion.

Even apart from JSON serialization, this is a useful schema in that it separates out fields one will actually need to generate a citation (including machine-readable individual sub-elements for parts of a name or date).  It’s best available documentation, in addition to the JSON schema itself, seems to be this document written for the original Javascript implementation and not entirely applicable to generic implementations.

There is, amazingly, a ruby CSL processor in the citeproc-ruby gem.  Not only can it take input in csl-json and format it as an individual citation in a desired style, but, as a standard CSL processor, it can also format a complete bibliography and footnotes in the context of a complete document (where some citation styles call for appropriate ibid use in the context of multiple citations, etc).  I was only interested in formatting an individual citation though.

Initially, I wasn’t completely sure the citeproc-ruby gem would work out for me, for performance or other reasons. But I still decided to split processing into two steps: translating our internal metadata into a csl-json compatible format, and then formatting a human readable citation. This two step process just makes sense for manageable code, trying to avoid an unholy mess of nested if-elsifs all jumbled together. And gives you clear separation if you need to generate in multiple human-readable styles, or change your mind about what style(s) to generate. The csl-json schema is great for an intermediate format even if you are going to format as human-readable by non-CSL means, as it’s been road-tested and proven as having the right elements you need to generate a citation.

However, I did end up using citeproc-ruby in the end.  @inkshuk it’s author was amazingly helpful and giving in my questions on the GH issues. Initially it looked like there were some extreme performance problems, but using alternate citeproc-ruby API to avoid re-loading/parsing XML style documents from disk every time (with one PR by me to make this work for locale XML style docs too) avoided those.

Citeproc-ruby can’t yet handle formatting of date ranges in a citation (inkshuk has started on the first steps to an implementation in response to my filed issue).  So when I have a date range in a work-to-be-cited, I just format it myself in my own ruby code, and include it in the csl-data-json as a date “literal”.

CSL is amazing, and using a CSL processor handles all sorts of weird idiosyncratic edge cases for you. (One example, if a title already includes double-quotes, but is to be double-quoted in the citation, it changes the internal double quotes to single quotes for you. There are so many of these, that you’re not going to think of initially yourself in a custom hobbled-together unholy mess of if-elsif statement implementation).

Also, while I didn’t do it, you could hypothetically customize some of the existing styles in CSL XML if you need to for local context needs. I believe citeproc-ruby even gives you a way to override parts of an existing style in ruby code.

The particular and peculiar challenges of sufia/hyrax/samvera

There are two main, er, idiosyncracies of the sufia/hyrax/samvera architecture that provided additional challenges. One: the difficulty of efficiently determining the parent work of a work-in-hand, and (in sufia but not hyrax) the collection(s) that contain a work. Two: The split architecture between Solr index data (used at display-time), and fedora data (used at index time), and the need to write code very differently to get data in each of these sources/times.

Initially, I was worried about citeproc-ruby performance. So started out having our sufia app generate the human-readable citation at index time, and store it as text/html in the Solr index, so at display time it would just have to be retrieved and inserted on the page. Really, even if only takes 10ms to format a citation, wouldn’t it be better to not add 10ms to the page delivery time? (Granted, 10ms may be nothing to many slow sufia/hyrax apps).

However, to generate access to citations in our context, we need access to both the container collection (for archival arrangement/location when an archival item), and the parent work, for “container” for citation purposes. These are very slow to get out of fedora. (Changed/improved for fetching parent collections but not parent works in hyrax; we’re still sufia). Like, with our data and infrastructure, it was taking multiple seconds to get the answer from fedora to “what are the parent work(s) for this item-in-hand” (even trying to use the fedora API feature that seemed suited for this, whose name I now forget).  While one can accommodate more slowness at index-time than display-time, several-seconds-per-item was outside our tolerance — when re-indexing our ~20K item collection already can take many hours on an empty solr index.

So you want to get that info from the Solr index instead of fedora, but trying to access the Solr index in the indexing operation leads you to all sorts of problems when generating an initial index, with whether there’s already enough in the index to answer your question you need to index the item-in-hand. We want our indexing operation to always be usable starting from an empty index, for fault recovery purposes among others.  And even ignoring this issue, I found that the sufia ‘actor stack’ info actually led to the right info not being in the Solr index at the right time for a particular item-in-hand-to-index when changing the parent or collection membership for item(s).

Stopping myself as I got into trying to debug the actor stack yet again, I decided to switch to a pure display-time approach.  Just generate the citation on-demand, from the solr index.  At this point I already had a map-metadata-to-csl-json implementation based on doing it at index-time with info from fedora.  I had actually forgotten when I wrote that that I wasn’t leaving my options open to switch to display-time — so I had to rewrite the thing to retrieve the slightly different info in slightly different ways from the Solr index at display time using a sufia “show presenter”.

Also had to add some things to our Solr index so they could be used at display time — we were including in our solr index only the dates-of-work as strings we wanted to display to user on our pages, but the citation metadata transformer needed all our original structured metadata so it could determine how best to convert them (differently) to dates for inclusion in citation. (I stored our original data objects serialized to json, and then have the presenter “re-hydrate” them to our original ruby model objects without touching fedora).

Premature Abstraction

In our original implementation, I tried to provide a sort of generic “serialize to RIS”  base class, thinking it would make our code more readable, and potentially be of general use.

However, even originally it didn’t end up working quite as well as I’d hoped (needed custom logic more often than using the “built in” automatic mappings in the base class), and in fact this new implementation abandons it entirely. Instead, it first maps to CSL-json schema/format, and then the RIS serializer mostly just extracts the needed fields from there. (We wanted to take advantage of our improved citation data for on-screen human-readable to improve the RIS export too, of course).

No harm, no foul in our local codebase. You learn more about your requirements and you learn more about how particular architectural solutions work out, and you change your mind about implementation decisions and change them. This is a normal thing.

But if I had jumped to, say, add my “RIS Serializer base” abstraction to some shared codebase (say the hyrax gem, or even some kind of samvera-citations gem), it probably would have ended up not as generally useful as I thought at the time (it’s not even a good match for our needs/use case, it turns out!).  And it’s much harder to change your mind about an abstraction in a shared codebase, that many people may be relying upon, and can’t be changed without backwards incompatability problems. (That in a local codebase aren’t nearly as problematic, you just change all your code in your repo and commit it and you’re done, no need to worry about versioning or coordinating the work of various developers using the shared code).

It’s good to remember to be even more cautious with abstractions in shared code in general.  Ideally, abstractions in shared code (ie, a gem) should be based on a good understanding of the domain from some experience, and have been proven in one (or better more) individual app(s) over some amount of time, before being enshrined into a shared codebase. The first abstraction that seems to be working well for you in a particular codebase may not stand the test of time and diverse requirements/use cases, and “the wrong abstraction can be worse than no abstraction at all”—and the wrong abstraction can be very expensive and painful to undo in a gem/shared codebase.

Our implementation

You can see the Pull Request here.  (It’s possible there were some subsequent bug fixes postdating the PR).

We have a class called CitableAttributes, which takes a display-time ‘work show presenter’ (which as above has been customized to have access to some original component models), and formats it into data compatible with csl-data-json (retrievable via individual public accessors), as well as an actual JSON document that is csl-data-json.

Our RISSerializer uses a CitableAttributes object to extract individual metadata fields, and put them in the right place in an RIS document. It also needs it’s own logic for some things that aren’t quite the same in RIS and csl-data-json (different ‘type’ vocabulary, no ability to describe dates ranges machine-readably).  We wanted to take advantage of all the logic we had for transforming the metadata to something applicable to citations, to improve the RIS exports too.

Oh, one more interesting thing. We decided for photographs of “realia” (largely from our Museum‘s collection), it was more appropriate and useful to cite them as photographs (taken by us, dated the date of the photo), rather than try to cite “realia” itself, which most citation styles aren’t really set up to do, and some here thought was inappropriate for these objects as seen in our website anyhow. So we have some custom logic to determine when an item in our collection is such, and cite appropriately using some clever OO polymorphism. This logic now carries over to the RIS export, hooray.

And a simple Rails helper just uses a CitableAttributes to get a csl-data-json, and then feeds it to citeproc-ruby objects to convert to the human-readable Chicago-style citation we want on screen.

There are definitely still a variety of idiosyncratic edge cases it gets not quite right, from weird punctuation to semantics. But I believe it’s still actually one of the best on-screen automatically-generated human-readable citation implementations around!

Some live diverse examples:

Leave a comment