journaltoc, structured data

I had previously known about TicTocs, a nice aggregator of journal RSS latest article feeds. The aggregated feeds themselves are public access, but the articles linked to may not be. This put a damper on my thoughts of using this to power a service for my users — I don’t want to provide a service unless I can send users to only licensed copies (or public access, but there’s no way to be sure if it will be or not from the RSS feed).  To make matters worse sometimes the publisher-provided copy linked to in the RSS won’t be licensed by our institution, but our institution will have access to the article from an aggregator or other alternate platform — so not only would I be sending users to something they couldn’t access, but in those cases there’s a link they could get the article at, which I wouldn’t be sending them to!

So I was reading this blog entry from Dave Pattern where he mentions a “JournalTocs”. At first I thought this was the old TicTocs, but with a new name. But it would seem to be a different service. TicTocs is, I think, hosted by JISC, while “JournalTOCs is an initiative of the ICBL at Heriot-Watt University and is being managed by Santy Chumbe.”

JournalTocs seems to often get actual structured citation information out of the RSS feeds. Including: Year, volume, issue, start page, end page, DOI.   Different feeds have different structured data available. Sometimes there’s a DOI, sometimes there isn’t (obviously sometimes the article may not have a DOI; surely other times it does but JournalTocs doesn’t succeed in sniffing it from the RSS feed). Sometimes there’s vol/issue/page, sometimes only a subset of those, sometimes nothing.

In at least some cases, JournalTocs would seem to be taking structured information from the original publisher feed, which included structured citation information using DC or prism namespaced elements. (I am not familiar with ‘prism’, or where it came from?). I am not sure if in other cases JournalTocs is ‘sniffing’ the data in other ways?

JournalTocs has some basic APIs (returning RSS feeds), including the ability to get an RSS feed by ISSN from JournalTocs itself, instead of the original publishers.  I like this, to the extent that JournalTocs may be sniffing non-structured data and then structuring it, or otherwise normalizing the publishers feeds. Here’s the example on their documentation page.

Now what I’m not sure is how many of the feeds from JournalTocs are going to have structured data minimally neccesary to create a good OpenURL link to my link resolver: Either DOI, or year/volume/issue/start-page.  Because that’s really my goal here, to be able to use this service in my own services, but sending users to my own link resolver for locally licensed copy, or barring that ILL form.

If I could do that, then I could do some cool stuff.  Put a list of recent articles on my catalog detail pages, or Find It link resolver pages. (In fact, I’d probably do the former by making some kind of service in Find It and vending it to the catalog).  Or give the users a way to have RSS/Atom feeds whose links took them through our institutional link resolver; or email notification of new articles from a journal, etc.

Some day I’ll have time to work on that, it seems a pretty good project. When/if I do, I’d email the JournalTocs folks to find out more about what they’re doing, and how often I can expect to find sufficient structured data to create an OpenURL.  Also curious what level of institutional support this project has, how reliably sustainable it might be, or if it might disappear soon.  If anyone has any more info to share though, please do.

One oddity of the JournalTocs recent article API feed (or at least the one in their example?) is that it returns a feed sorted alphabetically. I’d really want a feed sorted by publication date, and ideally by page number within the same publication date. But, if publication date and/or vol/issue/page are in the structured data, my own software could always sort the feed from JournalTocs itself before doing anything else with it.

possible useful client code to write

  • A ruby gem that deals with JournalTocs with some ‘value added’. Give it an issn, it’ll look up the JournalTocs feed; provide facilities to translate RSS to Atom if you want; optionally add in an OpenURL context object (not sure the best way to embed a context object in atom or rss. As an <html:span> element using COinS maybe?; or optionally add in a complete http OpenURL to a local link resolver (embed that in atom as a <link rel=z3988> or something).
  • A Rails engine-type plugin that puts some controller and view wrappers around that gem, so you can easily create a web app (returning RSS, Atom, HTML, etc) for those functions, or include such functionality in your own app.
  • Have Umlaut use that plugin, and then write an Umlaut source adapter to add recent article information to the Umlaut responses (HTML and APIs), so it can be consumed by my catalog etc.

15 thoughts on “journaltoc, structured data

  1. Following your thoughts I developed a proof-of-concept web service that extract metadata from the JournalTOCs feeds and rewrites the contents to provide OpenURL support as well as providing library proxy prefixes to links.

    It works so well that we just went away and put it into production in our Primo catalogue though a bit immature. To facilitate this it offers a JSONP interface so you can query the existence of a feed before presenting the user with a “recent articles” tab.

    An example feed:

    And for display:

    The service is written in Perl and will be made public when it’s more mature and fairly documented…

  2. You wrote “I don’t want to provide a service unless I can send users to only licensed copies (or public access, but there’s no way to be sure if it will be or not from the RSS feed). ”

    So, have you seen the WattJournals service which uses the JournalTOCs API to deliver ONLY results to Heriot-Watt subscribed, or OA, journals?


  3. I did see that, Roddy, after I wrote this blog post. Yep, something like that is what I’d want to do. Except I think I’d _show_ ToC’s even for journals we don’t have electronic access to (if we have physical access to), I’d just not provide links. And instead of providing an entirely seperate search interface, I’d integrate it right into my catalog — look up a journal in the catalog, get the ToC on the catalog page. We don’t need yet another search interface, I’m trying to make “the catalog” do everything you want to do with the materials contained in it.

    Definitely looks feasible, the WattJournals service is kind of proof of that. It’s definitely on my list of projects, but I don’t know when I could get to it, but when I do I should definitely as Heriot-Watt folks for tips. Thanks!

  4. I agree with all of that except that, sometimes, it’s good to be able to say “Search this, and you’re guaranteed immediate full text of anything you find, 24/7” to the students.

    JournalTOCs has been integrated into some library catalogues.

  5. Ah, you’re right, I see that the WattJournals thing is searching metadata from ‘recent issues’, not just journal names. That’s definitely doing something that my idea for a service integrated into the catalog (just title-by-title lookup) wouldn’t. Somewhat different use cases, both can be valuable. Technical and business-related restrictions give us serious limitations/challenges to providing the ideal “Give the patron whatever they’d want” interfaces we really want to, so it’s all about the trade-offs, trying to create the least confusing and most powerful environment we can for our patrons. What WattJournals is doing is definitey reasonable, although I’m still leaning in a different direction.

  6. Actually, their search of not only ‘current’ issues but ‘past issues’ gives me another seriously interesting idea.

    If someone starts archiving all of these current issue RSS feeds now, then we can have a pretty damn good “meta-search interface”, that can be made pretty cheaply/efficiently, but only going back as far as the ‘now’ we started archiving and only including articles from journals that expose their ToCs reasonably, and with PRISM-esque metadata.

    Looks like that’s what WattJournals in fact is doing to provide ‘past issues’ searches. But they’re probably only archiving articles from journals they have access to.

    Would be super interesting if someone started archiving ALL of em, so the data was captured for the future when someone else wants to turn it into a meta-search. Could end up being a lot better than broadcast search, and a lot cheaper than paying for aggregated indexes like Summon. Internet Archive? Anyone? Want to archive these?

    I wonder if there’s someone I could talk to at the IA to try and get them interested. Just archiving all the RSS/Atom feeds from JournalTOCs, that’s it. Just so the data is captured for someone that wants to do something with it in the future.

  7. Nice Kaspar, sorry I just noticed your comment was in the ‘pending’ queue. Great to hear it works so well, I want to find time to do it in Umlaut.

    What do you choose to do with the articles without sufficient metadata for an OpenURL? I assume there are some?

  8. Well, trees don’t grow to the sky and the OpenURL isn’t always very complete for some publishers. Therefore I provide the original link as well as a proxied one to have at least a couple of ways for the user to try to get to the contents.

    I can of course always provide the issn number and if no date is available from the feed I add sfx.ignore_date_threshold=1 to the OpenURL “SFX style”. So linking to the appropriate vendor is often the best I can do. So it’s not always deep links but at least some indication of availability.

    I use a hierarchical model to extract the data to get the “best” information first and move down to get anything if nothing else is available. E.g. for date: look in prism:publicationDate, then dc:date, then dc:source, and finally in description matching anything looking like a year: ([1-2][0-9]{3}). A bit rude but I think it works OK in all the cases I tested.


  9. “a lot cheaper than paying for aggregated indexes like Summon.” You are quite right Jrochkind. That is one of the reasons why we are archiving all the TOC RSS feeds since we started to collect them in 2008. What we are doing now is using some algorithms to complement and enrich the original metadata found in the RSS feeds.

    Kaspar, I thought you may wan to know that the Library of the Suffolk University (USA) is trialling the JournalTOCs’s feature that inserts the Library’s proxy prefix into the articles’ URLs. In this case, OpenURL is not required as the Library uses its own WAM proxy provider.
    (PS: I didn’t forget your PJSON suggestion)

  10. The reason OpenURL may be required even if you have a proxy, is becuase the library may pay for full text of the article from an aggregator, at a different URL than the publisher url in the JournalTOC’s feed. Simply proxying the link to the publisher URL still won’t get the patron to the full text the library is already paying for. An OpenURL link resolver will.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s