Proposal: A method of exposing alternate representations of entities embedded in HTML, using HTML5 microdata conventions.
[This is just a draft for discussion -- note the key URL begins http://example.org, and is just a placeholder, it should not be used in a real system.]
The main use case here is for a browser plugin like Zotero or Mendeley to discovery scholarly citations listed on an HTML page, and get structured scholarly citations to import into a citation management application.
It likely also could prove useful in other cases of user-agents wanting to identify scholarly citations (with structured metadata) mentioned on the web, for instance in a spider/crawler.
It may also be useful in more general cases not involving scholarly publications/citations, for providing links to alternate format representations of arbitrary entities mentioned on an HTML web page.
It can be considered an alternative to or updated version of UnAPI. This method is very similar in functionality to UnAPI. However: UnAPI no longer validates under HTML5; UnAPI’s “microformat”-style use of ‘title’ attributes and oddly used ‘abbr’ tags has accessibility and consistency problems on the web; the method here instead uses contemporary HTML5 microdata, and is somewhat easier to implement than UnAPI requiring no extra UnAPI api URL endpoints.
In the actually existing environment for our prime use case, many of the actual and potential consumers of citation data already can import and understand structured citation data in various non-web formats (principally RIS), and many of the actual and potential information-producing websites already provide structured citation data in non-web formats (again principally including RIS). The simplest way to a solution to our prime use case then, is just to provide a way for a user-agent to automatically discover these alternate format representations, using a method as easy and simple to implement as possible. (That is, the exact same goal as UnAPI had). A simple one-property HTML5 microdata solution accomplishes this.
The alternate-format property should be used only on HTML5 <a> or <link> elements. An <a> element should be used in cases where a user-viewable link to the alternate format representation is on the page; the <link> element is available in cases where a user-viewable link to the alternate format representation will not be on the page.
A <link> or <a> element with the alternate-format microdata property must also include a “type” attribute with a valid IANA content type. While officially the ‘type’ attributes is only advisory and does not over-ride the actual content headers returned with the URL, user agents interested in alternate representations should only be expected to follow URLs with ‘type’s they recognize.
The microdata ‘itemscope’ attribute should be attached to a DOM element on the page that includes the human readable description of the entity which has an alternate format advertised. This can be used by a user-agent to, for example, highlight the area of the page describing the element the user-agent is offering to import.
1. Simple <a> example
<div itemscope> <p>Alice in Wonderland, by Lewis Caroll</p> <a itemprop="http://example.org/alternate-format" type="application/x-Research-Info-Systems" href="http://example.org/items/12355.ris"> Export to EndNote </a> </div>
2. Simple ‘link’ example, using HTML5 ‘section’ instead too.
Note that schema.org suggests using link tag for urls that are not meant to be user-visible, and a link tag with microdata itemprop attribute validates in html5.
<section itemscope> <p>Alice in Wonderland</p> <p>by Lewis Caroll</p> <link itemprop="http://example.org/alternate-format" type="application/x-Research-Info-Systems" href="http://example.org/items/12355.ris" /> </section>
3. Can co-exist with other microdata vocabularies
For instance using the ‘schema.org‘ Book vocabulary. Note that it may or may not be useful to advertise the alternate-format URL as a schema.org “url” property simultaneously, this document makes no suggestion either way. This example is meant to show it is possible to do if you judge it useful and semantically correct. Note that there are two values in the ‘itemprop’ for the alternate representation url.
<div itemscope itemtype="http://schema.org/Book"> <img itemprop="image" src="catcher-in-the-rye-book-cover.jpg" alt="" /> <span itemprop="name">The Catcher in the Rye</span> <link itemprop="bookFormat" href="http://schema.org/Paperback">Mass Market Paperback by <a itemprop="author" href="/author/jd_salinger.html">J.D. Salinger</a> <a itemprop="url http://example.org/alternate-format" type="application/x-Research-Info-Systems" href="http://example.org/items/12355.ris"> Export to EndNote </a> Product details: <span itemprop="numPages">224</span> pages Publisher: <span itemprop="publisher">Little, Brown, and Company</span> - <meta itemprop="datePublished" content="1991-05-01">May 1, 1991 </div>
[google rich snippet extraction of example 3, demonstrating that our extra alternate-format property name does not interfere with schema.org url or other properties].
A word on IANA (mime) types
The suggested practice in this document depends on the ‘type’ attributes of <a> or <link> to specify an Internet Media Type (ie “mime type” or “IANA content type”) , in order to fully describe the alternate format representation being provided.
Unfortunately, the most common actually used formats for our prime use case have no officially registered types. We have identified what seem to be the most common “x-” extension types in use for the formats we care about, and suggest both information producers and user-agent consumers of this suggestion use these values until suchtime as registered values exist:
application/x-Research-Info-Systems : RIS
application/x-endnote-refer : EndNote Import Format. Oddly, no online documentation for this format from EndNote itself can be found.
The single item page
The method described above is great for when you have multiple items on a page, and wish to provide alternate formats or export of all of them. If you have only a single item on a page, the method described above still works. However, another option for a single-item page is simply using html <link> in <head> section with ‘rel’ attribute. We encourage relevant user-agents to recognize this method too:
<head> <link rel="alternate" type="application/x-Research-Info-Systems" href="http://example.org/item/12345.ris" /> [...] </head>
It’s not actually legal to use <link> with microdata in the <head>, nor is it legal to use <link> with ‘rel’ in the <body> section; a later section will explain why we think purely relying on ‘rel’ is insufficient for a multi-item page.
The ‘rel’ attribute
The older HTML ‘rel’ attribute on an <a> or <link> tag serves much the same purpose as the newer HTML5 microdata, although in a more specialized domain. ’rel’ is limited to <a> or <link> tags, but that’s all the method discussed here uses anyway.
For visible link, an <a> with a ‘rel’ and ‘type’ could serve much the same purpose as this proposal. However for a link which is not meant to be user-visible, using <link> with ‘rel’ would be restricted to the <head> section, which is less convenient for a multi-entity (ie, “results”) page. (HTML5 allows <link> in <body>, but only with microdata “itemprop” attribute, not with “rel” attribute).
Putting such an alternate link in the header instead seems inaccurate if it applies to only a portion of the page (rel=alternate in header means it’s an alternate representation of the general contents of the page). If you instead included a <link> in <head> which returned an alternate format aggregating all entities on page — this is only suitable for formats that can be aggregated; it’s harder to implement on the provider end if they provider already has individual alternate format urls available but not arbitrary aggregate ones (likely scenario), and it does not allow any way to specify to consumer which citation applies to which entity described human-readably on the page.
The microdata approach also allows the area of the page with a human readable description the relevant entity to be identified with ‘itemscope’ — it seems potentially useful for an extracting user-agent to be able to highlight this for the user when offering a citation import/export.
In general, HTML5, in it’s choices of where “rel” is and is not valid, seems to be clearly making a preference for the newer, more flexible and general purpose, microdata attributes, over rel. It is no more difficult on the data providers end to implement the legal HTML5 microdata approach described here, and allows the content provider to include alternate-format links in the same area of the DOM as the human description, multiple times on a page for a multi-record page, and whether the link is meant to be human-visible or not, in a consistent way, and using a unique URI that precisely specifies the intention.
The schema.org vocabulary
Rather than invent our own microdata URI property name (http://example.org/alternate-format), we considered re-using the schema.org vocabularies. It is defined somewhat generically as “url of the item”. One example given specifies a ‘canoical’ identifier to a wikipedia page (which is still something about the item, but an entirely different page, not an alternate format, clearly) Another example given uses the url property to point to the site’s own item detail page in a multiple-item listing: “For pages like this with a collection of items, you should mark up each item separately (in this case as a series of Persons) and add the url property to the link to the corresponding page for each item.”
It is possible using schema.org “url” property would work out fine; but it also seems possible that cases would arise where it is too ambiguous, and a user-agent mis-interprets a “url” value as being an appropriate alternate format representation for import/export, when in fact it is not.
Additionally, the information provider may or may not want to use the schema.org vocabularies, the provider may prefer to use a different microdata vocabulary. All the examples on schema.org use the microdata ‘itemtype’ property to set a schema.org vocabulary on an item. The ‘itemtype’ property can specify only one vocabulary for the item. You can mix and match vocabularies in an item, but only by using properties identified with URIs, as in this proposal, rather than by using short simple strings as in all the schema.org examples. It is not clear if full URI variants exist for the schema.org vocabularies; but if they do, no mention is made of this possibility on the official schema.org vocabulary, and it seems likely to be confusing to people.
If we needed a complex vocabulary, then defining our instead of re-using an existing one would seem more problematic. But since we only need a single property, by specifying our own URI property name: we maximize re-useability with arbitrary additional microdata types of the provider’s choice; we minimize possibility for ambiguity or conflict with existing use of existing types; and we also make it easier on the user-agent consumer, which only has to look for a <link> or <a> with an itemprop including our full URI, and not also variants using ‘itemtype’ with a short name.
In short, by taking the approach of our a new URI-named property, we think keep things as simple and easy to understand and implement as possible, making up-take more likely.
Another option would be actually embedding the structured citation elements in the HTML itself, using RDFa or microdata.
The trick here is that, if the provider or consumer already understand a non-web format like RIS (as is true in our use case), then going the RDFa or microdata route for full semantics ends up being significantly more complicated to implement on both provider and consumer sides.
The provider first has to choose RDFa or microdata (or try to combine them). Each of these is only a framework though, and subsequently requires choosing a particular vocabulary. A provider needs to predict what vocabularies a targetted consumer will understand (there are at least several choices vying for adoption, including the schema.org CreativeWork and bibo), and if different consumers understand different ones, possibly figure out a way to provide multiple vocabularies.
If the provider already has an RIS export available, then it is significant extra work to figure out how to transform metadata into an additional vocabulary like bibo or schema.org CreativeWork; in some cases it may not even be possible if insufficient metadata granularity exists in the providing system. And then the consumer also needs to have additional logic to understand one or more vocabularies and serialization formats (microdata or RDFa), potentially a non-trivial task.
Embedding semantic information directly in HTML using either RDFa or microdata is potentially quite useful. We encourage the continual development of these techniques, and there’s no reason a user-agent couldn’t recognize them as well as the method proposed here.
But for our actually existing environment, especially where many relevant systems already understand RIS, we think there is utility in providing a standard means for an information provider to advertise an already existing non-web format to a user-agent. We think it provides the lowest-barrier and the least costly implementation to achieve our use case goals and likelyhood of quick adoption; and it should be easy to implement, especially for systems already dealing in RIS, that it will not detract any resources from more complex initiatives to further develop embedded semantics.
User-agent consumer adoption
Currently, citation-scraping user-agents like Zotero or Mendeley commonly contain a bundle of site-specific adapters, for each site they want to scrape from, custom tuned to that site’s unique HTML DOM structure, and often triggered by URL.
This probably isn’t going away anytime soon, and it works out okay in practice. But it has some definite downsides. Each site-specific adapter takes manual intervention to create, and to maintain. These adapters may be fragile, and break when the site changes it’s DOM structure, again requiring manual intervention to fix, and tracking to discover when a fix is needed. If the adapters are url-triggered, then getting a new website (or newly located website) recognized by the user-agent requires the operator of the new website to communicate with the developers of the user-agent, and be added to the list.
It would be advantageous to supplement these site-specific adapters with a generic method a site can use, and know it will be recognized by common user-agents. This proposal is an attempt to provide such. While ideally such a method would eventually supplant site-specific adapters, large benefit can be had by those who use a general purpose method immediately, even with other sites continuing to be scraped by site-specific adapters. We’ve attempted to describe a method which is so simple and cheap to implement, that the cost-benefit of implementation looks really good to both provider and consumer sides, and adoption more plausible.
We encourage relevant user-agents like Zotero or Mendeley to have their software recognize this method, and clearly advertise that fact, perhaps even identifying it as the preferable or recommended method if an information provider would like to ensure compatibility with the user-agent in a reliable and efficient manner.