structural marc problems you may encounter February 2, 2010
Posted by jrochkind in General.4 comments
This is not your typical ‘why MARC must die’ post. It’s instead about very low level structural problems in a Marc21 binary file that my ILS outputs. It’s not about the semantics of MARC at all, it’s about the structural features of the Marc21 format.
I never had to know much about low-level Marc21 format details before, and wish I still didn’t, but I had to because my ILS (Horizon) is outputting certain bibs as MARC that the Marc4J Java library used by SolrMarc refused to read, claiming they were structurally invalid in various ways. (Never would have figured this stuff out with the invaluable help of sesuncedu, robcaSSon, and others in #code4Lib).
But this may help someone else figuring out why Marc4J can’t read their MARC.
1. Invalid leader bytes
In the leader of a Marc21 record, byte 10 is always ascii ‘2′, byte 11 is always ‘2′ as well, and bytes 20-24 are always ‘4500′. At least they’re supposed to be. Theoretically these bytes allow a record to specify details about the nature of it’s binary format — but these details are fixed in Marc21, and in all other Marc variants we know of, it’s a flexibility that was rarely or never taken advantage of in any Marc format.
However, Horizon actually stores most of it’s leader bytes in a db column. And if the leader bytes are something other than these invariants in that db column, Horizon’s marc export will include those leader bytes — even if they are invalid, even if they do _not_ accurately describe the Marc record they are attached to (which wouldn’t be a valid Marc21 record if it was true).
Since these values are invariant in Marc21 and most (all) other Marc formats, most Marc parsers ignore them.
However, Marc4J doesn’t, it actually treats them as gospel. So if those bytes were wrong, Marc4J will try reading the record improperly. And if those bytes weren’t ascii decimal digits at all, Marc4J will claim it can’t read the leader.
So I just had to fix those in our production ILS. And figure out where they’re coming from, and try to stop them from coming in again? Really, I blame our ILS here for even allowing such completely wrong bytes to be in it’s internal db.
(A perhaps better solution from the other end is fixing the Marc4J PermissiveReader to not pay attention to those bad bytes, assuming the invariant values. sesuncedu has prepared a patch doing some of that for Marc4J, hopefully it’ll get in there.)
2. Bib Records Too Long For Marc
Because of the nature of MARC21’s ‘directory’ structure, there is a maximum length that a MARC record can be. If it’s above this length, the MARC directory doesn’t have enough bytes in it to describe where the fields beyond this length are in the record, and the MARC record is unreadable. (Incidentally, it’s very odd that MARC includes internal byte offsets recorded as ascii decimal chars, rather than ordinary binary data. If it used more typical simple binary encoding of integers for byte offests, the maximum length of a MARC file would be quite a bit larger. But it doesn’t. Oh well.)
So what does Horizon marc export do if it has a record which has too much data, which will go over the maximum record length in MARC? It outputs it anyway. But the marc record it outputs is seriously messed up. It’s got a MARC directory which may be entirely illegal (not a multiple of 12), it’s got a wrong leader bytes 0-4 ‘length’, possibly other problems. Depending on the individual record and exactly how Horizon ended up outputting it, Marc4J might just skip it as a bad record and go on. That’s the best that could be expected.
However, more often, Marc4J gets entirely confused because of the bad leader bytes 0-4 length, and doesn’t understand where the subsequent record in the marc file actually begins. So every other record after this too long one in the marc file is a loss to Marc4J/SolrMarc indexing. Either every subsequent record can’t be indexed at all, or even worse, every subsequent record is indexed by Marc4J/SolrMarc, but completely wrong, because Marc4j/SolrMarc got the wrong data.
I need to work out a patch for Marc4J PermissiveReader so when encountering such a record, Marc4J can at least recover by properly finding the beginning of the NEXT record, using the Marc Record Separator character.
3. Blank/null tags
This one might be Horizon-specific. Horizon allows the operator to accidentally add a tag to a record that has a null tag value. Not 100, 245, or something else, but just null. This accident could have been made manually, or could have been made by some sort of automated import script when we batch loaded records into Horizon. When the Horizon marc exporter encounters such a record, it does output marc21 for it, but completely invalid and wrong marc21.
I blame Horizon for this, it ought not to allow null tag values to even exist in the db, and if they do, ought to be ignored on export, not create an invalid marc record on export.
This is another problem that often results in Marc4J getting completely confused about where one record ends and the next starts, making the entire rest of the Marc file after such a record un-readable. Probably because of a bad leader bytes 0-4 length value, so perhaps if I can work out a patch to above, it will at least result in Marc4J succesfully skipping such a record and going on to the rest of the file.
4. Illegal chars in Marc values?
This one I haven’t completely gotten to the bottom of yet, because I made the mistake of fixing the couple examples I found in the Horizon Staff Client, where it didn’t really show me exactly what was going on.
But I think some Marc control characters (Field Terminator or Record Terminator) wound up in some of my record values in the db. (No doubt as the result of an import gone wrong at some point in the past). The Horizon marc exporter simply included them unescaped in it’s marc output. Resulting in special marc control characters in illegal places, or places where they don’t mean what they mean, in the marc file. This also messed up Marc4J something awful.
Again, I kind of blame Horizon here, for allowing bad data in it’s internal store, and then for writing bad data out in marc export when such bad data is in it’s internal store.
NEW! 4 Feb 2010:
Marc control character in internal data value.
I’ll describe this one in Horizon-specific terminology, cause it’s clearer.
The horizon “bib” table holds an individual marc field in the ‘text’ column. Every ‘text’ column ENDS in the Marc Field Terminator character (decimal 30, hex 1E, sometimes displayed as “^^”).
However, some of our values have that Marc Field Terminator character _not_ as the last character, but internally. This creates problems in marc export, where the marc created by marcout is invalid unparseable marc. (as it includes marc Field Terminator control character in illegal position).
This problem is not visible in Horizon Staff Client, the control character is not shown. But it’s hiding there in the database anyway. If you open an individual record in Horizon Staff Client and then simply re-save it, it SEEMS to fix the problem in at least some cases (not sure about all), but probably makes more sense to fix it in bulk through an automated process anyway.
As a technical note: I used this SQL against hzdev db to find the number of bibs which contained char(30) as some char OTHER than the last in dbo.bib. It takes quite a while ro run. This would have to be re-done for dbo.bib_longtext.longtext, another table that data destined for marc export can hide. You could base an automated fix off of this SQL technique.
select count(distinct bib#) from dbo.bib where (charindex(char(30), text) != char_length(text))
Correction: that SQL will also find values that do not end in the FT at all. While Horizon ordinarily does so that’s sort of an error, it doesn’t cause any problems. Here’s one to find only ones with an internal FT after all, not including ones with no FT whatsoever:
select count(distinct bib#) from dbo.bib where (charindex(char(30), text) not in(0, char_length(text)))
A note on MARC control character terminology
One confusing thing in dealing with this stuff that took me a while to figure out is how MARC uses it’s own special weird names for certain control characters.
MARC has a “Field Terminator” (which is sometimes called ‘field separator’ in marc docs instead of ‘terminator’) and a “Record Terminator” (also sometimes called ‘record separator’ in docs instead of ‘terminator’).
But the ascii values used for these special MARC control codes already had names in ascii, and they are confusingly similar but not the same names! This certainly leads to confusion.
Marc “Field Terminator” == Hex 1E == Decimal 30 == Ascii “Record Separator” == “control-^” or “^^”, which is how stock vim will display it.
Marc “Record Terminator” == Hex 1D == Decimal 29 == Ascii “Group Separator” == “control-]” or “^]” which is how stock vim will display it.
(Correction 3 Feb, this next is also part of the marc standard):
Marc “Subfield Delimiter” == Hex 1F == Decimal 31 == Ascii “Unit Separator” == “control-_” or “^_” which is how it will show up in vim.
Also update 3 Feb 2010. I made this little sign and now keep it on my wall next to my desk, so I can refer to it ha.
solr multi-core gotcha February 1, 2010
Posted by jrochkind in General.add a comment
This might only be a gotcha if you’re a lazy guy like me.
But in ordinary Solr, if you want to completely clear out your indexes, you can just delete the ‘data’ directory, no problem. It sounds weird, but several solr guru types told me I could do it, and it was certainly convenient to be able to do when my data (still in development) got all messed up and I just wanted to start over, and it did indeed work.
But when you’ve set up ‘multi core’ in Solr (which has nothing to do with CPUs, it’s solr term for having multiple entirely seperate solr indexes in one running solr process)… don’t try to just go and delete the ‘data’ directory in one of the cores. It messes everything up horribly and you have to repair/rebuild your cores.
So I guess in a solr multi-core world, if you want to delete all your solr data, you have to do it the normal way with a ‘delete’ operation.
Using Mongrel with Rails 2.2+ and prefix January 29, 2010
Posted by jrochkind in General.add a comment
First, don’t. Mongrel seems to be a dead product, I think everyone’s moved over to mod_rails/passenger. Because mod_rails/passenger (what’s that product actually called these days?) is nice. Just use that, mongrel hasn’t been updated in years now and doesn’t look like it will be.
But in case you have to. And you run into that thing where you want to use mongrel’s “–prefix” argument, but it doesn’t work with Rails 2.2+, because mongrel tries to call API that doens’t exist in Rails anymore. What do you do?
Some people patch mongrel. Some people give up on using “–prefix” in mongrel, and just set the prefix in their Rails environment.rb (a much less flexible arrangement). But I actually found a better solution, which worked out fine for me back when I was using it, before I gave up on mongrel and switched the application involved on Rails 2.2+ over to passenger.
Someone just emailed me to ask for it, and it took me a while to track down, it didn’t google well, so I’ll stick it here so it maybe googles better, and I can find it again if I need it.
###################
# Fix for mongrel which still doesn't know about Rails 2.2's changes, grr.
# We provide a backwards compatible wrapper around the new
# ActionController::base.relative_url_root,
# so it can still be called off of the actually non-existing
# AbstractRequest class.
module ActionController
class AbstractRequest < ActionController::Request
def self.relative_url_root=(path)
ActionController::Base.relative_url_root=(path)
end
def self.relative_url_root
ActionController::Base.relative_url_root
end
end
end
###################
Pretty nifty, huh? I think I came up with that myself, but can’t remember, maybe I got it from somewhere else, but I’m surprised that it wasn’t already google-able, it’s such a nifty solution which seemed to work fine as far as I can remember. Maybe cause everybody has already moved over to passenger, and nobody cares anymore.
Purposes/Functions of Controlled Vocabulary January 20, 2010
Posted by jrochkind in General.4 comments
Quite a while ago on December 14th, Jim Weinheimer wrote something on NGC4Lib that got me thinking about a paper I wrote back in school. Jim wrote:
I often see this sort of finding, but it always leaves me a bit confused. When they discuss “controlled vocabulary” and “clustering” I wonder what they mean. There are two primary purposes and functions of subjects and you can see them especially clearly in LC subject headings: there is the collation function, which brings together metadata records for resources with similar subjects, and then there is the “labelling” function, which provides an authorized term that describes the items that have been collated. The purpose of the labelling function is to make the items collated together findable by humans.
I actually wrote a paper in school on the purposes/functions of controlled vocabulary understood broadly (which in the paper for some reason I call ’systems of knowledge organization’ rather than ‘controlled vocabulary’), in which I, via the literature and my own analysis, identify quite a bit more than two purposes or functions controlled vocabulary.
I’m not sure if Jim is aying that ‘clustering’ doesn’t mean anything, is really something subsumed by one of his Big Two, or is just an illegitimate purpose you should never use a controlled vocabulary for? (Or just never use a subject vocabulary for? Or never use LCSH for?).
Regardless, I think there are clearly more than two, although figuring out exactly the right ‘taxonomy’ for purposes of controlled vocabulary isn’t neccesarily straightforward. But I think there are more than two things controlled vocabulary can be useful for, more than two that it is useful for, and indeed more than two that the historical literature (including from Cutter and Dewey) suggested our legacy systems may have been designed to be useful for!
In my paper I identified 10, although they have some overlap. In retrospect, with more experience working with controlled vocabularies — and just as important, systems that try to make them serve various purposes for the user — I think I could collapse some of them.
Here’s the school paper itself, if you really want to read the whole thing:
Towards a Conceptual Model of Knowledge Organization Systems: 1. The Functions of Knowledge Organization Systems
by Jonathan Rochkind
12 December 2005
Here’s my summary of the functions I identified, with some retroactive re-thinking.
Functions of Controlled Vocabularies
1. Class Retrieval
Just plain identify a particular class or term, and then look up all documents with that term attached. What most librarians probably think of first, and perhaps the same as what Jim calls the ‘collation’ function. (All of these functions have been referred to by different names by different people; my paper identifies some of them in the literature, including writings by Dewey, Cutter, and the CRG).
2. Browsing
It’s unclear exactly what “browsing” means, but lots of people talk about it. It is some sort of exploratory or investigatory, probably iterative, interaction with a corpus, to be contrasted with the more specifically directed aims of Class Retrieval.
If you were just to page through the LCSH Red Books looking for interesting topics, that might be a form of “browsing”. If you were to browse a physical shelf ordered by DDC or LCC, that’s another kind of browsing. Note that both of these activities rely on particular features of the controlled vocabulary to make browsing possible: For LCSH, human readable terms applied to classes that, when filed alphabetically, put similar subjects near each other. For Dewey, numeric class numbers that, when filed in order, do similar.
You don’t need to have anything like that purely for ‘class retrieval’, is why it’s worth mentioning browsing as a seperate function. But just as importantly, there are probably other ways to support browsing (exploratory investigation of a corpus), especially in the digital environment, that may not rely on display headings or displayable notation that can be filed in order! Can you think of any?
3. Relationship Navigation.
Some controlled vocabularies let you navigate relationships between classes and terms. Others do not. Perhaps this could be thought of as a subset of Browsing, not sure, but it’s a particular type of Browsing if it is that.
While our previous examples of a shelf browse of a Dewey shelf, or a page through the Red Books may expose you to some hieararchy — because both systems intend to put sub-hieararchical classes “After” their “parent” in filing order — in fact, especially in the computer environment, it’s possible to do much more explicit relationship navigation, and do it accross more than one hieararchy or type of relationship.
Identify a class or term, maybe look at the documents attached to it, then realize that you really want documents on either a more general or more specific topic. Relationship navigation lets you find those documents. (Although purely post-coordinated combinations of terms may also allow you to do that, depending on how the system has been designed and applied).
4. Identification
The Identification function is served by listing assigned class or term information on a record so that the user knows more about the nature of the document indicated.
Consider traditional LCSH “subject tracings”. The fact that the book is about Subject A may not be revealed by it’s author or title, but there you have a subject tracing to tell you that.
Not sure if this is what Jim means by “labelling” or he means something different.
5. Locating
Basically, just using a controlled vocabulary assignment as a ‘locater’ so you know where to find the document. Basically, that is, traditional shelf location done by classification such as Dewey or LCC. (Although as some have recently noted in NGC4Lib, that’s not originally what Dewey invented DDC for!).
When students come up to the reference desk having written down only a Dewey or LCC shelfmark, they are also assuming it can be used for the ‘locating’ function, to refer to and find a unique item. Although in our library it doesn’t work so well for that, we often wish they had written down the author and title instead!
This isn’t a particularly interesting function, really, but it’s one we use, that some but not all controlled vocabularies are suitable for. (Basically, those that can be used to assign a unique notational string to an item, those we usually call ‘classifications’).
6. Ordering.
Some controlled vocabularies allow you to place documents in some meaningful order (that you couldn’t do without the controlled vocabularies), others don’t.
Traditionally, this is used for shelf location, DDC and LCC. It could also possibly be used in an online display.
This isn’t really that interesting.
7. Surveying
To support the user in getting a general overview of the corpus. While similar to ‘browsing’, walking up and down the shelves isn’t going to be very good for getting a general overview, unless you have a very small corpus! But looking at Dewey classes arranged in hieararchical fashion, with each class having it’s human readable display label assigned to it, and each class also having the number of documents posted to it listed — ah, now there’s a survey of the landscape!
Again, some controlled vocabularies can be used better for this function than others. Having a hieararchy is probably helpful, unless the vocabulary only has a small number of terms. Trying to use LCSH for a survey of a particularly large corpus would be tricky, since the hieararchy is so odd, but perhaps it could be done. NG facetted navigation systems often try to use either dewey or LCC for this though.
8. Dealing with a Large Result Set
What a lame name for a purpose/function, eh? By this, I really just mean the same thing as ’surveying’, but applied to a result set, not to the entire corpus. Give you a summarized overview of what’s there when you type in “politics” and get 400,000 results. Very similar to Surveying, really.
9. Keyword Match Enhancement
Even if you have full text, the term you are searching on may not appear in the fulltext of an item that is very much about that concept! The item may be in another language, or may not have written words at all. Or, to use another example recently raised on NGC4Lib, World War I wasn’t called World War I until there was a II, but a 1930 book can still be about World War I.
If you take the LCSH subject headings, and make sure they are in your textual indexing system too, you can increase the recall of the search.
If you first make them explicitly identify a class, and then click on it — then that’s #1, Class Retrieval. But if they just enter a search in a Google-style search box, and magically get their 1930 document on WWI because it had a subject assigned to it, and that subject included the heading or lead-in term “WWI”, then that’s Keyword Match Enhancement.
The more lead-in terms a vocabulary has (that you can also index for keyword match enhancement), the better it probably is for this. The better the headings or lead-in terms match the end-users query vocabulary, the better too. Which might be useful for other functions but is obviously crucial for this one.
10. Negotiation
Interacting with a controlled vocabulary in various ways can help the user come to a better understanding of what they’re actually looking for in the first place, by showing them what people call different things, and by showing them how different concepts can be related to one another. The right kind of interaction with the right controlled vocabulary can sometimes do what a good reference librarian does (probably not as well, but open 24 hours).
This might sound odd, but I included this function because it was in fact mentioned in the literature, from Cutter to Vickery to Svenonius.
So what?
So there you go. I’m not sure those 10 are the right 10, they can probably be blended up a bit to get a better taxonomy. But I’m pretty sure there are more than just 2!
Part of what I talk about in the larger paper is that the online environment has made us try to get more functions out of controlled vocabulary, functions that certain vocabularies may not have been designed for — or that they may not have been used for during the previous 100 years, where the use of a particular vocabulary may in fact not be the same as what it was designed for! Different features of a controlled vocabulary (both in design and in application by an indexer/cataloger) can facilitate different functions.
For 100 years we didn’t need to think much about that, settling into “alphabetic subject vocabulary” used basically only for class retrieval with a card catalog, and “classification” used basically only for browsing via an order that also served as a locator. Interestingly, that’s not what “classification” was neccesarily designed for at all! Back in the day, there was more confusion and less consensus about what these things were for, which my paper goes into a bit.
With the computer, we can try to do more than we could when the only tools we had for interacting with controlled vocabularies were card catalogs, printed catalogs, and actual shelves. (Cause interacting with shelves arranged by LCC or DDC is interacting with a controlled vocabulary!). This is bringing us back to some of the chaos of figuring out “what are these things for anyway”, and “how do we design them to serve these functions well?”
Google Scholar does not allow meta-search January 13, 2010
Posted by jrochkind in General.4 comments
Neither does Google ‘ordinary’ web search for that matter.
Metalib, the Ex Libris federated (broadcast) search product has for a long time supported a Google Scholar target, which I believe was accomplished by screen-scraping G.Scholar (cause I can’t figure out any other way it could have worked).
As long ago, as, I think, two years, some Metalib customers had problems with this G.Scholar target, and when they contacted Google, Google basically told them “Well, that’s because we don’t allow federated search of Google Scholar. You are violating our terms of service. Our automated rate controls probably noticed all the traffic from your IP and cut you off as a bot, which is what we wanted them to do.”
So from that moment on, I expected that G.Scholar would eventually become no longer supported in Metalib. I’m surprised it took this long!
Finally today, Ex Libris sent out an email that says, among other things:
In brief, Google rejected the use of these services via MetaSearch, since their policy is to reject automated queries. We understand that Google’s technology may be identifying MetaSearch as an automated query, which is identified (and blocked) by IP addresses. We contacted Google to ask about one of these services (Google Scholar) and were informed that Google does not permit use of these services via MetaSearch.
We always respect the decisions and policies of the search engines that act as MetaLib resources. We will therefore de-activate resources that are not supported when we become aware of it, and ensure that all new resources are supported before we add them.
Ex Libris customers are welcome to ask Google to change its policy. If Google decides to support MetaSearch engines for a particular service at any time, we will be happy to re-activate the relevant resource.
We will also endeavor to reach a positive outcome with Google on this issue.
I wouldn’t hold my breath for a “positive outcome”, myself. I think Google is pretty clear amongst themselves that it’s not in their business interests to allow someone else to present Google (Scholar) results ‘intermingled’ with other people’s results, which is exactly what meta-search does. In most cases, they’ve decided that it’s not even in their business interests to allow external software to search their resources, even if it presents them in a unitary and non-comingled-with-others way. Note that they have very specifically not provided any API for Google Scholar, although many people have asked for one.
One exception is the Google Books service. Now, at first Google provided a ‘javascript’ Google Books API which would allow you to sort of kind of embed Google results in your application, but only if the actual requests to Google were coming from the individual browser via AJAX. If you tried using the javascript API server-side, their rate limiting software would soon notice and cut you off. (They have a similar javascript API for ordinary Google web search results too, I think.)
However, later they provided a “Data API” for Google Books, which you are explicitly allowed to call from server-side applications. This is actually awesome, it lets us do a lot more (I’m using it in Umlaut), and I’m so pleased they did this — especially because they have NOT done this for any other google search. Note though that even the Google Books Data API terms of service (to my reading) prevent you from inter-mingling Google Books result list with other services results, so Metalib _still_ couldn’t do it’s thing on GBS.
Google Scholar is a great resource, i wish we could include it in our meta search tools — but Google has decided it’s not in their business interests to allow this. It’s a good reminder that, yes, Google does have business interests, and, yes, they act in them, even when it results in things we don’t like. It’s refreshing that for once, if local staff asks “why can’t google scholar be in our metasearch product”, I don’t have to blame the metasearch vendor, I can say “Because Google will not allow it”, and in the process help teach that Google is not some utopian charity after all.
In brief, Google rejected the use of these services via MetaSearch, since their policy is to reject automated queries. We understand that Google's technology may be identifying MetaSearch as an automated query, which is identified (and blocked) by IP addresses. We contacted Google to ask about one of these services (Google Scholar) and were informed that Google does not permit use of these services via MetaSearch. We always respect the decisions and policies of the search engines that act as MetaLib resources. We will therefore de-activate resources that are not supported when we become aware of it, and ensure that all new resources are supported before we add them. Ex Libris customers are welcome to ask Google to change its policy. If Google decides to support MetaSearch engines for a particular service at any time, we will be happy to re-activate the relevant resource. We will also endeavor to reach a positive outcome with Google on this issue.
ISSN search field in Solr December 24, 2009
Posted by jrochkind in General.add a comment
It’s fairly simple in the end, but took me a while to figure out as a Solr newbie. So I’ll document it here, in case someone else wants to do the same thing, or someone else finds it useful as a simple getting started solr example.
So I wanted a Solr indexed field for ISSNs. Mainly, the important thing here is that a query gets a match whether it uses a hyphen (1234-5678) or not (12345678), and whether the original data used a hyphen or not. That’s pretty much the only interesting part of an ISSN field in Solr.
Additionally, I wanted to make it possible for multiple ISSNs to be in one “value” in Solr, to make it easy to index records where it might be stored that way. For instance, send an 020$a and the new 020$l in one “value” — or ‘bad’ data that has multiple ISSNs in an 020$a even. — sure, you could have the indexer split em up first, but I figure offload as much of this work to solr, so it’ll be there for any indexer that uses it.
So here’s my annotated Solr field type definition for issn.
<fieldType name="issn" sortMissingLast="true" omitNorms="true">
<analyzer>
<!-- tokenize just splitting on whitespace. So if multiple ISSNs
are present seperated by whitespace, we'll catch em all
in their own tokens. But note that means you can't have
an ISSN like "1234 5678", that'll end up being considered
two ISSNs. We're not using the StandardTokenizer,
becuase we want to keep "1234-5678" as one token, not
split it into two! Whitespace tokenizer is sufficient. --->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- ISSNs can have an X as a last checksum 'digit'. While I
think that's always supposed to be uppercased, I don't
trust my data or my users entering querries to make it so,
so change all lowercase x to X to make sure querries always
match regardless of case of the X -->
<filter class="solr.PatternReplaceFilterFactory"
pattern="(x)" replacement="X" replace="all"
/>
<!-- ISSNs are composed just of numbers and X. So strip out
anything that isn't that. This will get rid of hyphens,
so allow hits whether or not there's a hyphen match between
original data and query. It will also turn any tokens
that don't have those into empty strings. -->
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^0-9X])" replacement="" replace="all"
/>
<!-- get rid of empty string tokens. At first I didn't think that
mattered, but if you don't, then you get weird behavior
if someone enters a query that doesn't look like an ISSN.
It gets analyzed into empty string tokens, which then
match empty string tokens in the index, which gives you
unexpected hits. I don't really care about the max
chars, but the length filter requires one, so I
just use a high number. -->
<filter class="solr.LengthFilterFactory" min="1" max="100"/>
</analyzer>
</fieldType>
So that’s the fieldType, now we simply declare a field:
It’s multi-valued becuase a record can and often does have more than one ISSN (although our field type above in some cases will have multiple ISSNs just as different tokens in one value, it’s convenient to let the indexer send multiple values to Solr too (and it’s really preferable when this happens, it’s cleaner)). It’s not stored, because I just care about this as an index lookup, I display the actual ISSN by parsing MARC at display time.
<field name="issn" type="issn" indexed="true" stored="false" multiValued="true"/>
And for the record, here’s the SolrMarc definition that fills up this issn field. I made the choice to assign any ISSNs recorded for a series the record belongs to as ISSNs for the record; I think this will lead to expected behavior? (I did not include the 776x in the ISSN index although our legacy OPAC seems to have done so, that seems weird to me and doesn’t seem to make sense. Anyone know if I’m missing something?)
issn = 022al:490x:440x:800x:810x:811x:830x
Combining information from catalog and link resolver kb December 14, 2009
Posted by jrochkind in General.4 comments
Our link resolver, branded Find It, powered by Umlaut backed up by SFX and other sources, aims to comprehensively answer the question “what library services can be provided for this known citation.” One of the more important services is electronic full text access.
Because of the legacy data environment we’ve inherited, in order to advertise online full text access for a given citation (which is usually an article, but can be a journal, or even a monograph), the software needs to combine information from MARC in our traditional ILS (via the 856 field), as well as data from the link resolver (SFX) knowledge base (which among other things, unlike our MARC data, when it works is capable of bringing the user to an article-level link for their article citation, and of knowing if we have access to the particular article cited.)
This gets tricky to do without providing misleading or confusing duplicative or missing information for the user.
Here is a description of the heuristic algorithm Umlaut uses to decide whether to include information from each source, and how. This explanation demonstrates how this is neccesarily an imperfect and approximate process, due to our legacy enherited data environment, but I think what we’re doing now works reasonably well. Slightly edited from internal documentation I prepared for my coworkers.
Why combine?
Why not just use links from SFX? (link resolver)
- Becuase there is information about full text links in Horizon that is not included in the SFX database.
Why not just use links from Horizon? (traditional MARC ILS)
- Because SFX’s information is more powerful for supporting the user’s work. The majority of Find It use is for journal articles:
- For a journal article, the SFX link should take the user directly to the article requested; the Horizon link will always take the user only to the journal title page and require additional navigation from the user.
- The SFX kb includes machine-processable range-of-coverage information, so an SFX link will only be shown if the SFX kb thinks the article requested is within the range-of-coverage. Horizon links, when they have ranges of coverage at all, do not have machine-processable ranges of coverage, so Find It is unable to determine if the Horizon link is supposed to include the specific article cited or not.
- There are some links in SFX that are not in Horizon, although this gap is growing smaller.
- Horizon links historically were less reliable than SFX links (more likely to be broken), and much less likely to have even human-readable range-of-coverage information, although this gap is narrowing too.
Why not just combine simply?
The first approach might be simply displaying all the links Find It can find from Horizon together with all the links Find It can find from the SFX kb.
However, this would lead to some unfortunate consequences for the user, mostly related to the fact that some platform links exist in both Horizon and the SFX kb, but in different forms with different capabilities.
Double-listing
In many cases, under this simple approach, a display would include the same link twice, once from Horizon and once from SFX. The links would be labeled slightly differently in the display, as there’s different label information from each source.
While this alone would be unfortunately confusing to the user, what’s worse is that these links aren’t in fact exact duplicates: the Horizon link is never going to take the user right to the article requested. So including the duplicate but less functional Horizon link is really a dis-service to the user.
Example:
(Temporary live link: http://findit.library.jhu.edu/findit-dev/go/16097)
Listing Horizon link despite lack of availability
Even worse, in other cases, the user may be asking for an article from a journal we have some coverage of, but not coverage including the specific article requested. The SFX kb is smart enough not to display the link in this case; but the Horizon db is not.
So in these cases, even though we have information (in SFX) that we in fact do not have availability for the requested article, and no SFX-derived link will be provided–a non-functioning Horizon-derived link would be provided anyway.
This leads to user frustration; a main goal of Find It is to, wherever possible, not give a user a link that won’t in fact work.
Example
SFX knows that cambridge UP range of coverage is only to 1999.

Temporary live link: http://findit.library.jhu.edu/findit-dev/go/16108
But on an article-level request for a 1995 article, the Horizon cambridge link is still shown:

Temporary live link: http://findit.library.jhu.edu/findit-dev/go/16118
General concept of a solution
The general idea of a solution is to then only show Horizon links when they do not represent platforms that SFX already knows about. Since the SFX kb is more fully functional than the Horizon db in terms of links, we try to consider the Horizon db ’supplemental’, only using it when it’s got information that is not also in SFX. That would prevent double-listing, and prevent listing of Horizon links when SFX knows that we don’t actually have access (to the specific requested article) after all.
However, it’s trickier than it sounds to implement this in software, because it’s not always entirely clear from a Horizon record whether a given link is something the SFX kb knows about or not.
All the Find It software has to go on is the URL in the Horizon record. From this URL, Find It needs to decide if it’s something the SFX kb knows about or not.
Find It tries to do that by keeping a list of URL hostnames that it believes SFX knows about, and ignoring Horizon URLs that match this hostname. But this is subject to a couple problems:
- It’s tricky to get an accurate list of ‘hostnames SFX knows about’, some things may be left off.
- Even if a Horizon URL matches a hostname SFX knows about, the particular Horizon link may represent a different platform on that hostname that SFX does not know about. The best example of this is with ebooks. For instance, SFX in general knows about “ebscohost.com”. However, we have e-book links from ebscohost.com, and SFX does not know about them.
So Find It has developed an approximate ‘heuristic’ algorithm to try and do the best it can to display Horizon links only when they are likely to actually be supplemental to what’s in SFX, and not when they are likely to duplicate what’s in SFX. This is an imperfect compromise to try and give the best display we can to the user, it will invariably make some mistakes both in the directions of inclusion and exclusion. But we’ve tweaked the algorithm from experience to try and get an optimal compromise.
The Find It Horizon inclusion algorithm
This is the compromise attempt at optimal rules that Find It implements as of now (Nov 2009). As we tweak the algorithm in response to problem cases, it may continue to evolve.
- Find It first assembles a list of bibs from Horizon that ‘match’ the Find It request. Then it looks for (marc 856) links within these matches.
- This is a somewhat imperfect process to begin with, depending on what information was supplied to Find It with the request, and what information is in the Horizon records, Find It may miss some records (‘false negatives’), or get some incorrect ‘matches’ (‘false positives’) here.
- However, if the Find It request originates from a catalog detail page, false negatives are unlikely, because Horizon bibID is used to make sure the originating record is considered a ‘match’.
- If a matching Horizon marc record is not for a Journal, then Find It displays it not matter what. This is because the SFX kb generally does not include e-books, so we were incorrectly excluding too many Horizon e-book links otherwise.
- If the request is: A) for a title-level_citation (not an _article level citation, but for a journal as a whole) and B) there are no links from SFX provided, then Horizon links are shown no matter what.
- In this case, the Horizon link is highly likely to represent something SFX does not know about. Since it’s a title-level citation, there are no range-of-coverage issues. So if nothing is coming through from SFX, the Horizon links are highly unlikely to ‘duplicate’ anything from SFX, they are likely to be unique information we want to show the user.
- Otherwise, the request is for a specific article, and the Horizon record is for a journal. In these cases, Find It will display the link only if it’s hostname does not match the list of hostnames Find It believes SFX knows about.
- Find It maintains a list of hostnames that it believes the SFX kb knows about. It maintains this list by an automatic extraction from the SFX kb. However, this list is supplemented by hand with URLs that could not be automatically extracted from the SFX kb. (For instance, automatic SFX extraction tells Find It that “ebscohost.com” is something SFX knows about. However, we have manually supplemented that with “epnet.net”, knowing this is really the same thing.)
- Technical info: This manual list is specified in the Umlaut file /config/umlaut_config/initializers/umlaut/resolve_logic.rb, the variable “additional_sfx_controlled_urls”
- Find It maintains a list of hostnames that it believes the SFX kb knows about. It maintains this list by an automatic extraction from the SFX kb. However, this list is supplemented by hand with URLs that could not be automatically extracted from the SFX kb. (For instance, automatic SFX extraction tells Find It that “ebscohost.com” is something SFX knows about. However, we have manually supplemented that with “epnet.net”, knowing this is really the same thing.)
debugging in a complex system of inter-related components December 5, 2009
Posted by jrochkind in General.2 comments
…which are under various parties control.
Eric Hellman recounts the process of tracking down an odd hard to reproduce bug in a link resolver process, that he describes as a “funny joke”, and indeed the particular nature of the bug is kind of funny in a hard to describe geeky kind of way.
But this got me thinking again about something I’ve been thinking about for a while, the added difficulty of tracking down bugs in the complex inter-related ‘chains’ of software that more and more are responsible for providing our library services — chains of software from various sources, hosted in various places, open source and licensed and free-with-no-contractual-support whatsoever.
What may not be hilarious is that if Eric hadn’t known the top tech guy at the company, he might have sent it to their support, who would, 99 times out of a hundred (Eric describes a number of other 99 times out of a 100 odds that led to this bug, which effected probably around a million interactions, not having been found/solved until now), have found the problem too mysterious and non-reproducible and closed the issue as non-reproducible. Meaning it could have taken until around 100,000,000 problem cases for it to be solved, instead of a million. (Actually, I think that math is wrong, and it would take even longer for the problem to be solved, but I forget my probability arithmetic, and all these numbers are just made up anyway. Suffice it to say it would have made it even less likely for the problem ever to actually be discovered by those with the power to solve it. –added 14 Dec 2009)
As a local systems librarian type, it can be VERY frustrating trouble shooting problems that involve so many different pieces in a chain like this. With a link resolver, that I’m responsible for, we’ve got:
- (usually) the source of the openurl link.
- The link resolver itself. (In my case that consists of 2a) Umlaut open-source front-end, and 2b) the proprietary licensed “knowledge base” product behind Umlaut.
- Possibly CrossRef and/or Pubmed (which I often forget to consider; this example shows why not to forget that.)
- The target destination, the platform hosting the content you are directing to.
In some cases there can be another couple pieces too. #1, the source of the link, can be our own federated search product. Which will then subdivide into:
- the source database being searched by the federated search product.
- The Xerxes open source federated search front-end we use.
- The proprietary licensed federated search engine backing up Xerxes.
So we’ve got up to 8 seperate components involved that could be the source of the problem. Some of these components are entirely accessible to my debugging (the open source self-hosted ones), some are partially accessible to my debugging (the proprietary licensed components that are still self-hosted), others are only barely accessible to my debugging (the free or licensed components in ‘the cloud’, which I can use Live HTTP Headers to examine HTTP transactions, and that’s about it).
So it can seriously take me many hours to get to the bottom of a reported problem — or as far to the bottom as I can get, sometimes you just end up against the walls of a licensed/cloud “black box”, sometimes in a way that lets me be pretty sure _which_ “black box” component is at fault (maybe or maybe not with any detailed insight on exactly _how_), but in other times not being able to decisively narrow it down because of lack of ability to peer inside the black box. And that many hours is only because I’ve become pretty darn familiar with these systems; if it were the me of three years ago instead of today, that many hours would possibly be many days instead.
If the component at fault is an open source component, then figuring out the problem was the hard part, and the subsequent fix, 9 out of 10 times, is easy.
But if the component at fault is one of the licensed products (locally hosted or in the cloud), then those many hours of debugging was the EASY part. The hard part is getting the vendors of the problematic component to pay attention: figure out the proper support channel (easy when it’s the link resolver or federated search engine; harder when it’s one of the many dozens of licensed platforms we have, any one of which we only very occasionally have need of contacting support on); convince them there really IS a problem (even though it’s hard to reproduce), and it really IS their software’s fault (many times they’ll try to blame it on another component in the chain). Being succesful at this can take many more (even less fun) hours, not exagerating. (And often involve being pushed up a support chain one or two times, at which point you may have to start from scratch and re-deploy the evidence you had already deployed at a lower level of support to re-convince a new person of what’s up).
And once you’ve succeeded there, what happens is you’re told “This problem has been forwarded to our developers.” And you get to hope that they’ll get around to fixing it some time before you retire, and maybe even courteously let you know when they do.
Hmm, this joke isn’t actually so funny to me anymore.
And gives one understanding of why 99 times out of 100 the Local Expert or Responsible Party finds an excuse to dismiss the problem instead of get to the bottom of it. Cause even once you’ve gotten to the bottom of it (and I actually agree with Eric that this kind of debugging is kind of fun), the ACTUAL hard work has barely begun (and I don’t find interminable exchanges with support people where you try to convince them there really is a problem, and it really is likely in their product, to be much fun at all). And all this is for a bug that, you’re actually not sure exactly how many users it’s effecting, and is just one of probably many such “heisenbugs” your software probably exhibits, geez is it really worth it? Except the whole problem is that these many hard to reproduce bugs add up, meaning any given user has (maybe? All of these 1 in a 100 things are just guesses, not based on any actual evidence) a good chance of running into a few of them in any given, say (again making this up) month of use. Phew.
Now, the good news is that lately I’ve noticed some of our vendors getting better at acknowledging, understanding, and even fixing bugs when I report them. I don’t know if it’s entirely fair to mention them by name, because on the one hand the fact that I have developed such good relationships with their support/developers means that their products somewhat regularly have problems I find. But, hey, nothing is bug free, and the positive is that they are unusually easy to work with and generally (eventually) fix the problem. So I’ll mention both Jeff Lang at Thomson Reuters and Mason Golden at Gale as being absolutely a pleasure to work with. And Chuck Koscher at CrossRef is also pretty great. This list isn’t meant to be exclusive, but I feel like I complain so much, it’s worth pointing out by name some people at vendors who have been really responsive, easy to communicate with, and actually take responsibility for trying to get a fix in on their end and letting us know when it’s been made. (There may very well be such people at nearly every vendor we license something from — but finding and developing a relationship with the right person at each of the dozens of vendors we deal with is a chore of it’s own, they aren’t neccesarily who you first get connected to when you file a support ticket).
improved google books search-within-book interface? December 2, 2009
Posted by jrochkind in General.1 comment so far
Some time since the last time I looked at it, Google Books improved their search results interface, for books which are not viewable in full.
Or at least they did for some books, at least as viewable by me. I know sometimes Google rolls out a test feature to just some people; and I’ve learned that sometimes it’s something I thought was new, but was just applied in very specific circumstances.
But anyway, what I used to see from GBS search results was simply a textual results list, showing hits for query in context just in ‘ascii’ excerpts — and just as one little box on book metadata page.
What I now see, at least in one example, is actual scanned pages, with band markers on the vertical scrollbar indicating at what points the matches were found; clicking on them shows highlighted matches on actual scanned images. Very nice! Although I wonder if it will confuse some users.
Now Umlaut, has been providing it’s own search box, directing to results on an actual GBS page upon submission. I did this just by reverse engineering the URLs that GBS used for searches, and combining with the book URLs returned by the GBS Data API.
Quite nicely, although they changed their interface, and I think the nature of the URLs returned by the Data API have also changed since I wrote the code — the simple procedure te code uses to create a direct link to search-within-a-book results still works quite well with no changes, showing you the new interface, upon following a direct link like this one.
If you pay attention you’ll see there’s an anchor (aka “fragment identifier”) on the URL Umlaut directs you to, that no longer is neccesary or useful, but in the old version of the page targetted the result list section.
Although actually you might not see that, it looks like there’s some weird javascript going on that changes the fragment identifier upon page load so the url you see in your browser bar may not be excactly the one I sent you to, which is:
Am I missing something, or is this disturbing? November 30, 2009
Posted by jrochkind in General.2 comments
(oops, how did this end up a ‘page’? I wrote this a couple weeks ago but posted it wrong).
Google publishes Stanford dissertations online
“Stanford doctoral students will now be able to post their dissertations on Google as the university replaces the traditional bound volumes of acid-free paper with e-files of scholarly work.” “Until now, Stanford has used ProQuest” ” The problem was solved by allowing the graduate students to embargo their work for up to five years, to give them time to get it published. They also will be allowed to decide whether to release either 20 or 100 percent of their dissertation to Google.”via Siva on twitter
Tell me if I have this right.
Traditional: Every Stanford dissertation is available in print at the Stanford library. You can go there and view it for free. You quite likely can get it (for free or nominal cost) in full or in excerpted photocopy via ILL. You can do these things right after publication, there’s no embargo (right? Or are traditional print dissertations sometimes embargoed?). If you want it electronically, then you (or your institution) has to pay Proquest, and there may be an embargo. But the paper copy exists and can probably be accessed by you one way or another right away.
New: There is no print copy. You can get it electronically for free from Google, only if the author’s optional embargo has expired, and only 100% if the author allows it, 20% otherwise. For some dissertations (how many this ends up being is significant, but hard to predict), there might be NO access to the full dissertation EVER. 20% after five years is all you get, and the Stanford library doesn’t even have a copy of the whole thing.
Am I missing something, or is this disturbing?