problems importing diacritics into RefWorks

So I am in an ongoing war with RefWorks (the software, and a little bit the people) to get non-ascii chars (in this case, simply latin alphabet with diacritics) imported succesfully into Refworks.

I am using the RefWorks import filters, and using the RefWorks “marc” filter, which doesn’t actually take marc, but takes a weird unique-to-refworks marc-in-plain-text format.  But which seems to work — until you get to diacritics.  I have this set up to export from my catalog to RefWorks.

UPDATE: Solution/answer at bottom of post.

I am curious:

  1. Has anyone else had a problem with diacritics in export to RW?
  2. Has anyone found a solution, or discovered anything more about the problem than RefWorks Support is able to tell me (which is pretty much nothing).

Our story so far

So for at least a year my users have been complaining about this. And sometimes I could reproduce their problem and sometimes I couldn’t. And when I could, sometimes I’d report it to RefWorks.

And RefWorks would tell me “Your data is not in UTF-8, we only support UTF-8”.

I was suspicious of this — I thought my data was in UTF-8. But you know how confusing debugging char encodings is, I didn’t really have time for it, so I let it be.

The chase is on

But my users were getting more and more restless about this, it is a serious problem for them. And recently were able to provide me with two clear reproducible test cases:  One in which diacritics are messed up by RefWorks upon import (they become detached free-floating, instead of being above teh chars they should be above) ; and another in which RefWorks refused to do the import at all, producing an error message instead.

Example one: Refworks imports diacritics improperly

My marc-in-plaintext file which I believe to be in UTF-8:

https://catalog.library.jhu.edu/mods/?format=marc&bib=2663421

The refworks import URL referencing this URL:

http://www.refworks.com/express/ExpressImport.asp?vendor=Johns%20Hopkins%20University&filter=MARC%20Format&encoding=60051&url=https://catalog.library.jhu.edu/mods/?format=marc%26bib=2663421

Example two: Refworks produces error

My marc-in-plaintext file which I believe to be in UTF-8:

https://catalog.library.jhu.edu/mods/?format=marc&bib=1144347

The refworks import URL referencing this URL:

http://www.refworks.com/express/ExpressImport.asp?vendor=Johns%20Hopkins%20University&filter=MARC%20Format&encoding=60051&url=https://catalog.library.jhu.edu/mods/?format=marc%26bib=1144347

Investigation

So I spent some time — with the invaluable help of dbs, gmcharlt and others on the #code4lib IRC — invesgating these source files byte by byte, to make sure they were valid UTF-8 representing what they should represent.

And they seem to be to me — as far as I can tell, these are perfectly valid UTF-8 representing what it should represent.

Now, they do use unicode combining diacritics — they are in ‘decomposed’ form. More on that later. But that’s perfectly legal UTF-8.

If anyone reading this can look at thise source files and find any problem with them, let me know.

Those files are also returned by my server with proper HTTP headers indicating they are UTF-8 (this  is right, right?):

Content-Type: text/plain; charset=UTF-8

Additionally, I checked in our demo blacklight instance, which has completely different logic implemented by different code for going from our Marc8-encoded Marc to UTF-8 encoded refworks marc-as-plain-text.  And RefWorks has the identical problem with the export from our demo blacklight.

Sally forth

So I prepared another email to RefWorks support, this time insisting that my files were UTF-8. My email included hexidecimal representations of bytes, and unicode code points those bytes mapped to in UTF-8.   My email was fairly concise, but I wanted to provide them with technical details they couldn’t simply dismiss with “your data is not UTF-8”.

And at first it worked — RefWorks support “escalated” my issue, and eventually gave me an answer. That it became clear they didn’t test or try out themselves at all, they were just pulling answers out of a hat.

They told me that instead of using the RefWorks “Marc Format” input filter, I should use the RefWorks “Marc Format (UTF-8)” import filter. Which at first made a certain amount of sense — except for the fact that the RefWorks import URL already included “&encoding=60051”, which is documented to mean UTF-8 in the first place. And the fact that for a year they’d been telling me “Marc Format” filter already (and only) supported UTF-8.

But still, of course, I tried it.  It did not help. The record that produced a RefWorks error message still produced an error message. The record with messed up diacritics still had messed up diacritics, but now also had the wrong information in the RefWorks fields. (I suspect the “Marc Format (UTF-8)”  filter assumes some European Marc format, rather than Marc21 — something I asked them before trying it, but they insisted it used the same marc-field-to-refworks-field mapping as “Marc Format”, which turned out not to be true.)

So I reported back to RefWorks that this didn’t work.

Guess what they’re response was?  They went back to telling me my data was not UTF-8.  Now they are suggesting my data is really in ISO 8859-1, and I need to convert it to UTF-8 if I want it to work with their software.

I’d be happy to convert it to UTF-8 — except as far as I can tell, it already is! It is not ISO 8859-1.   If there is a problem with it’s UTF-8 encoding that I have not figured out (which is quite possible, and if you see one please let me know), they need to actually tell me what it is, not just keep insisting my data is not in UTF-8 and needs to be. I have spent quite a bit of time trying to confirm that my data really is UTF-8, and believe I have done so. As far as I can tell, they have spent little time doing anything but suggesting solutions to me they didn’t even try themselves first and just pulled out of a hat, and repeating their default “your data is not UTF-8” claim.

My suspicion

Now, here’s my suspicion. I believe my data is valid UTF-8.  But it uses combining diacritics, it’s in “decomposed” form.  I have a hunch that the RefWorks software can’t handle this, it requires composed normalized form. The particular way the diacritics are messed up kind of suggests this (the diacritics on import become free-standing punctuation AFTER the char they are supposed to be over).

Even if this hunch is true, I have no idea if that would also fix the problem with the record RefWorks simply produces an error message for.

The thing is, for local weird reasons, it’s harder to change my software to do this than it ought to be. (It’s open source software in Java written by someone at another institution that I inherited when I started my job here; I don’t believe I have a copy of the source, just the .jar.  The source is probably floating around out there, cause others have used it, but doesn’t appear to be on the public web currently).

So I really don’t want to embark on that non-trivial task until I get confirmation from RefWorks of what their specifications are, so I can meet them. That’s all I ask. But it’s pretty clear RefWorks does not know the specifications/requirements of their software. Okay, so, I believe, they now have to figure them out. It’s what we pay them for, right?

My frustration

Character encoding issues are really complicated to deal with.  Char encoding debugging is definitely the most challenging, frustrating, brain-twisting sort of debugging I ever have to do.  But that’s how it goes, it still has to be done sometimes.

So I don’t blame RefWorks for finding them confusing too. My frustration is that RefWorks doesn’t seem to agree it’s their responsibility to figure them out. If they’re so confusing, then they need to give us customers clear specifications/requirements, so we can work on meeting them — instead of leaving each customer individually to “reinvent the wheel” of trying to reverse engineer RefWorks to figure out their specifications without having access to the source.   This is what we pay RefWorks for, providing support, right?

Of course, I guess they think they’ve done this, and their specs are “UTF-8”. The problem is, I have data I’ve spent significant time analyzing to be sure it’s UTF-8, and I am as sure as I can be, and they just keep insisting it’s not. The ball is in their court.

Next steps

So in addition to waiting to see what RW says next (I am not optimistic), I  might try individually translating those two files to UTF-8 normalized composed form, and seeing if it fixes the issues with one or both of them. And if it does, I guess I have to attack the non-trivial task of recompiling my software to do this normalization.  But it would be frustrating because I still won’t know if my software meets their software’s requirements, because they can’t tell me what those are, there might be other problems waiting to arise too.

Solution!!

Updated noon EST.  I just sent this email to RefWorks support:

####

Okay, I think I’ve actually figured this out.

My data was indeed legal and valid UTF-8 . However, there are a variety of forms UTF-8 may come in. (See http://unicode.org/reports/tr15/).

It looks like RefWorks can only handle UTF-8 in “KC” normalized form.   When I manually translated my two test files to UTF-8 KC normalized form, RefWorks handles them properly:

http://www.refworks.com/express/ExpressImport.asp?vendor=Johns%20Hopkins%20University&filter=MARC%20Format&encoding=60051&url=http://testjr.mse.jhu.edu/refworks-error-kc.txt

http://www.refworks.com/express/ExpressImport.asp?vendor=Johns%20Hopkins%20University&filter=MARC%20Format&encoding=60051&url=http://testjr.mse.jhu.edu/refworks-improper-kc.txt

Those were translated manually; fixing my software to do this automatically for all RefWorks exports will be more work. But at least now I know what I need to do.

I strongly encourage you to actually document this RefWorks requirement, and let other people know about it when they report oddities in RefWorks UTF-8 imports.   I have spent quite a few hours confusingly figuring this out since I first reported the issue over a year ago — would be nice to save others the time and just tell them.

####

(After reading up more on unicode normalization, I suspect “C” normalization might make RefWorks happy too, and be less invasive/lossy than KC normalization. I’ll try to test that soon too.)

final(?) update Have confirmed that just “C” normalization keeps RefWorks happy, at least for my two test records, no need for possibly lossy “KC” normalization. Of course, it may be that for my particular test records at present, KC and C are identical. But I’ve spent enough time on this for now. “C” seems a better bet, unless we have evidence or specs from RefWorks (ha!) to the contrary.

It would make a lot more sense if RefWorks would accept any UTF-8, but do “C” or “KC” normalization itself on the receiving end if it needs it, but I do not expect sense.

This entry was posted in General. Bookmark the permalink.

3 Responses to problems importing diacritics into RefWorks

  1. I am impressed by your diligence & technical acumen….encouraged to see I am not the only one with a Refworks access issue – when we use Refworks from off-campus (going thru EZProxy) and ONLY with Proquest databases, we get

    “Import aborted, HTTP request failed – The certificate authority is invalid or incorrect”
    it is maddening but likely not nearly as complicated as diacritics so I can take inspiration from your success and see if I can’t come up with a similar outcome!

  2. Brian: I’ve got some ideas about what your problem could be — you probably need to get a trusted ssl certificate for your EZProxy server. It’s a clue that it only happens when you’re going through EZproxy. But if Refworks support couldn’t figure that out with you, it’s kind of infuriating. It could also be a more complicated bug in Refworks.

    I’m not really an EZProxy expert, I’m not responsible for EZProxy admin here, but I think there’s an EZProxy user’s group, and I bet some of the people there (or possibly EZProxy support) could give you some ideas too.

    EZProxy makes everything SO much harder to debug, it ends up causing some weird situations.

  3. Pingback: More refworks diacritics « Bibliographic Wilderness

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s