unicode and LCSH

jrochkind General September 6, 2011September 6, 2011

So there’s an LCSH heading “C# (Computer program language)”

In some of my MARC records, it is encoded just like that, with an ascii number sign, byte 23.

In other of my MARC records, it is instead encoded as “U+266F (music sharp sign)” ♯

Whether the difference between ♯ and # is visible in this blog post may depend on your font.

But it makes a difference to computers, and makes these two sets of records using the two variants collocate seperately in my interface, as it would in most (unless they apply some normalization).

Now… I wonder how the music sharp sign ♯ even gets into those MARC records, since it’s harder to enter. But could it in fact be the “correct” one? How would one figure out which of these variants is the appropriate LCSH-authorized one? Looking at authority records? Anyone want to answer the question? (I have no idea which one is “official” for the C# language, but certainly people ordinarily type it with an ordinary ascii number sign, as it’s more difficult to produce the musical sharp symbol on most keyboards).

Looks like the majority of records in my catalog using the musical sharp symbol ♯ come from Safari vendor-supplied records ; I’m guessing the few that don’t had subject headings copy-and-pasted from Safari records. Should I try reporting this as a problem to Safari, you think? First I’d like to have some way of being sure that the musical sharp sign really is “wrong” — what do you think?

Incidentally, wikipedia rather confusingly says:

Due to technical limitations of display (standard fonts, browsers, etc.) and the fact that the sharp symbol (U+266F ♯music sharp sign (HTML: ♯ )) is not present on the standard keyboard, the number sign (U+0023 # number sign(HTML: # )) was chosen to represent the sharp symbol in the written name of the programming language.^[9] This convention is reflected in the ECMA-334 C# Language Specification.^[7] However, when it is practical to do so (for example, in advertising or in box art^[10]), Microsoft uses the intended musical symbol.

Looks like both methods are kind of “official” for the language — but the point of controlled vocabulary is to pick one, right? Those controlling vocabularies like LCSH have to remember to be clear about standardizing character choices like this in addition to word choices — we don’t live in an ascii world anymore. So a more general question is what mechanisms LCSH has to be clear about it’s character choices in cases like this. Authority file?

Published by jrochkind

View all posts by jrochkind

Published September 6, 2011September 6, 2011

8 thoughts on “unicode and LCSH”

Mark says:

September 6, 2011 at 5:24 pm

I think the authority record does answer the question, although it is still a mess. (Excerpt follows)
150 __ |a C# (Computer program language)
450 __ |a C-Sharp (Computer program language)
450 __ |w nne |a C♯ (Computer program language)
670 __ |a ISO/IEC 23270, 2nd. ed.: |b p. 11 (the name C# is pronounced “C Sharp”; the name C# is written as the Latin capital letter C (U+0043) followed by the number sign # (U+0023))

So the Lc authorized form is with a pound sign (see the 670 note) and it has references from both C-Sharp and C with a sharp symbol.
Simon Spero says:

September 6, 2011 at 5:35 pm

It’s C# (u+0023). The authority records in fact has a UF for the C♯ form.

Authority files to the rescue!
See the MarcXML record at: http://id.loc.gov/authorities/subjects/sh2001001705.marcxml.xml
Simon Spero says:

September 6, 2011 at 5:39 pm

As for the character choice: The character to be used is the one given- the only encoding mismatches that may (will) occur is between composed and non composed variants. Since Sharp and # are different characters, this is not an issue here, but can be with é et. al.

Simon
jrochkind says:

September 6, 2011 at 6:01 pm

Thanks Mark and Simon.

My indexing routines already take care of composed/de-composed unicode variants, normalizing at both query and index time.

Sounds like I should try telling Safari they’re using the wrong form in their subject headings.

Although transcribed titles also use both forms, and there I guess it would be considered appropriate to transcribe whatever orthography you think was used on the title page? Hmm, that’s kind of a mess, as catalogers certainly aren’t going to do consistent things there, it’s not neccesarily obvious which orthography was intended on a title page. Hmm.

I could normalize all actual sharp-sign to number-sign — except I’m not sure if this will mess up musical key signature searches, where they maybe want to actually search on the musical symbol and only the musical symbol. Or they might instead also get advantage from normalization. This stuff gets messy. (Musical uniform titles seem to use the actual musical unicode symbols consistently, presumably that is the ‘correct’ thing to do there.)
Alan Cockerill (@cockerilla) says:

September 6, 2011 at 6:23 pm

From a purely academic perspective – I wasn’t aware computing students used libraries ;-)
jrochkind says:

September 6, 2011 at 6:44 pm

Alan, all I can say is that we got specific requests from real patrons (I think they were professors rather than students, but not sure) saying “C# and C++” are unfindable.

In our case, we have a fairly significant collection of software engineering related ebooks (from Safari and a couple other places), so perhaps there is demand for those, but I haven’t tried to get to the bottom of exactly what they were looking for why.
Jonathan Rochkind says:

September 8, 2011 at 9:20 am

Interestingly, local cataloger Ray reports: ” It seems that C [Sharp] was a formerly legitimate heading that now should be C#”, and has merged em in our local file.

I’d like to get better at searching and understanding authority records myself. What systems do you guys use for doing so? I am not sure id.loc.gov would let me discover old deprecated headings like that? Would I need access to a paywalled database somewhere?
Mark says:

September 9, 2011 at 1:11 pm

Since I am no longer affiliated with any big institutions I only have access to the authorities via http://authorities.loc.gov/ (which is, I believe, still run from their Voyager catalog). You can view a record, once you’ve drilled down to it, in either a MARC display or a “labelled” display.

That (the MARC display) in conjunction with MARC 21 Format for Authority Data pages http://www.loc.gov/marc/authority/ can explain what the various MARC fields, tags, and subfields are stating. By tracing out the |w nne in the 450 in the snippet I pasted in above we see that:

“Heading in the 4XX field is a form of the heading in the 1XX field that was formerly established in the relevant national authority file under a situation other than that specified by code a.For example, code e is used when the heading in the 4XX field is a previously-authorized heading from the national authority file now superseded by a later form of heading in the 1XX field. It is also used when the tracing is a pre-AACR2 form of a name, name-title, or uniform title heading that had been established earlier in the national authority file but was not the established heading at the time of the changeover to the AACR2 rules.”

So yes, a previous authorized heading.

I have no idea how you’d automate any of this learning/checking for yourself but when you want/need to know the answer to something involving an authority this is a way to (possibly) get the answer. As you might imagine, MARC Authority Format is no less complex than MARC Bib Format. You’ll often have to trace out leader codes and assorted subfield codes. Even then, you are sometimes left in a haze. Then again, I’m pretty sure you know that when it comes to MARC.

Share this:

Published by jrochkind

8 thoughts on “unicode and LCSH”

Leave a comment