www.loc.gov gone, back, but missing from Google

Or, don’t misuse your HTTP status codes

So, the Library of Congress website was down for the government shutdown, with all URLs resulting in an outage message. (As I write this, the outage message appears to be still available, although loc.gov pages aren’t redirecting there anymore. Thanks jeff@#code4lib for pointing out the URL).

This actually got a lot of attention — nearly every article I saw about what would be closing for the government shutdown mentioned LC, often highlighting it. It continues to strike me how much affection people have for libraries. (I’m worried it’s a kind of nostalgic affection these days, but anyway, that’s a different topic).

This mostly affected me in that I use the LC MARC documentation fairly regularly, usually looking up by typing, eg, “marc 650” or “marc leader” into Google.  Other library and museum workers rely on other information on loc web sites.

(Pretty quickly after the shutdown, these hits stopped showing up in Google, Google realized they weren’t available pretty quickly, rather than giving you a hit that would take you to the outage message instead of what you wanted. )

Fortunately, Dan Scott generously set up some mirrors, so I still had access to the MARC docs at his coffeecode.net.

I had to manually navigate there and browse at first, because Dan’s coffeecode.net wasn’t showing up in google either, but after a day or two, coffeecode.net was in google results for “marc 245” or what have you.

Then it came back… but not in Google?

Apparently shortly after that, around Oct 3 or 4, loc.gov web sites mostly came back.  (I’m kind of curious about the politics and/or bureaucratic processes that led to this change of practice in the middle of the shutdown, but anyways). As Dan Scott noticed quickly. 

I never noticed until yesterday. Why? Because http://www.loc.gov pages still aren’t showing up in google, over a week after they came back online. I kept doing my searches, LC kept not showing up, I assumed it was still down.

Other *.loc.gov stuff is in google, like LC blogs and the Chronicling America project.   But it looks like not http://www.loc.gov, site:www.loc.gov returns 0 hits for me.  And my MARC searches also return no LC pages on google: no hits from loc.gov for “marc 650”; it used to always be at the top of the page.

Bing has still got www.loc.gov, and still gives me LC pages at the top of results for my MARC searches. 

But gone from Google.  I don’t normally use Bing. But I guess now I’ve got to to find MARC documentation, and other standards documentation hosted on loc.gov?

Or, of course, I could go directly to http://www.loc.gov, and try to browse or search from there — I haven’t had much success doing that, the search from front and center on http://www.loc.gov — much like most of our library catalog searches — searches documents considered part of the LC collection or what have you, but doesn’t seem to search the actual web pages on http://www.loc.gov.

This highlights how much google (or your own search engine of choice) is the interface to the web.  If you can’t find something on google — even when doing a known item search for exactly what you want —  for many intents and purposes it might as well be down.   It’s pretty much the only way we navigate to things on the web — and almost the only usable way available to do so. Individual web sites browse navigations or built-in searches tend to be awful — whether they’ve atrophied because everyone just uses google anyway so there’s no need for web site maintainers to focus on them; or whether just because it’s a hard problem; probably a combination of both.

Why is it missing from Google?

It’s something of a mystery. It would be an amazing coincidence if it weren’t related to the government shutdown and temporary website outage in some way — they disappared from Google during the website shutdown, and never came back.

But other *.loc.gov also disappeared and are now back. http://www.loc.gov has been available again for 10 days, but not in Google.  In general, Google doesn’t ban you from their search results forever just for having a few day outage.

When all the government sites went down, some people on the #code4lib IRC noted that some of them were maybe using incorrect HTTP status codes.  But today when I asked, nobody can remember or is sure exactly what http://www.loc.gov was doing during the outage, and we can’t think of any captured record of what sorts of HTTP responses it was returning in order to send people to the outage message for every request.

If one were to intentionally try to remove all traces of one’s website from Google permanently (through some manner other than a robots.txt), if one tried to come up with the way to mess up ones website in Google, one might have every single URL on your website send an “HTTP 301 Moved Permanently” message for every URL. You’d be essentially telling Google, “Oh yeah, our entire website has been permanently replaced by this one web page” [the outage message], and it would not be surprising if Google then removed (rather than temporarily disabled) your entire website from it’s results.

Is that what http://www.loc.gov was perhaps accidentally doing? Not sure.

There are various other things the website could have been doing during the outage that might have confused Google, or accidentally given Google bad instructions.

Use HTTP responsibly: What’s the right way to do it?

So it’s a mystery why http://www.loc.gov is missing from Google, and I hope it comes back soon. Maybe it’s actually just a bug of some kind on Googles end… although that frankly seems less likely to me than the loc.gov admins having done something ill advised. But who knows.

But regardless, in general, if your website is having a temporary outage, what’s the right way to do it?   There are a bunch of different ways to do it that will essentially produce the same result to human views, but the underlying semantics of HTTP — which might be invisible to a human browser — matter to software consumers, like Google.

You might use an HTTP 503:

503 Service Unavailable

The server is currently unable to handle the request due to a temporary overloading or maintenance of the server. The implication is that this is a temporary condition which will be alleviated after some delay. If known, the length of the delay MAY be indicated in a Retry-After header. If no Retry-After is given, the client SHOULD handle the response as it would for a 500 response.

That sounds about like a planned outage. (And no, the length of outage is not known for the government shutdown).

Or maybe you’d use a “307 Temporary Redirect.” Or maybe a “303 See Other” — neither of those implies that a software agent should permanently replace the first URL with the second.

Or maybe the server simply stops responding at all — sometimes that’s the best your maintenance plan can accomodate anyway.

Or maybe, you do some research and see what different search engines recommend: I wasn’t able to easily find official advice from Google, but here’s a googler on a google blog suggesting 503 where possible for a temporary outage.

Different software agents may do different things — http://www.loc.gov is back on bing for whatever reason.

But, during any planned outage — or when designing your procedures for unplanned outages — it pays to think through exactly how it will appear to software agents, and make sure you aren’t giving incorrect information. It may not be obvious what your system is going to do, you may have to actually investigate it experimentally, as well as think through the implications.

Some things definitely not to do, if you want to avoid essentially directing search engines to remove you?  Do not return a 301 Permanent Redirect to an outage page, for a temporary outage. Do not return a 200 at the original URL, with completely different content (like an outage message returned as a 200 response from every single URL, with no redirects). Probably a 404 “Not Found” is not a great idea either, although not as bad as a 301 or 200 with different content.

These things matter — keeping your website functioning properly includes keeping it functioning properly for software agents. At least if you care about things like appearing in Google results, and everyone does. Use HTTP responsibly!

This entry was posted in General. Bookmark the permalink.

3 Responses to www.loc.gov gone, back, but missing from Google

  1. Louise says:

    Jonathan, thanks for the pointer! This topic also turned up on intetbib (German librarian mailing list) a few days ago and we might have found the answer:
    LoC seems to have used http status code 200 for the shutdown page and 302 (“The requested resource resides temporarily under a different URI” – namely the shutdown page) for everything else.

    http://comments.gmane.org/gmane.culture.libraries.inetbib/26040

  2. jrochkind says:

    Thanks for verifying what http://www.loc.gov was doing during the outage, http-wise!

    It’s not obvious that using a 302 “found” redirect to an outage message with a 200 would cause a problem — I would have thought that should probably be okay, although it’s not as ideal as a 503. (And if I was going to do a 3xx redirect, 307 Temporary Redirect seems preferable)

    But maybe that did end up being something google didn’t like? It’s still something of a mystery what’s caused http://www.loc.gov to drop off google.

    It’s also curious that other *.loc.gov are back in google; I wonder if they ended up doing something different during the outage, rather than universal 302 redirect to a 200 outage message.

  3. jrochkind says:

    http://www.loc.gov seems to be back in google results now, hooray.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s