XML namespaces in nokogiri

[update 3 Nov 2010. On further experimentation, I realize I didn’t get all the details below exactly righ t– but I didn’t learn enough to know what ‘right’ was, I just learned that I’m still confused about how some of these methods are intended to work. I’m out of time to try and figure out more, I have other things I need to work on, so I’ll just leave you with this caveat, to consider this documentation of the current state of knowledge of one programmer’s work-in-progress attempt to understand nokogiri xml namespace programmatic manipulation/interrogation. Phew, what a mess!]

 

Ruby XML gem Nokogiri is quite reasonable at allowing you to do xpath searches against XML with namespaces.

I like that it generally requires you to supply the namespace URIs, not just use the prefix’s that happen to be in the document; some libraries do the latter, which is not really an appropriate use of XML namespaces. (Although in some cases you can do the latter in nokogiri too, I think ti’s a poor idea).

Where things get more confusing is in nokogiri’s methods for interrogating and manipulating namespaces in a parsed Nokogiri::Node’s.  The various methods you need are named kind of inconsistently, making it hard to figure out what methods you want, and some of the methods behave idiosyncratically. Also, the documentation is awfully sparse.

I’ve suggested a patch to nokogiri with better comments for rdoc generation in most of the namespace-dealing methods.  Unless/until it gets accepted by nokogiri developers, you may find helpful this diff with my suggested new comments for rdoc generation.

One ‘bug’

There is one thing I consider a bug in the methods #namespace_scopes and #namespaces.  I’ve created a ticket for it.

Basically, these methods are (I believe), designed to return all namespaces currently ‘operative’ or ‘in scope’ for the node — meaning namespaces declared on the node itself or it’s ancestors. The problem is that these methods do not include default (non-prefixed) namespaces (eg: xmlns=http://example.org) on ancestor nodes (although they do include a default namespace set on the receiving/self node).  So beware, that can trip you up or be inconvenient sometimes.

Except actually, on further investigation, this is even less predictable. Sometimes ancestral default namespaces are included, other times they aren’t. I’m not entirely sure what determines which, I am confused and out of time to continue investigating this.  Maybe someone else can figure it out from my ticket.

Methods you might be interested in

To add a namespace definition to a node, #add_namespace_definition(prefix, uri)  (aka #add_namespace , alias for the same thing) is what you want.  If you pass nil as the prefix, then you’ll be adding a default namespace to the node. (xmlns=, rather than xmlns:prefix=).

Note well that while namespaces you add this way do not show up in the #attributes for that node, they will be included as corresponding xmlns:prefix or xmlns attributes when serializing the node.

Note also that if you dynamically add an “xmlns” attribute to a node using the ordinary nokogiri attribute manipulation API, it will NOT be automatically added as a namespace utilized for queries etc.

To interrogate what namespaces are defined in the document, you may be interested in:

  • #namespace_definitions will return any namespace definitions on the recipient node itself (not including any operative becuase they are defined on ancestor nodes).  It returns this as an array of Nokogiri::XML::Namespace elements. The interesting methods on a Namespace element are basically #prefix and #href.  If a Namespace included in the list has a nil #prefix, that means it is a default namespace.
  • #namespace, will return the default namespace decleration for a node (ie, one that would be serialized as << xmlns=URI >> ), but it will return it as a Namespace object (which neccesarily has a nil prefix).  Or nil if there are none. It will only return non-nil for a default namespace defined on the recipient node, it will not return the actual operative namespace due to a parent having a default namespace.That is, it will return the same thing as looking through the #namespace_definitions list for a Namespace object with a nil prefix.
  • #namespace_scopes will also return an array of Namespace objects, but includes any namespaces defined on ancestor nodes too, so it should be all the namespaces currently in the operative scope for that node.  However, as noted before, there is what I consider a bug, where it won’t actually include a default namespace on an ancestor node, even if that default namespace is in effect on the recipient node.
  • #namespaces will return the same thing as #namespace_scopes, but as a hash, where the keys are XML attributes (xmlns or xmlns:prefix) suitable for expressing the namespace declaration, and the values are namespace URIs.

There are a bunch of other methods in there too that may confuse you; some of them do the same thing as these in different ways, others work in subtly different ways, others (like any that take a Namespace object as an argument) are very unlikely to be suitable for public API use anyway, they’re really internal nokogiri methods. (Since you can’t actually create a Namespace object yourself). But these methods I’ve mentioned are, as far as I can tell, sufficient to do anything nokogiri will actually do for you with namespaces.

An example problem and solution

Here’s the problem i actuallly started out with, that led to me figuring out how this stuff works in nokogiri.

I have an XML document with lots of namespaces, declared at various different parts and levels of the node tree.

I want to take a certain node and it’s children —  a sub-tree of the overall document. I want to serialize that sub-tree to XML, without it’s ancestors — but I want to include any “operative” or “scoped” namespaces from that node’s parents in the serialization, so all the nodes the sub-tree when serialized is still in exactly the same namespaces.

This ends up being tricky for a couple reasons. One is the forementioned “bug” in namespace_scopes/namespaces, which requires a bit of manual code to make sure you catch an operative default namespace from an ancestor.

Another is you’re faced with the question —

A) should I add the namespaces with #add_namespace_definition, in which case they won’t be listed in #attributes, but will be serialized out anyway, and will immediately effect querying on this sub-tree in the original document (which shouldn’t actually change anything though, since they should just be a re-definition of namespaces that were already operative).  If I choose this option, the #namespace_scopes method gives me data in a more convenient form.

B) Or, should I add the namespaces simply using the nokogiri attribute manipulation API to add xmlns attributes, in which case they will appear in the node’s #attributes immediately, and will still be serialized out, but won’t immediatley effect querying on the node in the original document (whcih as stated, probably done’s matter anyway).

I choose ‘B’, in part becuase I’m not entirely sure what #namespace_scopes does if the same prefix is used more than one once in the ancestor list, with a different namespace — if those multiple definitions (with the same prefix) each show up in the #namespace_scopes array, you’ll have to use your own custom code to make sure you use the “lowest” one.  In all the empirical investigation I did of nokogiri namespace methods, I realize now I didn’t look into that, oh well, let’s just choose B instead.

doc # some nokogiri doc
node # some node in that doc

operative_namespaces = node.namespaces
# But that may not include an operative default namespace
# on parent nodes, because of hard to understand possible bug,
# better look for it and add it in.
if (!operative_namespaces.has_key?("xmlns"))
   ancestor_with_default = node.ancestors.find {|a| a.namespace }
   operative_namespaces["xmlns"] = ancestor_with_default.namespace.href
end

# Okay, our hash is already in XML attribute form, just add it
# it in to our node. 

operative_namespaces.each_pair do |attr, value|
   node[attr] = value
end   

node.to_xml # will have all operative namespaces included as xmlns attributes.

Room for improvement

I think there are all kinds of things that could be done to make the nokogiri namespace interrogation/manipulation api more consistent and predictable. But I’ve run out of time for this side project of figuring out namespaces in nokogiri. Maybe you have ideas and want to submit patches or tickets to nokogiri?

One obvious example:   You can set the default namespace with a simple string URI with #default_namespace=.  But you can only _get_ the default namespace as a Namespace object with a guaranteed nil prefix, with #namespace.  Shouldn’t there also be a #default_namespace method that returns nil, or the default namespace on the receiving node as a string URI?

There are many more ideas, as well as possibly seriously renaming a bunch of those methods. (The fact that the method #namespace returns the default namespace (only if declared on the node itself, not parent) — why isn’t that called #default_namespace instead?  But then we start to get confused between the methods which take or return Namespace arguments, and the methods which take or return their string pair equivalents instead, which already aren’t named consistently so you know which is which.)

Leave a comment