Xatapult's XML Blog

08/10/2009

Interpreting XML – Dont do it yourself!

Filed under: General — xatapult @ 09:07
Tags:

In an XML document you can express the same semantics in syntactically very different ways. Which is a difficult way of saying: Watch out, XML documents may look different but can nonetheless mean the same! If you use the right tools to parse and interpret the XML that is not a problem. However, be aware for developers that do not…

This article explores the areas that cause most of the confusion in interpreting XML.

True stories

This really happened: One of the companies I worked with had to interpret SOAP XML messages coming from another company. Unfortunately they decided to use regular expressions (no, this is not a joke) to interpret the XML!

Of course in the beginning all went well. The XML messages came in, the regular expressions parsed them and miraculously produced the right results. Until… the company that sent the messages decided to upgrade their SOAP framework. After this the same messages were sent but the regular expression were no longer doing their job.

I can’t remember exactly what the new framework did, but it was a completely innocent thing with regards to the meaning of the messages. Something like an extra line break in between elements, a different namespace prefix, a different way of declaring namespaces or changing the order of the attributes. Nothing a true XML parser would have worried about, but the regular expressions stopped working.

Another example was an application that suddenly refused to handle a valid XML document because a namespace prefix was introduced (instead of using a default namespace). In this case a XML parser was used, but the handling of namespaces was definitely not correct.

Namespaces

Very often problems interpreting the XML are due to the handling of namespaces. Have a look at the examples below:

<data xmlns="basenamespace" xmlns:e="embeddednamespace">
   <e:content>bla bla bla</e:content>
</data>

<data xmlns="basenamespace" >
   <content xmlns="embeddednamespace">bla bla bla</content>
</data>

<b:data xmlns:b="basenamespace" xmlns:whatever="embeddednamespace">
   <whatever:content>bla bla bla</whatever:content>
</b:data>

As you might have guessed or seen, these examples all mean the same, albeit the different syntax. The following example however means something completely different, although at first glance it looks the same:

<data xmlns="basenamespace" xmlns:e="embeddednamespace">
   <content>bla bla bla</content>
</data>

Missed it? In the first three examples the inner <content> element was in the namespace embeddednamespace. In the last example it was in the basenamespace namespace. The last example did declare the embeddednamespace namespace but does not use it, something which might be confusing but is perfectly valid.

  • Use an XML parser that can handle namespaces and all the variations in their declarations. Most can but some older or very simple ones do not (examples are the basic XML parsers in PHP and Perl).
  • Always take the namespace of an element or attribute into account when interpreting the XML.
  • Never ever use the namespace prefix (or the absence of it) to distinguish between namespaces. Always use the full namespace name.
  • Do not get confused by the declaration of superfluous namespaces.

Text

Text is just text you might think. However it is not. Have a look at the following examples which all mean exactly the same (WATCH OUT: I inserted a space right before every entity ending semicolon. Otherwise WordPress will not display everything correct. Remove this space to try the examples):

<p>What are you looking at?</p>
<p><![CDATA[What are you looking at?]]></p>
<p>What are you looking at<![CDATA[?]]></p>
<p>What are you looking at&#x3F ;</p>
<p>What are you looking at&#x3f ;</p>
<p>&#x57 ;&#x68 ;&#x61 ;&#x74 ;&#x20 ;&#x61 ;&#x72 ;&#x65 ;&#x20 ;
&#x79 ;&#x6F ;&#x75 ;&#x20 ;&#x6C ;&#x6F ;&#x6F ;&#x6B ;&#x69 ;
&#x6E ;&#x67 ;&#x20 ;&#x61 ;&#x74 ;&#x3F ;</p>

So reading text is not straight forward. It might contain entities or CDATA sections and these must be handled well.

Whitespace

Handling whitespace is a very tricky area. Whether or not it is important relies on the nature of the XML. In data oriented XML, whitespace in between elements is usually not significant. Probably (but we cannot be absolutely sure just looking at the XML) the following two examples mean the same:

<codes><code>code1</code><code>code2</code></codes>

<codes>
   <code>code1</code>
   <code>code2</code>
</codes>

But what about these more text oriented examples?

<p><b>bold</b> <i>italic</i> <u>underline</u></p>

<p>
   <b>bold</b>
   <i>italic</i>
   <u>underline</u>
</p>

Again, it depends on how the XML is used.

  • When processing data oriented XML, whitespace in between elements is usually not significant and can be discarded. This also allows pretty printers to make the XML more readable without changing its meaning.
  • In more text oriented XML, things get more difficult. Be aware that whitespace in between elements might be significant.

Other pitfalls

  • The order of attributes is not significant. <hello type="greeting" meaning="friendly"/> is the same as <hello meaning="friendly" type="greeting/>.
  • XML is case sensitive. <Hello/> is not the same as <hello/> or <HELLO/>.
  • Attributes can be enclosed in single or double quotes. <hello type="greeting"/> is the same as <hello type='greeting'/>

And probably numerous others. If you have anything to add I would like to hear from you.

Conclusion

Unless you are in the business of writing XML parsers (and not many are), don’t try to do it yourself. Most development environments have excellent software on board to do it for you. After the parsing you can access the XML using SAX or DOM or some custom interface.

XML looks simple but has an awful lot of nitty-gritty details you need to take care of interpreting it. And never rely on simple string parsing…

 

About these ads

1 Comment »

  1. :)

    Comment by Zoran — 09/10/2009 @ 06:05 | Reply


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Rubric Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: