Pages

Wednesday, September 21, 2011

StAX perser to handle HTML elements in XML data

The StAX API provides methods for iterative, event based parsing of XML documents. XML documents are treated as a series of events. Moreover the StAX API is bidirectional and that enables both reading and writing of XML documents.

The top level interface that provides methods to parse XML Events is XMLEventReader. To get a reference to the XMLEventReader you use the XMLInputFactory class that defines and abstract implementation of a factory for getting streams. Here is how you do it:

// This can be any InputStream
InputStream is;
XMLInputFactory factory = XMLInputFactory.newInstance();
// Now use the factory to get the EventReader from the InputStream
XMLEventReader eventReader = factory.createXMLEventReader(is);

Now you can parse the whole XML document be using the method provided in the reader. i.e.

// Get the next Event from EventReader
event = eventReader.peek();
if (event.isCharacters()){
// Get the data if it is a Character Event
event = eventReader.nextEvent();
      data  = ((Characters) event).getData();
}

This all works brilliantly for any XML document until you encounter some HTML elements somewhere in the XML Data, such as & etc. That’s where it all goes wrong and the parser fails to parse the data after the HTML element.

This is because when StAX event parser encounters & (and other similar characters) it converts it correctly to “&” but it treats the data as three separate events; all the characters before the escape character, the escape character itself, and all the characters after the escape character.

As a result using the code above will return only the first event as data of that element. For example if you have “Procter & Gamble” in your element, the call to the getData() method will only return “Procter ” as data.

To get the whole data, you need to set the property “javax.xml.stream.isCoalescing” to true in the factory class. The default value of this property is false. Setting this property to TRUE will requires the processor to coalesce adjacent character data.

All you need to do is to add another line of code before you get the event reader from the factory as follows:

// This can be any InputStream
InputStream is;
XMLInputFactory factory = XMLInputFactory.newInstance();
// Set this property to handle special HTML characters like & etc.
factory.setProperty(XMLInputFactory.IS_COALESCING, true);
// Now use the factory to get the EventReader from the InputStream
XMLEventReader eventReader = factory.createXMLEventReader(is);

The resulting event reader will now handle the HTML character codes successfully.

This is the tutorial on how to use the StAX API.