Monday, February 6, 2012

java.net.URL, content handlers and HTML/XML parsing

Below is an example of using java.net.ContentHandler class while retrieving resources from URL.

The objective is to get traffic stats from a network device. The device may present its status in two ways: as an XML data or human readable HTML page. In this example will use both the sources to get the information.

My network device collects data from different interfaces. The interface may be described as follows:

public class InterfaceStatus {
    private String name;
    private long txPackets;
    private long txBytes;
    private long rxPackets;
    private long rxBytes;
    
    // getters and setters...
}

To override default behaviour of the URL.getContent() method, a custom content handler factory must be created, i.e. class that implements ContentHandlerFactory interface. There's only one method to implement in this interface: public ContentHandler createContentHandler(String mimetype).

import java.io.*;
import java.net.*;

import javax.swing.text.html.parser.ParserDelegator;

import org.xml.sax.*;
import org.xml.sax.helpers.XMLReaderFactory;

public class UrlContentHandlerFactory implements ContentHandlerFactory {

    @Override
    public ContentHandler createContentHandler(String mimetype) {
        if("application/xml".equals(mimetype)) {
            return new XmlContentHandler();
        } else
        if("text/html".equals(mimetype)) {
            return new HtmlContentHandler();
        }
        // default content handler will be selected by JVM
        return null;
    }
    // ... inner *ContentHandler classes below...
}

The mimetype value is taken from the [JAVA_HOME]\lib\content-types.properties file. Now, it's time for the concrete implementation of the ContentHandler abstract class.

HtmlContentHandler
public class UrlContentHandlerFactory implements ContentHandlerFactory {
    // ...
    protected class HtmlContentHandler extends ContentHandler {

        @Override
        public Object getContent(URLConnection urlc) throws IOException {
            HttpURLConnection conn = (HttpURLConnection) urlc;
            if(conn.getResponseCode() == HttpURLConnection.HTTP_OK) {
                // using HTML Editor Kit API
                HtmlTrafficExtractor hte = new HtmlTrafficExtractor();
                new ParserDelegator().parse(new InputStreamReader(conn.getInputStream()), hte, true);

                return hte.getExtractedList();
            }
            return null;
        }
    } // HtmlContentHandler
}

To parse HTML pages, I've used HTMLEditorKit and ParserDelegator from javax.swing.text.html package. Why not to use XML parser? Here's the answer:

<html>
    <!-- head -->
    <body bgcolor=#00cccc>                <!-- no quotation marks around attribute value -->
        <img src="logo.gif" alt="Logo">   <!-- no closing "img" tag -->
    </body>
</html>

Although HTML pages consist of tags, tag attributes, text, etc., just as XML documents do, they don't have to conform to XML specification as illustrated in the above snippet. This would cause unnecessary exceptions being thrown.

XmlContentHandler
public class UrlContentHandlerFactory implements ContentHandlerFactory {
    // ...
    protected class XmlContentHandler extends ContentHandler {

        @Override
        public Object getContent(URLConnection urlc) throws IOException {
            HttpURLConnection conn = (HttpURLConnection) urlc;
            if(conn.getResponseCode() == HttpURLConnection.HTTP_OK) {
                // this is where the SAX2 API kicks in
                try {
                    XMLReader xmlReader = XMLReaderFactory.createXMLReader();
                    XmlTrafficExtractor te = new XmlTrafficExtractor();
                    xmlReader.setContentHandler(te);
                    xmlReader.setErrorHandler(te);
                    xmlReader.parse(new InputSource(conn.getInputStream()));

                    return te.getExtractedList();
                } catch(SAXException saxe) {
                    System.err.println("Parsing failed due to the following error: " + saxe.getMessage());
                }
            } // if
            return null;
        }
    } // XmlContentHandler
}

The content handler for XML documents is very similar. In contrast to the previous code, it uses SAX2 parser, which is a part of Java environment.

Both APIs use callback objects to parse documents. In HTML Editor Kit, the object must extend static HTMLEditorKit.ParserCallback class, and in SAX2 it is org.xml.sax.helpers.DefaultHandler.

HtmlTrafficExtractor
import javax.swing.text.html.HTMLEditorKit;

public class HtmlTrafficExtractor extends HTMLEditorKit.ParserCallback {
    private List<InterfaceStatus> statusList;

    // overriding essential callback methods here

    public List<InterfaceStatus> getExtractedList() {
        return statusList;
    }
}

XmlTrafficExtractor
import org.xml.sax.helpers.DefaultHandler;

public class XmlTrafficExtractor extends DefaultHandler {
    private List<InterfaceStatus> statusList;
    
    // overriding essential handler's methods here
    
    public List<InterfaceStatus> getExtractedList() {
        return statusList;
    }
}

Both the callback classes provide a method to return a list of available/found interfaces.

And here's how to use the code:

public class TransferStatus {
    private static final String URL_ADDRESS_XML = "http://router/stats/traffic.xml";
    private static final String URL_ADDRESS_HTM = "http://router/stats/netstat.html";

    public static final void main(String[] args) {
        URLConnection.setContentHandlerFactory(new UrlContentHandlerFactory());

        try {
//          Object content = new URL(URL_ADDRESS_XML).getContent();
            Object content = new URL(URL_ADDRESS_HTM).getContent();
            if(content != null && content instanceof List<?>) {
                @SuppressWarnings("unchecked")
                List<InterfaceStatus> statusList = (List<InterfaceStatus>) content;
                for(InterfaceStatus status : statusList) {
                    // doing things with the data
                }
            }
        } catch(IOException ioe) {
            ioe.printStackTrace();
        }
    }
}

First, URLConnection.setContentHandlerFactory() static method is called to set the content handler factory. From now on, every call to URL.getContent() will ask the factory for a proper content handler (if none is found, i.e. the factory returns null, JVM will try to load default handler).
The next step is to check if the content returned is of correct type and further processing of the data.
Now, the only thing that changes in the above code is the resource URL address passed as an argument to the URL() constructor.