Sunday, 6 March 2011

Parsing an XML Document with XPath

J2SE 5.0 provides the javax.xml.xpath package to parse an XML document with the XML Path Language (XPath) other than DOM and SAX parsing. The JDOM org.jdom.xpath.XPath class also has methods to select XML document node(s) with an XPath expression, which consists of a location path of an XML document node or a list of nodes.

Parsing an XML document with an XPath expression is more efficient than the getter methods, because with XPath expressions, an Element node may be selected without iterating over a node list. Node lists retrieved with the getter methods have to be iterated over to retrieve the value of element nodes.

XPath - A Query Language for XML

Let us see how XPath can be used to query the various pieces of data in a XML Document. Consider a following simple XML file,

<employees>    
<employee id = "001">
<name>Johny</name>
</employee>
<employee>
<name>Williams</name>
</employee>
</employees>


The above XML file represents a collection of Employee instances as represented by the <employee> tag. A set of <employee> shares a common root tag <employees>. It is wise to mention that in XML terms a tag, element or node all means the same. A XML Document is nothing but a collection of properly organised well-formed tags. A XML Document can contain a mixture of several of the commonly-used tags or nodes like Element, Attribute, Text etc.

For example, in the above employees.xml, <employees>, <employee>, <name> are examples for 'Elements'. 'Attributes' represent a property of an element and in our example XML Document, it happens to be the 'id' attribute of the <employee> element. A 'Text' in a XML Document represents any textual content. For example 'Johny' and 'Williams' are the suitable candiates for 'Text'.

XPath uses simple expressions to query or select a portion of information from a XML Document. For instance, if we want to get the name of the first employee, then we can frame an expression like this,


/employees/employee[1]/name

The above expression can be intepreted like this, Starting from the root of the XML Document, (which is represented by '/') traverse until the <employees> element is found, then deep traverse to find the first employee element represented by employee[1], then retrive the value of the <name> element. As seen, the XML Document is hierarchically traversed to retrieve the information. '/' represents the root of the document, and multiple elements having the same name can be accessed using Array-based notation. The index starts with 0, 1, … and so on. If we want to select an attribute then '@' sign has to be prefixed along with the attribute name. For example, if we wish to query for the 'id' value for the second employee, then the following XPath expression will just do that,


/employees/employee[2]/@id


Java and XPath


Easy to use Java XPath API is available for accessing the XML data. The XPath API is available in the standard JDK distribution in the javax.xml.xpath package. All we have to do is to utilize the XPathFactory, XPath and XPathExpression classes and interfaces to do the task.

XPathFactory class follows the standard Factory Pattern to create XPath objects. XPath objects provides an environment to compile expressions which is encapsulated by XPathExpression. Then the compiled XPathExpression can be executed to get the desired results. Following is the code snippet,

XPathFactory xPathFactory = XPathFactory.newInstance();
// To get an instance of the XPathFactory object itself.
XPath xPath = xPathFactory.newXPath();
// Create an instance of XPath from the factory class.
String expression = "SomeXPathExpression";
XPathExpression xPathExpression = xPath.compile(expression);
// Compile the expression to get a XPathExpression object.
Object result = xPathExpression.evaluate(xmlDocument);
// Evaluate the expression against the XML Document to get the result.



Sample Application


Following section provides a sample application to demonstrate the usage of XPath in Java Applications. The sample application tries to select the value of an element, the value of an attribute, the value of a element-set (which is an element containing multiple elements) by compiling and executing different expressions.

1) projects.xml

Here is a XML file called 'projects.xml' which contains the structured information for various projects. The <project> element has an attribute called 'id' and various nested elements like <name>, <start-date> and <end-date>. The structure of the XML File is given below.

<?xml version="1.0" encoding="UTF-8"?>
<projects>

<project id = "BP001">
<name>Banking Project</name>
<start-date>Jan 10 1999</start-date>
<end-date>Jan 10 2003</end-date>
</project>
<project id = "TP001">
<name>Telecommunication Project</name>
<start-date>March 20 1999</start-date>
<end-date>July 30 2004</end-date>
</project>
<project id = "PP001">
<name>Portal Project</name>
<start-date>Dec 10 1998</start-date>
<end-date>March 10 2006</end-date>
</project>

</projects>

2) XPathReader.java

Now, let write a simple Java Application which acts as a reader in reading the various pieces of information from the XML Document. Following is the Java source that does the job of parsing the XML Document.


package com.javabeat.tips.xpath;
import java.io.IOException;
import javax.xml.XMLConstants;
import javax.xml.namespace.QName;
import javax.xml.parsers.*;
import javax.xml.xpath.*;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
public class XPathReader {

private String xmlFile;
private Document xmlDocument;
private XPath xPath;

public XPathReader(String xmlFile) {
this.xmlFile = xmlFile;
initObjects();
}

private void initObjects(){
try {
xmlDocument = DocumentBuilderFactory.
newInstance().newDocumentBuilder().
parse(xmlFile);
xPath = XPathFactory.newInstance().
newXPath();
} catch (IOException ex) {
ex.printStackTrace();
} catch (SAXException ex) {
ex.printStackTrace();
} catch (ParserConfigurationException ex) {
ex.printStackTrace();
}
}

public Object read(String expression,
QName returnType){
try {
XPathExpression xPathExpression =
xPath.compile(expression);
return xPathExpression.evaluate
(xmlDocument, returnType);
} catch (XPathExpressionException ex) {
ex.printStackTrace();
return null;
}
}
}


The constructor of this class is passed a XML File from which the information has to be read. The method initObjects() is called immediately, and it is used to initialize the XML Document and the XPath objects. A Document representation of the XML File is created by calling the DocumentBuilder.parse() method Then, a new XPath object is created by calling the XPathFactory.newXPath() method.

Client Applications can then call XPathReader.read() method by passing the expression to be evaluated and the return type of the expression. The return type of the expression is a QName which in XML terms, stands for Qualified Name. The standard XPath data-types are String, Number, Boolean, Node, NodeSet etc., which are represented as constants in XPathConstants namely XPathConstants.STRING, XPathConstants.NUMBER, XPathConstants.BOOLEAN, XPathConstants.NODE and XPathConstants.NODESET. Hence, the return type after evaluating an expression should be any of the above mentioned data-types. Within the read() method, an expression is compiled using the XPath.compile() method which returns a XPathExpression and the compiled expression can be evaluated using XPathExpression.evaluate() method.

3) XPathReaderTest.java

package com.javabeat.tips.xpath;
import javax.xml.xpath.XPathConstants;
import org.w3c.dom.*;
public class XPathReaderTest {

public XPathReaderTest() {
}

public static void main(String[] args){

XPathReader reader = new XPathReader("
src\\com\\javabeat\\tips\\xpath\\projects.xml"
);

// To get a xml attribute.
String expression = "/projects/project[1]/@id";
System.out.println(reader.read(expression,
XPathConstants.STRING) + "\n");

// To get a child element's value.'
expression = "/projects/project[2]/name";
System.out.println(reader.read(expression,
XPathConstants.STRING) + "\n");

// To get an entire node
expression = "/projects/project[3]";
NodeList thirdProject = (NodeList)reader.read(expression,
XPathConstants.NODESET);
traverse(thirdProject);
}

private static void traverse(NodeList rootNode){
for(int index = 0; index < rootNode.getLength();
index ++){
Node aNode = rootNode.item(index);
if (aNode.getNodeType() == Node.ELEMENT_NODE){
NodeList childNodes = aNode.getChildNodes();
if (childNodes.getLength() > 0){
System.out.println("Node Name-->" +
aNode.getNodeName() +
" , Node Value-->" +
aNode.getTextContent());
}
traverse(aNode.getChildNodes());
}
}
}
}


This test application uses the XPathReader class by creating its instance and then calls the XPathReader.read() method by passing different expressions and return types. As we see, the third expression tries to retrieve an entire node-set by passing the return type as XPathConstants.NODESET. Since a node-set contains a collection of nodes which in turn can contain some other nodes, a Recursive Traversal is made on the node-set to get the name and the value of the node by calling the Node.getNodeName() and Node.getTextContent() methods. The following would be the expected output for the above sample client application.

Output for the above program


BP001
Telecommunication Project
Node Name-->project , Node Value-->
Portal Project
Dec 10 1998
March 10 2006

Node Name-->name , Node Value-->Portal Project
Node Name-->start-date , Node Value-->Dec 10 1998
Node Name-->end-date , Node Value-->March 10 2006



No comments:

Post a Comment