|
|
Introduction
to VTD - XML
VTD-XML is
a new, open-source, non-validating, non-extractive eXtensible Markup Lanugauge
(XML) processing Application Programming Interface (API) written in Java.
The VTD-XML is the best alternative to Simple API for XML (SAX) and Document
Object Model (DOM), as it does not force you to trade processing performance
for usability.
_______________________________________________
_______________________________________________
The Java-based,
non-validating VTD - XML parser is faster than DOM and better than SAX.
Unlike the other XML processing technologies, VTD-XML is designed to be
random-access capable without incurring excessive resource overhead.
An important optimization feature of VTD-XML is non-extractive tokenization.
Internally, VTD-XML retains the XML message in memory intact and un-decoded,
and tokens represent tokens using starting offset and length exclusively.
Tokenization of VTD-XML is based on the Virtual Token Descriptor (VTD)
core binary encoding specification. A VTD record is a 64-bit integer that
encodes the token length, starting offset, type and nesting depth of a
token in XML.
Memory buffers
can be allocated in bulk to store the VTD records, as the records are
constant in length. This avoids the creation of a multitude of string/node
objects usually associated with other XML processing technologies. As
a result, both memory usage and object creation cost are greatly reduced
by using VTD-XML, which leads to significantly higher processing performance.
For example, on a 1.5 Ghz Athlon machine, VTD-XML delivers random access
at a performance level of 25 to 35 MB/sec, outperforming most SAX parsers
with null content handlers. An in-memory VTD-XML document typically consumes
only 1.3 to 1.5 times the size of the XML document.
VTD-XML provides
several benefits for software developers. For example, you require a processing
model to start work on a project involving XML. The DOM is slow and consumes
too much memory, particularly for large documents. The SAX difficult to
use especially for XML documents with complex structures. As a result,
the best option is to choose the VTD-XML, as the features of VTD-XML does
not force you to trade processing performance for usability. The random-access
capability of VTD-XML provides the best possible performance. Even though
SAX is fast due to ifs forward only nature, it does not suit for all the
conditions.
In some situations, you perform lots of buffering to extract the data
needed, while in others, you may have to repeat SAX parsing on the same
document multiple times. Irrespective of what you do, SAX programming
usually results in ugly and unmaintainable code, while the performance
benefit over DOM is not always significant. The VTD-XML enables you to
simultaneously achieve ease-of-use and high-performance. Also the performance
benefit of the VTD-XML over DOM is substantial.
VTD-XML
can be used for an XML project, only if the two criteria are met. The
first criteria is that the current version of VTD-XML does not support
entity declarations in document type definitions (DTDs). The VTD-XML recognizes
only the five built-in entities such as &s;, ', <, >,
and ". The VTD-XML works well when Simple Object Access Protocol
(SOAP), Resource Description Framework (RDF), Financial Information Exchange
Markup Language (FIXML), or Really Simple Syndication (RSS) are used in
the XML project. The next criterion is that the VTD-XML's internal parsed
representation of XML is slightly larger than the XML, which as a result
demands sufficient RAM. To provide true, random access to the entire document,
the document needs to be placed in memory. When both the criteria are
met, the VTD-XML is the most efficient XML processing API.
The Java
API of VTD-XML consists of three essential components which include VTDGen
(VTD generator) that encapsulates the parsing routine that produces the
internal parsed representation of XML, the VTDNav (VTD navigator) which
is a cursor-based API that allows for DOM-like random access to the hierarchical
structure of XML, and the Autopilot which is the class that allows for
document-order element traversal.
The following steps need to be performed to use VTD-XML for processing
an XML document either from disk or via HTTP. The first step is to find
out the length of the XML document, allocate adequate memory big enough
to hold the document, and then read the entire document into memory. The
next step is to create an instance of VTDGen and assign the byte array
to it using setDoc(). The final step is to call parse(boolean ns), to
generate the parsed XML representation. When ns is set to true, subsequent
document navigation is namespace aware. If parsing succeeds, you can retrieve
an instance of VTDNav by calling getNav().
At the onset
of navigation, the cursor of the VTDNav instance points at the root element
of the XML document. You can use one of the overloaded versions of toElement()
function, to move the cursor manually to different positions in the hierarchy.
The toElement() function when declared as toElement(int direction) takes
an integer as the input, to indicate the direction in which the cursor
moves. Defined as class variables of VTDNav, the six possible values of
this integer are ROOT, PARENT, FIRST_CHILD, LAST_CHILD, NEXT_SIBLING,
and PREV_SIBLING. Each has its respective acronym such as R, P, FC, LC,
NS, and PS. The method toElement() returns a Boolean value indicating
the status of the operation. The toElement() returns true when the cursor
is moved successfully. When the cursor is moved to a non-existent location,
for example, the first child of a childless element, then the cursor does
not move and the toElement() returns false.
The method
getAttrVal(String attrName) retrieves the attribute value of the element
at the cursor position.
Similarly, the getText() method retrieves the text content of the cursor
element. In addition, you can use the toElementNS() and getAttrValNS()
methods to navigate the document hierarchy in a namespace-aware fashion,
if the namespace is turned on during parsing. Autopilot is the other mode
of navigation. An instance of Autopilot can automatically move the cursor
through the node hierarchy in document order. To use Autopilot, first
you need to call the constructor, which accepts an instance variable of
VTDNav as the input. Then, you need to call the selectElement() or selectElementNS()
method, to specify the descendent elements to be filtered out. Whenever
this is done, each call to the iterate() method moves the cursor to the
next matching element.
Now let us
see some of the unique properties of VTD-XML compared to other similar
XML APIs, such as DOM and XMLCursor. The hierarchy of VTD-XML consists
exclusively of element nodes. This is very different from DOM, which treats
every node, whether it is an attribute node or a text node, as a part
of the hierarchy. In VTD-XML, every instance of VTDNav has only one cursor.
The cursor can be moved back and forth in the hierarchy, but you cannot
duplicate it. However, you can temporarily save the location of the cursor
on a global stack. VTDNav has two stack access methods which include Calling
push() which saves the cursor state and Calling pop() which restores the
cursor state. For example, consider that you are somewhere in the element
hierarchy and you want to move to a different area of the document after
saving the current location and then continue at the saved point. To accomplish
this task, you need to first push() the location onto the stack. After
moving the cursor to a different part of the document, you can very quickly
jump back to the saved location by popping it off the stack.
One of the
most unique aspect of VTD-XML that distinguishes it from any other XML
processing API, is its non-extractive tokenization based on Virtual Token
Descriptor. Non-extractive parsing enables you to achieve optimal processing
and memory efficiency in VTD-XML. VTD-XML manifests this non-extractiveness
in the following ways. Most of the member methods of VTDNav, such as getAttrVal(),
getCurrentIndex(), and getText() return an integer. This integer is a
VTD record index that describes the token as requested by the calling
functions. VTD-XML produces a linear buffer filled with VTD records, after
parsing. You can access any VTD record in the buffer if you know its index
value, as all the VTD records are have the same length. In addition, the
VTD records cannot be addressed using pointers, as the records are not
objects. When a VTDNav function does not evaluate to any meaningful value,
it returns -1 which is more or less equivalent to a NULL pointer in DOM.
VTD-XML implements
its own set of comparison functions that directly operate on VTD records,
as the parsing process does not create any string objects. For example,
the matchElement() method of VTDNav, tests whether the element name, which
effectively is the VTD record of the cursor, matches a given string. Similarly,
the matchTokenString(), matchRawTokenString(), and matchNormalizedTokenString()
methods of VTDNav perform a direct comparison between a string and a VTD
record. This is advantageous as you need not pull tokens out into string
objects, which are expensive to create, especially when you create lots
of them. Bypassing excessive object creation is the main reason why VTD-XML
significantly outperforms DOM and SAX. VTD-XML can also implement its
own set of string-to-numeric data conversion functions that operate directly
on VTD records. VTDNav has four member methods which include parseInt(),
parseLong(), parseFloat() and parseDouble(). These functions take a VTD
record index value and convert it directly into a numeric data type.
_______________________________________________
_______________________________________________
FREE
Subscription
Subscribe to our mailing list and receive new articles
through email. Keep yourself updated with latest
developments in the industry.
Note
: We never rent, trade, or sell my email lists to
anyone.
We assure that your privacy is respected
and protected.
_______________________________________
Recommended
XML Books
|
|
| FREE
Subscription Stay
Current With the Latest Technology Developments Realted to XML. Signup for Our
Newsletter and Receive
New Articles Through Email. Note
: We never rent, trade, or sell our email lists to anyone. We assure that
your privacy is respected and protected.
|
|