An Introduction to XML Processing with Lark and Larval

Saturday, March 8, 2008

Abstract
Lark is a non-validating XML processor implemented in the Java language; it attempts to achieve good trade-offs among compactness, completeness, and performance. Larval is a validating XML processor built on the same code base as Lark. This report gives an overview of the motivations for, facilities offered by, and usage of, the Lark processor.

This document and the Lark software are copyright © 1997 by Tim Bray; all rights reserved. However, Lark is available on the Internet for general public use.

This note applies to the final beta of version 1.0 of Lark, and release 0.8 of Larval, in use in January 1998. In this note, the name Lark refers to both Lark and Larval, unless otherwise noted.

Why Lark?
1 Motivations:

Lark’s creation was driven by the following motivations:

Personal Gratification
Writing language processors is fun, particularly when you have the chance to fix the language.
Desire to Learn Java
It’s about time, and Java seems like the appropriate language for the job.
Test Compliance with Design Goals
In particular, XML Design Principle #4, which states that “It shall be easy to write programs which process XML documents.”
Expore the API Design Space
There is a chance, while XML is young, to make some real progress in the design of processor API’s. The design of Lark makes very few assumptions about the user interface; thus Lark should be useful as an experimental testbed in this area.
$$$
Perhaps Lark will turn out to be useful. I have not the slightest desire to start another software company (been there, done that, got the T-shirts), but it would be nice to figure out a way to get paid for the time I’ve put in writing it.

2 Conclusions:
Yes, writing Lark was fun. In particular, none of the innocent-looking things in XML turned out, in practice, to be too horribly difficult. And Java is indeed a Happy Hunting Ground for programmers.

On the design-goal-compliance front, the good news is that if you wanted a program to process XML that you knew was well-formed, you could probably bash it out in perl (don’t know about Java) in a day or so. On the other hand, if you want to build a general-purpose tool that does all of XML and provides helpful error messages and a useful API, the nominal week is not nearly enough. The development of Lark has consumed about a month at this point in time, stretched over a year’s elapsed time.

I do think that Lark will be useful for exploring API designs. Of course, none of this will happen unless there are people out there who want to use an XML processor for something or other. Among other things, Lark currently has no user interface at all; while I don’t mind editing the Driver.java file and recompiling to run tests, presumably a UI would be a good thing to have.

As for the financial aspects, I’m kind of gloomy. I think most XML processors are going to be purpose-built for the needs of particular applications, and will thus hide inside them. Which is good; XML’s simplicity makes this approach cost-effective. Failing that, processors will be full-dress validating parsers with incremental parsing for authoring support. So I’m not sure that there’s all that much need for a standalone processor; but I’d love to be wrong.

Just in case, for the moment I’m going to be giving away all the .class files, and some of the Java source code, but not the source code for the three classes with the hard bits. In any case, they’re sure to be buggy at this stage and I wouldn’t want to be letting them out of my hands with a bit more polishing. If you can see a way to get a little revenue out of this project, give me a call.

Lark Feature Set Overview:

1 Compactness
(The figures below refer to Lark 0.97; they will be updated for 1.0 when it comes out of final beta status.)

Since an XML processor is often going to run on the client and presumably need to be delivered over the network, it must be compact. At the moment, the total byte count is around 45K, which is not too bad.

There is some more scope for compression, when some useful facilities appear in the Java class libraries that ought to be there; e.g. usable symbol table and better Unicode support.

2 Performance
At the moment, Lark, running under the Win95 J++ “Jview” application viewer on an underconfigured P100 notebook, runs at about 200K/second. I am fairly happy with this performance, and doubt whether a full-featured processor implemented in Java can really be made to run much faster.

3 Completeness
Lark is a processor only; it does not attempt to validate. It does read the DTD, with parameter entity processing; it processes attribute list declarations (to find default values) and entity declarations. Lark is relatively full-featured; it implements (I think) everything in the XML spec and reports violations of well-formedness.

Lark’s error-handling is draconian. After encountering the first well-formedness error, no further internal data structures are built or returned to the application. However, Lark does continue processing the document looking for more syntax errors (and in fact performing some fairly aggressive heuristics on the tag stack in order to figure out what’s going on), and calling the doSyntaxError application callback (see below) to report further such errors.

Larval is a full validating XML processor; it reports violations of validity constraints, but does not apply draconian error handling to them.

4 API
Lark presents as a set of Java classes. Those named Element, Attribute, and Entity are obvious in their function. One lesson of this activity is that it may be possible for such classes to be shared independent of the parser architecture; it would be very handy if all XML-processing Java apps used the same Element class, at least as a basis for subclassing.

The Text and Segment classes do Lark’s character data management; details are below.

From an application’s point of view, the Lark and Handler classes are central. Handler has methods such as doPI, doSTag, doEntityReference, doEtag; the application passes a Handler instance to Lark’s readXML method, and Lark calls the routines as it recognizes significant objects in the XML document. The base class provides do-nothing methods; the intent is that an application would subclass Handler to provide the appropriate processing semantics.

Along with presenting this event stream to the application, Lark can optionally build a parse tree, and if so doing, can optionally save copies of the XML document’s character data, all in parallel with providing the event stream. Lark provides methods to toggle these behaviors. These methods may be used in the Handler callbacks while Lark is running, to build the parse tree or save the text for only a subset of the document.

In building the parse tree, Lark is guaranteed to update only elements which are still open; i.e. for which the end-tag has not been seen. All other sections of the tree, and the entire tree once Lark has hit end-of-file, may be manipulated freely by the application.

An instance of Lark may be initialized with an optional list of element GI’s which are to be considered as those of empty elements, whether or not the XML “/>” syntax is used. A typical set might begin: “HR”, “BR”, “IMG”, ….

Another initializer allows a set of entities to be predefined.

Lark also provides the application access to the “entity tree”; there is a method that toggles whether Lark attempts to retrieve and parse external entities.

5 Error Handling
Lark makes a serious effort to be robust, by providing useful error messages and continuing to do so after the first error. The error handling is good enough that I now use Lark as my primary tool to debug broken XML files.

6 Text Segment Management
Probably due to my background in the indexing and search business, Lark pays more attention than is perhaps strictly necessary to the location of objects within the XML file. Lark informs the application of the containing entity and begin/end offsets of each element. A chunk of character data in an element, which may look contiguous to an application, may in fact map to several different byte ranges in different entities, due to the effect of entity references. Also, the use of encodings such as UTF8 may cause the number of bytes of underlying document to differ from the number of bytes that makes up the characters in a chunk of text. Lark represents Text objects as a vector of Segment objects, each of which gives information about the source offset and length, and the number of characters in the segment. If text-saving has not been turned on, the segments contain no character data, but still contain the offset information.

7 Entity Handling
Lark does full XML entity handling. Java provides facilities which make this trivially easy. The application can turn the inclusion of external text entities on and off. Lark makes no use of PUBLIC identifiers, aside from passing them to the callback in the Handler upon recognizing the declaration.

8 Concurrency
Lark is thread-safe. Multiple Larks can run in parallel threads, with other threads doing useful processing of under-construction document trees.

source: http://www.textuality.com/Lark/



 
Indelv.com is for sale!
 
ERP systemen
Alle ERP-systemen op een rij, compleet met ERP-nieuws en ERP-software informatie.
www.ERPcentraal.nl
ERP systemen
Alle ERP-systemen op een rij.
www.erpmatrix.nl


Quick Links
Our Friends
Cool Places
Visit also
About Us