Sunday, March 14, 2010

Testing Java HTML parsers

A few weeks ago I had to code some data export for which I had to test the speed of a bunch of Java HTML parser/cleaner libraries to have a valid XHTML output.
Jens proposed me to publish the results here and I thought that would be a really great idea.
At first, I'd like to present each one to you first and hopefully give some useful pieces of information on them. "Maven" means if the library can be found in the Maven repositories.

Jericho HTML-Parser

License: Eclipse Public License/LGPL
Maven: Yes
Has many features, like recognizing PHP tags and is easy to use.

JTidy

License: MIT License
Maven: Yes, but only the "old" builds
JTidy is tiny and pretty fast, can output wellformed XHTML.
Has bad internal exception handling(lots of empty catch blocks!)

HTMLCleaner

License: BSD License
Maven: No
DOM based, supports XPATH(really cool). Has a good bunch of confuguration options.

NekoHTML

License: Apache Software License
Maven: Yes
Good, fast, seems to be famous

TagSoup

License: Apache 2.0
Maven: Yes
Parses HTML and provides a SAX handler. Entry class is "Parser" to which a custom SAX handler can be given.

HTML Parser

License: GPL

HotSAX

License: LGPL
HotSAX looked pretty fast, but according to the homepage it is still in pre-alpha stadium, so it was not useful for my task.

Java Swing HTML parser

Comes with Sun Java.
XHTML is a more strict form of HTML 4.01, but this parser only supports HTML 3.2, so it was not in question for my purposes. Just wanted to mention it here.

Cobra: Java HTML Renderer and Parser

License: LGPL 2.1
Major plus of this one is that it is capable of parsing js and CSS, too. The browser is a good start(my admiration for that project!) although it fails all ACID tests. But nevertheless, this hasn't to say anything about the parser's quality.
One con is that this library is really slow.

Mozilla Java HTML-Parser

License: Mozilla Public License 1.1 (MPL 1.1)
The setup is not really suitable for a multi-developer setup so it fell out of the test selection.

Test results

Task was to load a predefined, really errorneous HTML document and select all <a> tags.
I used JUnit tests for each parser/cleaner and the measurement was taken ten times, while the first one was skipped due to the compilation time.
RankNameTime/msDeviation/ms
1HTMLCleaner95±18
2HotSAX124±19
3JTidy158,3±17
4Jericho HTML150±59
5NekoHtml380±44
6TagSoup439±50,5
7Cobra675±100


Jtidy is listed before Jericho HTML because it had the better deviation.I first used HTMLCleaner, because its advance in time was really big. The problem was that it couln't handle some of the real input data. HotSAX was pre-alpha(although the results are very good), so JTidy was my next choice as I needed reliability. I had not a single problem with it, it works really fine.
Last point to say is that the results of the Cobra parser are very bad...

Jens is working on a website so I can provide you the testing source code, I will put a link to there when it is online.
If anyone is interested in more detailed statistics, just contact me and I'll put them here.

As someone recently has begun to work on JTidy again, I'll try the SVN version soon and tell you the results in another post, promise! I hope they improved the exception handling.
Have a look at this piece of code:
public Node parse(InputStream in, OutputStream out)
{
Node document = null;

try
{
document = parse(in, null, out);
}
catch (FileNotFoundException fnfe) {}
catch (IOException e) {}

return document;
}

That's gruesome, isn't it?

3 comments:

Anonymous said...

Wimp .... real men use java.RegEx for that :P !

OK, honestly, really well done ;)

Unknown said...

Hey! I am using jericho parser for html parsing and manupilating some data in it .. but its taking more time to do the job can you suggest a better code to improve its perormance or anyother html parser to do the job i do.. i need more detailed information on this.. thanks

abrar said...

You should also add jsoup parser to your tests...I'm curious to know where it fits. Also does anyone know of a parser other than cobra that is able to parse html generated by JavaScript?