Best common purpose Java HTML parser

March 20, 2008

Real-life HTML and even XHTML is far away of being well-formed and valid but is quite dirty. Therefore you cannot use the "javax.xml.parsers" package to parse real-life HTML as you would get many exceptions. So I have looked for a good common purpose HTML parser which is still under active development and not being dumped to a source code repository and forgotten years ago. As a result of my investigation I have found the "NekoHTML" (org.cyberneko.*) HTML parser written in Java which is quite good suitable for extracting tag content out of HTML/XHTML documents -- e.g. the title of a HTML/XHTML document.

Comments

Anonymous24.4.10
Check this URL for HTML parser http://regxjava.blogspot.com/2010/04/partial-html-parser.html
ReplyDelete
Replies

Add comment

loxal DEV

Best common purpose Java HTML parser

Comments

Post a Comment

Popular posts from this blog

Apple's evaluating a new JVM

Eclipse sucks, so use NetBeans!

Tuning ext4 for performance with emphasis on SSD usage