Best common purpose Java HTML parser

Real-life HTML and even XHTML is far away of being well-formed and valid but is quite dirty. Therefore you cannot use the "javax.xml.parsers" package to parse real-life HTML as you would get many exceptions. So I have looked for a good common purpose HTML parser which is still under active development and not being dumped to a source code repository and forgotten years ago. As a result of my investigation I have found the "NekoHTML" (org.cyberneko.*) HTML parser written in Java which is quite good suitable for extracting tag content out of HTML/XHTML documents -- e.g. the title of a HTML/XHTML document.

Comments

  1. Anonymous24.4.10

    Check this URL for HTML parser http://regxjava.blogspot.com/2010/04/partial-html-parser.html

    ReplyDelete

Post a Comment

Popular posts from this blog

Tuning ext4 for performance with emphasis on SSD usage

NetBeans 6.1: Working with Google´s Android SDK, Groovy and Grails