2008-03-20

Best common purpose Java HTML parser

Real-life HTML and even XHTML is far away of being well-formed and valid but is quite dirty. Therefore you cannot use the "javax.xml.parsers" package to parse real-life HTML as you would get many exceptions. So I have looked for a good common purpose HTML parser which is still under active development and not being dumped to a source code repository and forgotten years ago. As a result of my investigation I have found the "NekoHTML" (org.cyberneko.*) HTML parser written in Java which is quite good suitable for extracting tag content out of HTML/XHTML documents -- e.g. the title of a HTML/XHTML document.