Best common purpose Java HTML parser
Real-life HTML and even XHTML is far away of being well-formed and valid but is quite dirty. Therefore you cannot use the "javax.xml.parsers" package to parse real-life HTML as you would get many exceptions. So I have looked for a good common purpose HTML parser which is still under active development and not being dumped to a source code repository and forgotten years ago. As a result of my investigation I have found the "NekoHTML" (org.cyberneko.*) HTML parser written in Java which is quite good suitable for extracting tag content out of HTML/XHTML documents -- e.g. the title of a HTML/XHTML document.
Check this URL for HTML parser http://regxjava.blogspot.com/2010/04/partial-html-parser.html
ReplyDelete