jsoup is a Java library for working with real-world HTML. It can parse HTML from a URL, file, or string. It can find and extract data, using DOM traversal or CSS selectors. The HTML elements, attributes, and text can be manipulated. It can clean user-submitted content against a safe white-list. jsoup is designed to deal with all varieties of HTML found in the wild, from pristine and validating to invalid tag-soup; jsoup will create a sensible parse tree.
| Tags | Java HTML Parser HTML cleaner Extract whitelist Cross Platform |
|---|---|
| Licenses | MIT/X |
| Operating Systems | Java Cross Platform |
| Implementation | Java Java 5 HTML |
| Translations | English |
Recent releases


Release Notes: This release introduces selectors for structural pseudo CSS classes, full support for international supplementary characters, and a raft of improvements and bugfixes.


Release Notes: This release parses HTML 2.3x faster. The author has profiled the parse execution of thousands of documents, optimized every hotspot to streamline the parser, and significantly minimized node memory consumption. This release also trims the retained heap memory when retrieving data from parsed documents, reduces garbage collection when selecting elements, and removes lock contention to allow jsoup to run concurrently on as many threads as are available.


Release Notes: This release adds a number of improvements and bugfixes, including renewed support for the Google App Engine and parsing fixes.


Release Notes: This release adds many improvements, including a relaxed XML parser, a lighter memory footprint, and a range of bugfixes.


Release Notes: This release included a new HTML5 compliant parser and fixes for Java 1.5 and Android 2.2 compatibility.
A program to analyze your databases and check your data quality.