HTML Tidy bindings for Python (PyTidyLib)

A Python wrapper for HTML Tidy, which allows you to convert most invalid (X)HTML markup into valid markup. E.g. this Python tidy library will correct unescaped ampersands, unclosed tags, missing elements, missing attributes, etc. HTML Tidy is highly configurable; it can output HTML or XHTML, and perform other functions such as converting named entities to numeric entities (named entities work only along with an HTML or XHTML doctype; numeric entities work in generic XML data).

The importance of web standards and validating HTML has been covered in books such as Zeldman's Designing with Web Standards, Cederholm's Web Standards Solutions, Murphy & Persson's HTML and CSS Web Standards Solutions: A Web Standardistas' Approach (all recent editions), and others. HTML Tidy is not a replacement for web-standards knowledge. However, you can be sure that what you run through it will very probably validate, and it's great for cleaning up input from third-party sources as well as in some cases your own code.

PyTidyLib is released under the MIT license.

Recent changes

0.2.1: Supports custom unicode subclasses, not just unicode objects. Better unit-test coverage. Thanks: Greg Phillips.

0.2.0: Now supports 32- and 64-bit Windows, with the proper DLL. (Also supported are Linux & BSD platforms including OS X, and 64-bit versions of same.) Major documentation update and minor cleanups. Thanks: Kevin A.

Installation using pip or setuptools

You will need to download HTML Tidy source or binaries for your platform either from the HTML Tidy web site or, for 32- or 64-bit Windows, from int64.org. Then:

pip install pytidylib

Or:

easy_install pytidylib

Usage example

from tidylib import tidy_document
document, errors = tidy_document('''<p>f&otilde;o <img src="bar.jpg">''',
    options={'numeric-entities':1})
print document
print errors

Links

Feedback

Please direct all suggestions or responses to Jason Stitt at js@jasonstitt.com.

Share this content