domstripper - A lxml.html test project
http://pypi.python.org/pypi/domstripper/19th of November 2008
I'm just playing with the impressive lxml.html package. It makes it possible to easily work with HTML trees and manipulate them.
I had this crazy idea of a "DOM stripper" that removes all but specified elements from an HTML file. For example you want to keep the contents of the <head> tag intact but you just want to keep the <div id="content">...</div> tag thus omitting <div id="banner">...</div> and <div id="nav">...</div>. domstripper now does that. This can be used for example as a naive proxy that tranforms a bloated HTML page into a more stripped down smaller version suitable for say mobile web browsers. It's more a proof of concept that anything else.
To test you just need a virtual python environment and the right system libs to needed to install lxml. This worked for me:
$ cd /tmp
$ virtualenv --no-site-packages testenv
$ cd testenv
$ source bin/activate
$ easy_install domstripper
Now you can use it like this:
>>> help(domstripper)
...
>>> domstripper('bloat.html', ['#content', 'h1.header'])
<!DOCTYPE...
...
Best to just play with it and see if makes sense. I'm not saying this is an amazing package but it goes to show what can be done with lxml.html and the extremely user friendly CSS selectors.
Tweet