February 27, 2010

At PyCon, I saw a lightning talk about, a lightweight Python library for parsing webpages/interacting with them programmatically. For example, finding page elements:

>>> from scrape import *
>>> s.go('')
<Region 0:17780>
>>> d = s.doc
>>> t = d.first('title')
>>> t
<Region 247:258 title>
>>> t.tagname
>>> t.text
u'Ka-Ping Yee'

The presentation I saw focused on the use case of testing your website. This is definitely a pain point for me personally: I currently either grep the HTML with regexes or I parse the whole thing using ElementTree and use XPath. But there’s still a couple of problems: 1. JS isn’t usually testable this way; 2. you often have to construct your HTML with an eye towards testability. For example, to test pagination, you might need to add a class or id specifying that this is the pagination section and that these pages link to pagination things.

Comments are closed.