Webscraping 101

The last 6 months my job pretty much only consisted of webscraping several things (javascript, strange asp.net viewstates, you name it I got it :D) and this is what I came up with:

  • If you-can't avoid it, watir + internet explorer works... you might want to disable displaying pictures in IE. You don't need them for scraping and they tend to break stuff :-/
  •  Mechanize is a nice tool to put together HTTP get and post requests. The documentation is kinda horrible compared with other ruby libraries though (e.g. some methods do not exist (?any more / yet ?) in the stable version of the gem)
  • A nice way to get data is from AJAX enabled websites is to use Webscarab and simply find out which GET requests the Javascript on the pages makes. This way you'll end up with a minimum of parsing and a maximum of data, most of the stuff is REST'ish
  • Use Nokogiri (kinda builit into mechanize) or Hpricot to parse HTML. Use xpath!
  • Regular Expressions are Nerd Superpowers. Rubular is a wonderful site!
  • I tried using beautiful soup once and found it to be a HORRIBLY unpleasent experience compared to hpricot (I am more of a Ruby guy than a Python dude though)

http://wtr.rubyforge.org/
http://www.rubular.com/
http://www.reddit.com/r/programming/comments/974iy/an_almost_perfect_realworld_hack/c0bnuxm