Daily Archives: 2008-10-20

Using Ruby for command line web lookups

Common Problem

You frequently have to look up customer information on the company website.  Firing up a web browser takes time and invites you to start dawdling away on facebook and such.  If only anyone in the company had bothered to write a decent webservice or command line utility to look up customer information.

Find the right url

The first step is to dig into the company website and find out what happens when you click search.  You are looking for an element in there of type “form”.  A form is what gets submitted when you click search.  It will submit information to a page, and that page is the “action” attribute of the form element.  Then you need to find the inputs to that form.  Look for elements of type “input”.  These guys are the information you are sending to the action page.  Once you have the “action” and the “input” names, you can come up with a URL that represents this lookup.  It’s dead easy and it always follows the same pattern.

If you have a form like this:


<form action="/admin/clientsearch.asp" method="post">
<table border="0" align="center">
<tbody>
<tr>
<th class="QueryHeader">Search Options</th>
</tr>
<tr>
<td align="center">
<table border="0">
<tbody>
<tr>
<td><strong>Search For:</strong>

<input name="SEARCHPARAM" size="15" type="text" /></td>
</tr>
</tbody></table>
</td>
</tr>
<tr>
<td></td>
</tr>
<tr>
<td align="center"><input type="submit" value="Search" /><input type="reset" value="Reset" /></td>
</tr>
</tbody></table>
</form>

You can see the “action” is”/admin/clientsearch.asp” and that the input is named “SEARCHPARAM”.  From this we know that the URL is going to be “http://www.example.com/admin/clientsearch.asp?SEARCHPARAM=”.  That’s how simple it is.

Automate the lookup using Ruby

If this is a task you have to do often, try using Ruby to automate it.  Ruby has a utility for doing repetitive tasks called Rake or Sake and a utility for parsing web pages called Hpricot.  Install them like so:


gem install rake hpricot sake

Now we write up a file called “Rakefile.rb” and put in a rake task

desc "sets up the following tasks"
task :setup do
require 'open-uri'
require 'hpricot'
end

desc "lookup a client"
task :clients, :client_name do |t, args|
doc = Hpricot(open("http://www.example.com/admin/clientsearch.asp?SEARCHPARAM=#{args.client_name}", :http_basic_authentication => ['username', 'password']))
puts doc.search("//center[2]/table")[0].to_plain_text
end
task :clients => :setup #put in here bc named args seem to conflict with dependencies.

Most of what’s going on there is happening on line 9. We are opening a url, then passing it to our parser. If you don’t have a username and password for this website, you can remove the whole “http_basic_authentication” argument to open.

Where is my data

In line 10 you’ll see a little handy XPath going on to narrow down the document to what we care about. If you aren’t so hot with XPath, there is an easy way to find it out. In Firefox, install an extension called Firebug. Do a search on your webpage, then activate firebug by clicking the bug icon in your statusbar. Click inspect in Firebug and then click where your data is. Firebug will display a bunch of elements on the top. Move your mouse along them and you’ll find one element that highlights your data in blue. Right click on this and “Copy XPath”. That’s what you will put in “doc.search()”.

Using it

From the commandline type rake clients[myclient] and ruby will do the lookup and return the information you care about.  That will only work if you are in the same directory as your rakefile.rb.  We can install these tasks into sake by typing sake -i rakefile.rb. This makes these tasks system wide, so you can call sake clients[myclient].

A couple of caveats

  1. You may have to do a little tweaking to get open-uri to play nice with expired https certificates. Shouldn’t be a problem for most folks.
  2. The world of screen-scraping as it is called, doesn’t end there. If you need more advanced techniques for screen scraping a page, behold the power of the internet.