scrapiを使ってみた

馬術部のホームページをかってに弄ってみた

require 'rubygems'
require 'pp'
require 'scrapi'
require 'open-uri'
#$KCODE = 'u'   # scrapeで警告が出るのでコメントアウト
require 'kconv'

# horse pages
horses = Scraper.define do
    array :links
    process "td.hpb-lb-tb1-cell3 a", :links => "@href"
    result :links
end
horses_html = "http://circle.cc.hokudai.ac.jp/horse/homepage/horses.html"
html = URI.parse(horses_html)
horse_pages = horses.scrape(html, :parser_option => {:char_encoding => 'sjis'})
horse_pages.collect!{|l| horses_html.sub(/\/horses.html$/, "/#{l}")}

まず一覧ページからリンクを抽出したので、このリンク先のページから情報を取り出したい。

# horse page scrape
scraper = Scraper.define do
    array :images
    process "td.hpb-lb-tb1-cell4 img", :name => "@alt"
    process "div>table td:nth-child(2)>table table>tbody>tr:nth-child(2)>td>table img", :images => "@src"
    result :name, :images
end
scraper.parser_options :show_warnings => true, :char_encoding => 'shiftjis'
html = URI.parse horse_pages[0]
horse = scraper.scrape(html)

info = {}
info[:name] = horse.name.toutf8
info[:images] = horse.images.collect{|s| horses_html.sub(/\/horses.html$/, "/#{s}")}

scrapiはcss セレクタで要素を指定できるので、細かな設定が可能。テーブルレイアウトでも目的の要素にclassやidが設定されていれば、簡単に指定できることもありそう。
それでも、構造のないHTMLをscrapingするのは難しい場合が多くて、この馬術部のHTMLの場合は、テーブル構造になっていてほしい情報が、スペースと改行(BR)でフラットに記述されているため、とってきた情報をテキスト解析する必要がある。
さらに、フラットなテキストの中にテーブルを含むため（つまり＜div＞1:aaa＜br＞2:ccc＜table＞...＜/table＞＜/div＞)、:hoge=>:textの様にすると、テーブルの中身まで取得してしまう。

ちなみにその場合は、次のようにすれば、とりあえずなんとか目的が達せられる。

scraper = Scraper.define do
    process "div" :all_text => :text
    process "div>table" :table_text => :text
    result :all_text, :table_text
end
s = scraper.scrape(html)
div_text = s.all_text.sub( Regexp.new(s.table_text) )

さらに、なんと各馬の詳細ページごとにHTMLの構造が異なることが判明。
この場合は、各ページごとに抽出方法を個別に記述する必要がある。セレクタの書き方によっては、柔軟に対応できるかもしれないが、今回のHTMLの場合は難しそう。人手で手間ひまかけて作成されるHTMLは、スクレーピングにも手間ひまかかる。