Aranea::Common

CONTENTS


NAME

Aranea::Common - library of common procedures


SYNOPSIS


  my $content = safe_get("http://www.ai.mit.edu/index.html";);
  my $safe = demoronize($content);
  my $tagged = pos_tag('This is a test');


DESCRIPTION

This module contains commonly used procedures shared by all Aranea components:

safe_get($url)
This procedure serves as a replacement of get in LWP::Simple, and alleviates the problem bombarding external sites with non-stop HTTP requests (which often gets us banned from those sites).

The procedure enforces a mandatory delay between retrieval of live Web pages (cached pages can be retrieved instantly). The default is 30 seconds, but can be adjusted using safe_delay.

When fetching live pages, the procedure randomly selects a UserAgent string for its GET request, emulating a variety of browsers: Netscape, IE, Opera (and different versions thereof.) Furthermore, Web pages are cached in /tmp/web_cache. The cache can be cleared by simply deleting any files there.

demoronize($str)
Fixes moronic Microsoft-generated encodings. Microsoft-generated Web pages use an extended version of ASCII that includes things like special open/close quotes, e with an accent mark, and other weird letters. True to Microsoft, this encoding is not compatable with Unicode/utf-8, so it breaks the XML parser. This procedure maps the moronic Microsoft characters into normal ASCII characters. The original code for the demoroniser can be found at http://www.fourmilab.ch/webtools/demoroniser/.

pos_tag($str)
Part-of-speech tags the argument (using LPost). To amortize startup costs, all tagging should be performed through this procedure (instead of using LPost directly).