Aranea::Common
- library of common procedures
|
This module contains commonly used procedures shared by all Aranea components:
safe_get($url)
get
in LWP::Simple
,
and alleviates the problem bombarding external sites with non-stop
HTTP requests (which often gets us banned from those sites).
The procedure enforces a mandatory delay between retrieval of live Web
pages (cached pages can be retrieved instantly). The default is 30
seconds, but can be adjusted using safe_delay
.
When fetching live pages, the procedure randomly selects a UserAgent
string for its GET
request, emulating a variety of browsers:
Netscape, IE, Opera (and different versions thereof.) Furthermore,
Web pages are cached in /tmp/web_cache
. The cache can be cleared
by simply deleting any files there.
demoronize($str)
pos_tag($str)
LPost
). To amortize
startup costs, all tagging should be performed through this procedure
(instead of using LPost directly).