Important note about Web caching!

Aranea is a Web-based question answering system and, as such, it depends on existing Web search engines (Google and Teoma) for text snippets. This is potentially a problem because most search engines don't appreciate parasitic programs that access their services programmatically...

The solution is to cache Web pages and only download fresh pages as need. This has the additional advantage that reproducible experiments. Aranea caches Web results in /tmp/web-cache; /tmp/web_proxy.log contains the log file.

The preferred way to fetch Web pages is:


use Aranea::Common qw(:default);
...

my $content = safe_get($uri);

Aranea returns the cached page if available; otherwise, it retrieves a fresh copy from the Web. The system waits 30 seconds between page fetches by default. This can be adjusted with the following command:


Aranea::Common::set_page_delay(1);    # in seconds

Do this, however, at your own risk. In my experience, a page delay of one second will get you banned by Google. Teoma appears to be more lenient.

Why don't you use the Google API?

For experimentation, there is a module SearchGoogleAPI that uses the Google API. This is not preferred for two reasons: 1.) Google puts a daily limit on the number of hits allowed, and 2.) the Google API only allows you to fetch 10 hits at a time (which means you're wasting a lot of time waiting for network connections).