Aranea is a Web-based question answering system and, as such, it depends on existing Web search engines (Google and Teoma) for text snippets. This is potentially a problem because most search engines don't appreciate parasitic programs that access their services programmatically...
The solution is to cache Web pages and only download fresh pages as
need. This has the additional advantage that reproducible
experiments. Aranea caches Web results in
/tmp/web-cache
; /tmp/web_proxy.log
contains
the log file.
The preferred way to fetch Web pages is:
|
Aranea returns the cached page if available; otherwise, it retrieves a fresh copy from the Web. The system waits 30 seconds between page fetches by default. This can be adjusted with the following command:
|
Do this, however, at your own risk. In my experience, a page delay of one second will get you banned by Google. Teoma appears to be more lenient.
Why don't you use the Google API?
For experimentation, there is a module SearchGoogleAPI
that uses the Google API. This is not preferred for two reasons: 1.)
Google puts a daily limit on the number of hits allowed, and 2.) the
Google API only allows you to fetch 10 hits at a time (which means
you're wasting a lot of time waiting for network connections).