Aranea: General Architecture

Conceptually, these are the steps involved in answering questions using the Web:

  1. Figure out the correct query to issue to a search engine.
  2. Get the search results.
  3. Determine the correct answer from those pages.
  4. Present the answer.

The Aranea is comprised of independent modules whose input and output are organized around the Aranea XML object, which is parsed and supported by Areana::XML::Aranea in Perl. At the top-level, the XML object contains five components components:

  1. <query> contains the question being answered, along with other meta-information.
  2. <request_list> contains "requests" derived from the natural language query. Each request translates into one or more search Engine queries with additional postprocessing.
  3. <page_list> stores all the pages retrieved from the Web: page summary, url, etc.
  4. <pipeline> keeps track of the modules (timestamped) that have modified the Aranea XML object; basically, this provides an accounting method.
  5. <candidate_list> stores all candidate answers to the query, including their score, and urls of documents that support the answer.

The Aranea XML object starts out with something like this, when the user query is converted into requests:


<aranea>
  <query>
    <text>where is Belize located?</text>
    <id>stdin</id>
  </query>
  <request_list>
    <entry>
      <query>Belize is located in ?x</query>
      <score>10</score>
      <constraints>
	<max_length>100</max_length>
	<pages_to_search>10</pages_to_search>
	<max_words>5</max_words>
      </constraints>
    </entry>
    <entry>
      <query>where is Belize located?</query>
      <score>1</score>
      <constraints>
	<use_backoff>true</use_backoff>
	<pages_to_search>10</pages_to_search>
      </constraints>
    </entry>
  </request_list>
</aranea>

The requests are executed and pages are retrieved: (Note that each entry is tagged with the request that the page came from.)


<aranea>
  ...
  <page_list>
    <entry>
      <id>0</id>
      <page_category> ... </page_category>
      <page_summary> ... </page_summary>
      <page_url> ... </page_url>
      <page_title> ... </page_title>
      <page_cache_url> ... </page_cache_url>
    </entry>
    ...
    <entry>
      <id>1</id>
      <page_category> ... </page_category>
      <page_summary> ... </page_summary>
      <page_url> ... </page_url>
      <page_title> ... </page_title>
      <page_cache_url> ... </page_cache_url>
    </entry>
  <page_list>
  ...
</aranea>

After additional processing by various modules, <candidate_list> becomes populated with candidate answers:


<aranea>
  ...
  <candidate_list>
    ...
    <entry>
      <support>
        <doc>http://commonwealth.ednet.ns.ca/americas/Belize/belize.htm</doc>
        <doc>http://realestate.escapeartist.com/Properties/Belize/</doc>
        ...
      </support>
      <score>60</score>
      <candidate>Central America</candidate>
    </entry>
  </candidate_list>
</aranea>

The Aranea modules tutorial provides a HOWTO guide for writing modules.


$Header: /fs/clip-qa/.cvsroot/Aranea.support/docs/general-architecture.html,v 1.3 2005/06/10 20:44:17 jimmylin Exp $