TREC 2007 ciQA Task Guidelines

1 Overview

The goal of the complex, interactive question answering (ciQA) task at TREC 2007 is to push the state of the art in two directions:

A move away from "factoid" questions towards more complex information needs that exist within a richer user context.
A move away from the one-shot interaction model implicit in the previous ciQA evaluation towards one based at least in part on multiple interactions with users.

The ciQA task is entirely independent of the main task (focused on question series); teams may participate in one or both. The interactive aspect of the ciQA task will be optional.

1.1 Complex "Relationship" Questions

The concept of a "relationship" is defined as the ability of one entity to influence another, including both the means to influence and the motivation for doing so. Eight "spheres of influence" were noted in a previous pilot study funded by AQUAINT: financial, movement of goods, family ties, communication pathways, organizational ties, co-location, common interests, and temporal. Evidence for both the existence or absence of ties is relevant; the particular relationships of interest depend on the context.

A relationship question in the ciQA task, which we will refer to as a topic (to reduce confusion), is composed of two parts. Consider an example:

Template: What evidence is there for transport of [drugs] from [Bonaire] to [the United States]?
Narrative: The analyst would like to know of efforts made to discourage narco traffickers from using Bonaire as a transit point for drugs to the United States. Specifically, the analyst would like to know of any efforts by local authorities as well as the international community.

The question template is a stylized information need that has a fixed structure (the template itself) and free slots whose instantiation varies across different topics (items in square brackets). The narrative is free-form natural language text that elaborates on the information need, providing, for example, user context, more fine-grained statement of interest, focus on particular topical aspects, etc.

The ciQA task will employ the following templates (same as those in TREC 2006):

What evidence is there for transport of [goods] from [entity] to [entity]?
What [relationship] exist between [entity] and [entity]?
where [relationship] is an element of {"financial relationships", "organizational ties", "familial ties", "common interests"}
What influence/effect do(es) [entity] have on/in [entity]?
What is the position of [entity] with respect to [issue]?
Is there evidence to support the involvement of [entity] in [event/entity]?

1.2 Interactive Question Answering

The purpose of the interactive aspect of ciQA is to provide a framework for participants to investigate interaction in the QA context. Participants have the opportunity to deploy a full-fledged Web-based QA system. Each assessor will spend five minutes interacting with the system per topic. There are no restrictions on the nature of the interaction or the system, except that it must be accessible from a Web browser. Therefore, anything ranging from mixed-initiative dialogues to graphical interfaces are allowed. In fact, we encourage an exploration of novel interaction paradigms.

Note that this setup is a departure from interaction forms in ciQA 2006.

2 Task Details

Here is the general setup of the task:

Participants submit initial runs and URL files. URL files provide pointers to the Web-based QA system for each topic.
NIST assessors interact with the Web-based QA systems.
Participants submit final runs based on the results of the interactions.
NIST evaluates both initial and final runs.

~~The TREC 2007 ciQA task will only accept fully automatic runs. No human intervention is allowed from the participant's end.~~

(Update, 7/10/2007): Manual runs will also be accepted, but must be marked as such in the run submission interface.

The interactive part of ciQA is optional. For groups that do not wish to participate in the interactive aspect, simply don't submit any URL files. Note that if you do not wish to participate in the interactive aspect, you must submit your runs when others are submitting their initial runs.

Each group is allowed to submit at most 2 initial runs, 2 URL files, and 2 final runs. All submitted runs will be judged. The submission form will ask you to establish a correspondence between the runs and the URL files.

2.1 Document Collection

The ciQA task will use the newswire portion of the document collection used by the main QA task. The ciQA task will not be using the blog data.

For participants that do not wish to perform their own document retrieval, NIST will provide the top 100 documents as retrieved by the PRISE system, using the question template verbatim as the query.

2.2 Topic Format

Each topic will consist of a question template and a free-form narrative. The templates will draw from the following set:

What evidence is there for transport of [goods] from [entity] to [entity]?
What [relationship] exist between [entity] and [entity]?
where [relationship] is a element of {"financial relationships", "organizational ties", "familial ties", "common interests"}
What influence/effect do(es) [entity] have on/in [entity]?
What is the position of [entity] with respect to [issue]?
Is there evidence to support the involvement of [entity] in [event/entity]?

All ciQA topics will be encoded in a XML file, with the following format:

<ciqa> <topic num="1"> <template id="1">What evidence is there for transport of [drugs] from [Bonaire] to [the United States]?</template> <narrative>The analyst would like to know of efforts made to discourage narco traffickers from using Bonaire as a transit point for drugs to the United States. Specifically, the analyst would like to know of any efforts by local authorities as well as the international community.</narrative> </topic> ... </ciqa>

There will be 30 topics total, but they will not be distributed evenly across the 5 templates. In addition, there will be one throw-away topic for training purposes, with the "topic number" ciQA2007throwaway---see below.

2.3 Response Format

For each topic, the answer submission file should contain one or more lines of the form

topic-number run-tag doc-id rank answer-string

run-tag is a string that is used as a unique identifier for your run. Please limit it to no more than 12 characters, and it may not contain embedded white space. answer-string is the piece of evidence derived (extracted, concluded, etc.) from the given document doc-id. It may contain embedded white space but may not contain embedded newlines. rank is the rank order of the answer string, starting from 1. This means that systems should rank their answer strings in order of relevance, with the most relevant answer string first.

The response for all the topics should be contained in a single file. Please include a response for all topics except the throw-away topic, even if the response is just a place-holder like:

5 RUNTAG NYT20000101.0001 1 don't know

The maximum total length of answer strings is 7000 non-whitespace characters per topic per run, and excessive length is penalized in the scoring

2.4 URL Files

The URL files provide pointers to the participant's Web-based QA system. For each topic, the participant will provide a URL---which the NIST assessor will navigate to. There, the assessor will spend five minutes interacting with the QA system. The URL files should contains lines of the following format (one per topic):

topic-number run-tag URL

The URL can be anything that one could type into a browser---it can encode CGI queries, point to active content, etc. There are no restrictions on the participant's QA system, provided that it is accessible on the Web. Note that the submitted URL only serves as an entry point. The participant's system can redirect the assessor to other pages as necessary. The assessor will spend five minutes interacting with each system. Each participant can submit at most two URL files, which means that two separate interactive systems can be evaluated.

NIST will prepare a throw-away topic, noted via the "topic number" ciQA2007throwaway. The purpose of this topic is to familiarize assessors with systems before they complete an actual topic. The assessor will spend five minutes on this training topic. Note that NIST places no restrictions on the nature of this particular interaction---for instance, participants are free to deploy a tutorial of their systems that is completely unrelated to the actual topic. The throw-away topic will not be judged in the evaluation, and the assessors will be made aware of this. Participants will supply a URL for this "topic number" just as with any other topics.

The NIST assessors will be using the following setup:

Redhat Enterprise Linux 4 workstation
20-inch LCD monitor with 1600x1200 resolution, true color (millions of colors)
Firefox Web browser, v2.0.0.x, where x is the current revision number as of August 10, 2007

(Update, 7/10/2007): Flash, Acroread, and RealPlayer will all be enabled in Firefox, and they will be the most current version available for Linux as of August 10.

IMPORTANT NOTE: The five minutes of interaction per topic include time spent loading/rendering the page, as well as any delay caused by network traffic. It is the participant's responsibility to ensure that the QA system is Web-accessible during the period of time the assessors are scheduled to interact with submitted systems (8/15-8/17, see timeline below). If, for any reason, the assessor cannot access the participant's QA system, he or she will simply skip that interaction (and not return to that topic).

Finally, participants are requested to supply NIST with representative screenshots of their systems for archival purposes and to help the organizers report on the task. This should take the form of URLs to locations of the screenshots (at least one, but no more than ten). These URLs should be encoded in the URL file in the following way:

screenshot run-tag URL

URL(s) to the screenshots should come at the end of the URL file. Thus, a complete URL file might look like the following:

ciQA2007throwaway MRUN http://foo.edu/ciqa/qa.cgi?topic=ciQA2007throwaway 56 MRUN http://foo.edu/ciqa/qa.cgi?topic=56 57 MRUN http://foo.edu/ciqa/qa.cgi?topic=57 ... 85 MRUN http://foo.edu/ciqa/qa.cgi?topic=57 screenshot MRUN http://foo.edu/ciqa/screenshot1.jpg screenshot MRUN http://foo.edu/ciqa/screenshot2.jpg screenshot MRUN http://foo.edu/ciqa/screenshot3.jpg

NIST will record time spent on each system/topic combination, and will return this information to participants.

2.5 Final Runs

After the NIST assessors interact with the system, participants will generate a final run, submitted to NIST before the deadline (same response format). It is each participant's responsibility to extract and record whatever input is necessary from the assessors (e.g., via logging capabilities). NIST assessors will not provide the participants any additional feedback.

IMPORTANT NOTE: It is the responsibility of each participant to collect any data that they want from the interaction. NIST will not keep any record of the interaction other than the time spent on each interaction. Assessors will be allowed at most 5 minutes per topic per system. After 5 minutes, the NIST system will simply close the page (i.e., participants cannot assume that the assessor will click on a "submit" button at the end of the interaction). It will be the responsibility of participants to ensure that any form that the assessor fills out gets submitted on the unload of the page. Note that it is possible to execute a form submission dynamically in Javascript by launching form.submit() from the body.unload() action.

2.6 Evaluation Methodology

System responses will be evaluated using the "nugget pyramid" extension of the nugget-based methodology used in previous TRECs. See (Lin and Demner-Fushman, HLT 2006) for more details. Additional analyses will include recall by length plots, as described in (Lin, HLT 2007)

3 Timeline

now	Corpus available (contact NIST)
August 6, 2007	Test topics release
August 13, 2007 (11:59pm EDT)	Initial runs and URLs files due
August 15-17, 2007	NIST assessors scheduled to interact with systems
August 26, 2007 (11:59pm EDT)	final runs due

For reference, SIGIR 2007 is July 23-27.

4 Revision History

5/10/2007: initial draft guidelines posted.
7/10/2007: manual rules allowed, specified enabled Firefox plug-ins.