TREC 2006 ciQA Task Guidelines

1 Overview

The goal of the complex, interactive question answering (ciQA) task within the QA track at TREC 2006 is to push the state of the art in question answering in two directions:

A move away from "factoid" questions towards more complex information needs that exist within a richer user context.
A move away from the one-shot interaction model implicit in previous systems towards one based at least in part on interactions with users.

In terms of setup, the ciQA task is a blend of the TREC 2005 relationship questions task and the TREC 2005 HARD track, which focused on single-iteration clarification dialogues. The ciQA task is entirely independent of the main task (with question series); teams may participate in one or both. In addition, the interactive aspect of the ciQA task will be optional. Finally, both automatic and manual runs will be allowed.

1.1 Complex "Relationship" Questions

Primarily for purposes of providing continuity and some amount of training data, ciQA will use so-called "relationship" questions, similar to those in the TREC 2005 relationship task.

The concept of a "relationship" is defined as the ability of one entity to influence another, including both the means to influence and the motivation for doing so. Eight "spheres of influence" were noted in a previous pilot study funded by AQUAINT: financial, movement of goods, family ties, communication pathways, organizational ties, co-location, common interests, and temporal. Evidence for both the existence or absence of ties is relevant; the particular relationships of interest depend on the context.

A relationship question in the ciQA task, which we will refer to as a topic (to reduce confusion), is composed of two parts. Consider an example:

Template: What evidence is there for transport of [drugs] from [Bonaire] to [the United States]?
Narrative: The analyst would like to know of efforts made to discourage narco traffickers from using Bonaire as a transit point for drugs to the United States. Specifically, the analyst would like to know of any efforts by local authorities as well as the international community.

The question template is a stylized information need that has a fixed structure (the template itself) and free slots whose instantiation varies across different topics. The narrative is free-from natural language text that elaborates on the information need, providing, for example, user context, more fine-grained statement of interest, focus on particular topical aspects, etc.

The ciQA task will employ the following templates:

What evidence is there for transport of [goods] from [entity] to [entity]?
What [relationship] exist between [entity] and [entity]?
where [relationship] is an element of {"financial relationships", "organizational ties", "familial ties", "common interests"}
What influence/effect do(es) [entity] have on/in [entity]?
What is the position of [entity] with respect to [issue]?
Is there evidence to support the involvement of [entity] in [event/entity]?

1.2 Interactive Question Answering

The purpose of the interactive aspect of ciQA is to provide a framework for participants to investigate interaction in the QA context and an opportunity for other researchers to become involved in QA. In this task, we consider an interactive system to be a system that gives users control over all or a portion of displayed content. Using this definition, the smallest possible interaction unit consists of the user responding to the system and the system using the user's response to perform some action to produce such content. The interactive aspect of ciQA is concerned with the smallest interaction unit and is modeled in part after the HARD track's clarification form task.

The HARD track's clarification forms allowed participants to elicit information from assessors through a single interaction. This interaction consisted of assessors completing forms (i.e., Web pages) that had been created by track participants. The results of these interactions were then returned to the participants---comparison of output before and after the clarification quantified the effects of the interaction.

Although many participants took advantage of this opportunity to investigate traditional relevance feedback techniques, this was not a goal of the track nor a condition for participation; there were, in fact, some participants who used the clarification form in different ways. In the ciQA track, we encourage novel and innovative ways of using forms that go beyond traditional relevance feedback. We have changed the name of the form from "clarification" to "interaction" to reflect that all types of single unit interaction techniques are appropriate.

The rationale for studying the smallest interaction unit rests on the idea that a good QA system should return relevant information with a minimum amount of interaction. Furthermore, given the potential complexities that are likely to arise with coordinating cross-site, multi-unit interaction evaluation, we believe that using the smallest interaction unit is a reasonable place to start an interactive QA task. The TREC interactive track demonstrated that coordinating multi-site interactive IR system evaluation is a challenge and that results are difficult (if not impossible) to compare. Therefore, using a more conservative approach seems most appropriate and most likely to yield useful and usable results.

2 Task Details

Here is the general setup of the task:

Participants submit initial runs and interaction forms.
NIST assessors interact with forms.
NIST returns results of the interaction---CGI bindings.
Participants submit final runs based on the results of the interactions.
NIST evaluates both initial and final runs.

Both automatic and manual submissions are allowed. If there is human intervention in any part of the process (except assessor interaction through the interaction forms), the entire run must be designated "manual".

For groups that do not wish to participate in the interactive aspect of ciQA, simply don't submit any interaction forms. Note that if you do not wish to participate in the interactive aspect, you must submit your runs when others are submitting their initial runs and interaction forms. No final runs will be accepted from groups that did not submit interaction forms. The rationale for this is fairness: the relevant variable under consideration here is interaction vs. no interaction, not interaction vs. more system development time.

2.1 Document Collection

The ciQA task will use the same document collection as the main QA task---the AQUAINT corpus.

2.2 Topic Format

Each topic will consist of a question template and a free form narrative. The templates will draw from the following set:

What evidence is there for transport of [goods] from [entity] to [entity]?
What [relationship] exist between [entity] and [entity]?
where [relationship] is a element of {"financial relationships", "organizational ties", "familial ties", "common interests"}
What influence/effect do(es) [entity] have on/in [entity]?
What is the position of [entity] with respect to [issue]?
Is there evidence to support the involvement of [entity] in [event/entity]?

All ciQA topics will be encoded in a XML file, with the following format:

<ciqa> <topic num="1"> <template id="1">What evidence is there for transport of [drugs] from [Bonaire] to [the United States]?</template> <narrative>The analyst would like to know of efforts made to discourage narco traffickers from using Bonaire as a transit point for drugs to the United States. Specifically, the analyst would like to know of any efforts by local authorities as well as the international community.</narrative> </topic> ... </ciqa>

There will be 30 topics total, 6 per template.

2.3 Response Format

For each topic, the submission file should contain one or more lines of the form

topic-number run-tag doc-id rank answer-string

run-tag is a string that is used as a unique identifier for your run. Please limit it to no more than 12 characters, and it may not contain embedded white space. answer-string is the piece of evidence derived (extracted, concluded, etc.) from the given document doc-id. It may contain embedded white space but may not contain embedded newlines. rank is the rank order of the answer string, starting from 1. This means that systems should rank their answer strings in order of relevance, with the most relevant answer string first. This is an important change from the 2005 relationship task, where the answer string was an unordered set..

The response for all the topics should be contained in a single file. Please include a response for all topics, even if the response is just a place-holder response like:

5 RUNTAG 1 NYT20000101.0001 don't know

The maximum total length of answer strings is 7000 non-whitespace characters per topic per run, and excessive length is penalized in the scoring

Both automatic and manual runs are allowed, although you must describe what manual processing was done (if any). Each group may submit at most 2 runs prior to interaction, and 2 additional runs after interaction; all submitted runs will be judged.

2.4 Interaction Forms

Participants will submit an initial run for the relationship task and interaction forms simultaneously. Interaction forms are HTML page that solicit user input (via CGI). There are no restrictions on content, but there are technical restrictions (see below). Each site will be limited to two interaction forms per question. Assessors will have three minutes to complete each interaction form. Both automatic and manual runs using results of interaction forms are allowed.

2.4.1 Format of and restrictions on forms

Interaction forms will be completed by assessors at NIST on a machine with the following configuration:

Redhat Enterprise Linux workstation
20-inch LCD monitor with 1600x1200 resolution, true color (millions of colors)
Firefox Web browser, v1.5.0.2
Disconnected from all networks of any sort

The following restrictions apply to the format of the interaction forms:

Forms will be running on a computer that is disconnected from all networks, so you must provide all necessary information as part of the form. If the form requires multiple files, they must all be within the same directory structure. You cannot assume that interaction forms for all topics will be on the same computer.
It is not possible to invoke any cgi-bin scripts.
It is not possible to write to disk.
Javascript is allowed, but Java is not.

Please note: It is your responsibility to pilot your forms. If they do not work at NIST, then NIST will simply remove them from the rotation. NIST will not be able to trouble-shot any forms.

Your interaction form must include the following items:

<form action="/cgi-bin/interaction_submit.pl" method="post">
This indicates the script where the output will be generated.
<input type="hidden" name="site" value="formid">
Here, "formid" is a short id for your form; it should be unique to your site and form in the way that runid is unique, e.g., "NIST1" or "NIST2".
<input type="hidden" name="topicid" value="000">
Indicates the topic number. It should be 3-digit code with zeros padding as needed. So 001 rather than 01 or 1.
<input type="submit" name="send" value="submit">
This is the submit button that should appear somewhere on your page.

In addition, you are strongly encouraged to make the topic number (e.g., "001") and the title of the topic visible somewhere on the page. The purpose of including this is to provide a sanity check that assessors are, indeed, responding to the correct questions.

For each submission, put all of your interaction forms in a single directory (folder) with the name indicated (e.g., NIST1). Each interaction form inside that directory should also be a directory with the name of the submission and the topic number (e.g., NIST1_043 for topic 43 of the NIST1 submission). Note that the topic number must be 0-filled to three digits.

Inside that directory, the main interaction form should be called index.html. It may access any files from within your directory hierarchy, using relative pathnames. For example, "logo.gif" would refer to the file NIST1/NIST1_043/logo.gif within the directory structure, and "../logo.gif" would refer to "NIST1/logo.gif". Do not refer to any files outside of your directory structure. Do not refer to files with absolute path names in the URL since (1) absolute names will not be known and (2) there is no access to files on the network.

2.4.2 Processing and Use of Forms

Assessors will spend no more than three minutes per form no matter how complex the form is. The three minutes includes time needed to load the form (from local disk since there is no network access), initialize it, and do any rendering, so unusually complex or large forms will potentially be penalized. At the end of time limit, if the assessor has not pressed the "submit" button, the form will be timed out and forcibly submitted (anything entered up to that point should be saved). If a form somehow actively prevents the submission from happening at the end of the time limit, the form will be rejected, no further forms from that submission will be processed, and you will receive no interaction responses from that submission. Note that this implies you should not have entry-validation code that prevents the submit button from being pressed. A validation phase that asks the assessor to re-edit or "submit anyway" is acceptable, since it does not force the annotator to spend more than three minutes.

NIST will record time spent on each form and return this information to participants. The presentation of interaction forms will be randomized for each topic.

2.5 Evaluation Methodology

Two metrics will be employed in assessing responses to relationship topics. The first and primary metric will be the F-score (beta=3) that has been used in the 2005 relationship task, as well as the TREC main task for "other" questions in 2004, 2005, and 2006. The "nugget pyramid" extension will be used, where multiple assessors provide vital/okay judgments. See:

Jimmy Lin and Dina Demner-Fushman. Will Pyramids Built of Nuggets Topple Over? Proceedings of the HLT/NAACL 2006.

The second metric is new for the ciQA task, called mean average nugget recall (MANuR). This metric is the analog to mean average precision (MAP) in document retrieval, and attempts to explicitly capture the tradeoffs between precision and recall both graphically and as a single-point metric. The basic idea is to quantify weighted nugget recall as a function of answer length (in non-whitespace characters). By the nugget pyramid building process, each nugget will be assigned a weight between zero and one. Weighted nugget recall is the sum of weights of all nuggets retrieved divided by sum of all weights of all nuggets in the assessor's answer key.

Implementing MANuR requires two important changes to the previous evaluation protocol:

Answer strings must be rank ordered, with best first.
Assessors must mark the first instance of a nugget in the response set of answer strings.

Here's the scoring methodology (all character counts do not include whitespaces):

For each topic, NIST will record the cumulative character length and weighted nugget recall after each answer string has been assessed.
Each data point will be interpolated to the nearest 100 character increment (longer than the current length). For example, a recall of 0.25 at 168 characters will get interpolated to (200, 0.25). Plotting these points gives us nugget recall as a function of answer length for a particular topic.
The MANuR graph (recall as a function of answer length) for a particular run will be the average across all questions at each length increment (multiples of 100 characters).
A single-point metric will be derived from the above graph by taking the mean across all length increments, 100 characters to 4000 characters.

3 Timeline

now	Corpus available
June 30, 2006	Test topics release
July 31, 2006 (11:59pm EST)	Initial runs and interaction forms due (for interactive participants)
July 31, 2006 (11:59pm EST)	final runs due (for non-interactive participants)
August 16, 2006	Interaction results returned back to participants
September 5, 2006	Final runs due

For reference, SIGIR 2006 is August 6-11.

4 Acknowledgments

Much of the language of Section 2.4, was borrowed from the 2005 TREC HARD Guidelines.

5 Revision History

05/10/2006: initial draft guidelines posted.
06/21/2006: added number of topics for the ciQA 2006 testset.
07/21/2006: clarified that the limit of 7000 characters is in terms of non-whitespace characters.
07/28/2006: modified output format to be consistent with the check script.