Master’s Thesis Presentation • Data Systems • Evaluating LLM Robustness Under Adversarial and Conflicting Evidence in Health Question Answering and Claim Verification

Monday, July 13, 2026 9:30 am - 10:30 am EDT (GMT -04:00)

Please note: This master’s thesis presentation will take place online.

Shakiba Amirshahi, Master’s candidate
David R. Cheriton School of Computer Science

Supervisors: Professors Charles Clarke, Amira Ghenai

Large language models (LLMs) are increasingly used in applications that rely on externally retrieved evidence, including health question answering, scientific claim verification, and retrieval-augmented generation (RAG). A fundamental question underlies these systems: do LLMs genuinely reason over the evidence they receive, or do they primarily follow the stance expressed in the provided documents? This thesis investigates this question through two complementary empirical studies that examine model behavior under harmful, adversarial, and conflicting evidence conditions across health question answering and claim verification tasks.

Study 1 evaluates RAG robustness in the health domain using expert-annotated collections from the TREC 2020 and 2021 Health Misinformation Tracks. Across six LLMs, eight document types, and three query framing conditions, results show that retrieved evidence strongly shapes model behavior regardless of its reliability. Helpful documents drive ground-truth alignment to near-ceiling levels, whereas adversarial documents generated from scratch can reduce alignment to near-zero. Even a single helpful document within an otherwise adversarial retrieval pool substantially improves robustness, highlighting retrieval composition as a key factor in RAG performance. Models also demonstrate greater robustness on COVID-19 queries than on general health questions, suggesting that resistance to misleading evidence may vary across domains.

Study 2 extends the analysis to explicit claim verification, evaluating five LLMs across two domains: Check-COVID, a scientific verification benchmark, and Emergent, a journalistic rumor dataset. Under both single- and paired-document settings, models frequently reverse their verification decisions when evidence stance is flipped, struggle to maintain stable judgments under conflicting evidence, and exhibit sensitivity to document order. These vulnerabilities persist across both scientific and journalistic domains, suggesting that evidence-driven behavior is not domain-specific but a broader limitation of current verification systems. Across both studies, adversarial documents generated from scratch are consistently more damaging than naturally occurring harmful content.

Taken together, the findings show that strong benchmark performance does not necessarily indicate robust evidence reasoning. Helpful evidence can mask differences between models, whereas adversarial evidence exposes substantial variation in robustness. These results highlight the need for evaluation protocols that explicitly test model behavior under misleading and conflicting evidence, and motivate future evidence-grounded systems that assess evidence credibility rather than simply reproducing its stance.


Attend this master’s thesis presentation virtually on Zoom.