DiffBreak: Is Diffusion-Based Purification Robust?

Oct 28, 2025·

A. Kassis

U. Hengartner

Y. Yu

· 0 min read

PDF Cite Code Poster Slides URL

Abstract

Diffusion-based purification (DBP) has become a cornerstone defense against adversarial examples (AEs), regarded as robust due to its use of diffusion models (DMs) that project AEs onto the natural data manifold. We refute this core claim, theoretically proving that gradient-based attacks effectively target the DM rather than the classifier, causing DBP’s outputs to align with adversarial distributions. This prompts a reassessment of DBP’s robustness, accrediting it two critical factors: inaccurate gradients and improper evaluation protocols that test only a single random purification of the AE. We show that when accounting for stochasticity and resubmission risk, DBP collapses. To support this, we introduce DiffBreak, the first reliable toolkit for differentiation through DBP, eliminating gradient mismatches that previously further inflated robustness estimates. We also analyze the current defense scheme used for DBP where classification relies on a single purification, pinpointing its inherent invalidity. We provide a statistically grounded majority-vote (MV) alternative that aggregates predictions across multiple purified copies, showing partial but meaningful robustness gain. We then propose a novel adaptation of an optimization method against deepfake watermarking, crafting systemic perturbations that defeat DBP even under MV, challenging DBP’s viability.

Type

Conference paper

Publication

Advances in Neural Information Processing Systems (NeurIPS)

Last updated on Oct 28, 2025

Conference

← BridgePure: Limited Protection Leakage Can Break Black-Box Data Protection Oct 28, 2025

Uncoupled and Convergent Learning in Monotone Games under Bandit Feedback Oct 28, 2025 →