Master’s Thesis Presentation • Artificial Intelligence • Understanding and Enforcing Precise Control in Generative Models via Graph-Based Attention

Wednesday, May 14, 2025 1:30 pm - 2:30 pm EDT (GMT -04:00)

Please note: This master’s thesis presentation will take place online.

Achint Soni, Master’s candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Sirisha Rambhatla

Generative models have significantly advanced in recent years, enabling unprecedented capabilities for data generation, manipulation, and editing. However, their practical applicability depends heavily on their ability to disentangle the underlying factors of variation, allowing precise and controllable modifications. This thesis explores disentanglement from two complementary perspectives: latent-space disentanglement in Variational Autoencoders (VAEs) and spatial disentanglement in diffusion-based text-guided image editing.

In the first part of the thesis, we investigate the mechanisms behind disentanglement in VAEs. By proposing a local non-linear approximation of the VAE decoder, we provide a rigorous theoretical analysis that reveals orthogonality of the decoder’s Jacobian as a fundamental condition for disentanglement. To support this finding, we introduce a quantitative measure termed the Orthogonality Deviation Score (OD-Score) and empirically demonstrate across multiple benchmark datasets (dSprites, 3D Faces, 3D Shapes, and MPI3D) that increased orthogonality directly corresponds to improved disentanglement as measured by established metrics such as Mutual Information Gap (MIG) and MIG-Sup.

In the second part, we address the challenge of spatial disentanglement in text-guided image editing using diffusion models. Traditional diffusion-based methods rely primarily on cross-attention maps derived from textual prompts to determine regions for editing, often resulting in unintended alterations and compromised spatial coherence. To overcome this, we introduce LOCATEdit, a novel approach that refines attention maps using a graph-based regularization framework. LOCATEdit constructs a Cross and Self-Attention (CASA) graph, leveraging patch relationships derived from self-attention to promote spatial consistency and to constrain edits precisely within designated areas. Extensive evaluations on the PIE-Bench dataset illustrate that LOCATEdit achieves superior performance in localized editing tasks, substantially outperforming existing baselines in both semantic alignment and background preservation.

Together, these contributions offer a unified understanding of disentanglement in generative modeling, bridging theoretical insights from latent-space analysis with practical advancements in spatially coherent, text-guided image editing. Ultimately, this thesis provides a principled foundation for developing interpretable, reliable, and highly controllable generative systems.


Attend this master’s thesis presentation virtually on MS Teams.