11:30am - 12:30pm

Tuesday 20 May 2025

Leveraging Multimodality for Controllable Image Generation

PhD Viva Open Presentation - Gemma Canet Tarr茅s

Hybrid event - All Welcome!

Free

21BA02 - Arthur C Clarke, 2nd floor
麻豆视频
Guildford
Surrey
GU2 7XH

Leveraging Multimodality for Controllable Image Generation

Abstract:
Recent advancements in deep learning have transformed the field of image generation, enabling the creation of highly realistic and visually compelling images. However, despite their impressive capabilities, state-of-the-art models often lack the fine-grained control needed to tailor outputs precisely. This challenge is particularly evident when user input is ambiguous or when multiple constraints must be satisfied simultaneously. Addressing these limitations, this thesis explores novel methods to constrain and guide the image generation process by leveraging multimodal inputs, such as sketches, style, text, and exemplars, to guide the creative process.

The thesis begins with CoGS, a framework designed for style-conditioned, sketch-driven synthesis. By decoupling structure and appearance, CoGS empowers users to define coarse layouts via sketches and class labels and guide aesthetics using exemplar style images. A transformer-based encoder converts these inputs into a discrete codebook representation, which can be mapped into a metric space for fine-grained adjustments. This unification of search and synthesis allows iterative refinement, enabling users to explore diverse appearance possibilities and produce results that closely match their vision.

Building on this idea, PARASOL advances control by enabling disentangled, parametric control of the visual style. This multimodal synthesis model conditions a latent diffusion framework on both content and fine-grained style embeddings, ensuring independent yet complementary control of each modality. Using a novel training strategy based on auxiliary search-driven triplets, PARASOL introduces precise style manipulation while preserving content integrity. Beyond creative applications like generation or stylization, this capability enhances generative search workflows, allowing users to adapt text-based search results by interpolating content and style descriptors, opening new avenues for personalization and refinement in image synthesis.

Expanding to conditioning on exemplars, the third contribution addresses the novel challenge of 'unconstrained generative object compositing'. This task involves seamlessly integrating objects into background images without requiring explicit positional guidance. By training a diffusion-based model on paired synthetic data, the approach autonomously handles tasks such as object placement, scaling, lighting harmonization, and generating realistic effects like shadows and reflections. Notably, the model explores diverse, natural placements when no positional input is provided, enabling flexibility and accelerating workflows. This solution surpasses existing methods in realism and user satisfaction, setting a new standard for generative compositing.

Finally, the thesis culminates in a model for simultaneous multi-object compositing, combining text, layout, and exemplar-based inputs. Designed to handle complex scenarios, the model captures interactions between objects ranging from spatial arrangements to dynamic actions like 鈥減laying guitar鈥 or 鈥渉ugging鈥, while autonomously generating supporting elements such as props. By jointly training for compositing and subject-driven generation, seamless integration of textual and visual cues is achieved, producing coherent and compelling multi-object scenes.

Together, these contributions form a cohesive framework for controllable image generation, addressing challenges in structural, stylistic, and compositional control. By leveraging diverse input modalities, the generation space is narrowed, producing outputs more closely aligned with the inputs and unlocking greater precision and new creative possibilities.