Two papers from UC Berkeley: Exploring LLM-enhanced diffusion models for text-to-image translation

Today I read two papers with similar content, both focusing on how to use Large Language Models (LLMs) to control Diffusion Models.

Both papers were published by UC Berkeley: "LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models" and "Self-correcting LLM-controlled Diffusion Models", which were released in May and November of last year respectively.

The URLs for the related projects are as follows:

https://llm-grounded-diffusion.github.io
https://self-correcting-llm-diffusion.github.io

Let's first look at the first paper: "LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models".

: Text Prompt -> Large Language Model (LLM) -> Intermediate Representation (e.g., image layout) -> Stable Diffusion -> Image

Diffusion models still face challenges in handling complex prompts, especially those involving numerical calculations and spatial reasoning. This study proposes a method to enhance the prompt understanding ability of diffusion models - LMD.

The LMD method leverages a pre-trained large language model (LLM), achieving grounded generation in a novel two-stage process:

, the LLM generates a scene layout based on the given prompt description, which includes captioned bounding boxes that describe the desired image.
, a novel controller guides an off-the-shelf diffusion model for layout-based image generation.

Both stages use existing pre-trained models without requiring additional model parameter optimization.

LMD naturally enables:

a. Performing multi-round scene specification based on instructions;

b. Generating images from prompts in languages not supported by the base diffusion model.

Comparison

The LMD method significantly outperforms the base diffusion model and several strong baselines in accurately generating images described by prompts requiring various capabilities, doubling the generation accuracy on average across four tasks.

The second article "Self-correcting LLM-controlled Diffusion Models"

: Existing diffusion-based text-to-image generators (e.g., DALL-E 3) often struggle to precisely generate images fully aligned with complex input prompts, especially those requiring numerical calculations and spatial relationships.

The SLD framework enables these diffusion models to automatically and iteratively correct inaccuracies by applying a series of latent space operations (addition, deletion, repositioning, etc.), thereby achieving better text-to-image alignment.

Features of the SLD framework include:

Self-correction: Enhancing the generative model through an integrated LLM detector to achieve precise text-to-image alignment.
Unified generation and editing: Performs excellently in both image generation and fine-grained editing.
Universal compatibility: Applicable to any image generator, such as DALL-E 3, without requiring additional training or data.

SLD enhances text-to-image alignment through an iterative self-correction process. It starts with object detection driven by LLMs, followed by LLM-controlled analysis and correction.

The latent operations of SLD can be summarized into two key concepts:

The latent representation of removed areas is re-initialized as Gaussian noise, while the latent representation of newly added or modified objects is synthesized onto the canvas.
Latent synthesis is limited to the initial steps, followed by the "unfreezing" step of the standard forward diffusion process, which enhances visual quality and avoids artificial copy-paste effects.

By leveraging the advanced localization capabilities of the OWL-ViT v2 open vocabulary detector, we can accurately identify all seagulls in the image, enabling selective removal to meet user prompt requirements. As shown in the following figure:

SLD improves the text-to-image alignment accuracy of various diffusion-based generative models including SDXL, LMD+, and DALL-E 3. As shown in the following figure, as indicated by the red box in the first row, SLD precisely locates the blue bicycle associated with the bench and palm tree, while maintaining the accurate number of palm trees and seagulls. The second row further demonstrates the robustness of SLD in complex, cluttered scenes, effectively managing object collisions through latent operations without training.

SLD can handle a variety of image editing tasks guided by natural, human-like instructions. Its capabilities cover a wide range, from adjusting the number of objects to changing object attributes, positions, and sizes. As shown in the following figure

Comparison

It shows significantly better performance in executing these object-level edits.