GLIGEN UI - Precisely specify object locations

I tried out the GLIGEN UI that was released recently today.

GLIGEN is a method for precisely specifying the location of objects in text-to-image models. Last week, an intuitive graphical user interface (GUI) was open-sourced, which greatly simplifies the operation of GLIGEN using ComfyUI.

I ran it directly with Pinokio, and the results were extremely satisfying!

Let me explain the principle behind the GLIGEN model:

GLIGEN was a research project jointly conducted last year by the University of Wisconsin-Madison, Columbia University, and Microsoft. The findings were published in a paper titled "GLIGEN: Open-Set Grounded Text-to-Image Generation."

GLIGEN (Grounded-Language-to-Image Generation), is an innovative approach that enhances existing pre-trained text-to-image models by introducing anchored inputs. To preserve the extensive conceptual knowledge of the pre-trained model, all weights are frozen, and anchor information is introduced through a gating mechanism in new trainable layers. GLIGEN achieves open-world grounded text-to-image generation, allowing caption and bounding box inputs to guide image generation. This method surpasses existing supervised layout-to-image baselines on zero-shot performance on the COCO and LVIS datasets, showing significant improvements in handling novel spatial configurations and concepts.

Model Design: Efficient Training and Flexible Inference

GLIGEN is built on existing pre-trained diffusion models, with its original weights frozen to retain a large amount of pre-trained knowledge. A new trainable gated self-attention layer is added in each transformer block to absorb new anchor inputs. Each anchor token contains two types of information: the semantics of the anchor entity (encoded text or image) and the spatial position (encoded bounding box or keypoints).

I. Modulated Training

Compared to other methods using pre-trained diffusion models (such as full model fine-tuning), GLIGEN's newly added modulated layers continue to be pre-trained on large-scale anchor data (image-text-box). This approach is more cost-effective. Like LEGO bricks, different trained layers can be inserted and used to enable various new features.

II. Scheduled Sampling

As an advantage of GLIGEN's modulated training, scheduled sampling is supported during the inference diffusion process, where the model can dynamically choose to use anchor tokens (by adding new layers) or the original diffusion model with good priors (by removing new layers), thus balancing generation quality and anchoring capability.