Google's Magic Insert achieves style-aware and realistic insertion effects by dragging into the target image

Yesterday, I shared a Google research study, and today I'll share another one. This research currently only has a paper, no open-source code yet. The study is named "Magic Insert: Style-Aware Drag-and-Drop".

Through Magic Insert, the main subject of an image can be dragged and dropped into another target image with a completely different style, achieving a style-aware and realistic insertion effect.

Effect

The effectiveness and diversity in style-aware insertion are demonstrated. These examples cover various artistic styles of subjects and target backgrounds, ranging from realistic scenes to cartoons and paintings.

LLM-guided pose adjustment

The example shows Magic Insert pose editing guided by LLM, where the LLM suggests plausible poses and environmental interactions for image regions, and Magic Insert generates and inserts a stylized subject with the corresponding pose into the image.

Bootstrap domain adaptation results

Using a pre-trained subject insertion module without Bootstrap domain adaptation leads to suboptimal results with failure modes such as missing shadows and reflections, or adding distortions and artifacts.

Style-aware personalization with attribute modification

It allows modification of key attributes of the subject, such as those shown in the figure, while consistently applying the target style during the generation process. This enables redesigning characters or adding accessories, greatly increasing the flexibility for creative uses. It is worth noting that this capability disappears when using ControlNet.

Editability / Fidelity Trade-off

The generation results of Space Navy with different fine-tuning iterations (as shown in the figure) demonstrate the editability / fidelity trade-off phenomenon, adopting a "green ships" style and adding a "sitting on the ground" text prompt. When the style-aware personalized model is fine-tuned for longer on the subject, the resulting subject has stronger fidelity but reduced flexibility in editing poses or other semantic attributes. This may also affect the editability of the style.

Method

To generate subjects that respect both the target image's style and retain the essence and identity of the subject, Magic Insert takes the following steps:

：Personalizing diffusion models in the weight and embedding space by training LoRA increments on top of a pre-trained diffusion model, while simultaneously training two textual token embeddings using the diffusion denoising loss.
：Using personalized diffusion models to generate style-aware subjects by embedding the style of a target image and injecting adapter styles into selective up-sampling layers of the model during the denoising process.

To insert style-aware personalized generated subjects, we perform the following steps:

：Paste the segmented main version onto the target image.
：Run our main subject insertion model on the shadow-removed image, which creates contextual cues and realistically embeds the subject into the image, including shadows and reflections.

, which is to adapt the model's effective domain by using a subset of the model's own outputs. The specific steps are as follows:

Use the main body removal/insertion model to first remove the main body and shadows from the data set of the target domain.
Filter out defective outputs and retrain the main body removal/insertion model using the filtered image set.

We observe that the initial distribution (blue) changes after training (purple), and images initially misprocessed (red samples) are subsequently correctly processed (green samples). During bootstrap domain adaptation, we only train on the initially correct samples (green).

Compare

A comparison of style-aware personalization methods with top baseline methods, StyleAlign + ControlNet and InstantStyle + ControlNet. It can be seen that the baseline methods are able to generate decent outputs, but still lag behind Magic Insert's style-aware personalization method in overall quality. In particular, the output of InstantStyle + ControlNet often appears slightly blurry and fails to capture the contrast of the main subject features well.