Advertisement

TurboEdit performs text-based image editing using a 3-4 step diffusion model.

Today I came across an interesting project: TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models. My mom no longer has to worry about me not being able to afford Photoshop.

The code is not yet open-sourced, but I've already experienced it using the Demo.

Original description:

Modified:

Original description:

Modified:

Original description:

Modified:

More effects

Technical details

TurboEdit uses a popular text-based editing framework —— the "edit-friendly" DDPM noise inversion method. They analyzed its application in fast sampling and attributed its failure to two aspects: the appearance of visual artifacts and insufficient editing strength. The research team attributed the artifact issue to mismatched noise statistics between the inverted noise and the expected noise schedule, proposing an adjusted noise schedule to correct this deviation. To enhance editing strength, they also proposed a pseudo-guidance method that effectively enhanced the editing effect without introducing new artifacts.

Solving the visual artifact problem

The research team observed that, following the "edit-friendly" method, the noise statistics of the inverted noise map deviated significantly from the expected values at each step. In multi-step diffusion models, these statistics tend to converge later in the diffusion process, allowing the model to handle any artifacts introduced during diffusion. However, in SDXL-Turbo, these steps are completely skipped, leaving the artifacts unresolved. The team found that these mismatched statistics roughly exhibited a time-shift phenomenon, with noise statistics approximately matching the expected values 200 steps earlier. Therefore, they eliminated this domain gap by providing a time-step parameter offset by 200 steps to the scheduler and model, successfully resolving the artifact issue.

As shown in the figure, when using SDXL-Turbo, the easy-to-edit inversion leads to mismatched noise statistics (red) compared to the expected values (green). The research team proposed a simple time-offset method to realign these noise statistics (blue and purple), significantly reducing artifact generation.

Pseudo-guidance method

The research team analyzed the "edit-friendly" equation and proved that it could be decomposed into two parts —— one controlling the prompt intensity and the other transferring the original image to a new trajectory. The research team proposed applying CFG-like rescaling only to the prompt term, and the results showed that this method indeed enhances editing strength without introducing new artifacts. For more details, please refer to the paper.

As shown in the figure, when adjusting the cross-prompt term (wₚ, column) and the cross-trajectory term (wₜ, row), the study found that scaling only the cross-prompt term improves the editing effect without generating artifacts.

Equivalence of Edit-Friendly and Delta Denoising Score

The research team's study of the "edit-friendly" DDPM process revealed that its form is very similar to the correction method used in Delta Denoising Score. Surprisingly, the team proved that under appropriate learning rate selection and time-step sampling, these two methods are functionally equivalent and can generate identical results. This finding can also be extended to the recent Posterior Distillation Sampling (PDS) method, especially in its application to image editing.

Comparison

Comparison with existing multi-step methods

We compared our four-step editing results with the state-of-the-art editing methods under existing multi-step modes. Our method not only achieves comparable or superior image quality to the state-of-the-art methods but also has a significant speed advantage: it is six times faster than the fastest baseline method and even 630 times faster than the highest-scoring method.

Comparison with existing few-step methods

We further compared our method with other few-step editing methods. The results show that our method can better match the semantic intent of the edit while preserving the content of the original image. Additionally, our method successfully avoids the visual artifact issues present in the baseline "edit-friendly" method.