OMR-Diffusion:Optimizing Multi-Round Enhanced Training in Diffusion Models for Improved Intent Understanding

Datasets Code
Xiamen University, University of Electronic Science and Technology of China,University of Minnesota - Twin Cities, Tsinghua University
Indicates Corresponding Author

Abstract

Generative AI has significantly advanced text-driven image generation, but it still faces challenges in producing outputs that consistently align with evolving user preferences and intents, particularly in multi-turn dialogue scenarios. In this research, We present a Visual Co-Adaptation (VCA) framework that incorporates human-in-the-loop feedback, utilizing a well-trained reward model specifically designed to closely align with human preferences. Leveraging a diverse multi-turn dialogue dataset, the framework applies multiple reward functions—such as diversity, consistency, and preference feedback—while fine-tuning the diffusion model through LoRA, effectively optimizing image generation based on user input. We also constructed multi-round dialogue datasets with prompts and image pairs that well-fit user intent. Experiments show the model achieves 508 wins in human evaluation, outperforming DALL-E 3 (463 wins) and others. It also achieves 3.4 rounds in dialogue efficiency (vs. 13.7 for DALL-E 3) and excels in metrics like LPIPS (0.15) and BLIP (0.59). Various experiments demonstrate the effectiveness of the proposed method over state-of-the-art baselines, with significant improvements in image consistency and alignment with user intent.

MY ALT TEXT

The workflow demonstrates how human preferences guide text-to-image diffusion, with a DPO-trained reward model evaluating image-prompt alignment and PPO updating LoRA parameters while keeping the diffusion model fixed.

MY ALT TEXT

Overview of our multi-round dialogue generation process. (a) shows how prompts and feedback refine images over rounds. (b) compares multi-round user correction with single-round self-correction. (c) illustrates the diffusion process with LoRA layers and text embeddings. The total reward Rtotal balances diversity, consistency, and mutual information across rounds.

MY ALT TEXT

Distribution of selected datasets and visual styles.

MY ALT TEXT

Comparison of Preference and CLIP Score across different models. The top half of the figure illustrates user preferences for various models, including SD v1.4, P2P, Muse, DALL-E 3, Imagen, and Cogview 2. The bottom half shows the corresponding CLIP scores for each model. Box plots represent the distribution of scores, with red dots indicating the mean values.

MY ALT TEXT

Win, tie, and lose rates of our model compared to Random, CLIP Score, Aesthetic, and BLIP Score across different image selection scenarios (9, 25, and 64 images).

MY ALT TEXT

This heatmap showing the win rates of various generative models across dialogue interactions.

MY ALT TEXT

The figure compares the performance of sd-2.1, Imagen, CogView2, DALL·E 3, P2P(Prompt-to-Prompt), and our model in modifying images based on user instructions.