Generative AI has significantly advanced text-driven image generation, but it still faces challenges in producing outputs that consistently align with evolving user preferences and intents, particularly in multi-turn dialogue scenarios. In this research, We present a Visual Co-Adaptation (VCA) framework that incorporates human-in-the-loop feedback, utilizing a well-trained reward model specifically designed to closely align with human preferences. Leveraging a diverse multi-turn dialogue dataset, the framework applies multiple reward functions—such as diversity, consistency, and preference feedback—while fine-tuning the diffusion model through LoRA, effectively optimizing image generation based on user input. We also constructed multi-round dialogue datasets with prompts and image pairs that well-fit user intent. Experiments show the model achieves 508 wins in human evaluation, outperforming DALL-E 3 (463 wins) and others. It also achieves 3.4 rounds in dialogue efficiency (vs. 13.7 for DALL-E 3) and excels in metrics like LPIPS (0.15) and BLIP (0.59). Various experiments demonstrate the effectiveness of the proposed method over state-of-the-art baselines, with significant improvements in image consistency and alignment with user intent.
The workflow demonstrates how human preferences guide text-to-image diffusion, with a DPO-trained reward model evaluating image-prompt alignment and PPO updating LoRA parameters while keeping the diffusion model fixed.
Overview of our multi-round dialogue generation process. (a) shows how prompts and feedback refine images over rounds. (b) compares multi-round user correction with single-round self-correction. (c) illustrates the diffusion process with LoRA layers and text embeddings. The total reward Rtotal balances diversity, consistency, and mutual information across rounds.
Distribution of selected datasets and visual styles.
Comparison of Preference and CLIP Score across different models. The top half of the figure illustrates user preferences for various models, including SD v1.4, P2P, Muse, DALL-E 3, Imagen, and Cogview 2. The bottom half shows the corresponding CLIP scores for each model. Box plots represent the distribution of scores, with red dots indicating the mean values.
Win, tie, and lose rates of our model compared to Random, CLIP Score, Aesthetic, and BLIP Score across different image selection scenarios (9, 25, and 64 images).
This heatmap showing the win rates of various generative models across dialogue interactions.
The figure compares the performance of sd-2.1, Imagen, CogView2, DALL·E 3, P2P(Prompt-to-Prompt), and our model in modifying images based on user instructions.