Fuzzy Intent Strategy: Enhancing Intent Understanding for Ambiguous Prompts through Human-Machine Co-Adaptation

Abstract

Modern image generation systems are capable of producing realistic and high-quality images. Nevertheless, user prompts often contain ambiguities that make it difficult for these systems to interpret the true intentions of their users accurately. Consequently, many users must modify their prompts several times to ensure the generated images meet their expectations. While some approaches aim to enhance prompts to meet user needs more accurately, the model still struggles to grasp the requirements of users without specialized expertise. We propose Visual Co-Adaptation (VCA), a framework that leverages a pre-trained language model fine-tuned via reinforcement learning to iteratively refine user prompts, aligning generated images with user preferences. At its core, the Incremental Context-Enhanced Dialogue Block employs multi-turn dialogues to disambiguate prompts through clarifying questions and user feedback. The Semantic Exploration and Disambiguation Module (SESD) integrates techniques like Retrieval-Augmented Generation (RAG) and CLIP-based scoring to resolve ambiguities in complex prompts. To ensure pixel-level precision and global consistency, the Pixel Precision and Consistency Optimization Module (PPCO) utilizes Proximal Policy Optimization (PPO) and attention mechanisms to fine-tune image details while maintaining visual harmony. A human-in-the-loop feedback mechanism further enhances model performance by integrating user feedback into the training loops of diffusion models. Extensive experiments show that VCA significantly improves user satisfaction, image-text alignment, and aesthetic quality. Compared to state-of-the-art systems like DALL-E 3, Stable Diffusion, and Imagen, our model reduces the average number of dialogue rounds to 4.3, achieves a CLIP score of 0.92, and increases user satisfaction to 4.73/5. Furthermore, we collected multi-round dialogue datasets including prompt and image pairs as well as user intent for various experiments designed to showcase its efficacy within our dataset.