Fuzzy Intent Strategy: Enhancing Intent Understanding for Ambiguous Prompts through Human-Machine Co-Adaptation

Datasets Code
University of Minnesota - Twin Cities, Tsinghua University
Indicates Corresponding Author

Abstract

Modern image generation systems are capable of producing realistic and high-quality images. Nevertheless, user prompts often contain ambiguities that make it difficult for these systems to interpret the true intentions of their users accurately. Consequently, many users must modify their prompts several times to ensure the generated images meet their expectations. While some approaches aim to enhance prompts to meet user needs more accurately, the model still struggles to grasp the requirements of users without specialized expertise. We propose Visual Co-Adaptation (VCA), a framework that leverages a pre-trained language model fine-tuned via reinforcement learning to iteratively refine user prompts, aligning generated images with user preferences. At its core, the Incremental Context-Enhanced Dialogue Block employs multi-turn dialogues to disambiguate prompts through clarifying questions and user feedback. The Semantic Exploration and Disambiguation Module (SESD) integrates techniques like Retrieval-Augmented Generation (RAG) and CLIP-based scoring to resolve ambiguities in complex prompts. To ensure pixel-level precision and global consistency, the Pixel Precision and Consistency Optimization Module (PPCO) utilizes Proximal Policy Optimization (PPO) and attention mechanisms to fine-tune image details while maintaining visual harmony. A human-in-the-loop feedback mechanism further enhances model performance by integrating user feedback into the training loops of diffusion models. Extensive experiments show that VCA significantly improves user satisfaction, image-text alignment, and aesthetic quality. Compared to state-of-the-art systems like DALL-E 3, Stable Diffusion, and Imagen, our model reduces the average number of dialogue rounds to 4.3, achieves a CLIP score of 0.92, and increases user satisfaction to 4.73/5. Furthermore, we collected multi-round dialogue datasets including prompt and image pairs as well as user intent for various experiments designed to showcase its efficacy within our dataset.

MY ALT TEXT

Demonstration of disambiguation in text-to-image generation. Each row showcases examples of ambiguous prompts and their corresponding visual interpretations. For instance, the prompt "jam" results in both "fruit jam" and "traffic jam," while "bat" is interpreted as both a flying mammal and a baseball bat. Similarly, "spring" leads to "spring water" and "spring flower," and "mouse" generates both a "computer mouse" and an actual rodent. This highlights the importance of context and specificity in resolving ambiguities, illustrating the model's capability to handle polysemy effectively.

MY ALT TEXT

Overview of the overall framework: (a) The LLM refines the user's prompt through Word Swap, Adding a Phrase, or Attention Re-weighting, using PPO until satisfaction; (b) Effects of these operations on the image are demonstrated; (c) PPO's internal loop computes rewards based on CLIP feedback to meet the threshold or limit.

MY ALT TEXT

The flowchart of our framework, comprising a high-level Interpreter Agent and a low-level Controller Agent for text-to-image generation.

MY ALT TEXT

Resolving Ambiguous Prompts: Addressing Insufficient Context and Semantic Uncertainty through Clarifying Questions and Visual Setups.

MY ALT TEXT

The figure illustrates the iterative process of refining a prompt through six rounds. In each round, modifications are made using different strategies: adding phrases, attention re-weighting, and word swap. Each step's impact on the description of the image is shown visually, along with the type of modification used.

MY ALT TEXT

The comparison showcases our model’s ability to precise control and efficiently align with user intentions, such as refining soup images by adding specific elements like croutons. By the third round, our model fully captures the user's intention, demonstrating minimal need for further actions.

MY ALT TEXT

The chart shows user feedback on a model, highlighting mixed responses with positive feedback on image coherence and capturing intentions, but concerns over response time.

MY ALT TEXT

The figure shows the various operations applied during different rounds.