TL;DR: Text prompt -> LLM -> Intermediate representation (such as an image layout) -> Stable broadcast -> Image.
Recent advances in text-to-image generation with diffusion models have yielded remarkable results synthesizing highly realistic and diverse images. However, despite their impressive capabilities, diffusion models, such as Stable broadcastoften have difficulty following prompts accurately when spatial or common sense reasoning is required.
The following figure lists four scenarios in which Stable Diffusion fails to generate images that accurately match the given prompts, namely negation, calculationAnd attribute assignment, spatial relationships. On the other hand, our method, LLM-based Diffusion (LMD), provides a much better, quick understanding of text-to-image generation in these scenarios.
Figure 1: LLM-based diffusion improves the ability to quickly understand text-image diffusion models.
A possible solution to this problem is of course to gather a large multimodal dataset including complex subtitles and train a large diffusion model with a large language encoder. This approach has significant costs: it is time-consuming and expensive to train both extended language models (LLMs) and diffusion models.
Our solution
To effectively solve this problem with minimal cost (i.e. no training costs), we instead Equipping diffusion models with improved spatial and common sense reasoning using ready-to-use frozen LLMs in a new two-step generation process.
First, we adapt an LLM to be a text-guided layout generator through in-context learning. When equipped with an image prompt, an LLM generates a scene layout in the form of bounding boxes along with corresponding individual descriptions. Second, we drive a streaming model with a novel controller to generate layout-conditioned images. Both stages use frozen pre-trained models without any optimization of LLM or diffusion model parameters. We invite readers to read the article on arXiv for more details.
Figure 2: LMD is a generative text-image model with a new two-step generation process: a text-layout generator with in-context LLM+ learning and a new layout-guided stable delivery. Both steps are without training.
Additional LMD Capabilities
In addition, LMD naturally allows dialogue-based multi-turn scene specification, allowing for additional clarifications and subsequent edits for each prompt. Additionally, LMD is capable of handle prompts in a language that is not well supported by the underlying delivery model.
Figure 3: Incorporating an LLM for rapid comprehension, our method is capable of performing dialogue-based scene specification and generation from prompts in a language (Chinese in the example above) that the underlying diffusion model does not support.
Given an LLM that supports multi-round dialog (e.g., GPT-3.5 or GPT-4), LMD allows the user to provide additional information or clarification to the LLM by querying the LLM after the first generation of layout in the dialog and generating images with the updated layout in the subsequent LLM response. For example, a user can request to add an object to the scene or change the location or descriptions of existing objects (left half of Figure 3).
Additionally, giving an example non-English prompt with English layout and background description during in-context learning, LMD accepts non-English prompt inputs and will generate layouts, with box descriptions and background in English for later applications. image layout generation. As shown in the right half of Figure 3, this allows prompts to be generated in a language that the underlying delivery models do not support.
Visualizations
We validate the superiority of our design by comparing it to the basic diffusion model (SD 2.1) that LMD uses under the hood. We invite readers to our work for further evaluations and comparisons.
Figure 4: LMD outperforms the basic diffusion model by accurately generating images based on prompts that require both linguistic and spatial reasoning. LMD also allows counterfactual generation of text-to-image that the basic diffusion model is not capable of generating (the last line).
For more details on LLM-based diffusion (LMD), visit our website And read the article on arXiv.
BibTex
If the LLM-grounded spread inspires your work, please cite it with:
@article{lian2023llmgrounded,
title={LLM-grounded Diffusion: Enhancing Prompt Understanding of Text-to-Image Diffusion Models with Large Language Models},
author={Lian, Long and Li, Boyi and Yala, Adam and Darrell, Trevor},
journal={arXiv preprint arXiv:2305.13655},
year={2023}
}