UniD3: Unified Discrete Diffusion for Simultaneous Vision-Language Generation

Minghui Hu
Nanyang Technological University
Chuanxia Zheng
University of Oxford
Heliang Zheng
JD Explore Academy
Tat-Jen Cham
Nanyang Technological University
Chaoyue Wang
JD Explore Academy
Zuopeng Yang
Shanghai Jiao Tong University
Dacheng Tao
JD Explore Academy
Qatar University


The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.

Once trained, UniD3 can not only inherit the ability to manipulate the provided text or image, but is also able to unify the text and image generation, e.g., unconditional vision-language pairings generation, cross modal manipulation, text guided image completion, and image conditional text caption.

Overall Pipeline

The pipeline of UniD3. With an offline model (red background part), the given inputs are represented by discrete token sequence in separate domain. The fusion embedding concatenate the tokens in different modal and embed them to the same space. The unified diffusion (in blue background) will construct the joint distribution of all modalities based on the fused embedding with a fixed unified Markov transition matrix.

(1) An offline model to generate a compact yet expressive discrete representation for both images and texts via discrete VAE (dVAE) and Byte-Pair Encoding (BPE), respectively (the pink part in Figure).
(2) A novel unified discrete diffusion model to estimate the joint distribution of such latent visual and language codes (the cyan part in Figure).

Mutual Attention

The input to the neural network covers all modalities, and a simple self-attention mechanism can scarcely highlight the inter-modal linkages. In order to solve this problem, we propose a mutual attention module to capture the inter-modal linkages as well as the cross-modal connections. The mutual attention block consists of one self-attention, two parallel mutual attention operations and one feed-forward module. Each block receives a sequence of mixed-modal tokens as input that traverses a layer of self-attention to capture the inherent connection within the modalities.

Illustration of mutual attention blocks. A unified transformer is composed of several blocks stacked on top of one another.

Transition Matrix

The presence of a transition matrix determines the nature of the discrete diffusion model, which also provides us with more choices for token evolution. Thus we may wonder if it is feasible to design transition matrices that capture the global connections between various modalities.

The Markov transition matrix of the discrete diffusion model should satisfy the following requirements:
1. each column in Q_t should sum to one to conserve probability mass;
2. each column in the cumulative-product Q'_t should converge to either a known stationary distribution or a learnt prior when t becomes large.

On the basis of these criteria, we construct a unified transition matrix Q_t capable of encapsulating discrete representations among various modalities.


More Results

Generated vision-language Pairs from CUB-200 and MSCOCO.

Both the image and caption are generated simultaneously. The quality of the created photos and text is comprehensible, and there is a correlation between the descriptions and the visuals.


Image Captions on CUB


Cross Modal Inpainting

In this experiment, we obscured a portion of the image and modified the text description as well. Depending on the unmasked images, our model may supplement the caption that is derived from a portion of the image. Moreover, based on the amended description, the model may enhance the appearance of the masked component.



  title = {Unified Discrete Diffusion for Simultaneous Vision-Language Generation},
  author = {Hu, Minghui and Zheng, Chuanxia and Zheng, Heliang and Cham, Tat-Jen and Wang, Chaoyue and Yang, Zuopeng and Tao, Dacheng and Suganthan, Ponnuthurai N},
  journal = {arXiv},
  year = {2022},


The website template was borrowed from DreamFusion.