The recently developed discrete diffusion models perform extraordinarily well in the text-to-image task, showing significant promise for handling the multi-modality signals. In this work, we harness these traits and present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks using a single model, performing text-based, image-based, and even vision-language simultaneous generation. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Moreover, we design a mutual attention module with fused embedding layer and a unified objective function to emphasise the inter-modal linkages, which are vital for multi-modality generation. Extensive experiments indicate that our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
The pipeline of UniD3. With an offline model (red background part), the given inputs are represented by discrete token sequence in separate domain. The fusion embedding concatenate the tokens in different modal and embed them to the same space. The unified diffusion (in blue background) will construct the joint distribution of all modalities based on the fused embedding with a fixed unified Markov transition matrix.
The input to the neural network covers all modalities, and a simple self-attention mechanism can scarcely highlight the inter-modal linkages. In order to solve this problem, we propose a mutual attention module to capture the inter-modal linkages as well as the cross-modal connections. The mutual attention block consists of one self-attention, two parallel mutual attention operations and one feed-forward module. Each block receives a sequence of mixed-modal tokens as input that traverses a layer of self-attention to capture the inherent connection within the modalities.
The presence of a transition matrix determines the nature of the discrete diffusion model, which also provides us with more choices for token evolution. Thus we may wonder if it is feasible to design transition matrices that capture the global connections between various modalities.
The Markov transition matrix of the discrete diffusion model should satisfy the following requirements:
1. each column in Q_t should sum to one to conserve probability mass;
2. each column in the cumulative-product Q'_t should converge to either a known stationary distribution or a learnt prior when t becomes large.
On the basis of these criteria, we construct a unified transition matrix Q_t capable of encapsulating discrete representations among various modalities.
Both the image and caption are generated simultaneously. The quality of the created photos and text is comprehensible, and there is a correlation between the descriptions and the visuals.
In this experiment, we obscured a portion of the image and modified the text description as well. Depending on the unmasked images, our model may supplement the caption that is derived from a portion of the image. Moreover, based on the amended description, the model may enhance the appearance of the masked component.
@article{hu2022unified,
title = {Unified Discrete Diffusion for Simultaneous Vision-Language Generation},
author = {Hu, Minghui and Zheng, Chuanxia and Zheng, Heliang and Cham, Tat-Jen and Wang, Chaoyue and Yang, Zuopeng and Tao, Dacheng and Suganthan, Ponnuthurai N},
journal = {arXiv},
year = {2022},
}
The website template was borrowed from DreamFusion.