Scale-wise autoregressive transformer for text-to-image generation

Бесплатный доступ

In this work we adapt VAR, a recently proposed architecture for class-conditional image generation, to the text-to-image synthesis. To support this transition, we first introduce several architectural modifications aimed at improving training stability and final performance. One of such modifications is transforming attention mask in VAR to a block-wise non-causal one, allowing faster sampling and reducing memory consumption. The resulting model, a Scale-Wise Autoregressive Transformer (SWAT) is capable of generating high-quality 512×512 images, achieving competitive visual fidelity compared to strong diffusion-based baselines, while being up to 7× faster at inference. Our work demonstrates that non-diffusion architectures can offer a compelling alternative for text-toimage generation, balancing quality and efficiency.

Еще

Text-to-image, autoregression, scale-wise, transformer

Короткий адрес: https://sciup.org/142247114

IDR: 142247114   |   УДК: 004.932