Scale-wise autoregressive transformer for text-to-image generation
Автор: Voronov A.D.
Журнал: Труды Московского физико-технического института @trudy-mipt
Рубрика: Информатика и управление
Статья в выпуске: 4 (68) т.17, 2025 года.
Бесплатный доступ
In this work we adapt VAR, a recently proposed architecture for class-conditional image generation, to the text-to-image synthesis. To support this transition, we first introduce several architectural modifications aimed at improving training stability and final performance. One of such modifications is transforming attention mask in VAR to a block-wise non-causal one, allowing faster sampling and reducing memory consumption. The resulting model, a Scale-Wise Autoregressive Transformer (SWAT) is capable of generating high-quality 512×512 images, achieving competitive visual fidelity compared to strong diffusion-based baselines, while being up to 7× faster at inference. Our work demonstrates that non-diffusion architectures can offer a compelling alternative for text-toimage generation, balancing quality and efficiency.
Text-to-image, autoregression, scale-wise, transformer
Короткий адрес: https://sciup.org/142247114
IDR: 142247114 | УДК: 004.932