Separation of text generation tasks into stages for sequential execution on extremely limited resources
Автор: Khudaiberideva G.B., Kozhukhov D.A., Pimenkova A.A.
Журнал: Теория и практика современной науки @modern-j
Рубрика: Основной раздел
Статья в выпуске: 8 (122), 2025 года.
Бесплатный доступ
The problem of executing large language models (LLM) on devices with extremely limited RAM resources is considered. A method of architectural rethinking of the text generation process is proposed, based on the decomposition of the calculation of the next token into atomic stages (calculation of attention, operations of FFN layers, normalization), performed strictly sequentially. Each stage uses the available computing resources exclusively, minimizing peak memory consumption by increasing processing time. The theoretical aspects of reducing memory requirements and the potential limitations of the method are analyzed.
Large language models, limited resources, memory optimization, text generation, sequential calculations, decomposition of operations
Короткий адрес: https://sciup.org/140312534
IDR: 140312534 | УДК: 004.89