Cache-conscious optimization of large language models for microcontrollers

Автор: Khudaiberideva G.B., Kozhukhov D.A., Pimenkova A.A.

Журнал: Теория и практика современной науки @modern-j

Рубрика: Основной раздел

Статья в выпуске: 8 (122), 2025 года.

Бесплатный доступ

The spread of large language models (LLM) to Internet of Things (IoT) devices is constrained by the limited resources of microcontrollers (MCUs), in particular, the low volume and high latency of non-volatile memory (Flash) and RAM. Traditional approaches focus on reducing the size of the model. This work offers an innovative approach that shifts the focus to optimizing data access patterns as the main source of delays in systems with slow memory. Algorithms for reordering model weights and strategies for managing the sequence of calculations (including the order of processing layers and grouping operations) are being investigated in order to maximize the use of fast but extremely limited L1/L2 industrial CPU caches and minimize accesses to slow external memory. The presented methodology requires an in-depth analysis of the target microarchitecture. Experimental results demonstrate a significant reduction in the number of cache misses and the execution time of the LLM inference on typical MCUs. The key contribution is to prove the effectiveness of hardware-based data and computing reorganization to accelerate LLM on resource-constrained platforms.

Еще

Llm, mcu, tinyml

Короткий адрес: https://sciup.org/140312536

IDR: 140312536   |   УДК: 004.89