Новое поколение GPGPU и сопутствующего оборудования: микроархитектура и производительность вычислительных систем от серверов до суперкомпьютеров

Кузьминский М.Б.; Kuzminsky M.B.

doi:10.25209/2079-3316-2024-15-2-139-473

Научные статьи \ Общие вопросы науки и культуры \ Информационные технологии. Вычислительная техника. Обработка данных \ Архитектура вычислительных машин

Новое поколение GPGPU и сопутствующего оборудования: микроархитектура и производительность вычислительных систем от серверов до суперкомпьютеров

Автор: Кузьминский М.Б.

Журнал: Программные системы: теория и приложения @programmnye-sistemy

Рубрика: Программное и аппаратное обеспечение распределенных и суперкомпьютерных систем

Статья в выпуске: 2 (61) т.15, 2024 года.

Бесплатный доступ

Дан обзор современного состояния GPGPU с ориентацией их применения на традиционные задачи HPC (и в меньшей степени ИИ). К базовым GPGPU в обзоре отнесены Nvidia V100 и A100. В качестве GPGPU нового поколения рассмотрены Nvidia H100, AMD MI100 и MI200, Intel Ponte Vecchio (Data Center GPU Max), а также BR100 от Biren Technology. Проанализированы и сопоставлены микроархитектура и аппаратные показатели этих GPGPU, важные для задач HPC и ИИ, а также важнейших дополнительных аппаратных средств для построения вычислительных систем с применением GPGPU - центральных процессоров, специализированных для работы с GPGPU нового поколения, и межсоединений. Дается краткая информация об использующих их серверах, в том числе multi-GPU, и новых применяющих эти GPGPU суперкомпьютерах, где были получены данные о достигаемой производительности при работе с GPGPU. Кратко рассмотрены SDK фирм-производителей GPGPU и программные средства других фирм, включая математические библиотеки. Приводятся примеры, демонстрирующие важные для достижения максимальной производительности средства широко используемых моделей программирования, способствующие при этом непереносимости программных кодов на другие модели GPGPU. Особое внимание обращено на возможности применения тензорных ядер и их аналогов в современных GPGPU разных фирм. Это относится и к расчетам с пониженной (относительно стандартного для HPC формата FP64) и смешанной точностью, актуальным вследствие резкого роста достигаемой производительности при их использовании в тензорных ядрах GPGPU. Анализируются данные о достигаемой ими реальной производительности в тестах и приложениях для HPC и ИИ. Вкратце рассматривается и применение в GPGPU современных библиотек пакетной линейной алгебры, в том числе для HPC-приложений.

Gpgpu, v100, a100, h100, grace, gh200 grace hopper, mi100, mi200, ponte vecchio, data center gpu max, br100, cuda, hip, dpc++, fortran, производительность, hpc, ии, глубокое обучение

Короткий адрес: https://sciup.org/143183240

IDR: 143183240 | УДК: 004.272+004.382.2+004.8+004.43 | DOI: 10.25209/2079-3316-2024-15-2-139-473

New generation of GPGPU and related hardware: computing systems microarchitecture and performance from servers to supercomputers

An overview of the current state of GPGPUs is given, with orientation towards their using to traditional HPC tasks (and less to AI). The basic GPGPUs in the review include Nvidia V100 and A100. Nvidia H100, AMD MI100 and MI200, Intel Ponte Vecchio (Data Center GPU Max), as well as BR100 from Biren Technology are considered as new generation GPGPUs. The important for HPC and AI tasks microarchitecture and hardware features of these GPGPUs, as well as the most important additional hardware for building computer systems with GPGPUs, that are CPUs specialized (albeit only possible for the initial period of their use) for working with the new generation of GPGPUs and interconnects - are analyzed and compared. Brief information is given about the servers (including multi-GPUs) using them, and new supercomputers (using these GPGPUs), where data on the achieved performance when working with GPGPUs was obtained. The SDK of GPGPU manufacturers and software (including mathematical libraries) from other firms are briefly reviewed. Examples are given that demonstrate the tools of widely used programming models that are important for achieving maximum performance, while contributing to the non-portability of program codes to other GPGPU models. Particular attention is paid to the possibilities of using tensor cores and their analogues in modern GPGPUs from other companies, including the possibility of using calculations with reduced (relative to the standard for HPC FP64 format) and mixed precision, which are relevant due to the sharp increase of the achieved performance when using them in GPGPU tensor cores. Data is analyzed on their “real-world” performance in benchmarks and applications for HPC and AI. The use of modern batch linear algebra libraries in GPGPU, including for HPC applications, is also briefly discussed.

Список литературы Новое поколение GPGPU и сопутствующего оборудования: микроархитектура и производительность вычислительных систем от серверов до суперкомпьютеров

Top500 the list, The Green500 ranking, 61st edition.– 2023. hUtRtpLs://top500.org/lists/green500/2023/06/
Tschudi W., Xu T., Sartor D., Stein J. High-performance data centers. A research roadmap, LBNL-53483.– 2004.– 53 pp. hUtRtpLs://escholarship.org/uc/item/0w64r459
Maltenberger T., Ilic I., Tolovski I, Rab T. Evaluating multi-GPU sorting with modern interconnects // SIGMOD’22: Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA, June 12–17, 2022), New York: ACM.– 2022.– ISBN 978-1-4503-9249-5.– Pp. 1795–1809. https://doi.org/10.1145/3514221.3517842
Top500 the list, The Top500 ranking, 61st edition.– 2023. hUtRtpLs://www.top500.org/lists/top500/2023/06/highs/
Кузьминский М. Б. Современные серверные ARM-процессоры для суперЭ-ВM: A64FX и другие. Начальные данные тестов производительности // Программные системы: теория и приложения.– 2022.– Т. 13.– №1(52).– С. 63–129. https://doi.org/10.25209hU/t2Rt0pL7s9:/-3/3p1s6ta-2.p0s2i2r-a1s3.r-u1/-6r3e-a1d2/9psta2022_1_63-129.pdf
Gao J., Zheng F., Qi F, Ding Y, Li H., Lu H., He W., Wei H., Jin L., Liu X., Gong D., Wang F., Zheng Y., Sun H., Zhou Z., Liu Y., You H. Sunway supercomputer architecture towards exscale computing: analysis and practice // Science China Information Sciences.– 2021.– Vol. 64.– No. 4.– id. 141101.– 21 pp. https://doi.org/10.1007/hUs1tRt1p4L:3/2/-s0c2is0.-s3c1ic0h4i-n7a.com/en/2021/141101.pdf
Selig J. The cerebras software development kit: A technical overview.– Cerebras systems Inc..– 2022.– 8 pp. hUtRtpLs://f.hubspotusercontent30.net/hubfs/8968533/Cerebras%20SDK%20Technical%20Overview%20White%20Paper.pdf
Andromeda, a 13.5 Million Core AI Supercomputer, a section on the Cerebras company site.– 2024. UhtRtpLs://www.cerebras.net/andromeda/
Top500 the list, List Statistics of TOP500, 61st edition.– 2023. hUtRtpLs://www.top500.org/UhltiRtsptLss/:/to/pw5w0w0/.t2o0p2530/00.6o/rhg/igshtas/tistics/list/
Morgan T. P. Chip roadmaps unfold, crisscrossing and interconnecting, at AMD, The Next Platform.– Stackhouse Publishing.– 2022. hUtRtpLs://www.nextplatform.com/2022/06/14/chip-roadmaps-unfold-crisscrossing-and-interconnecting-at-amd/
Shah A. Intel Reiterates Plans to Merge CPU, GPU High-performance Chip Roadmaps, Tabor network.– HPCwire.– 2022. hUtRtpLs://www.hpcwire.com/2022/05/31/intel-reiterates-plans-to-merge-cpu-gpu-high-performance-chip-roadmaps/
Morgan T. P. The Increasingly Graphic Nature Of Intel Datacenter Compute, The Next Platform.– Stackhouse Publishing.– 2022. hUtRtpLs://www.nextplatform.com/2022/06/08/the-increasingly-graphic-nature-of-intel-datacenter-compute/
Evans J. Nvidia Grace // 2022 IEEE Hot Chips 34 Symposium (HCS) (Cupertino, CA, USA, 21–23 August 2022).– IEEE.– 2022.– Pp. 1–20. https://doi.org/10.1109/HCS55958.2022.9895599
Elster A. C., Haugdahl T. A. Nvidia Hopper GPU and Grace CPU Highlights // Computing in Science and Engineering.– 2022.– Vol. 24.– No. 2.– Pp. 95–100. https://doi.org/10.1109/MCSE.2022.3163817
Evans J. Inside Grace, Featured Playlists, GPU Technology Conference (GTC), Nvidia On-Demand.– Nvidia.– 2022. hUtRtpLs://www.nvidia.com/en-us/on-demand/session/gtcfall22-a41129/
CUDA C++ Programming Guide, Release 12.4.– Nvidia.– 2024.– 544 pp. hUtRtpLs://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
Ampere Tuning Guide, Release 12.4.– Nvidia.– 2024.– 22 pp. hUtRtpLs://docs.nvidia.com/cuda/pdf/Ampere_Tuning_Guide.pdf
Zhang Z., Jiao S., Li J., Wu W., Wan L., Qin X., Hu W., Yang J. KSSOLVGPU: An efficient GPU-enabled MATLAB toolbox for solving the Kohn-Sham equations within density functional theory in plane-wave basis set // Chinese Journal of Chemical Physics.– 2021.– Vol. 34.– No. 5.– Pp. 552–564. https://doi.org/10.1063/1674-0068/cjcp2108139
Giannozzi P., Baseggio O., Bonfà P., Brunato D., Car R., Carnimeo I., Cavazzoni C., de Gironcoli S., Delugas P., Ruffino . F., Ferretti A., Marzari N., Timrov I., Urru A., Baroni S. Quantum ESPRESSO toward the exascale // The Journal of chemical physics.– 2020.– Vol. 152.– No. 15.– id. 154105. https://doi.org/10.1063/5.0005082
Хэ Личжун Темпы локализации графических процессоров ускоряются, и новые команды продолжают появляться, Краткий отчет об отрасли.– Пекин: Capital Securities.– 2022 (Китайский).– 15 с. hUtRtpLs://pdf.dfcfw.com/pdf/H3_AP202208021576791297_1.pdf
Bispo J., Barbosa J., Silva P., Morales C., Myllykoski M., Ojeda-May P., Bialczak M., Uchroński M., Włodarczyk A., Wauligmann P., Krishnasamy E., Varrette S., Lührs S. Best Practice Guide: Modern Accelerators/ ed. Shoukourian H..– PRACE.– 2021.– 111 pp. hUtRtpLs://www.researchgate.net/publication/353446204_Best_Practice_Guide_Modern_Accelerators
Finkelstein J., Smith J. S., Mniszewski S. M., Barros K., Negre C. F. A., Rubensson E. H., Niklasson A. M. N. Quantum-based molecular dynamics simulations using tensor cores // Journal of Chemical Theory and Computation.– 2021.– Vol. 17.– No. 10.– Pp. 6180–6192. https://doi.org/10.1021/acs.jctc.1c00726
Posey S., Luitjens J., Hennigh O., Oberlin S. GPU-based HPC and AI developments for CFD (Maui, Hawaii, USA, July 11-15, 2022).– 2022.– id. ICCFD11-3803.– 5 pp. hUtRtpLs://www.iccfd.org/iccfd11/assets/pdf/papers/ICCFD11_Paper-3803.pdf
Schade R., Kenter T., Elgabarty H., Lass M., Schütt O., Lazzaro A., Pabst H., Mohr S., Hutter J., Kühne T. D., Plessl C. Towards electronic structure-based ab-initio molecular dynamics simulations with hundreds of millions of atoms // Parallel Computing.– July 2022.– Vol. 111.– id. 102920.– 11 pp. https://doi.org/10.1016/j.parco.2022.102920
Terzo O., Martinovič J (eds.) HPC, Big Data, and AI Convergence Towards Exascale: Challenge and Vision, 1st ed..– CRC Press.– 2022.– ISBN 9781003176664.– 322 pp. https://doi.org/10.1201/9781003176664
Nowicki M., Górski Ł., Bała P. PCJ Java library as a solution to integrate HPC, Big Data and Artificial Intelligence workloads // Journal of Big Data.– 2021.– Vol. 8.– No. 1.– Pp. 1–21.– id. 62. https://doi.org/10.1186/s40537-021-00454-6
Yin F., Shi F. A comparative survey of Big Data computing and HPC: from a parallel programming model to a cluster architecture // International Journal of Parallel Programming.– 2022.– Vol. 50.– No. 1.– Pp. 27–64. https://doi.org/10.1007/s10766-021-00717-y
Yin J., Wang F., Shankar M. Strategies for integrating deep learning surrogate models with HPC simulation applications // 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (Lyon, France, 2022).– IEEE.– 2022.– ISBN 978-1-6654-9747-3.– Pp. 01–10. https://doi.org/10.1109/IPDPSW55747.2022.00222
Sukumar S. R., Balma J. A., Rickett C.D., Maschhoff K. J., Landman J., Yates C. R., Chittiboyina A. G., Peterson Y. K., Vose A., Byler K., Baudry J., Khan I. A. The convergence of HPC, AI and Big Data in rapid-response to the COVID-19 pandemic // Driving Scientific and Engineering Discoveries Through the Integration of Exeriment, Big Data, and Modeling and Simulation: 21st Smoky Mountains Computational Sciences and Engineering, SMC 2021, Virtual Event, October 18-20, 2021, Revised Selected Papers, Communications in Computer and Information Science.– vol. 1512.– 2022.– ISBN 978-3-030-96497-9.– Pp. 157-172. https://doi.org/10.1007/978-3-030-96498-6_9
Ejarque J., Badia R. M., Albertin L., f, Aloisio G., Baglione E., Becerra Y., Boschert S., Berlin J. R., D’Anca A., Elia D., Exrtier F., Fiore S., Flich J., Folch A., Gibbons S. J., Koldunov N., Lordan F., Lorito S., Løvholt F., Macías J., Volpe M. Enabling dynamic and intelligent workflows for HPC, data analytics, and AI convergence // Future Generation Computer Systems.– September 2022.– Vol. 134.– Pp. 414–429. https://doi.org/10.1016/j.future.2022.04.014
Ihde N., Marten P., Eleliemy A., Poerwawinata G., Silva P., Tolovski I., Ciorba F. M., Rabl T. A survey of Big Data, High Performance Computing, and Machine Learning benchmarks, Technology Conference on Performance Evaluation and Benchmarking, Lecture Notes in Computer Science.– vol. 13169, Cham: Springer.– 2021.– ISBN 978-3-030-94436-0.– Pp. 98–118. https://doi.org/10.1007/978-3-030-94437-7_7
High-Performance Deep Learning Project (HiDL), NOWLAB: Network Based Computing Lab.– Ohio state university. UhtRtpL://hidl.cse.ohio-state.edu
High-Performance Big Data Project (HiBD), NOWLAB: Network Based Computing Lab.– Ohio state university. UhtRtpL://hidl.cse.ohio-state.edu
Jeon W., Ko G., Lee J., Lee H., Ha D., Ro W. W. Deep learning with GPUs // Advances in Computers.– 2021.– Vol. 122.– Pp. 167–215. https://doi.org/10.1016/bs.adcom.2020.11.003
Hong M., Xu L. Biren BR100 GPGPU: Accelerating Datacenter Scale AI Computing, 2022 IEEE Hot Chips 34 Symposium (HCS) (Cupertino, CA, USA).– 2022.– Pp. 1–22. https://doi.org/10.1109/HCS55958.2022.9895604
Shilov A. Russian Company Taps China’s Zhaoxin x86 CPU to Replace AMD, Intel CPUs, Tom’s Hardware.– New York: Future US.– 2022. hUtRtpLs://www.tomshardware.com/news/russian-company-taps-chinas-zhaoxin-x86-cpu-to-replace-amd-intel-cpus
Shang H., Li F., Zhang Y., Zhang L., Fu Y., Gao Y., Wu Y., Duan X., Lin R., Liu X., Liu Y., Chen D. Exreme-scale ab initio quantum Raman spectra simulations on the leadership HPC system in China // SC’21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New York: ACM.– November 2021.– ISBN 978-1-4503-8442-1.– id. 6.– 13 pp. https://doi.org/10.1145/3458817.3487402
Schneider D. The Exascale Era is Upon Us: The Frontier supercomputer may be the first to reach 1,000,000,000,000,000,000 operations per second // IEEE Spectrum.– January 2022.– Vol. 59.– No. 1.– Pp. 34–35. https://doi.org/10.1109/MSPEC.2022.9676353
Dongarra J., Geist A. Report on the Oak Ridge National Laboratory’s Frontier System, Technical Report ICL-UT-22-05.– Oak Ridge National Laboratory.– 2022.– Accessed 15.10.2023. hUtRtpLs://icl.utk.edu/files/publications/2022/icl-utk-1570-2022.pdf
Frontier Spec Sheet, Oak Ridge National Laboratory.– UT-Battelle.– 2019.– 4 pp. hUtRtpLs://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontier_specsheet_v4.pdf
GPU nodes—LUMI-G, Hardware documentation.– LUMI (Large Unified Modern Infrastructure) consortium. UhtRtpLs://docs.lumi-supercomputer.eu/hardware/lumig/
Markomanolis G. S., Alpay A., Young J., Klemm M., Malaya N., Esposito A., Heikonen J., Bastrakov S., Debus A., Kluge T., Steiniger K., Stephan J., Widera R., Bussmann M. Evaluating GPU programming models for the LUMI supercomputer // Supercomputing Frontiers, Lecture Notes in Computer Science (Asian Conference on Supercomputing Frontiers).– vol. 13214, Cham: Springer.– 2022.– ISBN 978-3-031-10419-0.– Pp. 79–101. https://doi.org/10.1007/978-3-031-10419-0_6
Aurora, Argonne Leadership Computing Facility.– Argonne National Laboratory. hUtRtpLs://www.alcf.anl.gov/aurora
Peckham O. LRZ announces new phase of SuperMUC-NG Supercomputer with Intels Ponte Vecchio GPU, Tabor network.– HPCwire.– 2021. hUtRtpLs://www.hpcwire.com/2021/05/05/lrz-announces-new-phase-of-supermuc-ng-supercomputer-with-intels-ponte-vecchio-gpu/
Kwack J. H., Tramm J., Bertoni C., Ghadar Y., Homerding B., Rangel E., Knight C., Parker S. Evaluation of performance portability of applications and mini-apps across AMD, Intel and Nvidia GPUs // 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC) (14 November 2021, St. Louis, MO, USA).– IEEE.– 2021.– ISBN 978-1-6654-2439-4.– Pp. 45–56. https://doi.org/10.1109/P3HPC54578.2021.00008
HPE Cray Supercomputing EX.– Hewlett Packard Enterprise Development LP.– 2024. hUtRtpLs://www.hpe.com/psnow/doc/a00094635enw
Bertoni C., Parker S. Aurora overvew, ALCF SDL Workshop (October 6, 2022).– 2022.– 20 pp. hUtRtpLs://www.alcf.anl.gov/sites/default/files/2022-10/aurora_overview_bertoni_10_6_2022.pdf
Morgan T.P. The NVSwitch Fabric That Is The Hub Of The DGX H100 SuperPOD, The Next Platform.– Stackhouse Publishing.– 2022.
Ishii A., Wells R. The Nvlink-Network switch: Nvidia’s switch chip for high communication-bandwidth superpods // 2022 IEEE Hot Chips 34 Symposium (HCS) (21–23 Aug., 2022, Cupertino, CA, USA).– IEEE.– 2022.– ISBN 978-1-6654-6028-6.– Pp. 1–23. https://doi.org/10.1109/HCS55958.2022.9895480
Eassa A., Ishii A., Wells R. Upgrading Multi-GPU Interconnectivity with the Third-Generation Nvidia NVSwitch, Technical blog, Nvidia developer.– 2022.
BR100 series general purpose GPU chip.– Shanghai: Biren Technology.– 2023. hUtRtpLs://www.birentech.com/BR10X.html
Andersch M., Palmer G., Krashinsky R., Stam N., Mehta V., Brito G., Ramaswamy S. Nvidia Hopper Architecture In-Depth, Technical blog, Nvidia developer.– 2022. hUtRtpLs://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/
Alcorn P. From Opteron to Milan: Crusher Supercomputer Comes Online With New AMD CPUs and MI250X GPUs, Tom’s Hardware.– New York: Future US.– 2022. hUtRtpLs://www.tomshardware.com/news/from-opteron-to-milan-crusher-supercomputer-comes-online-with-amd-cpus-and-gpus
Intel Xeon CPU Max series product overview.– Intel.– 2023.– Accessed 15.10.2023. hUtRtpLs://www.intel.com/content/www/us/en/products/details/processors/xeon/max-series.html
Accelerator Processor Stream.– European Processor Initiative.– 2022. hUtRtpLs://www.european-processor-initiative.eu/accelerator/
EPI EPAC1.0 RISC-V test chip samples delivered, News.– European Processor Initiative.– 2021. hUtRtpLs://www.european-processor-initiative.eu/epi-epac1-0-risc-v-test-chip-samples-delivered/
Kovač M., Notton P., Hofman D., Knezović J. How Europe is preparing its core solution for exascale machines and a global, sovereign, advanced computing platform // Mathematical and Computational Applications.– 2020.– Vol. 25.– No. 3.– Pp. 46. https://doi.org/10.3390/mca25030046
HIP Programming Guide, Version 5.0.– 2023.– Accessed 15.10.2023. hUtRtpLs://docs.amd.com/bundle/HIP-Programming-Guide-v5.0/page/Programming_with_HIP.html
OpenMP Application Programming Interface, Version 5.2.– OpenMP Architecture Review Board.– 2021.– 669 pp. hUtRtpLs://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf
Khronos OpenCL Registry, Formatted specifications and other related documentation.– Khronos Group. UhtRtpLs://registry.khronos.org/OpenCL/
SYCL 2020 Specification, rev. 6.– Khronos Group.– 2022.– 585 pp. hUtRtpLs://registry.khronos.org/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf
DPC++ Part 1: An Introduction to the New Programming Model.– Intel (Accessed 15.10.2023). hUtRtpLs://www.intel.com/content/www/us/en/developer/videos/dpc-part-1-introduction-to-new-programming-model.html
Bavarsad N. N., Makrani H. M., Sayadi H., Landis L., Rafatirad S., Homayoun H. HosNa: A DPC++ benchmark suite for heterogeneous architectures // 2021 IEEE 39th International Conference on Computer Design (ICCD) (24–27 October 2021, Storrs, CT, USA).– IEEE.– 2021.– ISBN 978-1-6654-3219-1.– Pp. 509–516. https://doi.org/10.1109/ICCD53106.2021.00084
Trott C., Berger-Vergiat L., Poliakoff D., Rajamanickam S., Lebrun-Grandie D., Madsen J., Al Awar N., Gligoric M., Shipman G., Womeldorff G. The Kokkos EcoSystem: comprehensive performance portability for high performance computing // Computing in Science & Engineering.– 2021.– Vol. 23.– No. 5.– Pp. 10–18. https://doi.org/10.1109/MCSE.2021.3098509
Trott C. R., Lebrun-Grandié D., Arndt D., Ciesko J., Dang V., Ellingwood N., Gayatri R., Harvey E., Hollman D. S., Ibanez D., Liber N., Madsen J., Miles J., Poliakoff D., Powell A., Rajamanickam S., Simberg M., Sunderland D., Turcksin B., Wilke J. Kokkos 3: Programming model extensions for the exascale era // IEEE Transactions on Parallel and Distributed Systems.– 2021.– Vol. 33.– No. 4.– Pp. 805–817. https://doi.org/10.1109/TPDS.2021.3097283
Moore S. The state of the LAMMPS KOKKOS package, SAND2021-9785C.– Albuquerque, NM: Sandia National Lab.– 2021.– Accessed 15.10.2023. UhtRtpLs://www.osti.gov/servlets/purl/1888676
Ghadar Y., Applencourt T., Homerding B., Harms K., Hammond J. SYCL Programming Model for Aurora, 2020 ECP Annual Meeting.– 2020.
Van Oostrum R., Chalmers N., McDougall D., Bauman P., Curtis N., Malaya N., Wolfe N. AMD GPU Hardware Basics, Frontier Application Readiness Kick-Off Workshop.– 2019.– 55 pp. hUtRtpLs://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf
Intel oneAPI GPU Optimization Guide Release 2022.3.– Intel (Accessed 15.10.2023). hUtRtpLs://www.intel.com/content/dam/develop/external/us/en/documents/oneapi-gpu-optimization-guide.pdf
Khudia D., Huang J., Basu P., Deng S., Liu H., Park J., Smelyanskiy M. Fbgemm: Enabling high-performance low-precision deep learning inference.– 2021.– 5 pp. arXivarXiv 2101.05615
Carrasco R., Vega R., Navarro C. A. Analyzing GPU tensor core potential for fast reductions // 2018 37th International Conference of the Chilean Computer Science Society (SCCC) (05–09 November 2018, Santiago, Chile).– IEEE.– 2018.– ISBN 9781538692349.– Pp. 1–6. https://doi.org/10.1109/SCCC.2018.8705253
Gupta G. Using Tensor Cores for Mixed-Precision Scientific Computing, Technical blog, Nvidia developer.– 2019. hUtRtpLs://developer.nvidia.com/blog/tensor-cores-mixed-precision-scientific-computing/
Nvidia A100 Tensor Core GPU Architecture, V1.0.– Nvidia.– 2020.– 82 pp. hUtRtpLs://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper
754-2019—IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019, Revision of IEEE 754-2008.– 2019.– 84 pp. https://doi.org/10.1109/IEEESTD.2019.8766229
Kalamkar D., Mudigere D., Mellempudi N., Das D., Banerjee K., Avancha S., Vooturi D. T., Jammalamadaka N., Huang J., Yuen H., Yang J., Park J., Heinecke A., Georganas E., Srinivasan S., Kundu A., Smelyanskiy M., Kaul B., Dubey P. A study of BFLOAT16 for deep learning training.– 2019.– 10 pp. arXivarXiv 1905.12322
Stosic D., Micikevicius P. Accelerating AI Training with Nvidia TF32 Tensor Cores, Technical blog, Nvidia developer.– 2021. hUtRtpLs://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/
Micikevicius P., Stosic D., Burgess N., Cornea M., Dubey P., Grisenthwaite R., Ha S., Heinecke A., Judd P., Kamalu J., Mellempudi N., Oberman S., Shoeybi M., Siu M., Wu H. Fp8 formats for deep learning.– 2022.– 9 pp. arXivarXiv 2209.05433
Nvidia H100 Tensor Core GPU Architecture, Includes final GPU / memory clocks and final TFLOPS performance specs, V1.04.– Nvidia.– 2023.– 71 pp.
Sun W., Li A., Geng T., Stuijk S., Corporaal H. Dissecting tensor cores via microbenchmarks: latency, throughput and numerical behaviors // IEEE Transactions on Parallel and Distributed Systems.– 2022.– Vol. 34.– No. 1.– Pp. 246–261. https://doi.org/10.1109/TPDS.2022.3217824
Lehmann M., Krause M. J., Amati G., Sega M., Harting J., Gekle S. Accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats // Physical Review E.– 2022.– Vol. 106.– No. 1.– id. 015308. https://doi.org/10.1103/PhysRevE.106.015308
Domke J., Matsumura K., Wahib M., Zhang H., Yashima K., Tsuchikawa T., Tsuji Y., Podobas A., Matsuoka S. Double-precision FPUs in high-performance computing: an embarrassment of riches? 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (20–24 May 2019, Rio de Janeiro, Brazil).– IEEE.– 2019.– ISBN 978-1-7281-1246-6.– Pp. 78–88. https://doi.org/10.1109/IPDPS.2019.00019
Schade R., Kenter T., Elgabarty H., Lass M., Schütt O., Lazzaro A., Pabst H., Mohr S., Hutter J., Kühne T. D., Plessl C. Towards electronic structure-based ab-initio molecular dynamics simulations with hundreds of millions of atoms // Parallel Computing.– 2022.– Vol. 111.– id. 102920.– 11 pp. https://doi.org/10.1016/j.parco.2022.102920
Schade R., Kenter T., Elgabarty H., Lass M., Kühne T.D., Plessl C. Breaking the exascale barrier for the electronic structure problem in ab-initio molecular dynamics.– 2022.– 6 pp. arXivarXiv 2205.12182
Yu V. W., Govoni M. GPU acceleration of large-scale full-frequency GW calculations.– 2022.– 54 pp. arXivarXiv 2203.05623
Eriksen J. J. Efficient and portable acceleration of quantum chemical many-body methods in mixed floating point precision using OpenACC compiler directives // Molecular Physics.– 2017.– Vol. 115.– No. 17–18.– Pp. 2086–2101. https://doi.org/10.1080/00268976.2016.1271155
Ruda D., Turek S., Ribbrock D., Zajac P. Very fast FEM Poisson solvers on lower precision accelerator hardware, ECCOMAS Congress 2022 (5–9 June 2022, Oslo, Norway).– 2022.– 24 pp. hUtRtpLs://www.mathematik.tu-dortmund.de/lsiii/cms/papers/RudaTurekRibbrockZajac2022b.pdf
Ootomo H., Yokota R. Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance // The International Journal of High Performance Computing Applications.– 2022.– Vol. 36.– No. 4.– Pp. 475–491. https://doi.org/10.1177/10943420221090256
Jain A., Sharma N. Accelerated AI inference at CNN-based machine vision in ASICs: A design approach // ECS Transactions.– 2022.– Vol. 107.– No. 1.– Pp. 5165. https://doi.org/10.1149/10701.5165ecst
Gallet B., Gowanlock M. Computing double precision Euclidean distances using GPU tensor cores.– 2022.– 10 pp. arXivarXiv 2209.11287
Domke J., Vatai E., Drozd A., Chen P. T, Oyama Y., Zhang L., Salaria S., Mukunoki D., Podobas A., Wahib M. T, Matsuoka S. Matrix engines for high performance computing: A paragon of performance or grasping at straws?2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (17–21 May 2021, Portland, OR, USA).– IEEE.– 2021.– ISBN 978-1-6654-4066-0.– Pp. 1056–1065. https://doi.org/10.1109/IPDPS49936.2021.00114
Tan H., Yan R., Yang L., Huang L., Xiao L., Yang Q. Efficient multiple-precision and mixed-precision floating-point fused multiply-accumulate unit for HPC and AI applications // Algorithms and Architectures for Parallel Processing, 22nd International Conference ICA3PP 2022 (Copenhagen, Denmark, October 10–12, 2022), Lecture Notes in Computer Science.– vol. 13777, Cham: Springer Nature Switzerland.– 2023.– ISBN 978-3-031-22676-2.– Pp. 642–659. https://doi.org/10.1007/978-3-031-22677-9_34
Эксклюзивное интервью с руководителями Biren Technology: деконструкция первого 7-нм графического процессора компании, Обзор от компании MooreElite.com (Hefei).– 2022 (Китайский). hUtRtpLs://caifuhao.eastmoney.com/news/20220812093829803631950
Nvidia A100 Tensor Core GPU Datasheet, V1.0.– Nvidia.– 2020.– 3 pp. hUtRtpLs://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/a100-80gb-datasheet-update-
Choquette J., Lee E., Krashinsky R., Balan V., Khailany B. 3.2 The A100 Datacenter GPU and Ampere Architecture // 2021 IEEE International Solid-State Circuits Conference (ISSCC) (13–22 February 2021, San Francisco, CA, USA).– IEEE.– 2021.– ISBN 9781728195506.– Pp. 48–50. https://doi.org/10.1109/ISSCC42613.2021.9365803
Nvidia A100 tensor core GPU architecture, V1.0.– Nvidia.– 2020.– 82 pp. hUtRtpLs://resources.nvidia.com/en-us-genomics-ep/ampere-architecture-white-paper
Hassanpour M., Riera M., González A. A survey of near-data processing architectures for neural networks // Machine Learning and Knowledge Extraction.– 2022.– Vol. 4.– No. 1.– Pp. 66–102. https://doi.org/10.3390/make4010004
Gómez-Luna J., Guo Y., Brocard S., Legriel J., Cimadomo R., Oliveira G. F., Singh G., Mutlu O. An experimental evaluation of machine learning training on a real processing-in-memory system.– 2022.– 21 pp. arXivarXiv 2207.07886
Niu D., Li S., Wang Y., Han W., Zhang Z., Guan Y., Guan T., Sun F., Xue F., Duan L., Fang Y., Zheng H., Jiang X., Wang S., Zuo F., Wang Y., Yu B., Ren Q., Xie Y. 184QPS/W 64Mb/mm23D logic-to-DRAM hybrid bonding with process-near-memory engine for recommendation system // IEEE International Solid-State Circuits Conference (ISSCC) (20–26 February 2022, San Francisco, CA, USA).– IEEE.– 2022.– Pp. 1–3. https://doi.org/10.1109/ISSCC42614.2022.9731694
BiLi 106M, Product details.– Shanghai: Biren Technology.– 2020–2023. hUtRtpLs://www.birentech.com/product_details/1005557637772464128.html
BiLi 106B, 106C.– Shanghai: Biren Technology.– 2020–2023. hUtRtpLs://www.birentech.com/product_details/1005557844745474048.html
Blankenship R., Wagh M. Introducing the CXL 3.1 Specification.– Compute express link consortium.– 2022.– 27 pp. hUtRtpLs://computeexpresslink.org/wp-content/uploads/2024/03/CXL_3.1-Webinar-Presentation_Feb_2024.pdf
Coughlin T. Digital storage and memory // Computer.– 2022.– Vol. 55.– No. 1.– Pp. 20–29. https://doi.org/10.1109/MC.2021.3125165
Nvidia A100 Tensor Core GPU Datasheet.– Nvidia.– 2021.– 3 pp. hUtRtpLs://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf
Ampere Tuning Guide, Release 12.4.– Nvidia.– 2024.– 22 pp. hUtRtpLs://docs.nvidia.com/cuda/pdf/Ampere_Tuning_Guide.pdf
Server/OAI, Wiki page.– Open computers project. hUtRtpLs://www.opencompute.org/wiki/Server/OAI
Nvidia DGX A100, Datasheet.– Nvidia.– 2023.– 2 pp. hUtRtpLs://images.nvidia.com/aem-dam/Solutions/Data-Center/nvidia-dgx-
Morgan T.P. China launches the inevitable indigenous GPU, The Next Platform.– Stackhouse Publishing.– 2022. hUtRtpLs://www.nextplatform.com/2022/08/25/china-launches-the-inevitable-indigenous-gpu/
BIRENSUPA software development platform, Product details.– Shanghai: Biren Technology.– 2023. hUtRtpLs://www.birentech.com/product_details/1005588957219246080.html
MLPerf inference: datacenter benchmark suite results.– MLCommons. UhtRtpLs://mlcommons.org/en/inference-datacenter-21/
Reddi V. J., Cheng C., Kanter D., Mattson P., Schmuelling G., Carole-Wu J., Anderson B., Breughe M., Charlebois M., Chou W., Chukka R., Coleman C., Davis S., Deng P., Diamos G., Duke J., Fick D., Gardner J. S., Hubara I., Idgunji S., Jablin T. B., Jiao J., John T. S., Kanwar P., Lee D., Liao J., Lokhmotov A., Massa F., Meng P., Micikevicius P., Osborne C., Pekhimenko G., Rajan A. T. R., Sequeira D., Sirasao A., Sun F., Tang H., Thomson M., Wei F., Wu E., Xu L., Yamada K., Yu B., Yuan G., Zhong A., Zhang P., Zhou Y. Mlperf inference benchmark // 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) (30 May 2020–03 June 2020, Valencia, Spain).– IEEE.– 2020.– ISBN 978-1-7281-4661-4.– Pp. 446–459. https://doi.org/10.1109/ISCA45697.2020.00045
Saad M. H., Hashima S., Sayed W., El-Shazly E. H., Madian A. H., Fouda M. M. Early diagnosis of COVID-19 images using optimal CNN hyperparameters // Diagnostics.– 2023.– Vol. 13.– No. 1.– id. 76. https://doi.org/10.3390/diagnostics13010076
Devlin J., Ming-Chang W., Lee K., Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding // Human Language Technology: Conference of the North American Chapter of the Association of Computational Linguistics.– V. 1, NAACL-HLT 2019 (June 2–June 7, 2019, Minneapolis, Minnesota, USA).– ACL.– 2019.– ISBN 978-1-950737-13-0.– Pp. 4171–4186. hUtRtpLs://aclanthology.org/N19-1423.pdf
Nvidia TensorRT, an SDK for high-performance deep learning inference, Web site, Nvidia developer.– Nvidia. hUtRtpLs://developer.nvidia.com/tensorrt
Blythe D. The Xe GPU architecture // 2020 IEEE Hot Chips 32 Symposium (HCS) (16–18 August 2020, Palo Alto, CA, USA).– IEEE.– 2020.– ISBN 978-1-7281-7129-6.– Pp. 1–27. https://doi.org/10.1109/HCS49909.2020.9220591
Blythe D. XeHPC Ponte Vecchio // 2021 IEEE Hot Chips 33 Symposium (HCS) (22–24 August 2021, Palo Alto, CA, USA).– IEEE.– 2021.– ISBN 978-1-6654-1397-8.– Pp. 1–34. https://doi.org/10.1109/HCS52781.2021.9567038
Intel data center GPU Max series product brief.– Intel (Accessed 15.10.2023). hUtRtpLs://www.intel.com/content/dam/www/central-libraries/us/en/documents/2023-01/data-center-gpu-max-series-product-brief.pdf
Intel data center GPU flex series product brief.– Intel (Accessed 15.10.2023). hUtRtpLs://www.intel.com/content/dam/www/central-libraries/us/en/documents/2022-08/ats-m-product-brief-final.pdf
Dhote D., Virmani C., Krishna K. G., Raghav S. The science of ray tracing // International Journal of Computer Applications.– 2020.– Vol. 176.– No. 42.– Pp. 15–20. https://doi.org/10.5120/ijca2020920443
Intel data center GPU Max series.– Intel (Accessed 15.10.2023). hUtRtpLs://ark.intel.com/content/www/us/en/ark/products/series/232874/intel-data-center-gpu-max-series.html
Jiang H. Intel’s Ponte Vecchio GPU: architecture, systems and software // 2022 IEEE Hot Chips 34 Symposium (HCS) (21–23 August 2022, Cupertino, CA, USA).– IEEE.– 2022.– ISBN 978-1-6654-6028-6.– Pp. 1–29. https://doi.org/10.1109/HCS55958.2022.9895631
Sidorova M., Gorbushin L., Koneva N. Analytical review of electronic devices of modern supercomputing systems, Proceedings of the International Russian Automation Conference, RusAutoCon2021 (September 5-11, 2021, Sochi, Russia), Lecture Notes in Electrical Engineering.– vol. 857, Cham: Springer.– 2022.– ISBN 978-3-030-94201-4.– Pp. 25–33. https://doi.org/10.1007/978-3-030-94202-1_3
Tian W., Li B., Li Z., Cui H., Shi J., Wang Y., Zhao J. Using chiplet encapsulation technology to achieve processing-in-memory functions // Micromachines.– 2022.– Vol. 13.– No. 10.– Pp. 1790. https://doi.org/10.3390/mi13101790
Moore S. K. 3 paths to 3D processors // IEEE Spectrum.– 2022.– Vol. 59.– No. 6.– Pp. 24–29. https://doi.org/10.1109/MSPEC.2022.9792148
Zhang S., Li Z., Zhou H., Li R., Wang S., Kyung-Paik W., He P., Recent prospectives and challenges of 3D heterogeneous integration // e-Prime-Advances in Electrical Engineering, Electronics and Energy.– 2022.– id. 100052. https://doi.org/10.1016/j.prime.2022.100052
Hadidi R., Asgari B., Mudassar B. A., Mukhopadhyay S., Yalamanchili S., Kim H. Demystifying the characteristics of 3D-stacked memories: A case study for Hybrid Memory Cube // 2017 IEEE international symposium on Workload characterization (IISWC) (01–03 October 2017, Seattle, WA, USA).– IEEE.– 2017.– Pp. 66–75. https://doi.org/10.1109/IISWC.2017.8167757
Ma X., Wang Y., Wang Y., Cai X., Han Y. Survey on chiplets: interface, interconnect and integration methodology // CCF Transactions on High Performance Computing.– 2022.– No. 4.– Pp. 43–52. https://doi.org/10.1007/s42514-022-00093-0
Universal chiplet interconnect express specifications.– Universal Chiplet Interconnect Express.– 2023. hUtRtpLs://www.uciexpress.org/specification
Gomes W., Koker A., Stover P., Ingerly D., Siers S., Venkataraman S., Pelto C., Shah T., Rao A., O’ .,Mahony, Karl E., Cheney L., Rajwani I., Jain H., Cortez R., Chandrasekhar A., Kanthi B., Koduri R. Ponte Vecchio: A multi-tile 3D stacked processor for exascale computing // 2022 IEEE International Solid-State Circuits Conference (ISSCC) (20–26 February 2022, San Francisco, CA, USA).– IEEE.– 2022.– ISBN 978-1-6654-2800-2.– Pp. 42–44. https://doi.org/10.1109/ISSCC42614.2022.9731673
Gomes W., Koker A., Stover P., Ingerly D., Siers S., Venkataraman S., Pelto C., Shah T., Rao A., O’Mahony F., Karl E., Cheney L., Rajwani I., Jain H., Cortez R., Chandrasekhar A., Kanthi B., Koduri R. Ponte Vecchio: A multi-tile 3D stacked processor for exascale computing, HPC user forum, Accelerated Computing Systems and Graphics Group.– 2021. hUtRtpLs://www.hpcuserforum.com/wp-content/uploads/2021/05/Gomes_Intel_Ponte-Vecchio_Mar2022-HPC-UF.pdf
Intel data center GPU Max series technical overview.– Intel.– 2023 (Accessed 15.10.2023). hUtRtpLs://www.intel.com/content/www/us/en/developer/articles/technical/intel-data-center-gpu-max-series-overview.html
Moore S. K. Behind Intel’s HPC chip that will pierce the exascale barrier, Blog, IEEE Spectrum.– IEEE.– 2022.
Ingerly D. B., Amin S., Aryasomayajula L., Balankutty A., Borst D., Chandra A., Cheemalapati K., Cook C. S., Criss R., Enamul K., Gomes W., Jones D., Kolluru K. C., Kandas A., G.-Kim S., Ma H., Pantuso D., Petersburg C. F., Phen-givoni M., Pillai A. M., Sairam A., Shekhar P., Sinha P., Stover P., Telang A., Zell Z. Foveros: 3D integration and the use of face-to-face chip stacking for logic devices // 2019 IEEE International Electron Devices Meeting (IEDM) (07–11 December 2019, San Francisco, CA, USA).– IEEE.– 2019.– ISBN 978-1-7281-4033-9.– Pp. 19.6.1-19.6.4. https://doi.org/10.1109/IEDM19573.2019.8993637
Mahajan R., Sankman R., Patel N., Dae-Kim W., Aygun K., Qian Z., Mekonnen Y., Salama I., Sharan S., Iyengar D., Mallik D. Embedded multi-die interconnect bridge (EMIB)–a high density, high bandwidth packaging interconnect // 2016 IEEE 66th Electronic Components and Technology Conference (ECTC) (31 May 2016–03 June 2016, Las Vegas, NV, USA).– IEEE.– 2016.– Pp. 557–565. https://doi.org/10.1109/ECTC.2016.201
Irani S. Hang SK Intel Ponte Vecchio compute accelerator OAM product and system, 2021 OCP Global Summit.– 2021. hUtRtpLs://www.opencompute.org/events/past-events/2021-ocp-global-summit
Tekin A., A.Durak T., Piechurski C., Kaliszan D.,Sungur F. A., Robertsén F., Gschwandtn P. State-of-the-art and trends for computing and interconnect network solutions for HPC and AI, Partnership for Advanced Computing in Europe.– PRACE.– 2021.– 38 pp. hUtRtpLs://prace-ri.eu/wp-content/uploads/State-of-the-Art-and-Trends-for-Computing-and-Interconnect-Network-Solutions-for-HPC-and-AI-1.pdf
Sun W., Li A., Geng T., Stuijk S., Corporaal H. Dissecting tensor cores via microbenchmarks: latency, throughput and numerical behaviors // IEEE Transactions on Parallel and Distributed Systems.– 2023.– Vol. 34.– No. 1.– Pp. 246–261. https://doi.org/10.1109/TPDS.2022.3217824
Intel Products formerly Alchemist.– Intel (Accessed 15.10.2023). hUtRtpLs://ark.intel.com/content/www/us/en/ark/products/codename/226095/products-formerly-alchemist.html
Watts D. Lenovo ThinkSystem and ThinkAgile GPU Summary, Product Guide.– Lenovo press.– 2024.– 71 pp. hUtRtpLs://lenovopress.lenovo.com/lp1602.pdf
Liu Zh. Intel Axes Data Center GPU Max 1350, Preps New Max 1450 for ’Different Markets’, Tom’s Hardware.– New York: Future US.– 2023. hUtRtpLs://www.tomshardware.com/news/intel-axes-data-center-gpu-max-1350-preps-max-1450-for-different-markets
Vuduc R., Chandramowlishwaran A., Choi J., Guney M.(E.), Shringarpure A. On the limits of GPU acceleration // Proceedings of the 2nd USENIX conference on Hot topics in parallelism, HotPar’10 (June 14–15, 2010, Berkeley, CA, USA), Berkeley: USENIX Association.– 2010.– id. 13.– 6 pp. hUtRtpLs://www.usenix.org/legacy/events/hotpar10/tech/full_papers/Vuduc.pdf
Hanindhito B., Gourounas D., Fathi A.,Trenev D., Gerstlauer A., John L. K. GAPS: GPU-acceleration of PDE solvers for wave simulation // ICS ’22: Proceedings of the 36th ACM International Conference on Supercomputing (June 28–30, 2022, Virtual Event), NeW York: ACM.– 2022.– ISBN 978-1-4503-9281-5.– id. 30.– 13 pp. https://doi.org/10.1145/3524059.3532373
Chalmers N., Mishra A., McDougall D., Warburton T. HipBone: A performance-portable GPU-accelerated C++ version of the NekBone benchmark // The International Journal of High Performance Computing Applications.– 2023.– Vol. 37.– No. 5.– Pp. 560-577. https://doi.org/10.1177/10943420231178552

Еще