Новое поколение GPGPU и сопутствующего оборудования: микроархитектура и производительность вычислительных систем от серверов до суперкомпьютеров

Бесплатный доступ

Дан обзор современного состояния GPGPU с ориентацией их применения на традиционные задачи HPC (и в меньшей степени ИИ). К базовым GPGPU в обзоре отнесены Nvidia V100 и A100. В качестве GPGPU нового поколения рассмотрены Nvidia H100, AMD MI100 и MI200, Intel Ponte Vecchio (Data Center GPU Max), а также BR100 от Biren Technology. Проанализированы и сопоставлены микроархитектура и аппаратные показатели этих GPGPU, важные для задач HPC и ИИ, а также важнейших дополнительных аппаратных средств для построения вычислительных систем с применением GPGPU - центральных процессоров, специализированных для работы с GPGPU нового поколения, и межсоединений. Дается краткая информация об использующих их серверах, в том числе multi-GPU, и новых применяющих эти GPGPU суперкомпьютерах, где были получены данные о достигаемой производительности при работе с GPGPU. Кратко рассмотрены SDK фирм-производителей GPGPU и программные средства других фирм, включая математические библиотеки. Приводятся примеры, демонстрирующие важные для достижения максимальной производительности средства широко используемых моделей программирования, способствующие при этом непереносимости программных кодов на другие модели GPGPU. Особое внимание обращено на возможности применения тензорных ядер и их аналогов в современных GPGPU разных фирм. Это относится и к расчетам с пониженной (относительно стандартного для HPC формата FP64) и смешанной точностью, актуальным вследствие резкого роста достигаемой производительности при их использовании в тензорных ядрах GPGPU. Анализируются данные о достигаемой ими реальной производительности в тестах и приложениях для HPC и ИИ. Вкратце рассматривается и применение в GPGPU современных библиотек пакетной линейной алгебры, в том числе для HPC-приложений.

Еще

Gpgpu, v100, a100, h100, grace, gh200 grace hopper, mi100, mi200, ponte vecchio, data center gpu max, br100, cuda, hip, dpc++, fortran, производительность, hpc, ии, глубокое обучение

Короткий адрес: https://sciup.org/143183240

IDR: 143183240   |   DOI: 10.25209/2079-3316-2024-15-2-139-473

Список литературы Новое поколение GPGPU и сопутствующего оборудования: микроархитектура и производительность вычислительных систем от серверов до суперкомпьютеров

  • Top500 the list, The Green500 ranking, 61st edition.– 2023. hUtRtpLs://top500.org/lists/green500/2023/06/
  • Tschudi W., Xu T., Sartor D., Stein J. High-performance data centers. A research roadmap, LBNL-53483.– 2004.– 53 pp. hUtRtpLs://escholarship.org/uc/item/0w64r459
  • Maltenberger T., Ilic I., Tolovski I, Rab T. Evaluating multi-GPU sorting with modern interconnects // SIGMOD’22: Proceedings of the 2022 International Conference on Management of Data (Philadelphia, PA, USA, June 12–17, 2022), New York: ACM.– 2022.– ISBN 978-1-4503-9249-5.– Pp. 1795–1809. https://doi.org/10.1145/3514221.3517842
  • Top500 the list, The Top500 ranking, 61st edition.– 2023. hUtRtpLs://www.top500.org/lists/top500/2023/06/highs/
  • Кузьминский М. Б. Современные серверные ARM-процессоры для суперЭ-ВM: A64FX и другие. Начальные данные тестов производительности // Программные системы: теория и приложения.– 2022.– Т. 13.– №1(52).– С. 63–129. https://doi.org/10.25209hU/t2Rt0pL7s9:/-3/3p1s6ta-2.p0s2i2r-a1s3.r-u1/-6r3e-a1d2/9psta2022_1_63-129.pdf
  • Gao J., Zheng F., Qi F, Ding Y, Li H., Lu H., He W., Wei H., Jin L., Liu X., Gong D., Wang F., Zheng Y., Sun H., Zhou Z., Liu Y., You H. Sunway supercomputer architecture towards exscale computing: analysis and practice // Science China Information Sciences.– 2021.– Vol. 64.– No. 4.– id. 141101.– 21 pp. https://doi.org/10.1007/hUs1tRt1p4L:3/2/-s0c2is0.-s3c1ic0h4i-n7a.com/en/2021/141101.pdf
  • Selig J. The cerebras software development kit: A technical overview.– Cerebras systems Inc..– 2022.– 8 pp. hUtRtpLs://f.hubspotusercontent30.net/hubfs/8968533/Cerebras%20SDK%20Technical%20Overview%20White%20Paper.pdf
  • Andromeda, a 13.5 Million Core AI Supercomputer, a section on the Cerebras company site.– 2024. UhtRtpLs://www.cerebras.net/andromeda/
  • Top500 the list, List Statistics of TOP500, 61st edition.– 2023. hUtRtpLs://www.top500.org/UhltiRtsptLss/:/to/pw5w0w0/.t2o0p2530/00.6o/rhg/igshtas/tistics/list/
  • Morgan T. P. Chip roadmaps unfold, crisscrossing and interconnecting, at AMD, The Next Platform.– Stackhouse Publishing.– 2022. hUtRtpLs://www.nextplatform.com/2022/06/14/chip-roadmaps-unfold-crisscrossing-and-interconnecting-at-amd/
  • Shah A. Intel Reiterates Plans to Merge CPU, GPU High-performance Chip Roadmaps, Tabor network.– HPCwire.– 2022. hUtRtpLs://www.hpcwire.com/2022/05/31/intel-reiterates-plans-to-merge-cpu-gpu-high-performance-chip-roadmaps/
  • Morgan T. P. The Increasingly Graphic Nature Of Intel Datacenter Compute, The Next Platform.– Stackhouse Publishing.– 2022. hUtRtpLs://www.nextplatform.com/2022/06/08/the-increasingly-graphic-nature-of-intel-datacenter-compute/
  • Evans J. Nvidia Grace // 2022 IEEE Hot Chips 34 Symposium (HCS) (Cupertino, CA, USA, 21–23 August 2022).– IEEE.– 2022.– Pp. 1–20. https://doi.org/10.1109/HCS55958.2022.9895599
  • Elster A. C., Haugdahl T. A. Nvidia Hopper GPU and Grace CPU Highlights // Computing in Science and Engineering.– 2022.– Vol. 24.– No. 2.– Pp. 95–100. https://doi.org/10.1109/MCSE.2022.3163817
  • Evans J. Inside Grace, Featured Playlists, GPU Technology Conference (GTC), Nvidia On-Demand.– Nvidia.– 2022. hUtRtpLs://www.nvidia.com/en-us/on-demand/session/gtcfall22-a41129/
  • CUDA C++ Programming Guide, Release 12.4.– Nvidia.– 2024.– 544 pp. hUtRtpLs://docs.nvidia.com/cuda/pdf/CUDA_C_Programming_Guide.pdf
  • Ampere Tuning Guide, Release 12.4.– Nvidia.– 2024.– 22 pp. hUtRtpLs://docs.nvidia.com/cuda/pdf/Ampere_Tuning_Guide.pdf
  • Zhang Z., Jiao S., Li J., Wu W., Wan L., Qin X., Hu W., Yang J. KSSOLVGPU: An efficient GPU-enabled MATLAB toolbox for solving the Kohn-Sham equations within density functional theory in plane-wave basis set // Chinese Journal of Chemical Physics.– 2021.– Vol. 34.– No. 5.– Pp. 552–564. https://doi.org/10.1063/1674-0068/cjcp2108139
  • Giannozzi P., Baseggio O., Bonfà P., Brunato D., Car R., Carnimeo I., Cavazzoni C., de Gironcoli S., Delugas P., Ruffino . F., Ferretti A., Marzari N., Timrov I., Urru A., Baroni S. Quantum ESPRESSO toward the exascale // The Journal of chemical physics.– 2020.– Vol. 152.– No. 15.– id. 154105. https://doi.org/10.1063/5.0005082
  • Хэ Личжун Темпы локализации графических процессоров ускоряются, и новые команды продолжают появляться, Краткий отчет об отрасли.– Пекин: Capital Securities.– 2022 (Китайский).– 15 с. hUtRtpLs://pdf.dfcfw.com/pdf/H3_AP202208021576791297_1.pdf
  • Bispo J., Barbosa J., Silva P., Morales C., Myllykoski M., Ojeda-May P., Bialczak M., Uchroński M., Włodarczyk A., Wauligmann P., Krishnasamy E., Varrette S., Lührs S. Best Practice Guide: Modern Accelerators/ ed. Shoukourian H..– PRACE.– 2021.– 111 pp. hUtRtpLs://www.researchgate.net/publication/353446204_Best_Practice_Guide_Modern_Accelerators
  • Finkelstein J., Smith J. S., Mniszewski S. M., Barros K., Negre C. F. A., Rubensson E. H., Niklasson A. M. N. Quantum-based molecular dynamics simulations using tensor cores // Journal of Chemical Theory and Computation.– 2021.– Vol. 17.– No. 10.– Pp. 6180–6192. https://doi.org/10.1021/acs.jctc.1c00726
  • Posey S., Luitjens J., Hennigh O., Oberlin S. GPU-based HPC and AI developments for CFD (Maui, Hawaii, USA, July 11-15, 2022).– 2022.– id. ICCFD11-3803.– 5 pp. hUtRtpLs://www.iccfd.org/iccfd11/assets/pdf/papers/ICCFD11_Paper-3803.pdf
  • Schade R., Kenter T., Elgabarty H., Lass M., Schütt O., Lazzaro A., Pabst H., Mohr S., Hutter J., Kühne T. D., Plessl C. Towards electronic structure-based ab-initio molecular dynamics simulations with hundreds of millions of atoms // Parallel Computing.– July 2022.– Vol. 111.– id. 102920.– 11 pp. https://doi.org/10.1016/j.parco.2022.102920
  • Terzo O., Martinovič J (eds.) HPC, Big Data, and AI Convergence Towards Exascale: Challenge and Vision, 1st ed..– CRC Press.– 2022.– ISBN 9781003176664.– 322 pp. https://doi.org/10.1201/9781003176664
  • Nowicki M., Górski Ł., Bała P. PCJ Java library as a solution to integrate HPC, Big Data and Artificial Intelligence workloads // Journal of Big Data.– 2021.– Vol. 8.– No. 1.– Pp. 1–21.– id. 62. https://doi.org/10.1186/s40537-021-00454-6
  • Yin F., Shi F. A comparative survey of Big Data computing and HPC: from a parallel programming model to a cluster architecture // International Journal of Parallel Programming.– 2022.– Vol. 50.– No. 1.– Pp. 27–64. https://doi.org/10.1007/s10766-021-00717-y
  • Yin J., Wang F., Shankar M. Strategies for integrating deep learning surrogate models with HPC simulation applications // 2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (Lyon, France, 2022).– IEEE.– 2022.– ISBN 978-1-6654-9747-3.– Pp. 01–10. https://doi.org/10.1109/IPDPSW55747.2022.00222
  • Sukumar S. R., Balma J. A., Rickett C.D., Maschhoff K. J., Landman J., Yates C. R., Chittiboyina A. G., Peterson Y. K., Vose A., Byler K., Baudry J., Khan I. A. The convergence of HPC, AI and Big Data in rapid-response to the COVID-19 pandemic // Driving Scientific and Engineering Discoveries Through the Integration of Exeriment, Big Data, and Modeling and Simulation: 21st Smoky Mountains Computational Sciences and Engineering, SMC 2021, Virtual Event, October 18-20, 2021, Revised Selected Papers, Communications in Computer and Information Science.– vol. 1512.– 2022.– ISBN 978-3-030-96497-9.– Pp. 157-172. https://doi.org/10.1007/978-3-030-96498-6_9
  • Ejarque J., Badia R. M., Albertin L., f, Aloisio G., Baglione E., Becerra Y., Boschert S., Berlin J. R., D’Anca A., Elia D., Exrtier F., Fiore S., Flich J., Folch A., Gibbons S. J., Koldunov N., Lordan F., Lorito S., Løvholt F., Macías J., Volpe M. Enabling dynamic and intelligent workflows for HPC, data analytics, and AI convergence // Future Generation Computer Systems.– September 2022.– Vol. 134.– Pp. 414–429. https://doi.org/10.1016/j.future.2022.04.014
  • Ihde N., Marten P., Eleliemy A., Poerwawinata G., Silva P., Tolovski I., Ciorba F. M., Rabl T. A survey of Big Data, High Performance Computing, and Machine Learning benchmarks, Technology Conference on Performance Evaluation and Benchmarking, Lecture Notes in Computer Science.– vol. 13169, Cham: Springer.– 2021.– ISBN 978-3-030-94436-0.– Pp. 98–118. https://doi.org/10.1007/978-3-030-94437-7_7
  • High-Performance Deep Learning Project (HiDL), NOWLAB: Network Based Computing Lab.– Ohio state university. UhtRtpL://hidl.cse.ohio-state.edu
  • High-Performance Big Data Project (HiBD), NOWLAB: Network Based Computing Lab.– Ohio state university. UhtRtpL://hidl.cse.ohio-state.edu
  • Jeon W., Ko G., Lee J., Lee H., Ha D., Ro W. W. Deep learning with GPUs // Advances in Computers.– 2021.– Vol. 122.– Pp. 167–215. https://doi.org/10.1016/bs.adcom.2020.11.003
  • Hong M., Xu L. Biren BR100 GPGPU: Accelerating Datacenter Scale AI Computing, 2022 IEEE Hot Chips 34 Symposium (HCS) (Cupertino, CA, USA).– 2022.– Pp. 1–22. https://doi.org/10.1109/HCS55958.2022.9895604
  • Shilov A. Russian Company Taps China’s Zhaoxin x86 CPU to Replace AMD, Intel CPUs, Tom’s Hardware.– New York: Future US.– 2022. hUtRtpLs://www.tomshardware.com/news/russian-company-taps-chinas-zhaoxin-x86-cpu-to-replace-amd-intel-cpus
  • Shang H., Li F., Zhang Y., Zhang L., Fu Y., Gao Y., Wu Y., Duan X., Lin R., Liu X., Liu Y., Chen D. Exreme-scale ab initio quantum Raman spectra simulations on the leadership HPC system in China // SC’21: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New York: ACM.– November 2021.– ISBN 978-1-4503-8442-1.– id. 6.– 13 pp. https://doi.org/10.1145/3458817.3487402
  • Schneider D. The Exascale Era is Upon Us: The Frontier supercomputer may be the first to reach 1,000,000,000,000,000,000 operations per second // IEEE Spectrum.– January 2022.– Vol. 59.– No. 1.– Pp. 34–35. https://doi.org/10.1109/MSPEC.2022.9676353
  • Dongarra J., Geist A. Report on the Oak Ridge National Laboratory’s Frontier System, Technical Report ICL-UT-22-05.– Oak Ridge National Laboratory.– 2022.– Accessed 15.10.2023. hUtRtpLs://icl.utk.edu/files/publications/2022/icl-utk-1570-2022.pdf
  • Frontier Spec Sheet, Oak Ridge National Laboratory.– UT-Battelle.– 2019.– 4 pp. hUtRtpLs://www.olcf.ornl.gov/wp-content/uploads/2019/05/frontier_specsheet_v4.pdf
  • GPU nodes—LUMI-G, Hardware documentation.– LUMI (Large Unified Modern Infrastructure) consortium. UhtRtpLs://docs.lumi-supercomputer.eu/hardware/lumig/
  • Markomanolis G. S., Alpay A., Young J., Klemm M., Malaya N., Esposito A., Heikonen J., Bastrakov S., Debus A., Kluge T., Steiniger K., Stephan J., Widera R., Bussmann M. Evaluating GPU programming models for the LUMI supercomputer // Supercomputing Frontiers, Lecture Notes in Computer Science (Asian Conference on Supercomputing Frontiers).– vol. 13214, Cham: Springer.– 2022.– ISBN 978-3-031-10419-0.– Pp. 79–101. https://doi.org/10.1007/978-3-031-10419-0_6
  • Aurora, Argonne Leadership Computing Facility.– Argonne National Laboratory. hUtRtpLs://www.alcf.anl.gov/aurora
  • Peckham O. LRZ announces new phase of SuperMUC-NG Supercomputer with Intels Ponte Vecchio GPU, Tabor network.– HPCwire.– 2021. hUtRtpLs://www.hpcwire.com/2021/05/05/lrz-announces-new-phase-of-supermuc-ng-supercomputer-with-intels-ponte-vecchio-gpu/
  • Kwack J. H., Tramm J., Bertoni C., Ghadar Y., Homerding B., Rangel E., Knight C., Parker S. Evaluation of performance portability of applications and mini-apps across AMD, Intel and Nvidia GPUs // 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC) (14 November 2021, St. Louis, MO, USA).– IEEE.– 2021.– ISBN 978-1-6654-2439-4.– Pp. 45–56. https://doi.org/10.1109/P3HPC54578.2021.00008
  • HPE Cray Supercomputing EX.– Hewlett Packard Enterprise Development LP.– 2024. hUtRtpLs://www.hpe.com/psnow/doc/a00094635enw
  • Bertoni C., Parker S. Aurora overvew, ALCF SDL Workshop (October 6, 2022).– 2022.– 20 pp. hUtRtpLs://www.alcf.anl.gov/sites/default/files/2022-10/aurora_overview_bertoni_10_6_2022.pdf
  • Morgan T.P. The NVSwitch Fabric That Is The Hub Of The DGX H100 SuperPOD, The Next Platform.– Stackhouse Publishing.– 2022.
  • Ishii A., Wells R. The Nvlink-Network switch: Nvidia’s switch chip for high communication-bandwidth superpods // 2022 IEEE Hot Chips 34 Symposium (HCS) (21–23 Aug., 2022, Cupertino, CA, USA).– IEEE.– 2022.– ISBN 978-1-6654-6028-6.– Pp. 1–23. https://doi.org/10.1109/HCS55958.2022.9895480
  • Eassa A., Ishii A., Wells R. Upgrading Multi-GPU Interconnectivity with the Third-Generation Nvidia NVSwitch, Technical blog, Nvidia developer.– 2022.
  • BR100 series general purpose GPU chip.– Shanghai: Biren Technology.– 2023. hUtRtpLs://www.birentech.com/BR10X.html
  • Andersch M., Palmer G., Krashinsky R., Stam N., Mehta V., Brito G., Ramaswamy S. Nvidia Hopper Architecture In-Depth, Technical blog, Nvidia developer.– 2022. hUtRtpLs://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/
  • Alcorn P. From Opteron to Milan: Crusher Supercomputer Comes Online With New AMD CPUs and MI250X GPUs, Tom’s Hardware.– New York: Future US.– 2022. hUtRtpLs://www.tomshardware.com/news/from-opteron-to-milan-crusher-supercomputer-comes-online-with-amd-cpus-and-gpus
  • Intel Xeon CPU Max series product overview.– Intel.– 2023.– Accessed 15.10.2023. hUtRtpLs://www.intel.com/content/www/us/en/products/details/processors/xeon/max-series.html
  • Accelerator Processor Stream.– European Processor Initiative.– 2022. hUtRtpLs://www.european-processor-initiative.eu/accelerator/
  • EPI EPAC1.0 RISC-V test chip samples delivered, News.– European Processor Initiative.– 2021. hUtRtpLs://www.european-processor-initiative.eu/epi-epac1-0-risc-v-test-chip-samples-delivered/
  • Kovač M., Notton P., Hofman D., Knezović J. How Europe is preparing its core solution for exascale machines and a global, sovereign, advanced computing platform // Mathematical and Computational Applications.– 2020.– Vol. 25.– No. 3.– Pp. 46. https://doi.org/10.3390/mca25030046
  • HIP Programming Guide, Version 5.0.– 2023.– Accessed 15.10.2023. hUtRtpLs://docs.amd.com/bundle/HIP-Programming-Guide-v5.0/page/Programming_with_HIP.html
  • OpenMP Application Programming Interface, Version 5.2.– OpenMP Architecture Review Board.– 2021.– 669 pp. hUtRtpLs://www.openmp.org/wp-content/uploads/OpenMP-API-Specification-5-2.pdf
  • Khronos OpenCL Registry, Formatted specifications and other related documentation.– Khronos Group. UhtRtpLs://registry.khronos.org/OpenCL/
  • SYCL 2020 Specification, rev. 6.– Khronos Group.– 2022.– 585 pp. hUtRtpLs://registry.khronos.org/SYCL/specs/sycl-2020/pdf/sycl-2020.pdf
  • DPC++ Part 1: An Introduction to the New Programming Model.– Intel (Accessed 15.10.2023). hUtRtpLs://www.intel.com/content/www/us/en/developer/videos/dpc-part-1-introduction-to-new-programming-model.html
  • Bavarsad N. N., Makrani H. M., Sayadi H., Landis L., Rafatirad S., Homayoun H. HosNa: A DPC++ benchmark suite for heterogeneous architectures // 2021 IEEE 39th International Conference on Computer Design (ICCD) (24–27 October 2021, Storrs, CT, USA).– IEEE.– 2021.– ISBN 978-1-6654-3219-1.– Pp. 509–516. https://doi.org/10.1109/ICCD53106.2021.00084
  • Trott C., Berger-Vergiat L., Poliakoff D., Rajamanickam S., Lebrun-Grandie D., Madsen J., Al Awar N., Gligoric M., Shipman G., Womeldorff G. The Kokkos EcoSystem: comprehensive performance portability for high performance computing // Computing in Science & Engineering.– 2021.– Vol. 23.– No. 5.– Pp. 10–18. https://doi.org/10.1109/MCSE.2021.3098509
  • Trott C. R., Lebrun-Grandié D., Arndt D., Ciesko J., Dang V., Ellingwood N., Gayatri R., Harvey E., Hollman D. S., Ibanez D., Liber N., Madsen J., Miles J., Poliakoff D., Powell A., Rajamanickam S., Simberg M., Sunderland D., Turcksin B., Wilke J. Kokkos 3: Programming model extensions for the exascale era // IEEE Transactions on Parallel and Distributed Systems.– 2021.– Vol. 33.– No. 4.– Pp. 805–817. https://doi.org/10.1109/TPDS.2021.3097283
  • Moore S. The state of the LAMMPS KOKKOS package, SAND2021-9785C.– Albuquerque, NM: Sandia National Lab.– 2021.– Accessed 15.10.2023. UhtRtpLs://www.osti.gov/servlets/purl/1888676
  • Ghadar Y., Applencourt T., Homerding B., Harms K., Hammond J. SYCL Programming Model for Aurora, 2020 ECP Annual Meeting.– 2020.
  • Van Oostrum R., Chalmers N., McDougall D., Bauman P., Curtis N., Malaya N., Wolfe N. AMD GPU Hardware Basics, Frontier Application Readiness Kick-Off Workshop.– 2019.– 55 pp. hUtRtpLs://www.olcf.ornl.gov/wp-content/uploads/2019/10/ORNL_Application_Readiness_Workshop-AMD_GPU_Basics.pdf
  • Intel oneAPI GPU Optimization Guide Release 2022.3.– Intel (Accessed 15.10.2023). hUtRtpLs://www.intel.com/content/dam/develop/external/us/en/documents/oneapi-gpu-optimization-guide.pdf
  • Khudia D., Huang J., Basu P., Deng S., Liu H., Park J., Smelyanskiy M. Fbgemm: Enabling high-performance low-precision deep learning inference.– 2021.– 5 pp. arXivarXiv 2101.05615
  • Carrasco R., Vega R., Navarro C. A. Analyzing GPU tensor core potential for fast reductions // 2018 37th International Conference of the Chilean Computer Science Society (SCCC) (05–09 November 2018, Santiago, Chile).– IEEE.– 2018.– ISBN 9781538692349.– Pp. 1–6. https://doi.org/10.1109/SCCC.2018.8705253
  • Gupta G. Using Tensor Cores for Mixed-Precision Scientific Computing, Technical blog, Nvidia developer.– 2019. hUtRtpLs://developer.nvidia.com/blog/tensor-cores-mixed-precision-scientific-computing/
  • Nvidia A100 Tensor Core GPU Architecture, V1.0.– Nvidia.– 2020.– 82 pp. hUtRtpLs://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper
  • 754-2019—IEEE Standard for Floating-Point Arithmetic, IEEE Std 754-2019, Revision of IEEE 754-2008.– 2019.– 84 pp. https://doi.org/10.1109/IEEESTD.2019.8766229
  • Kalamkar D., Mudigere D., Mellempudi N., Das D., Banerjee K., Avancha S., Vooturi D. T., Jammalamadaka N., Huang J., Yuen H., Yang J., Park J., Heinecke A., Georganas E., Srinivasan S., Kundu A., Smelyanskiy M., Kaul B., Dubey P. A study of BFLOAT16 for deep learning training.– 2019.– 10 pp. arXivarXiv 1905.12322
  • Stosic D., Micikevicius P. Accelerating AI Training with Nvidia TF32 Tensor Cores, Technical blog, Nvidia developer.– 2021. hUtRtpLs://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/
  • Micikevicius P., Stosic D., Burgess N., Cornea M., Dubey P., Grisenthwaite R., Ha S., Heinecke A., Judd P., Kamalu J., Mellempudi N., Oberman S., Shoeybi M., Siu M., Wu H. Fp8 formats for deep learning.– 2022.– 9 pp. arXivarXiv 2209.05433
  • Nvidia H100 Tensor Core GPU Architecture, Includes final GPU / memory clocks and final TFLOPS performance specs, V1.04.– Nvidia.– 2023.– 71 pp.
  • Sun W., Li A., Geng T., Stuijk S., Corporaal H. Dissecting tensor cores via microbenchmarks: latency, throughput and numerical behaviors // IEEE Transactions on Parallel and Distributed Systems.– 2022.– Vol. 34.– No. 1.– Pp. 246–261. https://doi.org/10.1109/TPDS.2022.3217824
  • Lehmann M., Krause M. J., Amati G., Sega M., Harting J., Gekle S. Accuracy and performance of the lattice Boltzmann method with 64-bit, 32-bit, and customized 16-bit number formats // Physical Review E.– 2022.– Vol. 106.– No. 1.– id. 015308. https://doi.org/10.1103/PhysRevE.106.015308
  • Domke J., Matsumura K., Wahib M., Zhang H., Yashima K., Tsuchikawa T., Tsuji Y., Podobas A., Matsuoka S. Double-precision FPUs in high-performance computing: an embarrassment of riches? 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (20–24 May 2019, Rio de Janeiro, Brazil).– IEEE.– 2019.– ISBN 978-1-7281-1246-6.– Pp. 78–88. https://doi.org/10.1109/IPDPS.2019.00019
  • Schade R., Kenter T., Elgabarty H., Lass M., Schütt O., Lazzaro A., Pabst H., Mohr S., Hutter J., Kühne T. D., Plessl C. Towards electronic structure-based ab-initio molecular dynamics simulations with hundreds of millions of atoms // Parallel Computing.– 2022.– Vol. 111.– id. 102920.– 11 pp. https://doi.org/10.1016/j.parco.2022.102920
  • Schade R., Kenter T., Elgabarty H., Lass M., Kühne T.D., Plessl C. Breaking the exascale barrier for the electronic structure problem in ab-initio molecular dynamics.– 2022.– 6 pp. arXivarXiv 2205.12182
  • Yu V. W., Govoni M. GPU acceleration of large-scale full-frequency GW calculations.– 2022.– 54 pp. arXivarXiv 2203.05623
  • Eriksen J. J. Efficient and portable acceleration of quantum chemical many-body methods in mixed floating point precision using OpenACC compiler directives // Molecular Physics.– 2017.– Vol. 115.– No. 17–18.– Pp. 2086–2101. https://doi.org/10.1080/00268976.2016.1271155
  • Ruda D., Turek S., Ribbrock D., Zajac P. Very fast FEM Poisson solvers on lower precision accelerator hardware, ECCOMAS Congress 2022 (5–9 June 2022, Oslo, Norway).– 2022.– 24 pp. hUtRtpLs://www.mathematik.tu-dortmund.de/lsiii/cms/papers/RudaTurekRibbrockZajac2022b.pdf
  • Ootomo H., Yokota R. Recovering single precision accuracy from Tensor Cores while surpassing the FP32 theoretical peak performance // The International Journal of High Performance Computing Applications.– 2022.– Vol. 36.– No. 4.– Pp. 475–491. https://doi.org/10.1177/10943420221090256
  • Jain A., Sharma N. Accelerated AI inference at CNN-based machine vision in ASICs: A design approach // ECS Transactions.– 2022.– Vol. 107.– No. 1.– Pp. 5165. https://doi.org/10.1149/10701.5165ecst
  • Gallet B., Gowanlock M. Computing double precision Euclidean distances using GPU tensor cores.– 2022.– 10 pp. arXivarXiv 2209.11287
  • Domke J., Vatai E., Drozd A., Chen P. T, Oyama Y., Zhang L., Salaria S., Mukunoki D., Podobas A., Wahib M. T, Matsuoka S. Matrix engines for high performance computing: A paragon of performance or grasping at straws?2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (17–21 May 2021, Portland, OR, USA).– IEEE.– 2021.– ISBN 978-1-6654-4066-0.– Pp. 1056–1065. https://doi.org/10.1109/IPDPS49936.2021.00114
  • Tan H., Yan R., Yang L., Huang L., Xiao L., Yang Q. Efficient multiple-precision and mixed-precision floating-point fused multiply-accumulate unit for HPC and AI applications // Algorithms and Architectures for Parallel Processing, 22nd International Conference ICA3PP 2022 (Copenhagen, Denmark, October 10–12, 2022), Lecture Notes in Computer Science.– vol. 13777, Cham: Springer Nature Switzerland.– 2023.– ISBN 978-3-031-22676-2.– Pp. 642–659. https://doi.org/10.1007/978-3-031-22677-9_34
  • Эксклюзивное интервью с руководителями Biren Technology: деконструкция первого 7-нм графического процессора компании, Обзор от компании MooreElite.com (Hefei).– 2022 (Китайский). hUtRtpLs://caifuhao.eastmoney.com/news/20220812093829803631950
  • Nvidia A100 Tensor Core GPU Datasheet, V1.0.– Nvidia.– 2020.– 3 pp. hUtRtpLs://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/a100-80gb-datasheet-update-
  • Choquette J., Lee E., Krashinsky R., Balan V., Khailany B. 3.2 The A100 Datacenter GPU and Ampere Architecture // 2021 IEEE International Solid-State Circuits Conference (ISSCC) (13–22 February 2021, San Francisco, CA, USA).– IEEE.– 2021.– ISBN 9781728195506.– Pp. 48–50. https://doi.org/10.1109/ISSCC42613.2021.9365803
  • Nvidia A100 tensor core GPU architecture, V1.0.– Nvidia.– 2020.– 82 pp. hUtRtpLs://resources.nvidia.com/en-us-genomics-ep/ampere-architecture-white-paper
  • Hassanpour M., Riera M., González A. A survey of near-data processing architectures for neural networks // Machine Learning and Knowledge Extraction.– 2022.– Vol. 4.– No. 1.– Pp. 66–102. https://doi.org/10.3390/make4010004
  • Gómez-Luna J., Guo Y., Brocard S., Legriel J., Cimadomo R., Oliveira G. F., Singh G., Mutlu O. An experimental evaluation of machine learning training on a real processing-in-memory system.– 2022.– 21 pp. arXivarXiv 2207.07886
  • Niu D., Li S., Wang Y., Han W., Zhang Z., Guan Y., Guan T., Sun F., Xue F., Duan L., Fang Y., Zheng H., Jiang X., Wang S., Zuo F., Wang Y., Yu B., Ren Q., Xie Y. 184QPS/W 64Mb/mm23D logic-to-DRAM hybrid bonding with process-near-memory engine for recommendation system // IEEE International Solid-State Circuits Conference (ISSCC) (20–26 February 2022, San Francisco, CA, USA).– IEEE.– 2022.– Pp. 1–3. https://doi.org/10.1109/ISSCC42614.2022.9731694
  • BiLi 106M, Product details.– Shanghai: Biren Technology.– 2020–2023. hUtRtpLs://www.birentech.com/product_details/1005557637772464128.html
  • BiLi 106B, 106C.– Shanghai: Biren Technology.– 2020–2023. hUtRtpLs://www.birentech.com/product_details/1005557844745474048.html
  • Blankenship R., Wagh M. Introducing the CXL 3.1 Specification.– Compute express link consortium.– 2022.– 27 pp. hUtRtpLs://computeexpresslink.org/wp-content/uploads/2024/03/CXL_3.1-Webinar-Presentation_Feb_2024.pdf
  • Coughlin T. Digital storage and memory // Computer.– 2022.– Vol. 55.– No. 1.– Pp. 20–29. https://doi.org/10.1109/MC.2021.3125165
  • Nvidia A100 Tensor Core GPU Datasheet.– Nvidia.– 2021.– 3 pp. hUtRtpLs://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf
  • Ampere Tuning Guide, Release 12.4.– Nvidia.– 2024.– 22 pp. hUtRtpLs://docs.nvidia.com/cuda/pdf/Ampere_Tuning_Guide.pdf
  • Server/OAI, Wiki page.– Open computers project. hUtRtpLs://www.opencompute.org/wiki/Server/OAI
  • Nvidia DGX A100, Datasheet.– Nvidia.– 2023.– 2 pp. hUtRtpLs://images.nvidia.com/aem-dam/Solutions/Data-Center/nvidia-dgx-
  • Morgan T.P. China launches the inevitable indigenous GPU, The Next Platform.– Stackhouse Publishing.– 2022. hUtRtpLs://www.nextplatform.com/2022/08/25/china-launches-the-inevitable-indigenous-gpu/
  • BIRENSUPA software development platform, Product details.– Shanghai: Biren Technology.– 2023. hUtRtpLs://www.birentech.com/product_details/1005588957219246080.html
  • MLPerf inference: datacenter benchmark suite results.– MLCommons. UhtRtpLs://mlcommons.org/en/inference-datacenter-21/
  • Reddi V. J., Cheng C., Kanter D., Mattson P., Schmuelling G., Carole-Wu J., Anderson B., Breughe M., Charlebois M., Chou W., Chukka R., Coleman C., Davis S., Deng P., Diamos G., Duke J., Fick D., Gardner J. S., Hubara I., Idgunji S., Jablin T. B., Jiao J., John T. S., Kanwar P., Lee D., Liao J., Lokhmotov A., Massa F., Meng P., Micikevicius P., Osborne C., Pekhimenko G., Rajan A. T. R., Sequeira D., Sirasao A., Sun F., Tang H., Thomson M., Wei F., Wu E., Xu L., Yamada K., Yu B., Yuan G., Zhong A., Zhang P., Zhou Y. Mlperf inference benchmark // 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) (30 May 2020–03 June 2020, Valencia, Spain).– IEEE.– 2020.– ISBN 978-1-7281-4661-4.– Pp. 446–459. https://doi.org/10.1109/ISCA45697.2020.00045
  • Saad M. H., Hashima S., Sayed W., El-Shazly E. H., Madian A. H., Fouda M. M. Early diagnosis of COVID-19 images using optimal CNN hyperparameters // Diagnostics.– 2023.– Vol. 13.– No. 1.– id. 76. https://doi.org/10.3390/diagnostics13010076
  • Devlin J., Ming-Chang W., Lee K., Toutanova K. BERT: Pre-training of deep bidirectional transformers for language understanding // Human Language Technology: Conference of the North American Chapter of the Association of Computational Linguistics.– V. 1, NAACL-HLT 2019 (June 2–June 7, 2019, Minneapolis, Minnesota, USA).– ACL.– 2019.– ISBN 978-1-950737-13-0.– Pp. 4171–4186. hUtRtpLs://aclanthology.org/N19-1423.pdf
  • Nvidia TensorRT, an SDK for high-performance deep learning inference, Web site, Nvidia developer.– Nvidia. hUtRtpLs://developer.nvidia.com/tensorrt
  • Blythe D. The Xe GPU architecture // 2020 IEEE Hot Chips 32 Symposium (HCS) (16–18 August 2020, Palo Alto, CA, USA).– IEEE.– 2020.– ISBN 978-1-7281-7129-6.– Pp. 1–27. https://doi.org/10.1109/HCS49909.2020.9220591
  • Blythe D. XeHPC Ponte Vecchio // 2021 IEEE Hot Chips 33 Symposium (HCS) (22–24 August 2021, Palo Alto, CA, USA).– IEEE.– 2021.– ISBN 978-1-6654-1397-8.– Pp. 1–34. https://doi.org/10.1109/HCS52781.2021.9567038
  • Intel data center GPU Max series product brief.– Intel (Accessed 15.10.2023). hUtRtpLs://www.intel.com/content/dam/www/central-libraries/us/en/documents/2023-01/data-center-gpu-max-series-product-brief.pdf
  • Intel data center GPU flex series product brief.– Intel (Accessed 15.10.2023). hUtRtpLs://www.intel.com/content/dam/www/central-libraries/us/en/documents/2022-08/ats-m-product-brief-final.pdf
  • Dhote D., Virmani C., Krishna K. G., Raghav S. The science of ray tracing // International Journal of Computer Applications.– 2020.– Vol. 176.– No. 42.– Pp. 15–20. https://doi.org/10.5120/ijca2020920443
  • Intel data center GPU Max series.– Intel (Accessed 15.10.2023). hUtRtpLs://ark.intel.com/content/www/us/en/ark/products/series/232874/intel-data-center-gpu-max-series.html
  • Jiang H. Intel’s Ponte Vecchio GPU: architecture, systems and software // 2022 IEEE Hot Chips 34 Symposium (HCS) (21–23 August 2022, Cupertino, CA, USA).– IEEE.– 2022.– ISBN 978-1-6654-6028-6.– Pp. 1–29. https://doi.org/10.1109/HCS55958.2022.9895631
  • Sidorova M., Gorbushin L., Koneva N. Analytical review of electronic devices of modern supercomputing systems, Proceedings of the International Russian Automation Conference, RusAutoCon2021 (September 5-11, 2021, Sochi, Russia), Lecture Notes in Electrical Engineering.– vol. 857, Cham: Springer.– 2022.– ISBN 978-3-030-94201-4.– Pp. 25–33. https://doi.org/10.1007/978-3-030-94202-1_3
  • Tian W., Li B., Li Z., Cui H., Shi J., Wang Y., Zhao J. Using chiplet encapsulation technology to achieve processing-in-memory functions // Micromachines.– 2022.– Vol. 13.– No. 10.– Pp. 1790. https://doi.org/10.3390/mi13101790
  • Moore S. K. 3 paths to 3D processors // IEEE Spectrum.– 2022.– Vol. 59.– No. 6.– Pp. 24–29. https://doi.org/10.1109/MSPEC.2022.9792148
  • Zhang S., Li Z., Zhou H., Li R., Wang S., Kyung-Paik W., He P., Recent prospectives and challenges of 3D heterogeneous integration // e-Prime-Advances in Electrical Engineering, Electronics and Energy.– 2022.– id. 100052. https://doi.org/10.1016/j.prime.2022.100052
  • Hadidi R., Asgari B., Mudassar B. A., Mukhopadhyay S., Yalamanchili S., Kim H. Demystifying the characteristics of 3D-stacked memories: A case study for Hybrid Memory Cube // 2017 IEEE international symposium on Workload characterization (IISWC) (01–03 October 2017, Seattle, WA, USA).– IEEE.– 2017.– Pp. 66–75. https://doi.org/10.1109/IISWC.2017.8167757
  • Ma X., Wang Y., Wang Y., Cai X., Han Y. Survey on chiplets: interface, interconnect and integration methodology // CCF Transactions on High Performance Computing.– 2022.– No. 4.– Pp. 43–52. https://doi.org/10.1007/s42514-022-00093-0
  • Universal chiplet interconnect express specifications.– Universal Chiplet Interconnect Express.– 2023. hUtRtpLs://www.uciexpress.org/specification
  • Gomes W., Koker A., Stover P., Ingerly D., Siers S., Venkataraman S., Pelto C., Shah T., Rao A., O’ .,Mahony, Karl E., Cheney L., Rajwani I., Jain H., Cortez R., Chandrasekhar A., Kanthi B., Koduri R. Ponte Vecchio: A multi-tile 3D stacked processor for exascale computing // 2022 IEEE International Solid-State Circuits Conference (ISSCC) (20–26 February 2022, San Francisco, CA, USA).– IEEE.– 2022.– ISBN 978-1-6654-2800-2.– Pp. 42–44. https://doi.org/10.1109/ISSCC42614.2022.9731673
  • Gomes W., Koker A., Stover P., Ingerly D., Siers S., Venkataraman S., Pelto C., Shah T., Rao A., O’Mahony F., Karl E., Cheney L., Rajwani I., Jain H., Cortez R., Chandrasekhar A., Kanthi B., Koduri R. Ponte Vecchio: A multi-tile 3D stacked processor for exascale computing, HPC user forum, Accelerated Computing Systems and Graphics Group.– 2021. hUtRtpLs://www.hpcuserforum.com/wp-content/uploads/2021/05/Gomes_Intel_Ponte-Vecchio_Mar2022-HPC-UF.pdf
  • Intel data center GPU Max series technical overview.– Intel.– 2023 (Accessed 15.10.2023). hUtRtpLs://www.intel.com/content/www/us/en/developer/articles/technical/intel-data-center-gpu-max-series-overview.html
  • Moore S. K. Behind Intel’s HPC chip that will pierce the exascale barrier, Blog, IEEE Spectrum.– IEEE.– 2022.
  • Ingerly D. B., Amin S., Aryasomayajula L., Balankutty A., Borst D., Chandra A., Cheemalapati K., Cook C. S., Criss R., Enamul K., Gomes W., Jones D., Kolluru K. C., Kandas A., G.-Kim S., Ma H., Pantuso D., Petersburg C. F., Phen-givoni M., Pillai A. M., Sairam A., Shekhar P., Sinha P., Stover P., Telang A., Zell Z. Foveros: 3D integration and the use of face-to-face chip stacking for logic devices // 2019 IEEE International Electron Devices Meeting (IEDM) (07–11 December 2019, San Francisco, CA, USA).– IEEE.– 2019.– ISBN 978-1-7281-4033-9.– Pp. 19.6.1-19.6.4. https://doi.org/10.1109/IEDM19573.2019.8993637
  • Mahajan R., Sankman R., Patel N., Dae-Kim W., Aygun K., Qian Z., Mekonnen Y., Salama I., Sharan S., Iyengar D., Mallik D. Embedded multi-die interconnect bridge (EMIB)–a high density, high bandwidth packaging interconnect // 2016 IEEE 66th Electronic Components and Technology Conference (ECTC) (31 May 2016–03 June 2016, Las Vegas, NV, USA).– IEEE.– 2016.– Pp. 557–565. https://doi.org/10.1109/ECTC.2016.201
  • Irani S. Hang SK Intel Ponte Vecchio compute accelerator OAM product and system, 2021 OCP Global Summit.– 2021. hUtRtpLs://www.opencompute.org/events/past-events/2021-ocp-global-summit
  • Tekin A., A.Durak T., Piechurski C., Kaliszan D.,Sungur F. A., Robertsén F., Gschwandtn P. State-of-the-art and trends for computing and interconnect network solutions for HPC and AI, Partnership for Advanced Computing in Europe.– PRACE.– 2021.– 38 pp. hUtRtpLs://prace-ri.eu/wp-content/uploads/State-of-the-Art-and-Trends-for-Computing-and-Interconnect-Network-Solutions-for-HPC-and-AI-1.pdf
  • Sun W., Li A., Geng T., Stuijk S., Corporaal H. Dissecting tensor cores via microbenchmarks: latency, throughput and numerical behaviors // IEEE Transactions on Parallel and Distributed Systems.– 2023.– Vol. 34.– No. 1.– Pp. 246–261. https://doi.org/10.1109/TPDS.2022.3217824
  • Intel Products formerly Alchemist.– Intel (Accessed 15.10.2023). hUtRtpLs://ark.intel.com/content/www/us/en/ark/products/codename/226095/products-formerly-alchemist.html
  • Watts D. Lenovo ThinkSystem and ThinkAgile GPU Summary, Product Guide.– Lenovo press.– 2024.– 71 pp. hUtRtpLs://lenovopress.lenovo.com/lp1602.pdf
  • Liu Zh. Intel Axes Data Center GPU Max 1350, Preps New Max 1450 for ’Different Markets’, Tom’s Hardware.– New York: Future US.– 2023. hUtRtpLs://www.tomshardware.com/news/intel-axes-data-center-gpu-max-1350-preps-max-1450-for-different-markets
  • Vuduc R., Chandramowlishwaran A., Choi J., Guney M.(E.), Shringarpure A. On the limits of GPU acceleration // Proceedings of the 2nd USENIX conference on Hot topics in parallelism, HotPar’10 (June 14–15, 2010, Berkeley, CA, USA), Berkeley: USENIX Association.– 2010.– id. 13.– 6 pp. hUtRtpLs://www.usenix.org/legacy/events/hotpar10/tech/full_papers/Vuduc.pdf
  • Hanindhito B., Gourounas D., Fathi A.,Trenev D., Gerstlauer A., John L. K. GAPS: GPU-acceleration of PDE solvers for wave simulation // ICS ’22: Proceedings of the 36th ACM International Conference on Supercomputing (June 28–30, 2022, Virtual Event), NeW York: ACM.– 2022.– ISBN 978-1-4503-9281-5.– id. 30.– 13 pp. https://doi.org/10.1145/3524059.3532373
  • Chalmers N., Mishra A., McDougall D., Warburton T. HipBone: A performance-portable GPU-accelerated C++ version of the NekBone benchmark // The International Journal of High Performance Computing Applications.– 2023.– Vol. 37.– No. 5.– Pp. 560-577. https://doi.org/10.1177/10943420231178552
Еще
Статья обзорная