Современные серверные ARM-процессоры для суперЭВM: A64FX и другие. Начальные данные тестов производительности
Автор: Кузьминский Михаил Борисович
Журнал: Программные системы: теория и приложения @programmnye-sistemy
Рубрика: Программное и аппаратное обеспечение распределенных и суперкомпьютерных систем
Статья в выпуске: 1 (52) т.13, 2022 года.
Бесплатный доступ
Дан сравнительный анализ производительности серверных ARM./процессоров, используемых на~суперЭВМ или ориентированных в частности на~высокопроизводительные вычисления (HPC). В~стартовый анализ производительности были отобраны Fujitsu A64FX, Marvell ThunderX2 и Huawei Kunpeng 920. Обзор производительности для HPC сосредоточен в~первую очередь на~тестах и приложениях для A64FX, поддерживающего более длинные, чем у~других ARM./процессоров, вектора и имеющего большую пиковую производительность. Производительность A64FX сопоставлена с~соответствующими данными для Intel Xeon Skylake и Cascade Lake, и AMD EPYC с~Zen 2 и 3 (Roma и Milan), а также с~GPU Nvidia V100 и A100. Сформулирован краткий набор потенциальных плюсов и минусов микроархитектуры A64FX. Сопоставлены данные о~производительности, получаемой с~применением различных компиляторов для A64FX. Сформированы признаки, когда A64FX дает обычно преимущества в~производительности относительно x86-64, а когда~.-- проигрывает x86-64. Подтверждается, что применение A64FX в~суперЭВМ может расти далее. Возможно, гегемония x86-64 в~HPC будет уменьшаться, в~том числе за счет расширения применения серверных ARM./процессоров. Однако проведенный анализ A64FX и ожидаемых в~ближайшее время новых процессоров архитектуры AArch64 показал, что ведущим в~этом процессе не обязательно окажется A64FX.
Arm, aarch64, a64fx, x86-64, высокопроизводительные вычисления, супер-эвм, тесты производительности
Короткий адрес: https://sciup.org/143178557
IDR: 143178557
Список литературы Современные серверные ARM-процессоры для суперЭВM: A64FX и другие. Начальные данные тестов производительности
- A. Tekin, Tuncer Durak A., Piechurski C., Kaliszan D., Aylin Sungur F., Roberts´en F., Gschwandtner P. State-of-the-art and trends for computing and interconnect network solutions for HPC and AI, Technical report: PRACE.– 2021.– 38 pp. hUtRtpLs://prace-ri.eu/wp-content/uploads/State-of-the-Art-and-Trends-for-Computing-and-Interconnect-Network-Solutions-for-HPC-and-AI-1.pdf
- Liabotis I. EINFRA-4-2014: Pan-European high performance computing infrastructure and services, PRACE-4IP-EINFRA-653838: PRACE.– 2017.– 71 pp. hUtRtpLs://prace-ri.eu/wp-content/uploads/4IP-D5.2-1.pdf
- Edwards C. Moore’s Law: what comes next? // Communications of the ACM.– 2021.– Vol. 64.– No. 2.– pp. 12–14. https://doi.org/10.1145/3440992
- Domke J., Vatai E., Drozd A., Chen P., Oyama Y., Zhang L., Salaria S., Mukunoki D., Podobas A., Wahib M., Matsuoka S. Matrix engines for high performance computing: A paragon of performance or grasping at straws?, 2021 IEEE International Parallel and Distributed Processing (IPDPS) (17–21 May 2021, Portland, OR, USA).– 2021.– pp. 1056–1065. https://doi.org/10.1109/IPDPS49936.2021.00114
- Arima E., Kodama Y., Odajima T., Tsuji M., Sato M. Power/Performance/Area Evaluations for Next-Generation HPC Processors using the A64FX Chip, 2021 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS) (14–16 April 2021, Tokyo, Japan).– 2021.– pp. 1–6. https://doi.org/10.1109/COOLCHIPS52128.2021.9410320
- M. S. Gordon, Barca G., S. S. Leang, Poole D., A. P. Rendell, J. L. Galvez Vallejo, Westheimer B. Novel computer architectures and quantum chemistry // The Journal of Physical Chemistry A.– 2020.– Vol. 124.– No. 23.– pp. 4557–4582. https://doi.org/10.1021/acs.jpca.0c02249
- Calore E., Gabbana A., S. F. Schifano, Tripiccione R. ThunderX2 performance and energy-efficiency for HPC workloads // Computation.– 2020.– Vol. 8.– No. 1.– pp. 20. https://doi.org/10.3390/computation8010020
- Tiwari A., Keipert K., Jundt A., Peraza J., S. S. Leang, Laurenzano M., M. S. Gordon, Carrington L. Performance and energy efficiency analysis of 64-bit ARM using GAMESS // Proceedings of the 2nd International Workshop on Hardware-Software Co-Design for High Performance Computing, Co-HPC ’15 (15 November 2015, Austin, Texas, USA), New York:ACM.– 2015.– ISBN 978-1-4503-3992-6.– 10 pp. https://doi.org/10.1145/2834899.2834905
- Keipert K., Mitra G., Sunriyal V., S. S. Leang, Sosonkina M., A. P. Rendell, M. S. Gordon Energy-efficient computational chemistry: Comparison of x86 and ARM systems // Journal of Chemical Theory and Computation.– 2015.– Vol. 11.– No. 11.– pp. 5055–5061. https://doi.org/10.1021/acs.jctc.5b00713
- Saastad O. W., Kapanova K., Markov S., Morales C., Shamakina A., Johnson N., Krishnasamy E., Varrette S. Best Practice Guide Modern Processors: PRACE.– 2020.– 109 pp. hUtRtpLs://prace-ri.eu/wp-content/uploads/Best-Practice-Guide-Modern-Processors-Accelerators.pdf
- Антонов А. С., Афанасьев И. В., Воеводин Вл. В. Высокопроизводительные вычислительные платформы: текущий статус и тенденции развития // Вычислительные методы и программирование.– 2021.– Т. 22.– №2.– с. 135–177. https://doi.org/10.26089/NumMet.v22r210
- Xia J., Cheng C., Zhou X., Hu Y., Chun P. Kunpeng 920: The first 7nm chiplet-based 64-Core ARM SoC for cloud services // IEEE Micro.– 2021.– Vol. 41.– No. 5.– pp. 67–75. https://doi.org/10.1109/MM.2021.3085578
- Dongarra J. Report on the Fujitsu Fugaku system, Tech Report No ICL-UT-20-06: University of Tennessee, Innovative Computing Laboratory.– 2020.– 18 pp. hUtRtpLs://www.icl.utk.edu/files/publications/2020/icl-utk-1379-2020.pdf
- Zhang W., Jiang Z., Chen Z., Xiao N., Ou Y. NUMA-Aware DGEMM based on 64-bit ARMv8 multicore processors architecture // Electronics.– 2021.– Vol. 10.– No. 16.– pp. 1984. https://doi.org/10.3390/electronics10161984
- Фролов В., Галактионов В., Санжаров В. RISC-V: стандарт, изменивший мир микропроцессоров // Открытые системы. СУБД.– 2020.– №2.– с. 30–34. hUtRtpLs://www.osp.ru/os/2020/02/13055471
- Кузьминский М. Power10: возрождение RISC // Oткрытые системы. СУБД.– 2021.– №3.– с. 10–12. hUtRtpLs://www.osp.ru/os/2[Р02И1Н/0Ц3]/13055989
- Jiang L., Yang C., W. Ma Enabling highly efficient batched matrix multiplications on SW26010 many-core processor // ACM Transactions on Architecture and Code Optimization.– 2020.– Vol. 17.– No. 1.– pp. 1–23, 2. https://doi.org/10.1145/3378176
- Кузьминский М. Китайский процессорно-суперкомпьютерный путь // Oткрытые системы. СУБД.– 2017.– №1.– с. 8–11. hUtRtpLs://www.osp.ru/os/2[Р01И7Н/0Ц1]/13051592
- Кузьминский М. ARM для HPC: время пришло? // Oткрытые системы. СУБД.– 2020.– №2.– с. 12–15. hUtRtpLs://www.osp.ru/os/2[Р02И0Н/0Ц2]/13055475
- Ouro P., Lopez-Novoa U., Guest M. F. On the performance of highly-scalable Computational Fluid Dynamics code on AMD, ARM and Intel processor-based HPC systems // Computer Physics Communications.– 2021.– Vol. 269, 108105. https://doi.org/10.1016/j.cpc.2021.108105
- McIntosh-Smith S., Price J., Deakin T., Poenaru A. A performance analysis of the first generation of HPC-optimized Arm processors // Concurrency and Computation: Practice and Experience.– 2019.– Vol. 31.– No. 16, e5110. https://doi.org/10.1002/cpe.5110
- Stephens N., Biles S., Boettcher M., Eapen J., Eyole M., Gabrielli G., Horsnell M., Magklis G., Martinez A., Premillieu N., Reid A., Rico A., Walker P. The ARM scalable vector extension // IEEE Micro.– 2017.– Vol. 37.– No. 2.– pp. 26–39. https://doi.org/10.1109/MM.2017.35
- Soria-Pardos V., Armejach A., Su´arez D., Moret´o M. On the use of manycore Marvell ThunderX2 processor for HPC workloads // The Journal of Supercomputing.– 2021.– Vol. 77.– No. 4.– pp. 3315–3338. https://doi.org/10.1007/s11227-020-03397-6
- Sugumar R. ThunderX3 next-generation arm-based server, 2020 IEEE Hot Chips 32 Symposium (HCS) (16–18 Aug., 2020, Palo Alto, CA, USA).– 2020.– pp. 1–19. https://doi.org/10.1109/HCS49909.2020.9220418
- Sugumar R., Shah M., Ramirez R. Marvell ThunderX3: next-generation arm-based server processor // IEEE Micro.– 2021.– Vol. 41.– No. 2.– pp. 15–21. https://doi.org/10.1109/MM.2021.3055451
- Gao W., Fang J., Huang C., Xu C., Wang Z. Optimizing barrier synchronization on ARMv8 many-core architectures, 2021 IEEE International Conference on Cluster Computing (CLUSTER) (7–10 Sept. 2021, Portland, OR, USA).– pp. 542-552. https://doi.org/10.1109/Cluster48925.2021.00044
- All SPEC CPU2017 results published by SPEC. hUtRtpLs://www.spec.org/cpu2017/results/cpu2017.html
- McCalpin J. D. HPL and DGEMM performance variability on the Xeon Platinum 8160 processor, SC18: International Conference for High Performance Computing, Networking, Storage and Analysis (11–16 Nov. 2018, Dallas, TX, USA).– pp. 225–237. https://doi.org/10.1109/SC.2018.00021
- Кружилов И. С., Кузьминский М. Б., Чернецов А. М., Шамаева О.Ю. Базовые библиотеки линейной алгебры для высокопроизводительных расчетов // Вестник МЭИ.– 2018.– №6.– с. 87–95. https://doi.org/10.24160[/Р1И99Н3Ц-6]982-2018-6-87-95
- ´Alvarez-Farr´e X., Gorobets A., Trias F., Oliva A. NUMA-aware strategies for the heterogeneous execution of SPMV on modern supercomputers, 14th WCCM-ECCOMAS Congress 2020 (19–24 July 2020, Paris, France).– 10 pp. hUtRtpLs://www.scipedia.com/public/Alvarez-Farre_et_al_2021a
- Mahmoud M., Hoffmann M., Reza H. Developing a new storage format and a warp-based SpMV kernel for configuration interaction sparse matrices on the GPU // Computation.– 2018.– Vol. 6.– No. 3.– pp. 45. https://doi.org/10.3390/computation6030045
- Zhang Y., Yang W., Li K., Tang D., Li K. Performance analysis and optimization for SpMV based on aligned storage formats on an ARM processor // Journal of Parallel and Distributed Computing.– 2021.– Vol. 158.– pp. 126–137. https://doi.org/10.1016/j.jpdc.2021.08.002
- Afanasyev I., Lichmanov D. Evaluating the performance of Kunpeng 920 processors on modern HPC applications // PaCT 2021: Parallel Computing Technologies, International Conference on Parallel Computing Technologies, Lecture Notes in Computer Science.– vol. 12942, Cham:Springer.– 2021.– ISBN 978-3-030-86359-3.– pp. 301–321. https://doi.org/10.1007/978-3-030-86359-3_23
- Sato M., Ishikawa Y., Tomita H., Kodama Y., Odajima T., Tsuji M., Yashiro H., Aoki M., Shida N., Miyoshi I., Hirai K., Furuya A., Asato A., Morita K., Shimizu T. Co-Design for A64FX manycore processor and “Fugaku” // SC ’20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (9–19 November, 2020, Atlanta, Georgia, USA).– 2020.– ISBN 978-1-7281-9998-6.– pp. 1–15. hUtRtpLs://dl.acm.org/doi/abs/10.5555/3433701.3433763
- Okazaki R., Tabata T., Sakashita S., Kitamura K., Takagi N., Sakata H., Ishibashi T., Nakamura T., Ajima Y. // Fujitsu Technical Review 2020-03.– 2020.– 9 pp. hUtRtpLs://www.fujitsu.com/global/documents/about/resources/publications/technicalreview/2020-03/article03.pdf
- Perks O. AEM and SVE, International Workshop on HighPerformance Computing and Programming on Quantum Chemistry and Physics 2020 (HPCPQCP2020).– 2020.– 68 pp. hUtRtpLs://www.stonybrook.edu/commcms/ookami/support/_docs/Arm_SVE_reduced.pdf
- Odajima T., Kodama Y., Sato M. Performance and power consumption analysis of ARM scalable vector extension // The Journal of Supercomputing.– 2021.– Vol. 77.– No. 6.– pp. 5757–5778. https://doi.org/10.1007/s11227-020-03495-5
- Sato M., Odajima T., Kodama Y. Performance evaluation of the supercomputer “Fugaku” and A64FX manycore processor, ScalA workshop 2020 (12th Nov, 2020).– 27 pp. hUtRtpLs://www.csm.ornl.gov/srt/conferences/Scala/2020/keynote_1.pdf
- Burford A., Calder A. C., Carlson D., Chapman B., Co¸SKun F., Curtis T., Feldman C., Harrison R. J., Kang Y., Michalow-Icz B., Raut E., Siegmann E., Wood D. G., Deleon R. L., Jones M., Simakov N. A., White J. P., Oryspayev D. Ookami: deployment and initial experiences.– 2021. arXivarXiv 2106.08987
- Huang et al H. Shuhai: a tool for benchmarking HighBandwidth memory on FPGAs // IEEE Transactions on Computers.– Vol. 71.– No. 5.– pp. 1133-1144.– 12 pp. https://doi.org/10.1109/hUTtRCtp.L2s:0/2/1w.3a0n7g5z7e6k5e.github.io/doc/Shuhai_TC21.pdf
- Langarita Benitez R. Evaluation of genome alignment workflows on HPC processors, Master thesis: Universitat Polit`ecnica de Catalunya.– 2021.– 69 pp. hUtRtpLs://upcommons.upc.edu/bitstream/handle/2117/343340/157590.pdf?sequence=1&isAllowed=y
- Alappat C., Laukemann J., Gruber T., Hager G., Wellein G., Meyer N., Wettig T. Performance modeling of streaming kernels and sparse matrixvector multiplication on A64FX, 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) (12 Nov. 2020, Atlanta, GA, USA).– pp. 1–7. https://doi.org/10.1109/PMBS51919.2020.00006
- Alappat C., Meyer N., Laukemann J., Gruber T., Hager G., Wellein G., Wettig T. ECM modeling and performance tuning of SpMV and Lattice QCD on A64FX.– 2021. arXivarXiv 2103.03013
- Li L., Pandey S., Flynn T., Liu H., Wheeler N., Hoisie A. SimNet: computer architecture simulation using machine learning.– 2021. arXivarXiv 2105.05821
- Schreier J. Optimization of small matrix multiplication kernels on Arm, Bachelor’s Thesis in Informatics: Technische Universit¨at M¨unchen.– 2021.– 47 pp. hUtRtpLs://mediatum.ub.tum.de/doc/1601278/zbkvhwtaatdlnckwxjkk40h5q.pdf
- Koo D., Lee J., Liu J., Byun E.-K., Kwak J.-H., Lockwood G. K., Hwang S., Antypas K., Wu K., Eom H. An empirical study of I/O separation for burst buffers in HPC systems // Journal of Parallel and Distributed Computing.– 2021.– Vol. 148.– pp. 96–108. https://doi.org/10.1016/j.jpdc.2020.10.007
- Anzt H., Tsai Y. M., Abdelfattah A., Cojean T., Dongarra J. Evaluating the performance of NVIDIA’s A100 Ampere GPU for sparse and batched computations, 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) (12 Nov. 2020, Atlanta, GA, USA).– pp. 26–38. https://doi.org/10.1109/PMBS51919.2020.00009
- Zhang L., Okamoto T., Ishii S., Hirai K., Sumimoto S., Gerofi B., Takagi M., Ishikawa Y. OS enhancement in supercomputer Fugaku, Fujitsu Technical Review.– 2020.– 7 pp. hUtRtpLs://www.fujitsu.com/global/documents/about/resources/publications/technicalreview/2020-03/article06.pdf
- Domke J., Matsumura K., Wahib M., Zhang H., Yashima K., Tsuchikawa T., Tsuji Y., Podobas A., Matsuoka S., OS double-precision FPUs in High- Performance Computing: an embarrassment of riches? 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (20–24 May 2019, Rio de Janeiro, Brazil).– pp. 78–88. https://doi.org/10.1109/IPDPS.2019.00019
- Domke J. A64FX–Your Compiler You Must Decide!.– 2021. ArXivarXiv 2107.07157
- Ishikawa K. I., Kanamori I., Matsufuru H., Miyoshi I., Mukai Y., Nakamura Y., Nitadori K., Tsuji M. 102 PFLOPS lattice QCD quark solver on Fugaku.– 2021. arXivarXiv 2109.10687
- Michalowicz B., Raut E., Kang Y., Curtis T., Chapman B., Oryspayev D. Comparing OpenMP implementations with applications across A64FX platforms.– 2021. arXivarXiv 2107.10346
- Meyer N., Georg P., Solbrig S., Wettig T. Grid on QPACE 4, The 38th International Symposium on Lattice Field Theory (26–30 Jul., 2021).– 1 pp. hUtRtpLs://indico.cern.ch/event/1006302/contributions/4378496/attachments/2279633/3873128/poster.pdf
- Meyer N. Grid Lattice QCD framework on A64FX, R-CCS seminar (online) (June 30, 2021, University of Regensburg, Regensburg, Germany).– 2021.– 29 pp. hUtRtpLs://www.r-ccs.riken.jp/labs/ftrt/slides/20210630_meyer.pdf
- Arm Ltd SVE compilers and libraries, Arm SVE Hackathon on Ookami, February 2021.– 2019.– 41 pp. hUtRtpLs://www.stonybrook.edu/commcms/ookami/support/_docs/4%20-%20SVE%20Compilers%20and%20Libraries.pdf
- T. Kolev, Fischer P., Austin A.P., Barker A. T., Beams N., Brown J., Camier J.-S., Chalmers N., Dobrev V., Dudouit Y., Ghaffari L., Kerkemeier S., Lan Y.-H., Merzari E., Min M., Pazner W., Rathnayake T., Shephard M. S., Siboni M. H., Smith C. W., Thompson J. L., Tomov S., Warburton T. High-order algorithmic developments and optimizations for large-scale GPU-accelerated simulations, ECP Milestone Report WBS 2.2.6.06, Milestone CEED-MS36: US Department of Energy.– 2021.– 51 pp.
- Harrison R. J. Performance engineering on A64FX with SVE intrinsics (Early experience on Ookami).– 2021.– 37 pp. hUtRtpLs://www.stonybrook.edu/commcms/ookami/support/_docs/RJHACMCF21.pdf
- Michalowicz B., Raut E., Kang Y., Curtis T., Chapman B., Oryspayev D. Comparing the behavior of OpenMP Implementations with various Applications on two different Fujitsu A64FX platforms.– 2021. arXivarXiv 2106.09787
- Bari M. A. S., Chapman B., Curtis A., Harrison R. J., Siegmann E., N. A. Simakov, M. D. Jones A64FX performance: experience on Ookami, 2021 IEEE International Conference on Cluster Computing (CLUSTER) (7–10 Sept. 2021, Portland, OR, USA).– pp. 711-718. https://doi.org/10.1109/Cluster48925.2021.00106
- Meng J., Atle A., Calandra H., Araya-Polo M. Minimod: A finite difference solver for seismic modeling.– 2020. arXivarXiv 2007.06048
- Bailey D., Harris T., Saphir W., van der Wijngaart R., Woo A., Yarro M. The NAS parallel benchmarks 2.0, Report NAS-95-020: NASA Ames Research Center.– 1995.– 24 pp. hUtRtpLs://www.nas.nasa.gov/assets/pdf/techreports/1995/nas-95-020.pdf
- Murai H. Overview of software environmenton Fugaku, 6th Meeting for Application Code Tuning on A64FX Computer Systems (June 30, 2021).– 17 pp. hUtRtpLs://www.hpci-office.jp/invite2/documents2/meeting_A64FX_210630/A64FX.pdf
- Hammond S., Curry M., Davis K., Dang V.-Q., Guba O., Hoekstra R., Laros J., Pedretti K., Poliakoff D., Rajamanickam S., Trott C., Vergiat-Berger L., Younge A. Fugaku and A64FX Update.– 2021.– 15 pp. hUtRtpLs://cfwebprod.sandia.gov/cfdocs/CompResearch/docs/snl-fugaku-update-20210420-final_2.pdf
- Xu R.-Q. G., Okubo T., Todo S., Imada M. Optimized implementation for calculation and fast-update of Pfaffians installed to the open-source fermionic variational solver mVMC.– 2021. arXivarXiv 2105.13098
- Van Zee F. G., van de Geijn R. A. BLIS: a framework for rapidly instantiating BLAS functionality // ACM Transactions on Mathematical Software.– 2015.– Vol. 41.– No. 3.– pp. 1–33, 14. https://doi.org/10.1145/2764454
- Imamura T. Development of EigenExa from K to Fugaku, and beyond Fugaku, The 4th Meeting for Application Code Tuning on A64FX Computer Systems (March 17, 2021).– 2021.– 24 pp. hUtRtpLs://www.hpci-office.jp/invite2/documents2/meeting_A64FX_210317/Fugaku-tuning-seminar-imamura_20210317.pdf
- Shibata N., Petrogalli F. SLEEF: A portable vectorized library of C standard mathematical functions // IEEE Transactions on Parallel and Distributed Systems.– 2019.– Vol. 31.– No. 6.– pp. 1316–1327. https://doi.org/10.1109/TPDS.2019.2960333
- Feldman C., Michalowicz B., Calder A. Lessons Learned. An In-Depth Look at Running FLASH on Ookami.– 30 pp. hUtRtpLs://www.stonybrook.edu/commcms/ookami/support/_docs/ACM_Slides_FLASH_2021.pdf
- Ruhela A., Xu S., Manian K. V., Subramoni H., Panda D. K. Analyzing and understanding the impact of interconnect performance on HPC, Big Data, and deep learning applications: a case study with InfiniBand EDR and HDR, 2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) (18–22 May 2020, New Orleans, LA, USA).– pp. 869–878. https://doi.org/10.1109/IPDPSW50202.2020.00147
- Trott C., Lebrun-Grandi´e D., Arndt D., Ciesko J., Dang V., Ellingwood N., Gayatri R., Harvey E., Hollman D. S., Ibanez D., Liber N., Madsen J., Miles J., Poliakoff D., Powell A., Rajamanickam S., Simberg M., Sunderland D., Turcksin B., Wilke J. Kokkos 3: Programming model extensions for the exascale era // IEEE Transactions on Parallel and Distributed Systems.– 2022.– Vol. 33.– No. 4.– pp. 805–817. https://doi.org/10.1109/TPDS.2021.3097283
- Kudo S., Nitadori K., Ina T., Imamura T. Implementation and numerical techniques for one EFlop/s HPL-AI benchmark on Fugaku, 2020 IEEE/ACM 11th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) (13 Nov. 2020, Atlanta, GA, USA).– pp. 69–76. https://doi.org/10.1109/ScalA51936.2020.00014
- Kudo S., Nitadori K., Ina T., Imamura T. Prompt report on Exa-scale HPL-AI benchmark, 2020 IEEE International Conference on Cluster Computing (CLUSTER) (14–17 Sept. 2020, Kobe, Japan).– pp. 418–419. https://doi.org/10.1109/CLUSTER49012.2020.00058
- Эйсымонт Л., Фролов А., Семенов А. Graph500: адекватный рейтинг // Oткрытые системы. СУБД.– 2011.– №7.– с. 14–17. hUtRtpLs://www.osp.ru/os/2011/01/13006961
- Nakao M., Ueno K., Fujisawa K., Kodama Y., Sato M. Performance evaluation of supercomputer Fugaku using breadth-first search benchmark in Graph500, 2020 IEEE International Conference on Cluster Computing (CLUSTER) (14–17 Sept. 2020, Kobe, Japan).– pp. 408–409. https://doi.org/10.1109/CLUSTER49012.2020.00053
- Nakamura Y. Software development and performance of Fugaku and ARM architectures, The 38th International Symposium on Lattice Field Theory (26–30 July 2021).– 2021.– 9 pp. hUtRtpLs://indico.cern.ch/event/1006302/contributions/4366845/attachments/2289133/3892993/lattice2021.pdf ↑92, 113
- Ishikawa K. I., Kanamori I., Matsufuru H., Miyoshi I., Mukai Y., Nakamura Y., Nitadori K., Tsuji M. 102 PFLOPS Lattice QCD quark solver on Fugaku.– 2021. arXivarXiv 2109.10687
- Yashiro H., Koji T., Yuta K., Shuhei K., Takemasa M., Toshiyuki I., Kazuo M., Masuo N., Chihiro K., Masaki S., Hirofumi T. The NICAM 3.5 km-1024 ensemble simulation: Performance optimization and scalability of NICAM-LETKF on supercomputer Fugaku, vEGU21, the 23rd EGU General Assembly (online 19–30 April, 2021), EGU21-4771. https://doi.org/10.5194/hUetgRtupLssp:/h/eurei.-aedgsua2b1s-.4h7a7r1vard.edu/abs/2021EGUGA..23.4771Y/abstract
- Dongarra J., Hammarling S., Higham N. J., Relton S. D., Valero-Lara P., Zounon M. The design and performance of batched BLAS on modern high-performance computing systems // Procedia Computer Science.– 2017.– Vol. 108.– pp. 495–504. https://doi.org/10.1016/j.procs.2017.05.138
- Shimizu T. Supercomputer Fugaku: Co-designed with application developers/ researchers, 2020 IEEE Asian Solid-State Circuits Conference (A-SSCC) (9–11 Nov. 2020, Hiroshima, Japan).– pp. 1–4. https://doi.org/10.1109/A-SSCC48613.2020.9336127
- Deakin T., Price J., Martineau M., McIntosh-Smith S. GPU-STREAM v2.0: Benchmarking the achievable memory bandwidth of many-core processors across diverse parallel programming models // ISC High Performance 2016: High Performance Computing, International Conference on High Performance Computing, Lecture Notes in Computer Science.– vol. 9945, Cham:Springer.– 2016.– ISBN 978-3-319-46079-6.– pp. 489–507. https://doi.org/10.1007/978-3-319-46079-6_34
- McVoy L. W., Staelin C. lmbench: portable tools for performance analysis, ATEC ’96: Proceedings of the 1996 annual conference on USENIX Annual Technical Conference (22–26 January, 1996, San Diego, CA, USA).– 17 pp. hUtRtpLs://www.usenix.org/legacy/publications/library/proceedings/sd96/full_papers/mcvoy.pdf
- Gupta N., Ashiwal R., Brank B., Peddoju S. K., Pleiter D. Performance evaluation of parallex execution model on arm-based platforms, 2020 IEEE International Conference on Cluster Computing (CLUSTER) (14–17 Sept. 2020, Kobe, Japan).– pp. 567–575. https://doi.org/10.1109/CLUSTER49012.2020.00080
- Odajima T., Kodama Y., Tsuji M., Matsuda M., Maruyama Y., Sato M. Preliminary performance evaluation of the Fujitsu A64FX using HPC applications, 2020 IEEE International Conference on Cluster Computing (CLUSTER) (14–17 Sept. 2020, Kobe, Japan).– pp. 523–530. https://doi.org/10.1109/CLUSTER49012.2020.00075
- Ltaief H., Cranney J., Gratadour D., Hong Y., Gatineau L., Keyes D. E. Meeting the real-time challenges of ground-based telescopes using low-rank matrix computations.– 2021. hUtRtpLs://repository.kaust.edu.sa/handle/10754/669813
- Кузьминский М. Векторные процессоры против акселераторов // Oткрытые системы. СУБД.– 2018.– №1.– с. 10–10. hUtRtpLs://www.osp.ru/os/2[Р01И8Н/0Ц1]/13053934
- Brank B., Nassyr S., Pouyan F., Pleiter D. Porting applications to Arm-based processors, 2020 IEEE International Conference on Cluster Computing (CLUSTER) (14–17 Sept. 2020, Kobe, Japan).– pp. 559–566. https://doi.org/10.1109/CLUSTER49012.2020.00079
- Alappat C., Meyer N., Laukemann J., Gruber T., Hager G., Wellein G., Wettig T. Execution-Cache-Memory modeling and performance tuning of sparse matrix-vector multiplication and Lattice quantum chromodynamics on A64FX // Concurrency and Computation: Practice and Experience, e6512 (to appear). https://doi.org/10.1002/cpe.6512
- Alappat C., Laukemann J., Gruber T., Hager G., Wellein G., Meyer N., Wettig T. Performance modeling of streaming kernels and sparse matrixvector multiplication on A64FX, 2020 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) (12 Nov., 2020, Atlanta, GA, USA).– pp. 1–7. https://doi.org/10.1109/PMBS51919.2020.00006
- Jackson A., Weiland M., Brown N., Turner A., Parsons M. Investigating applications on the A64FX, 2020 IEEE International Conference on Cluster Computing (CLUSTER) (14–17 Sept. 2020, Kobe, Japan).– pp. 549–558. https://doi.org/10.1109/CLUSTER49012.2020.00078
- McMahon F. H. Livermore Fortran kernels: A computer test of numerical performance range, Technical Report UCRL-53745: Lawrence Livermore National Laboratory.– 1986.– 204 pp.
- Luszczek P. R., Bailey D. H., Dongarra J. J., Kepner J., Lucas R. F., Rabenseifner R., Takahashi D. The HPC Challenge (HPCC) benchmark suite, SC ’06: International Conference for High Performance Computing, Networking, Storage and Analysis (11–17 Nov., 2006, Tampa, Florida, USA).– 2006.– pp. 213. https://doi.org/10.1145/1188455.1188677
- Jin H., Frumkin M., Yan J. The OpenMP implementation of NAS parallel benchmarks and its performance, NAS Technical Report NAS-99-011.– 1999.– 26 pp. hUtRtpLs://www.nas.nasa.gov/assets/pdf/techreports/1999/nas-99-011.pdf
- Li L., Pandey S., Flynn T., Liu H., Wheeler N., Hoisie A. SimNet: accurate and high-performance computer architecture simulation using machine learning.– 2021. arXivarXiv 2105.05821
- Kodama Y., Kondo M., Sato M. Evaluation of SPEC CPU and SPEC OMP on the A64FX, 2021 IEEE International Conference on Cluster Computing (CLUSTER) (7–10 Sept. 2021, Portland, OR, USA).– pp. 553–561. https://doi.org/10.1109/Cluster48925.2021.00088
- Nakao M., Ueno K., Fujisawa K., Kodama Y., Sato M. Performance of the supercomputer Fugaku for breadth-first search in Graph500 benchmark // ISC High Performance 2021: High Performance Computing, International Conference on High performance Computing, Lecture Notes in Computer Science.– vol. 12728, Cham:Springer.– 2021.– ISBN 978-3-030-78713-4.– pp. 372–390. https://doi.org/10.1007/978-3-030-78713-4_20
- Nakao M. Performance tuning of Graph500 benchmark on supercomputer Fugaku, The first Meeting for Application Code Tuning on A64FX Computer Systems (9 December, 2020).– 23 pp. hUtRtpLs://www.hpci-office.jp/invite2/documents2/meeting_A64FX_201209/Graph500.pdf
- Bramas B. A fast vectorized sorting implementation based on the ARM scalable vector extension (SVE).– 2021. arXivarXiv 2105.07782
- Furuya A., Asami A. Fujitsu regarding OSS porting for Tomitake/FX1000-RIST co-creation (in Japanese).– 12 pp. hUtRtpLs://www.hpci-office.jp/invite2/documents2/ws_cae_210312_furuya.pdf
- Zempo Y., Akino N., Ishida M., Tomiyama E., Yamamoto H. Real-time and real-space program tuned in K-computer // Journal of Physics: Conference Series.– 2015.– Vol. 640.– No. 1, 012066. https://doi.org/10.1088/1742-6596/640/1/012066
- Nakajima T., Katouda M., Kamiya M., Nakatsuka Y. NTChem: A highperformance software package for quantum molecular simulation // International Journal of Quantum Chemistry.– 2015.– Vol. 115.– No. 5.– pp. 349–359. https://doi.org/10.1002/qua.24860
- Sato M., Murai H., Nakao M., Tsugane K., Odajima T., Lee J. XcalableMP 2.0 and future directions // XcalableMP PGAS Programming Language, ed. M. Sato, Singapore:Springer.– 2021.– ISBN 978-981-15-7683-6.– pp. 245–262. https://doi.org/10.1007/978-981-15-7683-6_10
- Poenaru A., Lin W.C., McIntosh-Smith S. A performance analysis of modern parallel programming models using a compute-bound application // ISC High Performance 2021: High Performance Computing, International Conference on High Performance Computing, Lecture Notes in Computer Science.– vol. 12728, Cham:Springer.– 2021.– ISBN 978-3-030-78713-4.– pp. 332–350. https://doi.org/10.1007/978-3-030-78713-4_18
- Watanabe K. Performance tuning on LAMMPS for A64FX system, The 5th Meeting for Application Code Tuning on A64FX Computer Systems (27 April, 2021).– 2021.– 27 pp. hUtRtpLs://www.hpci-office.jp/invite2/documents2/meeting_A64FX_210427/lmp_tune_for_a64fx_27Apr2021_final.pdf
- Huber J., Wei W., Georgakoudis G., Doerfert J., Hernandez O. A case study of LLVM-based analysis for optimizing SIMD code generation // IWOMP 2021: OpenMP: Enabling Massive Node-Level Parallelism, International Workshop on OpenMP, Lecture Notes in Computer Science.– vol. 12870, Cham:Springer.– 2021.– ISBN 978-3-030-85262-7.– pp. 142–155. https://doi.org/10.1007/978-3-030-85262-7_10
- Kanamori I., Ishikawa K. I., Matsufuru H. Object-oriented implementation of algebraic multi-grid solver for lattice QCD on SIMD architectures and GPU clusters // ICCSA 2021: Computational Science and Its Applications, International Conference on Computational Science and Its Applications, Lecture Notes in Computer Science.– vol. 12953, Cham:Springer.– 2021.– ISBN 978-3-030-86976-2.– pp. 218–233. https://doi.org/10.1007/978-3-030-86976-2_15
- Nakamura Y., Mukai Y., Ishikawa K.-I., Kanamori I., Lattice quantum chromodynamics simulation library for Fugaku and computers with wide SIMD. UhtRtpLs://github.com/RIKEN-LQCD/qws
- Cielo S., Porth O., Iapichino L., Karmakar A., Olivares H., Xia C. Optimizing the hybrid parallelization of BHAC.– 2021. arXivarXiv 2108.12240
- Bird R., Tan N., Luedtke S. V., Harrell S.-L., Taufer M., Albright B. VPIC 2.0: next generation particle-in-cell simulations.– 2021. ArXivarXiv 2102.13133
- Kl¨ower M., Hatfield S., Croci M., D¨uben P. D., Palme T. N. Fluid simulations accelerated with 16 bit: Approaching 4x speedup on A64FX by squeezing ShallowWaters.jl into Float16: ESSOAr.– 2021.– 26 pp. https://doi.org/10.1002/hUetsRtspoLas:r/.1/0w5w0w74.e7s2s.o2ar.org/pdfjs/10.1002/essoar.10507472.2
- Ferenbaugh C. R. PENNANT: an unstructured mesh mini-app for advanced architecture research // Concurrency and Computation: Practice and Experience.– 2015.– Vol. 27.– No. 17.– pp. 4555–4572. https://doi.org/10.1002/cpe.3422
- Nukariya A., Akao K., Takahashi J., Fukumoto N., Kawakami K., Kuroda A., Minami K., Sato K., Matsuoka S. HPC and AI initiatives for supercomputer Fugaku and future prospects, Fujitsu Technical Review.– 6 pp. hUtRtpLs://www.fujitsu.com/global/documents/about/resources/publications/technicalreview/2020-03/article09.pdf
- Rajpal S., Lakhyani N., Singh A.-K., Kohli R., Kumar N. Using handpicked features in conjunction with ResNet-50 for improved detection of COVID-19 from chest X-ray images // Chaos, Solitons & Fractals.– 2021.– Vol. 145, 110749. https://doi.org/10.1016/j.chaos.2021.110749
- Honda T. Development of a deep neural network library for A64FX, R-CCS 4th Meeting for Application Code Tuning on A64FX Computer Systems (17 March 2021).– 25 pp. hUtRtpLs://www.hpci-office.jp/invite2/documents2/meeting_A64FX_210317