中山大学计算机学院,广东 广州 510006
卢宇彤(1969年生),女;研究方向:高性能计算与超级计算机; E-mail:luyutong@mail.sysu.edu.cn
纸质出版日期:2024-11-25,
网络出版日期:2024-10-22,
收稿日期:2024-10-01,
录用日期:2024-10-09
移动端阅览
卢宇彤,陈志广.基于超算的多模式计算融合支撑系统[J].中山大学学报(自然科学版)(中英文),2024,63(06):150-160.
LU Yutong,CHEN Zhiguang.The convergent computing based on supercomputer[J].Acta Scientiarum Naturalium Universitatis Sunyatseni,2024,63(06):150-160.
卢宇彤,陈志广.基于超算的多模式计算融合支撑系统[J].中山大学学报(自然科学版)(中英文),2024,63(06):150-160. DOI: 10.13471/j.cnki.acta.snus.ZR20240293.
LU Yutong,CHEN Zhiguang.The convergent computing based on supercomputer[J].Acta Scientiarum Naturalium Universitatis Sunyatseni,2024,63(06):150-160. DOI: 10.13471/j.cnki.acta.snus.ZR20240293.
复杂的科学工程计算应用要求在数值模拟、大数据处理、人工智能3种计算模式之间实现融合,而这3种计算模式具有不同的负载特征,它们在执行、调度、数据访问方面具有显著的差异,传统的超级计算机不能同时高效地支撑以上3种计算模式。我们重构了超级计算机的并行文件系统、并行通信系统、资源管理与作业调度系统等系统软件,并设计了基于超算的大数据处理框架和人工智能推理框架,支持在高性能计算应用中融合大数据和人工智能计算模式,形成了基于超算环境的多模式计算融合支撑系统。应用表明,所研发的融合支撑系统能够支持3种计算模式的耦合,且在性能上表现出显著的优势,为复杂的科学工程计算应用提供了完善的运行环境。
Complex science and engineering computing applications require the converging of three computing modes: Numerical simulation, big data processing, and artificial intelligence. However, traditional supercomputers cannot efficiently support the above three computing modes at the same time. We reconstructed the supercomputer's parallel file system, parallel communication system, resource management and job scheduling system etc., and further designed a supercomputer-based big data processing framework and artificial intelligence inference framework, building a converged environment based on supercomputers. Experiments show that the proposed converged system can support the coupling of three computing paradigms as well as improve the performance significantly, providing a comprehensive environment for complex scientific engineering computing applications.
多模式融合计算超级计算机科学工程计算
converged computingsupercomputerscience and engineering computing
梅宏, 杜小勇, 金海, 等, 2023. 大数据技术前瞻 [J]. 大数据, 9(1):1-20.
AGRAWAL, A, LEDIA N, PANWAR A, et al, 2024. Taming throughput-latency tradeoff in LLM inference with Sarathi-Serve[C]//18th USENIX Symposium on Operating Systems Design and Implementation. Santa Clara, CA,USA: 117-134.
CHEN C, YE W, ZUO Y, et al, 2019. Graph networks as a universal machine learning framework for molecules and crystals[J]. Chem Mater, 31(9): 3564-3572.
CHEN P, CHEN J, YAN H, et al, 2022. Improving material property prediction by leveraging the large-scale computational database and deep learning[J]. J Phys Chem C, 126(38): 16297-16305.
CURTAROLO S, SETYAWAN W, WANG S, et al, 2012. AFLOWLIB.ORG: A distributed materials properties repository from high-throughput ab initio calculations[J]. Comput Mater Sci, 58: 227-235.
DU J, WEI J, JIANG J, et al, 2024. Liger: interleaving intra- and inter-operator parallelism for distributed large model inference[C]//Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. Edinburgh United Kingdom,ACM: 42-54.
HEYD J, SCUSERIA G E, ERNZERHOF M, 2003. Hybrid functionals based on a screened Coulomb potential[J]. J Chem Phys, 118(18): 8207-8215.
HUANG, Y, CHENG Y, BAPNA A, et al,2019. Gpipe: Efficient training of giant neural networks using pipeline parallelism[C]//Proceedings of the 33rd International Conference on Neural Information Processing Systems. NY, USA :103-112.
JAIN A, ONG S P, HAUTIER G, et al, 2013. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation[J]. APL Mater, 1(1): 011002.
KIRKLIN S, SAAL J E, MEREDIG B, et al, 2015. The Open Quantum Materials Database (OQMD): Assessing the accuracy of DFT formation energies[J]. NPJ Comput Mater, 1: 15010.
MEI S, GUAN H, WANG Q, 2018. An overview on the convergence of high performance computing and big data processing[C]//2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS). Singapore, IEEE: 1046-1051.
SEO S, AMER A, BALAJI P, et al, 2018. Argobots: A lightweight low-level threading and tasking framework[J]. IEEE Trans Parallel Distrib Syst, 29(3): 512-526.
SHOEYBI M, PATWARY M, PURI R, et al, 2019. Megatron-LM: Training multi-billion parameter language models using model parallelism[EB/OL].arXiv: 1909.08053. http://arxiv.org/abs/1909.08053http://arxiv.org/abs/1909.08053.
WANG J, FANG J, LI A, et al, 2024. PipeFusion: Displaced patch pipeline parallelism for inference of diffusion transformer models[EB/OL]. arXiv: 2405.14430. http://arxiv.org/abs/2405.14430http://arxiv.org/abs/2405.14430.
XIE T, GROSSMAN J C, 2018. Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties[J]. Phys Rev Lett, 120(14): 145301.
YIN F, SHI F, 2022. A comparative survey of big data computing and HPC: From a parallel programming model to a cluster architecture[J]. Int J Parallel Program, 50(1): 27-64.
ZHENG H, SONG A, LIU Z, et al,2014. Research of multicore-based parallel gabp algorithm with dynamic load-balance[J]. Int J Numer Anal Mod B,5(1): 123-135.
0
浏览量
9
下载量
0
CSCD
关联资源
相关文章
相关作者
相关机构