ETP4HPC

HPC Vega — Slovenian peta-scale supercomputer powering scientific discovery

Posted on Updated on

European Technology Platform (ETP) for High-Performance Computing (HPC) or ETP4HPC organized a conference today on the Vega system.

EuroHPC supercomputers with HPC Vega system, was hosted by IZUM in Maribor, Slovenia. Aleš Zemljak and Žiga Zebec from IZUM presented on Vega. IZUM is the Institute of Information Science, Maribor, Slovenia.

Slovenian peta-scale supercomputer
Aleš Zemljak gave an overview of HPC Vega: “HPC Vega — Slovenian Peta-scale Supercomputer”. He touched on the system’s design, architecture and installation, focusing on most user-relevant basic concepts of HPC, and their relation to HPC Vega.

HPC Vega.

HPC Vega is the Slovenian peta-scale supercomputer. HPC Vega is the most powerful Slovenian supercomputer. It is the first operational EuroHPC JU system, in production since April 2021. It has performance of 6.9 PFLOPS, uses Atos Sequana XH2000 and 1020 Compute nodes, Infiniband 100Gb/s. It has 18PB large capacity storage Ceph, and 1PB high performance storage Lustre. It consumes < 1MW power, and has PUE < 1.15.

App domains
HPC app domains include earth sciences, such as seismology, earthquake simulations and predictions, climate change, weather forecast, earth temperatures, ocean streams, forest fires, vulcano analysis, etc. High energy physics and space exploration, such as particle physics, large Hadron collider, project ATLAS trkalnik, astronomy, large synoptic survey telescope, Gaia satellite, supernovas, new stars, planets, sun, moon, etc.

Medicine, health, chemistry, molecular simulation, including diseases, drugs, vaccines, DNA sequencing, bioinformatics, molecular chemistry, etc. Mechanical engineering and computational liquid dynamics. Machine, deep learning, AI, etc., such as autonomous driving, walk simulations, speech and face recognition, robotics, language analytics, etc.

HPC Vega has 10 design goals. These are: general-purpose HPC for user communities, HPC compute intensive CPU/GPU partitions, high-performance data analytics (HPDA) extreme data processing, AI/ML, compute node WAN connectivity, hyper-converged network, remote access for job submission, good scalability for massively parallel jobs, fast throughput for large number of small jobs, and high sequential with random storage access.

EU projects (funded) are: interTwin, exploitation of HPC Vega environment, two FTEs (IZUM, JSI), starts, EPICURE, SMASH (MCSA cofunded), o-boarding first postdocs, etc. EUmaster4HPC is preparing an offer for summer internship.

Supporting projects/activities (non-funded) are: EuroCC SLING, MaX3 CoE, etc. Others are: European Digital Infrastructure Consortium (EDIC) – national resources reserved, high-level app support help for Leonardo, CASTIEL2, Container Forum, MultiXscale CoE, and EVEREST (Experiments for Validation and Enhancement of higher REsolution Simulation Tools).

Future is data centers and ‘Project NOO’. Project “Recuperation and Resilience Plan — NOO. The goal is to archive facilities for research data, space for hosting of equipment of public research institutions and universities, space for future HPC(s). The project is due to be completed in June 2026. We have EUR15.2 million for two data centers and the long-lasting storage for research data equipment.

We envision two identical facilities or buildings for two data centers. They will be located in Dravske elektrarne, Mariborski otok. Acquisition of land has been completed. The other one is JSI (nuclear research) reactor, at Podgorica, Montenegro. We will be using the ground floor for HPC, first floor for the research data archive, Arnes’s and hosted equipment. Slovenia is going to need a new supercomputer by end of 2026. EuroHPC JU co-funding is expected (this system is not part of this ‘Project NOO.

Powering scientific discovery
Dr. Žiga Zebec presented: “HPC Vega: Powering Scientific Discovery”, focusing on the science conducted on HPC Vega, or “use cases”.

Slovenian research facilities using HPC Vega are: Kemijsko Institut, lab for molecular modeling, Univerza v Lubljani, for cognition modeling lab, FMF, in physics department, Univerza v Maribou, lab of physical chemistry, and Institut Jozef Stefan, theroretical physics, experimental particle physics, reactor physics, Centre for Astrophysics and Cosmology, etc.

Major domestic projects are development of Slovene in a digital environment. Project goal is to meet needs for computational tools and services in language technologies for Slovene. Development of meteorological and oceanographic test models. Hospital smart development based on AI, with project goal to develop AI-based hospitals. Robot textile and fabric inspection and manipulation. It is to advance state-of-the-art of perception and inspection, and robotic manipulation of textile and fabric, and bridge technological gap in this industry.

We have Slovenian Genome project with systematic study of genomic variability of Slovenians. There can be faster and more reliable diagnostics of rare genetic diseases.

There are scientific projects running on the Slovenian share of HPC Vega. These include deep-learning ensemble for sea level and storm tide forecasting, All-Atom Simulations of cellular senescence (process of deterioration with age), first-principles catalyst screening, dynamics of opioid receptor, visual realism assessment of deepfakes, etc. Scientific projects are also running on EuroHPC share of HPC Vega such as understanding skin permeability with molecular dynamics simulations.

Vega is involved in several international projects. These include SMASH, interTwin, EUMaster4HPC, Epicure, etc.

LUMI the enabler of world-class scientific breakthroughs!

Posted on Updated on

EuroHPC Joint Undertaking has installed three leadership-class supercomputers. It discussed one of these systems, LUMI, located in Kajaani, Finland. LUMI is currently the fastest supercomputer in Europe, and one of the most powerful and advanced computing systems in the world.

Dr. Pekka Manninen, Director of Science and Technology, Advanced Computing Facility, CSC, Finnish IT Center for Science, present technical architecture of LUMI infrastructure and its status, together with plans and ambitions for the near future.

LUMI is a unique consortium of 10 countries with strong national HPC centers. LUMI resources will be allocated as per investments. Share of EuroHPC JU (50 percent) will be allocated by peer-review process, and available for European researchers.

The LUMI datacenter is based in Kajaani, Finland. Kajaani DC is a direct part of Nordic backbone, with 4x100Gbit/s to GEANT in place. It can be easily scaled up to multi-terabit level. Waste heat reuse in district heating leads to thousands of tons of CO2 reduced every year, and considerable financial savings.

LUMI HPE Cray EX is one of the fastest supercomputers of the world, produced by HPE. It is ranked third in heatmap. LUMI is a tier-o GPU-accelerated supercomputer that enables convergence of HPC, AI, and high-performance data analysis. It provides enhanced user experience. High-level interfaces are available on LUMI, such as Jupiter notebooks, Rstudio, and backend to LUMI compile nodes. It has a rich stack of preinstalled software, and provides datasets as a service.

LUMI is designed as a Swiss army knife, targeted for wide spectrum of use cases and user communities. These can be around climate research, data science, plasma physics, life sciences, materials science, humanities and social science, fast-track for urgent computing. It is currently behind Frontier and Fugaku supercomputers.

Scientific showcases include large language models and generative AI. Objective is the democratization of generative AI. There are several ongoing LUMI projects to train LLMs of various European languages. These include Finnish, Swedish, Norwegian, Estonian, English, etc. We are working to provide an API for instructional LLM. and open-source foundational LLM.

We have another showcase as climate adaptation digital twin. The objective is to build a digital twin of the Earth’s climate to understand the impacts of climate change, and help make data-driven political decisions in their mitigation. Simulations are over two ESMs — ICON and IFS-FESOM/NEMO at an unprecedented resolution (5-10km). German ICON model performance is also available on LUMI-G.

Another showcase was of solar astrophysics, to understand subsurface solar physics. Next, turbulent dynamics in superfluid fermi systems. Fermionic superfluids are important in understanding neutron stars. LUMI’s capabilities are already in use for societally important science initiatives. We are part of the new golden era in European HPC. LUMI has been in full customer use since Dec. 2022.

Integration of disruptive technologies, HPC systems needed

Posted on

ETP4HPC organized a conference on emerging technologies for HPC in Europe. The speakers were Prof. Dr. Estela Suarez, Research Group Leader, Jülich Supercomputing Centre, and Prof. Dr. Kristel Michielsen, Group Leader, Research Group, Quantum Information Processing, Jülich Supercomputing Centre, Germany.

They first looked at HPC system architecture aspects and managing heterogeneity. How are we managing heterogeneity in HPC? Causes for this trend include hardware, where, we need larger, more energy-efficient systems, and applications, where, we can combine different codes into complex workflows.

Heterogeneity is everywhere! This includes processing, memory, network, and paradigms — be it classical, neuromorphic and quantum computing. Traditional barriers are dissolving. Computing is not only in-node processors (CPU, GPU). We are computing in network with DPUs and iPUs. We also have processing in memory, and network-attached memory.

System architecture can enable integration of new technologies. Resource management has dynamic orchestration and malleability, and is key to achieve effective resource utilization. Software stack looks at master hardware, and fosters performance portability. We have standardized programming models, smart compilers, runtime and workflow systems, and debugging and performance tools.

Applications have challenges, but opens new opportunities. There can be novel workflows, and novel mathematical formulations. New application features are combining HPC and AI. Heterogeneity must be tackled by a coordinated effort at all levels.

We can build a modular supercomputing architecture (MSA). MSA has been evolving since 2011, with Deep Projects. We are now moving to the Exascale Jupiter system in 2024.

We are looking at heterogeneity at the chip-level. We can integrate the different dies (chiplets) in the same package at lower cost, facilitate diversity with chiplets chosen based on customer, and have short connections with lower latency, better bandwidth, and lower power consumption.

Integration of disruptive technologies needed
System complexity grows with heterogeneity. It is hard to predict the performance of a given application. There are dependencies understanding for hardware/software co-design. Energy efficiency is as important as performance.

Qualitative and quantitative evaluation of HPC systems are needed. These include large (Exa-)scale system modelling, end-to-end performance models of applications based on system metrics, and using the accelerators and techniques like ML/AI. Research is needed in heterogeneous hardware and application modelling.

There is the integration of disruptive technologies. The end of Moore’s Law calls for more disruptive solutions, such as ASIC-based solutions, neuromorphic computing, and quantum computing. A sudden, full exchange of technologies is not feasible. Not every application is suitable for these approaches. Even those that are, would need to be rewritten

Integration with ‘traditional’ HPC technologies is also needed. Applications can run on HPC system and offload suitable functions to disruptive devices. There is gradual, step-by-step adapting to innovative devices. Integration requires both hardware and software solutions.

Neuromorphic computing presents scenarios such as intelligent edge co-processors for distributed cross-platform edge-cloud-HPC workflows. AI inference and training is at the edge. Also, data movement is minimized.

Another is datacenter co-processors / accelerators for AI / ML training and inference at scale. We can have inference for HPC-AI hybrid workloads, and training for AI algorithm (spiking neural network, back-propagation).

In quantum computing, we can have integration of quantum computers and simulators (QCS) in HPC systems on a system level, with loose and tight models. It can also be in programming level with full hardware-software stack, and application level with optimization, quantum chemistry and quantum ML.

There can be application-centric benchmarking, with test for the algorithm, the software stack and the technology. We can also do emulation of QCS with HPC systems. It can lead to ideal and realistic QCS, and designing, analyzing, and benchmarking QCS and quantum algorithms.

Heterogeneity in HPC
There are several challenges. It is hard to efficiently share resources, and maximize utilization. It is hard to identify sources of performance loss and optimization opportunities. We also have to maintain performance portability. It is still difficult to understand and predict performance.

We also have several opportunities. There can be better energy efficiency, ability to select ideal hardware for each part of the application, wider range of providers and technologies, away from monopolistic scenarios, and development and integration of disruptive technologies with potentially better performance and energy efficiency. We can also see the real impact of co-design.

SRA 5 – HPC system software/application co-design @ETP4HPC

Posted on

European Technology Platform for High-Performance Computing (ETP4HPC) organized a seminar on SRA 5 – HPC system software/application co-design. Strategic Research Agenda (SRA) 5 is the fifth edition of the SRA, published in October 2022.

SRA 5 has several working groups. These are: system architecture, system hardware components, system software and management, programming environment, I/O and storage, mathematics and algorithms, application co-design, and centre-to-edge-framework. Two groups are new in SRA 5 — quantum for HPC, and non-conventional HPC architectures.

The speakers were Dr. Manolis Marazakis, Principal Staff Research Scientist, FORTH Research Center, Greece, and Prof. Dr. Erwin Laure, Director, Max Planck Computing and Data Facility (MPCDF) and Prof., Technical University Munich, Germany.

HPC system software in SRA 5
Dr. Manolis Marazakis, FORTH, presented on HPC system software in SRA 5. Computing continuum is still a great expectation. Everything changes, and everything stays the same (as in SRA 4).

SRA 5 support applications diversity and an expanding computing scope, in particular meeting the requirements of the convergence of simulation (HPC), Big Data (HPDA) and AI processing as part of the same IT continuum.

It masters the complexity of optimized or specialized hardware and software combinations, with the environments and tools supporting dynamic and flexible execution models. It offers smart tools to assist in development, deployment, optimization, and control of the efficiency of application workflows over heterogeneous hardware architectures.

SRA 5 also supports mixed generations of hardware and their interoperability. This need will become a challenge for runtimes, and applications portability. Requirements due to carbon footprint must be reinforced, including investigating the carbon footprint of data (research leading to refinement of data organization.

System software landscape at Exascale was displayed for 2023-2026. Drivers are power efficiency, scalability, and heterogeneity support. This needs the combination of software component and hardware diversity in the same execution environment. It also requires new CI/CD capabilities, and a strong composability process control. New, optimized libraries could facilitate optimal usage of hardware. There are tools to facilitate applications behavior understanding.

Next comes convergence of simulation (HPC), Big Data (HPDA) and AI in the same IT continuum. Convergence is through workflow for advanced applications, or through flexible computing resource sharing and execution deployment.

Carbon footprint
We have carbon footprint of computing infrastructure and applications, and sustainability. Hardware production has a large carbon footprint. It is the same for redesign of applications for new hardware. It must be counter-balanced by greater efficiency or larger problem resolution.

There are security/trust concerns, as well. These are security, authentication, protection of data, and application replay to validate results. Finally, open-source concerns, includes maintainability and compatibility of versions and ABI.

Convergence of HPC, HPDC, and AI
Now, let’s look at the convergence of simulation (HPC), Big Data (HPDA) and AI in the same IT continuum. Compute-intensive workloads are currently running on HPC clusters. It is close to the metal performance and efficient use of high end/dedicated hardware. Users are granted exclusive, albeit, time limited, access to resources.

There are data-intensive (Big Data, AI) workloads, currently running mostly on cloud systems. It means instant and elastic availability of resources, and fault tolerance, and multi-tenant environment, with sufficient flexibility to select between a ‘self-service’ operating mode, or rely on a ready-made software stack.

There is the ability to reconfigure data center resources, building new virtual infrastructures out of existing building blocks. It leads to the elastic reconfiguration and efficient scheduling. New technologies, such as AI methods, and AI-optimized hardware or commodity device observations, and IoT data streaming offer new potential for scientific methodologies and workflows.

Understanding and modelling data and workflows in the underlying multi-owner, and multi-tenant IT infrastructure is required. Data and computing continuum is a disruption for present application development, putting a major emphasis on data-aware execution flow and security across the full application workflow.

Challenges with resource management
There are challenges with resource management. One, lack of a global architecture of the complete environment. We need scheduling decisions that should be done in a distributed manner.

Next, wide variety of compute, storage, and communication systems in a computing continuum. It makes more complex efficient and secure integration and management, and the interoperability for seamless integration of HPC and cloud technologies. Edge devices and intermediate nodes (fog computing) could be limited in processing and memory. It is vital to enable local computing on the edge, minimizing the data movement between layers and energy consumption as far as possible, keeping adequate QoS.

Finally, workflows should be deployed and/or migrated between different layers of the continuum. Scheduling and orchestration systems need to map workflows adequately in complex infrastructure, minimizing the energy used, and optimizing the performance.

Focus on efficiency
We also need the efficiency of the combination of IT infrastructure and applications execution environments. There should be “adequate/ appropriate” computing and application-specific hardware (for higher ops/watt). We need reprogramming capabilities or dedicated software optimization. The challenge is not only from the programmer’s point of view, but also from the system managers — how to combine different accelerators into a unified programming model supported by a dynamic and elastic resource management infrastructure?

We also need to look at in situ/in transit processing. We can allow data visualization, curation, structuring or analysis to happen as data is generated by simulations. Big Data management approaches are bringing the computation to where data is located. There is pressing need to improve data streaming support at the network and OS levels.

Finally, a data-aware system policy! Heterogeneity of resources and large amount of data lead to negative impact on locality of data/processing. The energy cost, combined with the carbon footprint, of data movement, is a great challenge that needs to be considered in the co-design of applications, and in workflow orchestration.

Decarbonization needs
We also need to look at decarbonation, an emerging topic. There is emphasis on sustainability of hardware and software for HPC. Rethink the lifecycle requirements, and the carbon RoI of solutions. We need tools to estimate carbon footprint.

Slowdown in growth of CPU clock speed, and corresponding core performance, is likely to incentivize extending the useful life of HPC system deployments. Operating different generations of hardware may bring interoperability challenges. It needs to be balanced with potential savings from avoiding, or at least deferring, the effort of rewriting application codes for performance tuning. We also need efficient and timely metrics collection, and low-level resource monitoring APIs.

Finally, intersection with research cluster ‘sustainability’. Building modular systems can extend lifetime of some system parts exploitation. It reduces e-waste and allows partial upgrades. Eg., accelerator-based architectures lead to gain performance, while keeping central parts of the systems. Increasing system lifetime also requires increasing resilience/fail-over to system failures. Hardware heterogeneity leads to major issue for applications. It requires portability efforts, and an evolution of the software stack.

Tools needed
Tools are needed for dynamic and flexible execution models. New software stack integration and compliance capabilities support app portability over heterogeneous infrastructures. Shift toward hybrid HPC means cloud-based options are augmenting on-premise HPC capabilities. We can extend traditional on-premises HPC systems with flexible private cloud (off-premises) infrastructure. Data-aware governance policy, with emphasis on data sovereignty and lifecycle, and controls for operational costs, is also needed.

Next, we have matching hardware resource capabilities with applications-oriented environments. Significant tools evolution is required to address variability of resources. We need embedded AI and analytics methods to assist in the development and deployment.

Challenges lie in the coordination of application workflows, resource management and data management cycle, using the combination of HPC and cloud resources. We have Meta OS/Meta orchestration approach. Infrastructure owners/operators are expected to pay attention to alignment with development tools and standards.

We are also improving scheduling algorithms to optimize computing and I/O operations in a coordinated manner. There are hierarchical scheduling approaches. Modularization of runtime environment using standardized interfaces (e.g., PMIx) for information exchange between components is ongoing.

We also need decision-making support. There are flexible usage modes of HPC, scientific modelling, analytics, and AI/ML-based applications, in combination with data assets. We also need trustworthy processing of sensitive datasets. Assistance from OS for essential security mechanisms are enabling isolation and verifiably trusted execution. Concerns arise from a need to process sensitive data sets (personally identifiable information and data under IP restrictions), and necessitating protection against loss of integrity and confidentiality.

Security risks for supercomputers
There can be an exposure of supercomputers to security risks. The computing continuum and sharing of supercomputing resources by new types of applications are critical points for security, with more entry points to verified computing resources. The flexibility of application containers execution could be a way to execute verified applications.

Software stacks that run on computer resources are not strictly built and maintained by the supercomputer administrators anymore. They must become more evolvable and composable. There is no process today to verify via a ‘software factory’ process that a CVE is indeed mitigated. We also need protection of technological and data sovereignty.

QPUs in HPC
There is integration of quantum processing units (QPUs) into the existing HPC landscape. The only shared resource management and scheduling infrastructure with the existing von Neumann-based HPC enables usage scenarios going beyond pure analog or digital quantum processing. Such quantum hybrid HPC codes demand low latency data exchange between the QPU and the traditional part of the HPC system.

The modular supercomputing architecture (MSA) provides a natural way of realizing this kind of tight integration by considering the QPU as an additional module of the system. However, with QPUs still being scarce, novel resource management approaches are needed to enable efficient utilization by multiple users. System software needs to be prepared for even tighter integration models. Eg., QPUs might be directly integrated into nodes, and finally, on the chip.

Note on open-source software
Lastly, there is a note on open-source software. System software and management has benefited substantially by availability and long-term evolution of relevant open-source components. Job submission and control, including resource management and scheduling, such as for Slurm, Torque, Flux; node and system-level health monitoring, such as Nagios, Ganglia, Prometheus, etc.

There is configuration automation and change management, such as Puppet, Ansible, etc., low-level system initialization and management, such as OpenBIOS, OpenBMC, etc., OS, such as Linux kernel and distributions, filesystem, such as Lustre, BeeGFS, etc., and interconnects, such OpenFabrics.

There is limited funding for system software and management. Sustainment burden rests with a small set of beneficiaries. Fixing bugs and vulnerabilities is based on the surveillance done by the open-source community, which does not guarantee a high-level of security relative to the proprietary software.

Concerns to be carefully considered throughout the lifecycle including planning, development, integration, and sustainment phases. These include licensing, with standardized, well-reviewed licenses used as models of best practice, and open-source project governance frameworks for structured collaboration.

We need support models, including available upgrade paths, in some cases, oriented to reduce vulnerabilities. Well-defined and well-documented APIs, in cohesive software stacks is also needed to simplify interoperability and exchange-ability of software components.

There are research clusters and research domains in SRA 5. Clusters are in sustainability and usability, HPC in the digital continuum, HPC for prompt decision-making, federated cloud/HPC and data infrastructures, and exploitation of heterogeneous HPC technology.

SRA 5 era of application codesign
Dr. Erwin Laure, MPCDF presented the SRA 5 era of application codesign. He said breakthroughs have been achieved via exascale. Examples include climate, weather, and earth sciences, engineering and manufacturing, global challenges, life sciences and medicine, energy, chemistry and material sciences, etc.

European CoEs on HPC.

The industry and SMEs have a very pragmatic view. Industry has typically use cases that address capacity, and not capability. Routine simulations, e.g., in the automotive industry, use 100s and up to 1.000s of cores. Larger jobs are possible, but rarely reach the peta-scale level. Some sectors regularly have larger simulations to solve, but typically, are not too high up on the peta-scale level. DoE/stochastic models/parameter studies bring the need to perform a higher number of related simulations (100s or 1.000s) in a coordinated manner.

SMEs simulations can be smaller, but not necessarily. SMEs normally have ‘standard’-sized problems. SME-size service providers and ISVs can have needs like larger companies. AI training runs can bring the need for temporarily very-high performance needs.

Exascale computing is here!
Exascale computing is now here! An example is the FRONTIER @ OLCF (US), with HPE/CRAY. It will soon be in Europe — LEONARDO @ CINECA, Italy, with Atos, and LUMI @ CSC, Finland, with HPE/CRAY. There will be one more pre-exascale system at BSC in Europe, 2023. ‘Jupiter’ @Jülich will be the first European exascale system (worth €500 million), by 2024.

EuroHPC JU has already procured seven supercomputers. Two are pre-exascale and five are petascale. The total contracts cost €~360 million. We need bigger and faster computers for improving numerical accuracy, analyzing bigger datasets, and AI. EU H2020 Center of Excellence for Novel Materials Discovery (NOMAD) is already working on numerical accuracy and scaling.

There are challenges at exascale. These are level of parallelism, hardware heterogeneity, programming or performance portability, novel numerical/methodological approaches, composability, and mismatch of technology and development cycles.

Key challenges ahead for SRA 5 include education and career paths, low latency/high bandwith data access, large workflows and ensembles, memory bandwidth and communication latency, storage and I/O, long-term maintenance and portability of codes, dwarfs, and alternative approaches (energy efficiency) such as quantum, neuromorphic, RNA, data-flow, etc.

There are European Centers of Excellence (CoE) for HPC apps. They are developing exascale-ready applications, and supporting supercomputing applications and communities for science and innovation. Examples of CoEs include Esiwace, Plasma PEPSC, CEEC, Nomad, Hidalgo2, Excellerat, etc.

Exascale computing is here, enabling routine petaflop simulations on GPU-accelerated HPC clusters. Commercial) ML applications appear a major hardware driver. Multi-physics, multi-scale problems require even more compute power and clever algorithms. Sustainable (scientific) application development is currently a huge challenge.