Lenovo HPC and AI Software Stack Instructions

June 3, 2024
Lenovo

Lenovo HPC and AI Software Stack Instructions
Lenovo Logo

Product Guide

The Lenovo HPC & AI Software Stack combines open-source with proprietary best- of-breed Supercomputing software to provide the most consumable open-source HPC software stack embraced by all Lenovo HPC customers.

It provides a fully tested and supported, complete but customizable HPC software stack to enable the administrators and users in optimally and environmentally sustainable utilizing their Lenovo Supercomputers.

The software stack is built on the most widely adopted and maintained HPC community software for orchestration and management. It integrates third party components especially around programming environments and performance optimization to complement and enhance the capabilities, creating the organic umbrella in software and service to add value for our customers.

The software stack offers key software and support components for orchestration and management, programming environments and services and support, as shown in the following figure.
Software Stack

Did you know?

Lenovo HPC & AI Software Stack is a modular software stack tailored to our customer’s needs. Thoroughly tested, supported and periodically updated, it combines the latest open source HPC software releases to enable organizations with an agile and scalable IT infrastructure.

Benefits

The Lenovo HPC & AI Software Stack provides the following benefits to customers.

Overcoming the Complexity of HPC Software

An HPC system software stack consists of dozens of components, that administrators must integrate and validate before an organization’s HPC applications can run on top of the stack. Ensuring stable, reliable versions of all stack components is an enormous task due to the numerous interdependencies. This task is very time consuming because of the constant release cycles and updates of individual components.

The Lenovo HPC & AI Software Stack is fully tested, supported and periodically updated to combine the latest open-source HPC software releases, enabling organizations with an agile and scalable IT infrastructure.

Benefits of the Open-source Model

Going forward, in IDC’s opinion, the development model exemplified by Linux is more workable. In this model, stack development is driven primarily by the open-source community and vendors offer supported distributions with additional capabilities for customers that require and are willing to pay for them. As the Linux initiative demonstrates, a community-based model like this has major advantages for enabling software to keep pace with requirements for HPC computing and storage hardware systems.

This model delivers new capabilities faster to users and makes HPC systems more productive and higher returning investments.

A fair number of foundational open source HPC software components already exist (e.g., Open MPI, Rocky Linux, Slurm, OpenStack, and others). Many HPC community members are already taking advantage of these.

Customers will benefit from the HPC community, as the community works to integrate a multitude of components that are commonly used in HPC systems and are freely available for open source distribution.

The key open-source components of the software stack are:

  • Confluent Management
    Confluent is Lenovo-developed open-source software designed to discover, provision, and manage HPC clusters and the nodes that comprise them. Confluent provides powerful tooling to deploy and update software and firmware to multiple nodes simultaneously, with simple and readable modern software syntax.

  • Slurm Orchestration
    Slurm is integrated as an open source, flexible, and modern choice to manage complex workloads for faster processing and optimal utilization of the large- scale and specialized high-performance and AI resource capabilities needed per workload provided by Lenovo systems. Lenovo provides support in partnership with SchedMD.

  • LiCO Webportal
    Lenovo Intelligent Computing Orchestration (LiCO) is a Lenovo developed consolidated Graphical User Interface (GUI) for monitoring, managing and using cluster resources. The web portal provides workflows for both AI and HPC, and supports multiple AI frameworks, including TensorFlow, Caffe, Neon, and MXNet, allowing you to leverage a single cluster for diverse workload requirements.

  • Energy Aware Runtime
    EAR is a powerful European open-source energy management suite supporting anything from monitoring over power capping to live-optimization during the application runtime. Lenovo is collaborating with Barcelona Supercomputing Centre (BSC) and EAS4DC on the continuous development and support and offers three versions with differentiating capabilities.

Software components

Components are covered in the following sections:

  • Orchestration and management
  • Programming environment

Orchestration and management

The following orchestration software is available with Lenovo HPC & AI Software Stack:

  • Confluent (Best Recipe interoperability)
    Confluent is Lenovo-developed open source software designed to discover, provision, and manage HPC clusters and the nodes that comprise them. Our Confluent management system and LiCO Web portal provide an interface designed to abstract the users from the complexity of HPC cluster orchestration and AI workloads management, making open-source HPC software consumable for every customer. Confluent provides powerful tooling to deploy and update software and firmware to multiple nodes simultaneously, with simple and readable modern software syntax. Additionally, Confluent’s performance scales seamlessly from small workstation clusters to thousand-plus node supercomputers. For more information, see the Confluent documentation.

  • Lenovo Intelligent Computing Orchestration (Best Recipe interoperability)
    Lenovo Intelligent Computing Orchestration (LiCO) is a Lenovo developed software solution that simplifies the management and use of distributed clusters for High Performance Computing (HPC) and Artificial Intelligence (AI) environments. LiCO provides a consolidated Graphical User Interface (GUI) for monitoring and usage of cluster resources, allowing you to easily run both HPC and AI workloads across a choice of L novo infrastructure, including both CPU and GPU solutions to suit varying application requirements. LiCO Web portal provides workflows for both AI and HPC, and supports multiple AI frameworks, including TensorFlow, Caffe, Neon, and MXNet, allowing you to leverage a single cluster for diverse workload requirements. For more information, see the LiCO product guide.

  • Slurm
    Slurm is a modern, open-source scheduler designed specifically to satisfy the demanding needs of high-performance computing (HPC), high throughput computing (HTC) and AI. Slurm is developed and maintained by SchedMD® and integrated within LiCO. Slurm maximizes workload throughput, scale, reliability, and results in the fastest possible time while optimizing resource utilization and meeting organizational priorities. Slurm automates job scheduling to help admin and users manage the complexities of on-prem, hybrid, or cloud workspaces. Slurm workload manager executes faster and is more reliable ensuring increased productivity while decreasing costs. Slurm’s modern, plug- inbased architecture runs on a RESTful API supporting both large and small HPC, HTC, and AI environments. Allow your teams to focus on their work while Slurm manages their workloads.

  • NVIDIA Unified Fabric Manager (UFM) (ISV supported)
    NVIDIA Unified Fabric Manager (UFM) is InfiniBand networking management software that combines enhanced, real-time network telemetry with fabric visibility and control to support scale-out InfiniBand data centers. For more information, see the NVIDIA UFM product page.
    The two UFM offerings available from Lenovo are as follows:

    • UFM Telemetry for Real-Time Monitoring
      The UFM Telemetry platform provides network validation tools to monitor network performance and conditions, capturing and streaming rich real-time network telemetry information, application workload usage, and system configuration to an on premises or cloud-based database for further analysis.

    • UFM Enterprise for Fabric Visibility and Control
      The UFM Enterprise platform combines the benefits of UFM Telemetry with enhanced network monitoring and management. It performs automated network discovery and provisioning, traffic monitoring, and congestion discovery. It also enables job schedule provisioning and integrates with industry-leading job schedulers and cloud and cluster managers, including Slurm and Platform Load Sharing Facility (LSF).

The following table lists all Orchestration software available with Lenovo HPC & AI Software Stack.

Table 1. Orchestration and management

Part number Feature code Description

Lenovo Intelligent Computing Orchestration (LiCO) HPC AI version
7S090004WW| B1YC| Lenovo HPC AI LiCO Software 90 Day Evaluation License
7S09002BWW| S93A| Lenovo HPC AI LiCO Webportal w/1 yr S&S
7S09002CWW| S93B| Lenovo HPC AI LiCO Webportal w/3 yr S&S
7S09002DWW| S93C| Lenovo HPC AI LiCO Webportal w/5 yr S&S
Lenovo Intelligent Computing Orchestration (LiCO) Kubernetes version
7S090006WW| S21M| Lenovo K8S AI LiCO Software Evaluation License (90 days)
7S090007WW| S21N| Lenovo K8S AI LiCO Software 4GPU w/1Yr S&S
7S090008WW| S21P| Lenovo K8S AI LiCO Software 4GPU w/3Yr S&S
7S090009WW| S21Q| Lenovo K8S AI LiCO Software 4GPU w/5Yr S&S
7S09000AWW| S21R| Lenovo K8S AI LiCO Software 16GPU upgrade w/1Yr S&S
7S09000BWW| S21S| Lenovo K8S AI LiCO Software 16GPU upgrade w/3Yr S&S
7S09000CWW| S21T| Lenovo K8S AI LiCO Software 16GPU upgrade w/5Yr S&S
7S09000DWW| S21U| Lenovo K8S AI LiCO Software 64GPU upgrade w/1Yr S&S
7S09000EWW| S21V| Lenovo K8S AI LiCO Software 64GPU upgrade w/3Yr S&S
7S09000FWW| S21W| Lenovo K8S AI LiCO Software 64GPU upgrade w/5Yr S&S
UFM Telemetry
7S09000XWW| S921| NVIDIA UFM Telemetry 1-year License and 24/7 Support for Lenovo clusters
7S09000YWW| S922| NVIDIA UFM Telemetry 3-year License and 24/7 Support for Lenovo clusters
7S09000ZWW| S923| NVIDIA UFM Telemetry 5-year License and 24/7 Support for Lenovo clusters
UFM Enterprise
7S090011WW| S91Y| NVIDIA UFM Enterprise 1-year License and 24/7 Support for Lenovo clusters
7S090012WW| S91Z| NVIDIA UFM Enterprise 3-year License and 24/7 Support for Lenovo clusters
7S090013WW| S920| NVIDIA UFM Enterprise 5-year License and 24/7 Support for Lenovo clusters

Programming environment

The following programming software is available with Lenovo HPC&AI Software Stack.

  • NVIDIA CUDA
    NVIDIA CUDA is a parallel computing platform and programming model for general computing on graphical processing units (GPUs). With CUDA, developers are able to dramatically speed up computing applications by harnessing the power of GPUs. When using CUDA, developers program in popular languages such as C, C++, Fortran, Python and MATLAB and express parallelism through extensions in the form of a few basic keywords. For more information, see the NVIDIA CUDA Zone.

  • NVIDIA HPC Software Development Kit
    The NVIDIA HPC SDK C, C++, and Fortran compilers support GPU acceleration of HPC modeling and simulation applications with standard C++ and Fortran, OpenACC directives, and CUDA. GPUaccelerated math libraries maximize performance on common HPC algorithms, and optimized communications libraries enable standards-based multi-GPU and scalable systems programming. Performance profiling and debugging tools simplify porting and optimization of HPC applications, and containerization tools enable easy deployment on-premises or in the cloud. For more information, see the NVIDIA HPC SDK.

The following table lists the relevant ordering part numbers.

Table 2. NVIDIA CUDA and NVIDIA HPC SDK part numbers

Part number Description

NVIDIA CUDA
7S09001EWW| CUDA Support and Maintenance (up to 200 GPUs), 1 Year
7S09001FWW| CUDA Support and Maintenance (up to 500 GPUs), 1 Year
NVIDIA HPC SDK
7S090014WW| NVIDIA HPC Compiler Support Services, 1 Year
7S090015WW| NVIDIA HPC Compiler Support Services, 3 Years
7S090016WW| NVIDIA HPC Compiler Support Services, EDU, 1 Year
7S090017WW| NVIDIA HPC Compiler Support Services, EDU, 3 Years
7S09001CWW| NVIDIA HPC Compiler Support Services – Additional Contact, 1 Year
7S09001DWW| NVIDIA HPC Compiler Support Services – Additional Contact, EDU, 1 Year
7S09001AWW| NVIDIA HPC Compiler Premier Support Services, 1 Year
7S09001BWW| NVIDIA HPC Compiler Premier Support Services, EDU, 1 Year
7S090018WW| NVIDIA HPC Compiler Premier Support Services – Additional Contact, 1 Year
7S090019WW| NVIDIA HPC Compiler Premier Support Services – Additional Contact, EDU, 1 Year

Support components

The following software support is available with Lenovo HPC&AI Software.

  • SchedMD Slurm Support for Lenovo HPC Systems
    Slurm is part of the Lenovo HPC & AI Software Stack, integrated as an open source, flexible, and modern choice to manage complex workloads for faster processing and optimal utilization of the large-scale and specialized high- performance and AI resource capabilities needed per workload provided by Lenovo systems.
    SchedMD Slurm Support service capabilities for Lenovo HPC systems include:

    • Level 3 Support: High-performance systems must perform at high utilization and performance to meet end users and management return on the investment expectations. Customers covered by a support contract can reach out to SchedMD engineer experts to promptly resolve complex workload management issues and receive answers back to complex config questions quickly, instead of taking weeks or even months to try to resolve them in-house.
    • Remote Consulting: Valuable assistance and implementation expertise that speeds custom configuration tuning to increase throughput and utilization efficiency on complex and largescale systems. Customers can review cluster requirements, operating environment, and organizational goals directly with a Slurm engineer to optimize the configuration and meet organizational needs.
    • Tailored Slurm Training: Tailored Slurm expert training that empowers users on harnessing Slurm capabilities to speed projects and increase technology adoption. A customer scoping call before the onsite Instruction ensures coverage of specific use cases addressing organization needs. An in-depth and comprehensive technical training is delivered in a handson lab workshop format to help users feel empowered on Slurm best practices in their site specific use cases and configuration.
  • EAS Service and Support for EAR
    The Energy Aware Runtime is Open Source under BSD-3 license and EPL-1.0. For professional use cases in production environments, installation and support services are available. Commercial support as well as implementation services for EAR can be purchased from Lenovo under the HPC & AI Software Stack CTO and is delivered through Energy Aware Solutions (EAS). There are three different distributions of EAR: Detective Pro, Optimizer and Optimizer Pro. Detective Pro provides the basic monitoring and accounting capabilities, Optimizer adds the energy optimization and Optimizer Pro the power capping features.

The following table lists the relevant ordering part numbers Stack (some of the product numbers are not yet released at the time of writing this product guide

Table 3. SchedMD Slurm Support and EAR Support part numbers

Part number Description

SchedMD Slurm Support for Lenovo HPC Systems
7S09001MWW| SchedMD Slurm Onsite or Remote 3-day Training*
7S09001NWW| SchedMD Slurm Consulting w/Sr.Engineer 2REMOTE Sessions
7S09001PWW| SchedMD L3 Slurm support up to 100 Sockets/GPUs 3Y
7S09001QWW| SchedMD L3 Slurm support up to 100 Sockets/GPUs 5Y
7S09001RWW| SchedMD L3 Slurm support up to 100 Sockets/GPUs additional 1Y
7S09001SWW| SchedMD L3 Slurm support 101-1000 Sockets/GPUs 3Y
7S09001TWW| SchedMD L3 Slurm support 101-1000 Sockets/GPUs 5Y
7S09001UWW| SchedMD L3 Slurm support 101-1000 Sockets/GPUs additional 1Y
7S09001VWW| SchedMD L3 Slurm support 1001-5000+ Sockets/GPUs 3Y
7S09001WWW| SchedMD L3 Slurm support 1001-5000+ Sockets/GPUs 5Y
7S09001XWW| SchedMD L3 Slurm support 1001-5000+ Sockets/GPUs additional 1Y
7S09001YWW| SchedMD L3 Slurm support up to 100 Sockets/GPUs 3Y EDU&GOV
7S09001ZWW| SchedMD L3 Slurm support up to 100 Sockets/GPUs 5Y EDU&GOV
7S090022WW| SchedMD L3 Slurm support up to 100 Sockets/GPUs additional 1Y EDU&GOV
7S090023WW| SchedMD L3 Slurm support 101-1000 Sockets/GPUs 3Y EDU&GOV
7S090024WW| SchedMD L3 Slurm support 101-1000 Sockets/GPUs 5Y EDU&GOV
7S090026WW| SchedMD L3 Slurm support 101-1000 Sockets/GPUs additional 1Y EDU&GOV
7S090027WW| SchedMD L3 Slurm support 1001-5000+ Sockets/GPUs 3Y EDU&GOV
7S090028WW| SchedMD L3 Slurm support 1001-5000+ Sockets/GPUs 5Y EDU&GOV
7S09002AWW| SchedMD L3 Slurm support 1001-5000+ Sockets/GPUs additional 1Y EDU&GOV
EAS Service and Support for EAR**
7S09001KWW| EAR Energy Detective Pro Worldwide Remote Installation and Training for AMD or Intel CPUs
7S09001LWW| EAR Energy Detective Pro 1-year Worldwide Remote support for AMD or Intel CPUs (flat fee)
7S09001JWW| EAR Energy Optimizer Pro 1-year Support Entitlement for Energy Monitoring , Optimization and Power Capping per system power rating
7S09001GWW| EAR Energy Optimizer Pro Worldwide Remote Installation and Training for AMD or Intel CPUs
7S09001HWW| EAR Energy Optimizer Pro Worldwide Remote Installation and Training for AMD or Intel CPUs + NVIDIA GPUs

*SchedMD Slurm Onsite or Remote 3-day Training: in-depth and comprehensive site-specific technical training. Can only be added to a support purchase.
** SchedMD Slurm Consulting w/Sr.Engineer 2REMOTE Sessions (Up to 8 hrs): review initial Slurm setup, in-depth technical chats around specific Slurm topics & review site config for optimization & best practices. Required with support purchase, cannot be purchased separately.

Note: SchedMD Slurm Consulting w/Sr.Engineer 2REMOTE Sessions option must be selected and locked in for every SchedMD support selection.

SchedMD Slurm Onsite or Remote 3-day Training option must be selected and locked in for every SchedMD Commercial support selection. Optional for EDU & Government support selections.

Resources

For more information, see these resources:

Related product families

Product families related to this document are the following:

  • Artificial Intelligence
  • High Performance Computing

Notices

Lenovo may not offer the products, services, or features discussed in this document in all countries. Consult your local Lenovo representative for information on the products and services currently available in your area. Any reference to a Lenovo product, program, or service is not intended to state or imply that only that Lenovo product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any Lenovo intellectual property right may be used instead. However, it is the user’s responsibility to evaluate and verify the operation of any other product, program, or service. Lenovo may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to:

Lenovo (United States), Inc.
8001 Development Drive
Morrisville, NC 27560
U.S.A.
Attention: Lenovo Director of Licensing

LENOVO PROVIDES THIS PUBLICATION ”AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. Lenovo may make improvements and/or changes in the product(s) and/or the program(s)  described in this publication at any time without notice.

The products described in this document are not intended for use in implantation or other life support applications where malfunction may result in injury or death to persons. The information contained in this document does not affect or change Lenovo product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of Lenovo or third parties. All information contained in this document was obtained in specific environments and is presented as an illustration. The result obtained in other operating environments may vary. Lenovo may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you.

Any references in this publication to non-Lenovo Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this Lenovo product, and use of those Web sites is at your own risk. Any performance data contained herein was determined in a controlled environment. Therefore, the result obtained in other operating environments may vary significantly. Some measurements may have been made on development level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment.
© Copyright Lenovo 2022. All rights reserved.

This document, LP1651, was created or updated on November 10, 2022.

Send us your comments in one of the following ways:

This document is available online at https://lenovopress.lenovo.com/LP1651.

Trademarks

Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. A current list of Lenovo trademarks is available on the Web at https://www.lenovo.com/us/en/legal/copytrade/.

The following terms are trademarks of Lenovo in the United States, other countries, or both: Lenovo®

The following terms are trademarks of other companies:
Intel® is a trademark of Intel Corporation or its subsidiaries.
Linux® is the trademark of Linus Torvalds in the U.S. and other countries.
Other company, product, or service names may be trademarks or service marks of others

Lenovo Logo

References

Read User Manual Online (PDF format)

Loading......

Download This Manual (PDF format)

Download this manual  >>

Lenovo User Manuals

Related Manuals