Lenovo ThinkSystem Server User Guide

: June 3, 2024
: Lenovo

Lenovo ThinkSystem Server

Abstract

Graphics Processing Units (GPUs) on Lenovo® ThinkSystem™ servers are typically used to offload tasks from the server CPU, such as AI, VDI, and rendering tasks. Customers who use a Linux virtual environment on their ThinkSystem server may want to assign the GPU to a virtual machine (VM), and thus allow the GPU to appear as if it was physically attached to the guest OS running in the VM. This functionality is called GPU passthrough. This paper provides guidance on enabling GPU passthrough to a VM running in a Kernel Virtual Machine (KVM)-based OS. The paper is for Linux administrators wishing to use a GPU in a ThinkSystem server to pass through to a VM. At Lenovo Press, we bring together experts to produce technical publications around topics of importance to you, providing information and best practices for using Lenovo products and solutions to solve IT challenges.

See a list of our most recent publications at the Lenovo Press website: http://lenovopress.com.

Do you have the latest version?
We update our papers from time to time, so check whether you have the latest version of this document by clicking the Check for Updates button on the front page of the PDF. Pressing this button will take you to a web page that will tell you if you are reading the latest version of the document and give you a link to the latest if needed. While you’re there, you can also sign up to get notified via email whenever we make an update.

Introduction

Many virtual machine administrators want to make a GPU installed in a server available to a single machine. The method known as PCI device passthrough allows the GPU PCIe device to be removed from the host and instead assigned to a single guest VM for exclusive access.

The paper describes the steps needed to implement the passthrough GPU:

“Enabling IOMMU in UEFI”
“Enabling IOMMU host kernel support” on page 5
“Unbinding the GPU device from host physical machine driver” on page 6
“Getting the GPU IOMMU configuration” on page 7
“Attaching a GPU device with virsh” on page 10
“Installing and enabling the NVIDIA driver in the guest OS” on page 12

Enabling IOMMU in UEFI

I/O Memory Management Unit (IOMMU) is the common name for Intel VT-d and AMD- Vi technologies. PCI device passthrough is only available on hardware platforms supporting either Intel VT-d or AMD-Vi. The Intel VT-d and AMD-Vi specifications provide hardware support for directly assigning a physical device to a VM. The first step is to enable IOMMU in the ThinkSystem UEFI. The steps required for Intel and AMD processor-based ThinkSystem servers are listed in the following subsections.

IOMMU settings on the Intel system

VT-d stands for Intel Virtualization Technology for Directed I/O and should not be confused with VT-x Intel Virtualization Technology. VT-x allows one hardware platform to function as multiple “virtual” platforms. However, VT-d improves the security and reliability of the systems and also improves the performance of I/O devices in virtualized environments.

The steps to activate the Intel IOMMU on a server with an Intel processor are as follows:

Boot the server and when prompted, press F1 to enter System Setup.
From the UEFI menu, select System Settings → Devices and I/O ports, select Intel VT for Directed I/O (VT-d) and press Enter to enable the Intel IOMMU as shown in Figure 1.
Save and exit the BIOS setup menu, and then enter the Linux OS.
Boot up the OS and ensure the IOMMU is enabled with the following command # dmesg|grep DMAR
DMAR: IOMMU enabled
If you see DMAR: IOMMU enabled, it means that VT-d has been enabled by reporting the I/O device assignment to VMM through the DMAR (DMA Remapping) ACPI table.

IOMMU settings on the AMD system

The AMD IOMMU specifications are required to use PCI device assignment in Linux OS. These specifications must be enabled in the BIOS.

The steps to activate the Intel IOMMU on a server with an AMD processor are as follows:

Boot the server and when prompted, press F1 to enter System Setup.
From the UEFI menu, select System Settings → Devices and I/O ports, highlight IOMMU and press Enter to enable the AMD IOMMU as shown in Figure 2.
Save and exit the BIOS setup menu, and then enter the Linux OS.
Boot up the OS and ensure the IOMMU is enabled by entering the following command: # dmesg|grep AMD-Vi
AMD-Vi: Interrupt remapping enabled
If you see AMD-Vi: Interrupt remapping enabled, it means the system has enabled AMD IOMMU.

Enabling IOMMU host kernel support

Currently, up to two GPUs may be attached to the virtual machine, not including the standard emulated VGA interfaces. The emulated VGA is used for pre-boot and installation only; the NVIDIA GPU takes over once the NVIDIA graphics drivers are loaded.

To assign a GPU to a guest virtual machine, you must enable the IOMMU on the host machine, as described in the following procedure:

Edit the host kernel boot command line. For an Intel VT-d system, IOMMU is activated by adding the following parameters to the kernel command line:
intel_iommu=on
iommu=pt
For an AMD-Vi system, the parameters needed are
amd_iommu=on
iommu=pt
To enable this option, edit or add the GRUB_CMDLINX_LINUX line to the /etc/default/grub configuration file as shown in Figure 3 (Intel example).
# cat /etc/default/grub
- GRUB_TIMEOUT=5
- GRUB_DISTRIBUTOR=”$(sed ‘s, release .*$,,g’ /etc/system-release)”
- GRUB_DEFAULT=saved
- GRUB_DISABLE_SUBMENU=true
- GRUB_TERMINAL_OUTPUT=”console”
- GRUB_CMDLINE_LINUX=”crashkernel=auto resume=/dev/mapper/rhel-swap rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet intel_iommu=on iommu=pt”
- GRUB_DISABLE_RECOVERY=”true”
- GRUB_ENABLE_BLSCFG=true
Regenerate the grub2 config file
For the changes to the kernel command line to be applied, regenerate the boot loader configuration using the following command:
# grub2-mkconfig
You can verify the changes are effective with the following command:
# grubby –info=0
Reboot the host OS
For the changes to take effect to the kernel driver, reboot the host machine and use the following command:
# dmesg|grep iommu
Look for one of the following lines in the output:
Adding to iommu group 0 iommu: Default domain type: Passthrough (set via kernel command line)

Unbinding the GPU device from host physical machine driver

For GPU passthrough, it is recommended to unbind the GPU device from host driver, as these drivers often do not support dynamic unbinding of the device. When using the Virtual Machine Manager interface to attach a GPU device, these steps also need to be performed if the GPU driver does not support dynamic unbinding.\

Steps to unbind the GPU device from the host driver are as follows:

Identify the GPU PCI bus address
To identify the GPU PCI bus address and IDs of the device, run the command as listed in Figure 4. In our lab configuration, our server has the NVIDIA Tesla V100 GPU installed.
# lspci -Dnn|grep -i NVIDIA
0000:5b:00.0 3D controller [0302]: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] [10de:1db4] (rev a1)
The command reveals that the PCI bus address of this device is 0000:5b:00.0 and the PCI ID for the device is 10de:1db4. The PCI bus address and device ID will be used in the following steps.
Prevent the native host machine driver from using the GPU device
To prevent the native host machine driver from using the GPU device, you can use PCI ID with the pci-stub driver. To do this, append the following option to the GRUB_CMDLINX_LINUX line in the /etc/default/grub configuration file: pci-stub.ids=10de:1db4
where 10de:1db4 is the PCI ID for our GPU, as shown in Figure 5. To add additional PCI IDs for pci-stub, separate them with a comma.
# cat /etc/default/grub
- GRUB_TIMEOUT=5
- GRUB_DISTRIBUTOR=”$(sed ‘s, release .*$,,g’ /etc/system-release)”
- GRUB_DEFAULT=saved
- GRUB_DISABLE_SUBMENU=true
- GRUB_TERMINAL_OUTPUT=”console”
- GRUB_CMDLINE_LINUX=”crashkernel=auto resume=/dev/mapper/rhel-swap rd.lvm.lv=rhel/root rd.lvm.lv=rhel/swap rhgb quiet intel_iommu=on iommu=pt
  pci-stub.ids=10de:1db4″
- GRUB_DISABLE_RECOVERY=”true”
- GRUB_ENABLE_BLSCFG=true
Regenerate the grub2 config file
For the changes to the kernel command line to be applied, regenerate the boot loader configuration using the following command:
# grub2-mkconfig
You can verify the changes are effective with the following command:
# grubby –info=0
Reboot the host OS for the changes to take effect, using the following command
# init 6
After the OS boot, run the command in Figure 6 to check if the GPU device is using the vfio-pci driver instead of the standard inbox (nouveau) driver.
lspci -vvvnnn -s 0000:5b:00.0|grep -i “kernel driver in use”
Kernel driver in use: vfio-pci

Getting the GPU IOMMU configuration

Before attaching the GPU device, editing its IOMMU configuration is needed for the GPU to work properly on the guest. The steps are as follows.

List all PCI devices in the host machine
Using the following command to list all devices of a particular type that are attached to the host machine.
# virsh nodedev-list –cap pci
The output of the command is shown in Figure 7. Review the output for the string that maps to the GPU device you wish to enable for passthrough.
# virsh node dev-list –cap pci
- pci_0000_00_00_0
- pci_0000_00_04_0
- pci_0000_00_04_1
- pci_0000_00_04_2
- pci_0000_00_04_3
- pci_0000_00_04_4
- pci_0000_00_04_5
- pci_0000_00_04_6
- pci_0000_00_04_7
- pci_0000_00_05_0
- pci_0000_5b_00_0
- pci_0000_ad_02_0
- pci_0000_ad_05_0
- pci_0000_ad_05_2
- pci_0000_ad_05_4
  This example shows partial output info. The string that maps to the GPU with the 0000:5b:00.0 is pci_0000_5b_00_0 (bolded in Figure 7). Note that the ‘:’ and ‘.’ characters are replaced with underscores in the libvirt-compatible identifier. Record the PCI device number that maps to the GPU device you want to pass through to VM; this is required in the next steps.
Display the XML information of the GPU
To display the settings of the GPU in XML form, it needs to use a libvirt- compatible format PCI bus address. In this example, the GPU PCI device identifier is pci_0000_5b_00_0. Use the libvirt-compatible address of the GPU device with the virsh nodedev-dumpxml command to display its XML configuration as shown in Figure 8.
virsh nodedev-dumpxml pci_0000_5b_00_0
- pci_0000_5b_00_0
- /sys/devices/pci0000:5a/0000:5a:00.0/0000:5b:00.0
- pci_0000_5a_00_0
- vfio-pci
- 0
- 91
- 0
- 0
- GV100GL [Tesla V100 PCIe 16GB]
- NVIDIA Corporation
- Note the element is an entry of the XML configuration (bolded in Figure 8). The iommuGroup indicates a set of devices that are considered isolated from other devices due to IOMMU capabilities and PCI bus topologies. All of the endpoint devices within the homegroup (meaning devices that are not PCIe root ports, bridges, or switch ports) need to be unbound from the native host drivers in order to be assigned to a guest OS. In the example above, the group is composed of the GPU device (0000:5b:00.0), and some GPU cards might have a companion audio device, such as (0000:5b:00.1).
Adjust IOMMU settings (optional)
Note each IOMMU group may contain one or more devices. When multiple devices are present, all endpoints within the IOMMU group must be claimed for any device within the group to be assigned to a guest. This can be accomplished either by also assigning the extra endpoints to the guest or by detaching them from the host driver using the virsh node dev-detach command.
Devices within an IOMMU group can be determined using the iommuGroup section of the virsh nodedev-dumpxml output. Each member of the group is provided in a separate address field. This information may also be found in sysfs using the command listed in Figure 9.
ls /sys/bus/pci/devices/0000\:5b\:00.0/iommu_group/devices/ 0000:5b:00.0 0000:5b:00.1

If a GPU card has a companion audio device (0000:5b:00.1), to assign only 0000.5b.00.0 to the guest, the unused endpoint device (0000:5b:00.1) should be detached from the host before starting the guest. The following two steps need to be performed:

Detect the PCI ID for the device and append it to the pci-stub.ids option in the /etc/default/grub file, as described in “Unbinding the GPU device from host physical machine driver” on page 6.
Use the virsh nodedev-detach command with a libvirt-compatible address as a parameter, for example, # virsh nodedev-detach pci_0000_5b_00_1.

Attaching a GPU device with virsh

The GPU can be attached to the guest using either of the following methods:

Using the Virtual Machine Manager interface If device assignment fails, there may be other endpoints in the same IOMMU group that are still attached to the host. There is no way to retrieve group information using virt-manager, but virsh commands can be used to analyze the bounds of the IOMMU group.
Creating XML configuration for the GPU and attaching it with the virsh attach-device command

The steps using the latter method, using virsh attach-device, are as follows:

From the output of step 2 on page 9, obtain the device values required for the configuration file. In our example, the device has the following values:
- doman = 0x0000
- bus = 0x5b
- slot = 0x00
- function = 0x0
  The configuration uses these three values.
Create an XML file for the GPU device. In the example, a file named GPU.xml is created and its content is as shown in Figure 10.
cat GPU.xml
- *
Run the following command specifying the domain name you wish assign to and the XML file name you have created above.
virsh attach-device In the example in Figure 11, the domain name is rhel8.2 and the XML filename is GPU.xml.

virsh attach-device rhel8.2 GPU.xml Device attached successfully

The domain must be running before issuing the virsh attach-device command. Use the following commands to check the domain status or to start or shut down the domain:
- virsh list
- virsh start
- virsh shutdown
  The PCI device should now be successfully assigned to the virtual machine, and accessible to the guest operating system.
Login guest OS and run the command in Figure 12 to check GPU device in the guest OS. The GPU’s PCI bus address on the guest will be different than on the host. In this example, the bus address is 07:00.0.
lspci |grep -i nvidia
07:00.0 3D controller: NVIDIA Corporation GV100GL [Tesla V100 PCIe 16GB] (rev a1)

Running virsh attach-device just assigns GPU device to VM temporarily. After a reboot, the GPU is no longer attached. Appending the parameter –persistent persistently attaches to a guest OS. For example:

virsh attach-device rhel8.2 GPU.xml –persistent

In order to persistently attach a GPU device to a guest OS, follow these steps:

Run the following command to edit the domain XML configuration file. vrish edit ’
Specify the domain name you wish to assign to.
Add the appropriate device XML entry in the section to assign the PCI device to the guest manually.

# virsh edit rhel8.2

Installing and enabling the NVIDIA driver in the guest OS

This section describes how to enable an NVIDIA GPU from the Linux console. For GPU cards from other manufacturers, the steps may be slightly different. When using an assigned NVIDIA GPU in the guest OS, only the NVIDIA drivers are supported. Other drivers may not work well.

To install the NVIDIA driver based on RHEL7.x or RHEL8.x guest OS, perform the following steps:

Download the appropriate NVIDIA driver for your graphics controller from the NVIDIA website, http://www.nvidia.com.
Ensure that this driver is saved in the local disk of the target system. Installing from an external device, such as a flash drive, will cause known issues such as an installation failure.
Run the commands listed below to install the NVIDIA driver. The driver cannot be installed if the X server is running on the system, so ensure that the system is started in text mode (run level 3).
- init 3
- sh nvidia_filename.run
Edit Grub 2 to blacklist the nouveau (inbox) driver. Edit /etc/default/grub and add the following parameter to the GRUB_CMDLINE_LINUX line. rd. driver.blacklist=nouveau nouveau.modeset=0 This kernel parameter blacklists the nouveau driver module to disable it from getting loaded at boot from initramfs in guest OS.
Rebuild the grub.cfg file by running one of the following commands: # grub2-mkconfig.
Edit the /etc/modprobe.d/blacklist.conf file and add the following line to it, so that the blacklist requirement is added into initramfs at rebuild:
blacklist nouveau
Back up the current initramfs and build a new one as follows:
- mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r)-nouveau.img
- dracut /boot/initramfs-$(uname-r).img $(uname-r)
Restart the system.
The system should not load the nouveau module now at boot.
Before the above steps, the nouveau driver is in use as shown in the command below:
# lspci -vvvnnn -s 07:00.0|grep -i “kernel driver in use”
Kernel driver in use: nouveau
After the above steps, we can check the nvidia driver is in use by the following command:
# lspci -vvvnnn -s 07:00.0|grep -i “kernel driver in use”
Kernel driver in use: nvidia

With this, the GPU is now available for exclusive use in the guest OS.

References

Use these references for more information

X.org/XFree86 Video Timings HOWTO
http://www.tldp.org/HOWTO/XFree86-Video-Timings-HOWTO/
ArchWiki entry for PCI passthrough via OVMF
https://wiki.archlinux.org/index.php/PCI_passthrough_via_OVMF
RHEL 7 documentation: Virtualization Deployment and Administration Guide
https://access.redhat.com/documentation/en- us/red_hat_enterprise_linux/7/html/virtualization_deployment_and_administration_guide/.

Author

Xiaochun Li is a Linux Engineer in the Lenovo Infrastructure Solutions Group based in Beijing, China. He specializes in development related to Linux kernel storage and memory management, as well as kernel DRM. Before joining Lenovo, he was an operating system engineer for INSPUR. With eight years of industry experience, he now focuses on Linux kernel RAS, storage, security and virtualization.

Thanks to the following people for their contributions to this project:

Yangyang Liang, Lenovo Test Engineer for Linux Enablement
Adrian Huang, Lenovo OS Engineer
Huaisheng Ye, Lenovo OS Engineer
Gary Cudak, Lenovo OS Architect
Paul Artman, Storage and I/O Architect
JieJie Cheng, Technical Writer
David Watts, Lenovo Press

Notices

Lenovo may not offer the products, services, or features discussed in this document in all countries. Consult your local Lenovo representative for information on the products and services currently available in your area. Any reference to a Lenovo product, program, or service is not intended to state or imply that only the Lenovo product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any Lenovo intellectual property right may be used instead. However, it is the user’s responsibility to evaluate and verify the operation of any other product, program, or service.

Lenovo may have patents or pending patent applications covering the subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to:

Lenovo (United States), Inc.
1009 Think Place – Building One Morrisville, NC 27560
U.S.A.
Attention: Lenovo Director of Licensing

LENOVO PROVIDES THIS PUBLICATION “AS IS” WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some jurisdictions do not allow disclaimers of express or implied warranties in certain transactions, therefore, this statement may not apply to you.

This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. Lenovo may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice.

The products described in this document are not intended for use in implantation or other life support applications where malfunction may result in injury or death to persons. The information contained in this document does not affect or change Lenovo product specifications or warranties. Nothing in this document shall operate as an express or implied license or indemnity under the intellectual property rights of Lenovo or third parties. All information contained in this document was obtained in specific environments and is presented as an illustration. The result obtained in other operating environments may vary.

Lenovo may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Any references in this publication to non-Lenovo Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this Lenovo product, and use of those Web sites is at your own risk.

Any performance data contained herein was determined in a controlled environment. Therefore, the result obtained in other operating environments may vary significantly. Some measurements may have been made on development- level systems and there is no guarantee that these measurements will be the same on generally available systems. Furthermore, some measurements may have been estimated through extrapolation. Actual results may vary. Users of this document should verify the applicable data for their specific environment.

© Copyright Lenovo 2021. All rights reserved.
Note to U.S. Government Users Restricted Rights — Use, duplication or disclosure restricted by Global Services Administration (GSA) ADP Schedule Contract.

This document was created or updated on May 25, 2021.
Send us your comments via the Rate & Provide Feedback form found at http://lenovopress.com/lp1234

Trademarks

Lenovo and the Lenovo logo are trademarks or registered trademarks of Lenovo in the United States, other countries, or both. These and other Lenovo trademarked terms are marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned by Lenovo at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of Lenovo trademarks is available from
https://www.lenovo.com/us/en/legal/copytrade/.

The following terms are trademarks of Lenovo in the United States, other countries, or both:

Lenovo®
Lenovo(logo)®
ThinkSystem™

The following terms are trademarks of other companies:
Intel, and the Intel logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a trademark of Linus Torvalds in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others.

Configuring a Passthrough GPU in a Linux VM on a Lenovo ThinkSystem Server.