VMware vSphere VMDirectPath I/O: Requirements for Platforms and Devices

Posted on Posted in ESXi, VCenter

About VMDirectPath I/O

VMDirectPath I/O enables direct assignment of hardware PCI Functions to virtual machines. This gives the virtual machine access to the PCI Functions with minimal intervention from the ESXi host, potentially improving performance. It is suitable for performance critical workloads such as graphics acceleration for virtual desktops, such as VMware View vDGA, and high data-rate networking such as those found in enterprise class telecommunications equipment. It works particularly well with PCI devices supporting SR-IOV technology, as each virtual function in the device can be assigned to a separate virtual machine.

While VMDirectPath I/O can improve performance of a virtual machine, enabling it makes several important features of vSphere unavailable to the virtual machine, such as Suspend and Resume, Snapshots, Fault Tolerance, and vMotion.

IOMMU

The platform must have an IOMMU for DMA remapping any PCI Function which is to be assigned for VMDirectPath I/O. The IOMMU’s DMA re-mapping functionality is necessary in order for VMDirectPath I/O to work. DMA transactions sent by the passthrough PCI Function carry guest OS physical addresses which must be translated into host physical addresses by the IOMMU.

PCI Express Access Control Services (ACS)

DMA requests emitted by a passthrough PCI Function carry guest OS physical addresses rather than host OS physical addresses, so they can’t route correctly in the host’s PCI bus hierarchy in the absence of ACS. With ACS, these requests are unconditionally routed to the platform’s IOMMU for address translation.

 

ACS support is required in the PCIe root ports and the PCIe switch downstream ports which are located above the passthrough PCI Function in the PCI bus hierarchy. ACS is also required if the passthrough PCI Function is part of a multi-function device and supports peer-to-peer transfers.

Device Requirements and Recommendations

PCI Function BARs

PCI Function BARs must be aligned to 4KB addresses or multiples of it (For example: 8KB, 16KB, etc.) in the host machine’s physical address space. If the PCI Function’s BAR size is greater than or equal to 4KB, this requirement is implicitly satisfied.

 

If the PCI Function’s BAR size is less than 4KB, platform firmware must align the start of the BAR as stated above, and must ensure no other system resources are mapped within the 4KB page at which the BAR is mapped.

 

PCI Function BARs size requirements across product versions:

 

ESXi 5.1 and 5.5

  • The maximum supported size for a single PCI BAR is 1GB.
  • The combined size of all the BARs in the PCI Function must not exceed 3.75GB.
  • The amount of BAR space consumed by other PCI devices in the virtual machine will further limit this, as the combined size of all PCI BARs in the virtual machine is 3.75GB or less. For ESXi 5.1 and 5.5, virtual machines that use legacy BIOS firmware, which maps BARs below the 4GB address boundary, the requirements arise from BAR alignment requirements and memory reservations by the BIOS.

ESXi 6.0

  • If the virtual machine uses Legacy BIOS mode, the limits for ESXi 6.0 are identical to those in ESXi 5.5.
  • To use more than 3.75GB of total BAR allocation within a virtual machine add this line to the virtual machine's vmx file to set the virtual machine BIOS to use UEFI:
  • firmware=“efi"

  • To enable 64-bit Memory Mapped I/O (MMIO) add this line to the virtual machine vmx file:

     

    pciPassthru.use64bitMMIO="TRUE"

  • Set the ESXi hosts BIOS to allow PCI mapping above 4GB and below 16TB.
  • In UEFI BIOS mode, a virtual machines's total BAR allocation is limited to 32GB.

 

ESXi 6.5

  • If the virtual machine uses Legacy BIOS mode, the limits for ESXi 6.5 are identical to those in ESXi 6.0.
  • If the virtual machine uses UEFI BIOS mode, the ESXi 6.0 limitations apply with the exception that the 32GB limit has been removed and the new limit on total BAR allocation for a virtual machine is based on the hardware limits. This limit is generally 1TB or more.

    Ensure the virtual machine is using UEFI BIOS by looking in the .vmx file for these entries:

    firmware=“efi"

  • To enable 64-bit Memory Mapped I/O (MMIO) add this line to the virtual machine vmx file:
  • pciPassthru.use64bitMMIO="TRUE"

  • To use more than 32GB, specify the size of the MMIO region as a power-of-two number of GB in the virtual machine's vmx file: 

    For example:

    pciPassthru.64bitMMIOSizeGB = “128"

     

PCI Function resets

ESXi resets PCI Functions assigned for VMDirectPath I/O in order to provide isolation between the ESXi host and the virtual machine. This is done to ensure guest operating systems see a device with clean state during power-up or reboot.

ESXi supports the use of several reset types for PCI Functions:

  • Function Level Reset (FLR)
  • Secondary Bus Reset
  • Link Disable/Enable
  • Device power state transition (D0 > D3hot > D0; non-standard reset method)

The selected method is configurable via the passthru.map file in the ESXi host located at /etc/vmware/passthru.map.

The FLR and Device power state transition reset types have function-level granularity, meaning that the reset can be applied on a single PCI Function without affecting other PCI Functions in the device or other devices in the same bus.

Conversely, Secondary Bus Reset and Link Disable/Enable have bus-level granularity, meaning that the reset affects all PCI Functions on the same bus (For example, all PCI Functions in a multi-function PCIe device).

 

Requirements and Recommendations:

  1. In order to guarantee isolation between the ESXi host and the virtual machine, the PCI Function must meet the following criteria after a PCI reset. These criteria are from the PCIe 3.0 specification, section 6.6.2:

    • The Function must not give the appearance of an initialized adapter within the host or its external interfaces (For example, the Function must be quiesced and not issue any transactions or interrupts).
    • The Function must not retain software readable state that potentially includes secret information associated with any preceding use of the function.
    • Normal configuration should cause the Function to be usable by its drivers.
  2. VMware strongly recommends that PCI Functions support Function-Level-Reset (FLR).
  3. For reset methods with bus-level granularity, VMDirectPath I/O is only supported if all PCI Functions on the same bus are collectively assigned to the same virtual machine.

    • ESXi 5.5 and 6.0 detect reset dependencies automatically and notify the user about them when a given PCI Function is assigned for VMDirectPath I/O to a virtual machine.
    • ESX 5.1 does not detect reset dependencies automatically. Such dependencies must be explicitly configured by the user through the passthru.map file, by setting the shareable attribute of the PCI Function to false.
  4. A PCI Function directly under a PCI Host Bridge must support FLR or D3Hot reset in order for it to be eligible for VMDirectPath I/O.

Multi-Function PCI Devices

It is strongly recommended that a multi-function PCI device not have functional-dependencies across its PCI Functions.

If a functional-dependency exists between PCI Functions of a multi-function device, all dependent PCI Functions in the same device must be collectively assigned for VMDirectPath I/O to the same virtual machine. Such dependencies must be explicitly configured by the user via the passthru.map file, by setting the shareable attribute of the dependent PCI Functions to false.

 

Peer-to-peer DMA Transactions

ESXi does not currently support peer-to-peer DMA transactions to/from a PCI passthrough device in a VM. That is, ESXi expects that all DMA transactions emitted by a passthrough PCI device access the VM's memory (e.g., RAM) and never access memory-mapped BARs of another PCI device in the VM (regardless of whether the other PCI device is virtual or another passthrough device). Similarly, DMA transactions emitted by a virtual PCI device in the VM must never access the memory-mapped BARs of a PCI passthrough device assigned to the VM. Failure to meet this requirement could result in termination of the VM when the peer-to-peer transaction occurs.

 

PCI Functions behind legacy PCI Bridges

VMware strongly recommends that PCI Functions assigned for VMDirectPath I/O be placed behind PCI Express root ports or switch downstream ports.

VMware discourages VMDirectPath I/O assignment of PCI Functions behind conventional PCI bridges or PCIe-to-PCI/PCI-X bridges. PCI Functions behind PCIe to PCI/PCI-X bridges or PCI conventional bridges must be collectively assigned for VMDirectPath I/O to the same virtual machine.

These bridges take ownership of PCI transactions sent by PCI Functions behind them by placing the bridge’s PCI requester ID on the transactions. This forces the ESXi host to program IOMMU translations using the PCI Bridge’s requester ID, implying that all PCI Functions behind the bridges must be placed in the same IOMMU domain and therefore be collectively assigned to the same virtual machine.

PCI root complex integrated endpoints

PCI passthrough of root-complex integrated endpoints (i.e., PCI Functions directly under a PCI host bridge) is supported for regular (non SR-IOV) PCI Functions. It is not supported for SR-IOV Virtual Functions at this time.

 

PCI SR-IOV Devices

PCI passthrough of SR-IOV virtual functions is supported, as long as the PCI platform and device requirements stated in this document are met.

PCI passthrough of SR-IOV physical functions with the purpose of allowing the guest OS to enable VFs is not supported at this time. Specifically, ESX does not currently support virtualizing the PF's SR-IOV Capability to the guest OS.

PCI passthrough of SR-IOV physical functions for other purposes is allowed, as long as the PCI platform and device requirements stated in this document are met.

PCI Functions that DMA to host reserved memory

VMware strongly recommends that PCI Functions that are assigned for VMDirectPath I/O do not generate DMA read/writes to ESXi host memory marked as reserved in the platform’s memory map (For example, BIOS E820 or UEFI GetMemoryMap()).

PCI Functions that generate DMA read\writes to host reserved memory forces the ESXi host to create an IOMMU identity-map for the reserved memory ranges (since such DMAs carry host physical addresses rather than guest physical addresses). This creates addressing constraints for the ESXi host and the virtual machine as described below.

If a PCI Function DMAs to the ESXi host reserved memory, the PCI Function and the reserved ranges it DMAs to must appear in the ACPI Reserved Memory Region Reporting structure (RMRR).

ESX’s support for VMDirectPath I/O of PCI Functions that DMA to host reserved memory varies across product versions:

ESX 5.1:

VMDirectPath I/O of PCI Functions that DMA to host reserved memory is not supported, unless a reset of the PCI Function causes it to stop DMAing to such memory.

 

ESX 5.5 (versions prior to ESX 5.5U3):

VMDirectPath I/O of PCI Functions that DMA to host reserved memory is supported, subject to the following:

  1. The RMRR regions associated with the PCI Function must be placed in one of the following ranges, in order of preference:

    1. Outside of the VM’s physical memory range (i.e., above the highest address configured for the VM’s RAM).
    2. Below the 4GB address boundary, in the VM’s PCI memory hole.
      The VM’s .vmx file must be configured to map the VM’s PCI memory hole such that it overlaps with the RMRR region (by setting configuration option “pciHole.start” and “pciHole.size”).

      The PCI memory hole must be in the address range 256MB -> 4GB.

    3. Explanation: the RMRR ranges must never overlap with the VM’s physical memory range (except as described above), as otherwise ESX won’t be able to program IOMMU address translations correctly.

  2. ESX resets PCI Functions assigned VMDirectPath I/O. If the platform depends on the PCI Function DMAing to the RMRR regions for correct operation, it must ensure that the PCI Function continues to do this after the reset by using a platform-specific mechanism. If this is not possible, then the PCI Function should not be assigned for VMDirectPath I/O.

 

ESX 5.5 (starting with 5.5U3), ESX 6.0:

Starting with ESX 5.5 U3, ESX will automatically adjust the VM’s physical memory map to prevent overlaps between host RMRR regions and the VM’s RAM. However, the automatic adjustment only works if the following conditions are met.

  1. The RMRR regions associated with the PCI Function must be placed in one of the following ranges:

    1. Outside of the VM’s physical memory range (i.e., above the highest address configured for the VM’s RAM).
    2. Below the 4GB address boundary, in the range 256MB -> 4GB.
    3. In the VM’s BIOS reserved memory (i.e., 640K – 1MB).
  2. ESX resets PCI Functions assigned VMDirectPath I/O. If the platform depends on the PCI Function DMAing to the RMRR regions for correct operation, it must ensure that the PCI Function continues to do this after the reset by using a platform-specific mechanism. If this is not possible, then the PCI Function should not be assigned for VMDirectPath I/O.

PCI Interrupts

VMDirectPath I/O supports PCI Functions that use MSI-X, MSI, or legacy (i.e., INTx) PCI interrupts. VMware recommends that PCI Functions used for VMDirectPath I/O support MSI or MSI-X interrupts.

Explanation: PCI Functions using INTx may share the interrupt line with other PCI Functions in the system. This can delay the servicing of the interrupt for each PCI Function sharing the interrupt. The delay is exacerbated if one or more of the PCI Functions sharing the INTx is assigned for VMDirectPath I/O.

PCI Errors

ESX does not support virtualization of PCI errors for VMDirectPath I/O (i.e., PCI Errors reported by a PCI Function assigned for VMDirectPath I/O are always handled by ESX and never presented to the guest OS).

Due to platform constraints on current generation servers, PCI uncorrectable errors typically cause an NMI to be delivered to the processors, from which ESX can’t currently recover. As a result, VMware requires that PCI Functions that may be assigned for VMDirectPath I/O be designed to avoid emitting PCI uncorrectable errors under normal operating conditions.

PCI Functions with Expansion ROM

ESX supports VMDirectPath I/O of PCI Functions with Expansion ROM (e.g., the Expansion ROM code may contain device-specific initialization and/or code that allows a VM to boot from the passthrough PCI Function).

The Expansion ROM is executed by the VM’s firmware whenever the VM is powered-on or restarted. A hot-reset of the device should place the device in a state where the Expansion ROM can be re-executed.

Execution of Expansion ROM code by the VM’s firmware is disabled by default but can be enabled as follows:

  1. Setting the ESX boot option ‘pcipSaveOPROM=TRUE’.
  2. Configuring the VM’s .vmx file with parameter “%s.opromEnabled=TRUE” (e.g., pciPassthru0.opromEnabled=TRUE).

    Future Implementation Note: Though currently not supported, future versions of ESX may support re-assigning a VMDirectPath I/O PCI Function back to the ESX VMkernel at run-time (i.e., without having to power-down the ESX host). For PCI Functions with an Expansion ROM, this operation won’t be supported since the PCI Function is reset during this re-assignment and the Expansion ROM won’t be executed by ESX (i.e., such code is only executed by the host machine’s BIOS/UEFI firmware during boot time).

PCI Power Management

ESX does not support virtualization of PCI Power Management Events (PME) for VMDirectPath I/O (i.e., PME messages generated by VMDirectPath I/O PCI Functions are currently ignored by ESX and do not cause a notification to the guest OS).

However, the VM’s BIOS does by default grant control of PCIe Native Power Management Events to guest OSes that request so via the APCI _OSC method. This may cause the guest OS to enable PMEs in the passthrough PCI Function, which could lead to malfunction for PCI Functions that rely on the generation of PMEs for correct operation (e.g., a PCI NIC where its driver may place it in a low power state during periods of low traffic, where the driver relies on the NIC generating a PME to place it back into a normal power state).

This situation may be prevented by configuring the VM’s BIOS from granting control of PCIe Native Power Management Events to the guest OS. This is done via the .vmx configuration option ‘acpi.osc.pcie’, which overrides the _OSC Control Field, Return Value. Refer to the ACPI specification for the format of the _OSC Control Field, Return value.

General Testing Recommendations

To increase the likelihood of both success and overall system stability with the VMDirectPath I/O implementation, VMware strongly encourages that OEM partners perform rigorous testing on the specific platforms they intend to use VMDirectPath I/O with, to ensure the requirements stated in this document are met.

This testing should account for the common power-on, restart, and functional test use-cases within the VM, but it must also consider more corner-case testing to attempt to fully validate the platform. Such corner cases primarily include forced shutdown (or crashing) of the VM, or forced shutdown of the ESX host itself while the VM is running.

Leave a Reply

Your email address will not be published. Required fields are marked *