以VFIO作安全的設備分配
https://lwn.net/Articles/474088/
As a general rule, most developers feel that device drivers belong in the kernel.
Kernel-space drivers are (hopefully) widely reviewed, implement standard device
interfaces, perform better, and are more secure than the user-space variety.
There are exceptions, though. Some high-performance applications want to talk
to devices directly. Virtualized guests can also be thought of as a sort of
user-space process; it is often desirable to allow guests to work with hardware
directly rather than funneling their I/O through the host. So the kernel really
should support this mode of access for the times when it is needed.
以一般的標準來說, 大部分的開發人員覺得設備驅動程式屬於kernel.
Kernel-space 的驅動程式有被廣泛的審查(但願), 且實作了標準的設備介面, 執行起來
更好, 而且比起user-space類型(驅動程式)更加安全. 雖然有些例外. 一些高性能的應用
程式希望直接和設備溝通. 虛擬化的guest也可以被認為是一種user-space程序; 通常希望
允許虛擬機直接和硬體一起工作而不是讓他們的I/O穿隧通過host. 所以 kernel 真的應該在需要他(應用程式)的時候支援這樣的訪問模式.
The kernel's UIO interface has been available for the implementation of
user-space drivers for some years. UIO has some shortcomings, though, including
a lack of support for direct memory access (DMA) operations. DMA under
user-space control is challenging to support for a number of reasons, not the
least of which is security. A DMA-capable device is normally capable of writing
any page in memory; as a result, empowering a user-space process to set up DMA
operations is equivalent to giving it full root access. Sometimes a user-space
driver can be trusted with that access, but that is often not the case,
especially when virtualization is involved.
多年來 Kernel 的 UIO 介面已經可以用於實現 user-space 驅動程式. 雖然UIO有一些缺點,
這包含缺乏 DMA 操作的支援. 不僅僅是安全問題, 有數種原因造成在用戶空間下支援控制
DMA是個挑戰. 一個具備 DMA 能力的設備通常可以對記憶體中的任何頁進行寫入; 其結果是,
授權一個使用者空間下的進程去操作設置 DMA, 相當於給它完整的root存取(權限). 有時候
一個使用者空間的驅動程式可以被信任去做這樣的存取, 但情況往往並非如此, 尤其當虛擬
化(技術)參與時.
More recent CPUs have added support for safe (or safer) access to devices from
virtualized guests. Devices can be restricted, via an I/O memory management unit
(IOMMU) so that only specific regions of memory are accessible to them.
Technologies like KVM support a "device assignment" mechanism that uses the
hardware capabilities to hand a device to a guest, but device assignment is not
without its shortcomings. Among other things, device assignment alone cannot
guarantee the isolation of a specific device, and it involves a fair amount of
complexity in the kernel.
最近的CPU支援了從虛擬機內安全存取設備. 設備可以被限制住, 但透過I/O記憶體管理單元
(IOMMU)可以存取設備的特定區域的記憶體. 類似KVM這樣的技術支援了"設備分配"的機制,
它利用硬體功能將設備交給虛擬機, 但是設備分配並不是沒有缺點. 除此之外, 單單設備
分配(機制)並不能確保一個特定設備的隔離性, 而且它涉及相當程度的複雜性在kernel中.
Alex Williamson's VFIO patch set is an attempt to come up with a better solution
that allows the development of safe, high-performance user-space drivers. It
provides interfaces allowing those drivers to work with DMA and interrupts while
keeping overall control over how devices access the system's resources.
Alex Williamson 的 VFIO 補丁集嘗試提出一個更好的方案, 它讓使用者空間下的驅動程式
在開發上更安全且高效. 在全面控制設備如何去存取系統資源的狀況下, 它提供了介面允許
(user-space)驅動程式與DMA和中斷一起工作.
One problem with KVM's device assignment is that it assumes that all devices are
fully independent of each other. In particular, groups of devices may be
connected through the same IOMMU; that means that any device can access any
memory regions made available to any other devices in the same group. That, in
turn, implies that the group of devices must be assigned as a unit; if any of
those devices are assigned separately, the isolation of the group as a whole can
be broken.
KVM的設備分配有個問題是, 它假設所有的設備都是完全獨立於其他設備. 特別是同一個群組
的設備可能會透過相同的IOMMU連接; 這意味任何設備可以存取任何提供給其他同群組設備的記憶體區域. 反過來說, 也意味著該群組的所有設備都必須被分派成同一個單元. 如果任何
一個設備被單獨分配, 則整體群組的隔離被破壞.
So the first thing a VFIO driver writer will encounter is the group mechanism.
The VFIO code creates the groups to match the hardware topology. It then ensures
that every device in a group is controlled by a VFIO driver; if any device is
unavailable, then the group as a whole cannot be used. Most devices on a typical
system are unlikely to be bound to VFIO drivers at boot, so the system administrator
must explicitly unbind them and tell VFIO to claim them. This is probably a good
thing; exposing groups of devices to user space is best not done by default.
所以, VFIO驅動程式開發者會遭遇到的第一件事就是群組(group)機制. VFIO程式碼建立和硬體拓撲一致的多個群組. 然後它確保了所有在同一個群組下的設備都被同一個VFIO驅動所控制; 如果任何一個設備無法被使用, 則整個群組會被視為一體皆不可使用. 大多數傳統系統上的設備不太可能在開機的時候就綁定到 VFIO 驅動程式, 所以系統管理員必須確切的將它們解除綁定, 並且告訴 VFIO 去認領這些設備. 這可能是件好事, 最好不要預設就以群組為單位將設備曝露給使用者空間.
For each group, a virtual device is created under /dev/vfio; prior to working with any individual device, a driver must open the group, claiming ownership of it. The access permissions on the group file control access to the underlying devices. Once the group has been opened, the driver should do an ioctl(VFIO_GROUP_GET_INFO) call to determine whether the group is "viable" (meaning all of the relevant devices are assigned to it) and available for use. If the group is not viable, the driver will not be able to proceed.
針對每個群組, 會有一個虛擬設備建立在 /dev/vfio; 在和任何個別的設備合作之前, 必須有一個驅動程式打開群組, 並聲明其所有權. 群組檔案的訪問權限就控制了對底層設備的存取. 一旦該群組被打開了驅動程式必須調用 ioctl(VFIO_GROUP_GET_INFO) 去判定該群組是否為"活的(viable)"(這意味著所有相關設備都被分配給它) 並可供使用. 如果群組不是活著, 驅動程式將無法繼續進行下去.
To work with specific devices, the driver will "open" them with the VFIO_GROUP_GET_DEVICE_FD ioctl() call, which returns a file descriptor for access to the device. The VFIO_DEVICE_GET_REGION_INFO command can be used to learn about the device's memory-mapped I/O regions, which can then be accessed via an mmap() call.VFIO_DEVICE_GET_IRQ_INFO returns information about the device's interrupt assignment(s); the driver can use the eventfd() mechanism to receive notification of interrupts via a file descriptor. For most hardware, access to MMIO and interrupts is enough to communicate with the device.
為了和特定設備一起工作, 驅動程式會用 VFIO_GROUP_GET_DEVICE_FD ioctl() call "打開"它們, 它會回傳一個檔案描述符(file descriptor)以用於存取該設備. VFIO_DEVICE_GET_REGION_INFO 命令可以被用來了解設備的記憶體映射I/O區域, 然後可以透過一個 mmap() 存取這些區域. VFIO_DEVICE_GET_IRQ_INFO 回傳了關於設備的中斷分配資訊; 驅動程式可以使用 eventfd() 機制來透過檔案描述符接收中斷通知. 對於多數的硬體, 存取 MMIO 和中斷就足以和設備溝通.
That still leaves the DMA problem, though. To that end, the VFIO_GROUP_GET_IOMMU_FD command returns a file descriptor representing the IOMMU. DMA mappings can be set up by filling in a vfio_dma_map structure:
struct vfio_dma_map { __u32 argsz; __u32 flags; __u64 vaddr; /* Process virtual address */ __u64 iova; /* IO virtual address */ __u64 size; /* Size of mapping (bytes) */ };
This structure is used to request a mapping of the user-space memory found at vaddr (of size bytes) into the device's I/O memory range starting at iova; the VFIO_IOMMU_MAP_DMA command actually gets the work done. For most user-space drivers, that should be about all that is needed, modulo a few details.
Not all VFIO drivers will be in user space, though. Inside the kernel, VFIO looks like a special bus type to which devices can be bound. A VFIO driver needs to provide a set of operations to the core:
struct vfio_device_ops { bool (*match)(struct device *dev, const char *buf); int (*claim)(struct device *dev); int (*open)(void *device_data); void (*release)(void *device_data); ssize_t (*read)(void *device_data, char __user *buf, size_t count, loff_t *ppos); ssize_t (*write)(void *device_data, const char __user *buf, size_t count, loff_t *size); long (*ioctl)(void *device_data, unsigned int cmd, unsigned long arg); int (*mmap)(void *device_data, struct vm_area_struct *vma); };
Most of these operations are analogous to those found in struct file_operations or the bus-specific device structures. A device registered in this way can be opened and used like any other device with one difference: the interlock with group ownership is always enforced. If a device has been opened individually, the group is not "viable" and cannot be used by a user-space driver. If, instead, the group has been opened, the individual devices are busy and cannot be opened.
VFIO is not the only patch set aimed at this problem; David Gibson's device isolation infrastructure is also intended to enable safe assignment of devices. The scope of this patch set is smaller, though, focusing mostly on the grouping aspect; there is no mechanism for controlling the IOMMU or working with individual devices. There is a certain amount of disagreement between the two on how grouping should be managed which suggests, in turn, that a certain amount of discussion will have to take place before either can be merged.