2016年1月20日 星期三

Safe device assignment with VFIO (with Chinese translation comment)

Safe device assignment with VFIO    By Jonathan Corbet
以VFIO作安全的設備分配
https://lwn.net/Articles/474088/

As a general rule, most developers feel that device drivers belong in the kernel.
Kernel-space drivers are (hopefully) widely reviewed, implement standard device
interfaces, perform better, and are more secure than the user-space variety.           
There are exceptions, though. Some high-performance applications want to talk           
to devices directly. Virtualized guests can also be thought of as a sort of            
user-space process; it is often desirable to allow guests to work with hardware         
directly rather than funneling their I/O through the host. So the kernel really
should support this mode of access for the times when it is needed.

以一般的標準來說, 大部分的開發人員覺得設備驅動程式屬於kernel.
Kernel-space 的驅動程式有被廣泛的審查(但願), 且實作了標準的設備介面, 執行起來
更好, 而且比起user-space類型(驅動程式)更加安全. 雖然有些例外. 一些高性能的應用
程式希望直接和設備溝通. 虛擬化的guest也可以被認為是一種user-space程序; 通常希望
允許虛擬機直接和硬體一起工作而不是讓他們的I/O穿隧通過host. 所以 kernel 真的應該在需要他(應用程式)的時候支援這樣的訪問模式.

The kernel's UIO interface has been available for the implementation of
user-space drivers for some years. UIO has some shortcomings, though, including
a lack of support for direct memory access (DMA) operations. DMA under
user-space control is challenging to support for a number of reasons, not the
least of which is security. A DMA-capable device is normally capable of writing
any page in memory; as a result, empowering a user-space process to set up DMA
operations is equivalent to giving it full root access. Sometimes a user-space
driver can be trusted with that access, but that is often not the case,
especially when virtualization is involved.

多年來 Kernel 的 UIO 介面已經可以用於實現 user-space 驅動程式. 雖然UIO有一些缺點,
這包含缺乏 DMA 操作的支援. 不僅僅是安全問題, 有數種原因造成在用戶空間下支援控制
DMA是個挑戰. 一個具備 DMA 能力的設備通常可以對記憶體中的任何頁進行寫入; 其結果是,
授權一個使用者空間下的進程去操作設置 DMA, 相當於給它完整的root存取(權限). 有時候
一個使用者空間的驅動程式可以被信任去做這樣的存取, 但情況往往並非如此, 尤其當虛擬
化(技術)參與時.

More recent CPUs have added support for safe (or safer) access to devices from
virtualized guests. Devices can be restricted, via an I/O memory management unit
(IOMMU) so that only specific regions of memory are accessible to them.
Technologies like KVM support a "device assignment" mechanism that uses the
hardware capabilities to hand a device to a guest, but device assignment is not
without its shortcomings. Among other things, device assignment alone cannot
guarantee the isolation of a specific device, and it involves a fair amount of
complexity in the kernel.

最近的CPU支援了從虛擬機內安全存取設備. 設備可以被限制住, 但透過I/O記憶體管理單元
(IOMMU)可以存取設備的特定區域的記憶體. 類似KVM這樣的技術支援了"設備分配"的機制,
它利用硬體功能將設備交給虛擬機, 但是設備分配並不是沒有缺點. 除此之外, 單單設備
分配(機制)並不能確保一個特定設備的隔離性, 而且它涉及相當程度的複雜性在kernel中.

Alex Williamson's VFIO patch set is an attempt to come up with a better solution
that allows the development of safe, high-performance user-space drivers. It
provides interfaces allowing those drivers to work with DMA and interrupts while
keeping overall control over how devices access the system's resources.

Alex Williamson 的 VFIO 補丁集嘗試提出一個更好的方案, 它讓使用者空間下的驅動程式
在開發上更安全且高效. 在全面控制設備如何去存取系統資源的狀況下, 它提供了介面允許
(user-space)驅動程式與DMA和中斷一起工作.

One problem with KVM's device assignment is that it assumes that all devices are
fully independent of each other. In particular, groups of devices may be
connected through the same IOMMU; that means that any device can access any
memory regions made available to any other devices in the same group. That, in
turn, implies that the group of devices must be assigned as a unit; if any of
those devices are assigned separately, the isolation of the group as a whole can
be broken.

KVM的設備分配有個問題是, 它假設所有的設備都是完全獨立於其他設備. 特別是同一個群組
的設備可能會透過相同的IOMMU連接; 這意味任何設備可以存取任何提供給其他同群組設備的記憶體區域. 反過來說, 也意味著該群組的所有設備都必須被分派成同一個單元. 如果任何
一個設備被單獨分配, 則整體群組的隔離被破壞.

So the first thing a VFIO driver writer will encounter is the group mechanism.
The VFIO code creates the groups to match the hardware topology. It then ensures
that every device in a group is controlled by a VFIO driver; if any device is
unavailable, then the group as a whole cannot be used. Most devices on a typical
system are unlikely to be bound to VFIO drivers at boot, so the system administrator
must explicitly unbind them and tell VFIO to claim them. This is probably a good
thing; exposing groups of devices to user space is best not done by default.

所以, VFIO驅動程式開發者會遭遇到的第一件事就是群組(group)機制. VFIO程式碼建立和硬體拓撲一致的多個群組. 然後它確保了所有在同一個群組下的設備都被同一個VFIO驅動所控制; 如果任何一個設備無法被使用, 則整個群組會被視為一體皆不可使用. 大多數傳統系統上的設備不太可能在開機的時候就綁定到 VFIO 驅動程式, 所以系統管理員必須確切的將它們解除綁定, 並且告訴 VFIO 去認領這些設備. 這可能是件好事, 最好不要預設就以群組為單位將設備曝露給使用者空間.

For each group, a virtual device is created under /dev/vfio; prior to working with any individual device, a driver must open the group, claiming ownership of it. The access permissions on the group file control access to the underlying devices. Once the group has been opened, the driver should do an ioctl(VFIO_GROUP_GET_INFO) call to determine whether the group is "viable" (meaning all of the relevant devices are assigned to it) and available for use. If the group is not viable, the driver will not be able to proceed.

針對每個群組, 會有一個虛擬設備建立在 /dev/vfio; 在和任何個別的設備合作之前, 必須有一個驅動程式打開群組, 並聲明其所有權. 群組檔案的訪問權限就控制了對底層設備的存取. 一旦該群組被打開了驅動程式必須調用 ioctl(VFIO_GROUP_GET_INFO) 去判定該群組是否為"活的(viable)"(這意味著所有相關設備都被分配給它) 並可供使用. 如果群組不是活著, 驅動程式將無法繼續進行下去.

To work with specific devices, the driver will "open" them with the VFIO_GROUP_GET_DEVICE_FD ioctl() call, which returns a file descriptor for access to the device. The VFIO_DEVICE_GET_REGION_INFO command can be used to learn about the device's memory-mapped I/O regions, which can then be accessed via an mmap() call.VFIO_DEVICE_GET_IRQ_INFO returns information about the device's interrupt assignment(s); the driver can use the eventfd() mechanism to receive notification of interrupts via a file descriptor. For most hardware, access to MMIO and interrupts is enough to communicate with the device.

為了和特定設備一起工作, 驅動程式會用 VFIO_GROUP_GET_DEVICE_FD ioctl() call "打開"它們, 它會回傳一個檔案描述符(file descriptor)以用於存取該設備. VFIO_DEVICE_GET_REGION_INFO 命令可以被用來了解設備的記憶體映射I/O區域, 然後可以透過一個 mmap() 存取這些區域. VFIO_DEVICE_GET_IRQ_INFO 回傳了關於設備的中斷分配資訊; 驅動程式可以使用 eventfd() 機制來透過檔案描述符接收中斷通知. 對於多數的硬體, 存取 MMIO 和中斷就足以和設備溝通.

That still leaves the DMA problem, though. To that end, the VFIO_GROUP_GET_IOMMU_FD command returns a file descriptor representing the IOMMU. DMA mappings can be set up by filling in a vfio_dma_map structure:
    struct vfio_dma_map {
 __u32 argsz;
 __u32 flags;
 __u64 vaddr;  /* Process virtual address */
 __u64 iova;  /* IO virtual address */
 __u64 size;  /* Size of mapping (bytes) */
    };


This structure is used to request a mapping of the user-space memory found at vaddr (of size bytes) into the device's I/O memory range starting at iova; the VFIO_IOMMU_MAP_DMA command actually gets the work done. For most user-space drivers, that should be about all that is needed, modulo a few details.

Not all VFIO drivers will be in user space, though. Inside the kernel, VFIO looks like a special bus type to which devices can be bound. A VFIO driver needs to provide a set of operations to the core:
    struct vfio_device_ops {
 bool (*match)(struct device *dev, const char *buf);
 int (*claim)(struct device *dev);
 int (*open)(void *device_data);
 void (*release)(void *device_data);
 ssize_t (*read)(void *device_data, char __user *buf,
   size_t count, loff_t *ppos);
 ssize_t (*write)(void *device_data, const char __user *buf,
    size_t count, loff_t *size);
 long (*ioctl)(void *device_data, unsigned int cmd,
    unsigned long arg);
 int (*mmap)(void *device_data, struct vm_area_struct *vma);
    };

Most of these operations are analogous to those found in struct file_operations or the bus-specific device structures. A device registered in this way can be opened and used like any other device with one difference: the interlock with group ownership is always enforced. If a device has been opened individually, the group is not "viable" and cannot be used by a user-space driver. If, instead, the group has been opened, the individual devices are busy and cannot be opened.

VFIO is not the only patch set aimed at this problem; David Gibson's device isolation infrastructure is also intended to enable safe assignment of devices. The scope of this patch set is smaller, though, focusing mostly on the grouping aspect; there is no mechanism for controlling the IOMMU or working with individual devices. There is a certain amount of disagreement between the two on how grouping should be managed which suggests, in turn, that a certain amount of discussion will have to take place before either can be merged.

2016年1月14日 星期四

Check the signature of kernel module (old format before 3f1e1bea3 commit in v4.3-rc1)

There have a situation that I need to check the signature blob that attached in the end of .ko file, it's should signed by appropriate key. Here is a procedure to manually checking signature by public key.

Success Case

acer-wmi.ko (from kernel-default-3.12.28-4.6.x86_64.rpm):

STEP 1. Find out the "Subject Key Identifier"

linux-aiip:/lib/modules/3.12.28-4-default # modinfo acer-wmi
filename:       /lib/modules/3.12.28-4-default/kernel/drivers/platform/x86/acer-wmi.ko
[...snip]
signer:         SUSE Linux Enterprise Secure Boot Signkey
sig_key:        3F:B0:77:B6:CE:BC:6F:F2:52:2E:1C:14:8C:57:C7:77:C7:88:E3:E7
sig_hashalgo:   sha256

The sig_key match with "Subject Key Identifier" in SUSE Linux Enterprise Secure Boot
Signkey, so the public key can decrypt the signature that it is attached on acer-wmi.ko.


STEP 2. Extract the signature from acer-wmi.ko

Signature is attached behind the key name and Subject Key Identifier number.
It's a 256 bytes block because hashed by SHA256. In this case, the Subject Key
Identifier is 3F B0 77 B6...E3 E7, finding out the position from acer-wmi.ko:

0000eb10  00 00 00 00 00 00 00 00  53 55 53 45 20 4c 69 6e  |........SUSE Lin|
0000eb20  75 78 20 45 6e 74 65 72  70 72 69 73 65 20 53 65  |ux Enterprise Se|
0000eb30  63 75 72 65 20 42 6f 6f  74 20 53 69 67 6e 6b 65  |cure Boot Signke|
0000eb40  79 3f b0 77 b6 ce bc 6f  f2 52 2e 1c 14 8c 57 c7  |y?.w...o.R....W.|  <=== 3F B0 77 B6...
0000eb50  77 c7 88 e3 e7 01 00 9b  eb 31 d2 cd f7 3a 65 92  |w........1...:e.|  <=== ...E3 E7 01 00 signature /* 60240 *//* 60240 + 7 = 60247 */
0000eb60  30 ee 2e d4 97 d2 7b 15  a0 e0 08 1f db 2d a7 9e  |0.....{......-..|
0000eb70  7f 0a 5f 25 ed 04 6e 95  2d 98 85 cc 98 5a 4f 08  |.._%..n.-....ZO.|

The signature is always leading by "01 00" head, we don't need the head, so
the start of signature is 60240 + 7 = 60247. Use dd to extract signature to
another file:

dd skip=60247 count=256 bs=1 if=./acer-wmi.ko of=./acer-wmi.ko.sig

> hexdump -C acer-wmi.ko.sig
00000000  9b eb 31 d2 cd f7 3a 65  92 30 ee 2e d4 97 d2 7b  |..1...:e.0.....{|
00000010  15 a0 e0 08 1f db 2d a7  9e 7f 0a 5f 25 ed 04 6e  |......-...._%..n|
00000020  95 2d 98 85 cc 98 5a 4f  08 b5 c6 b5 1b 2f 87 52  |.-....ZO...../.R|
00000030  00 28 bb f0 6b bd f1 60  8e 58 be 18 0e 30 e7 dd  |.(..k..`.X...0..|
00000040  87 91 ce 1f 84 71 f8 83  f3 ba f7 07 68 c4 35 6d  |.....q......h.5m|
00000050  c4 3d 87 e7 ff 4c b2 20  ae b9 65 52 0f 56 38 38  |.=...L. ..eR.V88|

STEP 3. Verify signature by the public key in SLES certificate

First, extract public key from certificate file:
openssl x509 -in SLES-UEFI-SIGN-Certificate.der -inform DER -pubkey -noout > SLES-UEFI-SIGN-Certificate.pub

Then, decrypting signature by public key. If the signature didn't encrypt by right
key, then this step will be fail:

openssl rsautl -verify -inkey SLES-UEFI-SIGN-Certificate.pub -pubin -in acer-wmi.ko.sig > acer-wmi.ko.sha

The acer-wmi.ko.sha file is the decrypted signature. It's leaded by ASN.1 format
and a SHA256 hash result:

> hexdump -C acer-wmi.ko.sha
00000000  30 31 30 0d 06 09 60 86  48 01 65 03 04 02 01 05  |010...`.H.e.....| <== ASN1. format head
00000010  00 04 20 ae 4a 31 b2 46  2b 1d e6 01 26 aa 38 2e  |.. .J1.F+...&.8.| <=== should match with the result of modhash
00000020  9d 3d ab 08 78 1a c2 85  b3 2f 87 96 3e 7f 15 7a  |.=..x..../..>..z|
00000030  31 b7 8c                                          |1..|
00000033

The attached perl script, modhash, is developed by Gary Lin for calculating
signature from a signed ko file. Comparing the result from modhash and the above decrypted
hash value to confirm the hash value matched:
> perl modhash -v acer-wmi.ko
Hash algorithm: sha256
acer-wmi.ko: ae4a31b2462b1de60126aa382e9d3dab08781ac285b32f87963e7f157a31b78c

The same procedure, following is a failed case.

Failed Case

sample.ko:

It's also claimed signed by SUSE Linux Enterprise Secure Boot Signkey:
linux-aiip:~ # modinfo sample
filename:       /lib/modules/3.12.28-4-default/updates/sample.ko
description:    xxxxxxxxxxxxxx
[...snip]
signer:         SUSE Linux Enterprise Secure Boot Signkey
sig_key:        3F:B0:77:B6:CE:BC:6F:F2:52:2E:1C:14:8C:57:C7:77:C7:88:E3:E7
sig_hashalgo:   sha256

Found the Subject Key Identifier in .ko file, looks OK:
00061790  70 61 67 65 73 5f 63 75  72 72 65 6e 74 00 53 55  |pages_current.SU|
000617a0  53 45 20 4c 69 6e 75 78  20 45 6e 74 65 72 70 72  |SE Linux Enterpr|
000617b0  69 73 65 20 53 65 63 75  72 65 20 42 6f 6f 74 20  |ise Secure Boot |
000617c0  53 69 67 6e 6b 65 79 3f  b0 77 b6 ce bc 6f f2 52  |Signkey?.w...o.R|  <=== 3F B0 77...
000617d0  2e 1c 14 8c 57 c7 77 c7  88 e3 e7 01 00 c6 12 1d  |....W.w.........|  <=== ...88 E3 E7 01 00 signature /* 399312 */ /* 399325 */
000617e0  ba 45 3a b3 b1 99 fb 55  1b fc d3 90 6a ea 92 64  |.E:....U....j..d|

Then extract signature:
dd skip=399325 count=256 bs=1 if=./sample.ko of=./sample.ko.sig
> hexdump -C sample.ko.sig
00000000  c6 12 1d ba 45 3a b3 b1  99 fb 55 1b fc d3 90 6a  |....E:....U....j|
00000010  ea 92 64 8a 04 04 f9 22  a7 74 35 98 05 d7 e6 85  |..d....".t5.....|
00000020  8c 5f 32 e6 6c 71 f7 ba  1c 0a 0f 8a 95 f3 ec c7  |._2.lq..........|
00000030  88 b2 11 71 27 28 ca b8  b8 55 ae df 56 38 c6 b4  |...q'(...U..V8..|

Unfortunately the signature can not be decrypted by public key success:
openssl rsautl -verify -inkey SLES-UEFI-SIGN-Certificate.pub -pubin -in sample.ko.sig > sample.ko.sha
RSA operation error
139755852273296:error:0407006A:rsa routines:RSA_padding_check_PKCS1_type_1:block type is not 01:rsa_pk1.c:100:
139755852273296:error:04067072:rsa routines:RSA_EAY_PUBLIC_DECRYPT:padding check failed:rsa_eay.c:721:

Got problem when decrypting the blob by public key. So, looks signature didn't encrypted by appropriate key.

2016年1月5日 星期二

RAPL (Running Average Power Limit) driver (with Chinese translation comment)

RAPL (Running Average Power Limit) driver Jacob Pan <jacob.jun.pan@linux.intel.com>
RAPL (運行時期平均供電限制) 驅動程式
https://lwn.net/Articles/545745/

RAPL(Running Average Power Limit) interface provides platform software
with the ability to monitor, control, and get notifications on SOC
power consumptions. Since its first appearance on Sandy Bridge, more
features have being added to extend its usage. In RAPL, platforms are
divided into domains for fine grained control. These domains include
package, DRAM controller, CPU core (Power Plane 0), graphics uncore
(power plane 1), etc.

在 RAPL, 硬體平台被區分成為幾個domain來進行細粒度調整. 這些domain包含
package, DRAM控制器, CPU核心 (電力平面0), 非核心顯卡 (電力平面1), 等等.

The purpose of this driver is to expose RAPL for userspace
consumption. Overall, RAPL fits in the generic thermal layer in
that platform level power capping and monitoring are mainly used for
thermal management and thermal layer provides the abstracted interface
needed to have portable applications.

整體來說, RAPL 適用於通用散熱層, 用在平台層級的電力封頂, 以及監控. 主要
為溫度管控以及散設層提供抽象介面以滿足便攜式應用的需求.

Specifically, userspace is presented with per domain cooling device
with sysfs links to its kobject. Although RAPL domain provides many
parameters for fine tuning, long term power limit is exposed as the
single knob via cooling device state. Whereas the rest of the
parameters are still accessible via the linked kobject. This simplifies
the interface for both simple and advanced use cases.

具體而言, 在 userspace 會透過 sysfs 展現出來, 其背後聯結到每個domain的
冷卻裝置的 kobject. 雖然 RAPL domain 提供了很多可微調的參數, 但長時間的
功率限制乃是透過冷卻裝置狀態而揭露出來成為單獨的 knob. 然而其他的參數
仍然可以透過鏈接 kobject 來取用. 這樣同時簡化了簡單和進階使用場合中的
使用介面.

DETAILS 細節
=======
1. sysfs layout         sysfs 層

As an x86 platform driver, RAPL driver binds with supported CPU ids
during probing phase. Once domains are discovered, kobjets are created
for each domain which are also linked with cooling devices after its
registration with the generic thermal layer.

作為一支 x86 平台驅動, RAPL驅動程式在試探時期綁定了被支援的CPU id. 一旦
domains 發現, kobjects 會為了每個domain而被創建出來, 它們也會在註冊到通用
散熱層之後聯結到散熱設備

e.g.package RAPL domain registered as cooling device #15, link "device"
back to its kobject.

範例: RAPL domain 包裹被註冊成為散熱設備15號, device 聯結回到它的kobject.

/sys/class/thermal/cooling_device15/
├── cur_state
├── device -> ../../../platform/intel_rapl/rapl_domains/package
├── max_state
├── power
├── subsystem -> ../../../../class/thermal
├── type
└── uevent

In driver's private sysfs area, domains kobjects are grouped under a
kset which exposes global data.

在驅動程式的私有 sysfs 區域, domains 的 kobjects 被歸納在一個 kset 下,
揭露全域資料.

/sys/devices/platform/intel_rapl/
├── driver -> ../../../bus/platform/drivers/intel_rapl
├── power
├── rapl_domains
│   ├── package
│   │   └── thermal_cooling
-> ../../../../virtual/thermal/cooling_device15
│   ├── power_plane_0
│   │   └── thermal_cooling
-> ../../../../virtual/thermal/cooling_device16
│   └── power_plane_1
│       └── thermal_cooling
-> ../../../../virtual/thermal/cooling_device18
└── subsystem -> ../../../bus/platform

2. per domain parameters        每個domain的參數

These are the fine tuning parameters only used by advanced
power/thermal management applications. Refer to Intel SDM ch14 for
details.

有一些可以微調的參數只能應用在功耗/溫度管理上. 細節請參考 Intel SDM
第14章.

root@chromoly:/sys/class/thermal/cooling_device15/device# grep . *
domain_name:package
energy:924228
lock:0
max_power:0
max_window:0
min_power:0
pl1_clamp:1
pl1_enable:1
pl2_clamp:0
pl2_enable:1
power:2276
power_limit1:12000
power_limit2:31250
thermal_spec_power:17000
throttle_time:
time_window1:28000
time_window2:0

3. event notifications          事件通知

RAPL driver uses eventfd to provide userspace notifications on selected
events. A file node called "event_control" is created for each RAPL
domain. User can write control file descriptor, eventfd descriptor, and
threshold to event_control file. Then, user application can use
poll/select or blocking read to get notifications from the driver.
Multiple events are allowed for each domain but only a single threshold
is accepted.

RAPL驅動程式利用 eventfd 為 userspace 所選定的事件來提供通知.
event_control 這一個檔案結點為每個 RAPL domain 被建立出來. 使用者可以
寫入控制檔案描述符, eventfd 描述符, 以及閥值到 event_control 檔案. 接著,
使用方的應用程式可以使用 輪詢/選擇 或者 阻斷讀取 來取得驅動程式的通知.
多事件在每一個 domain 都是可以被允許的, 但是只允許單一閥值.

4. Usage Examples (assume the topology in the sysfs layout above)
使用範例 (假設在 sysfs 層之上的拓撲)

- set power limit to package domain (whole SOC package) to 6w
- 設定封裝 domain 耗電限制在6w
root@chromoly:~# echo 6000
        > /sys/class/thermal/cooling_device15/cur_state

- set power limit to pp1 domain (graphics) to 4w
- 設定 pp1 domain 耗電限制在4w
root@chromoly:~# echo 4000
        > /sys/class/thermal/cooling_device18/cur_state

- check the current power usage in mWatts of pp1 domain
- 確認 pp1 domain 目前的耗電使用, 單位mWatts
root@chromoly:~# cat  /sys/class/thermal/cooling_device18/cur_state
61

- set event notification when power consumption of graphics unit crosses
  5w.
- 設定事件通知, 當繪圖單元的耗電高過5w時通知
root@chromoly:~#
  event_fd_listener /sys/class/thermal/cooling_device18/device/power 5000
(event_fd_listener opens control file power and creates an eventfd,
then write efd, cfd, threshold to event_control file of the given
domain)
(event_fd_listener 打開 power 控制檔案而且創建一個 eventfd, 然後寫入 efd,
cfd, 閥值 到domain的 event_control 檔案)

Caveats:        注意事項:

1. Package power limit events are supported by legacy thermal reporting
mechanism, which uses local APIC thermal vector to generate interrupts
when targeted P-states are not honored by the HW/FW. This is tied to
machine check reporting. Until RAPL is used, this notification is a rare
exception. When RAPL power limit is set artifically low, this
notification could result in unwanted interrupts for each power limit
excursion. Therefore, RAPL driver attempts to turn off the power limit
notification interrupt when user sets a power limit.

1. 傳統的溫度回報機制支援了包裹的耗電限制事件, 當目標 P-states 沒有被
   硬體/軔體 所履行時, 它會使用 local APIC 溫度向量來產生中斷. 它是綁定
   到機器檢查報告中. 直到 RAPL 使用之前, 這樣的通知都還是一種罕見的例外.
   當 RAPL 耗電限制被人為壓低時, 這樣的通知可能會導致在每次功率極限飄移
   時產生不必要的中斷. 因此, 當使用者設定功耗限制, 則RAPL 驅動會嘗試關閉
   功耗限制通知中斷.

2. By Intel Software Developer's Manual, RAPL interface can report
max/min power for certain domains. But in reality HW often reports 0
for max/min power. RAPL driver tackles this problem by using thermal
specification power or current power limit1 when max power information
is not available. The result is that the max_state of a RAPL cooling
device can be based on thermal spec power or power limit 1.

2. 根據 Intel 軟體開發人員手冊 (SDM), RAPL 介面能彙報特定 domain 的
   最大/最小 耗電. 但在現實中, 硬體常常回報 最大/最小耗電 為零. 當最大
   耗電資訊不適用時, RAPL 驅動程式會利用 溫度規範功率或 當前電源限制 來
   逮住這個問題. 其結果是 RAPL 散熱設備的 max_state 可以基於溫度規範耗電
   或耗電限制1.

3. Since RAPL is backed by FW. In case of FW failure or plain lack of
support, setting RAPL power limit could result in silent failure. I
don't have a good solution for that.

3. 因為 RAPL 需要軔體的支援. 一旦軔體故障或者平面缺乏支援, 設定 RAPL
   耗電限制可能導至無提示的故障. 我並沒有好的解決方案.

4. Data polling starts only when the following items are set
        - power limit
        - events

4. 資料輪詢只有當下列項目設定時才回開始
        - 耗電限制
        - 事件

Power Efficient Idle Injection - Jacob Pan - LinuxCon Japan 2015 (with Chinese translation comment)

original slides: Power Efficient Idle Injection - Jacob Pan - LinuxCon Japan 2015

Power Efficient Idle Injection
        Jacob Pan
        Intel Open Source Technology Center
        LinuxCon Japan 2015

Why Injecting Idle?                                             /* 為何要注射 idle */
        * Primary: Thermal/Power limiting                       /* 主要: 溫度/能源 限制 */
        * Secondary:                                            /* 次要: */
                * Performance management                        /* 性能管理 */
                * Pay per use                                   /* 按使用付費 */
                * Idle power efficiency                         /* 閒置功耗效率 */

LFM (low frequency mode)

Idle Injection in Linux                                         /* Linux 上的閒置注入 */
        * Intel PowerClamp driver                               /* Intel PowerClamp 驅動程式 */
        * Scheduler throttling, RT or CFS bandwidth control     /* scheduler 節流 */

Intel Power Clamp V1
(current design in mainline kernel)
The idea: play idle!

Limitations of Intel PowerClamp V1                              /* Intel PowerClamp V1 的限制 */

        * CPU appears busy while playing idle                   /* 演繹 idle 時 CPU 仍呈現忙碌 */
        * Scheduler ticks not stopped in NOHZ idle              /* scheduler 的滴答在 NOHZ 空閒時仍沒有停止 */
                Removal of tick_nohz_idle_enter/exit() API
                RCU grace period
        * Relies on timely jiffies updates                      /* 依賴 jiffies 的及時更新 */

Limitations of Intel PowerClamp V1
        Relies on secondary timing source
        * timely jiffy updates
        * periodic timers

Scheduler Based Throttling                                      /* 以 scheduler 為基礎的節流 */
Normal tasks under completely fair scheduling (CFS) class
        * Bandwidth control via CPU control group/container     /* 頻寬控制是透過 CPU 控制組/容器 */
        * Runqueue throttling by enqueue/dequeue tasks          /* 對運行中的 queue 節流是通過把 tasks 排入或移出queue */
        
Time chart of CFS Bandwidth Control                                             /* CFS 頻寬控制 */
        * Pros: No fake idle task, Finer per cgroup controls                    /* 優點: 沒有假性閒置任務, 細到逐一cgroup的控制 */
        * Cons: No synchronization loss of package C-state opportunities        /* 缺點: 對於可能有機會發生的 C-state 封包丟失沒有做同步 */

Power Clamp V2(work in progress)                                        /* Power Clamp V2 版本 */
        * Runqueue throttling of CFS class                              /* 針對 CFS class 作 runqueue 節流控制 */
        * Synchronization around rounded Ktime instead of jiffies       /* 圍繞著周圍的 Ktime 作同步, 而不是 jiffies */

Experiment Data                                                                 /* 實驗數據 */
        Goals:
                Comparing Power Efficiency                                      /* 比較能源效率 */
                Scalability                                                     /* 可擴展性 */  
                CPU HW design trend: old vs. new                                /* CPU 的硬體設計趨勢 */
        Configurations:                                                         /* 配置 */ 
                CPUs: Ivy Bridge/Haswell/Broadwell clients, Haswell EX server
                Workload:fspin by Len Brown. CPU bound, floating
                Test case: Inject idle from 0 to 50% at 5% increment            /* 注入閒置, 以 5% 的增量從 0 到 50% */

Power and Performance Control V1 vs. V2
        V1: RT kthread play idle 
        V2: CFS runqueue throttling

Comparing Deep vs. Shallow Package C-States
(powerclamp v2)

Conclusions
* Idle injection can effectively reduce power beyond energy efficient frequency         /* 閒置注射可以在節能型頻率以外再有效的降低耗能 */
* With deeper package C-states, can achieve near linear performance and power           /* 配合更深的 C-states 封包, 可達成線性效能以及降低功耗 */
  reduction
* Scheduler runqueue throttling results in cleaner and more efficient solution          /* 排程器 runqueue 節流可導至更清楚與有效的解決方法 */
* Align activities results in significant power savings                                 /* 對齊活動導至顯著降低功耗 */

Future plan
* Better handling of interrupts                         /* 更妥善的處理中斷 */
* Integration with scheduler                            /* 與排程器整合 */
* Synchronize with devices with latency tolerance       /* 依據延遲容忍進行設備間同步 */
* Work with hardware duty cycling                       /* 與硬體負載循環合作 */

powerclamp_timer_fn

2015年12月23日 星期三

Re: [RFC PATCH 00/14] Support timezone of ACPI TAD and EFI TIME (with Chinese translation comment)


Subject Re: [RFC PATCH 00/14] Support timezone of ACPI TAD and EFI TIME
From "H. Peter Anvin" <>
Date Fri, 20 Dec 2013 08:57:03 -0800

But we prefer the TAD for that.  The case where the EFI runtime is the only source of that info is problematic as they are known to not work at runtime.  We could collect it at boot and then never change it, although you end up in definitional issues between EFI and the hw RTC.

/* 但是那樣的狀況下我們比較喜歡 TAD. 在某種案例下 EFI runtime 是唯一的資訊來源但是它卻有問題, 因為我們已知它們在 runtime 無法運作. 我們可以在開機時收集它然後永遠不修改, 但是你最終會遭遇 EFI 和 hw RTC 間的定義上問題.  */

Matthew Garrett <matthew.garrett@nebula.com> wrote:
>On Thu, 2013-12-19 at 20:22 -0800, H. Peter Anvin wrote:
>> On 12/19/2013 08:05 PM, joeyli wrote:
>> > Can we use EFI time services on x86_64 after Borislav's patches
>accepted
>> > to mainline?
/* 當 Borislav 的 patches 被上游允許之後, 在x86_64是否我們可以使用 EFI time services? */
>> > 
>> 
>> No.
/* 不行. */
>
>We will want to use them to (at minimum) obtain the clock timezone.
>Using them for general RTC access is less attractive.

/* 我們想要(最低限度)使用他們以獲得時鐘時區. 用他們作為一般的RTC存取(功能)並沒有吸引力. */
-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.


From Matthew Garrett <>
Subject Re: [RFC PATCH 00/14] Support timezone of ACPI TAD and EFI TIME
Date Fri, 20 Dec 2013 16:58:33 +0000

On Fri, 2013-12-20 at 08:57 -0800, H. Peter Anvin wrote:
> But we prefer the TAD for that.  The case where the EFI runtime is the only source of that info is problematic as they are known to not work at runtime.  We could collect it at boot and then never change it, although you end up in definitional issues between EFI and the hw RTC.

Most shipping UEFI hardware has no TAD.
/* 大部份出貨的 UEFI 硬體沒有 TAD */

-- 
Matthew Garrett <matthew.garrett@nebula.com>


Subject Re: [RFC PATCH 00/14] Support timezone of ACPI TAD and EFI TIME
From "H. Peter Anvin" <>
Date Fri, 20 Dec 2013 12:29:51 -0800

Yes, but the TZ isn't all that critical, either.  It certainly doesn't matter at all for a pure Linux system.

/* 對, 但是 TZ 並不是那麼的關鍵. 在一個純 Linux 系統上它並不重要 */

Matthew Garrett <matthew.garrett@nebula.com> wrote:
>On Fri, 2013-12-20 at 08:57 -0800, H. Peter Anvin wrote:
>> But we prefer the TAD for that.  The case where the EFI runtime is
>the only source of that info is problematic as they are known to not
>work at runtime.  We could collect it at boot and then never change it,
>although you end up in definitional issues between EFI and the hw RTC.
>
>Most shipping UEFI hardware has no TAD.

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.


From Matthew Garrett <>
Subject Re: [RFC PATCH 00/14] Support timezone of ACPI TAD and EFI TIME
Date Fri, 20 Dec 2013 20:32:12 +0000

On Fri, 2013-12-20 at 12:29 -0800, H. Peter Anvin wrote:
> Yes, but the TZ isn't all that critical, either.  It certainly doesn't matter at all for a pure Linux system.

No, but it does matter for a great number of deployed Linux systems.
Dealing with the timezone over DST changes has been a perpetual problem,
and if we can make that work then life will be significantly better.

/* 不, 但它對於一個已大量佈署的 Linux 系統很重要. 處理在時區上的DST變化是個永遠的課題, 如果我們可以讓它運作, 那麼生活會更加美好 */
-- 
Matthew Garrett <matthew.garrett@nebula.com>

Date Fri, 20 Dec 2013 13:14:25 -0800
From "H. Peter Anvin" <>
Subject Re: [RFC PATCH 00/14] Support timezone of ACPI TAD and EFI TIME

On 12/20/2013 12:32 PM, Matthew Garrett wrote:
> On Fri, 2013-12-20 at 12:29 -0800, H. Peter Anvin wrote:
>> Yes, but the TZ isn't all that critical, either.  It certainly doesn't matter at all for a pure Linux system.
> 
> No, but it does matter for a great number of deployed Linux systems.
> Dealing with the timezone over DST changes has been a perpetual problem,
> and if we can make that work then life will be significantly better.
> 

And as I pointed out, it can matter a lot for VMs, since the provider
doesn't want to provision the VMs differently for different types of guests.

/* 就像我指出的, 它對虛擬機更重要, 因為供應商不希望對於不同型態的 guest 提供有差別的虛擬機 */

 -hpa


Date Fri, 20 Dec 2013 13:12:52 -0800
From "H. Peter Anvin" <>
Subject Re: [RFC PATCH 00/14] Support timezone of ACPI TAD and EFI TIME

On 12/20/2013 07:16 AM, Matthew Garrett wrote:
> On Thu, 2013-12-19 at 20:22 -0800, H. Peter Anvin wrote:
>> On 12/19/2013 08:05 PM, joeyli wrote:
>>> Can we use EFI time services on x86_64 after Borislav's patches accepted
>>> to mainline?
>>>
>>
>> No.
> 
> We will want to use them to (at minimum) obtain the clock timezone.
> Using them for general RTC access is less attractive.
> 

One option is to use the EFI runtime call to get and save the clock
timezone before we call ExitBootServices() in the EFI stub.  This
doesn't obviate the need for proper handling of the TAD, though,
especially since it is likely that future hardware will not have a RTC
in the current form (it is a way more complex device than is needed,
which wouldn't normally be a problem, but the fact that it has to
operate in the Vbat well makes it a major one.)

/* 有個選項是我們可以在 EFI stub 內, 調用 ExitBootServices() 之前, 使用 EFI runtime call 取得和儲存時鐘時區. 雖然這樣仍無法避免需妥善處理 TAD, 特別是因為未來的硬體很可能沒有現在這種形式的 RTC (它是一種比需求更複雜的設備, 這通常不會是個問題, 但事實上, 它必須具備在Vbat(電池)下運作良好的重大特性) */

 -hpa

2015年12月22日 星期二

Check the EFI time servies usage status of Windows 10 on qemu

In Hackweek 13, I checked the newest Windows version, Windows 10, that it uses EFI time services to restore timezone information to UEFI firmware.

My original target is checking the ACPI Time and Alarm usage status in Windows 10, but I didn't find checked build of Windows 10. So I check the EFI time services usage status.

Install Windows 10 with OVMF as a Qemu guest

First I need install a Windows 10 guest in qemu. I downloaded the Windows 10 evaluation edition from TechNet Evaluation Center:

10586.0.151029-1700.TH2_RELEASE_CLIENTENTERPRISEEVAL_OEMRET_X64FRE_EN-US.ISO

My host environment is openSUSE 13.2:

qemu-x86-2.1.3-7.2.x86_64
qemu-2.1.3-7.2.x86_64
qemu-kvm-2.1.3-7.2.x86_64
qemu-seabios-1.7.5-2.9.noarch
qemu-tools-2.1.3-7.2.x86_64
qemu-sgabios-8-2.9.noarch
qemu-ksm-2.1.0-2.9.x86_64
qemu-ovmf-x86_64-0.1+svn19110-11.1.noarch
qemu-vgabios-1.7.5-2.9.noarch

virt-install-1.3.0-329.4.noarch
libvirt-daemon-1.2.9-23.1.x86_64
libvirt-daemon-driver-qemu-1.2.9-23.1.x86_64
virt-manager-common-1.3.0-329.4.noarch
virt-manager-1.3.0-329.4.noarch
libvirt-daemon-qemu-1.2.9-23.1.x86_64
libvirt-client-1.2.9-23.1.x86_64

At beginning I run virt-install to install Windows 10 on qemu. Unfortunately it didn't success because the ovmf on openSUSE that it doesn't include Microsoft's key for Windows platform. So I need disable secure boot before install Windows 10. To keep the secure boot BIOS option, I setup nvram parameter in /etc/libvirt/qemu.conf before running virt-manager to install Windows 10:

vi /etc/libvirt/qemu.conf
...
nvram = [ "/usr/share/qemu/ovmf-x86_64-ms-code.bin:/usr/share/qemu/ovmf-x86_64-ms-vars.bin" ]

Then running a QEMU/KVM guest, I disabled secure boot in UEFI option. The change should save to nvram file:

# virsh dumpxml win8.1
...
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.1'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x86_64-ms-code.bin</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/win8.1_VARS.fd</nvram>
    <boot dev='hd'/>
  </os>

Please ignore the wrong guest name win8.1 :p

Another problem when I am installing Windows 10 is that it always sticks on Windows start up screen but didn't run installation process. After upgrade ovmf to qemu-ovmf-x86_64-0.1+svn19110-11.1.noarch, this issue gone.

When using virt-manager to install Windows 10, need choice UEFI to be the BIOS for installation:

Choice "Customize configuration before install":

Choice Firmware:

Remember to disable secure boot before running Windows 10 installation first.

Windows 10 installation and boot success on qemu:


Enable serial console log of OVMF

Thanks for Gary Lin's help. He told me the way to enable serial console log of ovmf is adding the following qemu parameters:
    -global isa-debugcon.iobase=0x402 -debugcon file:debug.log

Add the above parameters to print ovmf's debug log to debug.log file. I am using libvirt, so need add those parameters to the XML of domain. So, using virsh to edit xml:

# virsh edit win8.1

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
[...snip]
  <qemu:commandline>
    <qemu:arg value='-global'/>
    <qemu:arg value='isa-debugcon.iobase=0x402'/>
    <qemu:arg value='-debugcon'/>
    <qemu:arg value='file:/tmp/debug.log'/>
    <qemu:arg value='-s'/>
  </qemu:commandline>
</domain>

As the above, please add xml name space of qemu in domain element, then add "qemu:commandline" element in the end of domain. Please note you need have the access right of the folder of debug.log file.

You can tail /tmp/debug.log after starting Windows 10 guest:

# tail -f /tmp/debug.log
SecCoreStartupWithStack(0xFFFCC000, 0x818000)
Register PPI Notify: DCD0BE23-9586-40F4-B643-06522CED4EDE
Install PPI: 8C8CE578-8A3D-4F1C-9935-896185C32DD3
Install PPI: 5473C07A-3DCB-4DCA-BD6F-1E9689E7349A
The 0th FV start address is 0x00000820000, size is 0x000E0000, handle is 0x820000
Register PPI Notify: 49EDB1C1-BF21-4761-BB12-EB0031AABB39
Register PPI Notify: EA7CA24B-DED5-4DAD-A389-BF827E8F9B38
Install PPI: B9E0ABFE-5979-4914-977F-6DEE78C278A6
[...snip]

PcAtChipsetPkg/KbcResetDxe/ResetEntry.c:  SystemTable->RuntimeServices->ResetSystem = KbcResetSystem;

Add debug log to getTime()/setTime() in OVMF 

Here is my patch to EDK2 to add debug log in OVMF for detecting the usage status by OS:
    https://github.com/joeyli/edk2/commit/bf1dc51b2f2b9b9a765fdbaadc740b4806265fbb

There have some different code paths in EDK2 that implemented get time and set time functions. I added some debug log in different code path and got Gary Lin's help to find the right runtime services functions that are used by OVMF. The point is finding out some codes to define function points in "SystemTable->RuntimeServices" table:

PcAtChipsetPkg/KbcResetDxe/ResetEntry.c:  SystemTable->RuntimeServices->ResetSystem = KbcResetSystem;

MdePkg/Library/UefiRuntimeServicesTableLib/UefiRuntimeServicesTableLib.c:  gRT = SystemTable->RuntimeServices;
PcAtChipsetPkg/PcatRealTimeClockRuntimeDxe/PcRtcEntry.c:
                                                          gRT->GetTime       = PcRtcEfiGetTime;
                                                          gRT->SetTime       = PcRtcEfiSetTime;
                                                          gRT->GetWakeupTime = PcRtcEfiGetWakeupTime;
                                                          gRT->SetWakeupTime = PcRtcEfiSetWakeupTime;

And should make sure those packages used by OvmfPkg package.
e.g.
ovmf-0.1+svn19289/OvmfPkg> grep -r "UefiRuntimeServicesTableLib" *
[...snip]
OvmfPkgX64.dsc:  UefiRuntimeServicesTableLib|MdePkg/Library/UefiRuntimeServicesTableLib/UefiRuntimeServicesTableLib.inf

The OVMF debug log of booting Windows 10

Here is the ovmf debug log result file of booting Windows 10:
    https://github.com/joeyli/hackweek/blob/master/windows10-uefi-time-services/windows10-qemu-ovmf-debug.log

For the the first line to 2908 line is the log of ovmf booting stage. I go to the UEFI menu and wait all log written to file, then select Windows 10 booting item in UEFI boot manager. I add "[[Windows 10 boot START]]" tag in windows10-qemu-ovmf-debug.log as a mark to start Winddows 10 booting.

In the log file, I saw many "PcRtcEfiGetTime" but I didn't see any "PcRtcEfiSetTime" log in Windows 10 booting process. After booting to Windows 10 desktop I tried to set time by "Settings -> Time & language -> Date & time" functions. But I didn't see any ovmf log from Windows 10 call runtime services. Actually, I didn't see any runtime services log after Windows 10 boot finished, even "VariableServiceGetVariable" or "VariableServiceSetVariable".

Windows 10 doesn't aware timezone change by UEFI shell

Base on ovmf debug log, looks Windows 10 doesn't use EFI time services to set date/time and timezone to EFI firmware. Then I boot to EFI shell to change timezone. The original timezone field is Local (2047):

    Shell> timezone
    Local

I set it to GMT+05:00

    Shell> timezone -s 5:00
    Shell> timezone
    GMT+05:00

Then reboot to Windows 10, I didn't see the timzone field in "Settings -> Time & language -> Date & time" changed to sync with the timezone that's set by EFI shell. So Windows 10 doesn't aware timezone field from EFI time services even it calls getTime() time services. Looks it just ignore it and keeps timezone by itself.

Summary

The timezone and daylight fields in EFI time services are useful to a OS dual boot environment to sync the date/time status in different OS. Windows is the most popular OS but looks it doesn't use timezone field that it is provided by EFI time services. That causes that doesn't have enough benefits to support this functions. At least it doesn't help on dual boot environment.

The good news is that Windows 10 accesses getTime() runtime service in booting process, that meas at least this function will be tested in all OEM/ODM QA process.