2016年1月20日 星期三

Safe device assignment with VFIO (with Chinese translation comment)

Safe device assignment with VFIO    By Jonathan Corbet
以VFIO作安全的設備分配
https://lwn.net/Articles/474088/

As a general rule, most developers feel that device drivers belong in the kernel.
Kernel-space drivers are (hopefully) widely reviewed, implement standard device
interfaces, perform better, and are more secure than the user-space variety.           
There are exceptions, though. Some high-performance applications want to talk           
to devices directly. Virtualized guests can also be thought of as a sort of            
user-space process; it is often desirable to allow guests to work with hardware         
directly rather than funneling their I/O through the host. So the kernel really
should support this mode of access for the times when it is needed.

以一般的標準來說, 大部分的開發人員覺得設備驅動程式屬於kernel.
Kernel-space 的驅動程式有被廣泛的審查(但願), 且實作了標準的設備介面, 執行起來
更好, 而且比起user-space類型(驅動程式)更加安全. 雖然有些例外. 一些高性能的應用
程式希望直接和設備溝通. 虛擬化的guest也可以被認為是一種user-space程序; 通常希望
允許虛擬機直接和硬體一起工作而不是讓他們的I/O穿隧通過host. 所以 kernel 真的應該在需要他(應用程式)的時候支援這樣的訪問模式.

The kernel's UIO interface has been available for the implementation of
user-space drivers for some years. UIO has some shortcomings, though, including
a lack of support for direct memory access (DMA) operations. DMA under
user-space control is challenging to support for a number of reasons, not the
least of which is security. A DMA-capable device is normally capable of writing
any page in memory; as a result, empowering a user-space process to set up DMA
operations is equivalent to giving it full root access. Sometimes a user-space
driver can be trusted with that access, but that is often not the case,
especially when virtualization is involved.

多年來 Kernel 的 UIO 介面已經可以用於實現 user-space 驅動程式. 雖然UIO有一些缺點,
這包含缺乏 DMA 操作的支援. 不僅僅是安全問題, 有數種原因造成在用戶空間下支援控制
DMA是個挑戰. 一個具備 DMA 能力的設備通常可以對記憶體中的任何頁進行寫入; 其結果是,
授權一個使用者空間下的進程去操作設置 DMA, 相當於給它完整的root存取(權限). 有時候
一個使用者空間的驅動程式可以被信任去做這樣的存取, 但情況往往並非如此, 尤其當虛擬
化(技術)參與時.

More recent CPUs have added support for safe (or safer) access to devices from
virtualized guests. Devices can be restricted, via an I/O memory management unit
(IOMMU) so that only specific regions of memory are accessible to them.
Technologies like KVM support a "device assignment" mechanism that uses the
hardware capabilities to hand a device to a guest, but device assignment is not
without its shortcomings. Among other things, device assignment alone cannot
guarantee the isolation of a specific device, and it involves a fair amount of
complexity in the kernel.

最近的CPU支援了從虛擬機內安全存取設備. 設備可以被限制住, 但透過I/O記憶體管理單元
(IOMMU)可以存取設備的特定區域的記憶體. 類似KVM這樣的技術支援了"設備分配"的機制,
它利用硬體功能將設備交給虛擬機, 但是設備分配並不是沒有缺點. 除此之外, 單單設備
分配(機制)並不能確保一個特定設備的隔離性, 而且它涉及相當程度的複雜性在kernel中.

Alex Williamson's VFIO patch set is an attempt to come up with a better solution
that allows the development of safe, high-performance user-space drivers. It
provides interfaces allowing those drivers to work with DMA and interrupts while
keeping overall control over how devices access the system's resources.

Alex Williamson 的 VFIO 補丁集嘗試提出一個更好的方案, 它讓使用者空間下的驅動程式
在開發上更安全且高效. 在全面控制設備如何去存取系統資源的狀況下, 它提供了介面允許
(user-space)驅動程式與DMA和中斷一起工作.

One problem with KVM's device assignment is that it assumes that all devices are
fully independent of each other. In particular, groups of devices may be
connected through the same IOMMU; that means that any device can access any
memory regions made available to any other devices in the same group. That, in
turn, implies that the group of devices must be assigned as a unit; if any of
those devices are assigned separately, the isolation of the group as a whole can
be broken.

KVM的設備分配有個問題是, 它假設所有的設備都是完全獨立於其他設備. 特別是同一個群組
的設備可能會透過相同的IOMMU連接; 這意味任何設備可以存取任何提供給其他同群組設備的記憶體區域. 反過來說, 也意味著該群組的所有設備都必須被分派成同一個單元. 如果任何
一個設備被單獨分配, 則整體群組的隔離被破壞.

So the first thing a VFIO driver writer will encounter is the group mechanism.
The VFIO code creates the groups to match the hardware topology. It then ensures
that every device in a group is controlled by a VFIO driver; if any device is
unavailable, then the group as a whole cannot be used. Most devices on a typical
system are unlikely to be bound to VFIO drivers at boot, so the system administrator
must explicitly unbind them and tell VFIO to claim them. This is probably a good
thing; exposing groups of devices to user space is best not done by default.

所以, VFIO驅動程式開發者會遭遇到的第一件事就是群組(group)機制. VFIO程式碼建立和硬體拓撲一致的多個群組. 然後它確保了所有在同一個群組下的設備都被同一個VFIO驅動所控制; 如果任何一個設備無法被使用, 則整個群組會被視為一體皆不可使用. 大多數傳統系統上的設備不太可能在開機的時候就綁定到 VFIO 驅動程式, 所以系統管理員必須確切的將它們解除綁定, 並且告訴 VFIO 去認領這些設備. 這可能是件好事, 最好不要預設就以群組為單位將設備曝露給使用者空間.

For each group, a virtual device is created under /dev/vfio; prior to working with any individual device, a driver must open the group, claiming ownership of it. The access permissions on the group file control access to the underlying devices. Once the group has been opened, the driver should do an ioctl(VFIO_GROUP_GET_INFO) call to determine whether the group is "viable" (meaning all of the relevant devices are assigned to it) and available for use. If the group is not viable, the driver will not be able to proceed.

針對每個群組, 會有一個虛擬設備建立在 /dev/vfio; 在和任何個別的設備合作之前, 必須有一個驅動程式打開群組, 並聲明其所有權. 群組檔案的訪問權限就控制了對底層設備的存取. 一旦該群組被打開了驅動程式必須調用 ioctl(VFIO_GROUP_GET_INFO) 去判定該群組是否為"活的(viable)"(這意味著所有相關設備都被分配給它) 並可供使用. 如果群組不是活著, 驅動程式將無法繼續進行下去.

To work with specific devices, the driver will "open" them with the VFIO_GROUP_GET_DEVICE_FD ioctl() call, which returns a file descriptor for access to the device. The VFIO_DEVICE_GET_REGION_INFO command can be used to learn about the device's memory-mapped I/O regions, which can then be accessed via an mmap() call.VFIO_DEVICE_GET_IRQ_INFO returns information about the device's interrupt assignment(s); the driver can use the eventfd() mechanism to receive notification of interrupts via a file descriptor. For most hardware, access to MMIO and interrupts is enough to communicate with the device.

為了和特定設備一起工作, 驅動程式會用 VFIO_GROUP_GET_DEVICE_FD ioctl() call "打開"它們, 它會回傳一個檔案描述符(file descriptor)以用於存取該設備. VFIO_DEVICE_GET_REGION_INFO 命令可以被用來了解設備的記憶體映射I/O區域, 然後可以透過一個 mmap() 存取這些區域. VFIO_DEVICE_GET_IRQ_INFO 回傳了關於設備的中斷分配資訊; 驅動程式可以使用 eventfd() 機制來透過檔案描述符接收中斷通知. 對於多數的硬體, 存取 MMIO 和中斷就足以和設備溝通.

That still leaves the DMA problem, though. To that end, the VFIO_GROUP_GET_IOMMU_FD command returns a file descriptor representing the IOMMU. DMA mappings can be set up by filling in a vfio_dma_map structure:
    struct vfio_dma_map {
 __u32 argsz;
 __u32 flags;
 __u64 vaddr;  /* Process virtual address */
 __u64 iova;  /* IO virtual address */
 __u64 size;  /* Size of mapping (bytes) */
    };


This structure is used to request a mapping of the user-space memory found at vaddr (of size bytes) into the device's I/O memory range starting at iova; the VFIO_IOMMU_MAP_DMA command actually gets the work done. For most user-space drivers, that should be about all that is needed, modulo a few details.

Not all VFIO drivers will be in user space, though. Inside the kernel, VFIO looks like a special bus type to which devices can be bound. A VFIO driver needs to provide a set of operations to the core:
    struct vfio_device_ops {
 bool (*match)(struct device *dev, const char *buf);
 int (*claim)(struct device *dev);
 int (*open)(void *device_data);
 void (*release)(void *device_data);
 ssize_t (*read)(void *device_data, char __user *buf,
   size_t count, loff_t *ppos);
 ssize_t (*write)(void *device_data, const char __user *buf,
    size_t count, loff_t *size);
 long (*ioctl)(void *device_data, unsigned int cmd,
    unsigned long arg);
 int (*mmap)(void *device_data, struct vm_area_struct *vma);
    };

Most of these operations are analogous to those found in struct file_operations or the bus-specific device structures. A device registered in this way can be opened and used like any other device with one difference: the interlock with group ownership is always enforced. If a device has been opened individually, the group is not "viable" and cannot be used by a user-space driver. If, instead, the group has been opened, the individual devices are busy and cannot be opened.

VFIO is not the only patch set aimed at this problem; David Gibson's device isolation infrastructure is also intended to enable safe assignment of devices. The scope of this patch set is smaller, though, focusing mostly on the grouping aspect; there is no mechanism for controlling the IOMMU or working with individual devices. There is a certain amount of disagreement between the two on how grouping should be managed which suggests, in turn, that a certain amount of discussion will have to take place before either can be merged.

2016年1月14日 星期四

Check the signature of kernel module (old format before 3f1e1bea3 commit in v4.3-rc1)

There have a situation that I need to check the signature blob that attached in the end of .ko file, it's should signed by appropriate key. Here is a procedure to manually checking signature by public key.

Success Case

acer-wmi.ko (from kernel-default-3.12.28-4.6.x86_64.rpm):

STEP 1. Find out the "Subject Key Identifier"

linux-aiip:/lib/modules/3.12.28-4-default # modinfo acer-wmi
filename:       /lib/modules/3.12.28-4-default/kernel/drivers/platform/x86/acer-wmi.ko
[...snip]
signer:         SUSE Linux Enterprise Secure Boot Signkey
sig_key:        3F:B0:77:B6:CE:BC:6F:F2:52:2E:1C:14:8C:57:C7:77:C7:88:E3:E7
sig_hashalgo:   sha256

The sig_key match with "Subject Key Identifier" in SUSE Linux Enterprise Secure Boot
Signkey, so the public key can decrypt the signature that it is attached on acer-wmi.ko.


STEP 2. Extract the signature from acer-wmi.ko

Signature is attached behind the key name and Subject Key Identifier number.
It's a 256 bytes block because hashed by SHA256. In this case, the Subject Key
Identifier is 3F B0 77 B6...E3 E7, finding out the position from acer-wmi.ko:

0000eb10  00 00 00 00 00 00 00 00  53 55 53 45 20 4c 69 6e  |........SUSE Lin|
0000eb20  75 78 20 45 6e 74 65 72  70 72 69 73 65 20 53 65  |ux Enterprise Se|
0000eb30  63 75 72 65 20 42 6f 6f  74 20 53 69 67 6e 6b 65  |cure Boot Signke|
0000eb40  79 3f b0 77 b6 ce bc 6f  f2 52 2e 1c 14 8c 57 c7  |y?.w...o.R....W.|  <=== 3F B0 77 B6...
0000eb50  77 c7 88 e3 e7 01 00 9b  eb 31 d2 cd f7 3a 65 92  |w........1...:e.|  <=== ...E3 E7 01 00 signature /* 60240 *//* 60240 + 7 = 60247 */
0000eb60  30 ee 2e d4 97 d2 7b 15  a0 e0 08 1f db 2d a7 9e  |0.....{......-..|
0000eb70  7f 0a 5f 25 ed 04 6e 95  2d 98 85 cc 98 5a 4f 08  |.._%..n.-....ZO.|

The signature is always leading by "01 00" head, we don't need the head, so
the start of signature is 60240 + 7 = 60247. Use dd to extract signature to
another file:

dd skip=60247 count=256 bs=1 if=./acer-wmi.ko of=./acer-wmi.ko.sig

> hexdump -C acer-wmi.ko.sig
00000000  9b eb 31 d2 cd f7 3a 65  92 30 ee 2e d4 97 d2 7b  |..1...:e.0.....{|
00000010  15 a0 e0 08 1f db 2d a7  9e 7f 0a 5f 25 ed 04 6e  |......-...._%..n|
00000020  95 2d 98 85 cc 98 5a 4f  08 b5 c6 b5 1b 2f 87 52  |.-....ZO...../.R|
00000030  00 28 bb f0 6b bd f1 60  8e 58 be 18 0e 30 e7 dd  |.(..k..`.X...0..|
00000040  87 91 ce 1f 84 71 f8 83  f3 ba f7 07 68 c4 35 6d  |.....q......h.5m|
00000050  c4 3d 87 e7 ff 4c b2 20  ae b9 65 52 0f 56 38 38  |.=...L. ..eR.V88|

STEP 3. Verify signature by the public key in SLES certificate

First, extract public key from certificate file:
openssl x509 -in SLES-UEFI-SIGN-Certificate.der -inform DER -pubkey -noout > SLES-UEFI-SIGN-Certificate.pub

Then, decrypting signature by public key. If the signature didn't encrypt by right
key, then this step will be fail:

openssl rsautl -verify -inkey SLES-UEFI-SIGN-Certificate.pub -pubin -in acer-wmi.ko.sig > acer-wmi.ko.sha

The acer-wmi.ko.sha file is the decrypted signature. It's leaded by ASN.1 format
and a SHA256 hash result:

> hexdump -C acer-wmi.ko.sha
00000000  30 31 30 0d 06 09 60 86  48 01 65 03 04 02 01 05  |010...`.H.e.....| <== ASN1. format head
00000010  00 04 20 ae 4a 31 b2 46  2b 1d e6 01 26 aa 38 2e  |.. .J1.F+...&.8.| <=== should match with the result of modhash
00000020  9d 3d ab 08 78 1a c2 85  b3 2f 87 96 3e 7f 15 7a  |.=..x..../..>..z|
00000030  31 b7 8c                                          |1..|
00000033

The attached perl script, modhash, is developed by Gary Lin for calculating
signature from a signed ko file. Comparing the result from modhash and the above decrypted
hash value to confirm the hash value matched:
> perl modhash -v acer-wmi.ko
Hash algorithm: sha256
acer-wmi.ko: ae4a31b2462b1de60126aa382e9d3dab08781ac285b32f87963e7f157a31b78c

The same procedure, following is a failed case.

Failed Case

sample.ko:

It's also claimed signed by SUSE Linux Enterprise Secure Boot Signkey:
linux-aiip:~ # modinfo sample
filename:       /lib/modules/3.12.28-4-default/updates/sample.ko
description:    xxxxxxxxxxxxxx
[...snip]
signer:         SUSE Linux Enterprise Secure Boot Signkey
sig_key:        3F:B0:77:B6:CE:BC:6F:F2:52:2E:1C:14:8C:57:C7:77:C7:88:E3:E7
sig_hashalgo:   sha256

Found the Subject Key Identifier in .ko file, looks OK:
00061790  70 61 67 65 73 5f 63 75  72 72 65 6e 74 00 53 55  |pages_current.SU|
000617a0  53 45 20 4c 69 6e 75 78  20 45 6e 74 65 72 70 72  |SE Linux Enterpr|
000617b0  69 73 65 20 53 65 63 75  72 65 20 42 6f 6f 74 20  |ise Secure Boot |
000617c0  53 69 67 6e 6b 65 79 3f  b0 77 b6 ce bc 6f f2 52  |Signkey?.w...o.R|  <=== 3F B0 77...
000617d0  2e 1c 14 8c 57 c7 77 c7  88 e3 e7 01 00 c6 12 1d  |....W.w.........|  <=== ...88 E3 E7 01 00 signature /* 399312 */ /* 399325 */
000617e0  ba 45 3a b3 b1 99 fb 55  1b fc d3 90 6a ea 92 64  |.E:....U....j..d|

Then extract signature:
dd skip=399325 count=256 bs=1 if=./sample.ko of=./sample.ko.sig
> hexdump -C sample.ko.sig
00000000  c6 12 1d ba 45 3a b3 b1  99 fb 55 1b fc d3 90 6a  |....E:....U....j|
00000010  ea 92 64 8a 04 04 f9 22  a7 74 35 98 05 d7 e6 85  |..d....".t5.....|
00000020  8c 5f 32 e6 6c 71 f7 ba  1c 0a 0f 8a 95 f3 ec c7  |._2.lq..........|
00000030  88 b2 11 71 27 28 ca b8  b8 55 ae df 56 38 c6 b4  |...q'(...U..V8..|

Unfortunately the signature can not be decrypted by public key success:
openssl rsautl -verify -inkey SLES-UEFI-SIGN-Certificate.pub -pubin -in sample.ko.sig > sample.ko.sha
RSA operation error
139755852273296:error:0407006A:rsa routines:RSA_padding_check_PKCS1_type_1:block type is not 01:rsa_pk1.c:100:
139755852273296:error:04067072:rsa routines:RSA_EAY_PUBLIC_DECRYPT:padding check failed:rsa_eay.c:721:

Got problem when decrypting the blob by public key. So, looks signature didn't encrypted by appropriate key.

2016年1月5日 星期二

RAPL (Running Average Power Limit) driver (with Chinese translation comment)

RAPL (Running Average Power Limit) driver Jacob Pan <jacob.jun.pan@linux.intel.com>
RAPL (運行時期平均供電限制) 驅動程式
https://lwn.net/Articles/545745/

RAPL(Running Average Power Limit) interface provides platform software
with the ability to monitor, control, and get notifications on SOC
power consumptions. Since its first appearance on Sandy Bridge, more
features have being added to extend its usage. In RAPL, platforms are
divided into domains for fine grained control. These domains include
package, DRAM controller, CPU core (Power Plane 0), graphics uncore
(power plane 1), etc.

在 RAPL, 硬體平台被區分成為幾個domain來進行細粒度調整. 這些domain包含
package, DRAM控制器, CPU核心 (電力平面0), 非核心顯卡 (電力平面1), 等等.

The purpose of this driver is to expose RAPL for userspace
consumption. Overall, RAPL fits in the generic thermal layer in
that platform level power capping and monitoring are mainly used for
thermal management and thermal layer provides the abstracted interface
needed to have portable applications.

整體來說, RAPL 適用於通用散熱層, 用在平台層級的電力封頂, 以及監控. 主要
為溫度管控以及散設層提供抽象介面以滿足便攜式應用的需求.

Specifically, userspace is presented with per domain cooling device
with sysfs links to its kobject. Although RAPL domain provides many
parameters for fine tuning, long term power limit is exposed as the
single knob via cooling device state. Whereas the rest of the
parameters are still accessible via the linked kobject. This simplifies
the interface for both simple and advanced use cases.

具體而言, 在 userspace 會透過 sysfs 展現出來, 其背後聯結到每個domain的
冷卻裝置的 kobject. 雖然 RAPL domain 提供了很多可微調的參數, 但長時間的
功率限制乃是透過冷卻裝置狀態而揭露出來成為單獨的 knob. 然而其他的參數
仍然可以透過鏈接 kobject 來取用. 這樣同時簡化了簡單和進階使用場合中的
使用介面.

DETAILS 細節
=======
1. sysfs layout         sysfs 層

As an x86 platform driver, RAPL driver binds with supported CPU ids
during probing phase. Once domains are discovered, kobjets are created
for each domain which are also linked with cooling devices after its
registration with the generic thermal layer.

作為一支 x86 平台驅動, RAPL驅動程式在試探時期綁定了被支援的CPU id. 一旦
domains 發現, kobjects 會為了每個domain而被創建出來, 它們也會在註冊到通用
散熱層之後聯結到散熱設備

e.g.package RAPL domain registered as cooling device #15, link "device"
back to its kobject.

範例: RAPL domain 包裹被註冊成為散熱設備15號, device 聯結回到它的kobject.

/sys/class/thermal/cooling_device15/
├── cur_state
├── device -> ../../../platform/intel_rapl/rapl_domains/package
├── max_state
├── power
├── subsystem -> ../../../../class/thermal
├── type
└── uevent

In driver's private sysfs area, domains kobjects are grouped under a
kset which exposes global data.

在驅動程式的私有 sysfs 區域, domains 的 kobjects 被歸納在一個 kset 下,
揭露全域資料.

/sys/devices/platform/intel_rapl/
├── driver -> ../../../bus/platform/drivers/intel_rapl
├── power
├── rapl_domains
│   ├── package
│   │   └── thermal_cooling
-> ../../../../virtual/thermal/cooling_device15
│   ├── power_plane_0
│   │   └── thermal_cooling
-> ../../../../virtual/thermal/cooling_device16
│   └── power_plane_1
│       └── thermal_cooling
-> ../../../../virtual/thermal/cooling_device18
└── subsystem -> ../../../bus/platform

2. per domain parameters        每個domain的參數

These are the fine tuning parameters only used by advanced
power/thermal management applications. Refer to Intel SDM ch14 for
details.

有一些可以微調的參數只能應用在功耗/溫度管理上. 細節請參考 Intel SDM
第14章.

root@chromoly:/sys/class/thermal/cooling_device15/device# grep . *
domain_name:package
energy:924228
lock:0
max_power:0
max_window:0
min_power:0
pl1_clamp:1
pl1_enable:1
pl2_clamp:0
pl2_enable:1
power:2276
power_limit1:12000
power_limit2:31250
thermal_spec_power:17000
throttle_time:
time_window1:28000
time_window2:0

3. event notifications          事件通知

RAPL driver uses eventfd to provide userspace notifications on selected
events. A file node called "event_control" is created for each RAPL
domain. User can write control file descriptor, eventfd descriptor, and
threshold to event_control file. Then, user application can use
poll/select or blocking read to get notifications from the driver.
Multiple events are allowed for each domain but only a single threshold
is accepted.

RAPL驅動程式利用 eventfd 為 userspace 所選定的事件來提供通知.
event_control 這一個檔案結點為每個 RAPL domain 被建立出來. 使用者可以
寫入控制檔案描述符, eventfd 描述符, 以及閥值到 event_control 檔案. 接著,
使用方的應用程式可以使用 輪詢/選擇 或者 阻斷讀取 來取得驅動程式的通知.
多事件在每一個 domain 都是可以被允許的, 但是只允許單一閥值.

4. Usage Examples (assume the topology in the sysfs layout above)
使用範例 (假設在 sysfs 層之上的拓撲)

- set power limit to package domain (whole SOC package) to 6w
- 設定封裝 domain 耗電限制在6w
root@chromoly:~# echo 6000
        > /sys/class/thermal/cooling_device15/cur_state

- set power limit to pp1 domain (graphics) to 4w
- 設定 pp1 domain 耗電限制在4w
root@chromoly:~# echo 4000
        > /sys/class/thermal/cooling_device18/cur_state

- check the current power usage in mWatts of pp1 domain
- 確認 pp1 domain 目前的耗電使用, 單位mWatts
root@chromoly:~# cat  /sys/class/thermal/cooling_device18/cur_state
61

- set event notification when power consumption of graphics unit crosses
  5w.
- 設定事件通知, 當繪圖單元的耗電高過5w時通知
root@chromoly:~#
  event_fd_listener /sys/class/thermal/cooling_device18/device/power 5000
(event_fd_listener opens control file power and creates an eventfd,
then write efd, cfd, threshold to event_control file of the given
domain)
(event_fd_listener 打開 power 控制檔案而且創建一個 eventfd, 然後寫入 efd,
cfd, 閥值 到domain的 event_control 檔案)

Caveats:        注意事項:

1. Package power limit events are supported by legacy thermal reporting
mechanism, which uses local APIC thermal vector to generate interrupts
when targeted P-states are not honored by the HW/FW. This is tied to
machine check reporting. Until RAPL is used, this notification is a rare
exception. When RAPL power limit is set artifically low, this
notification could result in unwanted interrupts for each power limit
excursion. Therefore, RAPL driver attempts to turn off the power limit
notification interrupt when user sets a power limit.

1. 傳統的溫度回報機制支援了包裹的耗電限制事件, 當目標 P-states 沒有被
   硬體/軔體 所履行時, 它會使用 local APIC 溫度向量來產生中斷. 它是綁定
   到機器檢查報告中. 直到 RAPL 使用之前, 這樣的通知都還是一種罕見的例外.
   當 RAPL 耗電限制被人為壓低時, 這樣的通知可能會導致在每次功率極限飄移
   時產生不必要的中斷. 因此, 當使用者設定功耗限制, 則RAPL 驅動會嘗試關閉
   功耗限制通知中斷.

2. By Intel Software Developer's Manual, RAPL interface can report
max/min power for certain domains. But in reality HW often reports 0
for max/min power. RAPL driver tackles this problem by using thermal
specification power or current power limit1 when max power information
is not available. The result is that the max_state of a RAPL cooling
device can be based on thermal spec power or power limit 1.

2. 根據 Intel 軟體開發人員手冊 (SDM), RAPL 介面能彙報特定 domain 的
   最大/最小 耗電. 但在現實中, 硬體常常回報 最大/最小耗電 為零. 當最大
   耗電資訊不適用時, RAPL 驅動程式會利用 溫度規範功率或 當前電源限制 來
   逮住這個問題. 其結果是 RAPL 散熱設備的 max_state 可以基於溫度規範耗電
   或耗電限制1.

3. Since RAPL is backed by FW. In case of FW failure or plain lack of
support, setting RAPL power limit could result in silent failure. I
don't have a good solution for that.

3. 因為 RAPL 需要軔體的支援. 一旦軔體故障或者平面缺乏支援, 設定 RAPL
   耗電限制可能導至無提示的故障. 我並沒有好的解決方案.

4. Data polling starts only when the following items are set
        - power limit
        - events

4. 資料輪詢只有當下列項目設定時才回開始
        - 耗電限制
        - 事件

Power Efficient Idle Injection - Jacob Pan - LinuxCon Japan 2015 (with Chinese translation comment)

original slides: Power Efficient Idle Injection - Jacob Pan - LinuxCon Japan 2015

Power Efficient Idle Injection
        Jacob Pan
        Intel Open Source Technology Center
        LinuxCon Japan 2015

Why Injecting Idle?                                             /* 為何要注射 idle */
        * Primary: Thermal/Power limiting                       /* 主要: 溫度/能源 限制 */
        * Secondary:                                            /* 次要: */
                * Performance management                        /* 性能管理 */
                * Pay per use                                   /* 按使用付費 */
                * Idle power efficiency                         /* 閒置功耗效率 */

LFM (low frequency mode)

Idle Injection in Linux                                         /* Linux 上的閒置注入 */
        * Intel PowerClamp driver                               /* Intel PowerClamp 驅動程式 */
        * Scheduler throttling, RT or CFS bandwidth control     /* scheduler 節流 */

Intel Power Clamp V1
(current design in mainline kernel)
The idea: play idle!

Limitations of Intel PowerClamp V1                              /* Intel PowerClamp V1 的限制 */

        * CPU appears busy while playing idle                   /* 演繹 idle 時 CPU 仍呈現忙碌 */
        * Scheduler ticks not stopped in NOHZ idle              /* scheduler 的滴答在 NOHZ 空閒時仍沒有停止 */
                Removal of tick_nohz_idle_enter/exit() API
                RCU grace period
        * Relies on timely jiffies updates                      /* 依賴 jiffies 的及時更新 */

Limitations of Intel PowerClamp V1
        Relies on secondary timing source
        * timely jiffy updates
        * periodic timers

Scheduler Based Throttling                                      /* 以 scheduler 為基礎的節流 */
Normal tasks under completely fair scheduling (CFS) class
        * Bandwidth control via CPU control group/container     /* 頻寬控制是透過 CPU 控制組/容器 */
        * Runqueue throttling by enqueue/dequeue tasks          /* 對運行中的 queue 節流是通過把 tasks 排入或移出queue */
        
Time chart of CFS Bandwidth Control                                             /* CFS 頻寬控制 */
        * Pros: No fake idle task, Finer per cgroup controls                    /* 優點: 沒有假性閒置任務, 細到逐一cgroup的控制 */
        * Cons: No synchronization loss of package C-state opportunities        /* 缺點: 對於可能有機會發生的 C-state 封包丟失沒有做同步 */

Power Clamp V2(work in progress)                                        /* Power Clamp V2 版本 */
        * Runqueue throttling of CFS class                              /* 針對 CFS class 作 runqueue 節流控制 */
        * Synchronization around rounded Ktime instead of jiffies       /* 圍繞著周圍的 Ktime 作同步, 而不是 jiffies */

Experiment Data                                                                 /* 實驗數據 */
        Goals:
                Comparing Power Efficiency                                      /* 比較能源效率 */
                Scalability                                                     /* 可擴展性 */  
                CPU HW design trend: old vs. new                                /* CPU 的硬體設計趨勢 */
        Configurations:                                                         /* 配置 */ 
                CPUs: Ivy Bridge/Haswell/Broadwell clients, Haswell EX server
                Workload:fspin by Len Brown. CPU bound, floating
                Test case: Inject idle from 0 to 50% at 5% increment            /* 注入閒置, 以 5% 的增量從 0 到 50% */

Power and Performance Control V1 vs. V2
        V1: RT kthread play idle 
        V2: CFS runqueue throttling

Comparing Deep vs. Shallow Package C-States
(powerclamp v2)

Conclusions
* Idle injection can effectively reduce power beyond energy efficient frequency         /* 閒置注射可以在節能型頻率以外再有效的降低耗能 */
* With deeper package C-states, can achieve near linear performance and power           /* 配合更深的 C-states 封包, 可達成線性效能以及降低功耗 */
  reduction
* Scheduler runqueue throttling results in cleaner and more efficient solution          /* 排程器 runqueue 節流可導至更清楚與有效的解決方法 */
* Align activities results in significant power savings                                 /* 對齊活動導至顯著降低功耗 */

Future plan
* Better handling of interrupts                         /* 更妥善的處理中斷 */
* Integration with scheduler                            /* 與排程器整合 */
* Synchronize with devices with latency tolerance       /* 依據延遲容忍進行設備間同步 */
* Work with hardware duty cycling                       /* 與硬體負載循環合作 */

powerclamp_timer_fn