2016年1月5日 星期二

RAPL (Running Average Power Limit) driver (with Chinese translation comment)

RAPL (Running Average Power Limit) driver Jacob Pan <jacob.jun.pan@linux.intel.com>
RAPL (運行時期平均供電限制) 驅動程式
https://lwn.net/Articles/545745/

RAPL(Running Average Power Limit) interface provides platform software
with the ability to monitor, control, and get notifications on SOC
power consumptions. Since its first appearance on Sandy Bridge, more
features have being added to extend its usage. In RAPL, platforms are
divided into domains for fine grained control. These domains include
package, DRAM controller, CPU core (Power Plane 0), graphics uncore
(power plane 1), etc.

在 RAPL, 硬體平台被區分成為幾個domain來進行細粒度調整. 這些domain包含
package, DRAM控制器, CPU核心 (電力平面0), 非核心顯卡 (電力平面1), 等等.

The purpose of this driver is to expose RAPL for userspace
consumption. Overall, RAPL fits in the generic thermal layer in
that platform level power capping and monitoring are mainly used for
thermal management and thermal layer provides the abstracted interface
needed to have portable applications.

整體來說, RAPL 適用於通用散熱層, 用在平台層級的電力封頂, 以及監控. 主要
為溫度管控以及散設層提供抽象介面以滿足便攜式應用的需求.

Specifically, userspace is presented with per domain cooling device
with sysfs links to its kobject. Although RAPL domain provides many
parameters for fine tuning, long term power limit is exposed as the
single knob via cooling device state. Whereas the rest of the
parameters are still accessible via the linked kobject. This simplifies
the interface for both simple and advanced use cases.

具體而言, 在 userspace 會透過 sysfs 展現出來, 其背後聯結到每個domain的
冷卻裝置的 kobject. 雖然 RAPL domain 提供了很多可微調的參數, 但長時間的
功率限制乃是透過冷卻裝置狀態而揭露出來成為單獨的 knob. 然而其他的參數
仍然可以透過鏈接 kobject 來取用. 這樣同時簡化了簡單和進階使用場合中的
使用介面.

DETAILS 細節
=======
1. sysfs layout         sysfs 層

As an x86 platform driver, RAPL driver binds with supported CPU ids
during probing phase. Once domains are discovered, kobjets are created
for each domain which are also linked with cooling devices after its
registration with the generic thermal layer.

作為一支 x86 平台驅動, RAPL驅動程式在試探時期綁定了被支援的CPU id. 一旦
domains 發現, kobjects 會為了每個domain而被創建出來, 它們也會在註冊到通用
散熱層之後聯結到散熱設備

e.g.package RAPL domain registered as cooling device #15, link "device"
back to its kobject.

範例: RAPL domain 包裹被註冊成為散熱設備15號, device 聯結回到它的kobject.

/sys/class/thermal/cooling_device15/
├── cur_state
├── device -> ../../../platform/intel_rapl/rapl_domains/package
├── max_state
├── power
├── subsystem -> ../../../../class/thermal
├── type
└── uevent

In driver's private sysfs area, domains kobjects are grouped under a
kset which exposes global data.

在驅動程式的私有 sysfs 區域, domains 的 kobjects 被歸納在一個 kset 下,
揭露全域資料.

/sys/devices/platform/intel_rapl/
├── driver -> ../../../bus/platform/drivers/intel_rapl
├── power
├── rapl_domains
│   ├── package
│   │   └── thermal_cooling
-> ../../../../virtual/thermal/cooling_device15
│   ├── power_plane_0
│   │   └── thermal_cooling
-> ../../../../virtual/thermal/cooling_device16
│   └── power_plane_1
│       └── thermal_cooling
-> ../../../../virtual/thermal/cooling_device18
└── subsystem -> ../../../bus/platform

2. per domain parameters        每個domain的參數

These are the fine tuning parameters only used by advanced
power/thermal management applications. Refer to Intel SDM ch14 for
details.

有一些可以微調的參數只能應用在功耗/溫度管理上. 細節請參考 Intel SDM
第14章.

root@chromoly:/sys/class/thermal/cooling_device15/device# grep . *
domain_name:package
energy:924228
lock:0
max_power:0
max_window:0
min_power:0
pl1_clamp:1
pl1_enable:1
pl2_clamp:0
pl2_enable:1
power:2276
power_limit1:12000
power_limit2:31250
thermal_spec_power:17000
throttle_time:
time_window1:28000
time_window2:0

3. event notifications          事件通知

RAPL driver uses eventfd to provide userspace notifications on selected
events. A file node called "event_control" is created for each RAPL
domain. User can write control file descriptor, eventfd descriptor, and
threshold to event_control file. Then, user application can use
poll/select or blocking read to get notifications from the driver.
Multiple events are allowed for each domain but only a single threshold
is accepted.

RAPL驅動程式利用 eventfd 為 userspace 所選定的事件來提供通知.
event_control 這一個檔案結點為每個 RAPL domain 被建立出來. 使用者可以
寫入控制檔案描述符, eventfd 描述符, 以及閥值到 event_control 檔案. 接著,
使用方的應用程式可以使用 輪詢/選擇 或者 阻斷讀取 來取得驅動程式的通知.
多事件在每一個 domain 都是可以被允許的, 但是只允許單一閥值.

4. Usage Examples (assume the topology in the sysfs layout above)
使用範例 (假設在 sysfs 層之上的拓撲)

- set power limit to package domain (whole SOC package) to 6w
- 設定封裝 domain 耗電限制在6w
root@chromoly:~# echo 6000
        > /sys/class/thermal/cooling_device15/cur_state

- set power limit to pp1 domain (graphics) to 4w
- 設定 pp1 domain 耗電限制在4w
root@chromoly:~# echo 4000
        > /sys/class/thermal/cooling_device18/cur_state

- check the current power usage in mWatts of pp1 domain
- 確認 pp1 domain 目前的耗電使用, 單位mWatts
root@chromoly:~# cat  /sys/class/thermal/cooling_device18/cur_state
61

- set event notification when power consumption of graphics unit crosses
  5w.
- 設定事件通知, 當繪圖單元的耗電高過5w時通知
root@chromoly:~#
  event_fd_listener /sys/class/thermal/cooling_device18/device/power 5000
(event_fd_listener opens control file power and creates an eventfd,
then write efd, cfd, threshold to event_control file of the given
domain)
(event_fd_listener 打開 power 控制檔案而且創建一個 eventfd, 然後寫入 efd,
cfd, 閥值 到domain的 event_control 檔案)

Caveats:        注意事項:

1. Package power limit events are supported by legacy thermal reporting
mechanism, which uses local APIC thermal vector to generate interrupts
when targeted P-states are not honored by the HW/FW. This is tied to
machine check reporting. Until RAPL is used, this notification is a rare
exception. When RAPL power limit is set artifically low, this
notification could result in unwanted interrupts for each power limit
excursion. Therefore, RAPL driver attempts to turn off the power limit
notification interrupt when user sets a power limit.

1. 傳統的溫度回報機制支援了包裹的耗電限制事件, 當目標 P-states 沒有被
   硬體/軔體 所履行時, 它會使用 local APIC 溫度向量來產生中斷. 它是綁定
   到機器檢查報告中. 直到 RAPL 使用之前, 這樣的通知都還是一種罕見的例外.
   當 RAPL 耗電限制被人為壓低時, 這樣的通知可能會導致在每次功率極限飄移
   時產生不必要的中斷. 因此, 當使用者設定功耗限制, 則RAPL 驅動會嘗試關閉
   功耗限制通知中斷.

2. By Intel Software Developer's Manual, RAPL interface can report
max/min power for certain domains. But in reality HW often reports 0
for max/min power. RAPL driver tackles this problem by using thermal
specification power or current power limit1 when max power information
is not available. The result is that the max_state of a RAPL cooling
device can be based on thermal spec power or power limit 1.

2. 根據 Intel 軟體開發人員手冊 (SDM), RAPL 介面能彙報特定 domain 的
   最大/最小 耗電. 但在現實中, 硬體常常回報 最大/最小耗電 為零. 當最大
   耗電資訊不適用時, RAPL 驅動程式會利用 溫度規範功率或 當前電源限制 來
   逮住這個問題. 其結果是 RAPL 散熱設備的 max_state 可以基於溫度規範耗電
   或耗電限制1.

3. Since RAPL is backed by FW. In case of FW failure or plain lack of
support, setting RAPL power limit could result in silent failure. I
don't have a good solution for that.

3. 因為 RAPL 需要軔體的支援. 一旦軔體故障或者平面缺乏支援, 設定 RAPL
   耗電限制可能導至無提示的故障. 我並沒有好的解決方案.

4. Data polling starts only when the following items are set
        - power limit
        - events

4. 資料輪詢只有當下列項目設定時才回開始
        - 耗電限制
        - 事件

沒有留言:

張貼留言