2015年12月23日 星期三

Re: [RFC PATCH 00/14] Support timezone of ACPI TAD and EFI TIME (with Chinese translation comment)


Subject Re: [RFC PATCH 00/14] Support timezone of ACPI TAD and EFI TIME
From "H. Peter Anvin" <>
Date Fri, 20 Dec 2013 08:57:03 -0800

But we prefer the TAD for that.  The case where the EFI runtime is the only source of that info is problematic as they are known to not work at runtime.  We could collect it at boot and then never change it, although you end up in definitional issues between EFI and the hw RTC.

/* 但是那樣的狀況下我們比較喜歡 TAD. 在某種案例下 EFI runtime 是唯一的資訊來源但是它卻有問題, 因為我們已知它們在 runtime 無法運作. 我們可以在開機時收集它然後永遠不修改, 但是你最終會遭遇 EFI 和 hw RTC 間的定義上問題.  */

Matthew Garrett <matthew.garrett@nebula.com> wrote:
>On Thu, 2013-12-19 at 20:22 -0800, H. Peter Anvin wrote:
>> On 12/19/2013 08:05 PM, joeyli wrote:
>> > Can we use EFI time services on x86_64 after Borislav's patches
>accepted
>> > to mainline?
/* 當 Borislav 的 patches 被上游允許之後, 在x86_64是否我們可以使用 EFI time services? */
>> > 
>> 
>> No.
/* 不行. */
>
>We will want to use them to (at minimum) obtain the clock timezone.
>Using them for general RTC access is less attractive.

/* 我們想要(最低限度)使用他們以獲得時鐘時區. 用他們作為一般的RTC存取(功能)並沒有吸引力. */
-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.


From Matthew Garrett <>
Subject Re: [RFC PATCH 00/14] Support timezone of ACPI TAD and EFI TIME
Date Fri, 20 Dec 2013 16:58:33 +0000

On Fri, 2013-12-20 at 08:57 -0800, H. Peter Anvin wrote:
> But we prefer the TAD for that.  The case where the EFI runtime is the only source of that info is problematic as they are known to not work at runtime.  We could collect it at boot and then never change it, although you end up in definitional issues between EFI and the hw RTC.

Most shipping UEFI hardware has no TAD.
/* 大部份出貨的 UEFI 硬體沒有 TAD */

-- 
Matthew Garrett <matthew.garrett@nebula.com>


Subject Re: [RFC PATCH 00/14] Support timezone of ACPI TAD and EFI TIME
From "H. Peter Anvin" <>
Date Fri, 20 Dec 2013 12:29:51 -0800

Yes, but the TZ isn't all that critical, either.  It certainly doesn't matter at all for a pure Linux system.

/* 對, 但是 TZ 並不是那麼的關鍵. 在一個純 Linux 系統上它並不重要 */

Matthew Garrett <matthew.garrett@nebula.com> wrote:
>On Fri, 2013-12-20 at 08:57 -0800, H. Peter Anvin wrote:
>> But we prefer the TAD for that.  The case where the EFI runtime is
>the only source of that info is problematic as they are known to not
>work at runtime.  We could collect it at boot and then never change it,
>although you end up in definitional issues between EFI and the hw RTC.
>
>Most shipping UEFI hardware has no TAD.

-- 
Sent from my mobile phone.  Please pardon brevity and lack of formatting.


From Matthew Garrett <>
Subject Re: [RFC PATCH 00/14] Support timezone of ACPI TAD and EFI TIME
Date Fri, 20 Dec 2013 20:32:12 +0000

On Fri, 2013-12-20 at 12:29 -0800, H. Peter Anvin wrote:
> Yes, but the TZ isn't all that critical, either.  It certainly doesn't matter at all for a pure Linux system.

No, but it does matter for a great number of deployed Linux systems.
Dealing with the timezone over DST changes has been a perpetual problem,
and if we can make that work then life will be significantly better.

/* 不, 但它對於一個已大量佈署的 Linux 系統很重要. 處理在時區上的DST變化是個永遠的課題, 如果我們可以讓它運作, 那麼生活會更加美好 */
-- 
Matthew Garrett <matthew.garrett@nebula.com>

Date Fri, 20 Dec 2013 13:14:25 -0800
From "H. Peter Anvin" <>
Subject Re: [RFC PATCH 00/14] Support timezone of ACPI TAD and EFI TIME

On 12/20/2013 12:32 PM, Matthew Garrett wrote:
> On Fri, 2013-12-20 at 12:29 -0800, H. Peter Anvin wrote:
>> Yes, but the TZ isn't all that critical, either.  It certainly doesn't matter at all for a pure Linux system.
> 
> No, but it does matter for a great number of deployed Linux systems.
> Dealing with the timezone over DST changes has been a perpetual problem,
> and if we can make that work then life will be significantly better.
> 

And as I pointed out, it can matter a lot for VMs, since the provider
doesn't want to provision the VMs differently for different types of guests.

/* 就像我指出的, 它對虛擬機更重要, 因為供應商不希望對於不同型態的 guest 提供有差別的虛擬機 */

 -hpa


Date Fri, 20 Dec 2013 13:12:52 -0800
From "H. Peter Anvin" <>
Subject Re: [RFC PATCH 00/14] Support timezone of ACPI TAD and EFI TIME

On 12/20/2013 07:16 AM, Matthew Garrett wrote:
> On Thu, 2013-12-19 at 20:22 -0800, H. Peter Anvin wrote:
>> On 12/19/2013 08:05 PM, joeyli wrote:
>>> Can we use EFI time services on x86_64 after Borislav's patches accepted
>>> to mainline?
>>>
>>
>> No.
> 
> We will want to use them to (at minimum) obtain the clock timezone.
> Using them for general RTC access is less attractive.
> 

One option is to use the EFI runtime call to get and save the clock
timezone before we call ExitBootServices() in the EFI stub.  This
doesn't obviate the need for proper handling of the TAD, though,
especially since it is likely that future hardware will not have a RTC
in the current form (it is a way more complex device than is needed,
which wouldn't normally be a problem, but the fact that it has to
operate in the Vbat well makes it a major one.)

/* 有個選項是我們可以在 EFI stub 內, 調用 ExitBootServices() 之前, 使用 EFI runtime call 取得和儲存時鐘時區. 雖然這樣仍無法避免需妥善處理 TAD, 特別是因為未來的硬體很可能沒有現在這種形式的 RTC (它是一種比需求更複雜的設備, 這通常不會是個問題, 但事實上, 它必須具備在Vbat(電池)下運作良好的重大特性) */

 -hpa

2015年12月22日 星期二

Check the EFI time servies usage status of Windows 10 on qemu

In Hackweek 13, I checked the newest Windows version, Windows 10, that it uses EFI time services to restore timezone information to UEFI firmware.

My original target is checking the ACPI Time and Alarm usage status in Windows 10, but I didn't find checked build of Windows 10. So I check the EFI time services usage status.

Install Windows 10 with OVMF as a Qemu guest

First I need install a Windows 10 guest in qemu. I downloaded the Windows 10 evaluation edition from TechNet Evaluation Center:

10586.0.151029-1700.TH2_RELEASE_CLIENTENTERPRISEEVAL_OEMRET_X64FRE_EN-US.ISO

My host environment is openSUSE 13.2:

qemu-x86-2.1.3-7.2.x86_64
qemu-2.1.3-7.2.x86_64
qemu-kvm-2.1.3-7.2.x86_64
qemu-seabios-1.7.5-2.9.noarch
qemu-tools-2.1.3-7.2.x86_64
qemu-sgabios-8-2.9.noarch
qemu-ksm-2.1.0-2.9.x86_64
qemu-ovmf-x86_64-0.1+svn19110-11.1.noarch
qemu-vgabios-1.7.5-2.9.noarch

virt-install-1.3.0-329.4.noarch
libvirt-daemon-1.2.9-23.1.x86_64
libvirt-daemon-driver-qemu-1.2.9-23.1.x86_64
virt-manager-common-1.3.0-329.4.noarch
virt-manager-1.3.0-329.4.noarch
libvirt-daemon-qemu-1.2.9-23.1.x86_64
libvirt-client-1.2.9-23.1.x86_64

At beginning I run virt-install to install Windows 10 on qemu. Unfortunately it didn't success because the ovmf on openSUSE that it doesn't include Microsoft's key for Windows platform. So I need disable secure boot before install Windows 10. To keep the secure boot BIOS option, I setup nvram parameter in /etc/libvirt/qemu.conf before running virt-manager to install Windows 10:

vi /etc/libvirt/qemu.conf
...
nvram = [ "/usr/share/qemu/ovmf-x86_64-ms-code.bin:/usr/share/qemu/ovmf-x86_64-ms-vars.bin" ]

Then running a QEMU/KVM guest, I disabled secure boot in UEFI option. The change should save to nvram file:

# virsh dumpxml win8.1
...
  <os>
    <type arch='x86_64' machine='pc-i440fx-2.1'>hvm</type>
    <loader readonly='yes' type='pflash'>/usr/share/qemu/ovmf-x86_64-ms-code.bin</loader>
    <nvram>/var/lib/libvirt/qemu/nvram/win8.1_VARS.fd</nvram>
    <boot dev='hd'/>
  </os>

Please ignore the wrong guest name win8.1 :p

Another problem when I am installing Windows 10 is that it always sticks on Windows start up screen but didn't run installation process. After upgrade ovmf to qemu-ovmf-x86_64-0.1+svn19110-11.1.noarch, this issue gone.

When using virt-manager to install Windows 10, need choice UEFI to be the BIOS for installation:

Choice "Customize configuration before install":

Choice Firmware:

Remember to disable secure boot before running Windows 10 installation first.

Windows 10 installation and boot success on qemu:


Enable serial console log of OVMF

Thanks for Gary Lin's help. He told me the way to enable serial console log of ovmf is adding the following qemu parameters:
    -global isa-debugcon.iobase=0x402 -debugcon file:debug.log

Add the above parameters to print ovmf's debug log to debug.log file. I am using libvirt, so need add those parameters to the XML of domain. So, using virsh to edit xml:

# virsh edit win8.1

<domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
[...snip]
  <qemu:commandline>
    <qemu:arg value='-global'/>
    <qemu:arg value='isa-debugcon.iobase=0x402'/>
    <qemu:arg value='-debugcon'/>
    <qemu:arg value='file:/tmp/debug.log'/>
    <qemu:arg value='-s'/>
  </qemu:commandline>
</domain>

As the above, please add xml name space of qemu in domain element, then add "qemu:commandline" element in the end of domain. Please note you need have the access right of the folder of debug.log file.

You can tail /tmp/debug.log after starting Windows 10 guest:

# tail -f /tmp/debug.log
SecCoreStartupWithStack(0xFFFCC000, 0x818000)
Register PPI Notify: DCD0BE23-9586-40F4-B643-06522CED4EDE
Install PPI: 8C8CE578-8A3D-4F1C-9935-896185C32DD3
Install PPI: 5473C07A-3DCB-4DCA-BD6F-1E9689E7349A
The 0th FV start address is 0x00000820000, size is 0x000E0000, handle is 0x820000
Register PPI Notify: 49EDB1C1-BF21-4761-BB12-EB0031AABB39
Register PPI Notify: EA7CA24B-DED5-4DAD-A389-BF827E8F9B38
Install PPI: B9E0ABFE-5979-4914-977F-6DEE78C278A6
[...snip]

PcAtChipsetPkg/KbcResetDxe/ResetEntry.c:  SystemTable->RuntimeServices->ResetSystem = KbcResetSystem;

Add debug log to getTime()/setTime() in OVMF 

Here is my patch to EDK2 to add debug log in OVMF for detecting the usage status by OS:
    https://github.com/joeyli/edk2/commit/bf1dc51b2f2b9b9a765fdbaadc740b4806265fbb

There have some different code paths in EDK2 that implemented get time and set time functions. I added some debug log in different code path and got Gary Lin's help to find the right runtime services functions that are used by OVMF. The point is finding out some codes to define function points in "SystemTable->RuntimeServices" table:

PcAtChipsetPkg/KbcResetDxe/ResetEntry.c:  SystemTable->RuntimeServices->ResetSystem = KbcResetSystem;

MdePkg/Library/UefiRuntimeServicesTableLib/UefiRuntimeServicesTableLib.c:  gRT = SystemTable->RuntimeServices;
PcAtChipsetPkg/PcatRealTimeClockRuntimeDxe/PcRtcEntry.c:
                                                          gRT->GetTime       = PcRtcEfiGetTime;
                                                          gRT->SetTime       = PcRtcEfiSetTime;
                                                          gRT->GetWakeupTime = PcRtcEfiGetWakeupTime;
                                                          gRT->SetWakeupTime = PcRtcEfiSetWakeupTime;

And should make sure those packages used by OvmfPkg package.
e.g.
ovmf-0.1+svn19289/OvmfPkg> grep -r "UefiRuntimeServicesTableLib" *
[...snip]
OvmfPkgX64.dsc:  UefiRuntimeServicesTableLib|MdePkg/Library/UefiRuntimeServicesTableLib/UefiRuntimeServicesTableLib.inf

The OVMF debug log of booting Windows 10

Here is the ovmf debug log result file of booting Windows 10:
    https://github.com/joeyli/hackweek/blob/master/windows10-uefi-time-services/windows10-qemu-ovmf-debug.log

For the the first line to 2908 line is the log of ovmf booting stage. I go to the UEFI menu and wait all log written to file, then select Windows 10 booting item in UEFI boot manager. I add "[[Windows 10 boot START]]" tag in windows10-qemu-ovmf-debug.log as a mark to start Winddows 10 booting.

In the log file, I saw many "PcRtcEfiGetTime" but I didn't see any "PcRtcEfiSetTime" log in Windows 10 booting process. After booting to Windows 10 desktop I tried to set time by "Settings -> Time & language -> Date & time" functions. But I didn't see any ovmf log from Windows 10 call runtime services. Actually, I didn't see any runtime services log after Windows 10 boot finished, even "VariableServiceGetVariable" or "VariableServiceSetVariable".

Windows 10 doesn't aware timezone change by UEFI shell

Base on ovmf debug log, looks Windows 10 doesn't use EFI time services to set date/time and timezone to EFI firmware. Then I boot to EFI shell to change timezone. The original timezone field is Local (2047):

    Shell> timezone
    Local

I set it to GMT+05:00

    Shell> timezone -s 5:00
    Shell> timezone
    GMT+05:00

Then reboot to Windows 10, I didn't see the timzone field in "Settings -> Time & language -> Date & time" changed to sync with the timezone that's set by EFI shell. So Windows 10 doesn't aware timezone field from EFI time services even it calls getTime() time services. Looks it just ignore it and keeps timezone by itself.

Summary

The timezone and daylight fields in EFI time services are useful to a OS dual boot environment to sync the date/time status in different OS. Windows is the most popular OS but looks it doesn't use timezone field that it is provided by EFI time services. That causes that doesn't have enough benefits to support this functions. At least it doesn't help on dual boot environment.

The good news is that Windows 10 accesses getTime() runtime service in booting process, that meas at least this function will be tested in all OEM/ODM QA process.

2015年11月17日 星期二

Re: [GIT PULL] x86/mm changes for v4.4 (with Chinese translation comment)

Prior read: Re: [PATCH v2] x86/mm: warn on W+x mappings

Date: Fri, 6 Nov 2015 11:39:43 +0000
From: Matt Fleming <matt@codeblueprint.co.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Jones <davej@codemonkey.org.uk>, Ingo Molnar <mingo@kernel.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>,
        Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, Stephen Smalley <sds@tycho.nsa.gov>,
        linux-efi@vger.kernel.org
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.24 (2015-08-30)

We have separate page tables today, for a few reasons, but mainly it's
/* 因為一些原因, 我們目前有分離的page tables */
so that we can have an identity mapping of memory present in the
/* 主要原因是我們可以有一個恆等映射(identity mapping, 1:1) */
region usually used by user processes - broken firmware still uses
/* 通常被 user processes 使用, 壞掉的 firmware 仍然使用恆等映射 */
those identity mappings even after the kernel tells it they're
/* 即使 kernel 告訴他們已經失效了 */
invalid.

Note that when I say "separate" I'm talking about trampoline_pgd[]
which is also used by the x86 suspend/resume code.
/* 注意當我說"分離"時, 我講的是關於 trampoline_pgd[]也被使用在 x86 suspend/resume 的程式中 */

However, turns out that the issue with the current scheme is the fact
/* 原來的問題在當前的方案, 事實上 trampoline_pgd[] 分享了一些 PGD entries 給 swapper_pg_dir */
that trampoline_pgd[] actually shares a couple of PGD entries with
swapper_pg_dir as can be seen in setup_real_mode(),


        trampoline_pgd = (u64 *)__va(real_mode_header->trampoline_pgd);
        trampoline_pgd[0] = init_level4_pgt[pgd_index(__PAGE_OFFSET)].pgd;
        trampoline_pgd[511] = init_level4_pgt[511].pgd;

So when we map the EFI regions in efi_map_regions() we're inserting
/* 所以當我們映射 EFI 區域也一並映射到swapper_pg_dir */
them into swapper_pg_dir also, which is why you're seeing the
warnings.

If I remember correctly the rationale for using trampoline_pgd[] was
/* 使用 trampoline_pgd[] 是因為它已經有我們想要的(提供恆等映射) */
that it already did what we wanted (provided the identity mapping) and
would save us the overhead of maintaining more page tables for no good
/* 可以節省我們用於維護更多 page tables 的開銷 */
reason. Obviously this entire thread is a good reason.

I suggest we stop using trampoline_pgd[] (since it has a good reason 
/* 我建議停止使用 trampoline_pgd[] (它具有一個好的理由去分享 kernel 映射 PGD entries)   */
for sharing the kernel mapping PGD entries) and create our own so that
/* 而且建立我們自己的(PGD)然後我們可以完全隔離 EFI */
we can isolate EFI completely.

For the immediate problem of the warnings spewing forth on all UEFI
machines, at the very least the config options needs to be disabled by
/* 最起碼 config 選項必須預設關閉 */
default, if not the patch reverted.



Date: Sat, 7 Nov 2015 08:05:54 +0100
From: Ingo Molnar <mingo@kernel.org>
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin"
        <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, Stephen Smalley <sds@tycho.nsa.gov>,
        linux-efi@vger.kernel.org
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.23 (2014-03-12)


* Matt Fleming <matt@codeblueprint.co.uk> wrote:

> On Thu, 05 Nov, at 01:33:10PM, Linus Torvalds wrote:
[...]
> I suggest we stop using trampoline_pgd[] (since it has a good reason
> for sharing the kernel mapping PGD entries) and create our own so that
> we can isolate EFI completely.

Ok. Could you please make this fix a priority for upcoming EFI changes?

> For the immediate problem of the warnings spewing forth on all UEFI
> machines, at the very least the config options needs to be disabled by
> default, if not the patch reverted.

We'll certainly flip around the default, but reverting would be shooting
/* 我們肯定會反轉預設值 */
the messenger: the EFI code is endangering everyone else today, and for
/* EFI 程式正在危害其他人, 而且它的出現沒有充份理由 */
no good reason as it appears... so the warning very much served its
/* 這樣的警告(CONFIG_DEBUG_WX)非常成功的達成目的, 指出了一個有效的問題 */
purpose in pointing out a valid problem.

Thanks,

        Ingo



Date: Fri, 6 Nov 2015 12:39:12 +0000
From: Matt Fleming <matt@codeblueprint.co.uk>
To: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner
        <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Denys Vlasenko
        <dvlasenk@redhat.com>, Kees Cook <keescook@chromium.org>, linux-efi@vger.kernel.org, Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.24 (2015-08-30)

On Fri, 06 Nov, at 07:55:50AM, Ingo Molnar wrote:
>
>  3) We should fix the EFI permission problem without relying on the firmware: it
/* 我們必須在不依賴軔體的狀況下修好 EFI 的權限問題 */
>     appears we could just mark everything R-X optimistically, and if a write fault
/* 我們可以樂觀的標記所有東西為R-X */
>     happens (it's pretty rare in fact, only triggers when we write to an EFI
/* 而且當寫入失敗發生時(這很罕見, 只有我們寫入EFI變數時會觸發), 我們可以在運行中標記失敗page為RW- */
>     variable and so), we can mark the faulting page RW- on the fly, because it
>     appears that writable EFI sections, while not enumerated very well in 'old'
/* 因為它出現在寫入 EFI 段的時候, 在舊的軔體沒有很好的列舉出來, 仍應該是page粒度 */
>     firmware, are still supposed to be page granular. (Even 'new' firmware I 
/* (就算是新的軔體, 我也不會自動的相信會得到正確的列舉...) */
>     wouldn't automatically trust to get the enumeration right...)

Sorry, this isn't true. I misled you with one of my earlier posts on
/* 抱歉這是錯的,我誤導了你 */
this topic. Let me try and clear things up...

Writing to EFI regions has to do with every invocation of the EFI
/* 寫入 EFI 區域在每次叫用EFI runtime services時會發生, 不僅限於 讀/寫/刪除 EFI 變數 */
runtime services - it's not limited to when you read/write/delete EFI
variables. In fact, EFI variables really have nothing to do with this
/* 事實上, EFI 變數和這次的討論真的沒關係 */
discussion, they're a completely opaque concept to the OS, we have no
/* 對OS來說他們完全是不透明的觀念 */
idea how the firmware implements them. Everything is done via the EFI
boot/runtime services.

The firmware itself will attempt to write to EFI regions when we
/* 當我們調用EFI services時, 軔體本身會嘗試寫入 EFI 區域, 因為 PE/COFF 的.data 以及.bss 是和 heap 存活在一起 */
invoke the EFI services because that's where the PE/COFF ".data" and
".bss" sections live along with the heap. There's even some relocation
/* 甚至像一些發生在 SetVirtualAddressMap() 時的重新定位位置調整, 所以它也會寫入.text */
fixups that occur as SetVirtualAddressMap() time so it'll write to
".text" too.

Now, the above PE/COFF sections are usually (always?) contained within
/* 上述的 PE/COFF sections 常常(總是?) 被包含在 EfiRuntimeServicesCode 型態的EFI 區域中 */
EFI regions of type EfiRuntimeServicesCode. We know this is true
/* 我們知道這個事實乃是因為軔體開發者告訴我們 */
because the firmware folks have told us so, and because stopping that
/* 而且也是因為它阻擋了 EFI_PROPERTIES_TABLE 新功能背後的動機 */
is the motivation behind the new EFI_PROPERTIES_TABLE feature in UEFI
V2.5.

The data sections within the region are also *not* guaranteed to be
/* 在區域中的 data 區段也不保證是 page 粒度 */
page granular because work was required in Tianocore for emitting
/* 因為 Tianocore 的工作需求, 用來發出 4k 對齊的區段作為支援 EFI_PROPERTIES_TABLE 的一部份 */
sections with 4k alignment as part of the EFI_PROPERTIES_TABLE
support.

Ultimately, what this means is that if you were to attempt to
/* 最終這代表了如果你嘗試動態佈置這些需要write權限的區域, 你橫豎都必須修改EFI區域的主要映攝 */
dynamically fixup those regions that required write permission, you'd
have to modify the mappings for the majority of the EFI regions
anyway. And if you're blindly allowing write permission as a fixup,
/* 而且如果你盲目的允許write權限, 這就不會得到太多的安全性 */
there's not much security to be had.

>     If that 'supposed to be' turns out to be 'not true' (not unheard of in
/* 如果這個"認為應該是"被正名為"不對的" (不是前所未聞的軔體園地) */
>     firmware land), then plan B would be to mark pages that generate write faults
/* 則 plan B 就是標記那些產生 write 失敗的 pages 成為 RWX, 這樣不會破壞功能 */
>     RWX as well, to not break functionality. (This 'mark it RWX' is not something
/* 這個"標記它為RWX"並不是一些容易取用的漏洞, 而且我們仍然可以產生一個警告[在EFI call完成之後], 如果這個警告曾經被觸發 */
>     that exploits would have easy access to, and we could also generate a warning
>     [after the EFI call has finished] if it ever triggers.)
>
>     Admittedly this approach might not be without its own complications, but it
/* 誠然,這種方法可能不是沒有自己的並發症, */
>     looks reasonably simple (I don't think we need per EFI call page tables,
/* 但是他看來相當簡單 (我不認為我們需要逐一 EFI call 的 page tables, 等等) */
>     etc.), and does not assume much about the firmware being able to enumerate its
/* 而且這並沒有假設軔體能夠正確列舉其權限 */
>     permissions properly. Were we to merge EFI support today I'd have insisted on
>     trying such an approach from day 1 on.

We already have separate EFI page tables, though with the caveat that
/* 我們已經有分開的 EFI page tables */ /* 但需要提醒的是 */
we share some of swapper_pg_dir's PGD entries. The best solution would
/* 我們共享了一些 swapper_pg_dir 的 PGD entries. */
be to stop sharing entries and isolate the EFI mappings from every
/* 最好的解法是停止共享 entires 並且將 EFI mappings 從所有其他的 page table 結構隔離開來 */
other page table structure, so that they're only used during the EFI
/* 所以他們(EFI mappings page tables) 只被用在 EFI service calls 中 */
service calls.



Date: Sat, 7 Nov 2015 08:09:22 +0100
From: Ingo Molnar <mingo@kernel.org>
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner
        <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Denys Vlasenko
        <dvlasenk@redhat.com>, Kees Cook <keescook@chromium.org>, linux-efi@vger.kernel.org, Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.23 (2014-03-12)


* Matt Fleming <matt@codeblueprint.co.uk> wrote:

> On Fri, 06 Nov, at 07:55:50AM, Ingo Molnar wrote:
> >
[...]
>
> Ultimately, what this means is that if you were to attempt to
> dynamically fixup those regions that required write permission, you'd
> have to modify the mappings for the majority of the EFI regions
> anyway. And if you're blindly allowing write permission as a fixup,
> there's not much security to be had.

I think you misunderstood my suggestion: the 'fixup' would be changing it from R-X
/* "修理"代表把R-X改成RW-, 例如, 它增加了 write 權限但是移除 execute 權限 */
to RW-, i.e. it would add 'write' permission but remove 'execute' permission.

Note that there would be no 'RWX' permission at any given moment - which is the
/* 請注意這就不會有 RWX 權限同時存在, 這是危險的組合 */
dangerous combination.

> >     If that 'supposed to be' turns out to be 'not true' (not unheard of in
> >     firmware land), then plan B would be to mark pages that generate write faults
> >     RWX as well, to not break functionality. (This 'mark it RWX' is not something
> >     that exploits would have easy access to, and we could also generate a warning
> >     [after the EFI call has finished] if it ever triggers.)
> >
> >     Admittedly this approach might not be without its own complications, but it
> >     looks reasonably simple (I don't think we need per EFI call page tables,
> >     etc.), and does not assume much about the firmware being able to enumerate its
> >     permissions properly. Were we to merge EFI support today I'd have insisted on
> >     trying such an approach from day 1 on.
>
> We already have separate EFI page tables, though with the caveat that
> we share some of swapper_pg_dir's PGD entries. The best solution would
> be to stop sharing entries and isolate the EFI mappings from every
> other page table structure, so that they're only used during the EFI
> service calls.

Absolutely. Can you try to fix this for v4.3?

Thanks,

        Ingo



Date: Sat, 7 Nov 2015 08:39:35 +0100
From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: Ingo Molnar <mingo@kernel.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel Mailing List
        <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski
        <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, Kees Cook <keescook@chromium.org>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>
Subject: Re: [GIT PULL] x86/mm changes for v4.4

On 7 November 2015 at 08:09, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Matt Fleming <matt@codeblueprint.co.uk> wrote:
>
[...]
>
> I think you misunderstood my suggestion: the 'fixup' would be changing it from R-X
> to RW-, i.e. it would add 'write' permission but remove 'execute' permission.
>
> Note that there would be no 'RWX' permission at any given moment - which is the
> dangerous combination.
>

The problem with that is that /any/ page in the UEFI runtime region
/* 問題在於 EFI runtime 區域中的任何 page 可能和任何組成 runtime 軔體的 PE/COFF images 的 .text 與 .data 相交 */
may intersect with both .text and .data of any of the PE/COFF images
that make up the runtime firmware (since the PE/COFF sections are not
/* 因為 PE/COFF 區段不需要 page 對齊 */
necessarily page aligned). Such pages require RWX permissions. The
/* 這些 pages 需要 RWX 權限 */
UEFI memory map does not provide the information to identify those
/* UEFI memory map 沒有提供資訊以先前識別這些 pages */
pages a priori (the entire region containing several PE/COFF images
/* 包含了幾個 PE/COFF 影像的整個區域可能只被單一entry包覆 */
could be covered by a single entry) so it is hard to guess which pages
/* 所以很難猜測哪個 pages 必須允許 RWX 權限 */
should be allowed these RWX permissions.



Date: Sat, 7 Nov 2015 22:58:52 -0800
From: Kees Cook <keescook@chromium.org>
To: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>, Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux
        Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>,
        Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>, Matthew Garrett <mjg59@coreos.com>
Subject: Re: [GIT PULL] x86/mm changes for v4.4

On Fri, Nov 6, 2015 at 11:39 PM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 7 November 2015 at 08:09, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> * Matt Fleming <matt@codeblueprint.co.uk> wrote:
>>
>>> On Fri, 06 Nov, at 07:55:50AM, Ingo Molnar wrote:
>>> >
[...]
>
> The problem with that is that /any/ page in the UEFI runtime region
> may intersect with both .text and .data of any of the PE/COFF images
> that make up the runtime firmware (since the PE/COFF sections are not
> necessarily page aligned). Such pages require RWX permissions. The
> UEFI memory map does not provide the information to identify those
> pages a priori (the entire region containing several PE/COFF images
> could be covered by a single entry) so it is hard to guess which pages
> should be allowed these RWX permissions.

I'm sad that UEFI was designed without even the most basic of memory            
/* 我感到遺憾 UEFI 的設計沒有最基本的記憶體保護 */
protections in mind. UEFI _itself_ should be setting up protective              
/* UEFI 本身應該設置保護性 page mappings */
page mappings. :(

For a boot firmware, it seems to me that safe page table layout would           
/* 對於一個開機軔體, 對我來說"安全的 page table 佈局"會是高優先級的臭蟲 */
be a top priority bug. The "reporting issues" page for TianoCore
doesn't actually seem to link to the "Project Tracker":
https://github.com/tianocore/tianocore.github.io/wiki/Reporting-Issues

Does anyone know how to get this correctly reported so future UEFI
releases don't suffer from this?

-Kees



Date: Sun, 8 Nov 2015 08:55:24 +0100
From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>, Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux
        Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>,
        Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>, Matthew Garrett <mjg59@coreos.com>
Subject: Re: [GIT PULL] x86/mm changes for v4.4

On 8 November 2015 at 07:58, Kees Cook <keescook@chromium.org> wrote:
> On Fri, Nov 6, 2015 at 11:39 PM, Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
>> On 7 November 2015 at 08:09, Ingo Molnar <mingo@kernel.org> wrote:
>>>
>>> * Matt Fleming <matt@codeblueprint.co.uk> wrote:
>>>
[...]
>
> I'm sad that UEFI was designed without even the most basic of memory
> protections in mind. UEFI _itself_ should be setting up protective
> page mappings. :(
>

Well, the 4 KB alignment of sections was considered prohibitive at the
/* 4KB 對齊區段在節省程式大小時被考慮過禁止. 但這是很久以前 */
time from code size pov. But this was a long time ago, obviously.

> For a boot firmware, it seems to me that safe page table layout would
> be a top priority bug. The "reporting issues" page for TianoCore
> doesn't actually seem to link to the "Project Tracker":
> https://github.com/tianocore/tianocore.github.io/wiki/Reporting-Issues
>
> Does anyone know how to get this correctly reported so future UEFI
> releases don't suffer from this?
>

Ugh. Don't get me started on that topic. I have been working with the           
/* 不要讓我開始這個話題. */
UEFI forum since July to get a fundamentally broken implementation of           
/* 我從7月份開始和 UEFI 論壇工作以修復從根本上就損壞的記憶體保護 */
memory protections fixed. UEFI v2.5 defines a memory protection scheme          
/* UEFI v2.5 定義了記憶體保護策略, 它是基於分割 PE/COFF 影像到分離的記憶體區域 */
that is based on splitting PE/COFF images into separate memory regions
so that R-X and RW- permissions can be applied. Unfortunately, that             
/* 所以R-X 和 RW- 權限可以應用上去 */
broke every OS in existence (including Windows 8), since the OS is             
/* 不幸的是, 這破壞了每個既存的 OS (包含 Windows 8) */
allowed to reorder memory regions when it lays out the virtual                  
/* 由於 OS 在規劃 EFI 區域的虛擬映射時, 被允許對於記憶體區域重新排序 */
remapping of the UEFI regions, resulting in PE/COFF .data and .text             
/* 這造成 PE/COFF 中 .data 和 .text 可能出現順序亂掉 */
potentially appearing out of order.

The good news is that we fixed it for the upcoming release (v2.6). I            
/* 好消息是我們在即將發行的v2.6修正了, 我不能透露任何細節 :-( */
can't disclose any specifics, though :-(



Date: Mon, 9 Nov 2015 13:08:01 -0800
From: Kees Cook <keescook@chromium.org>
To: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>, Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux
        Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>,
        Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>, Matthew Garrett <mjg59@coreos.com>
Subject: Re: [GIT PULL] x86/mm changes for v4.4

On Sat, Nov 7, 2015 at 11:55 PM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 8 November 2015 at 07:58, Kees Cook <keescook@chromium.org> wrote:
>> On Fri, Nov 6, 2015 at 11:39 PM, Ard Biesheuvel
>> <ard.biesheuvel@linaro.org> wrote:
>>> On 7 November 2015 at 08:09, Ingo Molnar <mingo@kernel.org> wrote:
>>>>
[...]
>
> Well, the 4 KB alignment of sections was considered prohibitive at the
> time from code size pov. But this was a long time ago, obviously.

Heh, yeah, I'd expect max 4K padding to get code/data correctly
/* 我期望最大 4K 的填充在獲取代碼/數據時正確的對齊 2MB 而不會構成問題 */
aligned on a 2MB binary to not be an issue. :)

[...]
>
> Ugh. Don't get me started on that topic. I have been working with the
> UEFI forum since July to get a fundamentally broken implementation of
> memory protections fixed. UEFI v2.5 defines a memory protection scheme
> that is based on splitting PE/COFF images into separate memory regions
> so that R-X and RW- permissions can be applied. Unfortunately, that
> broke every OS in existence (including Windows 8), since the OS is
> allowed to reorder memory regions when it lays out the virtual
> remapping of the UEFI regions, resulting in PE/COFF .data and .text
> potentially appearing out of order.
>
> The good news is that we fixed it for the upcoming release (v2.6). I
> can't disclose any specifics, though :-(

As long as there's motion to getting it fixed, that makes me happy! :)
/* 只要有動力讓它修正, 都可以讓我開心! */
Does 2.6 get rid of the (AIUI) 2MB limit too?                           
/* 2.6 版是否也擺脫了 2MB(就我了解) 的限制? */

-Kees



Date: Tue, 10 Nov 2015 08:08:30 +0100
From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>, Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux
        Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>,
        Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>, Matthew Garrett <mjg59@coreos.com>
Subject: Re: [GIT PULL] x86/mm changes for v4.4

On 9 November 2015 at 22:08, Kees Cook <keescook@chromium.org> wrote:
> On Sat, Nov 7, 2015 at 11:55 PM, Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
[...]
>
> Heh, yeah, I'd expect max 4K padding to get code/data correctly
> aligned on a 2MB binary to not be an issue. :)
>

This is not about section sizes on ARM. The PE/COFF format does not             
/* 這和 ARM 的區段大小無關 */
use segments, like ELF, so the payload (the sections) needs to be               
/* PE/COFF 格式沒有使用分段, 和 ELF 相同, */
completely disjoint from the header. This means, when using 4 KB                
/* 所以負載(這些區段)必須和 header 完全脫節 */
alignment, that every PE/COFF image wastes ~4 KB in the header and 4            
/* 每個 PE/COFF 影像浪費大約 4 KB 在 header 和平均 4KB 在段填充 */
KB on average in the section padding (assuming a .text/.data/.reloc             
/* (假設一個 .text/.data/.reloc 佈局, 在 PE/COFF 常見) */
layout, as is common with PE/COFF)

Considering that a typical UEFI firmware image consists of numerous             
/* 考慮到一個典型的 UEFI 軔體影像是由多個(我想平均大約五十個) PE/COFF 影像組成 */
(around 50 on average, I think) PE/COFF images, and some of them                
/* 而且他們部份從 NOR flash 中執行, Tianocore 工具 (關係到實作) */
execute from NOR flash, the Tianocore tooling (which is the reference           
/* 一直著眼於儘可能小的前提下保持對齊 */
implementation) has always been geared towards keeping the alignment
as small as possible, typically 32 bytes unless data objects need               
/* 通常是 32 位元, 除非需要更多 data 物件 */
more. Since the UEFI runtime services are typically implemented by              
/* 由於 UEFI runtime services 通常以數個 PE/COFF 影像來實作 */
several of these PE/COFF images, and since the memory they occupy may           
/* 而且由於記憶體所佔用(空間)可能只由單一個 UEFI memory map 條目所描述 */
be described by a single UEFI memory map entry, there is simply no             
/* 根本沒有簡單的方法來決定哪些頁面需要 R-X, RW- 或 RWX */
easy way to decide which pages need R-X, RW- or RWX. Even looking for           
/* 即使尋找記憶體中的 PE/COFF 標頭們也無法保證可行, 由於 */
PE/COFF headers in the memory region is not guaranteed to work, since
the PE/COFF header is part of the file format, not the memory format            
/* PE/COFF 標頭是檔案格式的一部份, 不是記憶體格式 */
(i.e., since the header is disjoint from the payload, a PE/COFF loader          
/* (也就是: 因為標頭和負載是脫節的, 一個 PE/COFF 載入器不需要拷貝標頭到記憶體) */
is not required to copy the header to memory)

>
> As long as there's motion to getting it fixed, that makes me happy! :)
> Does 2.6 get rid of the (AIUI) 2MB limit too?
>

No, there is no such limit in UEFI. If there is a limit like that, it          
/* 不, 並沒有這樣的限制在 UEFI. 如果有類似這樣的限制, */
is an implementation detail of the UEFI support in the OS.                      
/* 它會是一個 OS 支援 UEFI 的實作細節 */

For arm64 (and the upcoming ARM support), the UEFI runtime services             
/* 對於 ARM64 (和即將到來的 ARM 支援), UEFI runtime services 區域 */
regions are remapped into a virtual userland range that is only active          
/* 被重新映射到一個虛擬的使用者空間範圍 */
during the time runtime services are being invoked. (x86 does                  
/* 這個範圍只有在 runtime services 被調用時啟動 */
something similar, but it shares the page tables with the                       
/* x86 下做了類似的事情, 但是就我了解它和 suspend/resume 程式共享了 page tables */
suspend/resume code afaiu) These mappings could be page granularity             
/* 這些映射可以是頁粒度 (由於他們不需要在線性區域中分割PUDs或PMDs) */
(since they don't require splitting PUDs or PMDs in the linear
region), with the side note that arm64 mandates 64 KB alignment (to             
/* 補充說明 arm64 要求 64 KB 對齊 (和 64 KB 頁的作業系統互通) */
interoperate with 64 KB pages OSes). This requirement has been added            
/* 這個需求已經添加到 UEFI 規範, 也就是, */
to the UEFI spec, i.e., a v2.5 compliant arm64 firmware should not              
/* 一個和 v2.5 相容的 arm64 軔體不應該以非64 KB 對齊的方式曝露 UEFI runtime 區域 */
expose UEFI runtime regions that are not 64 KB aligned.



Date: Tue, 10 Nov 2015 12:11:18 -0800
From: Kees Cook <keescook@chromium.org>
To: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>, Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux
        Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>,
        Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>, Matthew Garrett <mjg59@coreos.com>
Subject: Re: [GIT PULL] x86/mm changes for v4.4

On Mon, Nov 9, 2015 at 11:08 PM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 9 November 2015 at 22:08, Kees Cook <keescook@chromium.org> wrote:
>> On Sat, Nov 7, 2015 at 11:55 PM, Ard Biesheuvel
>> <ard.biesheuvel@linaro.org> wrote:
[...]
>
> This is not about section sizes on ARM. The PE/COFF format does not
> use segments, like ELF, so the payload (the sections) needs to be
> completely disjoint from the header. This means, when using 4 KB
> alignment, that every PE/COFF image wastes ~4 KB in the header and 4
> KB on average in the section padding (assuming a .text/.data/.reloc
> layout, as is common with PE/COFF)
>
> Considering that a typical UEFI firmware image consists of numerous
> (around 50 on average, I think) PE/COFF images, and some of them

Oooh, that's no fun. So the linker can't produce merged .text and               
/* 喔, 這不妙. 所以linker 不能產出合併了 .text 和 .data 的區段? */
.data sections?

[...]
>
> No, there is no such limit in UEFI. If there is a limit like that, it
> is an implementation detail of the UEFI support in the OS.
>
> For arm64 (and the upcoming ARM support), the UEFI runtime services
> regions are remapped into a virtual userland range that is only active
> during the time runtime services are being invoked. (x86 does
> something similar, but it shares the page tables with the
> suspend/resume code afaiu) These mappings could be page granularity
> (since they don't require splitting PUDs or PMDs in the linear
> region), with the side note that arm64 mandates 64 KB alignment (to
> interoperate with 64 KB pages OSes). This requirement has been added
> to the UEFI spec, i.e., a v2.5 compliant arm64 firmware should not
> expose UEFI runtime regions that are not 64 KB aligned.

Cool, thanks for the details!

-Kees



Date: Fri, 6 Nov 2015 13:09:48 +0000
From: Matt Fleming <matt@codeblueprint.co.uk>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel Mailing List
        <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski
        <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, Kees Cook <keescook@chromium.org>, linux-efi@vger.kernel.org
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.24 (2015-08-30)

On Thu, 05 Nov, at 11:05:35PM, Andy Lutomirski wrote:
>
> Admittedly, we might need to use a certain amount of care to avoid        
 /* 不可否認的, 我們必須有一定程度的謹慎去避免 vmap 機制們間有趣的衝突 */
> interesting conflicts with the vmap mechanism.  We might need to vmap      
/* 我們可能需要虛擬映射所有 EFI 的東西, */
> all of the EFI stuff, and possibly even all the top-level entries that      
/* 而且甚至可能是全部包含 EFI 材料的頂層條目 */
> contain EFI stuff (i.e. exactly one of them unless EFI ends up *huge*)      
/* (亦即他們其中只有一個, 除非EFI最終太巨大) */
> as a blank not-present region to avoid overlaps, but that's not a big      
/* 成為一個空白不存在的區域以避免重疊, 但這不是大問題 */
> deal.

There shouldn't be any room for conflicting with vmap() because the VA        
/* 不應該存在任何和vmap()衝突的空間, */
region where we map EFI regions is still carved out especially for us.        
/* 因為用於映射EFI區域的虛擬位置區域仍然有為我們特別刻劃出來 */

Right Boris?



Date: Fri, 6 Nov 2015 14:24:47 +0100
From: Borislav Petkov <bp@alien8.de>
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Andy Lutomirski <luto@amacapital.net>, Ingo Molnar <mingo@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel
        Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Denys
        Vlasenko <dvlasenk@redhat.com>, Kees Cook <keescook@chromium.org>, linux-efi@vger.kernel.org
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.23 (2014-03-12)

On Fri, Nov 06, 2015 at 01:09:48PM +0000, Matt Fleming wrote:
> On Thu, 05 Nov, at 11:05:35PM, Andy Lutomirski wrote:
> >
> > Admittedly, we might need to use a certain amount of care to avoid
> > interesting conflicts with the vmap mechanism.  We might need to vmap
> > all of the EFI stuff, and possibly even all the top-level entries that
> > contain EFI stuff (i.e. exactly one of them unless EFI ends up *huge*)
> > as a blank not-present region to avoid overlaps, but that's not a big
> > deal.
>
> There shouldn't be any room for conflicting with vmap() because the VA
> region where we map EFI regions is still carved out especially for us.
>
> Right Boris?

Yap:                                                                            /* 是的 */

ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB)

vs

ffffffef00000000 - ffffffff00000000 EFI region in trampoline_pgd

the new pagetable will make that issue moot too.                               
/* 新的 pagetable 也將使該問題沒有實際意義 */

--
Regards/Gruss,
    Boris.

2015年10月12日 星期一

How to taint a hibernation snapshot image in a qemu swap disk file

For testing hibernation signature verification, I need try to taint the hibernation snapshot image in swap partition. Using qemu to test to taint snapshot image is more convenient than the real swap partition.

First, I created a qemu raw disk file to be the /dev/sdb swap partition to guest OS:

  qemu-img create ./swap.image 1G

Please be sure the format of file is raw but not qcow2, otherwise the swap header will not in the first page of swap file. I didn't investigate the detail of qcow2 format, but looks the magic string, SWAPSPACE2, is in the position after 5M offset but not the first page:

00050ff0  53 50 41 43 45 32 53 57  41 50 53 50 41 43 45 32  |SPACE2SWAPSPACE2|

The position of magic string in raw format file is in the first page:

00000fe0  01 00 00 00 00 00 00 00  03 00 00 00 53 57 41 50  |............SWAP|
00000ff0  53 50 41 43 45 32 53 57  41 50 53 50 41 43 45 32  |SPACE2SWAPSPACE2|

My whole investigation is base on the raw format disk file.

After create a disk file, then add it to guest OS to be the swap partition. My guest OS is openSUSE:

a. Add the new disk file to be the hdb of guest OS in qemu command:
  -hdb /home/joeyli/qemu-vm/swap.image

b. Launch guest OS, format the sdb disk to be swap:
  mkswap /dev/sdb

c. Add to /etc/fstab
  /dev/sdb swap swap defaults 0 0

d. Add kernel parameter to indicate sdb to be the resume partition:
  resume=/dev/sdb

e. Add the following kernel parameter to grub2 conf for testing hibernation:
earlyprintk=ttyS0,115200 console=tty0 console=ttyS0,115200 debug no_console_suspend=1 loglevel=9 nomodeset earlyprintk=efi hibernate=nocompress

Using hibernate=nocompress to avoid crc32 check, anyway, it's just for testing hibernation verification mechanism. We will taint a snapshot image in swap, then CRC checking maybe failed before hibernation verification.

After /dev/sdb show in guest OS, then trying hibernation. Before hibernation, you should find the magic string of swap parition:

00000fe0  01 00 00 00 00 00 00 00  03 00 00 00 53 57 41 50  |............SWAP|
00000ff0  53 50 41 43 45 32 53 57  41 50 53 50 41 43 45 32  |SPACE2SWAPSPACE2|

The declaration of swap header in kernel source is the swap_header union in swap.h header:

include/linux/swap.h
/*
 * Magic header for a swap area. The first part of the union is
 * what the swap magic looks like for the old (limited to 128MB)
 * swap area format, the second part of the union adds - in the
 * old reserved area - some extra information. Note that the first
 * kilobyte is reserved for boot loader or disk label stuff...
 *
 * Having the magic at the end of the PAGE_SIZE makes detecting swap
 * areas somewhat tricky on machines that support multiple page sizes.
 * For 2.5 we'll probably want to move the magic to just beyond the
 * bootbits...
 */
union swap_header {
        struct {
                char reserved[PAGE_SIZE - 10];
                char magic[10];                 /* SWAP-SPACE or SWAPSPACE2 */
        } magic;
        struct {
                char            bootbits[1024]; /* Space for disklabel etc. */
                __u32           version;
                __u32           last_page;
                __u32           nr_badpages;
                unsigned char   sws_uuid[16];
                unsigned char   sws_volume[16];
                __u32           padding[117];
                __u32           badpages[1];
        } info;
};

So, the magic string is in the end of the first page of partition. On the other hand, other meta-datas are written to the first page from top-down.

After launched hibernation, the magic string will be changed to S1SUSPEND:

00000fe0  01 00 00 00 00 00 00 00  05 00 00 00 53 57 41 50  |............SWAP|
00000ff0  53 50 41 43 45 32 53 31  53 55 53 50 45 4e 44 00  |SPACE2S1SUSPEND.|

After hibernation resuming, either success or fail, the magic string will be changed back to SWAPSPACE2.

The hibernation verification mechanism is calculating the signature of data pages in snapshot image. To test the mechanism is to taint the content in data pages:


Find out the position of data pages is the most important thing to taint data pages in snapshot image. The snapshot image header is swsusp_info, it's in the first page of snapshot. The hibernation mechanism uses the reserved space from bottom-up of the first page of swap partition:

kernel/power/swap.c
struct swsusp_header {      /* 1 page, 4096 bytes */
char reserved[PAGE_SIZE - 20 - sizeof(sector_t) - sizeof(int) -
sizeof(u32)]; /* 4096 - 20 - 8 - 4 - 4 = 4060 */
u32     crc32;             /* 4 bytes */
sector_t image;          /* sector_t of snapshot image, unsigned long 8 bytes */
unsigned int flags;      /* unsigned int, 4 bytes */
char    orig_sig[10];   /* 10 bytes */
char    sig[10];           /* 10 bytes *//* SWAPSPACE2 or S1SUSPEND */
} __attribute__((packed));

The swsusp_header actually is also mapping to the first page of swap, in the same space of swap_header:
The main field of swsusp_header to find the position snapshot image is "image", it's a number of sector_t, the size the same with page on swap. For example, if image = 1, then means the offset of snapshot image in swap is:
    offest = (swap header + image sector) * PAGE_SIZE = (1 + 1) * 4096

Then follow this offset, we can find out the snapshoti image header, swsusp_info:

kernel/power/power.h
struct swsusp_info {
        struct new_utsname      uts;     /* on x86, struct restore_data_record, 24 bytes */
                                                         /* 0x0123456789ABCDEFUL */
        u32                     version_code;
        unsigned long     num_physpages;
        int                       cpus;
        unsigned long      image_pages;    /* data pages number */
        unsigned long      pages;               /* total pages number that are in snapshot */
        unsigned long      size;
} __aligned(PAGE_SIZE);

I want to taint the content of data pages to emulate someone modified the image data. There have some important field in swsusp_info structure, pages and image_pages. So the offset of the beginning of data pages area is: 
    data pages offset = (swap header + image sector) * PAGE_SIZE + (pages - image_pages) * PAGE_SIZE

Then, using printf and dd command to modify the content of data pages. For example, this is the command to modify one byte in 5242848 offset:
    printf '\x32' | dd conv=notrunc of=./swap.image bs=1 seek=5242848

Here is a simple C program to check the swap header and swsusp header in qemu swap raw file, and taint 3 bytes in 3 different position for testing.

2015年5月25日 星期一

Attended GNOME.Asia 2015 in Indonesia

I am really happy have a chance to attend GNOME.Asia 2015 summit in Indonesia. This is my first time join to GNOME conference, it's also my first time visit to Indonesia. I really appreciate for GNOME Taiwan community and openSUSE Indonesia community friends' help in Depok. And I very think for GNOME.Asia and openSUSE's sponsor to my air ticket and hotel. It's a wonderful experience to me.

The location of GNOME.Asia 2015 venue is at University of Indonesia, Depok. It's a beautiful university:

The GNOME workship in day 0:

David King's session: "Writing your first GNOME application"

Speakers from more than 10 different countries:

Team discussion in workshop:


This special building is library in University of Indonesia:


There has a beautiful lake in university. GNOME.Asia venues are just around this lake:

The front door of Balairung, the main space of start/end and lighting talk:

Setting up the booth of of openSUSE ID. It's just simple in day 0. But it will have a big chance after openSUSE Indonesia community friends' help in day 1 and day2:

The openSUSE material and posters:


Banner, openSUSE is the silver sponsor of GNOME.Asia 2015:

The front door of venue:

There have 4 rooms for sessions:

My session is in latest day, about hibernate signature checking:

There have many banners on front door:

Show the different Linux distro:


openSUSE Indonesia friends setup computer, Raspberry pi to demo openSUSE. And also they provide interesting questionnaire for earning openSUSE T-shirt:

openSUSE Indonesia friends. Edwin, Yan, Adnan, Andi. Without them, it's impossible to setup wonderful booth to openSUSE. They also handle a amazing project to promote OSS in schools:

Many people visit to openSUSE booth:

The big hall for start/end and lighten talking:

Sesions start at the day 1 afternoon...
Join to the session "Open Source Software in Shoes Industry", Iwan S. Tahari. He is from shoes manufacture "Fans" to present their status of using OSS in the company. "Fans" is also the sponsor of GNOME.Aisa shoes:

"Fans" is using OSS on desktop and server for their designing and making shoes:


The second session that I joined is Chen Shing Yuan's presentation "Promote FOSS Education to Remote Areas in Taiwan and China":

The Aletheia University in Taiwan have a project to promoting OSS, English and Culture to the school in Xinjiang, China. It's far and less resource for the children in school:

The third session I joined is Anton Siswo Raharjo Ansori's "An Overview, Maximize The Ability of The GNU/Linux Operating System Using "In Memory Computation" for Academic, Business and Government":

The lighten talk of day one:

Talking with GNOME contributor:

The local GNOME user raised questions to experts:


I join to Edwin's session at afternoon on day 2:

The subject of Edwin's session is "Using Linux for Basic Education, Is it Feasible?":

Their project is promoting OSS to schools:


It's amazing for Edwin's project already promote and setup OSS to 500 schools. Cool!

It's really not easy to promoting OSS to teachers and students. openSUSE Indonesia community show us their tough working:

Then I join to Matthew Waters's session: "gtkgst: Video in your Widgets!"

Then Bin Li's interesting session "GNOME-Shell on Nexus 7". I think the only problem is need to find out the official kernel of Nexus 7 that's announced by Google:

Then, join to Franklin Weng & Eric Sun's "The evolution of ezgo development & Why Do We Insist Promoting FOSS in Taiwan's Campus?":

The ezgo is also a excellent project in Taiwan launched for education:

Sorry for I didn't take the picture for my own presentation, but my slides is here:

The lightening talk at the afternoon in day 2:

Ending of conference:

GNOME.Asia people done a great job to host GNOME.Asia 2015 (Max, Haris...):

openSUSE sponsor to GNOME.Asia 2015. And also thanks for GNOME.Asia sponsor to openSUSE.Asia at last year:

We all got the GNOME.Asia shoes from "Fans":

Thanks for very body's join, it's a wonderful conference:

The final picture is a note for anyone want to trip to Indonesia. The type of power socket the same with Europe, just feel free to take Europe type power adapter to Indonesia:

I am really happy to join GNOME.Asia 2015 summit. Not just to meet GNOME community friends from many different counties, but also I meet openSUSE community friends in Indonesia. I hope there have chance openSUSE.Asia can go to Indonesia, then I can trip to this good country again.

More pictures are here: https://www.flickr.com/photos/104068895@N07/sets/72157653462172002