2015年11月17日 星期二

Re: [GIT PULL] x86/mm changes for v4.4 (with Chinese translation comment)

Prior read: Re: [PATCH v2] x86/mm: warn on W+x mappings

Date: Fri, 6 Nov 2015 11:39:43 +0000
From: Matt Fleming <matt@codeblueprint.co.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Jones <davej@codemonkey.org.uk>, Ingo Molnar <mingo@kernel.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>,
        Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, Stephen Smalley <sds@tycho.nsa.gov>,
        linux-efi@vger.kernel.org
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.24 (2015-08-30)

We have separate page tables today, for a few reasons, but mainly it's
/* 因為一些原因, 我們目前有分離的page tables */
so that we can have an identity mapping of memory present in the
/* 主要原因是我們可以有一個恆等映射(identity mapping, 1:1) */
region usually used by user processes - broken firmware still uses
/* 通常被 user processes 使用, 壞掉的 firmware 仍然使用恆等映射 */
those identity mappings even after the kernel tells it they're
/* 即使 kernel 告訴他們已經失效了 */
invalid.

Note that when I say "separate" I'm talking about trampoline_pgd[]
which is also used by the x86 suspend/resume code.
/* 注意當我說"分離"時, 我講的是關於 trampoline_pgd[]也被使用在 x86 suspend/resume 的程式中 */

However, turns out that the issue with the current scheme is the fact
/* 原來的問題在當前的方案, 事實上 trampoline_pgd[] 分享了一些 PGD entries 給 swapper_pg_dir */
that trampoline_pgd[] actually shares a couple of PGD entries with
swapper_pg_dir as can be seen in setup_real_mode(),


        trampoline_pgd = (u64 *)__va(real_mode_header->trampoline_pgd);
        trampoline_pgd[0] = init_level4_pgt[pgd_index(__PAGE_OFFSET)].pgd;
        trampoline_pgd[511] = init_level4_pgt[511].pgd;

So when we map the EFI regions in efi_map_regions() we're inserting
/* 所以當我們映射 EFI 區域也一並映射到swapper_pg_dir */
them into swapper_pg_dir also, which is why you're seeing the
warnings.

If I remember correctly the rationale for using trampoline_pgd[] was
/* 使用 trampoline_pgd[] 是因為它已經有我們想要的(提供恆等映射) */
that it already did what we wanted (provided the identity mapping) and
would save us the overhead of maintaining more page tables for no good
/* 可以節省我們用於維護更多 page tables 的開銷 */
reason. Obviously this entire thread is a good reason.

I suggest we stop using trampoline_pgd[] (since it has a good reason 
/* 我建議停止使用 trampoline_pgd[] (它具有一個好的理由去分享 kernel 映射 PGD entries)   */
for sharing the kernel mapping PGD entries) and create our own so that
/* 而且建立我們自己的(PGD)然後我們可以完全隔離 EFI */
we can isolate EFI completely.

For the immediate problem of the warnings spewing forth on all UEFI
machines, at the very least the config options needs to be disabled by
/* 最起碼 config 選項必須預設關閉 */
default, if not the patch reverted.



Date: Sat, 7 Nov 2015 08:05:54 +0100
From: Ingo Molnar <mingo@kernel.org>
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin"
        <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, Stephen Smalley <sds@tycho.nsa.gov>,
        linux-efi@vger.kernel.org
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.23 (2014-03-12)


* Matt Fleming <matt@codeblueprint.co.uk> wrote:

> On Thu, 05 Nov, at 01:33:10PM, Linus Torvalds wrote:
[...]
> I suggest we stop using trampoline_pgd[] (since it has a good reason
> for sharing the kernel mapping PGD entries) and create our own so that
> we can isolate EFI completely.

Ok. Could you please make this fix a priority for upcoming EFI changes?

> For the immediate problem of the warnings spewing forth on all UEFI
> machines, at the very least the config options needs to be disabled by
> default, if not the patch reverted.

We'll certainly flip around the default, but reverting would be shooting
/* 我們肯定會反轉預設值 */
the messenger: the EFI code is endangering everyone else today, and for
/* EFI 程式正在危害其他人, 而且它的出現沒有充份理由 */
no good reason as it appears... so the warning very much served its
/* 這樣的警告(CONFIG_DEBUG_WX)非常成功的達成目的, 指出了一個有效的問題 */
purpose in pointing out a valid problem.

Thanks,

        Ingo



Date: Fri, 6 Nov 2015 12:39:12 +0000
From: Matt Fleming <matt@codeblueprint.co.uk>
To: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner
        <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Denys Vlasenko
        <dvlasenk@redhat.com>, Kees Cook <keescook@chromium.org>, linux-efi@vger.kernel.org, Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.24 (2015-08-30)

On Fri, 06 Nov, at 07:55:50AM, Ingo Molnar wrote:
>
>  3) We should fix the EFI permission problem without relying on the firmware: it
/* 我們必須在不依賴軔體的狀況下修好 EFI 的權限問題 */
>     appears we could just mark everything R-X optimistically, and if a write fault
/* 我們可以樂觀的標記所有東西為R-X */
>     happens (it's pretty rare in fact, only triggers when we write to an EFI
/* 而且當寫入失敗發生時(這很罕見, 只有我們寫入EFI變數時會觸發), 我們可以在運行中標記失敗page為RW- */
>     variable and so), we can mark the faulting page RW- on the fly, because it
>     appears that writable EFI sections, while not enumerated very well in 'old'
/* 因為它出現在寫入 EFI 段的時候, 在舊的軔體沒有很好的列舉出來, 仍應該是page粒度 */
>     firmware, are still supposed to be page granular. (Even 'new' firmware I 
/* (就算是新的軔體, 我也不會自動的相信會得到正確的列舉...) */
>     wouldn't automatically trust to get the enumeration right...)

Sorry, this isn't true. I misled you with one of my earlier posts on
/* 抱歉這是錯的,我誤導了你 */
this topic. Let me try and clear things up...

Writing to EFI regions has to do with every invocation of the EFI
/* 寫入 EFI 區域在每次叫用EFI runtime services時會發生, 不僅限於 讀/寫/刪除 EFI 變數 */
runtime services - it's not limited to when you read/write/delete EFI
variables. In fact, EFI variables really have nothing to do with this
/* 事實上, EFI 變數和這次的討論真的沒關係 */
discussion, they're a completely opaque concept to the OS, we have no
/* 對OS來說他們完全是不透明的觀念 */
idea how the firmware implements them. Everything is done via the EFI
boot/runtime services.

The firmware itself will attempt to write to EFI regions when we
/* 當我們調用EFI services時, 軔體本身會嘗試寫入 EFI 區域, 因為 PE/COFF 的.data 以及.bss 是和 heap 存活在一起 */
invoke the EFI services because that's where the PE/COFF ".data" and
".bss" sections live along with the heap. There's even some relocation
/* 甚至像一些發生在 SetVirtualAddressMap() 時的重新定位位置調整, 所以它也會寫入.text */
fixups that occur as SetVirtualAddressMap() time so it'll write to
".text" too.

Now, the above PE/COFF sections are usually (always?) contained within
/* 上述的 PE/COFF sections 常常(總是?) 被包含在 EfiRuntimeServicesCode 型態的EFI 區域中 */
EFI regions of type EfiRuntimeServicesCode. We know this is true
/* 我們知道這個事實乃是因為軔體開發者告訴我們 */
because the firmware folks have told us so, and because stopping that
/* 而且也是因為它阻擋了 EFI_PROPERTIES_TABLE 新功能背後的動機 */
is the motivation behind the new EFI_PROPERTIES_TABLE feature in UEFI
V2.5.

The data sections within the region are also *not* guaranteed to be
/* 在區域中的 data 區段也不保證是 page 粒度 */
page granular because work was required in Tianocore for emitting
/* 因為 Tianocore 的工作需求, 用來發出 4k 對齊的區段作為支援 EFI_PROPERTIES_TABLE 的一部份 */
sections with 4k alignment as part of the EFI_PROPERTIES_TABLE
support.

Ultimately, what this means is that if you were to attempt to
/* 最終這代表了如果你嘗試動態佈置這些需要write權限的區域, 你橫豎都必須修改EFI區域的主要映攝 */
dynamically fixup those regions that required write permission, you'd
have to modify the mappings for the majority of the EFI regions
anyway. And if you're blindly allowing write permission as a fixup,
/* 而且如果你盲目的允許write權限, 這就不會得到太多的安全性 */
there's not much security to be had.

>     If that 'supposed to be' turns out to be 'not true' (not unheard of in
/* 如果這個"認為應該是"被正名為"不對的" (不是前所未聞的軔體園地) */
>     firmware land), then plan B would be to mark pages that generate write faults
/* 則 plan B 就是標記那些產生 write 失敗的 pages 成為 RWX, 這樣不會破壞功能 */
>     RWX as well, to not break functionality. (This 'mark it RWX' is not something
/* 這個"標記它為RWX"並不是一些容易取用的漏洞, 而且我們仍然可以產生一個警告[在EFI call完成之後], 如果這個警告曾經被觸發 */
>     that exploits would have easy access to, and we could also generate a warning
>     [after the EFI call has finished] if it ever triggers.)
>
>     Admittedly this approach might not be without its own complications, but it
/* 誠然,這種方法可能不是沒有自己的並發症, */
>     looks reasonably simple (I don't think we need per EFI call page tables,
/* 但是他看來相當簡單 (我不認為我們需要逐一 EFI call 的 page tables, 等等) */
>     etc.), and does not assume much about the firmware being able to enumerate its
/* 而且這並沒有假設軔體能夠正確列舉其權限 */
>     permissions properly. Were we to merge EFI support today I'd have insisted on
>     trying such an approach from day 1 on.

We already have separate EFI page tables, though with the caveat that
/* 我們已經有分開的 EFI page tables */ /* 但需要提醒的是 */
we share some of swapper_pg_dir's PGD entries. The best solution would
/* 我們共享了一些 swapper_pg_dir 的 PGD entries. */
be to stop sharing entries and isolate the EFI mappings from every
/* 最好的解法是停止共享 entires 並且將 EFI mappings 從所有其他的 page table 結構隔離開來 */
other page table structure, so that they're only used during the EFI
/* 所以他們(EFI mappings page tables) 只被用在 EFI service calls 中 */
service calls.



Date: Sat, 7 Nov 2015 08:09:22 +0100
From: Ingo Molnar <mingo@kernel.org>
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner
        <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Denys Vlasenko
        <dvlasenk@redhat.com>, Kees Cook <keescook@chromium.org>, linux-efi@vger.kernel.org, Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.23 (2014-03-12)


* Matt Fleming <matt@codeblueprint.co.uk> wrote:

> On Fri, 06 Nov, at 07:55:50AM, Ingo Molnar wrote:
> >
[...]
>
> Ultimately, what this means is that if you were to attempt to
> dynamically fixup those regions that required write permission, you'd
> have to modify the mappings for the majority of the EFI regions
> anyway. And if you're blindly allowing write permission as a fixup,
> there's not much security to be had.

I think you misunderstood my suggestion: the 'fixup' would be changing it from R-X
/* "修理"代表把R-X改成RW-, 例如, 它增加了 write 權限但是移除 execute 權限 */
to RW-, i.e. it would add 'write' permission but remove 'execute' permission.

Note that there would be no 'RWX' permission at any given moment - which is the
/* 請注意這就不會有 RWX 權限同時存在, 這是危險的組合 */
dangerous combination.

> >     If that 'supposed to be' turns out to be 'not true' (not unheard of in
> >     firmware land), then plan B would be to mark pages that generate write faults
> >     RWX as well, to not break functionality. (This 'mark it RWX' is not something
> >     that exploits would have easy access to, and we could also generate a warning
> >     [after the EFI call has finished] if it ever triggers.)
> >
> >     Admittedly this approach might not be without its own complications, but it
> >     looks reasonably simple (I don't think we need per EFI call page tables,
> >     etc.), and does not assume much about the firmware being able to enumerate its
> >     permissions properly. Were we to merge EFI support today I'd have insisted on
> >     trying such an approach from day 1 on.
>
> We already have separate EFI page tables, though with the caveat that
> we share some of swapper_pg_dir's PGD entries. The best solution would
> be to stop sharing entries and isolate the EFI mappings from every
> other page table structure, so that they're only used during the EFI
> service calls.

Absolutely. Can you try to fix this for v4.3?

Thanks,

        Ingo



Date: Sat, 7 Nov 2015 08:39:35 +0100
From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: Ingo Molnar <mingo@kernel.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel Mailing List
        <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski
        <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, Kees Cook <keescook@chromium.org>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>
Subject: Re: [GIT PULL] x86/mm changes for v4.4

On 7 November 2015 at 08:09, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Matt Fleming <matt@codeblueprint.co.uk> wrote:
>
[...]
>
> I think you misunderstood my suggestion: the 'fixup' would be changing it from R-X
> to RW-, i.e. it would add 'write' permission but remove 'execute' permission.
>
> Note that there would be no 'RWX' permission at any given moment - which is the
> dangerous combination.
>

The problem with that is that /any/ page in the UEFI runtime region
/* 問題在於 EFI runtime 區域中的任何 page 可能和任何組成 runtime 軔體的 PE/COFF images 的 .text 與 .data 相交 */
may intersect with both .text and .data of any of the PE/COFF images
that make up the runtime firmware (since the PE/COFF sections are not
/* 因為 PE/COFF 區段不需要 page 對齊 */
necessarily page aligned). Such pages require RWX permissions. The
/* 這些 pages 需要 RWX 權限 */
UEFI memory map does not provide the information to identify those
/* UEFI memory map 沒有提供資訊以先前識別這些 pages */
pages a priori (the entire region containing several PE/COFF images
/* 包含了幾個 PE/COFF 影像的整個區域可能只被單一entry包覆 */
could be covered by a single entry) so it is hard to guess which pages
/* 所以很難猜測哪個 pages 必須允許 RWX 權限 */
should be allowed these RWX permissions.



Date: Sat, 7 Nov 2015 22:58:52 -0800
From: Kees Cook <keescook@chromium.org>
To: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>, Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux
        Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>,
        Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>, Matthew Garrett <mjg59@coreos.com>
Subject: Re: [GIT PULL] x86/mm changes for v4.4

On Fri, Nov 6, 2015 at 11:39 PM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 7 November 2015 at 08:09, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> * Matt Fleming <matt@codeblueprint.co.uk> wrote:
>>
>>> On Fri, 06 Nov, at 07:55:50AM, Ingo Molnar wrote:
>>> >
[...]
>
> The problem with that is that /any/ page in the UEFI runtime region
> may intersect with both .text and .data of any of the PE/COFF images
> that make up the runtime firmware (since the PE/COFF sections are not
> necessarily page aligned). Such pages require RWX permissions. The
> UEFI memory map does not provide the information to identify those
> pages a priori (the entire region containing several PE/COFF images
> could be covered by a single entry) so it is hard to guess which pages
> should be allowed these RWX permissions.

I'm sad that UEFI was designed without even the most basic of memory            
/* 我感到遺憾 UEFI 的設計沒有最基本的記憶體保護 */
protections in mind. UEFI _itself_ should be setting up protective              
/* UEFI 本身應該設置保護性 page mappings */
page mappings. :(

For a boot firmware, it seems to me that safe page table layout would           
/* 對於一個開機軔體, 對我來說"安全的 page table 佈局"會是高優先級的臭蟲 */
be a top priority bug. The "reporting issues" page for TianoCore
doesn't actually seem to link to the "Project Tracker":
https://github.com/tianocore/tianocore.github.io/wiki/Reporting-Issues

Does anyone know how to get this correctly reported so future UEFI
releases don't suffer from this?

-Kees



Date: Sun, 8 Nov 2015 08:55:24 +0100
From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>, Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux
        Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>,
        Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>, Matthew Garrett <mjg59@coreos.com>
Subject: Re: [GIT PULL] x86/mm changes for v4.4

On 8 November 2015 at 07:58, Kees Cook <keescook@chromium.org> wrote:
> On Fri, Nov 6, 2015 at 11:39 PM, Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
>> On 7 November 2015 at 08:09, Ingo Molnar <mingo@kernel.org> wrote:
>>>
>>> * Matt Fleming <matt@codeblueprint.co.uk> wrote:
>>>
[...]
>
> I'm sad that UEFI was designed without even the most basic of memory
> protections in mind. UEFI _itself_ should be setting up protective
> page mappings. :(
>

Well, the 4 KB alignment of sections was considered prohibitive at the
/* 4KB 對齊區段在節省程式大小時被考慮過禁止. 但這是很久以前 */
time from code size pov. But this was a long time ago, obviously.

> For a boot firmware, it seems to me that safe page table layout would
> be a top priority bug. The "reporting issues" page for TianoCore
> doesn't actually seem to link to the "Project Tracker":
> https://github.com/tianocore/tianocore.github.io/wiki/Reporting-Issues
>
> Does anyone know how to get this correctly reported so future UEFI
> releases don't suffer from this?
>

Ugh. Don't get me started on that topic. I have been working with the           
/* 不要讓我開始這個話題. */
UEFI forum since July to get a fundamentally broken implementation of           
/* 我從7月份開始和 UEFI 論壇工作以修復從根本上就損壞的記憶體保護 */
memory protections fixed. UEFI v2.5 defines a memory protection scheme          
/* UEFI v2.5 定義了記憶體保護策略, 它是基於分割 PE/COFF 影像到分離的記憶體區域 */
that is based on splitting PE/COFF images into separate memory regions
so that R-X and RW- permissions can be applied. Unfortunately, that             
/* 所以R-X 和 RW- 權限可以應用上去 */
broke every OS in existence (including Windows 8), since the OS is             
/* 不幸的是, 這破壞了每個既存的 OS (包含 Windows 8) */
allowed to reorder memory regions when it lays out the virtual                  
/* 由於 OS 在規劃 EFI 區域的虛擬映射時, 被允許對於記憶體區域重新排序 */
remapping of the UEFI regions, resulting in PE/COFF .data and .text             
/* 這造成 PE/COFF 中 .data 和 .text 可能出現順序亂掉 */
potentially appearing out of order.

The good news is that we fixed it for the upcoming release (v2.6). I            
/* 好消息是我們在即將發行的v2.6修正了, 我不能透露任何細節 :-( */
can't disclose any specifics, though :-(



Date: Mon, 9 Nov 2015 13:08:01 -0800
From: Kees Cook <keescook@chromium.org>
To: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>, Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux
        Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>,
        Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>, Matthew Garrett <mjg59@coreos.com>
Subject: Re: [GIT PULL] x86/mm changes for v4.4

On Sat, Nov 7, 2015 at 11:55 PM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 8 November 2015 at 07:58, Kees Cook <keescook@chromium.org> wrote:
>> On Fri, Nov 6, 2015 at 11:39 PM, Ard Biesheuvel
>> <ard.biesheuvel@linaro.org> wrote:
>>> On 7 November 2015 at 08:09, Ingo Molnar <mingo@kernel.org> wrote:
>>>>
[...]
>
> Well, the 4 KB alignment of sections was considered prohibitive at the
> time from code size pov. But this was a long time ago, obviously.

Heh, yeah, I'd expect max 4K padding to get code/data correctly
/* 我期望最大 4K 的填充在獲取代碼/數據時正確的對齊 2MB 而不會構成問題 */
aligned on a 2MB binary to not be an issue. :)

[...]
>
> Ugh. Don't get me started on that topic. I have been working with the
> UEFI forum since July to get a fundamentally broken implementation of
> memory protections fixed. UEFI v2.5 defines a memory protection scheme
> that is based on splitting PE/COFF images into separate memory regions
> so that R-X and RW- permissions can be applied. Unfortunately, that
> broke every OS in existence (including Windows 8), since the OS is
> allowed to reorder memory regions when it lays out the virtual
> remapping of the UEFI regions, resulting in PE/COFF .data and .text
> potentially appearing out of order.
>
> The good news is that we fixed it for the upcoming release (v2.6). I
> can't disclose any specifics, though :-(

As long as there's motion to getting it fixed, that makes me happy! :)
/* 只要有動力讓它修正, 都可以讓我開心! */
Does 2.6 get rid of the (AIUI) 2MB limit too?                           
/* 2.6 版是否也擺脫了 2MB(就我了解) 的限制? */

-Kees



Date: Tue, 10 Nov 2015 08:08:30 +0100
From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>, Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux
        Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>,
        Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>, Matthew Garrett <mjg59@coreos.com>
Subject: Re: [GIT PULL] x86/mm changes for v4.4

On 9 November 2015 at 22:08, Kees Cook <keescook@chromium.org> wrote:
> On Sat, Nov 7, 2015 at 11:55 PM, Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
[...]
>
> Heh, yeah, I'd expect max 4K padding to get code/data correctly
> aligned on a 2MB binary to not be an issue. :)
>

This is not about section sizes on ARM. The PE/COFF format does not             
/* 這和 ARM 的區段大小無關 */
use segments, like ELF, so the payload (the sections) needs to be               
/* PE/COFF 格式沒有使用分段, 和 ELF 相同, */
completely disjoint from the header. This means, when using 4 KB                
/* 所以負載(這些區段)必須和 header 完全脫節 */
alignment, that every PE/COFF image wastes ~4 KB in the header and 4            
/* 每個 PE/COFF 影像浪費大約 4 KB 在 header 和平均 4KB 在段填充 */
KB on average in the section padding (assuming a .text/.data/.reloc             
/* (假設一個 .text/.data/.reloc 佈局, 在 PE/COFF 常見) */
layout, as is common with PE/COFF)

Considering that a typical UEFI firmware image consists of numerous             
/* 考慮到一個典型的 UEFI 軔體影像是由多個(我想平均大約五十個) PE/COFF 影像組成 */
(around 50 on average, I think) PE/COFF images, and some of them                
/* 而且他們部份從 NOR flash 中執行, Tianocore 工具 (關係到實作) */
execute from NOR flash, the Tianocore tooling (which is the reference           
/* 一直著眼於儘可能小的前提下保持對齊 */
implementation) has always been geared towards keeping the alignment
as small as possible, typically 32 bytes unless data objects need               
/* 通常是 32 位元, 除非需要更多 data 物件 */
more. Since the UEFI runtime services are typically implemented by              
/* 由於 UEFI runtime services 通常以數個 PE/COFF 影像來實作 */
several of these PE/COFF images, and since the memory they occupy may           
/* 而且由於記憶體所佔用(空間)可能只由單一個 UEFI memory map 條目所描述 */
be described by a single UEFI memory map entry, there is simply no             
/* 根本沒有簡單的方法來決定哪些頁面需要 R-X, RW- 或 RWX */
easy way to decide which pages need R-X, RW- or RWX. Even looking for           
/* 即使尋找記憶體中的 PE/COFF 標頭們也無法保證可行, 由於 */
PE/COFF headers in the memory region is not guaranteed to work, since
the PE/COFF header is part of the file format, not the memory format            
/* PE/COFF 標頭是檔案格式的一部份, 不是記憶體格式 */
(i.e., since the header is disjoint from the payload, a PE/COFF loader          
/* (也就是: 因為標頭和負載是脫節的, 一個 PE/COFF 載入器不需要拷貝標頭到記憶體) */
is not required to copy the header to memory)

>
> As long as there's motion to getting it fixed, that makes me happy! :)
> Does 2.6 get rid of the (AIUI) 2MB limit too?
>

No, there is no such limit in UEFI. If there is a limit like that, it          
/* 不, 並沒有這樣的限制在 UEFI. 如果有類似這樣的限制, */
is an implementation detail of the UEFI support in the OS.                      
/* 它會是一個 OS 支援 UEFI 的實作細節 */

For arm64 (and the upcoming ARM support), the UEFI runtime services             
/* 對於 ARM64 (和即將到來的 ARM 支援), UEFI runtime services 區域 */
regions are remapped into a virtual userland range that is only active          
/* 被重新映射到一個虛擬的使用者空間範圍 */
during the time runtime services are being invoked. (x86 does                  
/* 這個範圍只有在 runtime services 被調用時啟動 */
something similar, but it shares the page tables with the                       
/* x86 下做了類似的事情, 但是就我了解它和 suspend/resume 程式共享了 page tables */
suspend/resume code afaiu) These mappings could be page granularity             
/* 這些映射可以是頁粒度 (由於他們不需要在線性區域中分割PUDs或PMDs) */
(since they don't require splitting PUDs or PMDs in the linear
region), with the side note that arm64 mandates 64 KB alignment (to             
/* 補充說明 arm64 要求 64 KB 對齊 (和 64 KB 頁的作業系統互通) */
interoperate with 64 KB pages OSes). This requirement has been added            
/* 這個需求已經添加到 UEFI 規範, 也就是, */
to the UEFI spec, i.e., a v2.5 compliant arm64 firmware should not              
/* 一個和 v2.5 相容的 arm64 軔體不應該以非64 KB 對齊的方式曝露 UEFI runtime 區域 */
expose UEFI runtime regions that are not 64 KB aligned.



Date: Tue, 10 Nov 2015 12:11:18 -0800
From: Kees Cook <keescook@chromium.org>
To: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>, Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux
        Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>,
        Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>, Matthew Garrett <mjg59@coreos.com>
Subject: Re: [GIT PULL] x86/mm changes for v4.4

On Mon, Nov 9, 2015 at 11:08 PM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 9 November 2015 at 22:08, Kees Cook <keescook@chromium.org> wrote:
>> On Sat, Nov 7, 2015 at 11:55 PM, Ard Biesheuvel
>> <ard.biesheuvel@linaro.org> wrote:
[...]
>
> This is not about section sizes on ARM. The PE/COFF format does not
> use segments, like ELF, so the payload (the sections) needs to be
> completely disjoint from the header. This means, when using 4 KB
> alignment, that every PE/COFF image wastes ~4 KB in the header and 4
> KB on average in the section padding (assuming a .text/.data/.reloc
> layout, as is common with PE/COFF)
>
> Considering that a typical UEFI firmware image consists of numerous
> (around 50 on average, I think) PE/COFF images, and some of them

Oooh, that's no fun. So the linker can't produce merged .text and               
/* 喔, 這不妙. 所以linker 不能產出合併了 .text 和 .data 的區段? */
.data sections?

[...]
>
> No, there is no such limit in UEFI. If there is a limit like that, it
> is an implementation detail of the UEFI support in the OS.
>
> For arm64 (and the upcoming ARM support), the UEFI runtime services
> regions are remapped into a virtual userland range that is only active
> during the time runtime services are being invoked. (x86 does
> something similar, but it shares the page tables with the
> suspend/resume code afaiu) These mappings could be page granularity
> (since they don't require splitting PUDs or PMDs in the linear
> region), with the side note that arm64 mandates 64 KB alignment (to
> interoperate with 64 KB pages OSes). This requirement has been added
> to the UEFI spec, i.e., a v2.5 compliant arm64 firmware should not
> expose UEFI runtime regions that are not 64 KB aligned.

Cool, thanks for the details!

-Kees



Date: Fri, 6 Nov 2015 13:09:48 +0000
From: Matt Fleming <matt@codeblueprint.co.uk>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel Mailing List
        <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski
        <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, Kees Cook <keescook@chromium.org>, linux-efi@vger.kernel.org
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.24 (2015-08-30)

On Thu, 05 Nov, at 11:05:35PM, Andy Lutomirski wrote:
>
> Admittedly, we might need to use a certain amount of care to avoid        
 /* 不可否認的, 我們必須有一定程度的謹慎去避免 vmap 機制們間有趣的衝突 */
> interesting conflicts with the vmap mechanism.  We might need to vmap      
/* 我們可能需要虛擬映射所有 EFI 的東西, */
> all of the EFI stuff, and possibly even all the top-level entries that      
/* 而且甚至可能是全部包含 EFI 材料的頂層條目 */
> contain EFI stuff (i.e. exactly one of them unless EFI ends up *huge*)      
/* (亦即他們其中只有一個, 除非EFI最終太巨大) */
> as a blank not-present region to avoid overlaps, but that's not a big      
/* 成為一個空白不存在的區域以避免重疊, 但這不是大問題 */
> deal.

There shouldn't be any room for conflicting with vmap() because the VA        
/* 不應該存在任何和vmap()衝突的空間, */
region where we map EFI regions is still carved out especially for us.        
/* 因為用於映射EFI區域的虛擬位置區域仍然有為我們特別刻劃出來 */

Right Boris?



Date: Fri, 6 Nov 2015 14:24:47 +0100
From: Borislav Petkov <bp@alien8.de>
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Andy Lutomirski <luto@amacapital.net>, Ingo Molnar <mingo@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel
        Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Denys
        Vlasenko <dvlasenk@redhat.com>, Kees Cook <keescook@chromium.org>, linux-efi@vger.kernel.org
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.23 (2014-03-12)

On Fri, Nov 06, 2015 at 01:09:48PM +0000, Matt Fleming wrote:
> On Thu, 05 Nov, at 11:05:35PM, Andy Lutomirski wrote:
> >
> > Admittedly, we might need to use a certain amount of care to avoid
> > interesting conflicts with the vmap mechanism.  We might need to vmap
> > all of the EFI stuff, and possibly even all the top-level entries that
> > contain EFI stuff (i.e. exactly one of them unless EFI ends up *huge*)
> > as a blank not-present region to avoid overlaps, but that's not a big
> > deal.
>
> There shouldn't be any room for conflicting with vmap() because the VA
> region where we map EFI regions is still carved out especially for us.
>
> Right Boris?

Yap:                                                                            /* 是的 */

ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB)

vs

ffffffef00000000 - ffffffff00000000 EFI region in trampoline_pgd

the new pagetable will make that issue moot too.                               
/* 新的 pagetable 也將使該問題沒有實際意義 */

--
Regards/Gruss,
    Boris.