Prior read:
Re: [PATCH v2] x86/mm: warn on W+x mappings
Date: Fri, 6 Nov 2015 11:39:43 +0000
From: Matt Fleming <matt@codeblueprint.co.uk>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dave Jones <davej@codemonkey.org.uk>, Ingo Molnar <mingo@kernel.org>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>,
Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, Stephen Smalley <sds@tycho.nsa.gov>,
linux-efi@vger.kernel.org
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.24 (2015-08-30)
We have separate page tables today, for a few reasons, but mainly it's
/* 因為一些原因, 我們目前有分離的page tables */
so that we can have an identity mapping of memory present in the
/* 主要原因是我們可以有一個恆等映射(identity mapping, 1:1) */
region usually used by user processes - broken firmware still uses
/* 通常被 user processes 使用, 壞掉的 firmware 仍然使用恆等映射 */
those identity mappings even after the kernel tells it they're
/* 即使 kernel 告訴他們已經失效了 */
invalid.
Note that when I say "separate" I'm talking about trampoline_pgd[]
which is also used by the x86 suspend/resume code.
/* 注意當我說"分離"時, 我講的是關於 trampoline_pgd[]也被使用在 x86 suspend/resume 的程式中 */
However, turns out that the issue with the current scheme is the fact
/* 原來的問題在當前的方案, 事實上 trampoline_pgd[] 分享了一些 PGD entries 給 swapper_pg_dir */
that trampoline_pgd[] actually shares a couple of PGD entries with
swapper_pg_dir as can be seen in setup_real_mode(),
trampoline_pgd = (u64 *)__va(real_mode_header->trampoline_pgd);
trampoline_pgd[0] = init_level4_pgt[pgd_index(__PAGE_OFFSET)].pgd;
trampoline_pgd[511] = init_level4_pgt[511].pgd;
So when we map the EFI regions in efi_map_regions() we're inserting
/* 所以當我們映射 EFI 區域也一並映射到swapper_pg_dir */
them into swapper_pg_dir also, which is why you're seeing the
warnings.
If I remember correctly the rationale for using trampoline_pgd[] was
/* 使用 trampoline_pgd[] 是因為它已經有我們想要的(提供恆等映射) */
that it already did what we wanted (provided the identity mapping) and
would save us the overhead of maintaining more page tables for no good
/* 可以節省我們用於維護更多 page tables 的開銷 */
reason. Obviously this entire thread is a good reason.
I suggest we stop using trampoline_pgd[] (since it has a good reason
/* 我建議停止使用 trampoline_pgd[] (它具有一個好的理由去分享 kernel 映射 PGD entries) */
for sharing the kernel mapping PGD entries) and create our own so that
/* 而且建立我們自己的(PGD)然後我們可以完全隔離 EFI */
we can isolate EFI completely.
For the immediate problem of the warnings spewing forth on all UEFI
machines, at the very least the config options needs to be disabled by
/* 最起碼 config 選項必須預設關閉 */
default, if not the patch reverted.
Date: Sat, 7 Nov 2015 08:05:54 +0100
From: Ingo Molnar <mingo@kernel.org>
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin"
<hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, Stephen Smalley <sds@tycho.nsa.gov>,
linux-efi@vger.kernel.org
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.23 (2014-03-12)
* Matt Fleming <matt@codeblueprint.co.uk> wrote:
> On Thu, 05 Nov, at 01:33:10PM, Linus Torvalds wrote:
[...]
> I suggest we stop using trampoline_pgd[] (since it has a good reason
> for sharing the kernel mapping PGD entries) and create our own so that
> we can isolate EFI completely.
Ok. Could you please make this fix a priority for upcoming EFI changes?
> For the immediate problem of the warnings spewing forth on all UEFI
> machines, at the very least the config options needs to be disabled by
> default, if not the patch reverted.
We'll certainly flip around the default, but reverting would be shooting
/* 我們肯定會反轉預設值 */
the messenger: the EFI code is endangering everyone else today, and for
/* EFI 程式正在危害其他人, 而且它的出現沒有充份理由 */
no good reason as it appears... so the warning very much served its
/* 這樣的警告(CONFIG_DEBUG_WX)非常成功的達成目的, 指出了一個有效的問題 */
purpose in pointing out a valid problem.
Thanks,
Ingo
Date: Fri, 6 Nov 2015 12:39:12 +0000
From: Matt Fleming <matt@codeblueprint.co.uk>
To: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner
<tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Denys Vlasenko
<dvlasenk@redhat.com>, Kees Cook <keescook@chromium.org>, linux-efi@vger.kernel.org, Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.24 (2015-08-30)
On Fri, 06 Nov, at 07:55:50AM, Ingo Molnar wrote:
>
> 3) We should fix the EFI permission problem without relying on the firmware: it
/* 我們必須在不依賴軔體的狀況下修好 EFI 的權限問題 */
> appears we could just mark everything R-X optimistically, and if a write fault
/* 我們可以樂觀的標記所有東西為R-X */
> happens (it's pretty rare in fact, only triggers when we write to an EFI
/* 而且當寫入失敗發生時(這很罕見, 只有我們寫入EFI變數時會觸發), 我們可以在運行中標記失敗page為RW- */
> variable and so), we can mark the faulting page RW- on the fly, because it
> appears that writable EFI sections, while not enumerated very well in 'old'
/* 因為它出現在寫入 EFI 段的時候, 在舊的軔體沒有很好的列舉出來, 仍應該是page粒度 */
> firmware, are still supposed to be page granular. (Even 'new' firmware I
/* (就算是新的軔體, 我也不會自動的相信會得到正確的列舉...) */
> wouldn't automatically trust to get the enumeration right...)
Sorry, this isn't true. I misled you with one of my earlier posts on
/* 抱歉這是錯的,我誤導了你 */
this topic. Let me try and clear things up...
Writing to EFI regions has to do with every invocation of the EFI
/* 寫入 EFI 區域在每次叫用EFI runtime services時會發生, 不僅限於 讀/寫/刪除 EFI 變數 */
runtime services - it's not limited to when you read/write/delete EFI
variables. In fact, EFI variables really have nothing to do with this
/* 事實上, EFI 變數和這次的討論真的沒關係 */
discussion, they're a completely opaque concept to the OS, we have no
/* 對OS來說他們完全是不透明的觀念 */
idea how the firmware implements them. Everything is done via the EFI
boot/runtime services.
The firmware itself will attempt to write to EFI regions when we
/* 當我們調用EFI services時, 軔體本身會嘗試寫入 EFI 區域, 因為 PE/COFF 的.data 以及.bss 是和 heap 存活在一起 */
invoke the EFI services because that's where the PE/COFF ".data" and
".bss" sections live along with the heap. There's even some relocation
/* 甚至像一些發生在 SetVirtualAddressMap() 時的重新定位位置調整, 所以它也會寫入.text */
fixups that occur as SetVirtualAddressMap() time so it'll write to
".text" too.
Now, the above PE/COFF sections are usually (always?) contained within
/* 上述的 PE/COFF sections 常常(總是?) 被包含在 EfiRuntimeServicesCode 型態的EFI 區域中 */
EFI regions of type EfiRuntimeServicesCode. We know this is true
/* 我們知道這個事實乃是因為軔體開發者告訴我們 */
because the firmware folks have told us so, and because stopping that
/* 而且也是因為它阻擋了 EFI_PROPERTIES_TABLE 新功能背後的動機 */
is the motivation behind the new EFI_PROPERTIES_TABLE feature in UEFI
V2.5.
The data sections within the region are also *not* guaranteed to be
/* 在區域中的 data 區段也不保證是 page 粒度 */
page granular because work was required in Tianocore for emitting
/* 因為 Tianocore 的工作需求, 用來發出 4k 對齊的區段作為支援 EFI_PROPERTIES_TABLE 的一部份 */
sections with 4k alignment as part of the EFI_PROPERTIES_TABLE
support.
Ultimately, what this means is that if you were to attempt to
/* 最終這代表了如果你嘗試動態佈置這些需要write權限的區域, 你橫豎都必須修改EFI區域的主要映攝 */
dynamically fixup those regions that required write permission, you'd
have to modify the mappings for the majority of the EFI regions
anyway. And if you're blindly allowing write permission as a fixup,
/* 而且如果你盲目的允許write權限, 這就不會得到太多的安全性 */
there's not much security to be had.
> If that 'supposed to be' turns out to be 'not true' (not unheard of in
/* 如果這個"認為應該是"被正名為"不對的" (不是前所未聞的軔體園地) */
> firmware land), then plan B would be to mark pages that generate write faults
/* 則 plan B 就是標記那些產生 write 失敗的 pages 成為 RWX, 這樣不會破壞功能 */
> RWX as well, to not break functionality. (This 'mark it RWX' is not something
/* 這個"標記它為RWX"並不是一些容易取用的漏洞, 而且我們仍然可以產生一個警告[在EFI call完成之後], 如果這個警告曾經被觸發 */
> that exploits would have easy access to, and we could also generate a warning
> [after the EFI call has finished] if it ever triggers.)
>
> Admittedly this approach might not be without its own complications, but it
/* 誠然,這種方法可能不是沒有自己的並發症, */
> looks reasonably simple (I don't think we need per EFI call page tables,
/* 但是他看來相當簡單 (我不認為我們需要逐一 EFI call 的 page tables, 等等) */
> etc.), and does not assume much about the firmware being able to enumerate its
/* 而且這並沒有假設軔體能夠正確列舉其權限 */
> permissions properly. Were we to merge EFI support today I'd have insisted on
> trying such an approach from day 1 on.
We already have separate EFI page tables, though with the caveat that
/* 我們已經有分開的 EFI page tables */ /* 但需要提醒的是 */
we share some of swapper_pg_dir's PGD entries. The best solution would
/* 我們共享了一些 swapper_pg_dir 的 PGD entries. */
be to stop sharing entries and isolate the EFI mappings from every
/* 最好的解法是停止共享 entires 並且將 EFI mappings 從所有其他的 page table 結構隔離開來 */
other page table structure, so that they're only used during the EFI
/* 所以他們(EFI mappings page tables) 只被用在 EFI service calls 中 */
service calls.
Date: Sat, 7 Nov 2015 08:09:22 +0100
From: Ingo Molnar <mingo@kernel.org>
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner
<tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Denys Vlasenko
<dvlasenk@redhat.com>, Kees Cook <keescook@chromium.org>, linux-efi@vger.kernel.org, Ard Biesheuvel <ard.biesheuvel@linaro.org>
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.23 (2014-03-12)
* Matt Fleming <matt@codeblueprint.co.uk> wrote:
> On Fri, 06 Nov, at 07:55:50AM, Ingo Molnar wrote:
> >
[...]
>
> Ultimately, what this means is that if you were to attempt to
> dynamically fixup those regions that required write permission, you'd
> have to modify the mappings for the majority of the EFI regions
> anyway. And if you're blindly allowing write permission as a fixup,
> there's not much security to be had.
I think you misunderstood my suggestion: the 'fixup' would be changing it from R-X
/* "修理"代表把R-X改成RW-, 例如, 它增加了 write 權限但是移除 execute 權限 */
to RW-, i.e. it would add 'write' permission but remove 'execute' permission.
Note that there would be no 'RWX' permission at any given moment - which is the
/* 請注意這就不會有 RWX 權限同時存在, 這是危險的組合 */
dangerous combination.
> > If that 'supposed to be' turns out to be 'not true' (not unheard of in
> > firmware land), then plan B would be to mark pages that generate write faults
> > RWX as well, to not break functionality. (This 'mark it RWX' is not something
> > that exploits would have easy access to, and we could also generate a warning
> > [after the EFI call has finished] if it ever triggers.)
> >
> > Admittedly this approach might not be without its own complications, but it
> > looks reasonably simple (I don't think we need per EFI call page tables,
> > etc.), and does not assume much about the firmware being able to enumerate its
> > permissions properly. Were we to merge EFI support today I'd have insisted on
> > trying such an approach from day 1 on.
>
> We already have separate EFI page tables, though with the caveat that
> we share some of swapper_pg_dir's PGD entries. The best solution would
> be to stop sharing entries and isolate the EFI mappings from every
> other page table structure, so that they're only used during the EFI
> service calls.
Absolutely. Can you try to fix this for v4.3?
Thanks,
Ingo
Date: Sat, 7 Nov 2015 08:39:35 +0100
From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: Ingo Molnar <mingo@kernel.org>
Cc: Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel Mailing List
<linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski
<luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, Kees Cook <keescook@chromium.org>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>
Subject: Re: [GIT PULL] x86/mm changes for v4.4
On 7 November 2015 at 08:09, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Matt Fleming <matt@codeblueprint.co.uk> wrote:
>
[...]
>
> I think you misunderstood my suggestion: the 'fixup' would be changing it from R-X
> to RW-, i.e. it would add 'write' permission but remove 'execute' permission.
>
> Note that there would be no 'RWX' permission at any given moment - which is the
> dangerous combination.
>
The problem with that is that /any/ page in the UEFI runtime region
/* 問題在於 EFI runtime 區域中的任何 page 可能和任何組成 runtime 軔體的 PE/COFF images 的 .text 與 .data 相交 */
may intersect with both .text and .data of any of the PE/COFF images
that make up the runtime firmware (since the PE/COFF sections are not
/* 因為 PE/COFF 區段不需要 page 對齊 */
necessarily page aligned). Such pages require RWX permissions. The
/* 這些 pages 需要 RWX 權限 */
UEFI memory map does not provide the information to identify those
/* UEFI memory map 沒有提供資訊以先前識別這些 pages */
pages a priori (the entire region containing several PE/COFF images
/* 包含了幾個 PE/COFF 影像的整個區域可能只被單一entry包覆 */
could be covered by a single entry) so it is hard to guess which pages
/* 所以很難猜測哪個 pages 必須允許 RWX 權限 */
should be allowed these RWX permissions.
Date: Sat, 7 Nov 2015 22:58:52 -0800
From: Kees Cook <keescook@chromium.org>
To: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>, Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux
Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>,
Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>, Matthew Garrett <mjg59@coreos.com>
Subject: Re: [GIT PULL] x86/mm changes for v4.4
On Fri, Nov 6, 2015 at 11:39 PM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 7 November 2015 at 08:09, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> * Matt Fleming <matt@codeblueprint.co.uk> wrote:
>>
>>> On Fri, 06 Nov, at 07:55:50AM, Ingo Molnar wrote:
>>> >
[...]
>
> The problem with that is that /any/ page in the UEFI runtime region
> may intersect with both .text and .data of any of the PE/COFF images
> that make up the runtime firmware (since the PE/COFF sections are not
> necessarily page aligned). Such pages require RWX permissions. The
> UEFI memory map does not provide the information to identify those
> pages a priori (the entire region containing several PE/COFF images
> could be covered by a single entry) so it is hard to guess which pages
> should be allowed these RWX permissions.
I'm sad that UEFI was designed without even the most basic of memory
/* 我感到遺憾 UEFI 的設計沒有最基本的記憶體保護 */
protections in mind. UEFI _itself_ should be setting up protective
/* UEFI 本身應該設置保護性 page mappings */
page mappings. :(
For a boot firmware, it seems to me that safe page table layout would
/* 對於一個開機軔體, 對我來說"安全的 page table 佈局"會是高優先級的臭蟲 */
be a top priority bug. The "reporting issues" page for TianoCore
doesn't actually seem to link to the "Project Tracker":
https://github.com/tianocore/tianocore.github.io/wiki/Reporting-Issues
Does anyone know how to get this correctly reported so future UEFI
releases don't suffer from this?
-Kees
Date: Sun, 8 Nov 2015 08:55:24 +0100
From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>, Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux
Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>,
Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>, Matthew Garrett <mjg59@coreos.com>
Subject: Re: [GIT PULL] x86/mm changes for v4.4
On 8 November 2015 at 07:58, Kees Cook <keescook@chromium.org> wrote:
> On Fri, Nov 6, 2015 at 11:39 PM, Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
>> On 7 November 2015 at 08:09, Ingo Molnar <mingo@kernel.org> wrote:
>>>
>>> * Matt Fleming <matt@codeblueprint.co.uk> wrote:
>>>
[...]
>
> I'm sad that UEFI was designed without even the most basic of memory
> protections in mind. UEFI _itself_ should be setting up protective
> page mappings. :(
>
Well, the 4 KB alignment of sections was considered prohibitive at the
/* 4KB 對齊區段在節省程式大小時被考慮過禁止. 但這是很久以前 */
time from code size pov. But this was a long time ago, obviously.
> For a boot firmware, it seems to me that safe page table layout would
> be a top priority bug. The "reporting issues" page for TianoCore
> doesn't actually seem to link to the "Project Tracker":
> https://github.com/tianocore/tianocore.github.io/wiki/Reporting-Issues
>
> Does anyone know how to get this correctly reported so future UEFI
> releases don't suffer from this?
>
Ugh. Don't get me started on that topic. I have been working with the
/* 不要讓我開始這個話題. */
UEFI forum since July to get a fundamentally broken implementation of
/* 我從7月份開始和 UEFI 論壇工作以修復從根本上就損壞的記憶體保護 */
memory protections fixed. UEFI v2.5 defines a memory protection scheme
/* UEFI v2.5 定義了記憶體保護策略, 它是基於分割 PE/COFF 影像到分離的記憶體區域 */
that is based on splitting PE/COFF images into separate memory regions
so that R-X and RW- permissions can be applied. Unfortunately, that
/* 所以R-X 和 RW- 權限可以應用上去 */
broke every OS in existence (including Windows 8), since the OS is
/* 不幸的是, 這破壞了每個既存的 OS (包含 Windows 8) */
allowed to reorder memory regions when it lays out the virtual
/* 由於 OS 在規劃 EFI 區域的虛擬映射時, 被允許對於記憶體區域重新排序 */
remapping of the UEFI regions, resulting in PE/COFF .data and .text
/* 這造成 PE/COFF 中 .data 和 .text 可能出現順序亂掉 */
potentially appearing out of order.
The good news is that we fixed it for the upcoming release (v2.6). I
/* 好消息是我們在即將發行的v2.6修正了, 我不能透露任何細節 :-( */
can't disclose any specifics, though :-(
Date: Mon, 9 Nov 2015 13:08:01 -0800
From: Kees Cook <keescook@chromium.org>
To: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>, Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux
Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>,
Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>, Matthew Garrett <mjg59@coreos.com>
Subject: Re: [GIT PULL] x86/mm changes for v4.4
On Sat, Nov 7, 2015 at 11:55 PM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 8 November 2015 at 07:58, Kees Cook <keescook@chromium.org> wrote:
>> On Fri, Nov 6, 2015 at 11:39 PM, Ard Biesheuvel
>> <ard.biesheuvel@linaro.org> wrote:
>>> On 7 November 2015 at 08:09, Ingo Molnar <mingo@kernel.org> wrote:
>>>>
[...]
>
> Well, the 4 KB alignment of sections was considered prohibitive at the
> time from code size pov. But this was a long time ago, obviously.
Heh, yeah, I'd expect max 4K padding to get code/data correctly
/* 我期望最大 4K 的填充在獲取代碼/數據時正確的對齊 2MB 而不會構成問題 */
aligned on a 2MB binary to not be an issue. :)
[...]
>
> Ugh. Don't get me started on that topic. I have been working with the
> UEFI forum since July to get a fundamentally broken implementation of
> memory protections fixed. UEFI v2.5 defines a memory protection scheme
> that is based on splitting PE/COFF images into separate memory regions
> so that R-X and RW- permissions can be applied. Unfortunately, that
> broke every OS in existence (including Windows 8), since the OS is
> allowed to reorder memory regions when it lays out the virtual
> remapping of the UEFI regions, resulting in PE/COFF .data and .text
> potentially appearing out of order.
>
> The good news is that we fixed it for the upcoming release (v2.6). I
> can't disclose any specifics, though :-(
As long as there's motion to getting it fixed, that makes me happy! :)
/* 只要有動力讓它修正, 都可以讓我開心! */
Does 2.6 get rid of the (AIUI) 2MB limit too?
/* 2.6 版是否也擺脫了 2MB(就我了解) 的限制? */
-Kees
Date: Tue, 10 Nov 2015 08:08:30 +0100
From: Ard Biesheuvel <ard.biesheuvel@linaro.org>
To: Kees Cook <keescook@chromium.org>
Cc: Ingo Molnar <mingo@kernel.org>, Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux
Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>,
Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>, Matthew Garrett <mjg59@coreos.com>
Subject: Re: [GIT PULL] x86/mm changes for v4.4
On 9 November 2015 at 22:08, Kees Cook <keescook@chromium.org> wrote:
> On Sat, Nov 7, 2015 at 11:55 PM, Ard Biesheuvel
> <ard.biesheuvel@linaro.org> wrote:
[...]
>
> Heh, yeah, I'd expect max 4K padding to get code/data correctly
> aligned on a 2MB binary to not be an issue. :)
>
This is not about section sizes on ARM. The PE/COFF format does not
/* 這和 ARM 的區段大小無關 */
use segments, like ELF, so the payload (the sections) needs to be
/* PE/COFF 格式沒有使用分段, 和 ELF 相同, */
completely disjoint from the header. This means, when using 4 KB
/* 所以負載(這些區段)必須和 header 完全脫節 */
alignment, that every PE/COFF image wastes ~4 KB in the header and 4
/* 每個 PE/COFF 影像浪費大約 4 KB 在 header 和平均 4KB 在段填充 */
KB on average in the section padding (assuming a .text/.data/.reloc
/* (假設一個 .text/.data/.reloc 佈局, 在 PE/COFF 常見) */
layout, as is common with PE/COFF)
Considering that a typical UEFI firmware image consists of numerous
/* 考慮到一個典型的 UEFI 軔體影像是由多個(我想平均大約五十個) PE/COFF 影像組成 */
(around 50 on average, I think) PE/COFF images, and some of them
/* 而且他們部份從 NOR flash 中執行, Tianocore 工具 (關係到實作) */
execute from NOR flash, the Tianocore tooling (which is the reference
/* 一直著眼於儘可能小的前提下保持對齊 */
implementation) has always been geared towards keeping the alignment
as small as possible, typically 32 bytes unless data objects need
/* 通常是 32 位元, 除非需要更多 data 物件 */
more. Since the UEFI runtime services are typically implemented by
/* 由於 UEFI runtime services 通常以數個 PE/COFF 影像來實作 */
several of these PE/COFF images, and since the memory they occupy may
/* 而且由於記憶體所佔用(空間)可能只由單一個 UEFI memory map 條目所描述 */
be described by a single UEFI memory map entry, there is simply no
/* 根本沒有簡單的方法來決定哪些頁面需要 R-X, RW- 或 RWX */
easy way to decide which pages need R-X, RW- or RWX. Even looking for
/* 即使尋找記憶體中的 PE/COFF 標頭們也無法保證可行, 由於 */
PE/COFF headers in the memory region is not guaranteed to work, since
the PE/COFF header is part of the file format, not the memory format
/* PE/COFF 標頭是檔案格式的一部份, 不是記憶體格式 */
(i.e., since the header is disjoint from the payload, a PE/COFF loader
/* (也就是: 因為標頭和負載是脫節的, 一個 PE/COFF 載入器不需要拷貝標頭到記憶體) */
is not required to copy the header to memory)
>
> As long as there's motion to getting it fixed, that makes me happy! :)
> Does 2.6 get rid of the (AIUI) 2MB limit too?
>
No, there is no such limit in UEFI. If there is a limit like that, it
/* 不, 並沒有這樣的限制在 UEFI. 如果有類似這樣的限制, */
is an implementation detail of the UEFI support in the OS.
/* 它會是一個 OS 支援 UEFI 的實作細節 */
For arm64 (and the upcoming ARM support), the UEFI runtime services
/* 對於 ARM64 (和即將到來的 ARM 支援), UEFI runtime services 區域 */
regions are remapped into a virtual userland range that is only active
/* 被重新映射到一個虛擬的使用者空間範圍 */
during the time runtime services are being invoked. (x86 does
/* 這個範圍只有在 runtime services 被調用時啟動 */
something similar, but it shares the page tables with the
/* x86 下做了類似的事情, 但是就我了解它和 suspend/resume 程式共享了 page tables */
suspend/resume code afaiu) These mappings could be page granularity
/* 這些映射可以是頁粒度 (由於他們不需要在線性區域中分割PUDs或PMDs) */
(since they don't require splitting PUDs or PMDs in the linear
region), with the side note that arm64 mandates 64 KB alignment (to
/* 補充說明 arm64 要求 64 KB 對齊 (和 64 KB 頁的作業系統互通) */
interoperate with 64 KB pages OSes). This requirement has been added
/* 這個需求已經添加到 UEFI 規範, 也就是, */
to the UEFI spec, i.e., a v2.5 compliant arm64 firmware should not
/* 一個和 v2.5 相容的 arm64 軔體不應該以非64 KB 對齊的方式曝露 UEFI runtime 區域 */
expose UEFI runtime regions that are not 64 KB aligned.
Date: Tue, 10 Nov 2015 12:11:18 -0800
From: Kees Cook <keescook@chromium.org>
To: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>, Matt Fleming <matt@codeblueprint.co.uk>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux
Kernel Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>,
Andy Lutomirski <luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, "linux-efi@vger.kernel.org" <linux-efi@vger.kernel.org>, Matthew Garrett <mjg59@coreos.com>
Subject: Re: [GIT PULL] x86/mm changes for v4.4
On Mon, Nov 9, 2015 at 11:08 PM, Ard Biesheuvel
<ard.biesheuvel@linaro.org> wrote:
> On 9 November 2015 at 22:08, Kees Cook <keescook@chromium.org> wrote:
>> On Sat, Nov 7, 2015 at 11:55 PM, Ard Biesheuvel
>> <ard.biesheuvel@linaro.org> wrote:
[...]
>
> This is not about section sizes on ARM. The PE/COFF format does not
> use segments, like ELF, so the payload (the sections) needs to be
> completely disjoint from the header. This means, when using 4 KB
> alignment, that every PE/COFF image wastes ~4 KB in the header and 4
> KB on average in the section padding (assuming a .text/.data/.reloc
> layout, as is common with PE/COFF)
>
> Considering that a typical UEFI firmware image consists of numerous
> (around 50 on average, I think) PE/COFF images, and some of them
Oooh, that's no fun. So the linker can't produce merged .text and
/* 喔, 這不妙. 所以linker 不能產出合併了 .text 和 .data 的區段? */
.data sections?
[...]
>
> No, there is no such limit in UEFI. If there is a limit like that, it
> is an implementation detail of the UEFI support in the OS.
>
> For arm64 (and the upcoming ARM support), the UEFI runtime services
> regions are remapped into a virtual userland range that is only active
> during the time runtime services are being invoked. (x86 does
> something similar, but it shares the page tables with the
> suspend/resume code afaiu) These mappings could be page granularity
> (since they don't require splitting PUDs or PMDs in the linear
> region), with the side note that arm64 mandates 64 KB alignment (to
> interoperate with 64 KB pages OSes). This requirement has been added
> to the UEFI spec, i.e., a v2.5 compliant arm64 firmware should not
> expose UEFI runtime regions that are not 64 KB aligned.
Cool, thanks for the details!
-Kees
Date: Fri, 6 Nov 2015 13:09:48 +0000
From: Matt Fleming <matt@codeblueprint.co.uk>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Ingo Molnar <mingo@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel Mailing List
<linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Borislav Petkov <bp@alien8.de>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski
<luto@kernel.org>, Denys Vlasenko <dvlasenk@redhat.com>, Kees Cook <keescook@chromium.org>, linux-efi@vger.kernel.org
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.24 (2015-08-30)
On Thu, 05 Nov, at 11:05:35PM, Andy Lutomirski wrote:
>
> Admittedly, we might need to use a certain amount of care to avoid
/* 不可否認的, 我們必須有一定程度的謹慎去避免 vmap 機制們間有趣的衝突 */
> interesting conflicts with the vmap mechanism. We might need to vmap
/* 我們可能需要虛擬映射所有 EFI 的東西, */
> all of the EFI stuff, and possibly even all the top-level entries that
/* 而且甚至可能是全部包含 EFI 材料的頂層條目 */
> contain EFI stuff (i.e. exactly one of them unless EFI ends up *huge*)
/* (亦即他們其中只有一個, 除非EFI最終太巨大) */
> as a blank not-present region to avoid overlaps, but that's not a big
/* 成為一個空白不存在的區域以避免重疊, 但這不是大問題 */
> deal.
There shouldn't be any room for conflicting with vmap() because the VA
/* 不應該存在任何和vmap()衝突的空間, */
region where we map EFI regions is still carved out especially for us.
/* 因為用於映射EFI區域的虛擬位置區域仍然有為我們特別刻劃出來 */
Right Boris?
Date: Fri, 6 Nov 2015 14:24:47 +0100
From: Borislav Petkov <bp@alien8.de>
To: Matt Fleming <matt@codeblueprint.co.uk>
Cc: Andy Lutomirski <luto@amacapital.net>, Ingo Molnar <mingo@kernel.org>, Linus Torvalds <torvalds@linux-foundation.org>, Stephen Smalley <sds@tycho.nsa.gov>, Dave Jones <davej@codemonkey.org.uk>, Linux Kernel
Mailing List <linux-kernel@vger.kernel.org>, Thomas Gleixner <tglx@linutronix.de>, "H. Peter Anvin" <hpa@zytor.com>, Andrew Morton <akpm@linux-foundation.org>, Andy Lutomirski <luto@kernel.org>, Denys
Vlasenko <dvlasenk@redhat.com>, Kees Cook <keescook@chromium.org>, linux-efi@vger.kernel.org
Subject: Re: [GIT PULL] x86/mm changes for v4.4
User-Agent: Mutt/1.5.23 (2014-03-12)
On Fri, Nov 06, 2015 at 01:09:48PM +0000, Matt Fleming wrote:
> On Thu, 05 Nov, at 11:05:35PM, Andy Lutomirski wrote:
> >
> > Admittedly, we might need to use a certain amount of care to avoid
> > interesting conflicts with the vmap mechanism. We might need to vmap
> > all of the EFI stuff, and possibly even all the top-level entries that
> > contain EFI stuff (i.e. exactly one of them unless EFI ends up *huge*)
> > as a blank not-present region to avoid overlaps, but that's not a big
> > deal.
>
> There shouldn't be any room for conflicting with vmap() because the VA
> region where we map EFI regions is still carved out especially for us.
>
> Right Boris?
Yap: /* 是的 */
ffffea0000000000 - ffffeaffffffffff (=40 bits) virtual memory map (1TB)
vs
ffffffef00000000 - ffffffff00000000 EFI region in trampoline_pgd
the new pagetable will make that issue moot too.
/* 新的 pagetable 也將使該問題沒有實際意義 */
--
Regards/Gruss,
Boris.