Exploiting Bugs in Early Boot Code with UEFI Capsules

January 4, 2025 exploit EFI secure boot coreboot

Recently, support for in-memory UEFI-capsule updates  was introduced to
the firmware framework coreboot [1]. The original implementation wasn't
accounting for potential integer overflows, which could be exploited by
an adversary with control over memory contents before a reboot. Because
early boot firmware often doesn't implement modern countermeasures, the
exploit I'm going to describe is rather simple. This highlights the im-
portance to avoid untrusted input in early boot stages altogether.

Note:  For readers who are generally familiar with exploits, this might
be a dull read.  The point of this post is primarily to show how easily
bugs can be exploited in unprotected, early boot code.

[1]: https://review.coreboot.org/c/coreboot/+/83422
     "drivers/efi/uefi_capsules.c: coalesce and store UEFI capsules"

Environment and Threat Model

A UEFI capsule is a data container passed to UEFI firmware. Usually, a capsule contains a firmware update, but it can also contain code to be run by the firmware. Sometimes, a capsule actually contains an update program, paired with the actual firmware update.

Capsules can be passed to UEFI firmware via memory in form of scatter-gather lists. These are placed in memory before a warm reboot, and persist until the firmware can pick them up. As the lists can be anywhere, not necessarily in a reserved area, care needs to be taken to gather the data early. The capsules have to be moved to a safe place that isn’t overwritten during the boot process. Hence, this processing usually happens in the first firmware phase that runs with the DRAM controller enabled, before user interaction is possible. In coreboot, this is the ramstage. In UEFI terms, this happens during the Pre-EFI Initialization (PEI). On x86, these early phases often still run in protected (32-bit) mode, sometimes even without paging.

One very important thing to note is that the UEFI capsule’s integrity and authenticity is supposed to be secured with digital signatures. This, however, does not cover the scatter-gather lists, and most implementations seem to first gather all the data, and only then verify all the capsules' signatures at once. Any code, from processing these lists until verifying the signatures, can be influenced by these untrusted data structures, and bugs in such code can potentially be exploited.

We’ll consider a standard PC with UEFI secure boot. The latter is supposed to prevent booting into an untrusted or tampered operating system (OS) by verifying digital signatures of the OS. Assuming an attacker is able to compromise the running OS, either remotely or with physical access to an unlocked system, they could forge malicious scatter-gather lists, to exploit bugs in the firmware. If an attacker could gain code execution in the early boot process this way, they could potentially

circumvent secure boot and boot into any OS they want,
manipulate the OS during boot or runtime, or
install a root-kit that persists in the firmware (if no additional verification of the firmware runs on every boot).

In-memory Capsule Structures

The UEFI specification defines two structures that are parsed during the gather process. There can be multiple capsules, and each capsule starts with a header:

typedef struct {
  EFI_GUID    CapsuleGuid;
  UINT32      HeaderSize;
  UINT32      Flags;
  UINT32      CapsuleImageSize;
} EFI_CAPSULE_HEADER;

A capsule can be scattered across multiple ranges of memory pages. And each contiguous range of memory pages is described by a very simple structure:

typedef struct {
  UINT64                    Length;
  union {
    EFI_PHYSICAL_ADDRESS    DataBlock;
    EFI_PHYSICAL_ADDRESS    ContinuationPointer;
  } Union;
} EFI_CAPSULE_BLOCK_DESCRIPTOR;

These block descriptors form an array, however can also be scattered: A descriptor with zero length but non-zero address points to a continuation of the array. The very last descriptor terminates the array with all fields set to zero. The following picture illustrates the idea: On the right-hand side we have unordered chunks of two capsules (1) and (2). The purpose of the block descriptors on the left is to bring the chunks in order. The first page of the descriptor array references the header of capsule (1) plus four more chunks. It ends with a continuation pointer to further descriptors that reference capsule (2). With all this in memory, all that is left is to pass a pointer to the first descriptor page to the firmware.

     pot. scattered                            scattered capsules
    descriptor array                       +------------------------+
    +--------------+                ,----->| cont.2 (1)             |
    |______________|-----.         /       |------------------------|
    |______________|------\--.    /    ,-->| cont.1 (2)             |
    |______________|-------\--\--'    /    |                        |
    |______________|--------\--\--.  /     |------------------------|
    |______________|-------. \  `--\/----->| cont.1 (1)             |
  ,-|______________|        \ \    /\      |------------------------|
 /  |       +--------------+ \ \  /  `---->| cont.3 (1)             |
/   |    ,->|______________|--\-\/---.     |------------------------|
\   +-- /---|______________|---\'\    `--->| header (2)             |
 \     /    |______________|--. \ \        |------------------------|
  `---'     |              |   \ \ \       |                        |
            |              |    \ \ \      |------------------------|
            |              |     \ \ `---->| header (1)             |
            |              |      \ \      |                        |
            |              |       \ \     |------------------------|
            +--------------+        \ `--->| cont.4 (1)             |
                                     \     |------------------------|
                                      `--->| cont.2 (2)             |
                                           |                        |

It seems the idea here is to give the program that places the capsule in memory maximum flexibility. Even if this program has no chance to allocate consecutive memory pages, it can still place capsules of any 32-bit size. And also very interesting: The total number of capsules is only limited by the available amount of memory.

Gathering Mechanism in coreboot

The coreboot gathering mechanism looks for up to 32 pointers to block-descriptor lists in NVRAM. In a first round, it goes through each list and performs some basic sanity checks (e.g. are pointers aligned, do they point to valid memory? do the first block and later blocks after a capsule start with a capsule header? etc.). The total amount of capsule data is also counted, to later ensure that we have that much spare memory. Finally, if there are multiple, distinct pointers, the lists will be joined by turning the terminating block descriptor of the first list into a continuation pointer to the second list (similar to the two descriptor lists in the picture above).

A second round goes through the single, concatenated list and reserves all the memory used by the block descriptors and the capsule chunks. Then, additional space for CBMEM (a structure that grows during the boot process) is estimated and reserved.

At this point, coreboot has a memory map that should only show free space where it is safe to place the gathered capsule data for later processing. A suitable spot is selected there, based on the total amount of capsule data as calculated earlier. This will usually be directly below the space reserved for CBMEM.

Finally, in a third round over the descriptor lists, the actual capsule data is gathered and stored in the selected memory range.

A 64GiB Crapsuhl Structure

What could possibly go wrong? As described above, (a) calculating the amount of space required for the gathered capsule data, (b) selecting a suitable memory range, and (c) filling that memory are all distinct steps. This can only turn out well if all steps have the same picture of the data being processed. So what we are looking for is any deviation in the results of each step. (c) copies as much data as is given by the scatter-gather lists, (b) allocates as much memory as counted by (a). But what if (a) counted wrong?

In the code we can find the loop that accumulates the space needed for all capsules:

uint64_t data_size = 0;
while (!is_final_block(&block)) {
        ...
        data_size += ALIGN_UP(capsule_hdr->CapsuleImageSize, CAPSULE_ALIGNMENT);
        ...

Under normal circumstances this uint64_t seems like it would never overflow. We don’t have enough memory for this amount of capsule data, right? If we consider a maliciously forged scatter-gather list, however, it becomes possible. Also, as discovered later, the ALIGN_UP() can overflow if the 32-bit capsule size given in the header is close to 4GiB. The original exploit targeted the 64-bit overflow, though.

Given the flexibility of the scatter-gather lists, this requires far less than 2^64 bytes of memory space. The maximum size of each capsule is limited by the 32-bit CapsuleImageSize in its header, so that’s 4GiB - 1. We don’t need 4GiB of memory, however, to account for such a huge capsule. All we need is a capsule header that says it’s that big. Now, to get to 64 bits, do we need 4 billion capsule headers? Actually, no, because nothing stops us from listing the same capsule (header) multiple times. This naive approach still needs about 4 billion entries in the scatter-gather list. With 16B per entry, this sums up to a little more than 64GiB. While we don’t find this much memory in every PC, it’s also not too uncommon anymore and provided enough motivation to write a PoC exploit.

         ~64GiB
    descriptor array                           fake ~4GiB capsule
    +--------------+                       +------------------------+
    |______________|---,-,-,-,-,-,-,-,-,-->| header                 |
    |______________|--' / / / / / / / /    |------------------------|
    |______________|---' / / / / / / /     |                        |
    |______________|----' / / / / / /      |                        |
    |______________|-----' / / / / /       |                        |
    |______________|------' / / / /        |                        |
    |______________|-------' / / /         |                        |
    |______________|--------' / /          |                        |
    |______________|---------' /           |                        |
    |______________|----------'            |                        |
    |     ...      |                       |                        |

Naive approach: 4 billion block descriptors all pointing to a fake 4GiB capsule.

What Can We Achieve with This?

When the calculation of the required space overflowed, too little space will be allocated in the next step, and the final copy step will run beyond the end of the allocated space. Sticking to the coreboot example, the most likely location chosen for the allocation will be below CBMEM, which also contains the currently running ramstage code. Copying too much means, we can overwrite the running program.

On x86, the copying happens in steps of 2MiB pages in a rep string instruction in memcpy(). Tests (in QEMU) have shown that the rep exits once the code is overwritten, however the last four bytes around the instruction pointer (IP) are still written. Without further analysis and further knowledge of the running program, the following simple code pattern, repeated over a 2MiB page, already gives us a 75% chance to continue execution with our own code:

    nop
    nop
    jmp -6

Each nop takes one byte, the jmp instruction takes two. If the IP falls on any of the nop’s or the first byte of the jmp, we will jump back 6 bytes as expected into the preceding, already copied 4-byte block. This jumping back continues until the beginning of the capsule, where we can place whatever program we want to execute. The following example was used to confirm code execution. It prints a little message over a UART at i/o port 0x3f8:

1:
    mov $0x3f8, %dx
    mov  $0x43, %al
    out    %al, %dx
    mov  $0x52, %al
    out    %al, %dx
    mov  $0x41, %al
    out    %al, %dx
    mov  $0x50, %al
    out    %al, %dx
    mov  $0x53, %al
    out    %al, %dx
    mov  $0x55, %al
    out    %al, %dx
    mov  $0x48, %al
    out    %al, %dx
    mov  $0x4c, %al
    out    %al, %dx
    mov  $0x21, %al
    out    %al, %dx
    out    %al, %dx
    out    %al, %dx
2:
    hlt
    jmp 2b
    jmp 1b

The last jmp should eventually be reached by the chain of jumping back described above. So this small program needs to be aligned such that it ends on a 4-byte boundary.

To wrap it up, having this code in a first capsule (so it gets copied first) and the 64GiB fake capsule as a second (for the integer overflow) is enough to gain execution in an environment where we shouldn’t. And most likely gives us full system access including write access to the BIOS flash.

A Lesson from the Past

Why was it that simple? Beside the uncaught integer overflow, there’s a bit of luck with the placement of the program and the allocated space for the capsules. Though, even with a different placement, usually there is something involved with the running program that can be overwritten.

Of course, there are mechanisms to write-protect program code. There’s a wrinkle, though, with early boot stages: Execution has to continue. It’s not uncommon that exceptions are ignored to make a best effort to boot the system. Assuming attempts to overwrite the program would be caught but execution continues, something else would eventually be overwritten. If it’s the stack, even if not executable, return-oriented programming is an option.

As long as we have to expect any bugs, exploitable or not, a successful boot (availability) and integrity are somewhat conflicting goals. It is, however, possible to avoid this conflict by design: Like coreboot did originally, leave program complexity to the payload or even the OS. Once firmware has reached a stage with possible user interaction, they can be notified of an issue, and dealing with processes that can crash becomes more feasible.

We probably won’t find the usual counter measures that we know from hardening our OS’s in early boot stages. The best option seems to avoid the problem by design, leave the processing of untrusted input to later firmware stages, and keep data structures simple. Like coreboot used to.