Writing a Windows Loader (Part 2)

Continuation of Writing a Windows Loader (Part 1)

Relocations

So now that we've allocated space, and copied over the image, what is the next step? We mentioned relocations briefly earlier in the chapter, but what does it really mean to adjust references? Well, as it turns out, Portable Executable files, unlike ELF files (if you are familiar with Linux's executable file format), are not built to support being Position Independent. This means that they cannot be loaded at any address, and in fact, they expect to be loaded at a particular address (also known as the "Preferred Address" - it is stored in the ImageBase field of the IMAGE_OPTIONAL_HEADER) every time.
Several factors (not the least of which being Address Space Layout Randomization, or ASLR) obviously make this at least slightly problematic, so in order to support mapping to different locations, a table of "relocations" may be added to the binary in order to allow the delta between the Preferred Address and the Actual Address to be corrected. There is a bit of nuance to how the relocations actually work, as we will see shortly.
The IMAGE_BASE_RELOCATION structure, which is roughly the foundation of our journey to fixing up the binary, is a bit odd looking, to say the least:

typedef struct _IMAGE_BASE_RELOCATION {
    DWORD   VirtualAddress;
    DWORD   SizeOfBlock;
//  WORD    TypeOffset[1];
} IMAGE_BASE_RELOCATION;

The relocation section, which we can reach by utilizing the IMAGE_DIRECTORY_ENTRY_BASERELOC entry of the DataDirectory table (in the IMAGE_OPTIONAL_HEADER), ends up simply being a table of these values. We can start our "relocations" method by finding our way there:

static LoaderStatus relocate_image(unsigned char* image_base,
                                   uint32_t size,
                                   const IMAGE_NT_HEADERS* nt)
{
    LoaderStatus status = LoaderSuccess;
    // Get the entry for relocations
    const IMAGE_DATA_DIRECTORY* data_dir =
                     &nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_BASERELOC];
    // Use the DATA_DIRECTORY rva to get the actual section location
    const IMAGE_BASE_RELOCATION* relocs = (const IMAGE_BASE_RELOCATION*)(
            image_base + data_dir->VirtualAddress);
    // This will be the total size of our relocation block.
    uint32_t  total_size = data_dir->Size;
    // How much we have to modify the relocation by
    uintptr_t delta = ((uintptr_t)image_base) - nt->OptionalHeader.ImageBase;


    // TODO: Implement

    return status;
}

At this point, we are at the start of our "relocations" section in the binary, which turns out to simply be a big table of relocation blocks. Each one is topped by a header (the IMAGE_BASE_RELOCATION structure from earlier), followed by a big table of 16 bit entries containing the offset (from the virtual address in the header), and some information about how to apply the relocation.
If we want to walk through the table, we might start with a loop, something like this:

static LoaderStatus relocate_image(unsigned char* image_base,
                                   uint32_t size,
                                   const IMAGE_NT_HEADERS* nt)
{
    LoaderStatus status = LoaderSuccess;
    const IMAGE_DATA_DIRECTORY* data_dir =
                     &nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_BASERELOC];
    const IMAGE_BASE_RELOCATION* relocs = (const IMAGE_BASE_RELOCATION*)(
            image_base + data_dir->VirtualAddress);
    uint32_t  total_size = data_dir->Size;
    uint32_t  current_size = 0;
    uintptr_t delta = ((uintptr_t)image_base) - nt->OptionalHeader.ImageBase;

    // The size of the section (from data_dir->Size, above) is the best way
    // for us to know we've reached the end of the section
    for(; current_size <= total_size;
        relocs = (const IMAGE_BASE_RELOCATION*)(((unsigned char*)relocs) +
        relocs->SizeOfBlock)) {
        uint32_t current_block = relocs->SizeOfBlock;
        // We need to add the current block size to the amount
        // we've advanced forward, so we can make sure we haven't
        // walked past the end of the section!
        current_size += relocs->SizeOfBlock;

        // We need to subtract the size of the header from
        // the number of entries in the table.
        current_block -= sizeof(IMAGE_BASE_RELOCATION);

        // Each entry is 16 bits; so we need to divide the size
        // (which is in bytes) by this in order to figure out how
        // many entries are in the table.
        current_block /= sizeof(uint16_t);

        // TODO: Finish

    }


    return status;
}

Walking through the comments in the code above, we first need to get the size of the section where the relocations are stored. Since it is essentially a big table of entries, and we have to compute the address of the next entry based on the size of the current one (which may vary), we need to keep track of how far we've gone, to ensure that we stay within the confines of the containing section. Next, in order to figure out how many entries are in the block we are currently processing, we need to do two things:

  1. Subtract the size of the header from the block, since the SizeOfBlock field includes both the header and the entries in the block.
  2. Divide the remaining size by the size of each entry - as indicated by the commented-out array at the bottom of the structure definition we saw before, this should be something like sizeof(WORD).
    To illustrate this, we can see a full block in the highlighted section below:
Figure 1 - Relocation Block Hex Dump

The first four bytes, (0x00, 0x20, 0x07, 0x00) come out to the VirtualAddress field, or when converted from little endian, 0x7200, the second four bytes (0x18, 0x00, 0x00, 0x00) come out to 0x18 (or 24, in decimal) bytes, with the remaining 16 highlighted bytes representing a block of 8 relocations. We can validate that the math we used earlier is correct, if we subtract the header size from 24 (from the SizeOfBlock field), we get 24-8 = 16 bytes. Dividing this by sizeof(WORD) yields a table of 8 entries, which we can see below:

Figure 2 - Relocation Block 010 Template

So now that we have this information, how do we actually set about parsing each relocation in the set? As it turns out, each entry in the table is formatted so that the low 12 bits contain the offset from the virtual address we need to adjust, and the high 4 bits contain information about what type of adjustment we need to make. With a little bit of shifting and bitmasking, we can easily extract the information we are interested in as follows:

uint16_t reloc_type = entry >> 12;
uint16_t reloc_offset = entry & 0xfff;

At this point, we now need to utilize the type to figure out how big of an adjustment we need to make. What does this actually mean? Essentially, we need to update differing amounts of the location to adjust depending on the type of relocation. We will not cover a totally exhaustive list of them (as many apply to architectures that are beyond the scope of this book, such as MIPS and Itanium). Our non-exhaustive list is as follows (if the descriptions don't totally make sense at this point, that is perfectly ok, an illustrative example of these offset types in action will follow!):

  1. IMAGE_REL_BASED_HIGH - We need to make a 2-byte update at the given offset, changing the "high" half of the address.
  2. IMAGE_REL_BASED_LOW - We need to adjust the low half of an address.
  3. IMAGE_REL_BASED_HIGHLOW - We need to apply a 4-byte update to the given offset.
  4. IMAGE_REL_BASED_DIR64 - We will need to make a 64 bit update to the location.

In practice, we might take this information, and process the relocations as follows:

unsigned char* relocate_location = base_address + reloc_offset;

switch(reloc_type) {
case IMAGE_REL_BASED_LOW:
    *((uint16_t*)relocate_location) += LOWORD(delta);
    break;
case IMAGE_REL_BASED_HIGH:
    *((uint16_t*)relocate_location) += HIWORD(delta);
    break;
case IMAGE_REL_BASED_HIGHLOW:
    *((uint32_t*)relocate_location) += (int32_t)delta;
    break;
case IMAGE_REL_BASED_DIR64:
    *((uint64_t*)relocate_location) += delta;
}

Combining this with our previous relocation method might give us something that looks a bit like the following:

static LoaderStatus relocate_image(unsigned char* image_base,
                                   uint32_t size,
                                   const IMAGE_NT_HEADERS* nt)
{
    LoaderStatus status = LoaderSuccess;
    const IMAGE_DATA_DIRECTORY* data_dir =
                     &nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_BASERELOC];
    const IMAGE_BASE_RELOCATION* relocs = (const IMAGE_BASE_RELOCATION*)(
            image_base + data_dir->VirtualAddress);
    uint32_t  total_size = data_dir->Size;
    uint32_t  current_size = 0;
    uintptr_t delta = ((uintptr_t)image_base) - nt->OptionalHeader.ImageBase;

    for(; current_size <= total_size;
        relocs = (const IMAGE_BASE_RELOCATION*)(((unsigned char*)relocs) +
        relocs->SizeOfBlock)) {
        uint32_t current_block = relocs->SizeOfBlock;
        uint32_t i = 0;
        // We will skip the header, and start at
        // our table of relocation entries
        const uint16_t* entry = (uint16_t*)(
            ((const unsigned char*)relocs) + sizeof(*relocs)
        );

        current_size += relocs->SizeOfBlock;

        current_block -= sizeof(IMAGE_BASE_RELOCATION);
        current_block /= sizeof(uint16_t);

        for(i = 0; i < current_block; i++) {
            uint16_t reloc_type = entry[i] >> 12;
            uint16_t reloc_offset = entry[i] & 0xfff;
            unsigned char* relocate_location = image_base + reloc_offset;


            switch(reloc_type) {
            case IMAGE_REL_BASED_LOW:
                *((uint16_t*)relocate_location) += LOWORD(delta);
                break;
            case IMAGE_REL_BASED_HIGH:
                *((uint16_t*)relocate_location) += HIWORD(delta);
                break;
            case IMAGE_REL_BASED_HIGHLOW:
                *((uint32_t*)relocate_location) += (int32_t)delta;
                break;
            case IMAGE_REL_BASED_DIR64:
                *((uint64_t*)relocate_location) += delta;
            }
        }

    }


    return status;
}

Which should provide us with enough to process relocations for most binaries of the architectures we are focused on (sans additional error checking to validate the relocations provided are valid, of course).

Imports

Overview

Once all of this is accomplished, the next step is to handle imports. Imports are functions that are exported by an external library that the binary we are loading depends on in order to function. In order to resolve these, we must parse the import descriptors of the binary, locate the required DLLs (if possible), and populate the Import Address Table (IAT) with the addresses to reach the required function. We can utilize the dumpbin tool, which ships with Microsoft Visual Studio (and is usable via the Developer Command Prompt) to view imports as follows:

Figure 3 - KernelBase.dll Imports

Locating Imports

As with the relocations, getting access to the import descriptors starts with the DataDirectory of the OptionalHeader:

IMAGE_IMPORT_DESCRIPTOR* imp = (IMAGE_IMPORT_DESCRIPTOR*)(
    image_base +
    nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_IMPORT].VirtualAddress
);

uint32_t imp_size =
    nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_IMPORT].Size;

and we start with a pointer to an IMAGE_IMPORT_DESCRIPTOR, which looks something like this:

typedef struct _IMAGE_IMPORT_DESCRIPTOR {
    union {
        DWORD   Characteristics;
        DWORD   OriginalFirstThunk;
    } DUMMYUNIONNAME;
    DWORD   TimeDateStamp;
    DWORD   ForwarderChain;
    DWORD   Name;
    DWORD   FirstThunk;
} IMAGE_IMPORT_DESCRIPTOR;

Out of this structure, we care most (at the present) about the union at the top, which will contain a terminating NULL when we hit the end of the imports list, and the bottom two fields - the Name and FirstThunk. While the other two fields have some uses under certain circumstances (the timestamp for bound imports, and the ForwarderChain in the event of forwards), they are a bit beyond the scope of the chapter.
As it turns out, each of these import descriptors represent one library that the PE we are loading depends upon. Within each, we have the RVAs for two parallel tables: the "Import Names Table" (INT), referenced by the poorly-named OriginalFirstThunk member of the union at the top, and the previously mentioned IAT, referenced by the FirstThunk field. The Name field referenced here refers to the RVA of a string representing the name of the library we need to load in order to satisfy the dependencies referenced in the INT.
Thus, our first step in beginning to parse the descriptors may somewhat resemble the following:

static LoaderStatus handle_imports(unsigned char* image_base,
                                   uint32_t size,
                                   PIMAGE_NT_HEADERS nt)
{
    LoaderStatus status = LoaderSuccess;
    PIMAGE_IMPORT_DESCRIPTOR imp =  (PIMAGE_IMPORT_DESCRIPTOR)(
        image_base +
        nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_IMPORT].VirtualAddress
    );
    uint32_t imp_size =
        nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_IMPORT].Size;

    // We will continue processing until we hit the terminating NULL
    for(; imp->Characteristics; imp++) {
        const char* library_name = (const char*)(image_base + imp->Name);
        HMODULE     hm = NULL;

        // We will try to load the library, if it fails, we will
        // exit, as we can't finish loading the library.
        if(NULL == (hm = LoadLibraryA(library_name))) {
            return LoaderNotFound;
        }


        /* TODO: Handle function imports! */
    }


    return status;
}

While this is a bit light on error checking (we should probably be checking to ensure that the binary is not malformed, so we will not accidentally walk past the end of the section), it gives us what we need to access the import data for each required library in sequence.
Now, in order to process the functions, we will start with the IMAGE_THUNK_DATA structure, below:

typedef struct _IMAGE_THUNK_DATA {
    union {
        ULONG_PTR ForwarderString;
        ULONG_PTR Function;
        ULONG_PTR Ordinal;
        ULONG_PTR AddressOfData;
    } u1;
} IMAGE_THUNK_DATA;

Note that above, we have generalized this structure a bit - there are two separate versions defined in winnt.h: the IMAGE_THUNK_DATA32 version, which has 32 bit unsigned values for each field in the union, and IMAGE_THUNK_DATA64, which is 64 bits.
In any case, at this point, this structure is used for both arrays: the INT, and the IAT. The INT, just like the import descriptor list, is NULL-terminated, and will tell us how to handle the corresponding IAT entry (since we will need to process the elements at each offset of the table together). It may indicate, for example, that the function we need to find in our newly-loaded library is not exported by name, but rather by ordinal - meaning that we simply have a single numeric value to reference it by.
Initially, each entry for the IAT will contain an RVA to an IMAGE_IMPORT_BY_NAME structure (if it is being imported by name), and will need to be updated to point to the actual function it references. This sounds a bit complicated, but we can illustrate it as follows:

static LoaderStatus handle_imports(unsigned char* image_base,
                                   uint32_t size,
                                   PIMAGE_NT_HEADERS nt)
{
    LoaderStatus status = LoaderSuccess;
    PIMAGE_IMPORT_DESCRIPTOR imp =  (PIMAGE_IMPORT_DESCRIPTOR)(
        image_base +
        nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_IMPORT].VirtualAddress
    );
    uint32_t imp_size =
        nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_IMPORT].Size;

    for(; imp->Characteristics; imp++) {
        const char* library_name = (const char*)(image_base + imp->Name);
        HMODULE     hm = NULL;
        char*       func_name = NULL; // If we need one!
        // This will be the INT
        PIMAGE_THUNK_DATA orig_first_thunk = NULL;
        // This will be the IAT
        PIMAGE_THUNK_DATA first_thunk = NULL;
        uint32_t          i = 0;

        if(NULL == (hm = LoadLibraryA(library_name))) {
            return LoaderNotFound;
        }


        // We will now find the start of our INT and IAT
        orig_first_thunk = (PIMAGE_THUNK_DATA)(image_base + imp->OriginalFirstThunk);
        first_thunk = (PIMAGE_THUNK_DATA)(image_base + imp->FirstThunk);
        for(; orig_first_thunk[i].u1.AddressOfData; i++) {
          // TODO: Update first_thunk->u1.Function
          // to point to the export
        }

    }

In order to populate first_thunk->u1.Function, we now must find the export from the library we found with LoadLibraryA that corresponds with the import we need in order to make our new PE function. We can use the canonical Windows-provided function, GetProcAddress, to locate this method, which takes in the library we loaded (hm, above), and either the name (as a string) or ordinal of the method we wish to attempt to locate. We will describe both momentarily, but first, we should take a step back and explain a little bit about how exports actually work.

Exports: A Brief Introduction

We will deal with two primary aspects of exports: names, and ordinals. First, it is important to note that while every export has an ordinal, not every export has a name. Additionally, unless specified, export names are often subject to name mangling based on factors such as calling convention, whether or not the C++ compiler is being used (in which case, it will consider things such as namespace and class membership as well). Occasionally, you will also see exports that are forwarded to another DLL. This means that the function we are looking for does not actually reside in the current library, but is simply referenced as an export. In that case, we will need to load the referenced DLL, and attempt to locate the method named as the target within its exports.
This is not something we need to be extremely concerned with for the purposes of building a loader (as we are mostly relying on GetProcAddress to handle this for us), but it is still worth at least being aware of.
As with imports, we can view exports with the dumpbin utility:

Figure 4 - KernelBase exports

A Note About GetProcAddress

Since we will be talking a bit about ordinals, an important note should first be made about using GetProcAddress. The method has the following signature:

FARPROC GetProcAddress(
  HMODULE hModule,
  LPCSTR  lpProcName
);

It seems pretty clear that if we get back an HMODULE from LoadLibrary, that probably goes in as the first argument, and as the second is a const char*, it seems fairly obvious how we would provide a name. So how, then, do we find an export by ordinal using GetProcAddress?
As it turns out, ordinal values are only allowed to be two bytes in width, and thus (as strange as it sounds), if we can simply cast our ordinal value to a const char* and pass it in (in order to keep the compiler happy), GetProcAddress will treat the provided value as an ordinal rather than a name. We can validate this by looking at the GetProcAddress implementation:

Figure 5 - GetProcAddress disassembly

Looking at the two lines in the block at the top with comments (lines with a ; in them), we can see that the very first thing GetProcAddress does after copying the second parameter into the EDI register is to check whether it is too big to be an ordinal value, and if it is, then it goes down the left branch, and treats it as a string. Otherwise, it goes down the right branch, and treats the provided value as an ordinal.

Getting Exports from Imports

So now that we've covered a bit about them, we must first determine if the INT entry (the OriginalFirstThunk) that we are looking at is a name, or an ordinal. In order to do that, we need to check to see if the IMAGE_ORDINAL_FLAG is set:

if(orig_first_thunk->u1.Ordinal & IMAGE_ORDINAL_FLAG)
   // Export is by ordinal
else
   // Export is by name

If the import is by ordinal, then our work is already nearly done; we can extract the ordinal value from the INT entry using the IMAGE_ORDINAL macro, which, as seen below, simply lops off the bottom bits (to honor the 2-byte size restriction):

#define IMAGE_ORDINAL(Ordinal) (Ordinal & 0xffff)

and then provide the returned value to GetProcAddress:

uint16_t ord = IMAGE_ORDINAL(orig_first_thunk->u1.Ordinal);
first_thunk->u1.Function = (ULONG_PTR)GetProcAddress(hm,
                                                    (const char*)ord);

If, on the other hand, we need to locate the method by name (which will most often by the case), then, as mentioned before, the IAT entry will initially provide an RVA to an IMAGE_IMPORT_BY_NAME structure, which we will use to get the appropriate string value to pass to GetProcAddress. The IMAGE_IMPORT_BY_NAME structure looks something like the following:

typedef struct _IMAGE_IMPORT_BY_NAME {
    WORD    Hint;
    BYTE    Name[1];
} IMAGE_IMPORT_BY_NAME, *PIMAGE_IMPORT_BY_NAME;

and thus, we can do the following:

PIMAGE_IMPORT_BY_NAME imp_by_name = NULL;

imp_by_name = (PIMAGE_IMPORT_BY_NAME)(image_base + first_thunk->u1.AddressOfData);
first_thunk->u1.Function = (ULONG_PTR)GetProcAddress(hm, imp_by_name->Name);

Once we're done with that, we should check to ensure that we got a valid value back in first_thunk->u1.Function (as finding the requested function may have failed), and we can update our loop from earlier as follows:

for(; orig_first_thunk[i].u1.AddressOfData; i++) {
    PIMAGE_IMPORT_BY_NAME by_name = NULL;
    const char*           search_value = NULL;

    if(orig_first_thunk[i].u1.Ordinal & IMAGE_ORDINAL_FLAG) {
        search_value = (const char*)IMAGE_ORDINAL(orig_first_thunk[i].u1.Ordinal);
    } else {
        by_name = (PIMAGE_IMPORT_BY_NAME)(image_base + first_thunk->u1.AddressOfData);
        search_value = (const char*)by_name->Name;
    }

    if(0 == (first_thunk->u1.Function = (ULONG_PTR)GetProcAddress(hm, search_value))) {
        FreeLibrary(hm);
        return LoaderNotFound;
    }
}

And with that, we have now resolved our imports, or failed gracefully (if they could not be resolved).

Continued here.