Writing a Windows Loader (Part 1)

A loader is code that prepares other code for execution. Loaders take data
corresponding to a program or library and prepare it to be read, modified, and/or executed. This preparation process typically involves steps such as parsing a file containing the code to be run, metadata about that code, and other relevant bits of information such as the external services it might need from other parts of the operating system. Additionally, things like resolving external dependencies - or other, external bits of code the bit being prepared will rely upon, setting memory protections appropriately, and perhaps updating references (if the code is not position independent) will happen here.  

Nearly all modern consumer-facing operating systems contain loaders. Loading occurs during the initialization of a process (when the primary application image is loaded), and also may occur in an ad-hoc fashion throughout program execution, as dynamic libraries (to include .dlls, .dylibs, and .sos, for example) are loaded and unloaded.

In the context of things like threat emulation, there is a strong desire to model trends present within the modern malware ecosystem - including the ability to operate in memory only. This presents a bit of a challenge: in Windows, for example, the operating system's built-in loader
only accepts binary files on disk. What we desire is a reflective loader
that will perform some of the same kinds of preparations that the native operating system's loader would handle, but without the requirement that the loaded binary reside on disk (note that some operating systems, such as MacOS, have facilities for executing directly in-memory natively).

How these tools actually run reflectively can vary quite a bit, but may include hijacking an existing process through process hollowing or process injection, or may simply include an attacker-provided benign hosting process. Regardless of which vector we choose, the common requirement is the ability to load our binaries without touching the disk.

To begin our exploration of loaders, let's take a look at the built-in Windows
loader, the LoadLibraryW function.

LoadLibraryW

The LoadLibraryW function, available in the Windows.h header and implemented
in the Kernel32 library, has the function signature as indicated below:

// The function signature of LoadLibraryW.
HMODULE LoadLibraryW(
  LPCWSTR lpLibFileName
);

To load a library, you provide the name of the module to be loaded - lpLibFileName. This can be a full file path, in which case Windows will look exclusively in that location.
Alternately, we would provide a module name, and the system would search for it in a variety of locations.

If the module cannot be found, LoadLibraryW returns NULL, otherwise it returns an opaque handle to the loaded module (which also ends up being a pointer to the beginning of the module in memory).

When Windows loads the module, it maps it into the process that called LoadLibraryW, performs some preparations, and invokes the module's
entry point. The entry point is not actually the user-provided method you would normally consider (e.g., main, DllMain, WinMain, etc.), but rather some code included by the compiler (and C runtime) that is responsible for various setup tasks, such as global variable allocation and initialization, calling constructors for C++ global objects, and will eventually execute the user-defined entry point (as listed above).

As mentioned above, notice this method requires that you provide a file path for loading. Effectively, this means that in order to use the built-in Windows loading facilities, we will need to first copy our binary to disk, then provide a path as a parameter to LoadLibraryW in order to get our module into memory and begin execution. This, by definition, is not reflective loading, as we must have a file on-disk in order to run.

If we wanted to build a reflective loader, we might start with a function that has a signature more like the function reflective_loadlibrary shown below.


typedef enum LoaderStatus_ {
  LoaderSuccess,
  LoaderInvalid,
  LoaderBadFormat,
  LoaderNoMem,
  LoaderNotFound,
  LoaderBadArch,
  LoaderFailed,
} LoaderStatus;


typedef struct loader_ctx_ {
  void*  p;
  size_t size;
} loader_ctx;


LoaderStatus reflective_loadlibrary(loader_ctx* ctx,
                                    const void* buffer,
                                    uint32_t size); 

The reflective_loadlibrary function accepts three arguments. First, it accepts
a loader_ctx, which contains a pointer to a structure we will use to manage contextual information that the loader will need as we implement it. Next, it accepts a buffer and its corresponding size, which represent the module that we wish to load. The function returns a LoaderStatus, which is a type we've created to help provide status results around success or failure as loader development progresses.

By implementing our own reflective_loadlibrary  which handles a subset of the operations that LoadLibraryW would normally accomplish, we can keep our module entirely in memory.

But what is the first step to implementing our own loader? Answering this question requires some knowledge of the Portable Executable or PE format.

The Portable Executable Format

The Portable Executable (PE) format is a file format for executables, DLLs, Drivers, and some more exotic kinds of files on 32- and 64-bit versions of Windows. It contains all the information Windows needs to load and execute the specified module. Since we're re-implementing some of the Windows loader's functionality, we'll also need to parse PE files during the course of loading and executing.

As there are many excellent references on PE files (including a somewhat self-serving one here), we will not describe them exhaustively in this post; instead we will cover some basic terminology, and talk a little bit about the parts of the file format that we really care about to simply get our module up and running.

But before we can really talk about PE-specific structures, we first need to talk about some special addressing terms that are fundamental to understanding how the actual structures work: relative and virtual addresses.

Virtual Addressing

A physical address is a location in main memory. A virtual address (VA), in contrast, is an abstracted memory address that the operating system makes available to a process. This virtual address is decoupled from the physical address that it represents through the process and its paging mechanisms, and is not particularly useful outside of the context of the process it exists within.

As with all user-mode process (and many kernel ones, for that matter), when parsing PE files, we almost never deal in physical addresses. Confusingly, there are one or two references to "Physical" addresses within the PE file itself, but in most cases, those actually refer to file offsets, rather than "real" physical memory.

As with most executable file formats, PE files make extensive use of relative virtual addresses or RVAs, which consist of an offset relative to the beginning of the PE file in memory. Unpacking that statement just a bit, the start of the PE file in memory, or this "beginning" is called the base address. In fact, most of the addresses in a PE will be RVAs - again meaning that they will be an offset relative to the beginning of the data.

As an example, suppose we have a DLL in memory beginning at virtual address 0x10000 (its base address). Inside of this DLL, we have a buffer beginning at 0x11100. The RVA of this buffer is its address minus the base address, or 0x11100 - 0x10000 = 0x1100.

This conversion works both ways, so given the RVA 0x1100, we can compute the buffer's VA by adding the PE's base address: 0x1100 + 0x10000 = 0x11100.

Note: RVAs don't necessarily correspond to file offsets; with RVAs, we are dealing with addresses that are relative to the PE base address in memory. This distinction is important when parsing some regions of the PE, where the amount of space a section takes on-disk may differ from the amount of space it takes in memory.

Now that you've got a firm footing in how PEs address memory, let's dive into the
first section in the PE file, the DOS header.

DOS Header

The DOS header is a legacy structure that begins all PE files. It's so-called
because it actually represents a small MS-DOS executable, which was needed in the early days of Windows when attempting to run a PE file in MS-DOS would print a friendly error message to the user indicating that they'd need to use Windows to execute the file.

The struct below represents the DOS header's layout.


typedef struct _IMAGE_DOS_HEADER
{
     WORD e_magic;
     WORD e_cblp;
     WORD e_cp;
     WORD e_crlc;
     WORD e_cparhdr;
     WORD e_minalloc;
     WORD e_maxalloc;
     WORD e_ss;
     WORD e_sp;
     WORD e_csum;
     WORD e_ip;
     WORD e_cs;
     WORD e_lfarlc;
     WORD e_ovno;
     WORD e_res[4];
     WORD e_oemid;
     WORD e_oeminfo;
     WORD e_res2[10];
     LONG e_lfanew;
} IMAGE_DOS_HEADER, *PIMAGE_DOS_HEADER;

The DOS header is documented extensively in Microsoft's "PE File," so we will not describe all of the fields here. For our purposes, it will be sufficient to examine two fields: e_magic and e_lfanew. At the very beginning of this header (and thus, the beginning of the PE file), we see the e_magic field, which contains the letters 'MZ' (or, 0x4d 0x5a). The e_lfanew is the RVA of the PE File's NT headers, which a series of critical structures that will help us load our module.

Note: The "NT" in NT headers and elsewhere is a vestigial initialism from Microsoft's Windows NT, where it stands for new technology. As Windows NT released in 1993, its newness has since worn off.

We can verify that PEs contain the magic 'MZ' bytes by looking at the hex dump of the DOS Header. You can choose any .exe or .dll and examine the first two bytes of the result. For example, Figure 1 contains a hex dump of a C:/Windows/System32/ntdll.dll, with the MZ bytes highlighted, as well as the RVA of the NT headers.

Figure 1 - e_magic and e_lfanew fields outlined

For robustness, we'll want to verify that the buffer that our reflective loader receives is actually a PE file. A simple validation technique is to check that (a) the e_magic value is correct, and (b) that the RVA of the NT headers resides within the bounds of our buffer. The code snippet below illustrates how we might perform these checks.

// A method for validating the DOS headers.
static LoaderStatus validate_dos_hdr(const unsigned char* buf, uint32_t size)
{
    uintptr_t base = (uintptr_t)buf;

    /* Check to make sure our buffer is not NULL, and        */
    /* that we have at least enough space to get our offsets */
    if(NULL == buf || size < sizeof(IMAGE_DOS_HEADER))
        return LoaderBadFormat;

    /* Does it start with MZ? */
    if(buf[0] != 'M' || buf[1] != 'Z')
        return LoaderBadFormat;

    /* Is the VA of the IMAGE_NT_HEADERS past the end of our buffer? */
    if((base + ((const IMAGE_DOS_HEADER*)buf)->e_lfanew) > (base + size))
        return LoaderBadFormat;

    /* We can proceed to IMAGE_NT_HEADERS! */
    return LoaderSuccess;
}

and then begin our reflective_loadlibrary implementation as follows:

LoaderStatus reflective_loadlibrary(loader_ctx* ctx,
                                    const void* buffer,
                                    uint32_t size)
{
    LoaderStatus            status = LoaderSuccess;
    const unsigned char*    cbuffer = (const unsigned char*)buffer;
    const IMAGE_NT_HEADERS* nt = NULL;

    if(NULL == ctx)
        return LoaderInvalid;

    if(LoaderSuccess != (status = validate_dos_hdr(cbuffer, size))) {
        return status;
    }

    nt = (const IMAGE_NT_HEADERS*)(cbuffer +
            ((const IMAGE_DOS_HEADER*)buffer)->e_lfanew);

    // TODO: Finish

    return status;
}

NT Header

After executing the code in the previous section (before the TODO), we will now have a pointer to the IMAGE_NT_HEADERS. This contains another magic value, 'PE' (or 0x50, 0x45, 0x00, 0x00). To show how the RVA works in practice, let's look at the previous hex dump, expanded to show the start of the next header we are interested in:

Figure 2 - DOS Header and NT headers both highlighted

As you can see from the previous highlighted section, e_lfanew indicated that we should find the start of the NT header 0xf0 bytes in (which is 0xf0, 0x00, 0x00, 0x00 in Little Endian), and sure enough, at offset 0xf0 in the hex dump above, we see the NT magic values.

That signature we see at the very bottom of the hex dump is immediately followed by two more structures that will be vitally important for finishing off our loader, as indicated below:

typedef struct _IMAGE_NT_HEADERS {
  DWORD                 Signature;
  IMAGE_FILE_HEADER     FileHeader;
  IMAGE_OPTIONAL_HEADER OptionalHeader;
} IMAGE_NT_HEADERS, *PIMAGE_NT_HEADERS;

The file header (below) comes first, and has several useful fields we will need in order to validate and finish loading our binary. Its definition is as follows:


typedef struct _IMAGE_FILE_HEADER {
  WORD  Machine;
  WORD  NumberOfSections;
  DWORD TimeDateStamp;
  DWORD PointerToSymbolTable;
  DWORD NumberOfSymbols;
  WORD  SizeOfOptionalHeader;
  WORD  Characteristics;
} IMAGE_FILE_HEADER, *PIMAGE_FILE_HEADER;

The Machine field will indicate the architecture the binary is supposed to run on. While Windows can run on many platforms, for the purposes of this book, we  will consider either IMAGE_FILE_MACHINE_I386, or 0x014c, for x86 systems, or IMAGE_FILE_MACHINE_AMD64, or 0x8664, for x64. Values for other architectures can be found within the relevant documentation (MSDN and Windows headers).
The second field that is of use to us here is the Characteristics element, which we can use to determine what type of PE file we are observing. Finally, the NumberOfSections entry is very useful to us, which will we talk more about later in the chapter.

We can now add a small bit of validation for the NT headers, to make sure the architecture is correct:


#ifdef _M_X64
#define CURRENT_ARCH  IMAGE_FILE_MACHINE_AMD64
#else
#define CURRENT_ARCH  IMAGE_FILE_MACHINE_I386
#endif

// ...

/*
*  We will check to see if the PE will work
*  on the current architecture.
*/
if(CURRENT_ARCH != nt->FileHeader.Machine)
    return LoaderBadArch;

Additionally, we can also check to see if the current PE is a DLL or not in the following fashion:

if(IMAGE_FILE_DLL & nt->FileHeader.Characteristics)
    // The file is a dll
else
    // Not a dll!

Optional Header

In spite of its name, the Optional Header (IMAGE_OPTIONAL_HEADER) is anything but. We will spend a good chunk of our time in this structure (which, as you can see below, is very large), and its constituent sub-structures.

typedef struct _IMAGE_OPTIONAL_HEADER {
  WORD                 Magic;
  BYTE                 MajorLinkerVersion;
  BYTE                 MinorLinkerVersion;
  DWORD                SizeOfCode;
  DWORD                SizeOfInitializedData;
  DWORD                SizeOfUninitializedData;
  DWORD                AddressOfEntryPoint;
  DWORD                BaseOfCode;
  DWORD                BaseOfData;
  DWORD                ImageBase;
  DWORD                SectionAlignment;
  DWORD                FileAlignment;
  WORD                 MajorOperatingSystemVersion;
  WORD                 MinorOperatingSystemVersion;
  WORD                 MajorImageVersion;
  WORD                 MinorImageVersion;
  WORD                 MajorSubsystemVersion;
  WORD                 MinorSubsystemVersion;
  DWORD                Win32VersionValue;
  DWORD                SizeOfImage;
  DWORD                SizeOfHeaders;
  DWORD                CheckSum;
  WORD                 Subsystem;
  WORD                 DllCharacteristics;
  DWORD                SizeOfStackReserve;
  DWORD                SizeOfStackCommit;
  DWORD                SizeOfHeapReserve;
  DWORD                SizeOfHeapCommit;
  DWORD                LoaderFlags;
  DWORD                NumberOfRvaAndSizes;
  IMAGE_DATA_DIRECTORY DataDirectory[IMAGE_NUMBEROF_DIRECTORY_ENTRIES];
} IMAGE_OPTIONAL_HEADER, *PIMAGE_OPTIONAL_HEADER;

Of most immediate interest are the SizeOfImage and SizeOfHeaders fields, in addition to the AddressOfEntryPoint, in addition to the DataDirectory table, which will we look at next.

Data Directories

This substructure of the Optional Header serves as a table to locate important sections within the binary. Each entry in the table is laid out as follows:

typedef struct _IMAGE_DATA_DIRECTORY {
  DWORD VirtualAddress;
  DWORD Size;
} IMAGE_DATA_DIRECTORY, *PIMAGE_DATA_DIRECTORY;

The first field is a bit poorly named, as it is actually an RVA, coupled with a size. We will visit quite a few of these table entries as we progress through the loader development sections that follow.

Sections

Following the headers we've discussed so far, lies the section headers, immediately followed by the actual sections of the binary. These sections contain the code, data, and metadata required to load and run the application. While we will not visit all of the section types that the PE specification supports, we will visit at least the major ones required to gain execution. So first, lets look at the section headers:

#define IMAGE_SIZEOF_SHORT_NAME   8U

// ...

typedef struct _IMAGE_SECTION_HEADER {
  BYTE  Name[IMAGE_SIZEOF_SHORT_NAME];
  union {
    DWORD PhysicalAddress;
    DWORD VirtualSize;
  } Misc;
  DWORD VirtualAddress;
  DWORD SizeOfRawData;
  DWORD PointerToRawData;
  DWORD PointerToRelocations;
  DWORD PointerToLinenumbers;
  WORD  NumberOfRelocations;
  WORD  NumberOfLinenumbers;
  DWORD Characteristics;
} IMAGE_SECTION_HEADER, *PIMAGE_SECTION_HEADER;

We can view the layout in the following hex dump:

Figure 3 - Section Header Information

The highlighted portion at the top shows where the IMAGE_NT_HEADERS end (with the array of data directories), and are immediately followed by the first IMAGE_SECTION_HEADER, which we can see is the .text section (the section executable code is generally stored in) on the right. These section headers are important for loading, as they allow us to map the actual file offset of each section to its Virtual Address, by adding the RVA provided above (in the VirtualAddress field of the structure) to the Base Address of our new allocation. We will cover this operation in more depth later in the chapter.

General Steps to Loading

Overview

At this point, we are now ready to begin writing a loader. We will start with our initial function stub from the beginning of the chapter:

LoaderStatus reflective_loadlibrary(loader_ctx* ctx,
                                    const void* buffer,
                                    uint32_t size)
{
    LoaderStatus            status = LoaderSuccess;
    const unsigned char*    cbuffer = (const unsigned char*)buffer;
    const IMAGE_NT_HEADERS* nt = NULL;

    if(NULL == ctx)
        return LoaderInvalid;

    if(LoaderSuccess != (status = validate_dos_hdr(cbuffer, size))) {
        return status;
    }

    nt = (const IMAGE_NT_HEADERS*)(cbuffer +
            ((const IMAGE_DOS_HEADER*)buffer)->e_lfanew);

    // TODO: Finish

    return status;
}

At this point, we will look to perform at least the following steps:

  1. Allocate Space - We will need to allocate memory that can match both the size and protection requirements of the sections of our binary
  2. Copy Data - We need to copy the data from our binary to the allocated space, adjusting file offsets to Virtual Addresses as we go.
  3. Handle Relocations - We must adjust references after copying our data.
  4. Resolve Imports - Generally, binaries need to be linked dynamically to other libraries in order to interact with the operating system at runtime.
  5. TLS Callbacks - We will find and invoke the TLS Callbacks - an array of functions that are invoked prior to the application's entry point.
  6. Finally, we will locate and invoke the binary's entry point. We will discuss some differences here between DLLs and Executables.

Many additional steps could go here, and while we will discuss them later in the chapter, implementation will be left as an exercise for the reader.

Allocate Space

While many allocator APIs exist in Windows, some special considerations must be applied to requesting memory for our loader. In particular, we have some special alignment and protection requirements (in general) that we will need to adhere to in order to ensure that our program will perform properly once loaded.
To keep things simple, we will start with VirtualAlloc, but keep in mind that other options exist, and may have different implications in terms of how our memory footprint looks forensically. So how much space should we get? In order to obtain that information, we should refer back to the IMAGE_OPTIONAL_HEADER structure from the last section, in particular, the SizeOfImage field. Thus, we can update our loader method as follows:

LoaderStatus reflective_loadlibrary(loader_ctx* ctx,
                                    const void* buffer,
                                    uint32_t size)
{
    LoaderStatus            status = LoaderSuccess;
    const unsigned char*    cbuffer = (const unsigned char*)buffer;
    const IMAGE_NT_HEADERS* nt = NULL;
    // Our new allocation!
    unsigned char*          new_buffer = NULL;
    uint32_t                nb_size = 0;

    if(NULL == ctx)
        return LoaderInvalid;

    if(LoaderSuccess != (status = validate_dos_hdr(cbuffer, size))) {
        return status;
    }

    nt = (const IMAGE_NT_HEADERS*)(cbuffer +
            ((const IMAGE_DOS_HEADER*)buffer)->e_lfanew);

    /* We will get the size we need to allocate */
    if(0 == (nb_size = nt->OptionalHeader.SizeOfImage)) {
        return LoaderBadFormat;
    }

    // Allocate space for our executable
    new_buffer = (unsigned char*)VirtualAlloc(NULL,
                                              nb_size,
                                              MEM_RESERVE | MEM_COMMIT,
                                              PAGE_EXECUTE_READWRITE);
    // Check that our allocation succeeded
    if(NULL == new_buffer)
        return LoaderNoMem;

    // TODO: Finish

cleanup:
    // If we did not succeed at loading, we need to clean up!
    if(LoaderSuccess != status)
        VirtualFree(new_buffer, 0, MEM_RELEASE);

    return status;
}

There is quite a bit to unpack in these new additions. First, we obtain our image size from the OptionalHeader, and use that to allocate the appropriate number of read/write/execute (RWX) pages. As we will see later on, it may not always be advisable to allocate RWX memory, but to keep things simple for now, we will allocate in this fashion, which will allow us to sort of ignore page permission issues (at least for the moment). Finally, we've added a cleanup label at the bottom, so we can ensure our allocation will get cleaned up properly if a subsequent step fails, and avoid leaking memory. Consider that our tools may be long running, and many defenders will look for abnormalities in memory and system resource utilization, and avoiding leaks becomes very important.

Copy Data

Now that we have successfully allocated space to load the PE, we must copy in the data. This is not quite as straightforward as it initially sounds, however, as the Virtual Addresses of the sections we need to copy over don't generally line up exactly with their offsets in the source buffer (or original file), and thus, we will need to utilize the section headers to determine exactly how to copy them over.
Before we do that, however, we will copy over the headers, which is a relatively straightforward operation, as the headers start at the beginning of the source buffer, and the size is available through the IMAGE_OPTIONAL_HEADER SizeOfHeaders field:

static LoaderStatus copy_headers(unsigned char* dest,
                                 uint32_t dest_size,
                                 const unsigned char* src,
                                 uint32_t src_size,
                                 const IMAGE_NT_HEADERS* nt)
{
    uint32_t hdr_size = 0;

    if(NULL == dest || NULL == src || NULL == nt)
        return LoaderInvalid;

    // Get the size of headers
    if(0 == (hdr_size = nt->OptionalHeader.SizeOfHeaders)) {
        return LoaderBadFormat;
    }

    // Validate that the size will not overflow either buffer
    if(hdr_size > src_size || hdr_size > dest_size)
        return LoaderBadFormat;

    // Copy!
    memcpy(dest, src, hdr_size);

    return LoaderSuccess;
}

Now we need to actually write some code to locate the section headers. If you recall from the hex dump of the NT headers and the section headers, the section headers immediately follow the NT headers (and thus, the IMAGE_OPTIONAL_HEADER) in the file, and so we can perform some simple pointer math to get to them:

const IMAGE_SECTION_HEADER* sec_start = (const IMAGE_SECTION_HEADER*)(
    (const unsigned char*)(&nt->OptionalHeader) +
    nt->FileHeader.SizeOfOptionalHeader);

Fortunately, however, in practice, Microsoft has provided a macro for doing just this, IMAGE_FIRST_SECTION:

#define IMAGE_FIRST_SECTION( ntheader ) ((PIMAGE_SECTION_HEADER)  \
    ((ULONG_PTR)(ntheader) +                                      \
     FIELD_OFFSET( IMAGE_NT_HEADERS, OptionalHeader ) +           \
     ((ntheader))->FileHeader.SizeOfOptionalHeader   \
    ))

and thus, we can instead simply write our code as follows:

const IMAGE_SECTION_HEADER* sec_start = IMAGE_FIRST_SECTION(nt);

and by done with it.
Once we have a pointer to the beginning of the section headers, we will now revisit the section header definition from earlier, but shortened a bit to reflect the fields we care about right now:

typedef struct _IMAGE_SECTION_HEADER {
  // ...
  DWORD VirtualAddress;
  DWORD SizeOfRawData;
  DWORD PointerToRawData;
  // ...
} IMAGE_SECTION_HEADER, *PIMAGE_SECTION_HEADER;

Now to unpack what we need to do a bit: The PointerToRawData field in the structure above is simply an offset into the source buffer. It normally indicates where in the file the actual section data resides. This is different from an RVA in that sections will often be different sizes in memory (and thus, have different Virtual Addresses) than they are on-disk. This may be due to things such as alignment differences, or in the case of some sections, there may be space that will need to be allocated at load time for data, that does not necessarily need to allocated on disk. As you may expect, the SizeOfRawData field indicates how many bytes the source section will hold. Finally, the VirtualAddress field will give us the appropriate RVA to base the section data at in our allocated buffer.
To put this in practice, we can create a copy_sections method, as below (error checking has been omitted here for clarity):

static LoaderStatus copy_sections(unsigned char* dest,
                                  uint32_t dest_size,
                                  const unsigned char* src,
                                  uint32_t src_size,
                                  const IMAGE_NT_HEADERS* nt)
{
    const IMAGE_SECTION_HEADER* sec = NULL;
    uint16_t                    sec_count = 0;
    uint16_t                    i = 0;


    sec = IMAGE_FIRST_SECTION(nt);
    sec_count = nt->FileHeader.NumberOfSections;
    for(; i < sec_count; i++) {
       // We want to find the section's virtual address.
       // This will be the RVA (from the section header)
       // added to the base address of our allocation
       // (the return value from VirtualAlloc).
       unsigned char* sec_va = dest + sec[i].VirtualAddress;
       // Now we need to find where the section data starts.
       // This will be the buffer containing the PE file contents,
       // adjusted with the offset from the section header.
       const unsigned char* src_va = src + sec[i].PointerToRawData;
       uint32_t size = sec[i].SizeOfRawData; // The size.

       // Now we copy - we probably should perform some
       // error checking here in practice - e.g., to make sure
       // that we won't copy out of bounds.
       memcpy(sec_va, src_va, size);
    }


    return LoaderSuccess;

}

At this point, we can now combine these:

static LoaderStatus copy_image(unsigned char* dest,
                               uint32_t dest_size,
                               const unsigned char* src,
                               uint32_t src_size,
                               const IMAGE_NT_HEADERS* nt)
{
    LoaderStatus status = LoaderSuccess;

    status = copy_headers(dest, dest_size, src, src_size, nt);
    if(LoaderSuccess != status)
        return status;

    return copy_sections(dest, dest_size, src, src_size, nt);
}

And putting it all together, we can combine this with our existing loader method, to get something like this:

LoaderStatus reflective_loadlibrary(loader_ctx* ctx,
                                    const void* buffer,
                                    uint32_t size)
{
    LoaderStatus            status = LoaderSuccess;
    const unsigned char*    cbuffer = (const unsigned char*)buffer;
    const IMAGE_NT_HEADERS* nt = NULL;
    // Our new allocation!
    unsigned char*          new_buffer = NULL;
    uint32_t                nb_size = 0;

    if(NULL == ctx)
        return LoaderInvalid;

    if(LoaderSuccess != (status = validate_dos_hdr(cbuffer, size))) {
        return status;
    }

    nt = (const IMAGE_NT_HEADERS*)(cbuffer +
            ((const IMAGE_DOS_HEADER*)buffer)->e_lfanew);

    /* We will get the size we need to allocate */
    if(0 == (nb_size = nt->OptionalHeader.SizeOfImage)) {
        return LoaderBadFormat;
    }

    // Allocate space for our executable
    new_buffer = (unsigned char*)VirtualAlloc(NULL,
                                              nb_size,
                                              MEM_RESERVE | MEM_COMMIT,
                                              PAGE_EXECUTE_READWRITE);
    // Check that our allocation succeeded
    if(NULL == new_buffer)
        return LoaderNoMem;

    //  We will copy our image data, and bail out if it fails.
    status = copy_image(new_buffer, nb_size, cbuffer, size, nt);
    if(LoaderSuccess != status)
        goto cleanup;

    // TODO: Finish

cleanup:
    // If we did not succeed at loading, we need to clean up!
    if(LoaderSuccess != status)
        VirtualFree(new_buffer, 0, MEM_RELEASE);

    return status;
}

Article continued here.