#include <stdint.h>
extern void hole_fn(void) __attribute__((preserve_none));
extern int hole_for_int;
__attribute__((preserve_none))
void swap_and_multiply(int a, int b) {
const int hole_value = (int)((uintptr_t)&hole_for_int);
int c = a * hole_value;
a = b * hole_value;
b = c;
typedef void(*outfn_type)(int, int) __attribute__((preserve_none));
outfn_type stencil_output = (outfn_type)&hole_fn;
stencil_output(a, b);
}
How It Works
From the tutorial, you’ve followed a set of given rules to produce stencils and relocation holes, with no explanation given as to why the rules are the way they are. We’re now going to dive into that reasoning, and look at exactly how each part of the guidance on clang flags and relocation hole macros came to be.
Stencil Creation
All of the idioms around creating stencils are about abusing features of clang as much as possible to be able to generate functions as only the specific sequences of instructions we want. There’s number of tricks involved:
First, we rely on the calling convention to be able to force values into known registers. Our goal is to be able to form programs by concatenating stencils, and so we must be able to match the outputs of one stencil to the inputs of another. By making the stencil inputs be the function arguments, and ending each function with a (tail)call to another function, we can rely on the calling convention to place input and output values into consistent registers. This ending call can be easily identified and trimmed off from the stencil. As a minor optimization we rely on specifically the GHC / preserve_none
calling convention, which tries to pass as many arguments in registers as possible. This maximizes our ability to keep values in registers, and minimizes the chance that the compiler will try to generate a stack frame as we won’t be pushing arguments to a stack.
Second, we rely on compiler optimizations to elide the stack frame prologue/epilogue and to turn the ending call into a tailcall. Setting up and tearing down a stack frame is a notable overhead on the small stencil functions, and means each setup must have a paired teardown. Ending the stencil with a tailcall is what allows us to trivially elide the jump instruction and fall through into the next concatenated stencil, as well as helping to ensure that any stack operations have been undone before the jump.
Third, we extensively abuse dynamic relocations to allow stencils to declare holes for values to be filled in at JIT compile time, and when compiling the stencil the C compiler will tell us how/where to patch in constants or addresses into the code that it generated. If we wish to be able to patch in an integer constant, we can declare an extern int some_constant
, and then cast the address of that variable to an int. By paying attention to the name of the extern symbol being referenced, we can more intelligently disambiguate its intended use and treat certain references specially. The machine code model has a significant impact on the relocations generated, and we’ll discuss that more later.
To show how these all fit together, let us consider a stencil which swaps its two arguments between its input and output, and multiplies them by a patchable constant:
We compile this with clang -mcmodel=medium -O3 -c swap_and_multiply.c
, and examine the generated code with objdump -d -Mintel,x86-64 --disassemble --reloc swap_and_multiply.o
:
0000000000000000 <swap_and_multiply>:
0: 44 89 e0 mov eax,r12d
3: 41 bc 00 00 00 00 mov r12d,0x0
5: R_X86_64_32 hole_for_int
9: 41 0f af c4 imul eax,r12d
d: 45 0f af e5 imul r12d,r13d
11: 41 89 c5 mov r13d,eax
14: e9 00 00 00 00 jmp 19 <swap_and_multiply+0x19>
15: R_X86_64_PLT32 hole_fn-0x4
And thus we have achieved our exact goals in stencil creation. The function body is only our targeted set of instructions. There’s no stack frame setup or teardown. The relocation information tells us exactly how and where to patch in our integer constant at JIT compile time. And the use of a unique symbol hole_fn
means the tail call jump is easy to identify and strip off from the generated code, as we end up with a unique pointer to it.
Now, let’s unwind each of the techniques involved here to illustrate their individual impact on the generated code.
Calling Convention
The standard x86_64 calling convention places the first six arguments into registers (in order: rdi
, rsi
, rdx
, rcx
, r8
, r9
), and then the rest go on the stack. There’s a very nice overview of the standard (cdecl) calling convention for x86_64 in The 64 bit x86 C Calling Convention. However, the guidance for copy-and-patch stencils is to instead opt in to the preserve_none
calling convention. Clang/LLVM only supports preserve_none
on x86_64 and AArch64, and GCC doesn’t support it at all (but support is being worked on).
We can look at the difference between cdecl
and preserve_none
by building a small stencil which just swaps the order of its inputs:
cdecl calling convention | preserve_none calling convention |
---|---|
|
|
|
|
Which is… not really all that different. preserve_none
is useful though as the number of arguments go up. As mentioned above, x86_64 provides six registers for arguments, so we can better illustrate the difference by extending swap_ints to 8 parameters:
#include <stdint.h>
extern void hole_fn(void)
__attribute__((CALLING_CONVENTION));
__attribute__((CALLING_CONVENTION))
void swap_ints(int a, int b, int c, int d, int e, int f, int g, int h) {
typedef void(*outfn_type)(int, int, int, int,
int, int, int, int)
__attribute__((CALLING_CONVENTION));
outfn_type stencil_output = (outfn_type)&hole_fn;
stencil_output(h, g, f, e, d, c, b, a);
}
// clang -DCALLING_CONVENTION=cdecl -O3 -c
// clang -DCALLING_CONVENTION=preserve_none -O3 -c
cdecl calling convention | preserve_none calling convention |
---|---|
|
|
So it’s helpful for when it matters. It moves us from being able to only define stencils with 6 inputs and outputs to stencils that have 12 inputs and outputs, after which preserve_none
also runs out of registers and has to start setting up a stack frame. However, there’s multiple categories of registers. Floating point values and SSE operations use xmm
registers, AVX uses ymm
registers, and AVX-512 uses zmm
registers. The calling convention also controls how these operate:
floating point | SIMD | |
---|---|---|
|
|
|
cdecl |
|
|
preserve none |
|
|
Any number of floating point or simd registers cause a stack frame to get emitted on cdecl, and thus if you’re trying to use them in stencils, you’ll have to use preserve_none. You’ll then be limited to 8 function arguments / registers before it will start passing arguments on the stack.
For SIMD specifically, note that one can use attributetarget("arch")
to be able to generate code for different SIMD feature sets, and detect which one to select as the code for the stencil at runtime:
__attribute__((preserve_none,target("avx")))
void fused_multiply_add_avx(__m512 a, __m512 b, __m512 c) {
DECLARE_STENCIL_OUTPUT(__m512);
return stencil_output(a * b + c);
}
__attribute__((preserve_none,target("no-avx")))
void fused_multiply_add_sse2(__m512 a, __m512 b, __m512 c) {
DECLARE_STENCIL_OUTPUT(__m512);
return stencil_output(a * b + c);
}
0000000000000100 <fused_multiply_add_avx>:
100: 62 f2 75 48 a8 c2 vfmadd213ps zmm0,zmm1,zmm2
106: e9 00 00 00 00 jmp 10b <fused_multiply_add_avx+0xb>
107: R_X86_64_PLT32 cnp_stencil_output-0x4
10b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
0000000000000110 <fused_multiply_add_sse2>:
110: 0f 59 c4 mulps xmm0,xmm4
113: 0f 58 44 24 08 addps xmm0,XMMWORD PTR [rsp+0x8]
118: 0f 59 cd mulps xmm1,xmm5
11b: 0f 58 4c 24 18 addps xmm1,XMMWORD PTR [rsp+0x18]
120: 0f 59 d6 mulps xmm2,xmm6
123: 0f 58 54 24 28 addps xmm2,XMMWORD PTR [rsp+0x28]
128: 0f 59 df mulps xmm3,xmm7
12b: 0f 58 5c 24 38 addps xmm3,XMMWORD PTR [rsp+0x38]
130: e9 00 00 00 00 jmp 135 <fused_multiply_add_sse2+0x25>
131: R_X86_64_PLT32 cnp_stencil_output-0x4
Tail Call
As was mentioned, we rely on clang’s optimization primary for converting the stencil_output
call to a tailcall. It also happens to be necessary for eliding the stack frame prologue and epilogue when it’s not necessary. Going back to our swap_and_multiply
example:
#include <stdint.h>
extern void hole_fn(void) __attribute__((preserve_none));
extern int hole_for_int;
__attribute__((preserve_none))
void swap_and_multiply(int a, int b) {
const int hole_value = (int)((uintptr_t)&hole_for_int);
int c = a * hole_value;
a = b * hole_value;
b = c;
typedef void(*outfn_type)(int, int) __attribute__((preserve_none));
outfn_type stencil_output = (outfn_type)&hole_fn;
stencil_output(a, b);
}
We can look at the resulting code without optimizations (-O0
) and with optimizations (-O3
):
clang -O0 | clang -O3 |
---|---|
|
|
So, clang is obviously doing great work for us. and are the stack frame setup and teardown in the unoptimized version, and they’ve been elided in the optimized version. The call at has been replaced with a tailcall jmp at .
I’m not aware of a more specific way to request clang to emit the stack frame when it’s not necessary. -fomit-frame-pointer -momit-leaf-frame-pointer
causes clang to drop the push rbp
/pop rbp
, but the sub rsp,0x20
and add rsp,0x20
remain as the unoptimized code relies on the stack for local variables. Maybe running only mem2reg would then suffice, but the whole point here is to get all of LLVM’s optimizations for "free" within a stencil anyway.
Clang does support the musttail attribute to force tailcall generation. However, it requires that the input and output types match, which doesn’t fit our needs for stencil creation.
extern void hole_fn(void) __attribute__((preserve_none));
__attribute__((preserve_none))
void add_two_ints(int a, int b) {
typedef void(*outfn_type)(int) __attribute__((preserve_none));
outfn_type stencil_output = (outfn_type)&hole_fn;
// Force the tailcall, via an attribute on the return statement.
__attribute__((musttail)) return stencil_output(a + b);
}
$ clang -O3 -c example.c example.c:12:29: error: cannot perform a tail call to function 'stencil_output' because its signature is incompatible with the calling function 12 | __attribute__((musttail)) return stencil_output(a + b); | ^ example.c:11:3: note: target function has different number of parameters (expected 2 but has 1) 11 | outfn_type stencil_output = (outfn_type)&hole_fn; | ^ example.c:12:18: note: tail call required by 'musttail' attribute here 12 | __attribute__((musttail)) return stencil_output(a + b); | ^
So, unless that changes in the future, we have to rely on -O3
magically doing the right thing.
Relocations
This far, we’ve examined the "copy" part of copy-and-patch. It is now time to focus on the "patch" part instead.
A relocation is a bit of information that clang leaves for the dynamic linker when referencing an external symbol, so that when the program is run and the executable and its various libraries are loaded into random addresses in memory, the dynamic linker can patch the executable with the correct addresses of all of the symbols it needs. In copy-and-patch, we abuse this by referencing an external symbol every time that we want a hole to be inserted into the stencil, and then looking at the relocation information generated after compilation to know what offsets to patch within the generated code to fill the hole at JIT compile time.
We lean heavily on the medium machine code model, which sets the expectation that code can be referenced within +-2GB (32-bit values), and large data needs to be referenced by full 64-bit values. Others have covered the topics of machine code models and relocations before, so please see Understanding the x64 code models or Relocation Overflow and Code Models for background on this topic. The official AMD64 ABI documentation is atypically clear and useful as well. Small views both code and data as 32-bit values, large views both as 64-bit values, and so using medium means we’re able to generate holes of either 32-bit or 64-bit depending on if we reference code or data.
I’ve summarized everything to be aware of within the realm of making holes into one program:
#include <stdint.h>
extern uint8_t cnp_small_data_array[8];
extern uint8_t cnp_large_data_array[1000000];
extern void cnp_function_near(uint32_t, uint64_t);
extern uint8_t cnp_function_far[1000000];
void stencil_example(void) {
uint32_t small = (uint32_t)((uintptr_t)&cnp_small_data_array);
uint64_t large = (uint64_t)((uintptr_t)&cnp_large_data_array);
typedef void(*fn_ptr_t)(uint32_t, uint64_t);
fn_ptr_t near_ptr = &cnp_function_near;
near_ptr(small, large);
uint64_t largefn = (uint64_t)((uintptr_t)&cnp_function_far);
asm volatile("" : "+r" (largefn) : : "memory");
fn_ptr_t far_ptr = (fn_ptr_t)largefn;
far_ptr(small, largefn);
}
The key part, which I cannot emphasize enough, is that we completely and utterly ignore the actual data referred to by the symbol. We always take the address of the symbol, and cast it to what we need. Hence, the use of some macros above to make this friendlier.
We compile this with clang -O3 -mcmodel=medium -c example.c
, though -mcmodel=medium
is the default anyway, and view the generated code and relocations with objdump -d -Mintel,x86-64 --disassemble --reloc example.o
as usual:
0000000000000000 <stencil_example>:
0: 50 push rax
1: 48 be 00 00 00 00 00 movabs rsi,0x0
8: 00 00 00
3: R_X86_64_64 cnp_large_data_array
b: bf 00 00 00 00 mov edi,0x0
c: R_X86_64_32 cnp_small_data_array
10: e8 00 00 00 00 call 15 <stencil_example+0x15>
11: R_X86_64_PLT32 cnp_function_near-0x4
15: 48 be 00 00 00 00 00 movabs rsi,0x0
1c: 00 00 00
17: R_X86_64_64 cnp_function_far
1f: bf 00 00 00 00 mov edi,0x0
20: R_X86_64_32 cnp_small_data_array
24: 58 pop rax
25: ff e6 jmp rsi
When referring to a small piece of data, we’ll get a 32-bit hole. You can see this with the relocation for cnp_small_data_array
being a R_X86_64_32
. Referring to a large piece of data instead gets us a 64-bit hole. cnp_large_data_array
was assigned R_X86_64_64
, and clearly there are more 00 bytes to fill in.-mlarge-data-threshold=threshold
controls the exact line between how large an array must be for it to be considered "large data" and get 64-bit addressing treatment, but it’s safe to just declare a needlessly large extern array as the array won’t exist anyway.
When calling a function, the function is expected to be within +-2GB according to the code model, so the invocation of cnp_function_near
becomes a 32-bit hole of R_X86_64_PLT32
. When patching references between stencils, it will be important to track the exact offsets of the source jmp/call and the destination, as the offset is relative. If you wish to call back into a function that’s a part of the JIT compiler runtime, that function won’t likely be within +-2GB. We need to be able to emit a call/jmp to the full 64-bit address. It turns out that this is incredibly difficult to do:
void stencil_example(void) {
typedef void(*fn_ptr_t)(uint64_t);
fn_ptr_t direct_assign = (fn_ptr_t)((uintptr_t)&cnp_function_far);
direct_assign(0);
uint64_t far_as_int = (uint64_t)((uintptr_t)&cnp_function_far);
fn_ptr_t indirect_assign = (fn_ptr_t)far_as_int;
indirect_assign(far_as_int);
uint64_t far_forgettable = (uint64_t)((uintptr_t)&cnp_function_far);
// Abuse an empty asm volatile to make clang unable to understand
// where the value came from.
asm volatile("" : "+r" (far_forgettable) : : "memory");
fn_ptr_t forgotten = (fn_ptr_t)far_forgettable;
forgotten(far_forgettable);
}
0000000000000000 <stencil_example>:
0: 53 push rbx
1: 31 ff xor edi,edi
3: e8 00 00 00 00 call 8 <stencil_example+0x8>
4: R_X86_64_PLT32 cnp_function_far-0x4
8: 48 bb 00 00 00 00 00 movabs rbx,0x0
f: 00 00 00
a: R_X86_64_64 cnp_function_far
12: 48 89 df mov rdi,rbx
15: e8 00 00 00 00 call 1a <stencil_example+0x1a>
16: R_X86_64_PLT32 cnp_function_far-0x4
1a: 48 89 df mov rdi,rbx
1d: 5b pop rbx
1e: ff e7 jmp rdi
Of which we see that there’s two 32-bit relocations (R_X86_64_PLT32
) and one 64-bit one (R_X86_64_64
). There’s 32-bit relocations because clang sees that we turned an external symbol into a function pointer. Code must be within +-2GB according to the code model, so 32 bits is fine. Clang is also then smart enough to track this through an assignment to a variable, and although it loads the full 64-bit address into a register as the argument, it then emits a 32-bit relocation for the actual call, because it still knows that the address came from a symbol definition. The only way I found to make clang "forget" the source of function pointer value was to run it through an empty asm volatile
so that clang thinks no assumptions are valid anymore, and then it finally is willing to just jump to the 64-bit value in the register.