How It Works

Stencil Creation
Calling Convention
Tail Call
Relocations

From the tutorial, you’ve followed a set of given rules to produce stencils and relocation holes, with no explanation given as to why the rules are the way they are. We’re now going to dive into that reasoning, and look at exactly how each part of the guidance on clang flags and relocation hole macros came to be.

Stencil Creation

All of the idioms around creating stencils are about abusing features of clang as much as possible to be able to generate functions as only the specific sequences of instructions we want. There’s number of tricks involved:

First, we rely on the calling convention to be able to force values into known registers. Our goal is to be able to form programs by concatenating stencils, and so we must be able to match the outputs of one stencil to the inputs of another. By making the stencil inputs be the function arguments, and ending each function with a (tail)call to another function, we can rely on the calling convention to place input and output values into consistent registers. This ending call can be easily identified and trimmed off from the stencil. As a minor optimization we rely on specifically the GHC / preserve_none calling convention, which tries to pass as many arguments in registers as possible. This maximizes our ability to keep values in registers, and minimizes the chance that the compiler will try to generate a stack frame as we won’t be pushing arguments to a stack.

Second, we rely on compiler optimizations to elide the stack frame prologue/epilogue and to turn the ending call into a tailcall. Setting up and tearing down a stack frame is a notable overhead on the small stencil functions, and means each setup must have a paired teardown. Ending the stencil with a tailcall is what allows us to trivially elide the jump instruction and fall through into the next concatenated stencil, as well as helping to ensure that any stack operations have been undone before the jump.

Third, we extensively abuse dynamic relocations to allow stencils to declare holes for values to be filled in at JIT compile time, and when compiling the stencil the C compiler will tell us how/where to patch in constants or addresses into the code that it generated. If we wish to be able to patch in an integer constant, we can declare an extern int some_constant, and then cast the address of that variable to an int. By paying attention to the name of the extern symbol being referenced, we can more intelligently disambiguate its intended use and treat certain references specially. The machine code model has a significant impact on the relocations generated, and we’ll discuss that more later.

To show how these all fit together, let us consider a stencil which swaps its two arguments between its input and output, and multiplies them by a patchable constant:

#include <stdint.h>
extern void hole_fn(void) __attribute__((preserve_none));
extern int hole_for_int;

__attribute__((preserve_none))
void swap_and_multiply(int a, int b) {
  const int hole_value = (int)((uintptr_t)&hole_for_int);
  int c = a * hole_value;
  a = b * hole_value;
  b = c;

  typedef void(*outfn_type)(int, int) __attribute__((preserve_none));
  outfn_type stencil_output = (outfn_type)&hole_fn;
  stencil_output(a, b);
}

We compile this with clang -mcmodel=medium -O3 -c swap_and_multiply.c, and examine the generated code with objdump -d -Mintel,x86-64 --disassemble --reloc swap_and_multiply.o:

0000000000000000 <swap_and_multiply>:
   0:	44 89 e0             	mov    eax,r12d
   3:	41 bc 00 00 00 00    	mov    r12d,0x0
			5: R_X86_64_32	hole_for_int
   9:	41 0f af c4          	imul   eax,r12d
   d:	45 0f af e5          	imul   r12d,r13d
  11:	41 89 c5             	mov    r13d,eax
  14:	e9 00 00 00 00       	jmp    19 <swap_and_multiply+0x19>
			15: R_X86_64_PLT32	hole_fn-0x4

And thus we have achieved our exact goals in stencil creation. The function body is only our targeted set of instructions. There’s no stack frame setup or teardown. The relocation information tells us exactly how and where to patch in our integer constant at JIT compile time. And the use of a unique symbol hole_fn means the tail call jump is easy to identify and strip off from the generated code, as we end up with a unique pointer to it.

Now, let’s unwind each of the techniques involved here to illustrate their individual impact on the generated code.

Calling Convention

The standard x86_64 calling convention places the first six arguments into registers (in order: rdi, rsi, rdx, rcx, r8, r9), and then the rest go on the stack. There’s a very nice overview of the standard (cdecl) calling convention for x86_64 in The 64 bit x86 C Calling Convention. However, the guidance for copy-and-patch stencils is to instead opt in to the preserve_none calling convention. Clang/LLVM only supports preserve_none on x86_64 and AArch64, and GCC doesn’t support it at all (but support is being worked on).

We can look at the difference between cdecl and preserve_none by building a small stencil which just swaps the order of its inputs:

cdecl calling convention preserve_none calling convention

cdecl calling convention	preserve_none calling convention
`#include <stdint.h> extern void hole_fn(void) __attribute__((cdecl)); __attribute__((cdecl)) void swap_ints(int a, int b) { typedef void(*outfn_type)(int, int) __attribute__((cdecl)); outfn_type stencil_output = (outfn_type)&hole_fn; stencil_output(b, a); }`	`#include <stdint.h> extern void hole_fn(void) __attribute__((preserve_none)); __attribute__((preserve_none)) void swap_ints(int a, int b) { typedef void(*outfn_type)(int, int) __attribute__((preserve_none)); outfn_type stencil_output = (outfn_type)&hole_fn; stencil_output(b, a); }`
`; <swap_ints>: mov eax,edi mov edi,esi mov esi,eax jmp b <swap_ints+0xb> ;; R_X86_64_PLT32 hole_fn-0x4`	`; <swap_and_multiply>: mov eax,r12d mov r12d,r13d mov r13d,eax jmp e <swap_and_multiply+0xe> ;; R_X86_64_PLT32 hole_fn-0x4`

#include <stdint.h>
extern void hole_fn(void)
  __attribute__((cdecl));

__attribute__((cdecl))
void swap_ints(int a, int b) {
  typedef void(*outfn_type)(int, int)
    __attribute__((cdecl));
  outfn_type stencil_output =
    (outfn_type)&hole_fn;
  stencil_output(b, a);
}

#include <stdint.h>
extern void hole_fn(void)
  __attribute__((preserve_none));

__attribute__((preserve_none))
void swap_ints(int a, int b) {
  typedef void(*outfn_type)(int, int)
    __attribute__((preserve_none));
  outfn_type stencil_output =
    (outfn_type)&hole_fn;
  stencil_output(b, a);
}

; <swap_ints>:
mov    eax,edi
mov    edi,esi
mov    esi,eax
jmp    b <swap_ints+0xb>
;; R_X86_64_PLT32	hole_fn-0x4

; <swap_and_multiply>:
mov    eax,r12d
mov    r12d,r13d
mov    r13d,eax
jmp    e <swap_and_multiply+0xe>
;; R_X86_64_PLT32	hole_fn-0x4

Which is… not really all that different. preserve_none is useful though as the number of arguments go up. As mentioned above, x86_64 provides six registers for arguments, so we can better illustrate the difference by extending swap_ints to 8 parameters:

#include <stdint.h>
extern void hole_fn(void)
  __attribute__((CALLING_CONVENTION));

__attribute__((CALLING_CONVENTION))
void swap_ints(int a, int b, int c, int d, int e, int f, int g, int h) {
  typedef void(*outfn_type)(int, int, int, int,
                            int, int, int, int)
  __attribute__((CALLING_CONVENTION));
  outfn_type stencil_output = (outfn_type)&hole_fn;
  stencil_output(h, g, f, e, d, c, b, a);
}

// clang -DCALLING_CONVENTION=cdecl -O3 -c
// clang -DCALLING_CONVENTION=preserve_none -O3 -c

cdecl calling convention preserve_none calling convention

cdecl calling convention	preserve_none calling convention
`; <swap_ints>: push rbx mov eax,ecx mov r10d,edx mov r11d,esi mov ebx,edi mov edi,DWORD PTR [rsp+0x18] mov esi,DWORD PTR [rsp+0x10] mov edx,r9d mov ecx,r8d mov r8d,eax mov r9d,r10d push rbx push r11 call 27 <swap_ints+0x27> ;; R_X86_64_PLT32 hole_fn-0x4 add rsp,0x10 pop rbx ret`	`; <swap_ints>: mov eax,r15d mov ebx,r14d mov r8d,r13d mov r9d,r12d mov r12d,ecx mov r13d,edx mov r14d,esi mov r15d,edi mov edi,eax mov esi,ebx mov edx,r8d mov ecx,r9d jmp 27 <swap_ints+0x27> ;; R_X86_64_PLT32 hole_fn-0x4`

; <swap_ints>:
push   rbx
mov    eax,ecx
mov    r10d,edx
mov    r11d,esi
mov    ebx,edi
mov    edi,DWORD PTR [rsp+0x18]
mov    esi,DWORD PTR [rsp+0x10]
mov    edx,r9d
mov    ecx,r8d
mov    r8d,eax
mov    r9d,r10d
push   rbx
push   r11
call   27 <swap_ints+0x27>
;; R_X86_64_PLT32	hole_fn-0x4
add    rsp,0x10
pop    rbx
ret

; <swap_ints>:
mov    eax,r15d
mov    ebx,r14d
mov    r8d,r13d
mov    r9d,r12d
mov    r12d,ecx
mov    r13d,edx
mov    r14d,esi
mov    r15d,edi
mov    edi,eax
mov    esi,ebx
mov    edx,r8d
mov    ecx,r9d
jmp    27 <swap_ints+0x27>
;; R_X86_64_PLT32	hole_fn-0x4

So it’s helpful for when it matters. It moves us from being able to only define stencils with 6 inputs and outputs to stencils that have 12 inputs and outputs, after which preserve_none also runs out of registers and has to start setting up a stack frame. However, there’s multiple categories of registers. Floating point values and SSE operations use xmm registers, AVX uses ymm registers, and AVX-512 uses zmm registers. The calling convention also controls how these operate:

floating point SIMD

	floating point	SIMD
	`STENCIL_FUNCTION void float_passthrough(float a) { DECLARE_STENCIL_OUTPUT(float); return stencil_output(a); }`	`#include <immintrin.h> STENCIL_FUNCTION void simd_passthrough(__m512 a) { DECLARE_STENCIL_OUTPUT(__m512); return stencil_output(a); }`
cdecl	`; <float_passthrough>: push r15 push r14 push r13 push r12 push rbx call 10e <float_passthrough+0xe> ;; R_X86_64_PLT32 pop rbx pop r12 pop r13 pop r14 pop r15 ret`	`; <simd_passthrough>: push r15 push r14 push r13 push r12 push rbx call 12e <simd_passthrough+0xe> ;; R_X86_64_PLT32 pop rbx pop r12 pop r13 pop r14 pop r15 vzeroupper ret`
preserve none	`; <float_passthrough>: jmp 105 <float_passthrough+0x5> ;; R_X86_64_PLT32`	`; <simd_passthrough>: jmp 115 <simd_passthrough+0x5> ;; R_X86_64_PLT32`

STENCIL_FUNCTION
void float_passthrough(float a) {
  DECLARE_STENCIL_OUTPUT(float);
  return stencil_output(a);
}

#include <immintrin.h>
STENCIL_FUNCTION
void simd_passthrough(__m512 a) {
  DECLARE_STENCIL_OUTPUT(__m512);
  return stencil_output(a);
}

cdecl

; <float_passthrough>:
push   r15
push   r14
push   r13
push   r12
push   rbx
call   10e <float_passthrough+0xe>
;; R_X86_64_PLT32
pop    rbx
pop    r12
pop    r13
pop    r14
pop    r15
ret

; <simd_passthrough>:
push   r15
push   r14
push   r13
push   r12
push   rbx
call   12e <simd_passthrough+0xe>
;; R_X86_64_PLT32
pop    rbx
pop    r12
pop    r13
pop    r14
pop    r15
vzeroupper
ret

preserve none

; <float_passthrough>:
jmp    105 <float_passthrough+0x5>
;; R_X86_64_PLT32

; <simd_passthrough>:
jmp    115 <simd_passthrough+0x5>
;; R_X86_64_PLT32

Any number of floating point or simd registers cause a stack frame to get emitted on cdecl, and thus if you’re trying to use them in stencils, you’ll have to use preserve_none. You’ll then be limited to 8 function arguments / registers before it will start passing arguments on the stack.

For SIMD specifically, note that one can use attributetarget("arch") to be able to generate code for different SIMD feature sets, and detect which one to select as the code for the stencil at runtime:

__attribute__((preserve_none,target("avx")))
void fused_multiply_add_avx(__m512 a, __m512 b, __m512 c) {
  DECLARE_STENCIL_OUTPUT(__m512);
  return stencil_output(a * b + c);
}

__attribute__((preserve_none,target("no-avx")))
void fused_multiply_add_sse2(__m512 a, __m512 b, __m512 c) {
  DECLARE_STENCIL_OUTPUT(__m512);
  return stencil_output(a * b + c);
}

0000000000000100 <fused_multiply_add_avx>:
 100:	62 f2 75 48 a8 c2    	vfmadd213ps zmm0,zmm1,zmm2
 106:	e9 00 00 00 00       	jmp    10b <fused_multiply_add_avx+0xb>
			107: R_X86_64_PLT32	cnp_stencil_output-0x4
 10b:	0f 1f 44 00 00       	nop    DWORD PTR [rax+rax*1+0x0]

0000000000000110 <fused_multiply_add_sse2>:
 110:	0f 59 c4             	mulps  xmm0,xmm4
 113:	0f 58 44 24 08       	addps  xmm0,XMMWORD PTR [rsp+0x8]
 118:	0f 59 cd             	mulps  xmm1,xmm5
 11b:	0f 58 4c 24 18       	addps  xmm1,XMMWORD PTR [rsp+0x18]
 120:	0f 59 d6             	mulps  xmm2,xmm6
 123:	0f 58 54 24 28       	addps  xmm2,XMMWORD PTR [rsp+0x28]
 128:	0f 59 df             	mulps  xmm3,xmm7
 12b:	0f 58 5c 24 38       	addps  xmm3,XMMWORD PTR [rsp+0x38]
 130:	e9 00 00 00 00       	jmp    135 <fused_multiply_add_sse2+0x25>
			131: R_X86_64_PLT32	cnp_stencil_output-0x4

Tail Call

As was mentioned, we rely on clang’s optimization primary for converting the stencil_output call to a tailcall. It also happens to be necessary for eliding the stack frame prologue and epilogue when it’s not necessary. Going back to our swap_and_multiply example:

#include <stdint.h>
extern void hole_fn(void) __attribute__((preserve_none));
extern int hole_for_int;

__attribute__((preserve_none))
void swap_and_multiply(int a, int b) {
  const int hole_value = (int)((uintptr_t)&hole_for_int);
  int c = a * hole_value;
  a = b * hole_value;
  b = c;

  typedef void(*outfn_type)(int, int) __attribute__((preserve_none));
  outfn_type stencil_output = (outfn_type)&hole_fn;
  stencil_output(a, b);
}

We can look at the resulting code without optimizations (-O0) and with optimizations (-O3):

clang -O0 clang -O3

clang -O0	clang -O3
; <swap_and_multiply>: push rbp (1) mov rbp,rsp sub rsp,0x20 mov DWORD PTR [rbp-0x4],r12d mov DWORD PTR [rbp-0x8],r13d mov eax,0x0 ;; R_X86_64_32 hole_for_int mov DWORD PTR [rbp-0xc],eax mov eax,DWORD PTR [rbp-0x4] mov ecx,DWORD PTR [rbp-0xc] imul eax,ecx mov DWORD PTR [rbp-0x10],eax mov eax,DWORD PTR [rbp-0x8] mov ecx,DWORD PTR [rbp-0xc] imul eax,ecx mov DWORD PTR [rbp-0x4],eax mov eax,DWORD PTR [rbp-0x10] mov DWORD PTR [rbp-0x8],eax mov QWORD PTR [rbp-0x18],0x0 ;; R_X86_64_32S hole_fn mov rax,QWORD PTR [rbp-0x18] mov r12d,DWORD PTR [rbp-0x4] mov r13d,DWORD PTR [rbp-0x8] call rax (3) add rsp,0x20 pop rbp (2) ret	`; <swap_and_multiply>: mov eax,r12d mov r12d,0x0 ;; R_X86_64_32 hole_for_int imul eax,r12d imul r12d,r13d mov r13d,eax jmp 19 <swap_and_multiply+0x19> (3) ;; R_X86_64_PLT32 hole_fn-0x4`

; <swap_and_multiply>:
push   rbp (1)
mov    rbp,rsp
sub    rsp,0x20
mov    DWORD PTR [rbp-0x4],r12d
mov    DWORD PTR [rbp-0x8],r13d
mov    eax,0x0
;; R_X86_64_32	hole_for_int
mov    DWORD PTR [rbp-0xc],eax
mov    eax,DWORD PTR [rbp-0x4]
mov    ecx,DWORD PTR [rbp-0xc]
imul   eax,ecx
mov    DWORD PTR [rbp-0x10],eax
mov    eax,DWORD PTR [rbp-0x8]
mov    ecx,DWORD PTR [rbp-0xc]
imul   eax,ecx
mov    DWORD PTR [rbp-0x4],eax
mov    eax,DWORD PTR [rbp-0x10]
mov    DWORD PTR [rbp-0x8],eax
mov    QWORD PTR [rbp-0x18],0x0
;; R_X86_64_32S	hole_fn
mov    rax,QWORD PTR [rbp-0x18]
mov    r12d,DWORD PTR [rbp-0x4]
mov    r13d,DWORD PTR [rbp-0x8]
call   rax (3)
add    rsp,0x20
pop    rbp (2)
ret

; <swap_and_multiply>:
mov    eax,r12d
mov    r12d,0x0
;; R_X86_64_32	hole_for_int
imul   eax,r12d
imul   r12d,r13d
mov    r13d,eax
jmp    19 <swap_and_multiply+0x19> (3)
;; R_X86_64_PLT32	hole_fn-0x4

So, clang is obviously doing great work for us. and are the stack frame setup and teardown in the unoptimized version, and they’ve been elided in the optimized version. The call at has been replaced with a tailcall jmp at .

I’m not aware of a more specific way to request clang to emit the stack frame when it’s not necessary. -fomit-frame-pointer -momit-leaf-frame-pointer causes clang to drop the push rbp/pop rbp, but the sub rsp,0x20 and add rsp,0x20 remain as the unoptimized code relies on the stack for local variables. Maybe running only mem2reg would then suffice, but the whole point here is to get all of LLVM’s optimizations for "free" within a stencil anyway.

Clang does support the musttail attribute to force tailcall generation. However, it requires that the input and output types match, which doesn’t fit our needs for stencil creation.

extern void hole_fn(void) __attribute__((preserve_none));

__attribute__((preserve_none))
void add_two_ints(int a, int b) {
  typedef void(*outfn_type)(int) __attribute__((preserve_none));
  outfn_type stencil_output = (outfn_type)&hole_fn;
  // Force the tailcall, via an attribute on the return statement.
  __attribute__((musttail)) return stencil_output(a + b);
}

$ clang -O3 -c example.c
example.c:12:29: error: cannot perform a tail call to function 'stencil_output'
because its signature is incompatible with the calling function
   12 |   __attribute__((musttail)) return stencil_output(a + b);
      |                             ^
example.c:11:3: note: target function has different number of parameters
(expected 2 but has 1)
   11 |   outfn_type stencil_output = (outfn_type)&hole_fn;
      |   ^
example.c:12:18: note: tail call required by 'musttail' attribute here
   12 |   __attribute__((musttail)) return stencil_output(a + b);
      |                  ^

So, unless that changes in the future, we have to rely on -O3 magically doing the right thing.

Relocations

This far, we’ve examined the "copy" part of copy-and-patch. It is now time to focus on the "patch" part instead.

A relocation is a bit of information that clang leaves for the dynamic linker when referencing an external symbol, so that when the program is run and the executable and its various libraries are loaded into random addresses in memory, the dynamic linker can patch the executable with the correct addresses of all of the symbols it needs. In copy-and-patch, we abuse this by referencing an external symbol every time that we want a hole to be inserted into the stencil, and then looking at the relocation information generated after compilation to know what offsets to patch within the generated code to fill the hole at JIT compile time.

We lean heavily on the medium machine code model, which sets the expectation that code can be referenced within +-2GB (32-bit values), and large data needs to be referenced by full 64-bit values. Others have covered the topics of machine code models and relocations before, so please see Understanding the x64 code models or Relocation Overflow and Code Models for background on this topic. The official AMD64 ABI documentation is atypically clear and useful as well. Small views both code and data as 32-bit values, large views both as 64-bit values, and so using medium means we’re able to generate holes of either 32-bit or 64-bit depending on if we reference code or data.

I’ve summarized everything to be aware of within the realm of making holes into one program:

#include <stdint.h>

extern uint8_t cnp_small_data_array[8];
extern uint8_t cnp_large_data_array[1000000];
extern void cnp_function_near(uint32_t, uint64_t);
extern uint8_t cnp_function_far[1000000];

void stencil_example(void) {
  uint32_t small = (uint32_t)((uintptr_t)&cnp_small_data_array);
  uint64_t large = (uint64_t)((uintptr_t)&cnp_large_data_array);
  typedef void(*fn_ptr_t)(uint32_t, uint64_t);
  fn_ptr_t near_ptr = &cnp_function_near;
  near_ptr(small, large);

  uint64_t largefn = (uint64_t)((uintptr_t)&cnp_function_far);
  asm volatile("" : "+r" (largefn) : : "memory");
  fn_ptr_t far_ptr = (fn_ptr_t)largefn;
  far_ptr(small, largefn);
}

The key part, which I cannot emphasize enough, is that we completely and utterly ignore the actual data referred to by the symbol. We always take the address of the symbol, and cast it to what we need. Hence, the use of some macros above to make this friendlier.

We compile this with clang -O3 -mcmodel=medium -c example.c, though -mcmodel=medium is the default anyway, and view the generated code and relocations with objdump -d -Mintel,x86-64 --disassemble --reloc example.o as usual:

0000000000000000 <stencil_example>:
   0:	50                   	push   rax
   1:	48 be 00 00 00 00 00 	movabs rsi,0x0
   8:	00 00 00
			3: R_X86_64_64	cnp_large_data_array
   b:	bf 00 00 00 00       	mov    edi,0x0
			c: R_X86_64_32	cnp_small_data_array
  10:	e8 00 00 00 00       	call   15 <stencil_example+0x15>
			11: R_X86_64_PLT32	cnp_function_near-0x4
  15:	48 be 00 00 00 00 00 	movabs rsi,0x0
  1c:	00 00 00
			17: R_X86_64_64	cnp_function_far
  1f:	bf 00 00 00 00       	mov    edi,0x0
			20: R_X86_64_32	cnp_small_data_array
  24:	58                   	pop    rax
  25:	ff e6                	jmp    rsi

When referring to a small piece of data, we’ll get a 32-bit hole. You can see this with the relocation for cnp_small_data_array being a R_X86_64_32. Referring to a large piece of data instead gets us a 64-bit hole. cnp_large_data_array was assigned R_X86_64_64, and clearly there are more 00 bytes to fill in.-mlarge-data-threshold=threshold controls the exact line between how large an array must be for it to be considered "large data" and get 64-bit addressing treatment, but it’s safe to just declare a needlessly large extern array as the array won’t exist anyway.

When calling a function, the function is expected to be within +-2GB according to the code model, so the invocation of cnp_function_near becomes a 32-bit hole of R_X86_64_PLT32. When patching references between stencils, it will be important to track the exact offsets of the source jmp/call and the destination, as the offset is relative. If you wish to call back into a function that’s a part of the JIT compiler runtime, that function won’t likely be within +-2GB. We need to be able to emit a call/jmp to the full 64-bit address. It turns out that this is incredibly difficult to do:

void stencil_example(void) {
  typedef void(*fn_ptr_t)(uint64_t);
  fn_ptr_t direct_assign = (fn_ptr_t)((uintptr_t)&cnp_function_far);
  direct_assign(0);

  uint64_t far_as_int = (uint64_t)((uintptr_t)&cnp_function_far);
  fn_ptr_t indirect_assign = (fn_ptr_t)far_as_int;
  indirect_assign(far_as_int);

  uint64_t far_forgettable = (uint64_t)((uintptr_t)&cnp_function_far);
  // Abuse an empty asm volatile to make clang unable to understand
  // where the value came from.
  asm volatile("" : "+r" (far_forgettable) : : "memory");
  fn_ptr_t forgotten = (fn_ptr_t)far_forgettable;
  forgotten(far_forgettable);
}

0000000000000000 <stencil_example>:
   0:	53                   	push   rbx
   1:	31 ff                	xor    edi,edi
   3:	e8 00 00 00 00       	call   8 <stencil_example+0x8>
			4: R_X86_64_PLT32	cnp_function_far-0x4
   8:	48 bb 00 00 00 00 00 	movabs rbx,0x0
   f:	00 00 00
			a: R_X86_64_64	cnp_function_far
  12:	48 89 df             	mov    rdi,rbx
  15:	e8 00 00 00 00       	call   1a <stencil_example+0x1a>
			16: R_X86_64_PLT32	cnp_function_far-0x4
  1a:	48 89 df             	mov    rdi,rbx
  1d:	5b                   	pop    rbx
  1e:	ff e7                	jmp    rdi

Of which we see that there’s two 32-bit relocations (R_X86_64_PLT32) and one 64-bit one (R_X86_64_64). There’s 32-bit relocations because clang sees that we turned an external symbol into a function pointer. Code must be within +-2GB according to the code model, so 32 bits is fine. Clang is also then smart enough to track this through an assignment to a variable, and although it loads the full 64-bit address into a register as the argument, it then emits a 32-bit relocation for the actual call, because it still knows that the address came from a symbol definition. The only way I found to make clang "forget" the source of function pointer value was to run it through an empty asm volatile so that clang thinks no assumptions are valid anymore, and then it finally is willing to just jump to the 64-bit value in the register.

↤ A Tutorial ↑ Copy-and-Patch ↑