All Along the Calltower • Scott's Ramblings

In my last post, we explored the native stack and how it manages function calls and local variables in compiled languages. I elided a bunch of other detail with some hand-wavey “here be dragons!” talk. In this post we’re going to dive into one of those details - now that we’ve got the stack and a way to jump into and return from functions, how do we pass their arguments and return values?

Functions can take all sorts of things as arguments - big, and small - as well as essentially an arbitrary number of them. So - where do we put them?

Let’s dive in!

Calling Conventions?

The very short answer is we follow a calling convention - a description of how we can use CPU registers and the stack to encode function arguments. When we share compiled code between libraries and languages we need them all to be callable in the same way so that the code plays together nicely.

We’re going to be looking at AAPCS64 (the ARM 64-bit calling convention), if only because I have a Mac handy. If we ignore Windows ¹ the other popular ABI these days is System V AMD64. In practice these are both very similar, and for this post, I want to give you a feel for how they work in general, so the differences aren’t so important!

Calling Conventions!

No arguments

Let’s start at the absolute simplest case: a function that takes no arguments at all.

To the side is the disassembly so we can dive into the calling convention; don’t be horrified — it’s straightforward once you know what you’re looking for. If you’re playing along at home, you can use objdump -d to produce this from a binary yourself. Note - this isn’t the whole diassembly, just the bits that are involved in the call.

// ...
int z = no_args();

; Call no_args()
; bl — "branch with link" — jumps to the target function
; and saves the address of the next instruction (the return point)
; into the link register (x30 / lr).
bl      0x1000004b0 <_no_args>

; no_args runs

; Store the return value from w0 into the stack slot reserved
; for the local variable 'z' in main's frame.
str     w0, [sp, #0x94]

Now, let’s have a look at no_args itself - how’s it returning that value?

int no_args() {
    return 0;
}

; Place the literal constant 0 into w0.
; w0 is the 32-bit view of x0 — the designated return register.
; The constant 0 here is an *immediate*, meaning it's encoded directly
; in the instruction rather than loaded from memory.
mov     w0, #0x0

; Return from the function.
; ret jumps to the address stored in x30,
; which was set by the caller's `bl` instruction.
ret

There’s a bit to take in here, but it establishes the baseline for our calling convention:

A register is used to store where to return control to when a function completes (x30)
We physically call the function by jumping execution to the function’s address in the application’s memory
A register is used to store the return value from a function (x0)
On AArch64, x0-x30 are the 64 bit general purpose registers, and w0-w30 are just their lower 32 bit halves; we often see these mixed into the same assembly

That’s it - we can see how we move control between functions and how we return a value. So - what about passing arguments?

Simple arguments

Let’s start with simple arguments:

To the side is the relevant bits of the disassembly; note that as we go along, I’m omitting bits that we’ve covered above, and focussing on what is new:

// ...
int mixed = simple_func(1, 2, 3.0, 4.0);

; Store arguments into registers per the AAPCS64 calling convention.
; Integer arguments go into x0, x1, x2, up to x7.
; Floating-point arguments go into d0, d1, d2, up to d7.

; Store first integer (1) argument into x0 register.
mov     x0, #0x1

; Store second integer (2) argument into x1 register.
mov     x1, #0x2

; Store first floating-point (3.0) argument into d0 register.
fmov    d0, #3.00000000

; Store second floating-point (4.0) argument into d1 register.
fmov    d1, #4.00000000

; Call simple_func(a, b, x, y)
bl      0x1000004b0 <_simple_func>

; simple_func runs

; Store the 32-bit return value from w0 into the stack slot
; reserved for the local variable 'mixed'.
str     w0, [sp, #0x94]

Now, let’s see what simple_func does:

int simple_func(long a, long b, double x, double y) {
    return (int)(a + b + (long)(x + y));
}

; Read arguments from registers to the stack.
; We've compiled with no optimisations (-O0), so the compiler chooses to
; "materialise" every C variable on the stack — the C standard
; requires that their addresses be available,
; which prevents the compiler from keeping them only in registers.
;
; This is helpful for debuggers! We can see the value of any argument
; when we break in the function body, even if the register transporting
; it into the call has been reused.

; Make 32 bytes (0x20) space on the stack to store the args to the stack
sub     sp, sp, #0x20

; Then store them to the stack
str     x0, [sp, #0x18]
str     x1, [sp, #0x10]
str     d0, [sp, #0x8]
str     d1, [sp]

; Read arguments back from the stack into registers,
; then add them together.
ldr     x8, [sp, #0x18]
ldr     x9, [sp, #0x10]
add     x8, x8, x9
ldr     d0, [sp, #0x8]
ldr     d1, [sp]
fadd    d0, d0, d1
fcvtzs  x9, d0
add     x8, x8, x9

; Move the result into x0, ready to return
mov     x0, x8

; Restore stack pointer (epilogue)
add     sp, sp, #0x20
ret

A bit more to take in here, but straightforward enough:

Small function arguments go into registers
Different registers are used for integer and floating point values
We have 8 registers of each type

An obvious question is - what happens when we run out of registers?

Lots of arguments

Let’s work it out by looking at how we can call func_ints, a function that takes 9 integer arguments:

// Conspicuously has 9 arguments!
int result = func_ints(1, 2, 3, 4, 5, 6, 7, 8, 9);

; Store the first eight arguments into the registers x0-x7
mov     w0, #0x1
mov     w1, #0x2
mov     w2, #0x3
mov     w3, #0x4
mov     w4, #0x5
mov     w5, #0x6
mov     w6, #0x7
mov     w7, #0x8

; Load the ninth argument (9) into a temporary register.
; We've run out of argument registers (w0-w7 are all used),
; so we need to pass this argument on the stack instead.
; w8 is just a scratch register - not part of the argument
; passing convention - we're using it temporarily to hold
; the value ...
mov     w8, #0x9

; before ultimately storing into onto the stack!

; The ARM64 AAPCS64 calling convention requires
; any additional integer arguments to be passed on
; the stack above the current SP.
;
; This conceptually belongs to the frame of the caller,
; with the callee reaching below the bottom of their own
; frame to read it!
str     w8, [sp]

; Call func_ints
bl      0x1000004b0 <_func_ints>

; func_ints runs

Now, let’s see what func_ints does:

int func_ints(int a, int b, int c, int d, int e, int f, int g, int h, int i) {
    return a + b + c + d + e + f + g + h + i;
}

; Allocate stack space
sub     sp, sp, #0x30

; Read the ninth argument from the stack (below our frame)
; Note: it's at [sp, #0x30] because the caller put it above
; the SP before calling us
ldr     w8, [sp, #0x30]

; Store all arguments to the stack (materialisation for -O0)
str     w0, [sp, #0x2c]
str     w1, [sp, #0x28]
str     w2, [sp, #0x24]
str     w3, [sp, #0x20]
str     w4, [sp, #0x1c]
str     w5, [sp, #0x18]
str     w6, [sp, #0x14]
str     w7, [sp, #0x10]
; Store w8 (which we loaded from the stack-passed 9th arg above)
; Note: It looks like w8 was passed as a register, but it wasn't!
; We read it from the caller's stack frame into w8 for convenience.
str     w8, [sp, #0xc]

; Load arguments back from stack and add them together
ldr     w8, [sp, #0x2c]
ldr     w9, [sp, #0x28]
add     w8, w8, w9
ldr     w9, [sp, #0x24]
add     w8, w8, w9
ldr     w9, [sp, #0x20]
add     w8, w8, w9
ldr     w9, [sp, #0x1c]
add     w8, w8, w9
ldr     w9, [sp, #0x18]
add     w8, w8, w9
ldr     w9, [sp, #0x14]
add     w8, w8, w9
ldr     w9, [sp, #0x10]
add     w8, w8, w9
ldr     w9, [sp, #0xc]
add     w0, w8, w9

; Restore stack pointer (epilogue)
add     sp, sp, #0x30
ret

So we can see that once we exhaust the scratch registers allocated for passing arguments w0-w7 we store additional arguments on the stack, in the frame of the caller. This is called spilling.

It should be clear now why we need a calling convention - there’s no “universally obvious” way of describing how a function should be called - we’re balancing performance - registers are much faster than memory - against the use of limited resources.

A calling convention also defines who must preserve register values across a call — for instance, on AArch64 per the AAPCS64 convention:

Caller-saved registers may be freely overwritten by the callee; the caller must save them if it cares.
Callee-saved registers must be restored by the callee before

Small structs

But what about when we pass a struct? Let’s look at func_small, which takes and returns a small structure by value.

typedef struct Small {
    int a;
    double b;
} Small;

// ...
Small s2 = func_small(s, 5);

; Load the first argument (the struct s) into x0/x1.
; Structs up to 16 bytes in size are passed entirely in registers.
; The fields are simply laid out across consecutive registers.
; Our Small struct has an int (4 bytes) and a double (8 bytes) = 12 bytes
; total, or 16 bytes with alignment padding, so it fits in two registers.
ldr     x0, [sp, #0x80]
ldr     x1, [sp, #0x88]

; Move the integer argument (5) into w2, the next available argument
; register. Note that 'w2' is the lower 32 bits of the 'x2' register -
; we're only storing an int and not a double, so we don't need more!
mov     w2, #0x5

; Call func_small
bl      0x1000004b0 <_func_small>

; func_small runs

; The return value (another Small) is 16 bytes, so it also comes back
; in registers. x0 and x1 now hold the new struct fields, which the
; compiler stores to the stack.
str     x0, [sp, #0x70]
str     x1, [sp, #0x78]

Now, let’s see what func_small does:

Small func_small(Small s, int x) {
    s.a += x;
    s.b *= 2.0;
    return s;
}

; Allocate stack space
sub     sp, sp, #0x30

; Store the incoming struct fields from registers to stack
str     x0, [sp, #0x10]
str     x1, [sp, #0x18]
str     w2, [sp, #0xc]

; Load s.a (int) and add x to it
ldr     w9, [sp, #0xc]
ldr     w8, [sp, #0x10]
add     w8, w8, w9
str     w8, [sp, #0x10]

; Load s.b (double) and multiply by 2.0
ldr     d0, [sp, #0x18]
fmov    d1, #2.0
fmul    d0, d0, d1
str     d0, [sp, #0x18]

; Copy the modified struct from the stack - [sp, #0x10]
; to another location in the stack - [sp, #0x20], via
; the q0 register.
;
; This copy is redundant - we could load directly into x0/x1
; for return - but we've forced optimisation off, so we see
; things like this!
ldr     q0, [sp, #0x10]
str     q0, [sp, #0x20]

; Load the struct from the copy location into x0/x1 to return
ldr     x0, [sp, #0x20]
ldr     x1, [sp, #0x28]

; Restore stack pointer (epilogue)
add     sp, sp, #0x30
ret

We see:

Small structs are passed by, and split across, registers
Small structs are returned in registers in the same fashion
“small” means 16 bytes or less

Sooo … what happens if we try pass a big struct?

Big structs

Let’s try:

typedef struct Large {
    char data[64];
    int len;
} Large;

// ...
Large l2 = func_large(l, 42);

; Load the address of input struct 'l' into x0.
; Large structs (>16 bytes) are passed by reference — the caller
; provides a pointer rather than copying into registers.
add     x0, sp, #0x9c          ; x0 = address of 'l' on the stack
str     x0, [sp, #0x50]        ; spill to a temp slot (-O0 artifact)

; Reload the pointer (redundant, but we've disabled optimisations)
ldr     x0, [sp, #0x50]

; Load second argument (42) into w1
mov     w1, #0x2a

; For large struct returns (>16 bytes), the caller allocates
; space on the stack and passes its address in x8.
sub     x8, x29, #0xa0         ; x8 = address of 'l2' (in caller's stack frame)

; Call func_large(l, 42)
bl      0x100000620 <_func_large>

; func_large writes the result into the address in x8

There’s a great deal of shuffling things back and forth between registers and stack in the unoptimised assembler of func_large, so I’ve omitted it in the interests of brevity.

The rules for big structs are already clear:

The caller allocates space for the result and passes its address in x8
The callee writes the result there directly, then returns normally
if a struct exceeds 16 bytes, we pass a pointer to the data in the stack, in the caller’s frame, instead.

Footnotes

I find ignoring windows to be generally a good practice. Windows 11: what even is that? ↩