In case the OP, or anyone reading this thread later, wants to see the same function (raise an integer to an integer power) as an example in the way I suggest learning asm (64 bit and mixing C and asm) here is that code:
You can compile the C and asm together with
gcc test.c foo.s
then run the result with
./a.out
The C function test.c just calls the asm function:
Code:
void foo(unsigned int a, unsigned int b);
int main(int argc, char**argv)
{
foo(3, 30);
return 0;
}
The asm function foo.s computes the 64 bit unsigned long of
a to the power
b, where
a and
b are 32 bit unsigned values. Then it calls printf to display the result.
Code:
.section .rodata
message:
.string "%d ** %d = %ld\n"
.text
.globl foo
# on entry
# rdi = base
# rsi = exponent
foo:
pushq %rbp
movq %rsp, %rbp
movl $1, %ecx # Same behavior as movq $1,%rcx
movl %esi, %edx # Save Exponent for printf
testl %esi, %esi
jz 2f # Skip the loop if exponent is zero
1: imulq %rdi, %rcx # rcx *= rdi
decl %esi
jne 1b
2: # rcx already has fourth parameter for printf
# rdx already has third parameter for printf
movl %edi, %esi # Second parameter for printf
movq $message, %rdi # First parameter for printf
xorl %eax, %eax # rax = 0 # Number of SSE registers in parameter list
call printf
leave
ret
A few details that a beginner would need explained:
1) No parameters nor locals were on the stack. So the conventional
pushq %rbp and
movq %rsp, %rbp at the beginning balanced by
leave at the end, serve little purpose. But we can't simply omit them with no other changes, because the ABI requires the stack to be 16 byte aligned before each call. The stack was 16 byte aligned when main called foo() but then is 8 bytes off of aligned because of the return address pushed. So we need to push an odd number of 8 byte objects before calling printf. The push of rbp at the beginning of each function is usually used as that odd item, so later pushes or stack allocations if any would be an even number of 8 byte items.
But a good asm programmer might notice the opportunity for a "function tail merge". When the last thing you do before returning is call a function, if the stack and register usage are compatible you can jump to the function instead of calling it and then returning. In 32 bit x86, function tail merge is rarely possible unless the signatures of the two functions are nearly identical. But the 64 bit ABI is more powerful, so the function tail merge is easy here even though only two parameters were passed to foo() while four are passed to printf(). So the easier version of foo.s is
Code:
.section .rodata
message:
.string "%d ** %d = %ld\n"
.text
.globl foo
# on entry
# rdi = base
# rsi = exponent
foo:
movl $1, %ecx # Same behavior as movq $1,%rcx
movl %esi, %edx # Save Exponent for printf
testl %esi, %esi
jz 2f # Skip the loop if exponent is zero
1: imulq %rdi, %rcx # rcx *= rdi
decl %esi
jne 1b
2: # rcx already has fourth parameter for printf
# rdx already has third parameter for printf
movl %edi, %esi # Second parameter for printf
movl $message, %edi # First parameter for printf
xorl %eax, %eax # rax = 0 # Number of SSE registers in parameter list
jmp printf
2) Any time the destination of an instruction is one of the 32 bit general registers, the CPU will clear the high half of the 64 bit register. So in several places where I wanted a 32 bit unsigned value in a 64 bit register, I just moved or computed the 32 bit value into the 32 bit register (which is a shorter and sometimes faster instruction than moving or computing the 64 bit value) and relied on the upper half being cleared.
3) A good asm programmer plans ahead for register use. The ABI specifies the first six integer or pointer parameters go in registers rdi, rsi, rdx, rcx, r8, and r9. I needed to select a place to save the exponent (because the loop destroys the original exponent) and select a place to compute the result. Since those would be the third and fourth parameters in the call, I selected rdx and rcx.
4) The instruction
movl $message, %edi treats the address
message as a 32 bit unsigned quantity. But in x86_64, addresses are 64 bit. So this instruction only works right if the program is linked so that all its pre initialized read only data is in first 4GB of the address space. That is a safe assumption for ordinary user mode programs in Linux.
5) When calling any function, such as printf, that takes a variable parameter list, the ABI requires that the caller put in the AL register the number of parameters which have been passed in SSE registers. My function is not passing (nor even using) any SSE registers. So I need to put 0 in al before calling printf. The instruction
xorl %eax, %eax clears all of rax and is generally the fastest way to clear part or all of rax. (al is the lowest byte of rax).
6) You should read the
as documentation for the use of jump targets such as the
2f and
1b I used in this example. In this example they connect to the
2: and
1: labels.