AArch64 & x86_64 Architectures

AArch64 & x86_64 Architectures

Introduction

To start, a central processing unit(CPU) is the electronic circuitry(silicon) within a computer; its primary purpose is executing instructions, performing arithmetic calculations, logic and input/output operations. The CPU architecture refers to the main parts of its structure like arithmetic logic unit(ALU), control unit(CU), a memory unit(MU) and registers. The diagram below shows a simple Von Neumann CPU structure:


Note:
  • The memory unit is responsible for holding data and instructions
  • The Arithmetic/Login Unit is responsible for arithmetic and logic operations on data
  • Registers are quickly accessible storage available to the CPU. Some registers may have specific hardware functions and different read/write access  
  • The control unit is responsible for the data flow within the CPU
  • Input and Output flow via the CPU bus

x86_64

x86 - an Intel/AMD architecture debuted with the Intel 8086 processor (16-bit). It gained desktop and server dominance as the 386/486/x86 32-bit architecture. Then it was extended by AMD to the 64-bit x86_64 architecture. Not to mention Intel and AMD actively compete with x86_64 CPUs. The architecture continues as the preeminent server architecture and most popular desktop architecture. Source. You might look at your current PC specification and find out that it was built on x86_64 architecture. Still, Most gaming and heavy computing hardware are using Intel/AMD CPUs inside, most software is written for x86, and it will continue. 

General Purpose Registers

Note: x86_64 uses variable-length instructions
The 64-bit versions of the 'original' x86 registers are named:
  • rax - register a extended
  • rbx - register b extended
  • rcx - register c extended
  • rdx - register d extended
  • rbp - register base pointer (start of the stack)
  • rsp - register stack pointer (current location in stack, growing downwards)
  • rsi - register source index (source for data copies)
  • rdi - register destination index (destination for data copies)
The registers added for 64-bit mode are named:
  • From r8 to r15(i.e. r8, r9, r10...)
Registers may be accessed as:
  • 64-bit registers using the 'r' prefix: rax, r15
  • 32-bit registers using the 'e' prefix (original registers: e_x) or 'd' suffix (added registers: r__d): eax, r15d
  • 16-bit registers using no prefix (original registers: _x) or a 'w' suffix (added registers: r__w): ax, r15w
  • 8-bit registers using 'h' ("high byte" of 16 bits) suffix (original registers - bits 8-15: _h): ah, bh
  • 8-bit registers using 'l' ("low byte" of 16 bits) suffix (original registers - bits 0-7: _l) or 'b' suffix (added registers: r__b): al, bl, r15b
Usage during syscall/function call:
  • The first six arguments are in rdi, rsi, rdx, rcx, r8d, r9d; the remaining arguments are on the stack.
  • For syscalls, the syscall number is in rax. For procedure calls, rax should be set to 0.
  • The return value is in rax.
  • The called routine is expected to preserve rsp,rbp, rbx, r12, r13, r14, and r15 but may trample any other registers.

Floating-Point and SIMD Registers

x86_64 also defines a set of large registers for floating-point and single-instruction/multiple-data (SIMD) operations. For details, refer to the Intel or AMD documentation

Common instructions

The following list demonstrates the common instruction set using Nasm syntax:
add r11,r10   // add r10 and r11, put result in r11
add r10, 5    // add 5 to r10, put result in r10
call label    // call a subroutine / function / procedure
cmp r11,r10   // compare register r10 with register r11.  The comparison sets flags in the processor status register which affect conditional jumps.
cmp r11,99    // compare the number 99 with register r11.  The comparison sets flags in the processor status register which affect conditional jumps.
div r10       // divide rax by the given register (r10), places quotient into rax and remainder into rdx (rdx must be zero before this instruction)
inc r10       // increment r10
jmp label     // jump to label
je  label     // jump to label if equal
jne label     // jump to label if not equal
jl  label     // jump to label if less
jg  label     // jump to label if greater
mov r11,r10   // move data from r10 to r11
mov r10, 99   // put the immediate value 99 into r10
mov (r11),r10 // move data from r10 to address pointed to by r11
mov r11,(r10) // move data from address pointed to by r10 to r10
mul r10       // multiplies rax by r10, places result in rax and overflow in rdx
push r10      // push r10 onto the stack
pop r10       // pop r10 off the stack
ret           // routine from subroutine (counterpart to call)
syscall       // invoke a syscall (in 32-bit mode, use "int $0x80" instead)
Source and Documentation: Intel, AMD

AArch64

ARM - an architecture started with the Acorn computer company, became the dominant mobile and embedded architecture in its 32-bit incarnations, and was extended to 64-bit in version 8 (ARMv8) with the AArch64 mode. 64-bit ARM processors are dominant in smartphone applications and starting to compete in server and high-performance computing systems. Source. The architecture is famous for its efficiency(low power consumption), and in the modern world of servers and power-hungry computing, it is becoming more and more popular. Needless to say that it is not inferior in performance compared to x86_64 in some benchmarking.

General Purpose Registers

Note: AArch64 uses fixed-length instructions
The aarch64 registers are named:
  • r0 through r30 - to refer generally to the registers
  • x0 through x30 - for 64-bit-wide access (same registers)
  • w0 through w30 - for 32-bit-wide access (same registers - upper 32 bits are either cleared on load or sign-extended (set to the value of the most significant bit of the loaded value)).
Register '31' is one of two registers depending on the instruction context:
  • For instructions dealing with the stack, it is the stack pointer, named rsp
  • For all other instructions, it is a "zero" register, which returns 0 when read and discards data when written - named rzr (xzr, wzr)
Usage during syscall/function call:
  • r0-r7 are used for arguments and return values; additional arguments are on the stack
  • For syscalls, the syscall number is in r8
  • r9-r15 are for temporary values (may get trampled)
  • r16-r18 are used for intra-procedure-call and platform values (avoid)
  • The called routine is expected to preserve r19-r28 *** These registers are generally safe to use in your program.
  • r29 and r30 are used as the frame register and link register (avoid)
See the ARM Procedure Call Reference for details.

Floating-Point and SIMD Registers

Aarch64 also defines a set of large registers for floating-point and single-instruction/multiple-data (SIMD) operations. For details, refer to the ARM documentation.

Common instructions

The following list demonstrates the common instruction set using GNU Assembler (gas/as) syntax:
add r0,r1,r2      // load r0 with r1+r2
add r0,r1,99      // load r0 with r1+99
adr r0,label      // load r0 with the address label (this actually calculates an address from the PC plus an offset)
adrp r0,label     // load r0 with the 4K page containing label (this calculates an address from the PC plus an offset, and is often followed by an ADD instruction so that the register points exactly to the label)
bl label          // branch (with link) to label - this is a procedure / subroutine / function call
br label          // branch to label - this is a goto
br register       // branch to the address in register
b.eq label        // branch to label if equal
b.ne label        // branch to label if not equal
b.lt label        // branch to label if less
b.gt label        // branch to label if greater
cmp r0,r1         // compare register r0 with register r1. The comparison sets flags in the processor status register which affect conditional branches.
cmp r0,99         // compare the number 99 with register r0. The comparison sets flags in the processor status register which affect conditional branches.
ldr r0,[r1,0]     // load register r0 from the address pointed to by (r1 + (0 * size)) where size is 8 bytes for 64-bit stores, 4 bytes for 32-bit stores
ldr w0,[r1,0]     // like above but reads 32 bits only - note the use of w0 instead of r0 for the source register name
ldrb w0,[r1,0]    // like above but reads 1 byte (8 bits) only - note the use of w0 for the source register name
ldur r0,[r1,0]    // load register r0 from the address pointed to by (r1 + 0) - the mnemonic means "load unscaled register"
mov r0,r1         // move data from r1 to r0
mov r0,99         // load r0 with 99 (only certain immediate values are possible)
ret               // return from subroutine (counterpart to bl)
str r0,[r1,0]     // store register r0 to address pointed to by (r1 + (0 * size)) where size is 8 bytes for 64-bit stores
strb w0,[r1,0]    // like str but writes one byte only - note the use of w0 for the source register name
stur r0,[r1,0]    // store register r0 to the address pointed to by (r1 + 0) - the mnemonic means "store unscaled register"
svc 0             // perform a syscall
msub r0,r1,r2,r3  // load r0 with r3-(r1*r2) (useful for calculating remainders)
madd r0,r1,r2,r3  // load r0 with r3+(r1*r2)
mul r0,r1,r2      // load r0 with r1*r2 (actually an alias - see ARM ARM)
push r0           // push r0 onto the stack
pop r0            // pop r0 off the stack
udiv r0,r1,r2     // unsigned - divide r1 by r2, places quotient into r0 - remainder is not calculated (use msub)

Conclusion

In this post, I have discussed the most common architecture families currently used in PCs and Mobile devices - x86_64 and AArch64. These architectures have many differences and applications, but they both primarily refer to the design of the central processing unit and its components. It is a fundamental part of modern computing, and it is essential to understand and distinguish them.

Author: Iurii Kondrakov 
GitHub: github.com

Comments