Assembly and Machine Code

Introduction

The programming language is a set of instructions used by a computer to execute a task or algorithm. In this era, most developers utilize high-level languages like C, C++, Java, Fortran, Python, etc., because of their ease of syntax and understanding. However, a CPU cannot understand the high-level syntax of modern languages; that is why a text program written by a developer is converted into the binary format(Machine Code) by a specific compiler or interpreter.

Machine Code is understandable and executable for a CPU, and it consists of byte code derived from Instruction Set Architecture (ISA) and is therefore very specific to a particular architecture.

On the other hand, Assembly language stands between high-level and machine code; it is a symbolic representation of machine language. It used to program on the "bare metal" hardware. But it lacks many features a high-level language can provide, like compile-time and run-time error checking, type checking, benchmarking and diagnostics, and much more. You must possess a good knowledge about hardware architecture, registers and memory management to produce an error-free and safe code while writing on Assembly.

Format

Consider the following "Hello World" Assembly code written for x86_64 Architecture using NASM syntax:

section .text           ; directive to put the following code to the "text" section
global _start           ; directive to the start symbol

; Symbols
stdout    equ  1  
sys_write equ  1
sys_exit  equ 60

_start:
    mov rdx, len        ; load msg length to the rdx register 
    mov rsi, hello      ; load msg to the rsi register
    mov rdi, stdout     ; load stdout descriptor to rdi register
    mov rax, sys_write  ; load sys_write number to rax register 
    syscall             ; call syscall 

    mov rdi, 0          ; exit code 
    mov rax, sys_exit   ; exit syscall
    syscall             ; call syscall 

section .rodata         ; dicrective to "data" section
hello  db   'Hello, World!',0x0A ; message 
len    equ  $ - hello            ; message length

Let me explain:

section .text is a directive that specifies that the following instructions/data should be placed in the ".text" section of the output ELF file.
section .rodata is a similar directive that specifies that the following instructions/data should be placed in the .rodata read-only section of the output ELF file. Alternatively, it could be placed in the .data section, which is for read-write data (data that is write-protected in memory).
global _start is a directive that makes the following symbol visible to the linker. Otherwise, symbols are normally lost during link time. In this case, the linker needs to know the value of the special symbol _start to know where execution is to begin in the program (which is not always at the start of the .text section).
_start is a label that is equivalent to the memory location of the first instruction in the program.
hello is a label that is equivalent to the memory location of the first byte of the string "Hello, World!\n"
equ is a directive that sets a symbol (len) equal to the value of an expression (in this example, "$ - hello" meaning the current memory location minus the value of the label "hello").
instructions inside _start are indented for the compiler to understand they are instructions, not labels. Each line consists of an instruction and 0 or more arguments.

Note that symbols are not variables - they are constants calculated at compile-time. However, they may contain an address of a variable. Note also that the syntax will vary from assembler to assembler and from architecture to architecture. Source and comparison between ARM and x86

Let's compile the code above:

$ nasm -o hello.o -f elf64 hello.s
$ ld -o hello hello.o
$ ./hello
Hello, World!

Now we can take a look at the Machine code generated by the assembler:

$ hexdump -C hello
... snipped ...
00002000  48 65 6c 6c 6f 2c 20 57  6f 72 6c 64 21 0a 00 00  |Hello, World!...|
... snipped ...
000020f0  00 10 40 00 00 00 00 00  00 00 00 00 00 00 00 00  |..@.............|
00002100  2d 00 00 00 10 00 02 00  0e 20 40 00 00 00 00 00  |-........ @.....|
00002110  00 00 00 00 00 00 00 00  39 00 00 00 10 00 02 00  |........9.......|
00002120  0e 20 40 00 00 00 00 00  00 00 00 00 00 00 00 00  |. @.............|
00002130  40 00 00 00 10 00 02 00  10 20 40 00 00 00 00 00  |@........ @.....|
00002140  00 00 00 00 00 00 00 00  00 68 65 6c 6c 6f 2e 73  |.........hello.s|
00002150  00 73 74 64 6f 75 74 00  73 79 73 5f 77 72 69 74  |.stdout.sys_writ|
00002160  65 00 73 79 73 5f 65 78  69 74 00 68 65 6c 6c 6f  |e.sys_exit.hello|
00002170  00 6c 65 6e 00 5f 5f 62  73 73 5f 73 74 61 72 74  |.len.__bss_start|
00002180  00 5f 65 64 61 74 61 00  5f 65 6e 64 00 00 2e 73  |._edata._end...s|
00002190  79 6d 74 61 62 00 2e 73  74 72 74 61 62 00 2e 73  |ymtab..strtab..s|
000021a0  68 73 74 72 74 61 62 00  2e 74 65 78 74 00 2e 64  |hstrtab..text..d|
000021b0  61 74 61 00 00 00 00 00  00 00 00 00 00 00 00 00  |ata.............|
000021c0  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
... snipped ...

The machine code above encodes the assembly code into this format which is hard to read. Each byte can encode an instruction, argument and data. Take a moment to read this page and watch Ben Eater educational Youtube channel.

Conclusion

Any CPU operates on Machine code generated by an Assembler from Assembly language. Assembly language consists of architecture-specific instructions, symbols and directives. Modern CPUs are unfathomably complex and have thousands of documentation pages, and it is hard to comprehend what is going on at the hardware level. That is why most developers tend to use high-level languages. But, I will continue exploring Assembly and Architectures to get more insights into this fascinating topic.

Author: Iurii Kondrakov

Email: deezzir@gmail.com

GitHub: github.com

SPO600 Blog

Search This Blog

SPO600 Project Analysis