- Get link
- X
- Other Apps
- Get link
- X
- Other Apps
Assembly and Machine Code
Introduction
The programming language is a set of instructions used by a
computer to execute a task or algorithm. In this era, most
developers utilize high-level languages like C, C++, Java,
Fortran, Python, etc., because of their ease of syntax and
understanding. However, a CPU cannot understand the high-level
syntax of modern languages; that is why a text program written by
a developer is converted into the binary format(Machine Code) by a
specific compiler or interpreter.
Machine Code
is understandable and executable for a CPU, and it consists of
byte code derived from Instruction Set Architecture (ISA) and is
therefore very specific to a particular architecture.
On the other hand,
Assembly language
stands between high-level and machine code; it is a symbolic
representation of machine language. It used to program on the
"bare metal" hardware. But it lacks many features a high-level
language can provide, like compile-time and run-time error
checking, type checking, benchmarking and diagnostics, and much
more. You must possess a good knowledge about hardware
architecture, registers and memory management to produce an
error-free and safe code while writing on Assembly.
Format
Consider the following "Hello World" Assembly code written for
x86_64 Architecture
using
NASM
syntax:
section .text ; directive to put the following code to the "text" section
global _start ; directive to the start symbol
; Symbols
stdout equ 1
sys_write equ 1
sys_exit equ 60
_start:
mov rdx, len ; load msg length to the rdx register
mov rsi, hello ; load msg to the rsi register
mov rdi, stdout ; load stdout descriptor to rdi register
mov rax, sys_write ; load sys_write number to rax register
syscall ; call syscall
mov rdi, 0 ; exit code
mov rax, sys_exit ; exit syscall
syscall ; call syscall
section .rodata ; dicrective to "data" section
hello db 'Hello, World!',0x0A ; message
len equ $ - hello ; message length
Let me explain:
- section .text is a directive that specifies that the following instructions/data should be placed in the ".text" section of the output ELF file.
- section .rodata is a similar directive that specifies that the following instructions/data should be placed in the .rodata read-only section of the output ELF file. Alternatively, it could be placed in the .data section, which is for read-write data (data that is write-protected in memory).
- global _start is a directive that makes the following symbol visible to the linker. Otherwise, symbols are normally lost during link time. In this case, the linker needs to know the value of the special symbol _start to know where execution is to begin in the program (which is not always at the start of the .text section).
- _start is a label that is equivalent to the memory location of the first instruction in the program.
- hello is a label that is equivalent to the memory location of the first byte of the string "Hello, World!\n"
- equ is a directive that sets a symbol (len) equal to the value of an expression (in this example, "$ - hello" meaning the current memory location minus the value of the label "hello").
- instructions inside _start are indented for the compiler to understand they are instructions, not labels. Each line consists of an instruction and 0 or more arguments.
Note that symbols are not variables - they are constants calculated at compile-time. However, they may contain an address of a variable. Note also that the syntax will vary from assembler to assembler and from architecture to architecture. Source and comparison between ARM and x86
Let's compile the code above:
$ nasm -o hello.o -f elf64 hello.s
$ ld -o hello hello.o
$ ./hello
Hello, World!
Now we can take a look at the Machine code generated by the assembler:
$ hexdump -C hello
... snipped ...
00002000 48 65 6c 6c 6f 2c 20 57 6f 72 6c 64 21 0a 00 00 |Hello, World!...|
... snipped ...
000020f0 00 10 40 00 00 00 00 00 00 00 00 00 00 00 00 00 |..@.............|
00002100 2d 00 00 00 10 00 02 00 0e 20 40 00 00 00 00 00 |-........ @.....|
00002110 00 00 00 00 00 00 00 00 39 00 00 00 10 00 02 00 |........9.......|
00002120 0e 20 40 00 00 00 00 00 00 00 00 00 00 00 00 00 |. @.............|
00002130 40 00 00 00 10 00 02 00 10 20 40 00 00 00 00 00 |@........ @.....|
00002140 00 00 00 00 00 00 00 00 00 68 65 6c 6c 6f 2e 73 |.........hello.s|
00002150 00 73 74 64 6f 75 74 00 73 79 73 5f 77 72 69 74 |.stdout.sys_writ|
00002160 65 00 73 79 73 5f 65 78 69 74 00 68 65 6c 6c 6f |e.sys_exit.hello|
00002170 00 6c 65 6e 00 5f 5f 62 73 73 5f 73 74 61 72 74 |.len.__bss_start|
00002180 00 5f 65 64 61 74 61 00 5f 65 6e 64 00 00 2e 73 |._edata._end...s|
00002190 79 6d 74 61 62 00 2e 73 74 72 74 61 62 00 2e 73 |ymtab..strtab..s|
000021a0 68 73 74 72 74 61 62 00 2e 74 65 78 74 00 2e 64 |hstrtab..text..d|
000021b0 61 74 61 00 00 00 00 00 00 00 00 00 00 00 00 00 |ata.............|
000021c0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
... snipped ...
The machine code above encodes the assembly code into this format which is hard to read. Each byte can encode an instruction, argument and data. Take a moment to read this page and watch Ben Eater educational Youtube channel.
Conclusion
Any CPU operates on Machine code generated by an Assembler from
Assembly language. Assembly language consists of
architecture-specific instructions, symbols and directives. Modern
CPUs are unfathomably complex and have thousands of documentation
pages, and it is hard to comprehend what is going on at the hardware
level. That is why most developers tend to use high-level languages.
But, I will continue exploring Assembly and Architectures to get
more insights into this fascinating topic.
Author: Iurii Kondrakov
Email: deezzir@gmail.com
GitHub: github.com
Comments
Post a Comment