- Get link
- X
- Other Apps
- Get link
- X
- Other Apps
Inline Assembly & Assembly Intrinsics
Introduction
Generally speaking, the inline assembly provides the ability to
embed assembly language directly into the source code. Usually, developers
write inline assembly code with languages that allow
low-level code within a program. Typically, programmers use the C
language for optimization tuning, therefore writing assembly within C
source code. On the other hand, compiler intrinsics allow to development
of a code that a compiler will not be able to generate. It includes AVX
and MMX functions for x86 and NEON and SVE/SVE2 functions for ARM.
Benefits
Embedding inline assembly languages allows the software to be more
optimized for different architectures. It introduces access to
processor-specific instructions, calling conventions, macros, syscalls,
and directives, which some commonly implemented algorithms can profit off. In
addition, a developer may choose to use the
compiler Intrinsics, which work almost the same but have more readable and flexible
implementation. Intrinsics are architecture-specific functions that a
compiler translates to the assembly code during compilation.
Drawbacks
Inline assembly, in some cases, may impose a problem as it complicates
the process of code analysis by a compiler, meaning that compiler-specific
optimization may not kick in, leaving the coda as is and possibly
decreasing the performance overall. In addition, an inline assembly may
carry an unsafe code that might crash during execution. Finally, it may
pose future porting and maintenance issues as it is tied to a particular
architecture.
Inline assembly Syntax
To demonstrate inline assembly, we will use the GCC compiler. We
can choose from these two forms to embed low-level code in a C source
file:
asm(...);
__asm__ (...); // With double underscores
We can also include the volatile keyword to instruct the compiler
not to move the code as a result of optimization or other manipulations:
asm volatile (...);
__asm__ __volatile__ (...); // With double underscores
Inside the parenthesis, there are up to four sections separated by
colons:
asm("assembly"
: output operands /* optional */
: input operands /* optional */
: list of clobbered registers /* optional */
);
Programmers may write the assembly code as one or more strings enclosed
in quotes with whitespace as a separator. Each instruction should
be separated as well, and you can also choose to use a semicolon(;),
explicit newline(\n) with or without tab(\t). For instance:
asm("mov $0, %%rdi\n\t"
"mov $60, %%rax\n\t"
"syscall"
); // return 0;
Input and Output Operands
These operands are optional, but we can specify input and output variables inside a parenthesis with a constraint string and an optional modifier. It is useful when we need to get a value from a C variable, process it inside asm closure and perhaps save it to another variable.
Contraints:
- r - any general-purpose register is permitted
- 0-9 - the same register used in the matching number operand should be used
- i - an immediate integer value is permitted
- F - an immediate floating-point value is permitted
Modifiers:
- = - output-only register - previous contents are discarded and replaced with output value
- + - input and output register - this register will be used to both pass input data to the asm code, and to receive a value from the asm code
- & - earlyclobber register - this value may be overwritten before input is processed, therefore it must not be used for input
- % - declares that this operand and the following operand are commutable (interchangeable) for optimization. Only one commutable pair may be specified.
int x=69, y;
__asm__ ("mov %1,%0"
: "=r"(y) // output register with y address is called %0 in template
: "r"(x) // input register with x value is called %1 in template
:
);
or we can also specify the name for the registers holding the variables:
int x=69, y;
__asm__ ("mov %[in],%[out]"
: [out]"+r"(y) // read/write register may be called %[out]
: [in]"r"(x) // register may be called %[in]
:
);
Note that contraints and modifiers might be arhitecture or library specific. There are additional for gcc here.
Clobbers
The clobber section instructs the compiler that the provided code may
overwrite listed registers, memory regions or condition
flags.
asm("assembly"
: "=r"(in)
: "r"(out)
: "rax", "rsi"
);
Additionally, we can include a "memory" string if the code may alter
memory regions. It will tell the compiler that the memory may change
during inline assembly execution, so the compiler reloads any altered
registers from memory after the asm execution. It is good practice to put "memory" in volatile asm code.
Examples
Following the
Algorithm Selection Post, the vol4 and vol5 algorithms use inline assembly and intrinsic, respectively. Both snippets produce the same assembly code:
Inline
while (in_cursor < limit) {
__asm__(
"ldr q0, [%[in_cursor]], #16 \n\t"
"sqrdmulh v0.8h, v0.8h, v1.8h \n\t"
"str q0, [%[out_cursor]],#16 \n\t"
: [in_cursor] "+r"(in_cursor), [out_cursor] "+r"(out_cursor)
: "r"(in_cursor), "r"(out_cursor)
: "memory"
);
}
Intrinsics
while (in_cursor < limit) {
vst1q_s16(out_cursor, vqrdmulhq_s16(vld1q_s16(in_cursor), vdupq_n_s16(vol_int)));
in_cursor += 8;
out_cursor += 8;
}
Disassembly
400734: 3cc10440 ldr q0, [x2], #16
400738: 6e61b400 sqrdmulh v0.8h, v0.8h, v1.8h
40073c: 3c810420 str q0, [x1], #16
You can find the code above
here.
Conclusion
In my opinion, an inline assembly is a powerful tool for optimization
and portability. It allows software developers to introduce
architecture targeted low-level code to execute on specific machines.
Meaning that the most performance-sensitive parts of the given
algorithm will be tuned efficiently, avoiding optimizations that will
not work on another architecture. But only if carefully
implemented.
Resources
Author: Iurii Kondrakov
Email: deezzir@gmail.com
GitHub: github.com
- Get link
- X
- Other Apps
Comments
Post a Comment