Inline Assembly & Assembly Intrinsics

Introduction

Generally speaking, the inline assembly provides the ability to embed assembly language directly into the source code. Usually, developers write inline assembly code with languages that allow low-level code within a program. Typically, programmers use the C language for optimization tuning, therefore writing assembly within C source code. On the other hand, compiler intrinsics allow to development of a code that a compiler will not be able to generate. It includes AVX and MMX functions for x86 and NEON and SVE/SVE2 functions for ARM.

Benefits

Embedding inline assembly languages allows the software to be more optimized for different architectures. It introduces access to processor-specific instructions, calling conventions, macros, syscalls, and directives, which some commonly implemented algorithms can profit off. In addition, a developer may choose to use the compiler Intrinsics, which work almost the same but have more readable and flexible implementation. Intrinsics are architecture-specific functions that a compiler translates to the assembly code during compilation.

Drawbacks

Inline assembly, in some cases, may impose a problem as it complicates the process of code analysis by a compiler, meaning that compiler-specific optimization may not kick in, leaving the coda as is and possibly decreasing the performance overall. In addition, an inline assembly may carry an unsafe code that might crash during execution. Finally, it may pose future porting and maintenance issues as it is tied to a particular architecture.

Inline assembly Syntax

To demonstrate inline assembly, we will use the GCC compiler. We can choose from these two forms to embed low-level code in a C source file:

asm(...);
__asm__ (...);                  // With double underscores

We can also include the volatile keyword to instruct the compiler not to move the code as a result of optimization or other manipulations:

asm volatile (...);
__asm__ __volatile__ (...);     // With double underscores

Inside the parenthesis, there are up to four sections separated by colons:

asm("assembly"
        : output operands                /* optional */
        : input operands                 /* optional */
        : list of clobbered registers    /* optional */
);

Programmers may write the assembly code as one or more strings enclosed in quotes with whitespace as a separator. Each instruction should be separated as well, and you can also choose to use a semicolon(;), explicit newline(\n) with or without tab(\t). For instance:

asm("mov $0, %%rdi\n\t"
    "mov $60, %%rax\n\t" 
    "syscall"
); // return 0;

Input and Output Operands

These operands are optional, but we can specify input and output variables inside a parenthesis with a constraint string and an optional modifier. It is useful when we need to get a value from a C variable, process it inside asm closure and perhaps save it to another variable.

Contraints:

r - any general-purpose register is permitted
0-9 - the same register used in the matching number operand should be used
i - an immediate integer value is permitted
F - an immediate floating-point value is permitted

Modifiers:

= - output-only register - previous contents are discarded and replaced with output value
+ - input and output register - this register will be used to both pass input data to the asm code, and to receive a value from the asm code
& - earlyclobber register - this value may be overwritten before input is processed, therefore it must not be used for input
% - declares that this operand and the following operand are commutable (interchangeable) for optimization. Only one commutable pair may be specified.

int x=69, y;
__asm__ ("mov %1,%0"
   : "=r"(y)    // output register with y address is called %0 in template
   : "r"(x)     // input register with x value is called %1 in template
   :
);

or we can also specify the name for the registers holding the variables:

int x=69, y;
__asm__ ("mov %[in],%[out]"
   : [out]"+r"(y)  // read/write register may be called %[out]
   : [in]"r"(x)    // register may be called %[in]
   :
);

Note that contraints and modifiers might be arhitecture or library specific. There are additional for gcc here.

Clobbers

The clobber section instructs the compiler that the provided code may overwrite listed registers, memory regions or condition flags.

asm("assembly" 
    : "=r"(in) 
    : "r"(out)
    : "rax", "rsi"
);

Additionally, we can include a "memory" string if the code may alter memory regions. It will tell the compiler that the memory may change during inline assembly execution, so the compiler reloads any altered registers from memory after the asm execution. It is good practice to put "memory" in volatile asm code.

Examples

Following the Algorithm Selection Post, the vol4 and vol5 algorithms use inline assembly and intrinsic, respectively. Both snippets produce the same assembly code:

Inline

while (in_cursor < limit) {
  __asm__(
    "ldr q0, [%[in_cursor]], #16 	\n\t"
    "sqrdmulh v0.8h, v0.8h, v1.8h	\n\t"
    "str q0, [%[out_cursor]],#16	\n\t"

    : [in_cursor] "+r"(in_cursor), [out_cursor] "+r"(out_cursor)
    : "r"(in_cursor), "r"(out_cursor)
    : "memory"
  );
}

Intrinsics

while (in_cursor < limit) {
  vst1q_s16(out_cursor, vqrdmulhq_s16(vld1q_s16(in_cursor), vdupq_n_s16(vol_int)));
  in_cursor  += 8;
  out_cursor += 8;
}

Disassembly

400734:  3cc10440   ldr     q0, [x2], #16
400738:  6e61b400   sqrdmulh        v0.8h, v0.8h, v1.8h
40073c:  3c810420   str     q0, [x1], #16

You can find the code above here.

Conclusion

In my opinion, an inline assembly is a powerful tool for optimization and portability. It allows software developers to introduce architecture targeted low-level code to execute on specific machines. Meaning that the most performance-sensitive parts of the given algorithm will be tuned efficiently, avoiding optimizations that will not work on another architecture. But only if carefully implemented.

Resources

Author: Iurii Kondrakov

Email: deezzir@gmail.com

GitHub: github.com

SPO600 Blog

Search This Blog

SPO600 Project Analysis