Algorithm Selection on AArch64 using SVE2

Algorithm Selection on AArch64 using SVE2

Introduction

This post is the second part of Algorithm Selection. Our focus is to create a new version of the volume scaling code from the previous post but using Scalable Vector Extension 2 for ARM SIMD. After that, we will analyze the assembly code generated by the compiler.

Code using Assembly intrinsics

First of all, we need to include <arm_sve.h> library which contains SVE and SVE2 intrinsics:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#ifdef __aarch64__
#include <arm_sve.h>
#endif

#include "vol.h"

// The algorithm below uses ARM SVE2 assembly instructions accesed through GCC compiler intrinsics (specific to AArch64) 
int main(void) {
#ifndef __aarch64__
	printd("Wrong architecture - written for aarch64 only.\n");
#else
	int16_t* in	 = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
	int16_t* out     = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
	int16_t  vol_int = (int16_t)  (VOLUME/100.0 * 32767.0);
	int      ttl     = 0;

	vol_createsample(in, SAMPLES);

	int32_t ix  = 0;
	svbool_t pd = svwhilelt_b16(ix, SAMPLES);
	do {
		svint16_t in_vec = svld1_s16(pd, &in[ix]);
		svst1_s16(pd, &out[ix], svqrdmulh_n_s16(in_vec, vol_int));

		ix += svcntd();
		pd = svwhilelt_b16(ix, SAMPLES);
	} while(svptest_any(svptrue_b16(), pd));

	for (size_t i = 0; i < SAMPLES; i++) {
            ttl = (ttl + out[i]) % 1000;
        }

        printf("Result: %d\n", ttl);  // Check sum to prevent compiler optimizations
#endif
        return 0;
}

Testing

To compile code containing SVE2 for the ARMv8 hardware, we need to include -march=armv8-a+sve2 option in GCC:
gcc -g -O1  -march=armv8-a+sve2  vol6.c vol_createsample.o -o vol6
To run the compiled program, we need to run it with QEMU user-mode software. It will trap SVE2 instructions and emulate them in software while executing Armv8a instructions directly on the hardware:
qemu-aarch64 ./vol6
The code successfully runs and produces the same result as the vol4 and vol5 algorithms.

Disassembly Analysis

4006e4:     937f7c41     sbfizx1, x2, #1, #32
4006e8:     8b010283     add     x3, x20, x1
4006ec:     a4a0a060     ld1h    {z0.h}, p0/z, [x3]
4006f0:     04617400     sqrdmulh        z0.h, z0.h, z1.h
4006f4:     8b010261     add     x1, x19, x1
4006f8:     e4a0e020     st1h    {z0.h}, p0, [x1]
4006fc:     04f0e3e2     incd    x2
400700:     25600441     whilelt p1.h, w2, w0
400704:     25814420     mov     p0.b, p1.b
400708:     2550c820     ptest   p2, p1.b
40070c:     54fffec1     b.ne    4006e4 <main+0x4c>  // b.any
The disassembly code block above represents a while loop where samples are scaled. In this block, we can see the SVE instructions with SIMD registers and predicates. Thus, the code is portable and can run on ARM machines with different SIMD register sizes.

We can see ld1h and st1h instructions, which load data from in array and store scaled data to out array, respectively.
Also, we have a sqrdmulh that multiples each lane like z0.h * z1.h * 2 and stores the saturated result in z0.
Next, the add instructions are responsible for getting the addresses of in[ix] and out[ix].
Next, incd increases the ix by the number of lanes(elements) processed.
Finally, whilelt and ptest ensure that vectors do not load data that should not be touched. These instructions control the loop flow, setting flags if all elements are processed

Conclusion

While working on this blog, I have learned how to implement SVE2 through compiler intrinsics, which is more flexible and readable than inline assembly. The challenge was to find the right intrinsic in the official documentation and put it all together. But it was exciting to read about this powerful technology.
Author: Iurii Kondrakov 
GitHub: github.com

P.S this blog post is created for the SPO600 Lab 6

Comments