- Get link
- X
- Other Apps
- Get link
- X
- Other Apps
Algorithm Selection on AArch64 using SVE2
Introduction
This post is the second part of Algorithm Selection. Our focus is to
create a new version of the volume scaling code from the previous post but
using Scalable Vector Extension 2 for ARM SIMD. After that, we will
analyze the assembly code generated by the compiler.
Code using Assembly intrinsics
First of all, we need to include <arm_sve.h> library which
contains SVE and SVE2 intrinsics:
#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#ifdef __aarch64__
#include <arm_sve.h>
#endif
#include "vol.h"
// The algorithm below uses ARM SVE2 assembly instructions accesed through GCC compiler intrinsics (specific to AArch64)
int main(void) {
#ifndef __aarch64__
printd("Wrong architecture - written for aarch64 only.\n");
#else
int16_t* in = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
int16_t* out = (int16_t*) calloc(SAMPLES, sizeof(int16_t));
int16_t vol_int = (int16_t) (VOLUME/100.0 * 32767.0);
int ttl = 0;
vol_createsample(in, SAMPLES);
int32_t ix = 0;
svbool_t pd = svwhilelt_b16(ix, SAMPLES);
do {
svint16_t in_vec = svld1_s16(pd, &in[ix]);
svst1_s16(pd, &out[ix], svqrdmulh_n_s16(in_vec, vol_int));
ix += svcntd();
pd = svwhilelt_b16(ix, SAMPLES);
} while(svptest_any(svptrue_b16(), pd));
for (size_t i = 0; i < SAMPLES; i++) {
ttl = (ttl + out[i]) % 1000;
}
printf("Result: %d\n", ttl); // Check sum to prevent compiler optimizations
#endif
return 0;
}
Testing
To compile code containing SVE2 for the ARMv8 hardware, we need to include
-march=armv8-a+sve2 option in GCC:
gcc -g -O1 -march=armv8-a+sve2 vol6.c vol_createsample.o -o vol6
To run the compiled program, we need to run it with QEMU user-mode software. It will trap SVE2 instructions and emulate them in software while executing Armv8a instructions directly on the hardware:
qemu-aarch64 ./vol6
The code successfully runs and produces the same result as the vol4 and vol5 algorithms.
Disassembly Analysis
4006e4: 937f7c41 sbfizx1, x2, #1, #32
4006e8: 8b010283 add x3, x20, x1
4006ec: a4a0a060 ld1h {z0.h}, p0/z, [x3]
4006f0: 04617400 sqrdmulh z0.h, z0.h, z1.h
4006f4: 8b010261 add x1, x19, x1
4006f8: e4a0e020 st1h {z0.h}, p0, [x1]
4006fc: 04f0e3e2 incd x2
400700: 25600441 whilelt p1.h, w2, w0
400704: 25814420 mov p0.b, p1.b
400708: 2550c820 ptest p2, p1.b
40070c: 54fffec1 b.ne 4006e4 <main+0x4c> // b.any
The disassembly code block above represents a while loop where samples are scaled. In this block, we can see the SVE instructions with SIMD registers and predicates. Thus, the code is portable and can run on ARM machines with different SIMD register sizes.
We can see ld1h and st1h instructions, which load data from in array and store scaled data to out array, respectively.
Also, we have a sqrdmulh that multiples each lane like z0.h * z1.h * 2 and stores the saturated result in z0.
Next, the add instructions are responsible for getting the addresses of in[ix] and out[ix].
Next, incd increases the ix by the number of lanes(elements) processed.
Also, we have a sqrdmulh that multiples each lane like z0.h * z1.h * 2 and stores the saturated result in z0.
Next, the add instructions are responsible for getting the addresses of in[ix] and out[ix].
Next, incd increases the ix by the number of lanes(elements) processed.
Finally, whilelt and ptest ensure that vectors do not load data that should not be touched. These instructions control the loop flow, setting flags if all elements are processed
Conclusion
While working on this blog, I have learned how to implement SVE2
through compiler intrinsics, which is more flexible and readable than
inline assembly. The challenge was to find the right intrinsic
in the official documentation and put it all together. But it was
exciting to read about this powerful technology.
Author: Iurii Kondrakov
Email: deezzir@gmail.com
GitHub: github.com
P.S this blog post is created for the SPO600 Lab 6
- Get link
- X
- Other Apps
Comments
Post a Comment