Scalable Vector Extension 2 for ArmV9

Scalable Vector Extension 2 for ArmV9

Introduction

Intel and AMD x86 chips have been dominating the computing industry for decades, but the architecture is facing its biggest competition ever with the rise of ARM. Not only the ARM hardware is used in most handheld and embedded devices, but it is also coming to server space and High-Performance Computing, and even powering the new Apple Macs. 
Recently, ARM has released the next generation architecture ARMv9, which will set the course for the industry. The architecture has many compelling features, including the standardization of the Scalable Vector Extension 2(SVE2) - Single Instruction Multiple Data(SIMD) Instruction Set supporting variable vector length. Note that SVE2 is a minor refinement of the original SVE used in the Fugaku supercomputer. The SVE was somewhat limited in scope and aimed more at HPC workloads, missing many of the more versatile instructions covered by NEON.

Software developers have been widely using SIMD instruction sets for years, and, as of now, Intel has several implementations, such as MMX, SSE2, and the AVX family. Not to mention that ARM also introduced the NEON SIMD set with the release of ARMv7.
But these implementations lack flexibility and portability because they target specific SIMD vector lengths. Code compiled for one platform may not work on another.
In contrast, SVE2 succedes NEON and SVE, allowing for more function domains in data-level parallelism. SVE2 inherits the concept, vector registers, and operation principles of SVE. Silicon manufacturers can choose a suitable vector length design implementation for hardware that varies between 128 bits and 2048 bits, at 128-bit increments. The design enables developers to write and build software once and run the same executables on different ARMv9 machines. It means that developers do not need to worry about vector length implementations, removing the requirement to rebuild software - portability.

SVE2 Architecture and Registers

The SVE2 shares the existing features and tools coming with NEON. The architecture also adds the following registers:
  • 32 scalable vector registers: Z0-Z31
  • 16 scalable predicate registers: P0-P15
    • Plus First Fault predicate register(FFR)
  • Scalable vector system control register: ZCR_Elx

Vectors

The figure on the left demonstrates the scalable vectors.
Each vector can hold 64, 32, 16, and 8-bit elements, supporting integer and floating-point values. For instance, a vector with a 256-bit length can carry eight integers, each 32-bit long.

Predicates

The figure on the left demonstrates the scalable vectors. Each predicate is 1/8 of the Zx vector size. 
P0-P7 registers are responsible for load, store and arithmetic. 
P8-P15 registers are for loop management. 
The last FFR register is for Speculative memory access.

ZCR_Elx register

The figure on the left demonstrates the scalable vector system control registers. 
The ZCR_Elx.LEN field is for the vector length of the current and lower exception levels. Some bits are reserved for future use.

Instructions example

LD1H (scalar plus immediate) - Gather load unsigned halfwords to vector:
LD1H { <Zt>.S }, <Pg>/Z, [<Xn|SP>{, #<imm>, MUL VL}]
Where:
  • <Zt> - is a name of the SIMD scalable vector register to be loaded with data
  • <Pg> - is a name of the controlling scalable predicate register P0-P7
  • <Xn|SP> - is the 64-bit name of the general-purpose register or stack pointer
  • <imm> - optional vector offset
Example: 
ld1h {z0.h}, p0/z, [x3]
WHILELT - While incrementing signed scalar less than scalar
WHILELT <Pd>.<T>, <R><n>, <R><m>
Where:
  • <Pd> - is a name of the destination scalable predicate register
  • <m>  - is a name of the general-purpose register with the total number of elements 
  • <n> - is a name of the general-purpose register with the number of processed elements
  • <T> - specifies the size of the data being processed
    • D(double word) - 64-bit values
    • S(single word) - 32-bit values
    • H(half word) - 16-bit values
    • B(byte?) - 8-bit values
  • <R> - register width specifier
    • x - 64-bit wide access register
    • w - 32-bit wide access register
Example: 
whilelt p1.h, w2, w0

Applications

The SVE2 will be able to accelerate the common algorithms used in:
  • Computer vision(Machine Learning)
  • Multimedia(Video and Audio, Streaming)
  • Long-Term Evolution (LTE) baseband processing
  • Genomics(Gene Altering)
  • In-memory database
  • Web serving
  • General-purpose software

Building SVE2 Code

As of today(March 2022), there is no actual ARMv9 hardware on which we can build and test the SVE2 code. However, there is a workaround - we can instruct the compiler for which architecture to compile the code. GCC or ARMClang compilers can perform that task by adding the -march= option (machine architecture). This option is necessary regardless of whether you're using auto-vectorization, inline assembler, or intrinsics. The architecture specification for the currently available hardware is armv8-a+sve2. This option will compile the code for the ARMv8 machines, including SVE2:
gcc -march=armv8-a+sve2 ...
armclang -march=armv8-a+sve2 ...
To invoke auto-vectorization in GCC version 11, use -O3 optimization level or the appropriate feature option (-ftree-vectorize). ARMClang can also vectorize the code if -O3 is enabled:
gcc -O3 -march=armv8-a+sve2 ...
gcc -O2 -march=armv8-a+sve2 -ftree-vectorize ...
armclang -O3 -march=armv8-a+sve2 ...
To use SVE2 intrinsics in a C program, include the header file arm_sve.h - correct GCC compiler file.
#include <arm_sve.h>
To detect SVE2 capability in the compilation target, use the macro following macro:
#if __ARM_FEATURE_SVE2
...
#endif

Run SVE2 Code

To run SVE2 code on ARMv8 hardware, use the QEMU user-mode software. It will allow to trap SVE2 assembly and emulate it in software while executing instructions that ARMv8 understands directly on the hardware:

qemu-aarch64 ./executable

Conclusion

The SVE2 Instruction set is a powerful tool that can boast portability and flexibility, meaning build once, run on hardware with different vector implementations. The SVE2 dramatically increases the application area for the new ARMv9 chips, so we will probably see a great variety of hardware shortly. Last but not least - you can check the SVE2 example code here.

Resources

Author: Iurii Kondrakov 
GitHub: github.com

Comments