- Get link
- X
- Other Apps
- Get link
- X
- Other Apps
Scalable Vector Extension 2 for ArmV9
Introduction
Intel and AMD
x86
chips have been dominating the computing industry for decades, but the
architecture is facing its biggest competition ever with the rise of
ARM. Not only the ARM hardware is used in most handheld and embedded
devices, but it is also coming to server space and High-Performance
Computing, and even powering the new Apple Macs.
Recently, ARM has released the next generation architecture ARMv9,
which will set the course for the industry. The architecture has many
compelling features, including the standardization of the Scalable Vector
Extension 2(SVE2) -
Single Instruction Multiple Data(SIMD)
Instruction Set supporting variable vector length. Note that SVE2
is a minor refinement of the original SVE used in the Fugaku supercomputer. The SVE was somewhat limited in scope and aimed more at HPC
workloads, missing many of the more versatile instructions covered by
NEON.
Software developers have been widely using SIMD instruction sets for
years, and, as of now, Intel has several implementations, such as
MMX,
SSE2, and
the AVX family. Not to mention that ARM also introduced the
NEON SIMD set
with the release of ARMv7.
But these implementations lack flexibility and portability because
they target specific SIMD vector lengths. Code compiled for one
platform may not work on another.
In contrast, SVE2 succedes NEON and SVE, allowing for more function
domains in data-level parallelism. SVE2 inherits the concept, vector
registers, and operation principles of SVE. Silicon manufacturers can
choose a suitable vector length design implementation for hardware
that varies between 128 bits and 2048 bits, at 128-bit increments. The design enables developers to write and build software once and
run the same executables on different ARMv9 machines. It means that
developers do not need to worry about vector length implementations,
removing the requirement to rebuild software - portability.
SVE2 Architecture and Registers
The SVE2 shares the existing features and tools coming with NEON. The
architecture also adds the following registers:
- 32 scalable vector registers: Z0-Z31
- 16 scalable predicate registers: P0-P15
- Plus First Fault predicate register(FFR)
- Scalable vector system control register: ZCR_Elx
Vectors
The figure on the left demonstrates the scalable vectors.
Each vector can hold 64, 32, 16, and 8-bit elements, supporting integer
and floating-point values. For instance, a vector with a 256-bit length
can carry eight integers, each 32-bit long.
Predicates
The figure on the left demonstrates the scalable vectors. Each
predicate is 1/8 of the Zx vector size.
P0-P7 registers are responsible for load, store and
arithmetic.
P8-P15 registers are for loop management.
The last FFR register is for Speculative memory access.
The figure on the left demonstrates the scalable vector system control registers. The ZCR_Elx.LEN field is for the vector length of the current and lower exception levels. Some bits are reserved for future use.ZCR_Elx register
Instructions example
LD1H (scalar plus immediate) - Gather load unsigned halfwords to vector:
LD1H { <Zt>.S }, <Pg>/Z, [<Xn|SP>{, #<imm>, MUL VL}]
Where:
- <Zt> - is a name of the SIMD scalable vector register to be loaded with data
- <Pg> - is a name of the controlling scalable predicate register P0-P7
- <Xn|SP> - is the 64-bit name of the general-purpose register or stack pointer
- <imm> - optional vector offset
Example:
ld1h {z0.h}, p0/z, [x3]
WHILELT - While incrementing signed scalar less than scalar
WHILELT <Pd>.<T>, <R><n>, <R><m>
Where:
- <Pd> - is a name of the destination scalable predicate register
- <m> - is a name of the general-purpose register with the total number of elements
- <n> - is a name of the general-purpose register with the number of processed elements
- <T> - specifies the size of the data being processed
- D(double word) - 64-bit values
- S(single word) - 32-bit values
- H(half word) - 16-bit values
- B(byte?) - 8-bit values
- <R> - register width specifier
- x - 64-bit wide access register
- w - 32-bit wide access register
Example:
whilelt p1.h, w2, w0
Applications
The SVE2 will be able to accelerate the common algorithms used in:
- Computer vision(Machine Learning)
- Multimedia(Video and Audio, Streaming)
- Long-Term Evolution (LTE) baseband processing
- Genomics(Gene Altering)
- In-memory database
- Web serving
- General-purpose software
Building SVE2 Code
As of today(March 2022), there is no actual ARMv9 hardware on which
we can build and test the SVE2 code. However, there is a workaround -
we can instruct the compiler for which architecture to compile the code. GCC or ARMClang compilers can perform that task by adding the -march= option (machine architecture).
This option is necessary regardless of whether you're using
auto-vectorization, inline assembler, or intrinsics. The architecture
specification for the currently available hardware is
armv8-a+sve2. This option will compile the code for the ARMv8 machines, including SVE2:
gcc -march=armv8-a+sve2 ...
armclang -march=armv8-a+sve2 ...
To invoke auto-vectorization in GCC version 11,
use -O3 optimization level or the appropriate
feature option (-ftree-vectorize). ARMClang can also vectorize the code if -O3 is enabled:
gcc -O3 -march=armv8-a+sve2 ...
gcc -O2 -march=armv8-a+sve2 -ftree-vectorize ...
armclang -O3 -march=armv8-a+sve2 ...
To use SVE2 intrinsics in a C program, include the header file
arm_sve.h - correct GCC compiler file.
#include <arm_sve.h>
To detect SVE2 capability in the compilation target, use the macro
following macro:
#if __ARM_FEATURE_SVE2
...
#endif
Run SVE2 Code
To run SVE2 code on ARMv8 hardware, use the QEMU user-mode software. It
will allow to trap SVE2 assembly and emulate it in software while
executing instructions that ARMv8 understands directly on the
hardware:
qemu-aarch64 ./executable
Conclusion
The SVE2 Instruction set is a powerful tool that can boast portability
and flexibility, meaning build once, run on hardware with different vector
implementations. The SVE2 dramatically increases the application area for
the new ARMv9 chips, so we will probably see a great variety of hardware
shortly. Last but not least - you can check the SVE2 example code
here.
Resources
Author: Iurii Kondrakov
Email: deezzir@gmail.com
GitHub: github.com
- Get link
- X
- Other Apps
Comments
Post a Comment