SPO600 Project Implementation

Introduction

This blog post is the second stage of the SPO Project. The main objective is to optimize an open-source library for ARMv9 hardware using Scalable Vector Extension 2.

The second part consists of steps to implement the planned changes and verify that the library builds and works correctly. In addition, the changes should not cause any regression on the other platforms.

Setbacks

In the previous stage, I searched for possible candidates. I found that the Tensorflow library fits the best for this project. I researched and worked with the library itself to find the best strategy to implement SVE2 optimizations. However, after almost a week of work, I have decided that this package is too complex for this project. I tried to use the auto-vectorization options for the compiler, which did not gain the desired results. The scope of this library requires a lot of work to accomplish the implementation that meets the project standards. Unfortunately, I do not have enough time and experience to make it even mediocre.

Moving on

After failure acknowledgement, It was time to return to the first stage and find another library that might fit the scope. I spent another day and finally stumbled upon the simdjson library.

The library aims to bring the JSON parsing to another level. The author decided to make the most common process on the Internet faster, more reliable, and automatic by taking advantage of CPU-specific SIMD instructions. The algorithm can boast reduced branch misprediction and data dependency, which leads to a better multiple core execution.

The library relies on advanced SIMD instructions for different platforms using intrinsic functions. The key feature of the package is the ability to compile all processing CPU-specific kernels and choose the most appropriate at the runtime. Meaning that compiled binaries will have different instructions sets for various CPU families and microarchitectures ready to be picked at runtime upon determining the platform. It enables the best possible performance, user-friendliness and portability.

Currently supported SIMD sets:

ALTIVEC
ARM Neon
AVX2
BMI1 & BMI2 (Bit manipulation instruction set)
SSE42
PCLMULQDQ (Extension for Intel/AMD)

The repository search confirmed no SVE2 implementation yet, no open Issues and Pull Requests. Considering the versatile approach the community used, it is clear that the set will be embedded as we start seeing hardware shortly.

Codebase Analysis

I have already cloned the repository and read the documentation, so let's look at the library structure and some code.

From the top level, the structure consists of:

include - a folder with the classes, function declarations and inline definition:

It contains subdirectories with architecture-specific inline implementation and headers, like <arm_neon> and <immintrin.h>.
It contains a fallback(default) function declaration if there are no SIMD sets to apply.
It contains the generic helpers and other classes used library-wide.

src - a folder with non-inlined functionality:

It contains architecture-specific function implementations.
It contains generic implementations of the simdjson parser.
It contains a simdjson.cpp - main source file with all implementations to select at runtime.

It contains other setup files, environments, scripts and dependencies.
CmakeLists.txt - the main build script with unit testing.

simdjson.cpp - main source:

arm64/intrinsics.h - main ARM implementaion header

SIMD Platform Detection

ARM Neon intrinsic example

ARM Neon Bit manipulation

Implementation

After examining the code structure and flow, It was time to choose the strategy to implement SVE2 into the library. Some of the headers were responsible for setting macros that will determine an instruction set for the runtime dispatcher to choose. There are two possible strategies that I see:

auto-vectorization
intrinsics

As mentioned above, the code heavily relies on intrinsic functions rather than inline assembly as it ensures flexibility and the correct work of the dispatcher.

I will try both strategies, starting with auto-vectorization. I went through the CMake files and added the options to build specifically for ARMv8-a with SVE2 capabilities.

The second step is to add SVE2 detection conditional statements and try rewriting one of the Neon functions to upgrade it to SVE2.

Verification

First of all, I have tried to use auto-vectorization with the CMake and march=armv8-a+sve2 options. After the build completion, I run the tests and benchmarks to ensure the library works. All the tests passed on the ARM machine under the software emulator. So it confirms that the compiler was able to produce the SVE2.

But it is not the best solution, and I am continuing to work with the ARM64 folder to make a least one function to include the actual intrinsic functions. I hope to have enough time to produce a working version with the implemented code. So I can test it on different platforms because compiling with the option mentioned above locks the binaries with the specific architecture and SIMD instruction set.

Boilerplate for the SVE2 intrinsic.

Conclusion

The library is a powerful instrument when it comes to JSON parsing. Currently, there are no analogs present in the community. Furthermore, the library uses the runtime dispatcher to select the appropriate SIMD set - the best solution for portability, avoiding any overheads in the different kernels.

I will continue working on implementing the SVE2 intrinsics, and we will see the outcomes in the next blog post for the last stage of the SPO600 Project.

Unfortunately, I am not considering creating a pull request this time because the ARMv9 microarchitecture only starts to arise, and it might be too early. But playing around with this technology and open-source projects makes me learn plenty of new topics, coding patterns and practices.

Author: Iurii Kondrakov

Email: deezzir@gmail.com

GitHub: github.com

SPO600 Blog

Search This Blog

SPO600 Project Analysis