SPO600 Project Implementation

Introduction

This blog post is the third and final stage of the SPO Project. The main objective is to optimize an open-source library for ARMv9 hardware using Scalable Vector Extension 2.

The final part involves a detailed analysis of the provided solution, whether with auto-vectorization, inline assembly or intrinsic functions. It includes disassembly analysis, the performance impact and conclusions.

And yet another failure/Analysis

In the previous post, I switched to another library - simdjson, to try implementing the SVE2 instruction set for possible optimizations because the initial library was too complex for me to handle alone. The library is the perfect fit for SVE2 optimizations. It already includes many SIMD instruction sets, including Neon, it uses a runtime dispatcher, and it claims to parse gigabytes of JSON stream in seconds.

Another interesting detail to mention - the library consists of many folders and files, but during the build process, CMake runs the python scripts to generate two files: simdjson.h and simdjson.cpp - which contain all of the headers and sources from the entire library. Meaning the end-user can include and compile using only these files.

But I failed again, and here is why:

Auto-vectorization

As a first step, I have tried using the -march=armv8-a+sve2 option to let the compiler decide where to apply auto-vectorization, and it worked:

But upon further investigation of the disassembled code, I found no traces of the instructions that use vector predicates or the SIMD vectors. I was disappointed, but I was eager to know why and run the compilation again but with the -fopt-info-vec-missed option to see the compiler analysis:

Output Snippet:

From the screenshot provided above, we can see that the compiler missed vectorization for several reasons:

Control flow in the loop
The compiler could not determine the number of iterations in the loop
There is a clobber memory with dynamic allocation/deallocation
Vectorization is not profitable

Most of the driver functions in the library use while loops that check for the error presence to parse the JSON document. In most cases, it means that the compiler might not be able to calculate the number of iterations inside the loop, so applying vectorization to the segment might be detrimental because the setup for SVE2 will decrease the performance for small data sets.

inline void document_stream::next() noexcept {
	// We always exit at once, once in an error condition.
	if (error) { return; }
	// Load the next document from the batch
	doc_index = batch_start + parser->implementation->structural_indexes[parser->implementation->next_structural_index];
	error = parser->implementation->stage2_next(parser->doc);
	// If that was the last document in the batch, load another batch (if available)
	while (error == EMPTY) {
		batch_start = next_batch_start();
		if (batch_start >= len) { break; }
#ifdef SIMDJSON_THREADS_ENABLED
		if(use_thread) {
			load_from_stage1_thread();
		} else {
			error = run_stage1(*parser, batch_start);
		}
#else
		error = run_stage1(*parser, batch_start);
#endif
		if (error) { continue; } // If the error was EMPTY, we may want to load another batch.
		// Run stage 2 on the first document in the batch
		doc_index = batch_start + parser->implementation->structural_indexes[parser->implementation->next_structural_index];
		error = parser->implementation->stage2_next(parser->doc);
	}
}

The code snippet above presents one of the main functions to parse and process JSON stream. The algorithm consists of two stages:

Stage 1: Identifies structural elements, strings, etc. and validates the input.
Stage 2: Constructs the working tree for navigation and parses strings and numbers.

There are both stages in the code snippet, and the compiler missed vectorization because of the indeterministic loop size.

But I have another theory - the library uses a runtime dispatcher to determine the SIMD instruction set during execution. The -march=armv8-a+sve2 option means building for ARMv8 architecture with the SVE2 capabilities, but It also includes the ARM Neon capabilities as it is the default SIMD set for this family. Maybe there are already optimizations coming from the Neon code in the library, so the compiler determines that the SVE2 vectorization is not profitable. So, I have tried to disable the Neon kernel for the implementation to be default(no SIMD capabilities):

After another five minutes of compilation, I looked at the disassembly to confirm my theory - there were no NEON instructions(with vectors). So the dispatcher did not select the kernel for the AArch64 during runtime. However, there were no SVE2 instructions either. Thus the problem with indeterministic loops and memory alignment is the cause of the failed auto-vectorization. The possible solution is to work with memory alignment macros to ensure proper alignment. However, the time spent searching for the appropriate functions for decorating will be a waste because of the dispatcher pattern and the "build once-run everywhere" philosophy.

Therefore, in conclusion - no feasible optimization with this strategy.

Intrinsic Functions

The second option was to try rewriting one of the Neon optimized functions with SVE2 intrinsics. I have opened the ARM C Language Extensions for SVE documentation and started looking through the "include/simdjson/arm64" and "src/arm64" folders for possible candidates.

But first of all, we need to include the SVE header inside the "include/simdjson/arm64/intrinsics.h" file, where the Neon header is declared:

The arm64 "source/" folder consists of parser functions constructed with inlined procedures containing Neon intrinsics defined in the arm64 "include/" folder:

The "include/" folder contains all the optimizations we might rewrite. So, I have scanned the contents to find the best fit. I selected the "simd.h" file that defines a couple of vital structs for parsing Stage 1:

However, at this point, the problems started to arise. The file contained a lot of interesting intrinsic functions and bit manipulations based on the Neon SIMD set. For example, vreinterpretq_u8_u64 will reinterpret the vectors from uint64x2_t to uint8x16_t. There were many adding, loading, storing, and comparing operations, but no loops I can, in theory, optimize.

After another hour of library analysis, I conducted that the library relies on SIMD operations with structs like finding a max/min value or determining if the vectors are equal and bit manipulations which is the essential part of the algorithm. Still, that implementation is perfect for little optimization and tweaks that make the overall library faster. Nevertheless, it is not a type of optimization that can be auto-vectorized or something that can benefit from SVE2 predicates.

But I did not give up. Returning to the second strategy, I might rewrite one of the struct's member functions with SVE2 analogs. So, I found this curious function that looks up the table of 16 values and returns the values defined by the indices provided:

And here is the implementation of the apply_lookup_16_to function:

It uses the vqtbl1q_u8 intrinsic to look up the vector presented as the first argument and return another vector with the values in the indices provided with the second argument. And I found the SVE2 analog intrinsic in the documentation:

I have tried rewriting the function and encountered perhaps the most fundamental problem that will prevent the successful embedment of the SVE2 intrinsics into the current setup. If we return to the screenshot showing the base8 struct - the base struct that others inherit from, there is the member called value represented by uint8x16_t type. The type means a vector with 16 unsigned integers of size 8. And as we know, Neon introduced 128-bit vectors for SIMD operations, which is the case here.

It means that the implementation locks with the 128-bit vectors in this setup. And to implement SVE2, the whole rework of the file ant struct is needed to introduce SVE2 vectors such as svuint8_t because I did not find a feasible way to cast the Neon vector to the SVE2 vector.

And if there is a way to do it(which might exist), it would mean an unnecessary sequence of casts that will impact the performance considering the number of times the function executes per second. It is truthful for the vast majority of members in the ARM64 directory.

Therefore, in conclusion - there is feasible optimization with this strategy.

What can be done?

As discussed before, auto-vectorization is not an option. Additionally, inline assembly is not an alternative because the authors purposefully do not use it for flexibility and stability assembly might harm.

So the only option in the table is to introduce the SVE2 intrinsic function with the revision of the project's structure. It includes adding the SVE2 detection to the runtime dispatcher and adding another directory to the "include/" and "src/" folders containing the implementation of the structs mentioned above.

Potential Benefits

Considering the performance improvements, I think that upgrading from the 128-bit vectors to a wide range of vector implementations can increase the overall effectiveness of the library for the ARM chips. Currently, ARM benchmarks show impressive performance for all cases, including edge samples where some x86_64 implementations can choke.

So, when the actual server hardware comes, we might see a spike in the performance and power efficiency in the industry of web serving and APIs.

Final/another try??

MAYBE - WIP

Conclusion/Reflection

During the project, I was into working through the library structure to understand what parts are for, how the algorithm functions and on what it depends. It was essential to get into large projects because you mostly start with some legacy code instead of doing everything from scratch. So searching codebase is a vital skill to learn. Additionally, I have learned how the runtime dispatcher pattern works, how CMake performs build and how the rapid JSON algorithm works. This experience and knowledge will help me in my future jobs.

In my opinion, failures are as relevant and meaningful as successes if you can understand where you failed and what can be done in the future to prevent them. It is an essential part of everyone's lives because it extends your experience and proficiency.

Unfortunately, this is the last blog post for the SPO600 course and the final semester in Seneca. Three years of studies were both exciting and comprehensive, and I would like to thank Professor Chris Tyler and his team for the engaging course that I find one of my favourites. Nothing can describe how much knowledge I have gained in CPU architectures, optimization and portability(as the course name implies😋). I will continue to read the articles in this field and monitor the industry to stay up-to-date.

Author: Iurii Kondrakov

Email: deezzir@gmail.com

GitHub: github.com

SPO600 Blog

Search This Blog

SPO600 Project Analysis