Algorithm Selection on x86_64 & AArch64 Platforms

Introduction

Digital sound is typically represented, uncompressed, as signed 16-bit integer signal samples. There are two streams of samples, one each for the left and right stereo channels, at typical sample rates of 44.1 or 48 thousand samples per second per channel, for a total of 88.2 or 96 thousand samples per second (kHz). Since there are 16 bits (2 bytes) per sample, the data rate is 88.2 * 1000 * 2 = 176,400 bytes/second (~172 KiB/sec) or 96 * 1000 * 2 = 192,000 bytes/second (~187.5 KiB/sec).
To change the volume of sound, each sample can be scaled (multiplied) by a volume factor, in the range of 0.00 (silence) to 1.00 (full volume).
On a mobile device, the amount of processing required to scale sound will affect battery life.

Multiple Approaches

Six programs are provided, each with a different approach to the problem, named vol0.c through vol5.c. A header file, vol.h, defines how much data (number of samples) will be processed by each program, as well as the volume level to be used for scaling (50%).

These are the six programs:

vol0.c is the basic or naive algorithm. This approach multiplies each sound sample by the volume scaling factor, casting from signed 16-bit integer to floating point and back again. Casting between an integer and a floating point can be an expensive operation.
vol1.c does the math using fixed-point calculations. This avoids the overhead of casting between an integer and a floating point and back again.
vol2.c pre-calculates all 65536 different results and then looks up the answer for each input value.
vol3.c is a dummy program - it doesn't scale the volume at all. It can be used to determine some of the overhead of the rest processing (besides scaling the volume) done by the other programs.
vol4.c uses Single Instruction, Multiple Data (SIMD) instructions accessed through inline assembly (assembly language code inserted into a C program). This program is specific to the AArch64 architecture and will not build for x86_64.
vol5.c uses SIMD instructions accessed through Compiler Intrinsics. This program is specific to AArch64.

Benchmarking

To start, I am considering the time.h C library to be the best choice for performing benchmarking in that case:

#include <time.h>  // for clock_t, clock(), CLOCKS_PER_SEC
int main(void) {
	double time_spent = 0.0;
	clock_t begin = clock();
	/* Do some funny stuff */
	clock_t end = clock();
	elapsed += (double)(end - begin) / CLOCKS_PER_SEC;
	return 0;
}

Working Directory and Files(download link):

.
└── examples
    └── volume
        ├── Makefile
        ├── vol0.c
        ├── vol1.c
        ├── vol2.c
        ├── vol3.c
        ├── vol4.c
        ├── vol5.c
        ├── vol_createsample.c
        └── vol.h

My Configuration in vol.h header:

/* This is the number of samples to be processed */
#define SAMPLES 1000000000
/* This is the volume scaling factor to be used */
#define VOLUME 50.0 // Percent of original volume
/* Function prototype to fill an array sample of 
 * length sample_count with random int16_t numbers
 * to simulate an audio buffer */
void vol_createsample(int16_t* sample, int32_t sample_count);

Before starting, I would like to note that each algorithm might produce slightly different sum checks because of its nature. I assume that it is not critical, and the difference is insignificant.

AArch64

The tests will be conducted on two platforms:

Apple M1 Max (My machine)
ARM Cortex-A72 (israel.cdot.systems Server)

I assume that the best algorithm may be the vol4 or vol5. These algorithms are specifically designed to take advantage of ARM Architecture SIMD implementation.

Timing results and maxim resident memory usage:

As can be seen from the table above, the Apple M1 Silicon is astonishing. It is not a secret that it is a state-of-the-art SOC at the moment, but still, it performed great. And I was right about vol4 and vol5; they are the better solution for the ARM architecture on average. Both Apple M1 and ARM Cortex A72 showed the best execution time with these two algorithms. Considering the results, I cannot determine the best one from these two. Additionally, the optimization adjustments did not end with better results on average. But I might assume that vol4 can perform better on greater sample size.

x86_64

The tests will be conducted on two platforms:

Intel(R) Core(TM) i7-3820 CPU @ 3.60GHz (My machine)
Intel(R) Core(TM) i7-950 CPU @ 3.07GHz (portugal.cdot.systems Server)

I assume that the best algorithm may be the vol1. It uses fixed-point calculation compared to the vol0, and it does not create a recalculated array beforehand in contrast to vol2, which might take extra time.

Timing results and maxim resident memory usage:

As can be seen from the table above, the vol1 algorithm, on average, showed the best execution time. Additionally, we can compare the performance between the two platforms - my machine has a slightly faster CPU and RAM, so it is not a surprise that it was faster on average. Also, we can compare the result of compiler optimizations - in some cases, adjusting the level might improve the execution time, but, mostly, there is a negligible impact for that particular number of samples.

Questions

What does this next block do? Why?

ifeq ($(shell echo | gcc -E -dM - | grep -c aarch64),1)
        BINARIES:=vol0 vol1 vol2 vol3 vol4 vol5
else
        BINARIES:=vol0 vol1 vol2 vol3
endif

The block above ensures that ARM-specific versions are not compiled when the system is based on x86_64 architecture.

This part sums the samples. Why is this needed?
```
for (x = 0; x < SAMPLES; x++) {
	ttl=(ttl+out[x])%1000;
}
```
The code above produces a verification value. The sum allows determining the ability of the given algorithm to process data correctly. Additionally, it shows how accurate is a particular scaling process.
Print the sum of the samples. Why is this needed?
```
printf("Result: %d\n", ttl);
```
The code above ensures the compiler does not remove the summation code because it is not invoked. The program prints the result. Thus the compiler has to calculate it and throw possible optimization.
What is the purpose of the cast to unint16_t in the next line?
```
precalc[(uint16_t) x] = (int16_t) ((float) x * VOLUME / 100.0);
```
The casting transforms a float value into the int16_t value - a fundamental type for digital sound. Additionally, the type checking mechanism of the C compiler would not allow it.
What's the point of the vol3 dummy program? How does it help with benchmarking?
The dummy program may be used to determine the execution time of data generation, data summation and other overhead, and extract the algorithm performance.
Should we use 32767 or 32768 in the next line? why?
```
vol_int = (int16_t)(VOLUME/100.0 * 32767.0);
```
According to the C/C++ documentation, the int16_t has a range of -32,768..32767, we are scaling in that range. And we do not want to invert the signal, so we took the max value.
What is the purpose of these next two lines?
```
in_cursor = in;
out_cursor = out;
limit = in + SAMPLES;
```
The first two lines get the addresses of the current element(beginning) in the "in" and "out" arrays and place it in the "in" and "out" cursors. The last line sets the address of the end to limit for the while loop.
What does it mean to "duplicate" values in the next line?
```
__asm__ ("dup v1.8h,%w0"::"r"(vol_int));
```
The code above gets the value from vol_int and puts it in the q1 register. Inline assembler puts the value of vol_int into w0 as an input argument.
What do these next three lines do?
```
: [in_cursor]"+r"(in_cursor), [out_cursor]"+r"(out_cursor) 
: "r"(in_cursor),"r"(out_cursor)
: "memory"
);
```
The first line specifies the outputs using read/write constraint - "+". It also sets the name for output registers as [in_cursor] and [out_cursor]. The second line sets the input and lets the compiler choose registers for these variables. Finally, the "memory" tells the compiler that the assembler code will change memory. It is crucial because we store values in the array, and the assembler needs to reload registers.
Are the results usable? Are they correct?
```
printf("Result: %d\n", ttl);
```
I would trust this result because the sqrdmulh instruction will saturate any value that overflowed giving more accurate sampling.
Why is the increment below 8 instead of 16 or some other value?
```
in_cursor += 8;
out_cursor += 8;
```
Because the "+=" operator used with the pointer offsets the address by a number of elements given on the right-hand side of the operator. On the other hand, we are offsetting by 16 bytes(8 samples) in the inline assembly.
Why is this line not needed in the inline assembler version of this program?
```
in_cursor += 8;
out_cursor += 8;
```
Because the inline version of this program post-increments the cursors in "ldr" and "str" instructions:
```
"str q0, [%[out_cursor]],#16 \n\t"
"ldr q0, [%[in_cursor]], #16 \n\t"
```

Conclusion

While working on this blog, I have discovered some simple digital signal sampling algorithms and learned inline assembly and assembly intrinsic. In addition, I have tested and benchmarked the algorithm on different platforms and architectures to compare and analyze performance.

Author: Iurii Kondrakov

Email: deezzir@gmail.com

GitHub: github.com

P.S this blog post is created for the SPO600 Lab 5

SPO600 Blog

Search This Blog

SPO600 Project Analysis