SPO600 Project Selection

SPO600 Project Selection

Introduction

This blog post is the first stage of the SPO Project. The main objective is to optimize an open-source library for ARMv9 hardware using Scalable Vector Extension 2.
The first part consists of steps to find and select a suitable package that ideally involves significant amounts of data processing and has existing SIMD implementations:
  1. Identify some candidate open-source packages for optimization.
  2. Find the SIMD implementations in these packages.
  3. Select a part you want to enhance.
  4. Create a strategy for changes.
  5. Note how the community accepts contributions and engages with the community to discuss your proposed work.

Open-source Package

As I am taking a course in Data Science, I have used a couple of packages to work on data and design learning models. Some of which are Pandas, NumPy, Scikit-Learn, TensorFlow. These open-source libraries perfectly fit our criteria - they process big data and probably use some SIMD implementation. However, a library should ideally use C/C++ - allows straightforward assembly embedding with GCC compiler intrinsic.

Upon examing the repositories of each, I found that Pandas packages use the Cython extension, which allows to write C code with Python syntax and integrate with existing low-level libraries. But I did not find any mentions of SIMD or architecture optimizations. Maybe they are located in the C dependencies. The same was true Scikit-Learn. 
On the other hand, I found mentions of <arm_neon.h> SIMD library in both Numpy and TensorFlow packages. Both packages use the NEON intrinsics and extensively utilize other SIMD CPU capabilities for optimization. 
Function using NEON from TensorFlow:
inline static float GetSum(const float32x4_t& values) {
  static float32_t summed_values[4];
  vst1q_f32(summed_values, values);
  return summed_values[0]
       + summed_values[1]
       + summed_values[2]
       + summed_values[3];
}
Additionally, the repository contains some inline assembly but nothing related to SVE2.
Inline assembly from NumPy:
#else // gcc
  npyv_s32 lo_odd, hi_odd;
  __asm__ ("xvcvdpsxws %x0,%x1" : "=wa" (lo_odd) : "wa" (a));
  __asm__ ("xvcvdpsxws %x0,%x1" : "=wa" (hi_odd) : "wa" (b));
As for NumPy, I found the files that include the NEON library, but there are just testing programs located in checks folder. Also, there is some inline assembly and x86 intrinsics but nothing related to SVE2.

Selection

Both packages are great candidates, and I will probably create an issue for both projects. But I am leaning towards TensorFlow as they already have robust NEON implementation. It will help me determine the SVE2 analogs of NEON instructions used in calculations and data processing. Not to mention that TensorFlow is a Machine Learning library, porting of which for ARMv9 embedded devices is an obvious course for the project.

Strategy

I plan to fork the code and explore the SIMD NEON parts to see what I can update. I will also explore the building process by trying to compile the test code into an executable. After that, I will research the SVE2 and find similarities with the given code, so I can start implementing every function one by one using compiler intrinsics - the approach is compatible with existing SIMD code.

Project Contribution Rules

I have read the TensorFlow Documentation and Contribution Rules. I will thoroughly research the contribution process and create an issue on GitHub for both packages. Conveniently, both projects are hosted on GitHub, meaning I can simply add Pull Requests for the code to be tested.

Author: Iurii Kondrakov 
GitHub: github.com

Comments