Compiler Vectorization in Assembly

SPO600 Week Six Deliverable

Introduction

For this exercise, we were tasked with the following instructions, cautioned that only ones with patience would achieve completion of this lab with their sanity intact:

  1. Write a short program that creates two 1000-element integer arrays and fills them with random numbers, then sums those two arrays to a third array, and finally sums the third array to a long int and prints the result.
  2. Compile this program on an aarch64 machine in such a way that the code is auto-vectorized.
  3. Annotate the emitted code (i.e., obtain a disassembly via objdump -d and add comments to the instructions in <main> explaining what the code does).
  4. Review the vector instructions for AArch64. Find a way to scale an array of sound samples (see Lab 5) by a factor between 0.000-1.000 using SIMD.

Step 1

Below, I’ve included the simplistic C code which would achieve the desired functionality. It’s very easy to read, with not complexity outside the realms of a standard math incremental operation, and the ever so popular addition operator. Included, is also a random number generator being drive by stdlib’s rand() function. Originally, I had the the calculations relating to the c array to be in a separate for loop, with the result calculation occurring in that for statement as well. This was moved into the loop used by variables a and b, making the program run O(n) instead of O(2n).

C Code

Step 2

To compile the application in such a way that the compiler utilizes advanced optimization techniques, I used the -O3 argument which incorporates vectorization where possible by default. Had I not wanted to use O3, I could instead use the -ftree-vectorize which provides the same desired optimization.

What is Auto-Vectorization?

The great wonder which is Wikipedia has the following explanation, which I shamelessly have posted below to supplement the answer to this question:

Automatic vectorization, in parallel computing, is a special case of automatic parallelization, where a computer program is converted from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation, which processes one operation on multiple pairs of operands at once.

Step 3

Below is my included analysis of the lab06 file, including my comments on the right side. Viewing of such data was made possible by using objdump -d command, and then for editing purposes routing said command’s output into an empty .asm file. I will not deny that my analysis has many plot holes, full of assumptions that are incorrect or misread Assembly code or incorrect parsing of arguments. Regardless of the vectorization, Machine Language is the closest this web developer has ever gotten to the CPU and hardware itself. Would I say I enjoy reading Assembly Code? No. Do I see where it is an invaluable source of optimization prowess which rivals even the best C code? Yes. But, I’d be a fool to say that it is my cup of tea.Without further discussion of my failings related to software optimization, analysis and the beast which is .ASM, here is my analysis.

Assembly Code

Thoughts

It seems based on my analysis, that a pivotal operation is the storing of variables into the registers as pairs, utilizing STP for said operation. This then allows for iterations between 8 elements of the array at a time. How the compiler is choosing to vectorized is still beyond me, but that’s what the lesson is for? Right?. Regardless, I can understand basic Assembly which puts me further knowledge wise than I was in the previous weeks.

Step 4

Without modifying the previous lab’s code to utilize the auto vectorized features of the compiler, along with inline-assembly code for further optimizations, here are some thoughts which were collected upon reviewing with peers their ideas along with my own.

  1. Utilize DUP to duplicate the volume factor into a scalar vector register. Wikipedia describes scalar registers as follows:

    A scalar processor processes only one datum at a time, with typical data items being integers or floating point numbers). A scalar processor is classified as a SISD processor (Single Instructions, Single Data) in Flynn’s taxonomy.

  2. Store the ‘Sample’ Data into a register using LD1. LD1 is an instruction which loads multiple 1-element structures into a vector register.

Leave a Reply

Your email address will not be published. Required fields are marked *