Writing Inline Assembly in C

SPO600 Deliverable Week Seven

For this exercise, the task was described in the following way, “Write a version of the Volume Scaling solution from the Algorithm Selection Lab for AArch64 that uses the SQDMULH or SQRDMULH instructions via inline assembler”. Though this sounds rather complex to the average programmer, I can assure you that it’s easier to delegate or assign such as task than it is to actually implement if you do not live in a Assembly-centric world. Luckily, this was a group lab so I have to credit the thought process, the logic officers, the true driving force which would lead to the completion of said lab, Timothy Moy and Matthew Bell. Together, we were able to write inline assembly which completed the requirements on an AArch64 system.

The Assembly Process

Multiple implementations were brought about by the group, some struggling to compile and others segfaulting as soon as the chance arose. One finally exclaimed promise, and all attention was shifted to perfecting it, which the final version of can be seen below. We modified the custom algorithm in the previous exercise with the inline assembly code, and recorded an improved performance metric compared to the naive C function.

Looking back now at the code, I can see where we neglected compiler-performant optimizations such as removal of common loop calculations which may better the performance of the custom algorithm and also reduce multiplication operations. Furthermore the source code was littered with commented out implementations which I have removed from the above, proving that we a class and myself as a developer still have no basic understanding of Assembly.

We also noted during the closing of this exercise that the custom sum did not work properly. Still that was not the focus of the lab so we pressed on. Curious, I did a few changes to optimize the items that were mentioned above to see if there was a performance increase. The new result is below, which effectively shaved off 1.13 seconds the original custom algorithm’s runtime. The biggest change which I’ve included below is simply modifying line 89 to compare the following variable (which was created on line 88) to p instead of doing the calculation of output + sizeof(int16_t) * SIZE.

Finding Assembly in Ardour

For the second part for this lab, we had to observe why inline assembly was used in one of the listed open source projects, and the musicians in me was too curious to pass the opportunity to look into Ardour’s source code. Ardour, is the definitive linux project aimed at recording, mixing and even light video editing. It is the Pro Tools of the open source world, the FOSS audio producers dream. I have not kept up to date with it’s recent developments, having played with version 2.* on my makeshift Ubuntu Studio workstation years ago.

Using GitHub’s ‘search in repository’ feature, a quick search for ‘asm’ led to 40 results, which along with the code base itself, can be seen with the following link. For this analysis, I will focus on the first two unique results which, span two files; the first being is found in ‘~/msvc_extra_headers/ardourext/float_cast.h.input’ and the later being found in ‘libs/ardour/ardour/cycles.h’.

Float_Cast.h.input Analysis

Opening the file displays this description first, which helps to understand the purprose of said file and answer a few questions such as operating system, cpu architecture targets and configurations:

The file itself seems to have functions which all call the same asm code, and returns different cast variables. The assembly code is below this paragraph and may differ throughout the file, out of the scope of my analysis and current window’s code.

FLD Instruction

The fld instruction loads a 32 bit, 64 bit, or 80 bit floating point value onto the stack. This instruction converts 32 and 64 bit operand to an 80 bit extended precision value before pushing the value onto the floating point stack. (University of Illinois)

FISTP Instruction

The fist and fistp instructions convert the 80 bit extended precision variable on the top of stack to a 16, 32, or 64 bit integer and store the result away into the memory variable specified by the single operand. These instructions convert the value on tos to an integer according to the rounding setting in the FPU control register (bits 10 and 11). As for the fild instruction, the fist and fistp instructions will not let you specify one of the 80×86’s general purpose 16 or 32 bit registers as the destination operand.

The fist instruction converts the value on the top of stack to an integer and then stores the result; it does not otherwise affect the floating point register stack. The fistp instruction pops the value off the floating point register stack after storing the converted value. (University of Illinois)

What This All Means

Due to the lack of support for the *lrint* and *rint* functions on WIN32, they had to be implemented here for proper operation of the program. Once handed a floating point value in the case of the entire function outlined below, the asm code handles converting (or casting in the case of native C code perhaps) the float to an integer; the converted variable being stored in the specified register.

Cycles.h Analysis

Opening this file gave another explanation of it’s purpose at the top, a standard among many of the files here and one that I hope to use in my own future projects:

The file itself seems to be an interface between the cycle counter and the CPU architecture, attempting to support where it can the different architectures with the same scheduling platform.

__ASM__ __VOLATILE__ Analysis

The typical use of extended asm statements is to manipulate input values to produce output values. However, your asm statements may also produce side effects. If so, you may need to use the volatile qualifier to disable certain optimizations.

GCC’s optimizers sometimes discard asm statements if they determine there is no need for the output variables. Also, the optimizers may move code out of loops if they believe that the code will always return the same result (i.e. none of its input values change between calls). Using the volatile qualifier disables these optimizations. asm statements that have no output operands, including asm goto statements, are implicitly volatile. (GCC GNU Documentation)

What This Means

The user of the Volatile argument disables said optimizations which may deem the .asm code to be useless in the program, or the code itself is consistent throughout that of a loop. Disabling such optimization allows for the developer to have deeper control and integration of their variables in the scope of the function and program. This explanation is questionable mind you, for the volatile documentation spans pages and pages of examples which contradict or support my own explanation.

Final Thoughts on Ardour’s ASM Code

From what I gather, this code is used for the same purpose that allows for support of a greater array of systems, beit on Windows 32bit systems, or AArch64. The CPU scheduler seems to play a pivotal role in how Ardour handles the various recording modes and cycles which play into real-time analysis of the output. The files themselves seem to be an afterthought, someone’s dedication to updated compatibility to an already stable system. That may be simply the sample bias of looking into the select few files that I did for this analysis.

Leave a Reply

Your email address will not be published. Required fields are marked *