Dealing with the ARM AArch64 SIMD documentation

The ARMv8 instruction set documentation has always been garbage, and particularly the SIMD sections. In comparison, the Intel documentation provides very nice diagrams of how the instructions operates. No such thing with ARM, only small cryptic wordings with no illustration.

In the context of multimedia development, it is common practice to hand write some SIMD assembly, but every single time it takes me hours to figure out which of trn*, uzp*, zip* or ext instruction will be needed to permute my data.

In the past, GDB was used as a workaround, but it's time consuming, and on top of that my crosstool-ng profile is broken so I can't even build it. Also, no ARMv8 board up and running I could use, I only have QEMU available.

Given this current state, I decided to actually execute the instructions to figure out how they operate.

General logic

The content of the 2 input registers of these instructions will be ASCII characters, but we will use a 4S arrangement in the ASM, which means the instructions will deal with a 4x32-bit packing. This is an arbitrary choice: I just though that 4 "boxes" are enough to illustrate the behavior of the instructions. I could of course have picked 8B, 16B or any other arrangement.

In C, the input data and destination look like this:

const uint32_t __attribute__((aligned(16))) v0[4] = {'A', 'B', 'C', 'D'};
const uint32_t __attribute__((aligned(16))) v1[4] = {'E', 'F', 'G', 'H'};
uint32_t __attribute__((aligned(16))) v[4];

v0 and v1 are the source and v the destination.

The function wrappers will have a prototype such as:

void zip1(uint32_t *v, const uint32_t *v0, const uint32_t *v1);

In the ASM, x0, x1 and x2 registers will be mapped on v, v0 and v1. Testing such instruction will look like this:

ld1     {v0.4S}, [x1]
ld1     {v1.4S}, [x2]
zip1    v2.4S, v0.4S, v1.4S
st1     {v2.4S}, [x0]
ret

After calling the ASM code, we can just print the resulting buffer with:

printf("%c%c%c%c\n", (char)v[0], (char)v[1], (char)v[2], (char)v[3]);

Complete implementation

showsimd.c calling the ASM functions:

#include <stdint.h>
#include <stdio.h>

#define VFMT "%c%c%c%c"
#define VARG(v) (char)v[0], (char)v[1], (char)v[2], (char)v[3]

int main()
{
    const uint32_t __attribute__((aligned(16))) v0[4] = {'A', 'B', 'C', 'D'};
    const uint32_t __attribute__((aligned(16))) v1[4] = {'E', 'F', 'G', 'H'};
    uint32_t __attribute__((aligned(16))) v[4];

#define TEST_INSTR(instr) do {                                                  \
void instr(uint32_t *v, const uint32_t *v0, const uint32_t *v1);                \
    instr(v, v0, v1);                                                           \
    printf(#instr "("VFMT","VFMT")="VFMT"\n", VARG(v0), VARG(v1), VARG(v));     \
} while (0)

    TEST_INSTR(trn1);
    TEST_INSTR(trn2);
    TEST_INSTR(uzp1);
    TEST_INSTR(uzp2);
    TEST_INSTR(zip1);
    TEST_INSTR(zip2);
    TEST_INSTR(ext0);
    TEST_INSTR(ext4);
    TEST_INSTR(ext8);
    TEST_INSTR(ext12);
    return 0;
}

asm.S wrapping the instructions:

#ifdef __ELF__
.section .note.GNU-stack, "", %progbits
#endif

.text

.macro func name
    .align 2
    .global \name
#ifdef __ELF__
    .type \name, %function
#endif
\name:
.endm

.macro endfunc
.endm

.macro insfunc ins
func \ins
    ld1 {v0.4S}, [x1]
    ld1 {v1.4S}, [x2]
    \ins v2.4S, v0.4S, v1.4S
    st1 {v2.4S}, [x0]
    ret
endfunc
.endm

.macro extfunc n
func ext\n
    ld1 {v0.4S}, [x1]
    ld1 {v1.4S}, [x2]
    ext v2.16B, v0.16B, v1.16B, #\n
    st1 {v2.4S}, [x0]
    ret
endfunc
.endm

insfunc trn1
insfunc trn2
insfunc uzp1
insfunc uzp2
insfunc zip1
insfunc zip2

extfunc 0
extfunc 4
extfunc 8
extfunc 12

and a Makefile:

NAME = showsimd
CFLAGS += -Wall -O2
OBJS = $(NAME).o asm.o
$(NAME): $(OBJS)
clean:
	$(RM) $(OBJS) $(NAME)
.PHONY: clean

Usage

% make CC=aarch64-unknown-linux-gnueabi-cc
aarch64-unknown-linux-gnueabi-cc -Wall -O2   -c -o showsimd.o showsimd.c
aarch64-unknown-linux-gnueabi-cc    -c -o asm.o asm.S
aarch64-unknown-linux-gnueabi-cc   showsimd.o asm.o   -o showsimd

% qemu-aarch64 -L $HOME/x-tools/aarch64-unknown-linux-gnueabi/aarch64-unknown-linux-gnueabi/sysroot ./showsimd
trn1(ABCD,EFGH)=AECG
trn2(ABCD,EFGH)=BFDH
uzp1(ABCD,EFGH)=ACEG
uzp2(ABCD,EFGH)=BDFH
zip1(ABCD,EFGH)=AEBF
zip2(ABCD,EFGH)=CGDH
ext0(ABCD,EFGH)=ABCD
ext4(ABCD,EFGH)=BCDE
ext8(ABCD,EFGH)=CDEF
ext12(ABCD,EFGH)=DEFG

So this is still not as fancy as an Intel diagram, but at least now I know what these instructions do.

For updates and more frequent content you can follow me on Mastodon. Feel also free to subscribe to the RSS in order to be notified of new write-ups. It is also usually possible to reach me through other means (check the footer below). Finally, discussions on some of the articles can sometimes be found on HackerNews, Lobste.rs and Reddit.