The ARMv8 instruction set documentation has always been garbage, and particularly the SIMD sections. In comparison, the Intel documentation provides very nice diagrams of how the instructions operates. No such thing with ARM, only small cryptic wordings with no illustration.
In the context of multimedia development, it is common practice to hand write
some SIMD assembly, but every single time it takes me hours to figure out which
of trn*
, uzp*
, zip*
or ext
instruction will be needed to permute my data.
In the past, GDB
was used as a workaround, but it's time consuming, and on
top of that my crosstool-ng profile is broken so I can't even build it. Also,
no ARMv8 board up and running I could use, I only have QEMU available.
Given this current state, I decided to actually execute the instructions to figure out how they operate.
General logic
The content of the 2 input registers of these instructions will be ASCII
characters, but we will use a 4S
arrangement in the ASM, which means the
instructions will deal with a 4x32-bit packing. This is an arbitrary choice: I
just though that 4 "boxes" are enough to illustrate the behaviour of the
instructions. I could of course have picked 8B
, 16B
or any other
arrangement.
In C, the input data and destination look like this:
const uint32_t __attribute__((aligned(16))) v0[4] = {'A', 'B', 'C', 'D'};
const uint32_t __attribute__((aligned(16))) v1[4] = {'E', 'F', 'G', 'H'};
uint32_t __attribute__((aligned(16))) v[4];
v0
and v1
are the source and v
the destination.
The function wrappers will have a prototype such as:
void zip1(uint32_t *v, const uint32_t *v0, const uint32_t *v1);
In the ASM, x0
, x1
and x2
registers will be mapped on v
, v0
and v1
.
Testing such instruction will look like this:
ld1 {v0.4S}, [x1]
ld1 {v1.4S}, [x2]
zip1 v2.4S, v0.4S, v1.4S
st1 {v2.4S}, [x0]
ret
After calling the ASM code, we can just print the resulting buffer with:
printf("%c%c%c%c\n", (char)v[0], (char)v[1], (char)v[2], (char)v[3]);
Complete implementation
showsimd.c
calling the ASM functions:
#include <stdint.h>
#include <stdio.h>
#define VFMT "%c%c%c%c"
#define VARG(v) (char)v[0], (char)v[1], (char)v[2], (char)v[3]
int main()
{
const uint32_t __attribute__((aligned(16))) v0[4] = {'A', 'B', 'C', 'D'};
const uint32_t __attribute__((aligned(16))) v1[4] = {'E', 'F', 'G', 'H'};
uint32_t __attribute__((aligned(16))) v[4];
#define TEST_INSTR(instr) do { \
void instr(uint32_t *v, const uint32_t *v0, const uint32_t *v1); \
instr(v, v0, v1); \
printf(#instr "("VFMT","VFMT")="VFMT"\n", VARG(v0), VARG(v1), VARG(v)); \
} while (0)
TEST_INSTR(trn1);
TEST_INSTR(trn2);
TEST_INSTR(uzp1);
TEST_INSTR(uzp2);
TEST_INSTR(zip1);
TEST_INSTR(zip2);
TEST_INSTR(ext0);
TEST_INSTR(ext4);
TEST_INSTR(ext8);
TEST_INSTR(ext12);
return 0;
}
asm.S
wrapping the instructions:
#ifdef __ELF__
.section .note.GNU-stack, "", %progbits
#endif
.text
.macro func name
.align 2
.global \name
#ifdef __ELF__
.type \name, %function
#endif
\name:
.endm
.macro endfunc
.endm
.macro insfunc ins
func \ins
ld1 {v0.4S}, [x1]
ld1 {v1.4S}, [x2]
\ins v2.4S, v0.4S, v1.4S
st1 {v2.4S}, [x0]
ret
endfunc
.endm
.macro extfunc n
func ext\n
ld1 {v0.4S}, [x1]
ld1 {v1.4S}, [x2]
ext v2.16B, v0.16B, v1.16B, #\n
st1 {v2.4S}, [x0]
ret
endfunc
.endm
insfunc trn1
insfunc trn2
insfunc uzp1
insfunc uzp2
insfunc zip1
insfunc zip2
extfunc 0
extfunc 4
extfunc 8
extfunc 12
and a Makefile
:
NAME = showsimd
CFLAGS += -Wall -O2
OBJS = $(NAME).o asm.o
$(NAME): $(OBJS)
clean:
$(RM) $(OBJS) $(NAME)
.PHONY: clean
Usage
% make CC=aarch64-unknown-linux-gnueabi-cc
aarch64-unknown-linux-gnueabi-cc -Wall -O2 -c -o showsimd.o showsimd.c
aarch64-unknown-linux-gnueabi-cc -c -o asm.o asm.S
aarch64-unknown-linux-gnueabi-cc showsimd.o asm.o -o showsimd
% qemu-aarch64 -L $HOME/x-tools/aarch64-unknown-linux-gnueabi/aarch64-unknown-linux-gnueabi/sysroot ./showsimd
trn1(ABCD,EFGH)=AECG
trn2(ABCD,EFGH)=BFDH
uzp1(ABCD,EFGH)=ACEG
uzp2(ABCD,EFGH)=BDFH
zip1(ABCD,EFGH)=AEBF
zip2(ABCD,EFGH)=CGDH
ext0(ABCD,EFGH)=ABCD
ext4(ABCD,EFGH)=BCDE
ext8(ABCD,EFGH)=CDEF
ext12(ABCD,EFGH)=DEFG
So this is still not as fancy as an Intel diagram, but at least now I know what these instructions do.