TD sur les SIMD (C + Assembly)

assembly-language optimization school-project simd-instructions

Assembly 94.1%
Makefile 4.5%
C 1.4%

Find a file

Houssem Ben Mabrouk 9416addbe2 Update README.md		2020-12-17 23:18:22 +01:00
clang	whoops	2020-12-17 13:49:47 +01:00
extra	Initial commit	2020-12-12 23:11:59 +01:00
gcc	whoops	2020-12-17 13:49:47 +01:00
dotprod.c	added 2 unrolls	2020-12-13 14:08:01 +01:00
dotprod_2.c	whoops	2020-12-17 13:49:47 +01:00
dotprod_4.c	whoops	2020-12-17 13:49:47 +01:00
makefile	whoops	2020-12-17 13:49:47 +01:00
README.md	Update README.md	2020-12-17 23:18:22 +01:00
td.pdf	Initial commit	2020-12-12 23:11:59 +01:00

README.md

Devoir Maison (TD SIMD)

'No Unroll' Version

The C code of the function:

double dotprod(double *a, double *b, unsigned long long n)
{
	double d = 0.0;
	unsigned long long i = 0;

	for (; i < n; i++)
		d += a[i] * b[i];

	return d;
}

The equivalent assembly compiled with 'gcc':

Assembly code compiled using the option '-O1':

dotprod:
.LFB0:
	.cfi_startproc
	endbr64
	testq	%rdx, %rdx		# Test N
	je	.L4			# If equals to 0 goto .L4
	movl	$0, %eax		# Copie 0 into a 32 bits register
	pxor	%xmm1, %xmm1		# Initializes the double 'd' to 0
.L3:					# FOR LOOP
    # movsd: SSE2 Instruction
	movsd	(%rdi,%rax,8), %xmm0	# Move 8*(%rax) bits from %rdi to %xmm0
					# %xmm0 = a[i]          \w { i = (%rax)/8 }
	mulsd	(%rsi,%rax,8), %xmm0	# Multiply 8*(%rax) bits from %rsi with %xmm0
					# %xmm0 = b[i] * %xmm0  \w { i = (%rax)/8 }
	addsd	%xmm0, %xmm1		# Adds %xmm0 to %xmm1
					# %xmm1 = %xmm1 + %xmm0
					# d = d + a[i] * b[i]   \w { i = (%rax)/8 }
	addq	$1, %rax		# Adds 1 to %rax (i = i + 1)
	cmpq	%rax, %rdx		# Compare if i equals n
	jne	.L3			# Jump if not equal to .L3 (FOR LOOP)
.L1:
	movapd	%xmm1, %xmm0		# Move 32 bit(single precision) doubles to %xmm0
	ret				# Return (exit)
.L4:
	pxor	%xmm1, %xmm1		# Initializes the double 'd' to 0
	jmp	.L1			# Jumps to .L1
	.cfi_endproc

Assembly code compiled using the option '-O2':

dotprod:
.LFB0:
	.cfi_startproc
	endbr64
	testq	%rdx, %rdx
	je	.L4
	xorl	%eax, %eax		# Initialize %eax therefore %rax to 0 by executing a xor operation
					# Meaning: Initializing i to 0 (i = 0)
	pxor	%xmm1, %xmm1		# Initialize %xmm1 (the double d) to 0
	.p2align 4,,10
	.p2align 3
.L3:
	movsd	(%rdi,%rax,8), %xmm0	# %xmm0 = a[i]
	mulsd	(%rsi,%rax,8), %xmm0	# %xmm0 = %xmm0 * b[i]
	addq	$1, %rax		# i + = 1
	addsd	%xmm0, %xmm1		# d = d + %xmm0
	cmpq	%rax, %rdx		# Compare if i equals to n
	jne	.L3			# If not, jump to .L3
	movapd	%xmm1, %xmm0		# Moves 32 bits of %xmm1 to %xmm0 (our return value)
	ret				# Return (exit)
	.p2align 4,,10
	.p2align 3
.L4:
	pxor	%xmm1, %xmm1		# Initializes the double 'd' to 0
	movapd	%xmm1, %xmm0		# Moves (copies) %xmm1 to %xmm0 (our return value)
	ret				# Return (exit)
	.cfi_endproc

Assembly code compiled using the option '-O3':

dotprod:
.LFB0:
	.cfi_startproc
	endbr64
	testq	%rdx, %rdx
	je	.L7			# If %rdx (n) is 0 then jump to .L7
	cmpq	$1, %rdx		# Comparing if n == 1
	je	.L8			# If n == 1 then jump to .L8
	movq	%rdx, %rcx		# Moving the variable n to a 4th function argument %rcx
	xorl	%eax, %eax		# Initialize i to 0
	pxor	%xmm0, %xmm0		# Initialize the double d to 0
	shrq	%rcx			# Shifting %rcx to the right by 1
	salq	$4, %rcx		# Shifting %rcx to the right by 4 
	.p2align 4,,10
	.p2align 3
.L4:
	movupd	(%rdi,%rax), %xmm1		# %xmm1 = a[i]
	movupd	(%rsi,%rax), %xmm3		# %xmm3 = b[i]
	addq	$16, %rax			# i += 16
	mulpd	%xmm3, %xmm1			# %xmm1 = %xmm1 * %xmm3 (a[i] * b[i])
	movapd	%xmm1, %xmm2			# %xmm2 = %xmm1
	unpckhpd	%xmm1, %xmm1		# Unpack %xmm1
	addsd	%xmm0, %xmm2			# %xmm2 += d
	movapd	%xmm1, %xmm0			# d = %xmm1
	addsd	%xmm2, %xmm0			# d += %xmm2
	cmpq	%rax, %rcx			# Compare i with 4th function argument (n)
	jne	.L4				# If we didn't get to the end of the loop jump to .L4 (back to loop)
	movq	%rdx, %rax			# i = n
	andq	$-2, %rax			# i = i & -2 | if 'i' is an odd number then subtract 1
	andl	$1, %edx			# Didn't understand the use of this instruction (%edx is unused!)
	je	.L11				# Jump to .L11
.L3:
	movsd	(%rsi,%rax,8), %xmm1		# %xmm1 = b[i]
	mulsd	(%rdi,%rax,8), %xmm1		# %xmm1 *= a[i]
	addsd	%xmm1, %xmm0			# d += %xmm1
	ret					# Return (exit)
	.p2align 4,,10
	.p2align 3
.L11:
	ret					# Return (exit)
	.p2align 4,,10
	.p2align 3
.L7:
	pxor	%xmm0, %xmm0			# Initialize the variable double d to 0 by performing XOR operation.
	ret					# Return (exit)
.L8:
	xorl	%eax, %eax			# Initialize the variable unsigned long long i to 0
	pxor	%xmm0, %xmm0			# Initialize the varaible double d to 0
	jmp	.L3				# Jump to .L3 (FOR LOOP)
	.cfi_endproc

Assembly code compiled using the option '-Ofast':

# ...
.L4:
	movupd	(%rdi,%rax), %xmm1
	movupd	(%rsi,%rax), %xmm2
	addq	$16, %rax
	mulpd	%xmm2, %xmm1			# vector multiplication
	addpd	%xmm1, %xmm0			# vector addition
	cmpq	%rax, %rcx
	jne	.L4
	movq	%rdx, %rax
	movapd	%xmm0, %xmm1
	unpckhpd	%xmm0, %xmm1
	andq	$-2, %rax
	andl	$1, %edx
	addpd	%xmm0, %xmm1			# vector addition
	je	.L1
.L3:						# Executed only if n equals to 1 => (0 + a[1] * b[1])
	movsd	(%rsi,%rax,8), %xmm0		# %xmm0 = b[i]
	mulsd	(%rdi,%rax,8), %xmm0		# %xmm0 *= a[i]
	addsd	%xmm0, %xmm1			# %xmm1 += %xmm0
.L1:
	movapd	%xmm1, %xmm0			# move %xmm1 to %xmm0 (return value)
	ret					# return (exit)
# ...

Observations:

In this assembly code we represented the unsigned long long i by a 32 bits register (eax) but we are manipulating this "variable" with a 64 bits register (rax).

Side note: eax is used to wipe and initialize the register rax.

The difference between the assembly codes generated with the optimization options -O1 and the other with -O2 are more or less the same. The copying of the SEE register %xmm1 to our return value equivalent to the SEE register %xmm0, has been inlined into the .L3 and .L4 sections in the -O2 code.

A way to optimize an initialization of a register to 0, instead of copying 0 to a register, we could perform a bitwise exclusive OR (XOR) operation on the destination register.

    # Instead of this
    movl $0, %destination               # %destination = 0

    # We could use this
    pxor %destination, %destination     # %destination = 0

Having said that, the assembly code generated with -O3 is way different from both -O1 and -O2 codes.

There isn't much difference between -O3 and -Ofast but the lack of vector addition in the -O3 code.

I didn't include the assembly code of the kamikaze version because it's a bit long (115 lines). So I am going to comment directly on the difference between itself and the other versions (-O1, -O2, -O3, -Ofast).

In the kamikaze version, AVX (Advanced Vector eXtensions) has been put to use with the vectors ymm0 to ymm15.

The equivalent assembly compiled with 'clang':

Observations of the output of the -O1 AND -O2:

The assembly code compiled with the option -O1 has way more lines than the gcc version but it is always using scaler double precision ASM operations (movsd, mulsd, addsd).

Observations for the other options (-O3, -Ofast and 'kamikaze'):

In -O3, -Ofast and kamikaze, we have evolved from using scalar double precision operations to packed double precision SIMD instructions which makes the use of vector operations. But, just like 'gcc', there is only vector operations for mulplication and scalar operations for addition in -O3. On the other hand, in -Ofast there is the use of vector operations for both addition and multiplication.

For the kamikaze options, we have got the same behaviour as the gcc's kamikaze version:

Using AVX vector registers (ymm0 - ymm15) So we have got a 100% vector assembly code with the exception of switching to scalar instructions only if vectorization is not possible.

Summary for the 'No Unroll' code (gcc vs clang)

SIMD Instructions (gcc) SIMD Instructions (clang)

-O1 Scalar Scalar

-O2 Scalar Scalar

-O3 Vector Multiplication Vector Multiplication

+ Scalar Addition + Scalar Addition

-Ofast Vector Add and Mul Vector Add and Mul

Kamikaze Vector Vector

	SIMD Instructions (gcc)	SIMD Instructions (clang)
-O1	Scalar	Scalar
-O2	Scalar	Scalar
-O3	Vector Multiplication	Vector Multiplication
	+ Scalar Addition	+ Scalar Addition
-Ofast	Vector Add and Mul	Vector Add and Mul
Kamikaze	Vector	Vector

Code that has been unrolled twice

The C code of the function:

double dotprod(double *a, double *b, unsigned long long n)
{
	double d = 0.0;
	double d_2 = 0.0;
	unsigned long long i = 0;
	if (n&1) 
	{
		d += a[i] * b[i];
		i++;
	}
	for (; i < n; i+=2) {
		d += a[i] * b[i];
		d_2 += a[i+1] * b[i+1];
	}

	return d + d_2;
}

Comments about the assembly code generated with 'gcc':

-O1 and -O2:

PS: Representing %xmm0 as ret

.L4:
	movsd	(%rdi,%rax,8), %xmm1		# %xmm1 = a[i]
	mulsd	(%rsi,%rax,8), %xmm1		# %xmm1 *= b[i]
	addsd	%xmm1, %xmm0			# ret += %xmm1
	movsd	8(%rdi,%rax,8), %xmm1		# %xmm1 = a[i+1] *advanced by 8 -> the next bloc*
	mulsd	8(%rsi,%rax,8), %xmm1		# %xmm1 *= b[i+1] *same for the %rsi regsiter*
	addsd	%xmm1, %xmm2			# %xmm2 += %xmm1
	addq	$2, %rax			# i += 2
	cmpq	%rax, %rdx			# compare n to i
	ja	.L4				# if n>i, loop back
.L3:
	addsd	%xmm2, %xmm0			# ret += %xmm2
	ret					# return ret (exit)

.L4:
	movsd	(%rdi,%rax,8), %xmm1
	mulsd	(%rsi,%rax,8), %xmm1
	addsd	%xmm1, %xmm2
	movsd	8(%rdi,%rax,8), %xmm1
	mulsd	8(%rsi,%rax,8), %xmm1
	addq	$2, %rax
	addsd	%xmm1, %xmm0
.L8:
	cmpq	%rax, %rdx			# compare n to i
	ja	.L4				# if n>i, goto back to .L4 LOOP
	addsd	%xmm2, %xmm0			# ret += %xmm2
	ret					# return ret

-O1 and -O2 are the same. Both use scalar double precision operations for the additions, multiplications and movements.

-O3 and -Ofast: In -O3, there is a use of packed double (vector) operations (reading 16 bits at a time ~2 blocs) and still using SEE vectors (xmm). -Ofast is a compact version of -O3.
Kamikaze: Use of scalar and vector registers.

Comments about the assembly code generated with 'clang':

-O1: Using only scalar double precision SIMD x86 instructions (movsd, mulsd, addsd)
-O2: Using packed double precision SIMD x86 instructions such as: movupd, mulpd and addpd
-O3: Same assembly as -O2
-Ofast: Same assembly as -O3 with slight change
Kamikaze: The size is significally smaller that the gcc's version, vmovupd (vector) and vfmadd231pd (vector) operations have been used.

Summary for the 'Unroll Twice' code (gcc vs clang)

SIMD Instructions (gcc) SIMD Instructions (clang)

-O1 Scalar Scalar

-O2 Scalar Scalar

-O3 Vector Multiplication Vector

+ Scalar Addition

-Ofast Vector Vector

Kamikaze Vector Vector

	SIMD Instructions (gcc)	SIMD Instructions (clang)
-O1	Scalar	Scalar
-O2	Scalar	Scalar
-O3	Vector Multiplication	Vector
	+ Scalar Addition
-Ofast	Vector	Vector
Kamikaze	Vector	Vector

Code that has been unrolled 4 times

The C code of the function:

double dotprod(double *a, double *b, unsigned long long n)
{
	double d = 0.0;
	double d_2 = 0.0;
	double d_3 = 0.0;
	double d_4 = 0.0;
	unsigned long long i = 0;
	if (n&1) 
	{
		d += a[i] * b[i];
		i++;
	}
	for (; i < n; i+=4) {
		d += a[i] * b[i];
		d_2 += a[i+1] * b[i+1];
		d_3 += a[i+2] * b[i+2];
		d_4 += a[i+3] * b[i+3];
	}
	return d + d_2 + d_3 + d_4;
}

Comments about the assembly code generated with 'gcc':

-O1: The use of only scalar double precision SIMD x86 instructions.
-O2: The use of only scalar double precision SIMD x86 instructions.
-O3: Mixed use of both scalar and packed double precision SIMD x86 instructions.
-Ofast: The use of only scalar double precision SIMD x86 instructions.
Kamikaze: The use of only scalar double precision SIMD x86 instructions.

Comments about the assembly code generated with 'clang':

-O1: The use of only scalar double precision SIMD x86 instructions.
-O2: The use of only packed double precision SIMD x86 instructions.
-O3: The use of only packed double precision SIMD x86 instructions.
-Ofast: The use of only packed double precision SIMD x86 instructions.
Kamikaze: The use of only packed double precision SIMD x86 instructions.

Summary for the 'Unroll x4' code (gcc vs clang)

SIMD Instructions (gcc) SIMD Instructions (clang)

-O1 Scalar Scalar

-O2 Scalar Vector

-O3 Vector + Scalar Vector

-Ofast Scalar Vector

Kamikaze Scalar Vector