The goal of this assignment is to study several means of vectorizing a simple program.
In this exercise, we will employ auto-vectorization of the compiler to increase the performance of a small program that computes a multiply-add operation on three vectors.
- Write a sequential program that computes
a[i] += b[i] * c[i]
for threefloat
vectorsa
,b
, andc
. Enclose this computation in a loop such that it is repeated1e6
times. I.e. your code should look like this:
for(int run = 0; run < repetitions; ++run) {
for(int i = 0; i < size; ++i) {
a[i] += b[i] * c[i];
}
}
-
Initialize the vectors with constant values and add some output that allows you to verify correct computation of the program for subsequent optimization steps (try to avoid overflows). Compile this program on LCC2 with
gcc 10.3.0
using-O1
and measure the execution time of the computation only for several problem sizes (=vector lengths). This sequential program is our baseline reference. -
Try to use auto-vectorization in the compiler to vectorize the computation. Investigate which flag(s) to use for that, and try to limit your compiler change to only enable vectorization but not any other optimizations. Measure the execution time for the same problem sizes as before and compute the respective speedups. What can you observe in terms of performance? Is the result still correct? Does varying the problem size affect the performance gain or loss?
-
Use
perf
to further investigate the origin of any performance difference you might observe.perf stat -e rYYXX
(mind the byte order!) allows you to measure custom events supported by the CPU, where - on Intel systems - XX is the so-called event code and YY is the so-called unit mask (also called umask). A few event codes and unit masks for measuring vectorized instructions are given below. Check which ones are suitable for your investigation:Name Event Code Unit Mask SIMD_INST_RETIRED.PACKED_SINGLE C7 01 SIMD_INST_RETIRED.SCALAR_SINGLE C7 02 SIMD_INST_RETIRED.PACKED_DOUBLE C7 04 SIMD_INST_RETIRED.SCALAR_DOUBLE C7 08 SIMD_INST_RETIRED.VECTOR C7 10 SIMD_INST_RETIRED.ANY C7 1F In case you are interested: For further events relevant to the CPU type on LCC2 and short event descriptions, see Table 19-23 on pages 415-438 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2. It lists all events that can be counted with
perf
on our hardware. -
Enter the wall clock time of the sequential program and the auto-vectorized version with size 2048 and 1e6 repetitions to the spreadsheet linked on Discord.
Instead of relying on the compiler to vectorize the code for us, we will do so manually in this exericse, using OpenMP.
- Vectorize your sequential program using OpenMP. Do not use any OpenMP thread-based parallelism.
- Compile your OpenMP-vectorized code with
-O1
but without any compiler flags for auto-vectorization and compare its performance for the problem size 2048 to both the sequential version and the compiler-vectorized version. What can you observe? Is the result still correct? - Verify any findings using
perf
as described in Exercise 1. - Repeat the sequential and OpenMP executions when changing the data type from
float
todouble
. What can you observe? - How does the solution for this Exercise compare to Exercise 1? Are there any advantages or disadvantages?
- Enter the wall clock time of the OpenMP-vectorized version with size 2048, 1e6 repetitions and data type
float
to the spreadsheet linked on Discord.
Instead of relying on OpenMP for vectorization, we will do so using compiler-specific intrinsics in this exercise.
- Vectorize your code for
float
types using the gcc-specific intrinsic functions_mm_load_ps
,_mm_add_ps
,_mm_mul_ps
, and_mm_store_ps
. Do not forget to include the respective headerxmmintrin.h
. - Compile your manually vectorized code with
-O1
but without any compiler flags for auto-vectorization and compare its performance for your set of problem sizes to your previous code versions. What can you observe? Is the result still correct? - Verify any findings using
perf
as described in Exercise 1. - How does the solution for this Exercise compare to Exercise 2 and Exercise 1? Are there any advantages or disadvantages?
- Enter the wall clock time of the compiler-specific intrinsics version with size 2048 and 1e6 repetitions to the spreadsheet linked on Discord.
All the material required by the tasks above (e.g., code, figures, text, etc...) must be part of the solution that is handed in. Your experiments should be reproducible and comparable to your measurements using the solution materials that you hand in.
Every member of your group must be able to explain the given problem, your solution, and possible findings. You may also need to answer detailed questions about any of these aspects.