Name	Name	Last commit message	Last commit date
parent directory ..
README.md	README.md

Assignment 07 for May 11th, 2021

The goal of this assignment is to study several means of vectorizing a simple program.

Exercise 1 (1 Point)

Description

In this exercise, we will employ auto-vectorization of the compiler to increase the performance of a small program that computes a multiply-add operation on three vectors.

Tasks

Write a sequential program that computes a[i] += b[i] * c[i] for three float vectors a, b, and c. Enclose this computation in a loop such that it is repeated 1e6 times. I.e. your code should look like this:

for(int run = 0; run < repetitions; ++run) {
    for(int i = 0; i < size; ++i) {
        a[i] += b[i] * c[i];
    }
}

Initialize the vectors with constant values and add some output that allows you to verify correct computation of the program for subsequent optimization steps (try to avoid overflows). Compile this program on LCC2 with gcc 10.3.0 using -O1 and measure the execution time of the computation only for several problem sizes (=vector lengths). This sequential program is our baseline reference.
Try to use auto-vectorization in the compiler to vectorize the computation. Investigate which flag(s) to use for that, and try to limit your compiler change to only enable vectorization but not any other optimizations. Measure the execution time for the same problem sizes as before and compute the respective speedups. What can you observe in terms of performance? Is the result still correct? Does varying the problem size affect the performance gain or loss?

Use perf to further investigate the origin of any performance difference you might observe. perf stat -e rYYXX (mind the byte order!) allows you to measure custom events supported by the CPU, where - on Intel systems - XX is the so-called event code and YY is the so-called unit mask (also called umask). A few event codes and unit masks for measuring vectorized instructions are given below. Check which ones are suitable for your investigation:

Name	Event Code	Unit Mask
SIMD_INST_RETIRED.PACKED_SINGLE	C7	01
SIMD_INST_RETIRED.SCALAR_SINGLE	C7	02
SIMD_INST_RETIRED.PACKED_DOUBLE	C7	04
SIMD_INST_RETIRED.SCALAR_DOUBLE	C7	08
SIMD_INST_RETIRED.VECTOR	C7	10
SIMD_INST_RETIRED.ANY	C7	1F

_{In case you are interested: For further events relevant to the CPU type on LCC2 and short event descriptions, see Table 19-23 on pages 415-438 of the Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide, Part 2. It lists all events that can be counted with perf on our hardware.}

Enter the wall clock time of the sequential program and the auto-vectorized version with size 2048 and 1e6 repetitions to the spreadsheet linked on Discord.

Exercise 2 (1 Point)

Description

Instead of relying on the compiler to vectorize the code for us, we will do so manually in this exericse, using OpenMP.

Tasks

Vectorize your sequential program using OpenMP. Do not use any OpenMP thread-based parallelism.
Compile your OpenMP-vectorized code with -O1 but without any compiler flags for auto-vectorization and compare its performance for the problem size 2048 to both the sequential version and the compiler-vectorized version. What can you observe? Is the result still correct?
Verify any findings using perf as described in Exercise 1.
Repeat the sequential and OpenMP executions when changing the data type from float to double. What can you observe?
How does the solution for this Exercise compare to Exercise 1? Are there any advantages or disadvantages?
Enter the wall clock time of the OpenMP-vectorized version with size 2048, 1e6 repetitions and data type float to the spreadsheet linked on Discord.

Exercise 3 (1 Point)

Description

Instead of relying on OpenMP for vectorization, we will do so using compiler-specific intrinsics in this exercise.

Tasks

Vectorize your code for float types using the gcc-specific intrinsic functions _mm_load_ps, _mm_add_ps, _mm_mul_ps, and _mm_store_ps. Do not forget to include the respective header xmmintrin.h.
Compile your manually vectorized code with -O1 but without any compiler flags for auto-vectorization and compare its performance for your set of problem sizes to your previous code versions. What can you observe? Is the result still correct?
Verify any findings using perf as described in Exercise 1.
How does the solution for this Exercise compare to Exercise 2 and Exercise 1? Are there any advantages or disadvantages?
Enter the wall clock time of the compiler-specific intrinsics version with size 2048 and 1e6 repetitions to the spreadsheet linked on Discord.

General Notes

All the material required by the tasks above (e.g., code, figures, text, etc...) must be part of the solution that is handed in. Your experiments should be reproducible and comparable to your measurements using the solution materials that you hand in.

Every member of your group must be able to explain the given problem, your solution, and possible findings. You may also need to answer detailed questions about any of these aspects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

07

07

README.md

Assignment 07 for May 11th, 2021

Exercise 1 (1 Point)

Description

Tasks

Exercise 2 (1 Point)

Description

Tasks

Exercise 3 (1 Point)

Description

Tasks

General Notes

Files

07

Directory actions

More options

Directory actions

More options

Latest commit

History

07

Folders and files

parent directory

README.md

Assignment 07 for May 11th, 2021

Exercise 1 (1 Point)

Description

Tasks

Exercise 2 (1 Point)

Description

Tasks

Exercise 3 (1 Point)

Description

Tasks

General Notes