CS 475/575 -- Spring Quarter 2017 Project #5 Vectorized Array Multiplication and Reduction using SSE 60 Points Due: May 23 This page was last updated: May 7, 2015 Introduction There are many problems in scientific and engineering computing where you want to multiply arrays of numbers (matrix manipulation, Fourier transformation, convolution, etc.). This project is in two parts. The first part is to test array multiplication, SIMD and non-SIMD. The second part is to test array multiplication and reduction, SIMD and non-SIMD. Use the gcc or g++ compilers for both parts, but... Because simd.p5.cpp uses assembly language, this code is not portable. I know for sure it works on flip, using gcc/g++ 4.8.5. You are welcome to try it other places, but there are no guarantees. It doesn't work on rabbit. Do not use "-O3". Requirements 1. Use the supplied SIMD SSE code to run an array multiplication timing experiment. Run the same experiment a second time using your own C/C++ array multiplication code. 2. Use the supplied SIMD SSE code to run an array multiplication + reduction timing experiment. Run the same experiment a second time using your own C/C++ array multiplication + reduction code. 3. Use different array sizes from 1K to 32M. The choice of in-between values is up to you, but pick something that will make for a good graph. 4. Feel free to run each array-size test a certain number of trials if you want. Use the peak value for the performance you record. Check peak versus average performance to be sure you are getting consistent answers. Try it again if the peak and average are not within, say, 20% of each other. 5. Create a table and a graph showing SSE/Non-SSE speed-up as a function of array size. Note: this is not a multithreading assignment, so you don't need to worry about a NUMT. Speedup in this case will be S = Psse/Pnon-see = Tnon-sse/Tsse (P = Performance, T = Elapsed Time). Plot both curves on the same set of axes. 6. The Y-axis performance units in this case will be "Speed-Up", i.e., dimensionless. 7. Be sure that the graphs are plotted so that "up" means "faster". 8. Your commentary write-up (turned in as a PDF file) should tell: 1. What machine you ran this on 2. Show the table and graph 3. What patterns are you seeing in the speedups? 4. Are they consistent across a variety of array sizes? 5. Why or why not, do you think? 6. Knowing that SSE SIMD is 4-floats-at-a-time, why could you get a speed-up of < 4.0 or > 4.0 in the array mutiplication? 7. Knowing that SSE SIMD is 4-floats-at-a-time, why could you get a speed-up of < 4.0 or > 4.0 in the array mutiplication-reduction? SSE SIMD code: • You are certainly welcome to write your own if you want, but we have already written Linux SSE code to help you with this. The two files that you want are simd.p5.h and simd.p5.cpp. A Makefile might look like this: simd.p5.o: simd.p5.h simd.p5.cpp g++ -c simd.p5.cpp -o simd.p5.o arraymult: arraymult.cpp simd.p5.o g++ -o arraymult arraymult.cpp simd.p5.o -lm -fopenmp • Note that you are linking in the OpenMP library because we are using it for timing. • Because simd.p5.cpp uses assembly language, this code is not portable. I know for sure it works on flip, using gcc/g++ 4.8.5. You are welcome to try it other places, but there are no guarantees. It doesn't work on rabbit. Do not use "-O3". • You can run the tests one-at-a-time, or you can script them by making the array size a #define that you set from outside the program. Warning! Do not use any optimization flags when compiling the simd.p5.cpp code. It jumbles up the use of the registers. Do not use the icc or icpc compilers when compiling the simd.p5.cpp code. It jumbles up the use of the registers. Grading: Feature Points Array Multiply numbers and curve 20 Array Multiply + Reduction numbers and curve 20 Commentary 20 Potential Total 60