Programming Assignment 4 - CUDA
Due Date: Tues. 11/15
Using CUDA at Ohio Supercomputer Center
The OSC clusters are equipped with Tesla K40 GPUs. Some relevant stats for the K40:
❚♦t❛❧ ❛♠♦✉♥t ♦❢ ❣❧♦❜❛❧ ♠❡♠♦r2✿ ✶✷✷✽✽ ▼❇2t❡s ✭✶✷✽✽✹✼✵✺✷✽✵ ❜2t❡s✮
✭✶✺✮ ▼✉❧t✐♣r♦❝❡ss♦rs✱ ✭✶✾✷✮ ❈❯❉❆ ❈♦r❡s✴▼P✿ ✷✽✽✵ ❈❯❉❆ ❈♦r❡s
●P❯ ❈❧♦❝❦ r❛t❡✿ ✽✼✻ ▼❍3 ✭✵✳✽✽ ●❍3✮
▼❡♠♦r2 ❈❧♦❝❦ r❛t❡✿ ✸✵✵✹ ▼❤3
▼❡♠♦r2 ❇✉s ❲✐❞t❤✿ ✸✽✹✲❜✐t
▲✷ ❈❛❝❤❡ ❙✐3❡✿ ✶✺✼✷✽✻✹ ❜2t❡s
❚♦t❛❧ ❛♠♦✉♥t ♦❢ ❝♦♥st❛♥t ♠❡♠♦r2✿ ✻✺✺✸✻ ❜2t❡s
❚♦t❛❧ ❛♠♦✉♥t ♦❢ s❤❛r❡❞ ♠❡♠♦r2 ♣❡r ❜❧♦❝❦✿ ✹✾✶✺✷ ❜2t❡s
❚♦t❛❧ ♥✉♠❜❡r ♦❢ r❡❣✐st❡rs ❛✈❛✐❧❛❜❧❡ ♣❡r ❜❧♦❝❦✿ ✻✺✺✸✻
❲❛r♣ s✐3❡✿ ✸✷
▼❛1✐♠✉♠ ♥✉♠❜❡r ♦❢ t❤r❡❛❞s ♣❡r ♠✉❧t✐♣r♦❝❡ss♦r✿ ✷✵✹✽
▼❛1✐♠✉♠ ♥✉♠❜❡r ♦❢ t❤r❡❛❞s ♣❡r ❜❧♦❝❦✿ ✶✵✷✹
▼❛1 ❞✐♠❡♥s✐♦♥ s✐3❡ ♦❢ ❛ t❤r❡❛❞ ❜❧♦❝❦ ✭1✱2✱3✮✿ ✭✶✵✷✹✱ ✶✵✷✹✱ ✻✹✮
▼❛1 ❞✐♠❡♥s✐♦♥ s✐3❡ ♦❢ ❛ ❣r✐❞ s✐3❡ ✭1✱2✱3✮✿ ✭✷✶✹✼✹✽✸✻✹✼✱ ✻✺✺✸✺✱ ✻✺✺✸✺✮
To use CUDA on the OSC cluster, you must allocate a node with an attached GPU. To interactively allocate such a node,
use:
$ qsub -I -l walltime=0:59:00 -l nodes=1:gpus=1
(qsub -EYE -ell walltime ... -ell nodes ...).
To ensure the best resource availability for everyone, please only log on to a GPU host node when you are ready compile
and run, then please exit when you are not actively testing.
To compile and test your programs you will need to load the CUDA environment:
$ module load cuda
and then use the Nvidia compilers. For example:
$ nvcc -O -o lab4p1 jones_jeffrey_lab4p1.cu
The "-O" flag sets the compiler to the default level (3), which the "-o lab4p1" flag, specifies the name for the the executable
file, "lab4p1," which you can then execute by name:
$ lab4p1
Note that compilation (use the nvidia compiler, nvcc) can be performed on the login nodes, and does not require a node
with a GPU. You will need to load the cuda module on the login node if you wish to do this. You will not be able to test
your programs successfully on the login nodes, as they have no GPUs.
Nvidia CUDA drivers available free on-line
If your laptop/desktop has an Nvidia graphics card, you can download the CUDA drivers directly from Nvidia for your
own local development and testing. Please see ❤tt♣s✿✴✴❞❡✈❡❧♦♣❡r✳♥✈✐❞✐❛✳❝♦♠✴❝✉❞❛✲❞♦✇♥❧♦❛❞s. Nvidia's "CUDA
Zone" also provides a wide array of tools and documentation: ❤tt♣s✿✴✴❞❡✈❡❧♦♣❡r✳♥✈✐❞✐❛✳❝♦♠✴❝✉❞❛✲3♦♥❡.
The Ohio State University
CSE 5441 Autumn 2016
Part 1
Create both serial and CUDA parallel programs based on the following code segment, which multiplies the transpose of
a matrix with itself:
❞♦✉❜❧❡ ❆❬✹✵✾✻❪❬✹✵✾✻❪✱ ❈❬✹✵✾✻❪❬✹✵✾✻❪❀
✴✴ ✐♥s❡rt ❝♦❞❡ t♦ ✐♥✐t✐❛❧✐3❡ ♠❛tr✐1 ❡❧❡♠❡♥ts t♦ r❛♥❞♦♠ ✈❛❧✉❡s ❜❡t✇❡❡♥ ✶✳✵ ❛♥❞ ✷✳✵
❢♦r ✭✐ ❂ ✵❀ ✐ ❁ ✹✵✾✻❀ ✐✰✰✮
❢♦r ✭❥ ❂ ✵❀ ❥ ❁ ✹✵✾✻❀ ❥✰✰✮
❢♦r ✭❦ ❂ ✵❀ ❦ ❁ ✹✵✾✻❀ ❦✰✰✮
❈❬ ✐ ❪❬ ❥ ❪ ✰❂ ❆❬ ❦ ❪❬ ✐ ❪ ✯ ❆❬ ❦ ❪❬ ❥ ❪❀
Use the code as above for your Serial version on OSC using 1 node with 12 processors (full node). Use whatever
techniques you feel appropriate to design a Parallel version.
a) Report your results in estimated GFlops.
b) Measure both serial and parallel performance.
c) Report the CUDA compute structure (Grid, Block and Thread) you used and explain your results.
Part 2
Implement both serial and CUDA program to perform Sobel operator for edge detection, on a given set of images.
Background
The Sobel operator performs a 2-D spatial gradient measurement on images. The Sobel edge detector uses a pair of
3 x 3 stencils, or "convolution masks," one estimating gradient in the x-direction and the other estimating gradient in
y-direction. The Sobel detector is incredibly sensitive to noise in pictures, it effectively highlights them as edges.
Sobel Operator Description
An image gradient is a change in intensity (or color) of an image. An edge in an image occurs when the gradient is
greatest and the Sobel operator makes use of this fact to find the edges in an image. The Sobel operator calculates the
approximate image gradient of each pixel by "convolution" of the image with a pair of 3x3 filters. These filters estimate
the gradients in the horizontal (x) and vertical (y) directions. The magnitude of the gradient is simply the sum of these 2
gradients.
Gx:
-1 0 +1
-2 0 +2
-1 0 +1
Gy:
+1 +2 +1
0 0 0
-1 -2 -1
The magnitude of gradient is calculated using
G =
p
Gx2 + Gy2
The Ohio State University
CSE 5441 Autumn 2016
Determining threshold for Pixel Classification
To perform Sobel edge detection the gradient magnitude is computed for each pixel (excluding the pixels in boundary),
with the pixel being classified as white or black based on comparing the gradient to a threshold. Below are the steps to
follow
• Take the input image (N x M) pixels
• Iterate over each pixel (excluding the first/last row and first/last column pixels)
• Determine the Magnitude G of new pixel, from Gx and Gy where:
– Gx is the sum of the products of the Gx stencil multiplied by the corresponding pixel values that align with the
stencil when the stencil is centered on the current pixel (that is, the center element of the stencil corresponds
to the current pixel).
– Gy is computed similarly by employing the Gy stencil.
– For each pixel, G(pixel) = q
G2
x + G2
y
• Use a threshold for classifying the pixel as black or white, if the magnitude is greater than threshold assign
white(255), else black(0)
• The Resultant image (N x M) containing new magnitude will be a black & white image with explicit edges
• Note: Assume that the boundary pixels are simply copied to new image without any modifications
Our convergence criterion for this experiment is to achieve a image having greater than 75% of percentage pixels
being black. We iterate over different threshold values starting at 0 and incrementing by 1%, until this convergence is
achieved. Below is the logic flow
t❤r❡s❤♦❧❞ ❂ ✵
✇❤✐❧❡✭❜❧❛❝❦❴❝❡❧❧❴❝♦✉♥t ❁ ✵✳✼✺ ✯ ♥✉♠❴♦❢❴♣✐1❡❧s✮
4
✐t❡r❛t❡ ♦✈❡r ❛❧❧ ♣✐1❡❧s
4
● ❂ s♦❜❡❧❴♦✉t♣✉t✭♣✐1❡❧✮
✐❢✭● ❃ t❤r❡s❤♦❧❞✮
s♦❜❡❧❴✐♠❛❣❡❬♣✐1❡❧❪ ❂ ✇❤✐t❡ ✭✷✺✺✮
❡❧s❡
s♦❜❡❧❴✐♠❛❣❡❬♣✐1❡❧❪ ❂ ❜❧❛❝❦ ✭✵✮
❜❧❛❝❦❴❝❡❧❧❴❝♦✉♥t✰✰
6
✐♥❝r❡♠❡♥t❴t❤r❡s❤♦❧❞
6
Input Image: Sobel Output Image:
The Ohio State University
CSE 5441 Autumn 2016
Reading/Writing Images
Your program will work on 24-bit bmp style image format. To help you with reading and writing images, a bmp reader
support library will be provided to you. You will be provided with bmp_reader.o and read_bmp.h file with the following
support API calls:
• void* read_bmp_file(FILE *bmp_file):
Will take in a FILE* pointing to the input image file and will return a buffer pointer (void *). The buffer returned will
contain the pixel values arranged in linear fashion running column first. i.e. if your image is N x M pixels. The buffer
will have M pixels of first row followed by M pixels of 2nd row and so on. Note: You have to open the image file
in 'rb' mode into the FILE* before calling this function. Ensure to free the buffer*, returned by the function before
exiting the program.
• void write_bmp_file(FILE *out_file, uint8_t *bmp_data):
Will take in FILE* pointing to output image file and buffer pointer containing the pixel values arranged in linear
fashion. Note: The output file should be opened in 'wb' mode into FILE* before calling write_bmp_file. Ensure to
free the buffer* before exiting the program
• Both these functions are in class bmp_image, along with 3 other useful information, which you can use to build
your program.
– image_width : Will contain width of image or no of pixel columns, after read API is issued.
– image_height: Will hold the height or no of pixel rows in an image, after read API is issued.
– num_pixel : Will be equal to image width x height.
Serial Code for Sobel operator with Classification convergence
Use the below code as a reference for your serial version of code
✴✴❘❡❛❞ t❤❡ ❜✐♥❛r2 ❜♠♣ ❢✐❧❡ ✐♥t♦ ❜✉❢❢❡r
❜♠♣❴❞❛t❛ ❂ ✭✉✐♥t✽❴t ✯✮✐♠❣✶✳r❡❛❞❴❜♠♣❴❢✐❧❡✭❢✐❧❡❴♥❛♠❡✮❀
✴✴❆❧❧♦❝❛t❡ ♥❡✇ ♦✉t♣✉t ❜✉❢❢❡r ♦❢ s❛♠❡ s✐3❡
♥❡✇❴❜♠♣❴✐♠❣ ❂ ✭✉✐♥t✽❴t ✯✮♠❛❧❧♦❝✭✐♠❣✶✳♥✉♠❴♣✐1❡❧✮❀
✴✴●❡t ✐♠❛❣❡ ❛ttr✐❜✉t❡s
✇❞ ❂ ✐♠❣✶✳✐♠❛❣❡❴✇✐❞t❤✭✮❀ ❤t ❂ ✐♠❣✶✳✐♠❛❣❡❴❤❡✐❣❤t✭✮❀
✴✴❈♦♥✈❡r❣❡♥❝❡ ❧♦♦♣
t❤r❡s❤♦❧❞ ❂ ✵❀
✇❤✐❧❡✭❜❧❛❝❦❴❝❡❧❧❴❝♦✉♥t ❁ ✭✼✺✯✇❞✯❤t✴✶✵✵✮✮
4
❜❧❛❝❦❴❝❡❧❧❴❝♦✉♥t ❂ ✵❀
t❤r❡s❤♦❧❞ ✰❂ ✶❀
❢♦r✭✐❂✶❀ ✐ ❁ ✭❤t✲✶✮❀ ✐✰✰✮
4
❢♦r✭❥❂✶❀ ❥ ❁ ✭✇❞✲✶✮❀ ❥✰✰✮
4
●1 ❂ ❜♠♣❴❞❛t❛❬ ✭✐✲✶✮✯✇❞ ✰ ✭❥✰✶✮ ❪ ✲ ❜♠♣❴❞❛t❛❬ ✭✐✲✶✮✯✇❞ ✰ ✭❥✲✶✮ ❪ ❭
✰ ✷✯❜♠♣❴❞❛t❛❬ ✭✐✮✯✇❞ ✰ ✭❥✰✶✮ ❪ ✲ ✷✯❜♠♣❴❞❛t❛❬ ✭✐✮✯✇❞ ✰ ✭❥✲✶✮ ❪ ❭
The Ohio State University
CSE 5441 Autumn 2016
✰ ❜♠♣❴❞❛t❛❬ ✭✐✰✶✮✯✇❞ ✰ ✭❥✰✶✮ ❪ ✲ ❜♠♣❴❞❛t❛❬ ✭✐✰✶✮✯✇❞ ✰ ✭❥✲✶✮ ❪❀
●2 ❂ ❜♠♣❴❞❛t❛❬ ✭✐✲✶✮✯✇❞ ✰ ✭❥✲✶✮ ❪ ✰ ✷✯❜♠♣❴❞❛t❛❬ ✭✐✲✶✮✯✇❞ ✰ ✭❥✮ ❪ ❭
✰ ❜♠♣❴❞❛t❛❬ ✭✐✲✶✮✯✇❞ ✰ ✭❥✰✶✮ ❪ ✲ ❜♠♣❴❞❛t❛❬ ✭✐✰✶✮✯✇❞ ✰ ✭❥✲✶✮ ❪ ❭
✲ ✷✯❜♠♣❴❞❛t❛❬ ✭✐✰✶✮✯✇❞ ✰ ✭❥✮ ❪ ✲ ❜♠♣❴❞❛t❛❬ ✭✐✰✶✮✯✇❞ ✰ ✭❥✰✶✮ ❪❀
♠❛❣ ❂ sqrt✭●1 ✯ ●1 ✰ ●2 ✯ ●2✮❀
✐❢✭♠❛❣ ❃ t❤r❡s❤♦❧❞✮
4
♥❡✇❴❜♠♣❴✐♠❣❬ ✐✯✇❞ ✰ ❥❪ ❂ ✷✺✺❀
6❡❧s❡4
♥❡✇❴❜♠♣❴✐♠❣❬ ✐✯✇❞ ✰ ❥❪ ❂ ✵❀
❜❧❛❝❦❴❝❡❧❧❴❝♦✉♥t✰✰❀
6
6
6
6
✴✴❲r✐t❡ ❜❛❝❦ t❤❡ ♥❡✇ ❜♠♣ ✐♠❛❣❡ ✐♥t♦ ♦✉t♣✉t ❢✐❧❡
✇r✐t❡❴❜♠♣❴❢✐❧❡✭♦✉t❴❢✐❧❡✱ ♥❡✇❴❜♠♣❴✐♠❣✮❀
Instrumentation
• Use the instructions and example code above to develop both serial and CUDA parallel version.
Note: In CUDA version the entire convergence need not be parallel, try your best to optimize the sobel operator
and do a serial loop for threshold convergence.
• Include read_bmp.h into your program file and link the object file to compile in your make file.
• The necessary sample images and program file will be uploaded to the /class/cse5441 directory.
• Your program should take the following command line parameters:
./a.out
• Your program should output
a) Time taken for serial execution
b) Time taken for cuda execution
c) Threshold obtained in serial vs cuda version
Results can adhere to format below but not restricted:
✳✴✐♥❞r❡s❤❴s✐r❛❴❧❛❜✹♣✷✳♦✉t ✐♠❛❣❡❴✶✳❜♠♣ s❡r✐❛❧❴✐♠❛❣❡✳❜♠♣ ❝✉❞❛❴✐♠❛❣❡✳❜♠♣
✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯
■♠❛❣❡ ■♥❢♦✿✿
❍❡✐❣❤t❂✸✻✺✽ ❲✐❞t❤❂✷✾✻✷
❚✐♠❡ t❛❦❡♥ ❢♦r s❡r✐❛❧ s♦❜❡❧ ♦♣❡r❛t✐♦♥✿ ✶✷✳✵✹✺✼ s❡❝
❚❤r❡s❤♦❧❞ ❞✉r✐♥❣ ❝♦♥✈❡r❣❡♥❝❡✿ ✻✾
❚✐♠❡ t❛❦❡♥ ❢♦r ❈❯❉❆ s♦❜❡❧ ♦♣❡r❛t✐♦♥✿ ✶✳✵✹✺✼ s❡❝
❚❤r❡s❤♦❧❞ ❞✉r✐♥❣ ❝♦♥✈❡r❣❡♥❝❡✿ ✻✾
✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯✯
The Ohio State University
CSE 5441 Autumn 2016
Reporting
Run your program against all the sample images provided using instructions specified above. Be sure to provide the
following
• Provide a timing and threshold convergence summary for all images
• Explain your cuda organization (grid, block, thread) distribution
• Did you see any performance improvement in using GPU?
Support your answer with numbers from your observation.
Submitting Results
Generally, follow the submission guidelines for the previous labs, with the following specifics:
• Create submission directory name "cse5441_lab4." and place your files in it.
• Ensure you provide read,write and executable permission to the folder, else it will not be graded.
• name your program files __lab4p1.cu and __lab4p2.cu.
• provide a single make file that will name your executables lab4p1 and lab4p2S