CS 398: Introduction to Graphics Programming
Assignment: General purpose computing on GPUs
In the following tasks, you may assume that the number of elements in an array is 22K where K is an integer number less than or equal to 12 (e.g., 322 for K = 5). The input data for Task 1 to Task 4 should be in the size of 42 for the marking purpose, but your program should support different sizes of data (e.g., 22, 642, 5122, ...) which is needed for Task 5 in any case. You may assume that the values in the input data are non-negative integers below 223 stored in 32-bit floating point numbers (i.e., integers are exactly represented).
Base Code
Once you downloaded the base code from Piazza and started a local web server, click here to open c_gpgpu.html.
Make sure to start a local web server if the link does not work.
The base code already implements all the tasks on CPUs.
Your task is to extend it to implement GPU versions of those.
Task 1: Map
Implement an element-wise computation using a parallel algorithm. If your input is a[0], a[1], a[2]... and b[0], b[1], b[2]..., your output should be 1 - exp(-a[0]), 1 - exp(-a[1]), 1 - exp(-a[2])...
Task 2: Reduce
Implement a reduce operator through multiple applications of a fragment shader using a parallel algorithm as explained in the lectures. Your task is to compute the max of an array of data. If your input is a[0], a[1], a[2]... and say a[13] is the max among all, your output should be a[13].
Task 3: Scan
Implement a scan operator through multiple applications of a fragment shader using a parallel algorithm as explained in the lectures. Your task is to compute a prefix-sum for a given array of data. If your input is a[0], a[1], a[2]..., your output should be a[0], a[0]+a[1], a[0]+a[1]+a[2]...
Task 4: Sort
Implement a parallel sorting algorithm (odd-even sort or bitonic sort) as explained in the lecture. Optionally, you can also choose to implement radix sort or odd-even merge sort, but explain your understanding of the algorithm in your readme if you choose to do so.
Task 5: Demo
Combine multiple operators from Task 1 to Task 4 to perform a complex computation on GPUs.
Do not fall back to CPU computation between different operators on GPUs (i.e., perform the entire computation on GPUs), and readback the result from a GPU only at the end of your computation.
Explain what you are doing and your results in your README and verify that your GPU result matches with the CPU result (which means you also need to implement the CPU version of your computation for the verification purpose).
Feel free to extend the code as needed.
Note that you may or may not be able to see any performance improvement on GPUs due to several reasons as we discussed in the lectures.
While not being able to improve the performance does not affect your mark, incorrectly redundant computation (e.g., doing O(N) passes when you are supposed to do only O(logN) passes) might lead to some deduction.
Submission
The submission process for this assignment is basically the same as the one for Assignment: Warm-up.
See Assignment: Warm-up for instructions about how to prepare your submission.
If you are doing something extra, be sure to document them in your README.
If your changes are so radical that your modified program is incompatible with the original specification for each task, you must include a "compatibility mode" that makes the interface behave like the requirements here, or consider creating an entirely separate code.