CS 398: Introduction to Graphics Programming

Assignment: General purpose computing on GPUs

In this assignment, you will explore the capabilities of Graphics Processing Units (GPUs) to perform some common non-graphics tasks. Completing this assignment will help you gain basic understanding of how to harness the power of GPUs for general computation tasks beyond computer graphics.

In the following tasks, you may assume that the number of elements in an array is 2^2K where K is an integer number less than or equal to 12 (e.g., 32² for K = 5). The input data for Task 1 to Task 4 should be in the size of 4² for the marking purpose, but your program should support different sizes of data (e.g., 2², 64², 512², ...) which is needed for Task 5 in any case. You may assume that the values in the input data are non-negative integers below 2²³ stored in 32-bit floating point numbers (i.e., integers are exactly represented).

Base Code

Once you downloaded the base code from Piazza and started a local web server, click here to open c_gpgpu.html. Make sure to start a local web server if the link does not work. The base code already implements all the tasks on CPUs. Your task is to extend it to implement GPU versions of those.

Task 1: Map

Implement an element-wise computation using a parallel algorithm. If your input is a[0], a[1], a[2]... and b[0], b[1], b[2]..., your output should be 1 - exp(-a[0]), 1 - exp(-a[1]), 1 - exp(-a[2])...

Task 2: Reduce

Implement a reduce operator through multiple applications of a fragment shader using a parallel algorithm as explained in the lectures. Your task is to compute the max of an array of data. If your input is a[0], a[1], a[2]... and say a[13] is the max among all, your output should be a[13].

Task 3: Scan

Implement a scan operator through multiple applications of a fragment shader using a parallel algorithm as explained in the lectures. Your task is to compute a prefix-sum for a given array of data. If your input is a[0], a[1], a[2]..., your output should be a[0], a[0]+a[1], a[0]+a[1]+a[2]...

Task 4: Sort

Implement a parallel sorting algorithm (odd-even sort or bitonic sort) as explained in the lecture. Optionally, you can also choose to implement radix sort or odd-even merge sort, but explain your understanding of the algorithm in your readme if you choose to do so.

Task 5: Demo

Combine multiple operators from Task 1 to Task 4 to perform a complex computation on GPUs. Do not fall back to CPU computation between different operators on GPUs (i.e., perform the entire computation on GPUs), and readback the result from a GPU only at the end of your computation. Explain what you are doing and your results in your README and verify that your GPU result matches with the CPU result (which means you also need to implement the CPU version of your computation for the verification purpose). Feel free to extend the code as needed. Note that you may or may not be able to see any performance improvement on GPUs due to several reasons as we discussed in the lectures. While not being able to improve the performance does not affect your mark, incorrectly redundant computation (e.g., doing O(N) passes when you are supposed to do only O(logN) passes) might lead to some deduction.

Submission

The submission process for this assignment is basically the same as the one for Assignment: Warm-up. See Assignment: Warm-up for instructions about how to prepare your submission. If you are doing something extra, be sure to document them in your README. If your changes are so radical that your modified program is incompatible with the original specification for each task, you must include a "compatibility mode" that makes the interface behave like the requirements here, or consider creating an entirely separate code.