3.10 3-D FFT

Next: 3.11 ILINK Up: 3 Results Previous: 3.9 Barnes-Hut

3.10 3-D FFT

3-D FFT, from the NAS [2] benchmark suite, numerically solves a partial differential equation using three dimensional forward and inverse FFT's. Assume the input array A is , organized in row-major order. The 3-D FFT first performs a -point 1-D FFT on each of the complex vectors. Then it performs a -point 1-D FFT on each of the vectors. Next, the resulting array is transposed into an complex array B and an -point 1-D FFT is applied to each of the complex vectors.

We distribute the computation on the array elements along the first dimension of A, so that for any i, all elements of the complex matrix , are assigned to a single processor. No communication is needed in the first two phases, because each of the -point FFTs or the -point FFTs is computed by a single processor. The processors communicate with each other at the transpose, because each processor accesses a different set of elements afterwards.

In the TreadMarks version, a barrier is called before the transpose. In the PVM version, messages are sent explicitly. To send these messages, we must figure out where each part of the A array goes to, and where each part of the B array needs to come from. These index calculations on a 3-dimensional array are much more error-prone than simply swapping the indices, as in TreadMarks, making the PVM version harder to write.

The results are obtained by running on a 64 x 64 x 64 array of double precision complex numbers for 6 iterations, excluding the time for distributing the initial values at the beginning of program. This matrix size is 1/32 of that specified in the class A problem in the NAS benchmarks. We scaled down the problem because of the limited swap space on the machines available to us. The speedup curves are shown in Figure 11. The sequential execution time is 41 seconds. A speedup of 4.41 is obtained by TreadMarks at 8 processors, which is 81%of the speedup of 5.47 obtained by PVM. Because of release consistency, TreadMarks sends almost the same amount of data as PVM, except for the 6-processor execution. However, because of the page-based invalidate protocol, many more messages are sent in TreadMarks than in PVM. At 8 processors, about 4 megabytes of data are communicated in each transpose. With a page size of 4096 bytes, each transpose therefore requires about 1000 diff requests and responses.

An anomaly occurs at 6 processors, which we attribute to false sharing. Because the matrix size is not a multiple of 6, a page modified by one processor is read by two other processors. Although the two processors read disjoint parts of the page, the same diff is sent to both of them. As a result, in TreadMarks, 14%more messages and 17%more data are sent at 6 processors than at 8 processors.

Next: 3.11 ILINK Up: 3 Results Previous: 3.9 Barnes-Hut

3.10 3-D FFT

Rice Systems Group