Red-Black Successive Over-Relaxation (SOR) is a method of solving partial differential equations. In the parallel version, the program divides the red and the black array into roughly equal size bands of rows, assigning each band to a different processor. Communication occurs across the boundary rows between bands. In the TreadMarks version, the arrays are allocated in shared memory, and processors synchronize using barriers. With PVM, each processor explicitly sends the boundary rows to its neighbors.
We ran red-black SOR on a 1024 x 3072 matrix of floating point numbers for 51 iterations. With this problem size each shared red or black row occupies one and a half pages. The first iteration is excluded from measurement to eliminate differences due to the fact that data is initialized in a distributed manner in the PVM version, while in TreadMarks it is done at the master process.
In the first test (SOR-Zero), the edge elements are initialized to 1, and all the other elements to 0. In the second test (SOR-Nonzero), all elements of the matrix are initialized to nonzero values, such that they all change values in each iteration.
Results from SOR-Zero are shown in Figure 2. The sequential program runs for 79 seconds. At 8 processors, the TreadMarks version and the PVM version achieve speedups of 4.58 and 4.71, respectively. The TreadMarks speedup is 97%that of PVM. Due to load imbalance, neither PVM nor TreadMarks achieves good speedup. Load imbalance occurs because floating-point computations involving zeros take longer than those involving non-zeros, causing the processors working on the middle parts of the array to take longer between iterations. Results from SOR-Nonzero are shown in Figure 3. Because the initial values are nonzero, the single processor time drops from 79 seconds to 68 seconds. At 8 processors, the speedup obtained by TreadMarks is 6.80, which is 91%of the PVM speedup of 7.44. Compared to the first test, the improved speedup is due to better load balance.
TreadMarks and PVM performance are relatively close, because of the low communication rate in SOR, and the use of lazy release consistency in TreadMarks. Although each processor repeatedly writes to the boundary pages between two barriers, diffs of the boundary pages are sent only once after each barrier, in response to diff requests from neighbors. The number of messages is 5 times higher in TreadMarks than in PVM. For n processors, PVM sends 2 x (n-1) messages at the end of each iteration. TreadMarks sends 2 x (n-1) messages to implement the barrier and 8 x (n-1) messages to page in the diffs for the boundary rows (Each boundary row requires two diffs, one for each page). This behavior exemplifies two of the performance drawbacks of TreadMarks relative to PVM: separation of synchronization and data transfer and multiple diff requests due to the invalidate protocol. As a result of diffing in TreadMarks, much less data is sent in SOR-Zero by TreadMarks than by PVM because most of the pages remain zero.