3.5 Integer Sort

Next: 3.6 TSP Up: 3 Results Previous: 3.4 Red-Black SOR

3.5 Integer Sort

Integer Sort (IS) [2] from the NAS benchmarks requires ranking an unsorted sequence of keys using bucket sort. The parallel version of IS divides up the keys among the processors. First, each processor counts its keys and writes the result in a private array of buckets. Then, the values in the private buckets are summed up. Finally, all processors read the sum and rank their keys.

In the TreadMarks version, there is a shared array of buckets, and each processor also has a private array of buckets. After counting its keys, a processor locks the shared array of buckets, adds the values in its private array to the shared array, releases the lock, and waits at a barrier until all other processors have finished their updates. Each processor then reads the final result in the shared array of buckets and ranks its keys. In the PVM version, each processor has a bucket array in private memory. After counting their own keys, the processors form a chain, in which processor 0 sends its local array of buckets to processor 1. Processor 1 adds the values in its local array of buckets to the values in the array of buckets it receives and forwards the result to the next processor, etc. The last processor in the chain calculates the final result and broadcasts it.

We tested IS with two sets of input data. In the first test (IS-Small), we sorted keys ranging from 0 to for 10 iterations. In the second test (IS-Large), the keys range from 0 to . We did not try the keys and the key range as suggested for the NAS benchmarks, because the extremely low computation/communication ratio is not suitable for workstation clusters.

The speedups are shown in Figures 4 and 5. The sequential execution time for IS-Small is 38 seconds. The 8 processor speedups for PVM and TreadMarks are 7.60 and 6.56, respectively. For IS-Large, the sequential program runs for 42 seconds, and PVM and TreadMarks achieve speedups of 4.67 and 2.30, respectively.

In IS-Small, the TreadMarks version sends 4 times more data and 5 times more messages than the PVM version. The extra messages are due to separate synchronization and diff requests. Of the 782 messages sent in TreadMarks (compared to 140 in PVM), 500 are synchronization messages, and about 150 are diff requests. In IS-Large, TreadMarks sends about 70 times more messages than PVM. The shared bucket array in IS-Large contains integers, spread over 32 pages. Therefore, each time the shared bucket array is accessed, TreadMarks sends 32 diff requests and responses, while PVM handles the transmission of the shared array with a single message exchange.

The extra data in TreadMarks comes from a phenomenon we call diff accumulation. Each time a processor acquires a lock to modify the shared array of buckets, the previous values in the array are completely overwritten. In the current TreadMarks implementation, however, all the preceding diffs are sent when a lock is acquired, even though (for IS) they completely overlap each other. The same phenomenon occurs after the barrier, when every processor reads the final values in the shared bucket. At this time, each processor gets all the diffs made by the processors who modified the shared bucket array after it during this iteration. Assuming the bucket size is b and the number of processors is n, in PVM, the amount of data sent in each iteration is 2 x (n-1) x b, while the amount of data sent in TreadMarks is n x (n-1) x b. Although all the diffs can be obtained from one processor, diff accumulation also results in more messages when the sum of the diff sizes exceeds the maximum size of a UDP message. Since the TreadMarks MTU is 48 kilobytes, extra messages due to diff accumulation are not a serious problem.

Next: 3.6 TSP Up: 3 Results Previous: 3.4 Red-Black SOR

3.5 Integer Sort

Rice Systems Group