Parallel computing on networks of workstations has been gaining more attention in recent years. Because workstation clusters use ``off the shelf'' products, they are cheaper than supercomputers. Furthermore, high-speed general-purpose networks and very powerful workstation processors are narrowing the performance gap between workstation clusters and supercomputers.
Processors in workstation clusters do not share physical memory, so all interprocessor communication between processors must be performed by sending messages over the network. Currently, the prevailing programming model for parallel computing on networks of workstations is message passing, using libraries such as PVM [9], TCGMSG [11] and Express [18]. A message passing standard MPI [17] has also been developed. With the message passing paradigm, the distributed nature of the memory system is fully exposed to the application programmer. The programmer needs to keep in mind where the data is, decide when to communicate with other processors, whom to communicate with, and what to communicate, making it hard to program in message passing, especially for applications with complex data structures.
Software distributed shared memory (DSM) systems (e.g., [16][14][5][3]) provide a shared memory abstraction on top of the native message passing facilities. An application can be written as if it were executing on a shared memory multiprocessor, accessing shared data with ordinary read and write operations. The chore of message passing is left to the underlying DSM system. While it is easier to program this way, DSM systems tend to generate more communication and therefore tend to be less efficient than message passing systems. Under the message passing paradigm, communication is handled entirely by the programmer, who has complete knowledge of the data usage pattern. In contrast, the DSM system has little knowledge of the application program, and therefore must be conservative in determining what to communicate. Since sending messages between workstations is expensive, this extra communication can cause serious performance degradation.
Much work has been done in the past decade to improve the performance of DSM systems. In this paper, we compare a state-of-the-art DSM system, TreadMarks, with the most commonly used message passing system, PVM. Our goals are to assess the differences in programmability and performance between DSM and message passing systems and to precisely determine the remaining causes of the lower performance of DSM systems.
We ported nine parallel programs to both TreadMarks and PVM: Water and Barnes-Hut from the SPLASH benchmark suite [20]; 3-D FFT, Integer Sort (IS), and Embarrassingly Parallel (EP) from the NAS benchmarks [2]; ILINK, a widely used genetic linkage analysis program [8]; and Successive Over-Relaxation (SOR), Traveling Salesman Problem (TSP), and Quicksort (QSORT). Two different input sets were used for Water (Water-288 and Water-1728), IS (IS-Small and IS-Large), and SOR (SOR-Zero and SOR-NonZero). We ran these programs on eight HP735 workstations connected by a 100Mbits per second FDDI network.
In terms of programmability, since most of our test programs are simple, it was not difficult to port them to PVM. However, for two of the programs, namely 3-D FFT and ILINK, the message passing versions were significantly harder to develop than the DSM versions.
For Water-1728, EP, ILINK, SOR-Zero, and SOR-NonZero, the performance of TreadMarks is within 10%of PVM. For IS-Small, Water-288, Barnes-Hut, 3-D FFT, TSP, and QSORT, differences are on the order of 10%to 30%. Finally, for IS-Large, PVM performs two times better than TreadMarks.
More messages and more data are sent in TreadMarks, explaining the performance differences. This extra communication is caused by 1) the separation of synchronization and data transfer, 2) extra messages to request updates for data in the invalidate protocol used in TreadMarks, 3) false sharing, and 4) diff accumulation for migratory data in TreadMarks. We are currently trying to address these deficiencies through the integration of compiler support in TreadMarks.
The rest of this paper is organized as follows. In Section 2 we introduce the user interfaces and implementations of PVM and TreadMarks. Section 3 presents the application programs and their results. Section 4 concludes the paper.