Next: 3.9 Barnes-Hut Up: 3 Results Previous: 3.7 QSORT

3.8 Water

Water from the SPLASH [20] benchmark suite is a molecular dynamics simulation. The main data structure in Water is a one-dimensional array of records, in which each record represents a molecule. It contains the molecule's center of mass, and for each of the atoms, the computed forces, the displacements and their first six derivatives. During each time step, both intra- and inter-molecular potentials are computed. To avoid computing all pairwise interactions among molecules, a spherical cutoff range is applied.

The parallel algorithm statically divides the array of molecules into equal contiguous chunks, assigning each chunk to a processor. The bulk of the interprocessor communication happens during the force computation phase. Each processor computes and updates the intermolecular force between each of its molecules and each of n/2 molecules following it in the array in wrap-around fashion.

In the TreadMarks version, the Water program from the original SPLASH suite is tuned to get better performance. Only the center of mass, the displacements and the forces on the molecules are allocated in shared memory, while the other variables in the molecule record are allocated in private memory. A lock is associated with each processor. In addition, each processor maintains a private copy of the forces. During the force computation phase, changes to the forces are accumulated locally in order to reduce communication. The shared forces are updated after all processors have finished this phase. If a processor i has updated its private copy of the forces of molecules belonging to processor j, it acquires lock j and adds all its contributions to the forces of molecules owned by processor j. In the PVM version, processors exchange displacements before the force computation. No communication occurs until all the pairwise intermolecular forces have been computed, at which time processors communicate their locally accumulated modifications to the forces.

We used two data set sizes, 288 molecules and 1728 molecules, and ran for 5 time steps. The results are shown in Figures 8 and 9. The sequential execution time for the 288-molecule simulation is 12 seconds. The 8-processor speedups for TreadMarks and PVM are 5.04 and 7.23, respectively. With 1728 molecules, the sequential program runs for 435 seconds. TreadMarks and PVM achieve speedups of 7.25 and 7.44 at 8 processors, respectively.

Low computation/communication ratio, separation of synchronization and data communication, and false sharing are the major reasons for the 30%gap in performance at 288 molecules. In PVM, two user-level messages are sent for each pair of processors that interact with each other, one message to read the displacements, and the other message to write the forces. In TreadMarks, extra messages are sent for synchronization and for diff requests to read the displacements or to write the shared forces. After the barrier that terminates the phase in which the shared forces are updated, a processor may fault again when reading the final force values of its own molecules, if it was not the last processor to update those values or if there is false sharing. False sharing causes the processor to bring in updates for molecules that it does not access, and may result in communication with more than one processor if molecules on the same page are updated by two different processors. At 8 processors, false sharing occurs on 7 of the 11.8 pages of the molecule array. Consequently, TreadMarks sends 5028 messages compared to 620 messages in PVM. False sharing also causes the TreadMarks version to send unnecessary data. Another cause of the additional data sent in TreadMarks is diff accumulation. Assuming there are n processors, where n is even, the molecules belonging to a processor are modified by n/2+1 processors, each protected by a lock. On average, each processor gets n/2 diffs. Since all the molecules are not modified by each processor in this case, the diffs are not completely overlapping. Adding both false sharing and diff accumulation, at 8 processors, TreadMarks sends 2.3 times more data than PVM. Increased computation/communication ratio and reduced false sharing cause TreadMarks to perform significantly better at 1728 molecules. TreadMarks sends only 1.3 times more data than PVM, compared to 2.3 times more with 288 molecules.



Next: 3.9 Barnes-Hut Up: 3 Results Previous: 3.7 QSORT


logoRice Systems Group