Next: 3.10 3-D FFT Up: 3 Results Previous: 3.8 Water

3.9 Barnes-Hut

Barnes-Hut from the SPLASH [20] benchmark suite is an N-body simulation using the hierarchical Barnes-Hut Method. A tree-structured hierarchical representation of physical space is used. Each leaf of the tree represents a body, and each internal node of the tree represents a ``cell'', a collection of bodies in close physical proximity. The major data structures are two arrays, one representing the bodies and the other representing the cells. The sequential algorithm loops over the bodies, and for each body traverses the tree to compute the forces acting on it.

In the parallel code, there are four major phases in each time step.

  1. MakeTree : Construct the Barnes-Hut tree.
  2. Get_my_bodies: Partition the bodies among the processors.
  3. Force Computation: Compute the forces on my own bodies.
  4. Update: Update the positions and the velocities of my bodies.
Phase 1 is executed sequentially, because running in parallel slows down the execution. In phase 2, dynamic load balance is achieved by using the cost-zone method, in which each processor walks down the Barnes-Hut tree and collects a set of logically consecutive leaves. Most of the computation time is spent in phase 3.

In the TreadMarks version, the array of bodies is shared, and the cells are private. In MakeTree, each processor reads all the shared values in bodies and builds internal nodes of the tree in its private memory. There are barriers after the MakeTree, force computation, and update phases. No synchronization is necessary during the force computation phase. The barrier at the end of the force computation phase ensures that all processors have finished reading the positions of all other processors. In the PVM version, every processor broadcasts its bodies at the end of each iteration, so that each processor obtains all the bodies and creates a complete tree in phase 1. No other communication is required.

We ran Barnes-Hut with 4096 bodies for 6 timesteps. The last 5 iterations are timed in order to exclude any cold start effects. Figure 10 shows the speedups. The sequential program runs for 17 seconds. At 8 processors, PVM and TreadMarks achieve speedups of 3.04 and 2.70 respectively. The low computation to communication ratio and the need for fine-grained communication [19] contribute to the poor speedups on both TreadMarks and PVM. In PVM, the network is saturated at 8 processors, because every processor tries to broadcast at the same time. Diff requests and false sharing are the major reasons for TreadMarks' lower performance. At 8 processors, TreadMarks sends 44% more data and about 222 times more messages than PVM. Although the set of bodies owned by a processor are adjacent in the Barnes-Hut tree, they are not adjacent in memory. Because of false sharing, in MakeTree, each page fault causes the processor to send out diff requests to several processors. For the same reason, during the force computation, a processor may fault on accessing its own bodies, and bring in unwanted data.



Next: 3.10 3-D FFT Up: 3 Results Previous: 3.8 Water


logoRice Systems Group