TreadMarks [14] is a software DSM system built at Rice University. It is an efficient user-level DSM system that runs on commonly available Unix systems. We use TreadMarks version 0.9.4 in our experiments.
TreadMarks provides primitives similar to those used in hardware
shared memory machines.
Application processes synchronize via two
primitives: barriers and mutex locks.
The routine Tmk_barrier(i)
stalls the calling process
until all processes in the system have arrived at the same barrier.
Barrier indices i
are integers in a certain range.
Locks are used to control access to critical sections.
The routine Tmk_lock_acquire(i)
acquires a lock for the
calling processor, and the routine Tmk_lock_release(i)
releases it.
No processor can acquire a lock if another processor is
holding it.
The integer i
is a lock index assigned by the programmer.
Shared memory must be allocated dynamically by calling Tmk_malloc
or Tmk_sbrk
.
They have the same syntax as conventional memory allocation calls.
With TreadMarks, it is imperative to use explicit synchronization,
as data is moved from processor to processor only in response to
synchronization calls (see Section 2.2.2).
TreadMarks uses a lazy invalidate [14] version of release consistency (RC) [10] and a multiple-writer protocol [5] to reduce the amount of communication involved in implementing the shared memory abstraction. The virtual memory hardware is used to detect accesses to shared memory.
RC is a relaxed memory consistency model.
In RC, ordinary shared memory accesses are distinguished from
synchronization accesses, with the latter category divided
into acquire and release accesses.
RC requires ordinary shared memory updates by a processor p
to become visible to another processor q only when a subsequent
release by p becomes visible to q via some chain of
synchronization events.
In practice, this model allows a processor to buffer multiple writes to
shared data in its local memory until a synchronization point is reached.
In TreadMarks, Tmk_lock_acquire(i)
is modeled as an acquire,
and Tmk_lock_release(i)
is modeled as a release.
Tmk_barrier(i)
is modeled as a release followed by an acquire,
where each processor performs a release at barrier arrival, and an
acquire at barrier departure.
With the multiple-writer protocol, two or more processors can simultaneously modify their own copy of a shared page. Their modifications are merged at the next synchronization operation in accordance with the definition of RC, thereby reducing the effect of false sharing. The merge is accomplished through the use of diffs. A diff is a runlength encoding of the modifications made to a page, generated by comparing the page to a copy saved prior to the modifications.
TreadMarks implements a lazy invalidate version of RC [13]. A lazy implementation delays the propagation of consistency information until the time of an acquire. Furthermore, the releaser notifies the acquirer of which pages have been modified, causing the acquirer to invalidate its local copies of these pages. A processor incurs a page fault on the first access to an invalidated page, and gets diffs for that page from previous releasers.
To implement lazy RC, the execution of each processor is divided into intervals. A new interval begins every time a processor synchronizes. Intervals on different processors are partially ordered: (i) intervals on a single processor are totally ordered by program order, (ii) an interval on processor p precedes an interval on processor q if the interval of q begins with the acquire corresponding to the release that concluded the interval of p, and (iii) an interval precedes another interval by transitive closure. This partial order is known as hb1 [1]. Vector timestamps are used to represent the partial order.
When a processor executes an acquire, it sends its current timestamp in the acquire message. The previous releaser then piggybacks on its response the set of write notices that have timestamps greater than the timestamp in the acquire message. These write notices describe the shared memory modifications that precede the acquire according to the partial order. The acquiring processor then invalidates the pages for which there are incoming write notices.
On an access fault, a page is brought up-to-date by fetching all the missing diffs and applying them to the page in increasing timestamp order. All write notices without corresponding diffs are examined. It is usually unnecessary to send diff requests to all the processors who have modified the page, because if a processor has modified a page during an interval, then it must have all the diffs of all intervals that precede it, including those from other processors. TreadMarks then sends diff requests to the subset of processors for which their most recent interval is not preceded by the most recent interval of another processor.
Each lock has a statically assigned manager. The manager records which processor has most recently requested the lock. All lock acquire requests are directed to the manager, and, if necessary, forwarded to the processor that last requested the lock. A lock release does not cause any communication. Barriers have a centralized manager. The number of messages sent in a barrier is 2 x (n-1), where n is the number of processors.