Frangipani: A Scalable Distributed File System

Thekkath, Mann, Lee (1997)

What Kind of Paper is This

Describes a system
Demonstrates performance
Demonstrates scalability

Goals

Consistent view of the file set.
Hands-off recovery (recovers from machine, network and disk failures).
Easy administration:
- Add servers
- Add users
- Full, consistent backup online

Key Design Points

Cluster file system
Implemented on top of Petal
- Provides abstraction of a virtual disk.
- Distributed storage system.
- Virtual disk is read/written in blocks.
- Sparse 2⁶⁴ byte address space.
- Storage allocated on demand.
- Optional replication.
- Efficient snapshots.
Shared, distributed lock manager to coordinate activity.
File server module runs in host kernel (via vfs/vnode).
Per server redo logs (maintained on Petal).

Security

Servers all trust one another (i.e., must run on trusted kernels).
Components authenticate themselves to one another.
Can export Frangipani via NFS (DCE/DCS, SMB) to insecure clients.

Implementation

Disk Layout
- Take advantage of Petal's sparse address space by carving it up into chunks for different purposes (allocation in 64 KB chunks).
- 0-1 TB:Configuration information.
- 1-2 TB:Logs (8 GB per server; maximum of 256 servers).
- 2-5 TB:Allocation bitmaps.
- 5-6 TB:Inodes (512 bytes each; stores symlinks).
- 6-134 TB:Small blocks (4 KB; max 16/file).
- 135-2⁶⁴ TB:Large blocks (1 TB; max 1/file; max file size 1 TB+64KB).
- Little rationale offered for this layout.
Logging and Recovery
- Write-ahead logging.
- Log only meta-data updates.
- Synchronous logging is optional; system always guarantees ordering.
- Max log size: 128 KB.
- Log maintained as a circular buffer of 2 64 KB chunks (this is inconsistent with allocation 1 TB for up to 256 logs).
- Reclamation is similar to Cedar: 25% at a time.
- Failure is detected in two ways.
  1. Client does not hear back from a server.
  2. Lock manager asks for a lock back and doesn't get a reply.
- A recovery daemon is given ownership of the crashed logs and runs recovery.
- Making recovery work.
  - Blocks are written to Petal before write locks are moved from one server to another.
  - Use log sequence numbers to make sure you never redo an operation that is already done (this guarantees that the proper write locks are held during recovery; just like database recovery).
  - Meta-data blocks must have room for LSN; blocks used for meta-data are never used for user data.
  - Locking guarantees that only one recovery daemon tries to recover a particular log.
  - Assume no partial sector failures.
Synchronization and Cache Coherence
- Multiple reader/single writer locks.
- No waiting; downgrade or release upon conflict.
- Invalidate cached entries upon release of read lock.
- Write dirty data to disk before releasing write lock.
- Lock granularity is a sector (do not allocate more than one data structure to a sector).
- Logs are single lockable entities.
- Bitmaps are partitioned into lockable entities.
- Each file/directory is a lockable entity.
- Global lock ordering avoids deadlocks.
Locking
- Client failure handled via leases.
- Network partitions can make data inaccessible; currently handled rather ungracefully (require unmount/remount).
- Current (3^rd generation) lock manager is fully distributed.
  - Module linked into each Frangipani server.
  - Lock ids are 64-bit integers.
  - Lock table per file system.
  - Communication is via asynchronous messages.
  - Global lock information maintained via Lamports Paxos algorithm.

Performance

Testbed
- 7 333 HMz DEC Alpha 500 5/333 Petal servers.
- 9 Digital RZ29 disks per server (4.3 GB each; 9 ms average seek; 6 MB/s transfer)
- 24 port 155 Mbit ATM.
- 8 MB presto-serve board per server.
- File Server: 225 MHz DEC Alpha 3000/700 w/192 MB memory.
Comparison to Advfs (a journaling file system)
- Advfs uses an 8-disk storage system connected via 10 MB fast SCSI.
- Claim Advfs is 4% slower on Petal than on this system.
- Use Modified Andrew Benchmark (ugh).
- Handles deferred work by unmounting/remounting file system between tests.
- Next looks at Conectathon benchmark.
- Interesting Point: Frangipani is slower on creates, setattr, and readdir, but the authors claim that these operations don't matter. Recall that one of LFS's big claims was its superior create performance.
- Notice diplomatic handling of apparent bug in Advfs (request to invalidate a file doesn't).
Large file read
- Basically hardware limited (throughput is 96% of hardware limit).
- Claims that although Frangipani doesn't do intelligent disk layout, everything works OK. What's wrong with this logic (they do not indicate that the disk is aged; I'm guessing that everything is done on a virgin file system, so they get good disk layout by accident).
- Performance is due to logging and striping.
- No read-ahead for small files (but most files are small)!
Scaling
- Modified Andrew Scaling is quite good (seems reasonable; no data sharing).
- Read/write tests with no lock contention scale well (until the ATM links are saturated).
- With lock contention, read-ahead can cause serious degradation. Suggest giving user ability to turn off read-ahead in presence of read/write sharing.

Potential Problems

Petal and Frangipani can do redundant logging.
Frangipani does no optimization placing data on disks.
Locks are held on entire files/directories.