Frangipani: A Scalable Distributed File System
Thekkath, Mann, Lee (1997)
What Kind of Paper is This
- Describes a system
- Demonstrates performance
- Demonstrates scalability
Goals
- Consistent view of the file set.
- Hands-off recovery (recovers from machine, network and disk failures).
- Easy administration:
- Add servers
- Add users
- Full, consistent backup online
Key Design Points
- Cluster file system
- Implemented on top of Petal
- Provides abstraction of a virtual disk.
- Distributed storage system.
- Virtual disk is read/written in blocks.
- Sparse 264 byte address space.
- Storage allocated on demand.
- Optional replication.
- Efficient snapshots.
- Shared, distributed lock manager to coordinate activity.
- File server module runs in host kernel (via vfs/vnode).
- Per server redo logs (maintained on Petal).
Security
- Servers all trust one another (i.e., must run on trusted kernels).
- Components authenticate themselves to one another.
- Can export Frangipani via NFS (DCE/DCS, SMB) to insecure clients.
Implementation
- Disk Layout
- Take advantage of Petal's sparse address space by carving it up into
chunks for different purposes (allocation in 64 KB chunks).
- 0-1 TB:Configuration information.
- 1-2 TB:Logs (8 GB per server; maximum of 256 servers).
- 2-5 TB:Allocation bitmaps.
- 5-6 TB:Inodes (512 bytes each; stores symlinks).
- 6-134 TB:Small blocks (4 KB; max 16/file).
- 135-264 TB:Large blocks (1 TB; max 1/file; max file size 1
TB+64KB).
- Little rationale offered for this layout.
- Logging and Recovery
- Write-ahead logging.
- Log only meta-data updates.
- Synchronous logging is optional; system always guarantees ordering.
- Max log size: 128 KB.
- Log maintained as a circular buffer of 2 64 KB chunks (this is
inconsistent with allocation 1 TB for up to 256 logs).
- Reclamation is similar to Cedar: 25% at a time.
- Failure is detected in two ways.
- Client does not hear back from a server.
- Lock manager asks for a lock back and doesn't get a reply.
- A recovery daemon is given ownership of the crashed logs and runs
recovery.
- Making recovery work.
- Blocks are written to Petal before write locks are moved from one
server to another.
- Use log sequence numbers to make sure you never redo an operation that
is already done (this guarantees that the proper write locks are held
during recovery; just like database recovery).
- Meta-data blocks must have room for LSN; blocks used for meta-data are
never used for user data.
- Locking guarantees that only one recovery daemon tries to recover a
particular log.
- Assume no partial sector failures.
- Synchronization and Cache Coherence
- Multiple reader/single writer locks.
- No waiting; downgrade or release upon conflict.
- Invalidate cached entries upon release of read lock.
- Write dirty data to disk before releasing write lock.
- Lock granularity is a sector (do not allocate more than one data
structure to a sector).
- Logs are single lockable entities.
- Bitmaps are partitioned into lockable entities.
- Each file/directory is a lockable entity.
- Global lock ordering avoids deadlocks.
- Locking
- Client failure handled via leases.
- Network partitions can make data inaccessible; currently handled rather
ungracefully (require unmount/remount).
- Current (3rd generation) lock manager is fully distributed.
- Module linked into each Frangipani server.
- Lock ids are 64-bit integers.
- Lock table per file system.
- Communication is via asynchronous messages.
- Global lock information maintained via Lamports Paxos algorithm.
Performance
- Testbed
- 7 333 HMz DEC Alpha 500 5/333 Petal servers.
- 9 Digital RZ29 disks per server (4.3 GB each; 9 ms average seek; 6 MB/s
transfer)
- 24 port 155 Mbit ATM.
- 8 MB presto-serve board per server.
- File Server: 225 MHz DEC Alpha 3000/700 w/192 MB memory.
- Comparison to Advfs (a journaling file system)
- Advfs uses an 8-disk storage system connected via 10 MB fast SCSI.
- Claim Advfs is 4% slower on Petal than on this system.
- Use Modified Andrew Benchmark (ugh).
- Handles deferred work by unmounting/remounting file system between
tests.
- Next looks at Conectathon benchmark.
- Interesting Point: Frangipani is slower on creates, setattr,
and readdir, but the authors claim that these operations don't matter.
Recall that one of LFS's big claims was its superior create performance.
- Notice diplomatic handling of apparent bug in Advfs (request to
invalidate a file doesn't).
- Large file read
- Basically hardware limited (throughput is 96% of hardware limit).
- Claims that although Frangipani doesn't do intelligent disk layout,
everything works OK. What's wrong with this logic (they do not indicate
that the disk is aged; I'm guessing that everything is done on a virgin
file system, so they get good disk layout by accident).
- Performance is due to logging and striping.
- No read-ahead for small files (but most files are small)!
- Scaling
- Modified Andrew Scaling is quite good (seems reasonable; no data
sharing).
- Read/write tests with no lock contention scale well (until the ATM links
are saturated).
- With lock contention, read-ahead can cause serious degradation. Suggest
giving user ability to turn off read-ahead in presence of read/write
sharing.
Potential Problems
- Petal and Frangipani can do redundant logging.
- Frangipani does no optimization placing data on disks.
- Locks are held on entire files/directories.