QuickSearch:   Number of matching entries: 0.

Search Settings

    AuthorTitleYearJournal/ProceedingsReftypeDOI/URL
    Anand, A., Muthukrishnan, C., Kappes, S., Akella, A. & Nath, S. Cheap and Large CAMs for High Performance Data-Intensive Networked Systems 2010 NSDI, pp. 433-448  inproceedings URL 
    Abstract: We show how to build cheap and large CAMs, or CLAMs, using a combination of DRAM and flash memory. These are targeted at emerging data-intensive networked systems that require massive hash tables running into a hundred GB or more, with items being inserted, updated and looked up at a rapid rate. For such systems, using DRAM to maintain hash tables is quite expensive, while on-disk approaches are too slow. In contrast, CLAMs cost nearly the same as using existing on-disk approaches but offer orders of magnitude better performance. Our design leverages an efficient flash-oriented data-structure called BufferHash that significantly lowers the amortized cost of random hash insertions and updates on flash. BufferHash also supports flexible CLAM eviction policies. We prototype CLAMs using SSDs from two different vendors. We find that they can offer average insert and lookup latencies of 0.006ms and 0.06ms (for a 40 % lookup success rate), respectively. We show that using our CLAM prototype significantly improves the speed and effectiveness of WAN optimizers.
    Review: The paper offers a RAM and flash based solution for fast and large hash tables on distributed systems at roughly the same cost as hard-disk based solutions.

    The system proposed consists of a smaller buffer layer in memory and an incarnation table on flash memory that is augmented by bloom filters maintained in memory as well.

    Performance evaluation on the system confirms it is capable of a few orders of magnitude more operations per second per dollar than the studied existing alternatives.

    BibTeX:
    @inproceedings{DBLP:conf/nsdi/AnandMKAN10,
      author = {Ashok Anand and Chitra Muthukrishnan and Steven Kappes and Aditya Akella and Suman Nath},
      title = {Cheap and Large CAMs for High Performance Data-Intensive Networked Systems},
      booktitle = {NSDI},
      publisher = {USENIX Association},
      year = {2010},
      pages = {433-448},
      url = {http://www.usenix.org/events/nsdi10/tech/full_papers/anand.pdf}
    }
    
    Andersen, D.G., Franklin, J., Kaminsky, M., Phanishayee, A., Tan, L. & Vasudevan, V. FAWN: a fast array of wimpy nodes 2011 Commun. ACM
    Vol. 54(7), pp. 101-109 
    article URL 
    Abstract: This paper introduces the FAWN—Fast Array of Wimpy Nodes—cluster architecture for providing fast, scalable, and power-efficient key-value storage. A FAWN links together a large number of tiny nodes built using embedded processors and small amounts (2–16GB) of flash memory into an ensemble capable of handling 700 queries per second per node, while consuming fewer than 6 watts of power per node. We have designed and implemented a clustered key-value storage system, FAWN-DHT, that runs atop these node. Nodes in FAWN-DHT use a specialized log-like back-end hash-based database to ensure that the system can absorb the large write workload imposed by frequent node arrivals and departures. FAWN uses a two-level cache hierarchy to ensure that imbalanced workloads cannot create hot-spots on one or a few wimpy nodes that impair the system’s ability to service queries at its guaranteed rate. Our evaluation of a small-scale FAWN cluster and several candidate FAWN node systems suggest that FAWN can be a practical approach to building large-scale storage for seek-intensive workloads. Our further analysis indicates that a FAWN cluster is cost-competitive with other approaches (e.g., DRAM, multitudes of magnetic disks, solid-state disk) to providing high query rates, while consuming 3-10x less power. Acknowledgements: We thank the members and companies of the CyLab Corporate Partners and the PDL
    Review: This paper takes a look at the power consumption aspect of distributed computing and presents a key-value storage system optimized for transactions per second per watt. Several problems are identified and discussed; such as the disproportionate power consumption of processors when frequency increases. Furthermore, powerful processors exhibit inefficient power savings when they have some of the cores shut down or frequency lowered. Thus the conclusion is the need to use slow in-sequence commodity processors that provide an order of magnitude more operations per joule than their more powerful counterpart. On the storage front, hard-drives are inefficient and RAM even more so when it comes to power consumption. The alternative is the relatively new SSD technology based on flash which provides the least power consumption. As a bonus, it is significantly faster than hard disks.

    The key-value store built is a pretty common log-based solution, with some mostly insignificant optimizations for using SSDs as storage. The authors sort of cheat by implementing a hash-based toy solution that only supports very basic operations. They then unfairly compare it with a full-fledged key-value store (BerkleyDB). The latter provides operations such as iterators and compare-and-swap. Iterators require an ordered structure (e.g. B-Tree, SkipList, etc.) which significantly slows down insertion and deletion operations as well as tremendously increasing program complexity. Atomic compare-and-swap operations may be used as building blocks for supporting transactions; but they need to block several other operations on the same key which may slow down the system further.

    BibTeX:
    @article{DBLP:journals/cacm/AndersenFKPTV11,
      author = {David G. Andersen and Jason Franklin and Michael Kaminsky and Amar Phanishayee and Lawrence Tan and Vijay Vasudevan},
      title = {FAWN: a fast array of wimpy nodes},
      journal = {Commun. ACM},
      year = {2011},
      volume = {54},
      number = {7},
      pages = {101-109},
      url = {http://dl.acm.org/citation.cfm?doid=1965724.1965746}
    }
    
    Baker, J., Bond, C., Corbett, J., Furman, J.J., Khorlin, A., Larson, J., Leon, J.-M., Li, Y., Lloyd, A. & Yushprakh, V. Megastore: Providing Scalable, Highly Available Storage for Interactive Services 2011 CIDR, pp. 223-234  inproceedings URL 
    Abstract: Megastore is a storage system developed to meet the requirements of today’s interactive online services. Megastore blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS in a novel way, and provides both strong consistency guarantees and high availability. We provide fully serializable ACID semantics within fine-grained partitions of data. This partitioning allows us to synchronously replicate each write across a wide area network with reasonable latency and support seamless failover between datacenters. This paper describes Megastore’s semantics and replication algorithm. It also describes our experience supporting a wide range of Google production services built with Megastore.
    Review: This paper describes a middle ground solution between RDBMS and NoSQL technologies. It is developed by Google and intended to be used on top of the Bigtable system; providing both high availability and strong consistency guarantees.

    The key to the system is partitioning, and the authors argue that probably all useful applications can easily manage to naturally find a way to partition their data. Several very different in-house applications are given as examples, with the partitioning scheme provided. Full ACID guarantees are provided within a partition while still maintaining high availability. Cross partition queries need to sacrifice either consistency, or availability, but the decision is left to the application which can choose either option on a transaction by transaction basis: Full ACID guarantees may be achieved cross-partitions at the cost of availability by using two-phase commit; or higher availability may be opted for by using asynchronous message delivery queues.

    A lot of complexity is dealt with at the application, however a Megastore Library is provided to ease application programming, but the exact contents of this library are a mystery. Applications need to provide a schema for their data and define partitions. Here is an example:

    CREATE SCHEMA PhotoApp;

    CREATE TABLE User
    required int64 user_id;
    required string name;
    PRIMARY KEY(user_id), ENTITY GROUP ROOT;

    CREATE TABLE Photo
    required int64 user_id;
    required int32 photo_id;
    required int64 time;
    required string full_url;
    optional string thumbnail_url;
    repeated string tag;
    PRIMARY KEY(user_id, photo_id),
    IN TABLE User,
    ENTITY GROUP KEY(user_id) REFERENCES User;

    CREATE LOCAL INDEX PhotosByTime
    ON Photo(user_id, time);

    CREATE GLOBAL INDEX PhotosByTag
    ON Photo(tag) STORING (thumbnail_url);

    The PAXOS protocol is used throughout the solution to provide consensus on the ordering of operations. In fact there is even a special replica type that doesn’t store data (only logs) which can still be used in the PAXOS voting process, without having large storage needs. The architecture is very flexible: the opposite kind of replica is also available – read-only replicas distribute the load, but participate in the global ordering process – thus not slowing it down (writes are impossible on such replicas).

    BibTeX:
    @inproceedings{DBLP:conf/cidr/BakerBCFKLLLLY11,
      author = {Jason Baker and Chris Bond and James Corbett and J. J. Furman and Andrey Khorlin and James Larson and Jean-Michel Leon and Yawei Li and Alexander Lloyd and Vadim Yushprakh},
      title = {Megastore: Providing Scalable, Highly Available Storage for Interactive Services},
      booktitle = {CIDR},
      publisher = {www.crdrdb.org},
      year = {2011},
      pages = {223-234},
      url = {http://www.cidrdb.org/cidr2011/Papers/CIDR11_Paper32.pdf}
    }
    
    Belazzougui, D., Botelho, F.C. & Dietzfelbinger, M. Hash, Displace, and Compress 2009
    Vol. 5757ESA, pp. 682-693 
    inproceedings URL 
    Abstract: A hash function h, i.e., a function from the set U of all keys to the range range [m] = 0,...,m − 1 is called a perfect hash function (PHF) for a subset S ⊆ U of size n ≤ m if h is 1-1 on S. The important performance parameters of a PHF are representation size, evaluation time and construction time. In this paper, we present an algorithm that permits to obtain PHFs with expected representation size very close to optimal while retaining O(n) expected construction time and O(1) evaluation time in the worst case. For example in the case m = 1.23n we obtain a PHF that uses space 1.4 bits per key, and for m = 1.01n we obtain space 1.98 bits per key, which was not achievable with previously known methods. Our algorithm is inspired by several known algorithms; the main new feature is that we combine a modification of Pagh’s “hash-and-displace” approach with data compression on a sequence of hash function indices. Our algorithm can also be used for k-perfect hashing, where at most k keys may be mapped to the same value.
    Review: “In this paper, we present an algorithm that permits to obtain PHFs with representation size very close to optimal while retaining O(n) construction time and O(1) evaluation time…. Our algorithm can also be used for k-perfect hashing, where at most k keys may be mapped to the same value”

    Another interesting, simple and fast hash-function for actual use is the SuperFastHash: http://www.azillionmonkeys.com/qed/hash.html

    BibTeX:
    @inproceedings{DBLP:conf/esa/BelazzouguiBD09,
      author = {Djamal Belazzougui and Fabiano C. Botelho and Martin Dietzfelbinger},
      title = {Hash, Displace, and Compress},
      booktitle = {ESA},
      publisher = {Springer},
      year = {2009},
      volume = {5757},
      pages = {682-693},
      url = {http://www.springerlink.com/content/d41867w7r1236506/}
    }
    
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A. & Gruber, R. Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!) 2006 OSDI, pp. 205-218  inproceedings URL 
    Abstract: Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Finance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this article, we describe the simple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we describe the design and implementation of Bigtable.
    Review: This paper presents a NOSQL data storage solution for data with loose or no schema. Bigtable, as the name would suggest is optimized to handle massive amounts of data, orders of magnitude bigger than any current relational database is able to accommodate. The data is stored in rows, each of which may have any number of columns grouped in column families that need to be established prior to being populated. Each row can have a different number of columns and each cell can have multiple versions of itself distinguishable by a timestamp. The service is hosted on top of the GFS distributed file system and is optimized to work well with it. The chubby locking service is also used by Bigtable in order to store the root of a B+ tree like structure that contains all data. The workload expected is an append-mostly write pattern and sequential read pattern.

    This storage service is optimized for a specific workload such as that of cacheing the entire internet. Other usage patterns, such as random writes will perform poorly. Even random reads are less optimized than sequential ones, as the block size used in the system is 64 MB (because of the underlying GFS infrastructure). Data may only be accessed at a row level and although atomicity is guaranteed on a per-row basis, multi-row transactions are impossible without a client orchestrated algorithm that includes separate locking. Also deleting data may not be done instantly; it may take a very long time before data may be deleted. All the files written to disk are immutable, meaning that they will simply be garbage collected once they contain no more useful data. This approach also wastes a lot of disk space. I would however like to point out that these drawbacks, are all a result of trade-offs being made to support the expected workload; they are not results of poor design decisions.

    BibTeX:
    @inproceedings{DBLP:conf/osdi/ChangDGHWBCFG06,
      author = {Fay Chang and Jeffrey Dean and Sanjay Ghemawat and Wilson C. Hsieh and Deborah A. Wallach and Michael Burrows and Tushar Chandra and Andrew Fikes and Robert Gruber},
      title = {Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!)},
      booktitle = {OSDI},
      publisher = {USENIX Association},
      year = {2006},
      pages = {205-218},
      url = {http://dl.acm.org/citation.cfm?doid=1365815.1365816}
    }
    
    Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., arno Jacobsen, H., Puz, N., Weaver, D. & Yerneni, R. PNUTS: Yahoo!’s hosted data serving platform 2008   techreport URL 
    Abstract: We describe PNUTS, a massively parallel and geographically distributed database system for Yahoo!’s web applications. PNUTS provides data storage organized as hashed or ordered tables, low latency for large numbers of concurrent requests including updates and queries, and novel per-record consistency guarantees. It is a hosted, centrally managed, and geographically distributed service, and utilizes automated load-balancing and failover to reduce operational complexity. The first version of the system is currently serving in production. We describe the motivation for PNUTS and the design and implementation of its table storage and replication layers, and then present experimental results.
    Review: The objective of this paper is to present a parallel database implementation optimized for a large geographical distribution of its components. The system works by replicating all data at a record-level: each record will be replicated in each geographic region. The system maintains access behavior metadata for each such record in order to dynamically establish the master region for the record (the region that most write requests to the record originate from). All write operations are forwarded to the master in order to ensure a record level timeline consistency: Each write updates a version number of the record. This allows for different levels of consistency in read operation: There are three types of reads an application may perform: “read-any”, “read-critical(version)”, “read-latest”. In order from left to right, the latency of these reads increases if the data in the local region is stale, as another region may need to get involved. In case of master failure, another replica may take over and table-level settings specify if writes should be allowed to occur at this point with the risk of violating timeline consistency.

    Writes need to be propagated to all regions and this is done with a guaranteed delivery publisher/subscriber service called the Yahoo! Message Broker (YMB). This service guarantees delivery even if parts of it fail. Records are grouped together in so called ‘tablets’ which are the objects servers work with. The paper calls these servers “Storage units” and basically allows for any hardware storage device to be used. Tablet controllers are responsible for assigning tablets to storage units and performing load control by moving tablets around.

    The service is live and being used in several Yahoo! Products.

    BibTeX:
    @techreport{Cooper08pnuts:yahoo!’s,
      author = {Brian F. Cooper and Raghu Ramakrishnan and Utkarsh Srivastava and Adam Silberstein and Philip Bohannon and Hans-arno Jacobsen and Nick Puz and Daniel Weaver and Ramana Yerneni},
      title = {PNUTS: Yahoo!’s hosted data serving platform},
      year = {2008},
      url = {http://research.yahoo.com/files/pnuts.pdf}
    }
    
    Dean, J. Evolution and future directions of large-scale storage and computation systems at Google 2010 SoCC, pp. 1  inproceedings URL 
    Review: Talks about a split into very large data centers and very small devices. The data centers have a typical network layout with les connectivity at higher levels. Network is the bottleneck as expected.

    GFS/Colossus running on top of the hardware along with MapReduce/Bigtable etc. All of them have small improvements.

    Spanner – storage and computation system that runs across many datacenters.

    Single master pattern is prevalent and can scale very well.

    Canary requests (test out a request on one machine to see if it crashes it).

    Backup requests to minimize latency.

    Store ranges rather than hash values.

    BibTeX:
    @inproceedings{DBLP:conf/cloud/Dean10,
      author = {Jeffrey Dean},
      title = {Evolution and future directions of large-scale storage and computation systems at Google},
      booktitle = {SoCC},
      publisher = {ACM},
      year = {2010},
      pages = {1},
      url = {http://dl.acm.org/citation.cfm?doid=1807128.1807130}
    }
    
    DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Sivasubramanian, S., Vosshall, P. & Vogels, W. Dynamo: amazon's highly available key-value store 2007 SOSP, pp. 205-220  inproceedings URL 
    Abstract: Reliability at massive scale is one of the biggest challenges we face at Amazon.com, one of the largest e-commerce operations in the world; even the slightest outage has significant financial consequences and impacts customer trust. The Amazon.com platform, which provides services for many web sites worldwide, is implemented on top of an infrastructure of tens of thousands of servers and network components located in many datacenters around the world. At this scale, small and large components fail continuously and the way persistent state is managed in the face of these failures drives the reliability and scalability of the software systems. This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on ” experience. To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
    Review: This paper describes Amazon’s solution to data storage for a lot of their front facing services. Dynamo is a distributed, eventually consistent key-value store. Consistency is thus sacrificed along with isolation which doesn’t exist at all in the system all in the name of availability. This seems to be the main requirement in Amazon’s production environment as they want to ensure customer satisfaction at the 99th percentile. And this is no easy task from an engineering perspective as Amazon is arguably the largest online retailer.

    The system only supports very few operations – essentially only get and put, with some extra versioning constraints. In the face of failures, the system state may branch and in some cases merging back these branches is pushed to the client applications which thus become more complex. Probably one of the most annoying things about this paper is the writing style – the authors just rant about loosely related things for most of the paper – it is very ambiguous most of the time whether they are talking about their own system or about general things such as P2P networks, SLAs, or distributed databases.

    BibTeX:
    @inproceedings{DBLP:conf/sosp/DeCandiaHJKLPSVV07,
      author = {Giuseppe DeCandia and Deniz Hastorun and Madan Jampani and Gunavardhan Kakulapati and Avinash Lakshman and Alex Pilchin and Swaminathan Sivasubramanian and Peter Vosshall and Werner Vogels},
      title = {Dynamo: amazon's highly available key-value store},
      booktitle = {SOSP},
      publisher = {ACM},
      year = {2007},
      pages = {205-220},
      url = {http://dl.acm.org/citation.cfm?doid=1294261.1294281}
    }
    
    Geambasu, R., Levy, A.A., Kohno, T., Krishnamurthy, A. & Levy, H.M. Comet: An active distributed key-value store 2010 OSDI, pp. 323-336  inproceedings URL 
    Abstract: Distributed key-value storage systems are widely used in corporations and across the Internet. Our research seeks to greatly expand the application space for key-value storage systems through application-specific customization. We designed and implemented Comet, an extensible, distributed key-value store. Each Comet node stores a collection of active storage objects (ASOs) that consist of a key, a value, and a set of handlers. Comet handlers run as a result of timers or storage operations, such as get or put, allowing an ASO to take dynamic, application-specific actions to customize its behavior. Handlers are written in a simple sandboxed extension language, providing properties of safety and isolation. We implemented a Comet prototype for the Vuze DHT, deployed Comet nodes on Vuze from PlanetLab, and built and evaluated over a dozen Comet applications. Our experience demonstrates that simple, safe, and restricted extensibility can significantly increase the power and range of applications that can run on distributed active storage systems. This approach facilitates the sharing of a single storage system by applications with diverse needs, allowing them to reap the consolidation benefits inherent in today’s massive clouds.
    Review: The system proposes a distributed key-value storage system that allows application-specific customizations in the form of so called “active storage objects” that basically represent actions that get executed on application-defined events such as timers, get/put operations etc.

    The authors briefly study the security implications of allowing multiple applications that run side by side to execute customized code. The report is however hardly thorough and the implemented solution lies in severely restricting the capabilities of the language the scripts must be written in as well as imposing communication caps between the nodes and restricting access to only applications stored under the same key.

    All in all the system proves very versatile despite the heavy restrictions and I find the idea of allowing active tweaks to the storage structure on an application by application basis very interesting. Such an approach far extends the usability of a system beyond its intended purpose without modifying an established well documented and implemented framework.

    BibTeX:
    @inproceedings{DBLP:conf/osdi/GeambasuLKKL10,
      author = {Roxana Geambasu and Amit A. Levy and Tadayoshi Kohno and Arvind Krishnamurthy and Henry M. Levy},
      title = {Comet: An active distributed key-value store},
      booktitle = {OSDI},
      publisher = {USENIX Association},
      year = {2010},
      pages = {323-336},
      url = {http://www.usenix.org/events/osdi10/tech/full_papers/Geambasu.pdf}
    }
    
    Glendenning, L., Beschastnikh, I., Krishnamurthy, A. & Anderson, T.E. Scalable consistency in Scatter 2011 SOSP, pp. 15-28  inproceedings URL 
    Abstract: Distributed storage systems often trade off strong semantics for improved scalability. This paper describes the design, implementation, and evaluation of Scatter, a scalable and consistent distributed key-value storage system. Scatter adopts the highly decentralized and self-organizing structure of scalable peer-to-peer systems, while preserving linearizable consistency even under adverse circumstances. Our prototype implementation demonstrates that even with very short node lifetimes, it is possible to build a scalable and consistent system with practical performance.
    Review: This paper presents a distributed key-value storage system built on top of a DHT. The novelty of the approach lies in building the DHT on top of groups of nodes rather than nodes themselves. This design decision increases replication and allows for very dynamic node movements (nodes joining and leaving the system) to minimally impact the system as a whole.

    “Scatter implements a simple DHT model in which a circular key-space is partitioned among groups. Each group maintains up-to-date knowledge of the two neighboring groups that immediately precede and follow it in the key-space. These consistent lookup links form a global ring topology, on top of which Scatter layers a best-effort routing policy based on cached hints. If this soft routing state is stale or incomplete, then Scatter relies on the underlying consistent ring topology as ground truth.”

    BibTeX:
    @inproceedings{DBLP:conf/sosp/GlendenningBKA11,
      author = {Lisa Glendenning and Ivan Beschastnikh and Arvind Krishnamurthy and Thomas E. Anderson},
      title = {Scalable consistency in Scatter},
      booktitle = {SOSP},
      publisher = {ACM},
      year = {2011},
      pages = {15-28},
      url = {http://dl.acm.org/citation.cfm?doid=2043556.2043559}
    }
    
    Krüger, J., Kim, C., Grund, M., Satish, N., Schwalb, D., Chhugani, J., Plattner, H., Dubey, P. & Zeier, A. Fast Updates on Read-Optimized Databases Using Multi-Core CPUs 2011 PVLDB
    Vol. 5(1), pp. 61-72 
    article URL 
    Abstract: Read-optimized columnar databases use differential updates to handle writes by maintaining a separate write-optimized delta partition which is periodically merged with the read-optimized and compressed main partition. This merge process introduces significant overheads and unacceptable downtimes in update intensive systems, aspiring to combine transactional and analytical workloads into one system. In the first part of the paper, we report data analyses of 12 SAP Business Suite customer systems. In the second half, we present an optimized merge process reducing the merge overhead of current systems by a factor of 30. Our linear-time merge algorithm exploits the underlying high compute and bandwidth resources of modern multi-core CPUs with architecture-aware optimizations and efficient parallelization. This enables compressed in-memory column stores to handle the transactional update rate required by enterprise applications, while keeping properties of read-optimized databases for analytic-style queries.
    Review: This incredibly verbose paper simply describes a common storage structure that is read optimized and where writes and updates are expensive (because in this case data is compressed). A second write optimized structure is available to maintain the delta of changes from the main read optimized storage. The key to keeping the system efficient is to maintain the second write optimized structure small by merging the changes back into the main storage solution. This is efficiently performed by stopping access to the write structure and creating a new one, while the old write structure is merged into a copy of the read structure. At the end of the process the two old structures are discarded to reclaim memory and the newly created ones take their place. All rather basic.

    Additional small optimizations to the algorithm are presented, but don’t make a large impact.

    BibTeX:
    @article{DBLP:journals/pvldb/KruegerKGSSCPDZ11,
      author = {Jens Krüger and Changkyu Kim and Martin Grund and Nadathur Satish and David Schwalb and Jatin Chhugani and Hasso Plattner and Pradeep Dubey and Alexander Zeier},
      title = {Fast Updates on Read-Optimized Databases Using Multi-Core CPUs},
      journal = {PVLDB},
      year = {2011},
      volume = {5},
      number = {1},
      pages = {61-72},
      url = {http://www.vldb.org/pvldb/vol5/p061_jenskrueger_vldb2012.pdf}
    }
    
    Lakshman, A. & Malik, P. Cassandra: a decentralized structured storage system 2010 Operating Systems Review
    Vol. 44(2), pp. 35-40 
    article URL 
    Abstract: Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure. Cassandra aims to run on top of an infrastructure of hundreds of nodes (possibly spread across different data centers). At this scale, small and large components fail continuously. The way Cassandra manages the persistent state in the face of these failures drives the reliability and scalability of the software systems relying on this service. While in many ways Cassandra resembles a database and shares many design and implementation strategies therewith, Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format. Cassandra system was designed to run on cheap commodity hardware and handle high write throughput while not sacrificing read efficiency.
    Review: This paper presents another NOSQL storage system. Casandra is similar to Bigtable in the way it stores data within rows each of which have columns grouped in super-columns grouped in column families. Unlike Bigtable, Casandra also supports a delete operation putting this system a lot closer to relational database. Schemas are however still loosely defined and operations are still row-atomic (no general transactions). The data in Casandra is replicated across multiple servers and is partitioned by using an order preserving deterministic hashing function. Several replication strategies are provided, some of which are rack or even data-center aware in the sense that they will survive a rack/data-center failure. Data reads and writes are sent to all replicas containing the particular data, but several levels of consistency are defined that the user may choose. These range from read-one or write-one all the way to read-all and write-all. These determine how many of the replicas need to acknowledge a write before the write is considered committed. Another popular strategy is write-quorum which requires a majority of replicas to finish the write before it is committed. In all of the strategies however, an attempt is made to write the data to ALL replicas; it’s just the number of response that are sufficient to trigger a commit that changes. In the case of reads, the number of replicas involved, determine what version of the data is returned. Since some of the replicas may contain stale data, the more replicas are queried, the larger the chance of getting a newer version (as the latest version of all queried replicas is returned). Read-all would thus ensure always reading the latest data. It is obvious however, that the number of replicas used to read or write determines a tradeoff between speed and consistency.

    Again, this is a simple software product employing known techniques that simply chooses tradeoffs that optimize for the common use case. In this case the storage system is used within Facebook and has been designed for their particular workloads. I have pointed out some of the downsides of the tradeoffs in the previous paragraph; however this doesn’t seem to be a weak point in the design.

    BibTeX:
    @article{DBLP:journals/sigops/LakshmanM10,
      author = {Avinash Lakshman and Prashant Malik},
      title = {Cassandra: a decentralized structured storage system},
      journal = {Operating Systems Review},
      year = {2010},
      volume = {44},
      number = {2},
      pages = {35-40},
      url = {http://dl.acm.org/citation.cfm?doid=1773912.1773922}
    }
    
    Lamport, L. The Part-Time Parliament 1998 ACM Trans. Comput. Syst.
    Vol. 16(2), pp. 133-169 
    article URL 
    Abstract: Recent archaeological discoveries on the island of Paxos reveal that the parliament functioned despite the peripatetic propensity of its part-time legislators. The legislators maintained consistent copies of the parliamentary record, despite their frequent forays from the chamber and the forgetfulness of their messengers. The Paxon parliament's protocol provides a new way of implementing the state machine approach to the design of distributed systems.
    Review: This paper describes the Paxos protocol – used to achieve consensus and global ordering of decisions voted on by a group of entities that can at any time decide to leave/return to the voting process. The algorithm described is proven to be mathematically correct and ensures forward progress at all times.

    The background story and mathematical proofs are given in the form of a fable describing an ancient civilization that had developed the algorithm for use in their parliament. The parallel to distributed systems is given later in the paper and one interesting aspect is that the parliament was described as having a much higher rate of people coming and going. The distributed system equivalent would only have someone (a machine) leave upon its failure, and it is reasonable to assume this happens less often. This aspect proves the system to be sturdy under much worse circumstances.

    BibTeX:
    @article{DBLP:journals/tocs/Lamport98,
      author = {Leslie Lamport},
      title = {The Part-Time Parliament},
      journal = {ACM Trans. Comput. Syst.},
      year = {1998},
      volume = {16},
      number = {2},
      pages = {133-169},
      url = {http://dl.acm.org/citation.cfm?doid=279227.279229}
    }
    
    Lim, H., Fan, B., Andersen, D.G. & Kaminsky, M. SILT: a memory-efficient, high-performance key-value store 2011 SOSP, pp. 1-13  inproceedings URL 
    Abstract: SILT (Small Index Large Table) is a memory-efficient, high-performance key-value store system based on flash storage that scales to serve billions of key-value items on a single node. It requires only 0.7 bytes of DRAM per entry and retrieves key/value pairs using on average 1.01 flash reads each. SILT combines new algorithmic and systems techniques to balance the use of memory, storage, and computation. Our contributions include: (1) the design of three basic key-value stores each with a different emphasis on memory-efficiency and write-friendliness; (2) synthesis of the basic key-value stores to build a SILT key-value store system; and (3) an analytical model for tuning system parameters carefully to meet the needs of different workloads. SILT requires one to two orders of magnitude less memory to provide comparable throughput to current high-performance key-value systems on a commodity desktop system with flash storage.
    Review: This paper presents a key-value storage system that provides 3 concurrent storage solutions (SEE GRAPHIC)

    The data pairs enter the structure in through the LogStore and make their way up to the SortedStore once the previous data structures fill up. There is an impressive amount of parameters that can be tuned with this approach, such as the sizes of the three data structures and the frequency of the movement of data through them. Unfortunately the large number of tuning knobs is the greatest weakness of this solution – finding the proper balance for a given workload is nearly impossible and there are no good general purpose values available either.

    The system may be used as a building block for a larger scale distributed version of the algorithm by assigning master copies to deal with the “In-memory” aspect of the solution, while multiple replicas can handle the “On-flash” part.

    BibTeX:
    @inproceedings{DBLP:conf/sosp/LimFAK11,
      author = {Hyeontaek Lim and Bin Fan and David G. Andersen and Michael Kaminsky},
      title = {SILT: a memory-efficient, high-performance key-value store},
      booktitle = {SOSP},
      publisher = {ACM},
      year = {2011},
      pages = {1-13},
      url = {http://dl.acm.org/citation.cfm?doid=2043556.2043558}
    }
    
    Mammarella, M., Hovsepian, S. & Kohler, E. Modular data storage with Anvil 2009 SOSP, pp. 147-160  inproceedings URL 
    Abstract: Databases have achieved orders-of-magnitude performance improvements by changing the layout of stored data – for instance, by arranging data in columns or compressing it before storage. These improvements have been implemented in monolithic new engines, however, making it difficult to experiment with feature combinations or extensions. We present Anvil, a modular and extensible toolkit for building database back ends. Anvil’s storage modules, called dTables, have much finer granularity than prior work. For example, some dTables specialize in writing data, while others provide optimized read-only formats. This specialization makes both kinds of dTable simple to write and understand. Unifying dTables implement more comprehensive functionality by layering over other dTables – for instance, building a read/write store from read-only tables and a writable journal, or building a generalpurpose store from optimized special-purpose stores. The dTable design leads to a flexible system powerful enough to implement many database storage layouts. Our prototype implementation of Anvil performs up to 5.5 times faster than an existing B-tree-based database back end on conventional workloads, and can easily be customized for further gains on specific data and workloads.
    Review: This paper provides a modular way of building databases by composing several standardized pieces. Since there are a lot of different storage structures out there, each of them have certain strengths and weaknesses. For instance log-based storage is optimized for writes and really slow on reads, hash tables are optimized for both reads and writes but partial matching or enumerating would be incredibly slow on such a structure. Finally, lists and trees provide a good balance of properties but don’t excel in any of them. Thus modern database systems and key-value stores often make use of combinations of these structures in order to optimize certain workloads.

    The proposed solution lets database designers use these simple building blocks along with other carefully designed aggregator blocks to quickly define a customized key-value storage solution that is optimized for a certain workload. For example one can have an aggregator on top of a log and a hash table in order to batch hash table modifications for increased speed and longer storage (SSD) life. Thus all inserts are done in the log and all lookups are done on the log first and on the hash-table second in case an entry isn’t found. In order to preserve fast read speeds the size of the log is kept small – whenever a certain threshold defined in the aggregator piece is reached the log gets compacted into the hashtable. It’s important however to remember the benefit of this system isn’t in the resulting key-value storage system, but in the ease with which a designer can build this system in a few minutes by specifying a few parameters to three available building blocks.

    It might also be important to note that customized building blocks are fully allowed and that there is no bound to their complexity, but the strength of the system, again, lies in fast prototyping and testing out different data storage options. An interesting note would be the fact that a system like Google’s “Bigtable” can be quite easily built within Anvil.

    BibTeX:
    @inproceedings{DBLP:conf/sosp/MammarellaHK09,
      author = {Mike Mammarella and Shant Hovsepian and Eddie Kohler},
      title = {Modular data storage with Anvil},
      booktitle = {SOSP},
      publisher = {ACM},
      year = {2009},
      pages = {147-160},
      url = {http://dl.acm.org/citation.cfm?doid=1629575.1629590}
    }
    
    Power, R. & Li, J. Piccolo: Building Fast, Distributed Programs with Partitioned Tables 2010 OSDI, pp. 293-306  inproceedings URL 
    Abstract: Piccolo is a new data-centric programming model for writing parallel in-memory applications in data centers. Unlike existing data-flow models, Piccolo allows computation running on different machines to share distributed, mutable state via a key-value table interface. Piccolo enables efficient application implementations. In particular, applications can specify locality policies to exploit the locality of shared state access and Piccolo’s run-time automatically resolves write-write conflicts using userdefined accumulation functions. Using Piccolo, we have implemented applications for several problem domains, including the PageRank algorithm, k-means clustering and a distributed crawler. Experiments using 100 Amazon EC2 instances and a 12 machine cluster show Piccolo to be faster than existing data flow models for many problems, while providing similar fault-tolerance guarantees and a convenient programming interface.
    Review: Piccolo is a system that allows distributed programs to share state via the use of a key-value storage service. This allows for a more natural approach to program writing without having to think about message passing and all its intricacies.

    As the authors point out, the overall application structure for Piccolo is very similar to that of the CUDA platform (NVIDIA GPU computing). Both these technologies utilize control functions that launch kernels to execute the same piece of code over multiple processors/computers. Within the kernel, different execution paths may occur based on the position of the kernel (the thread number). Kernel synchronization in both platforms is only possible on global barriers where the initial control function waits for the completion of all previously launched kernels. There isn’t any way of pairwise synchronization among concurrent kernel instances; however the piccolo system allows for communication via the key-value table proposed.

    BibTeX:
    @inproceedings{DBLP:conf/osdi/PowerL10,
      author = {Russell Power and Jinyang Li},
      title = {Piccolo: Building Fast, Distributed Programs with Partitioned Tables},
      booktitle = {OSDI},
      publisher = {USENIX Association},
      year = {2010},
      pages = {293-306},
      url = {http://www.usenix.org/events/osdi10/tech/full_papers/Power.pdf}
    }
    
    Venkataraman, S., Tolia, N., Ranganathan, P. & Campbell, R.H. Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory 2011 FAST, pp. 61-75  inproceedings URL 
    Abstract: The predicted shift to non-volatile, byte-addressable memory (e.g., Phase Change Memory and Memristor), the growth of “big data”, and the subsequent emergence of frameworks such as memcached and NoSQL systems require us to rethink the design of data stores. To derive the maximum performance from these new memory technologies, this paper proposes the use of singlelevel data stores. For these systems, where no distinction is made between a volatile and a persistent copy of data, we present Consistent and Durable Data Structures (CDDSs) that, on current hardware, allows programmers to safely exploit the low-latency and non-volatile aspects of new memory technologies. CDDSs use versioning to allow atomic updates without requiring logging. The same versioning scheme also enables rollback for failure recovery. When compared to a memory-backed Berkeley DB B-Tree, our prototype-based results show that a CDDS B-Tree can increase put and get throughput by 74% and 138%. When compared to Cassandra, a two-level data store, Tembo, a CDDS B-Tree enabled distributed Key-Value system, increases throughput by up to 250%–286%.
    Review: This paper advocates the use of new hardware memory types that allow large byte-addressable non-volatile memory structures to be built. The technology for these types of memories is still young and they are hardly deployed into production yet, but the authors look at changing existing data structures to take advantage of the new properties future memories will likely have.

    The solution proposed isn’t however impressive by any standards – a simple versioning scheme is used to make transactions atomic. Thus the state of the memory is always consistent. In case of corruption upon writing, the memory can simply return to the most recent consistent state.

    BibTeX:
    @inproceedings{DBLP:conf/fast/VenkataramanTRC11,
      author = {Shivaram Venkataraman and Niraj Tolia and Parthasarathy Ranganathan and Roy H. Campbell},
      title = {Consistent and Durable Data Structures for Non-Volatile Byte-Addressable Memory},
      booktitle = {FAST},
      publisher = {USENIX},
      year = {2011},
      pages = {61-75},
      url = {http://www.usenix.org/events/fast11/tech/full_papers/Venkataraman.pdf}
    }
    

    Created by JabRef on 21/02/2012.