Abstract
What is Cassandra?
Distributed storage system for managing large amounts of structured data
spread out across many commodity server
no single point of failure
Cassandra does not support a full relational data model
It provides a client with a simple data model that supports dynamic control over data layout and format
Intro
Facebook is the largest social network that servers 100s of millions users at peak time using thousands of servers, but scene is that servers often fail and they need to meet strict reliability and scalability constraints to keep growing, hence they developed Cassandra.
Cassandra was designed to fulfil storage needs of the Inbox Search problem
At Facebook this meant the system was required to handle a very high write throughput, billions of writes per day, and also scale with the number of users.
Since users are served from data centers that are geographically distributed, being able to replicate data across data centers was key to keep search latencies down.
Inbox Search was launched in June of 2008 for around 100 million users and today we are at over 250 million users and Cassandra has kept up the promise so far.
Data Model
A table in Cassandra is like a distributed spreadsheet with each row being indexed by a key. The value of a key is an object that is highly structured
The row key is a string with no size restrictions, although typically 16 to 36 bytes long
Every operation on a row is atomic per replica, means that for each individual copy of the data, any operation (like a read or write) is guaranteed to be completed entirely or not done at all.
Columns are grouped into column-families. Cassandra exposes 2 kinds of column families, Simple and Super.
Super column families can be visualized as column family within a column family
Apps can specify sort order for these column families
Any column within the column family is accessed using the covention
column_family : column
and for super column family it iscolumn_family : super_column : column
Typically applications use a dedicated Cassandra cluster and manage them as part of their service
The system supports notion of multiple tables but all deployments have only one table in their schema (it's like having a tool that can organize data in many ways, but everyone is using it in the simplest way possible.)
API
The API has 3 simple methods
insert(table, key, row, Mutation)
get(table, key, columnName)
delete(table, key, columnName)
⚠️ columnName can refer to a specific column within a column family, a column family, a super column family, or a column within a super column.
Architecture
Characteristics expected from the system
load balancing
membership and failure detection
failure recovery
replica synchronization
overload handling
state transfer
concurrency and job scheduling
request marshalling : Think of it like packing a gift. You take the gift (the request), wrap it up nicely (organize and convert the data), and then send it to the recipient (the other system). The recipient then unwraps the gift (unmarshalling) to understand what you sent.
request routing
System monitoring and alarming
configuration management
Not all the categories are in the scope the focus shall be on the distributed system characteristics
typically a request hits any node in the Cassandra cluster, the node then decides the replicas for this particular key.
For writes : the system routes the request to the replicas and waits for quorum from all the replicas telling that the write is completed
For reads : based on the consistency guarantees that the client requires the system either
routes the requests to the closest replica OR
routes the request to all replicas and waits for the quorum of all the responses
Partitioning
For the system to be able to scale incrementally we need the ability to dynamically partition data over the set of nodes.
Cassandra uses consistent hashing to partition data across nodes but uses an order preserving hash function to do so
⚠️ Normally, hash functions scramble data to distribute it evenly. However, an order-preserving hash function keeps the order of the data intact while distributing it. This means that if data items are ordered, the hash function will keep that order when it places the data items in the cluster.
The principal advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbours and other nodes remain unaffected
Challenges with Basic Consistent Hashing
Non-uniform data and load distribution:
When nodes (computers) are placed randomly on the hash ring, it can lead to some nodes having much more data than others, causing an uneven distribution.
Oblivious to node heterogeneity:
Not all nodes have the same performance capabilities. Some nodes might be faster or have more storage than others, but basic consistent hashing doesn’t consider these differences.
Solutions to These Challenges
Assigning nodes to multiple positions:
Some systems (like Dynamo) address the problem by assigning each node to multiple spots on the ring. This helps distribute data more evenly.
Load-based adjustments:
Another approach is to monitor the load on each node and adjust their positions on the ring accordingly. Nodes that are lightly loaded can move to help nodes that are heavily loaded.
Cassandra chooses the second approach. It monitors the load on each node and moves lightly loaded nodes to help balance the load on heavily loaded nodes. This approach is simpler to design and implement and ensures more predictable load balancing.
Replication
Cassandra uses replication for high availability and durability
Each data item is replicated at N hosts
N is the replication factor configured per-instance
⚠️ The replication factor can be set individually for each instance of the database, allowing flexibility based on the needs of different datasets or workloads.
Each key k is assigned to a coordinator node, The coordinator is in charge of the replication of the data items that fall within its range
In addition to locally storing each key within its range, the coordinator replicates these keys at the N-1 nodes in the ring
Cassandra provides various replication policies like
Rack Unaware : replicas are chosen by picking N-1 successors of the coordinator on the ring
Rack Aware
Datacenter Aware
Replicas are chosen based on the replication policies chosen by the application
Cassandra system elects a leader amongst its nodes using a system called Zookeeper
All nodes on joining the cluster contact the leader who tells them for what ranges they are replicas for.
The leader ensures that no node is responsible for more than N−1 ranges, where N is a configuration parameter.
The metadata about the ranges a node is responsible is cached locally at each node and in a fault-tolerant manner inside Zookeeper - this way a node that crashes and comes back up knows what ranges it was responsible for
Borrowing from Dynamo's terminology, the nodes responsible for a given data range are called the "preference list" for that range.
Due to replication in multiple datacenters around the world Cassandra is able to handle entire datacenter failures without any outage
Membership
In Cassandra cluster membership is based on Scuttle-butt
⚠️ Scuttlebutt is a gossip protocol. It ensures state synchronization by having nodes periodically exchange only the differences (deltas) between their states. Each piece of information is versioned, allowing nodes to determine and share only the newest or missing data. This process enables all nodes to eventually have a consistent view of the data, making Scuttlebutt ideal for data replication and achieving eventual consistency in a network.
Scuttle butt has very efficient CPU utilisation and very efficient use of the gossip channel (communication method used to spread information across all nodes in a network efficiently and reliably)
In Cassandra system gossip is used for both membership and to communicate about other system related control state
Failure Detection
Cassandra uses a modified version of Φ Accrual Failure Detector
The basic idea of Accural Failure Detection is that failure detection module doesn’t emit true of false for up or down but returns a value which represents a suspicion level for each of the monitored nodes
The above value is defined as Φ
The Φ Accrual Failure Detector improves failure detection by providing a probabilistic measure of whether a node has failed, using a threshold Φ value to manage the likelihood of mistakes. It tracks the time intervals between gossip messages to make informed decisions. By using an exponential distribution to model these intervals, it adapts well to the unpredictable nature of gossip communication, offering both accuracy and speed in detecting failures in a distributed system
Bootstrapping
When a node starts for the first time it chooses a random token for its position in the ring
For fault tolerance this mapping is kept locally on the disk and also in the Zookeper
The token information is then gossiped around the cluster, and that is how we get to know about all the nodes and their respective position on the ring
This enables the request to be routed to the correct node in the cluster.
Nodes joining the cluster read a configuration file with a list of initial contact points called "seeds."
Seeds can also be obtained from a configuration service like Zookeeper.
Now in the case of failures
Node outages in Facebook's environment are often temporary but can be prolonged.
Outages can be due to disk failures, bad CPUs, etc.
A node outage is usually not permanent, so it should not trigger re-balancing or repair of replicas.
Manual errors can lead to accidental startup of new Cassandra nodes.
Each message includes the cluster name to prevent nodes from joining the wrong cluster due to configuration errors.
Admin control
Nodes are added or removed from the cluster using explicit commands.
Administrators use a command-line tool or browser to connect to a Cassandra node and issue join or leave commands.
Scaling the cluster
On adding nodes, the new node is assigned a token to help balance the load by taking over part of the data from a heavily loaded node.
The new node splits a range that another node was previously managing.
The bootstrap algorithm is initiated by an operator using a command line utility or the Cassandra web dashboard.
The existing node streams data to the new node using efficient kernel-to-kernel copy techniques.
Current operational experience shows a transfer rate of 40 MB/sec from a single node.
Local Persistance
Cassandra relies on the local file system for storing data and data is stored in a format optimized for efficient retrieval.
Every write operation first logs the data to a commit log for durability and recoverability.
The commit log is stored on a dedicated disk to maximize throughput since all writes are sequential.
After logging to the commit log, the data is updated in an in-memory structure.
When the in-memory structure reaches a certain size, it is dumped to disk.
Data dumps are stored on commodity disks, with each write being sequential and generating an index for efficient row key lookup.
Reads first check the in-memory data structure.
If the data isn't found in memory, the system checks the disk files, starting from the newest.
Each data file has a bloom filter to summarize its keys, helping to quickly determine if a key exists in a file and avoid unnecessary lookups.
Data dumps to disk are sequential, and each file includes an index for efficient row key lookups.
Multiple data files are merged over time in a background process similar to Bigtable's compaction.
Each key in a column family can have many columns, requiring special indexing.
Column indices are maintained to allow direct jumps to the correct disk chunk for retrieval.
Indices are generated at every 256KB chunk boundary during serialization, a configuration found effective in production.
Implementation Details
The Cassandra process contains 3 main abstractions built using Java
Partitioning module
Cluster membership and failure detection module
Storage engine module
Each of the above modules rely on an event-driven substrate (like a layer), where the message processing pipeline and task processing pipeline are split into multiple states.
The cluster membership and failure detection module in built on top of a network layer that uses non-block I/O
All system control messages rely on UDP based messaging while the application related messages for replication and request routing rely on TCP
UDP is used for system control messages because it is fast, efficient, and the messages are less critical if occasionally lost.
TCP is used for application-related messages because it guarantees delivery, order, and data integrity, which are crucial for tasks like data replication and request routing.
The request routing modules are implemented using a certain state machine. When a request hits any node the state machine goes through the following states
Identify the nodes that own the data for the key
route request to the node and wait for responses
if responses didnt comeback before time fail the response and send back to client
figure out latest response based on timestamp
schedule a repair of the data at any replica if it did not have the latest data
The system can be configured to perform synchronous or asynchronous writes. For systems that require high throughput Cassandra relies on async replication, here, the writes exceed the reads that hit the system.
During the synchornous case we wait for a quorum of responses before we return a result to the client
Cassandra uses a rolling commit log system. A new commit log is created (or "rolled out") when the current one exceeds a configurable size. A size of 128MB for each commit log has been found effective in production environments. Each commit log includes a header, which is a bit vector. The size of this bit vector is fixed and is designed to be larger than the number of column families the system will handle.
For each column family, Cassandra maintains an in-memory data structure and a corresponding data file on disk. When the in-memory data structure for a column family is dumped to disk, its corresponding bit in the commit log's bit vector is set. This indicates that the column family's data has been successfully persisted to disk. The bit vectors are kept per commit log and are also maintained in memory for quick access.
When a commit log is rolled, its bit vector, along with the bit vectors of all previous commit logs, is checked. If all data is confirmed to be successfully persisted to disk, the older commit logs are deleted. Writes to the commit log can be performed in either normal mode or fast sync mode. In fast sync mode, writes are buffered, which increases speed but risks data loss if the machine crashes. Buffered writes also apply when dumping the in-memory data structure to disk.
Traditional databases struggle with high write throughput due to their design. Cassandra converts all disk writes into sequential writes, which maximizes disk throughput. Since the files dumped to disk are never mutated, no locks are needed during read operations. As a result, Cassandra's server instance operates with practically no locks for read/write operations. This lockless design avoids the concurrency issues commonly encountered in B-Tree based database implementations.
Cassandra system indexes all data based on a primary key, the data file on the system is broken down into blocks with each block containing at most 128 keys and is differentiated with a block index.
The block index
Captures the relative offset of a key within the block and size of its data
is maintained in memory for faster access.
On reading if we get the in memory data structure we return it else we perform an I/O operation against all data files on the disk in reverse time order
Because the number of data files are bound to increase over time Cassandra performs a compaction process (merge multiple files). Periodically a major compaction process is run to compact all related data files into one big file.
Practical Examples
Facebook Inbox Search
The 2 kinds of search features enabled for inbox were
term search
interactions - given a persons name return all messages from that person
The data schema contains of 2 column families
For term search the user id is the key and words that make up the message is the super column. Individual message identifiers of that message become the columns within the super column
For interactions the user id is the key and the recipients ids are the super columns. For each of these super columns the individual message ids are the columns
Cassandra uses smart caching to speed up searches. When a user clicks the search bar, a background message is sent to the Cassandra cluster to load that user’s index into memory. This makes the actual search faster since the data is already in memory.
Key takeaways from the paper
The paper covered a lot of new concepts overall some interesting one’s to revisit are
Drawbacks of basic consistent hashing
Column and super column data model
Order preserving hash functions
Reasons for using UDP in system control messages and TCP for application-related messages
Phi accural failure detection
Resources
Paper : https://www.cs.cornell.edu/projects/ladis2009/papers/lakshman-ladis2009.pdf
How does order preserving hash function work in cassandra : https://www.quora.com/How-does-Cassandras-order-preserving-hash-work
Phi Accural failure detection : https://medium.com/@arpitbhayani/phi-φ-accrual-failure-detection-79c21ce53a7a
hey,
love how you guys take a technical whitepaper and break it down, explaining it bit by bit. also noticed more bullet points in this edition (perhaps to make it more readable and reduce walls of texts?) which made it less daunting to go through than its whitepaper.
what i missed seeing more of though was diagrams! i think along with diagrams of whatever the whitepaper is about, having diagrams of even the subcomponents (such as consistent hashing, persistence etc.) help make it a lot fun and engaging to go through.
i think this substack can grow to be a great resources for people looking for system design resources. keep pushing good stuff.