Hey folks, long time! last couple of months were quite hectic (in terms of both work and travel), but now it’s time to get back! Well the last one year was 🔥 we covered a whopping total of 11 whitepapers, check them out in the archive. It is time we review some of the common concepts and algorithms that we’ve covered along with a few fundamentals that we’ve picked up along the way. Let’s start!
1. The CAP theorem
Imagine you have a group chat with friends spread across different cities. The CAP theorem says you can only guarantee 2 out of 3 things when the internet connection gets wonky:
The 3 Properties:
Consistency (C): Everyone sees the same message at the same time
Like: When you send "Let's meet at 7pm", everyone instantly sees the exact same message
Availability (A): The chat always works, even if some friends are offline
Like: You can always send and receive messages, even if some people's phones are dead
Partition Tolerance (P): The system works even when connections break
Like: The chat still functions even if the internet is spotty in some cities
The Trade-off:
You can only pick 2! In real distributed systems, network problems WILL happen, so you must choose:
CP Systems (Consistency + Partition Tolerance):
"Better safe than sorry"
When network breaks → system stops working to avoid showing wrong data
Example: Banking systems (better to be down than show wrong balance)
AP Systems (Availability + Partition Tolerance):
"Keep going, we'll fix it later"
When network breaks → system keeps working but might show old data
Example: Social media feeds (better to see old posts than no posts)
Cassandra (Covered in July 2024 article):
Choice: AP system (Availability + Partition Tolerance)
Real-world: Like a distributed photo album where you can always add photos, but it might take time for everyone to see all the latest ones
Trade-off: Stays online during network issues but might show slightly outdated data
DynamoDB (Covered in June 2024 article):
Choice: AP system by default (but offers strong consistency options)
Real-world: Like an online shopping cart that always works, but your "recently viewed" items might be a bit behind
Trade-off: Prioritizes being always available over having perfectly synchronized data
Simple Analogy:
Think of CAP theorem like organizing a surprise party:
Consistency: Everyone knows the exact same plan
Availability: You can always reach someone to help
Partition Tolerance: The party planning works even if some people can't talk to each other
You can't have all three! If phones are down (partition), you either:
Stop planning until everyone can talk (Consistency + Partition)
Keep planning but risk different people having different plans (Availability + Partition)
2. Consistent Hashing
Imagine you're organizing a massive library with millions of books, but instead of one giant building, you have multiple smaller libraries across the city. Consistent hashing is like having a smart system that decides which book goes to which library - and it's really good at handling when libraries open, close, or get renovated!
The Problem it Solves:
Traditional Hashing Problem:
Think of a pizza delivery system with 3 drivers:
Customer 1 → Driver 1
Customer 2 → Driver 2
Customer 3 → Driver 3
Customer 4 → Driver 1 (back to start)
What happens when Driver 2 calls in sick?
Suddenly ALL customers need to be reassigned to different drivers
Chaos! Everyone's delivery gets messed up
Consistent Hashing Solution:
Instead of a straight line, imagine drivers arranged in a circle (like a clock). When Driver 2 leaves:
Only customers near Driver 2 need reassignment
Everyone else keeps their original driver
Minimal disruption!
Cassandra (July 2024 article):
Real-world scenario: You're running a global messaging app like WhatsApp
Without Consistent Hashing:
1 billion users split across 1000 servers
Server crashes → ALL users need to find new servers
Result: App goes down for hours while reshuffling
With Consistent Hashing (Cassandra's approach):
Users arranged in a virtual "ring"
Each server handles a slice of the ring
Server crashes → only users in that slice move to neighboring servers
Result: 99.9% of users unaffected, app stays online
The Magic Ring Concept:
Picture a giant clock face:
12 o'clock: Server A handles users with names A-D
3 o'clock: Server B handles users with names E-H
6 o'clock: Server C handles users with names I-M
9 o'clock: Server D handles users with names N-Z
When Server B breaks:
Server A takes E-F users
Server C takes G-H users
Only 25% of data moves, not 100%!
Key Benefits :
🎯 Minimal Data Movement:
Traditional: Adding/removing 1 server = reshuffling ALL data
Consistent Hashing: Adding/removing 1 server = moving only neighboring data
Real impact: Adding a server to handle 1 million users only affects ~100,000 users
🔄 Smooth Scaling:
Like adding lanes to a highway - traffic flows better without stopping everything
New servers automatically take load from busiest neighbors
No "big bang" migrations that break everything
⚡ Fault Tolerance:
Server dies? Its neighbors automatically pick up the slack
Like having backup pizza drivers who know exactly which routes to cover
System keeps running while you fix the broken server
Simple Step-by-Step Example:
Let's say you're building Instagram for pets 🐕🐱:
Step 1: The Ring Setup
Hash Ring: 0 → 100 → 200 → 300 → 0 (circular)
Server A: handles 0-99
Server B: handles 100-199
Server C: handles 200-299
Step 2: Where Does Data Go?
Photo of "Fluffy the Cat" → Hash = 150 → Goes to Server B
Photo of "Rex the Dog" → Hash = 75 → Goes to Server A
Photo of "Goldie the Fish" → Hash = 250 → Goes to Server C
Step 3: Server B Crashes! 💥
New Ring: 0 → 200 → 300 → 0
Server A: now handles 0-199 (takes B's load)
Server C: still handles 200-299
Result: Only Fluffy's photos need to move (from B to A). Rex and Goldie's photos stay put!
Why This Matters :
Building Scalable Systems:
Cassandra uses this to handle billions of records across thousands of servers
DynamoDB uses similar concepts for automatic scaling
Your app can grow from 100 users to 100 million without major rewrites
Cost Efficiency:
No expensive "migration weekends" where everything breaks
Add servers gradually as you grow, not all at once
Reduce downtime = happier users = more revenue
Practical Implementation:
When building your own distributed system:
Start small: Use consistent hashing even with 2-3 servers
Plan for growth: Your system automatically handles scaling
Sleep better: Servers can fail without taking down your entire app
TLDR :
Consistent hashing is like having a smart GPS for your data:
It knows the best route to store information
When roads (servers) close, it quickly finds alternate routes
It minimizes traffic jams (data movement) during changes
Everyone gets to their destination (data stays accessible)
Bottom line: It's the difference between a system that breaks when it grows vs. one that gracefully scales with your success
3. Exponential Backoff
Imagine you're trying to call your best friend but their phone is busy. What do you do?
Bad approach: Call every second until they pick up
Ring... busy
Ring... busy
Ring... busy
Result: You're annoying, their phone crashes, nobody's happy! 😤
Smart approach (Exponential Backoff): Wait longer each time
Try 1: Wait 1 second → Ring... busy
Try 2: Wait 2 seconds → Ring... busy
Try 3: Wait 4 seconds → Ring... busy
Try 4: Wait 8 seconds → Ring... "Hello!" ✅
The magic: Each wait time doubles, giving the system time to recover!
The Problem it Solves:
Real-World Scenario: Black Friday Shopping 🛒
You're trying to buy the latest iPhone on Black Friday:
Without Exponential Backoff:
1 million people hit "Buy Now" at exactly 12:00 AM
Server gets overwhelmed → crashes
Everyone keeps clicking "Buy Now" frantically
Server tries to restart → gets hit with 1 million requests again
Infinite crash loop! Nobody gets their iPhone 💥
With Exponential Backoff:
Server crashes initially (expected)
Your app waits 1 second, tries again
Still busy? Wait 2 seconds
Still busy? Wait 4 seconds
By the time you retry, the initial rush has calmed down
Success! You get your iPhone 🎉
Cassandra (From July 2024 article):
Scenario: Writing user data during a viral TikTok moment
User posts viral video → 1 million likes in 1 minute
Cassandra nodes get overwhelmed trying to update like counts
Exponential Backoff in Action:
Try 1: Write fails → Wait 100ms
Try 2: Write fails → Wait 200ms
Try 3: Write fails → Wait 400ms
Try 4: Write succeeds! → Like count updated ✅
Result: System recovers gracefully instead of crashing
DynamoDB (From June 2024 article):
Scenario: E-commerce flash sale (like Amazon Prime Day)
10,000 people try to buy the same discounted laptop
DynamoDB gets hit with massive concurrent writes
Without Backoff: Database locks up, nobody can buy anything With Backoff: Requests spread out over time, everyone gets a fair chance
Simple Pseudocode:
Basic Exponential Backoff:
exponential_backoff_retry(operation, max_retries=5):
base_delay = 1 # Start with 1 second
for attempt in range(max_retries):
try:
result = operation() # Try the operation
return result # Success! Return the result
except Exception as error:
if attempt == max_retries - 1:
raise error # Final attempt failed, give up
# Calculate wait time: 1, 2, 4, 8, 16 seconds...
wait_time = base_delay * (2 ** attempt)
print(f"Attempt {attempt + 1} failed. Waiting {wait_time} seconds...")
sleep(wait_time)
return None # Should never reach here
The Magic Formula:
Wait Time = Base Delay × (2 ^ Attempt Number)
Attempt 1: 1 × (2^0) = 1 second
Attempt 2: 1 × (2^1) = 2 seconds
Attempt 3: 1 × (2^2) = 4 seconds
Attempt 4: 1 × (2^3) = 8 seconds
Attempt 5: 1 × (2^4) = 16 seconds
Why It Works:
Gives systems time to recover from overload
Reduces thundering herd effect (everyone hitting at once)
Balances persistence with politeness
Automatically adapts to system load
Real-World Benefits:
For Users:
Apps don't freeze during high traffic
Better success rates for important operations
Smoother experience during peak times
For Systems:
Prevents cascade failures
Allows graceful degradation
Reduces infrastructure costs
The "Aha!" Moment:
Exponential backoff is like being a polite person in a crowded elevator:
First try: "Excuse me, can I get through?"
Second try: Wait a bit, then "Excuse me again?"
Third try: Wait longer, then try once more
Eventually: Either you get through, or you take the stairs
It's about being persistent without being pushy!
4. Distributed Locks
Imagine you’re at a group potluck dinner 🍲. Everyone’s bringing food, but only one person can use the microwave at a time. A distributed lock is how we all politely take turns — even when we're in different rooms or cities — to avoid reheating chaos 🔥.
The Problem It Solves:
Race Conditions!
In a distributed system, multiple servers might try to update the same resource at the same time:
Two servers try to update your wallet balance
Three users book the same last movie seat
Multiple workers try to process the same order
Without a lock, you get:
Double bookings
Duplicate charges
World-ending bugs (okay, maybe not worldending... but you get it)
How Distributed Locks Help:
Think of it like putting your name on a whiteboard outside the microwave room:
If your name is there → You're using it
If it's not → Wait your turn
In tech:
One machine “acquires the lock” on a resource
Others wait (or fail gracefully) until the lock is released
Real-World Analogy:
Imagine you're in a remote team using a shared Google Doc:
Only one person edits at a time to avoid overwriting each other
Others wait or work in “view-only” mode
When done → you release the lock and someone else can edit
DynamoDB + Distributed Locks (June 2024 article):
Scenario: Flash sale checkout
Problem: Two people trying to buy the last iPhone
Solution: Use DynamoDB’s conditional writes (sort of like a lock!)
First one gets it ✅
Second one sees "Out of Stock" ❌
No overselling. No angry customers. No chaos.
Popular Implementations:
🔐 Redis-based Lock (Redlock):
Like using a sticky note on the fridge across 5 kitchens
You write the note on 3 out of 5 fridges → You’re the winner
Super fast, used in high-speed systems
🧠 Zookeeper-based Lock:
Think of it as a queue of people outside the microwave
Everyone knows their turn
Good for strict ordering and leader election
🦺 etcd-based Lock:
Used heavily in Kubernetes
Works like a lease: “You get the lock for 30 seconds, then renew it”
Key Concepts:
Acquire: “I want to use the resource”
Hold: “I’m using it right now”
Release: “I’m done, someone else can go”
Timeouts: What if someone forgets to release it? Set an expiry (like a microwave timer ⏲️)
Simple Pseudocode:
if acquire_lock("microwave"):
heat_food()
release_lock("microwave")
else:
wait_and_retry()
Optional: Add Exponential Backoff 🌀
Because no one likes 5 friends banging on the microwave at once 🙃
Why This Matters:
Prevent Chaos:
Avoid data corruption
Prevent overselling
Keep systems sane
Safety in Concurrency:
Multiple workers, one shared resource → smooth coordination
No more “oops we both processed the same thing twice”
Foundation for Critical Workflows:
Payment systems
Order processing
Workflow orchestration
Leader elections
The “Aha!” Moment:
Distributed locks are like a microwave etiquette system in a shared apartment:
No one fights
Food gets heated fairly
Life stays peaceful 🧘
Bottom line: In a world where everyone wants access right now, distributed locks say: “One at a time, please.” 🙏
5. Paxos Algorithm (Distributed Consensus)
Alright, imagine a bunch of friends trying to decide where to eat dinner 🍕🍔🍜 — but you’re all in different cities, your phones have spotty signal, and nobody trusts group chats anymore. 😵💫
Welcome to the world of distributed consensus — and Paxos is the OG algorithm that helps everyone agree on one decision, even if:
Messages get delayed
Some people don’t respond
The network goes wild
What Is Distributed Consensus?
When you're building a distributed system (like a cluster of databases or microservices), they all need to agree on things like:
Which value to store
Who the leader is
What order to process transactions in
But machines crash, networks get flaky, and yet… you still need them to act like a single source of truth.
That’s what consensus algorithms like Paxos solve.
Real-World Analogy: Group Dinner Planning
You and 4 friends are voting on where to eat:
You text: “Pizza?”
Friend 1 says yes 🍕
Friend 2 says no 😤
Friend 3 didn’t see the message
Friend 4 is offline
You want a system that:
Still works even if someone’s offline
Avoids multiple decisions at once
Can retry if people disagree
Paxos is like a very careful friend who makes sure you all settle on one decision — even in chaos.
The Core Roles in Paxos:
1. Proposer – the one who makes a suggestion
“Hey everyone, let’s go get sushi!” 🍣
2. Acceptor – the voters who say yes or no
“Okay, I’m cool with sushi.”
“Hmm… I already agreed to burgers earlier.”
3. Learner – who finds out what was decided
“So we’re getting sushi right? Just confirming.”
Paxos in Action (Oversimplified):
Propose:
Someone suggests a plan (e.g., “Let’s eat tacos 🌮”).
Prepare Phase:
The proposer sends a “Will you consider this idea?” message with a proposal number.
Promise Phase:
Acceptors reply “I promise not to accept anything older than this.”
Accept Phase:
Proposer says, “Cool, here’s the actual plan: tacos!”
Learn Phase:
Everyone learns the result and commits to it.
Result? Even if half the group didn’t respond, as long as a majority agrees — decision is made!
Why Paxos Is Hard (and Powerful):
It’s guaranteed to be correct, even in failures
It can be slow if not optimized (which is why variants like Raft exist)
But it’s foundational — like the seatbelt of distributed systems
Where Paxos Shows Up IRL:
Google’s Chubby lock service
Apache ZooKeeper (inspired by Paxos, though uses Zab)
CockroachDB and Spanner (use Paxos variants to agree on data writes)
Any system that needs to elect a leader or agree on one true value
Why Not Just Pick a Leader and Move On?
Because in distributed systems:
Leaders might crash
Messages might be delayed
Two leaders might get elected at the same time (💥 split-brain)
Paxos ensures:
You either agree on the same value, or you don’t agree at all — but you’ll never get two conflicting answers.
And that’s it, while there were a LOT more things that we covered, these seemed to be some of the most important, (or atleast the one’s that I felt needed one more look). Comment below or start a chat if you feel like discussing something else and we’ll try to figure it out together!