Class notes 2 for Cloud Computing Concepts by Indranil Gupta @ Coursera
This chapter mainly introduces the following two building blocks for distributed system (in the cloud)
- Gossip or Epidemic Protocols
- Failure detection and Membership Protocols
What is multicast?
- Sending information to other nodes in the group;
- Opposed to broadcast, multicast is more restricted and only sends info to a particular group of nodes, instead of everyone.
What are the requirements for multicast protocol?
- Fault tolerance
One of the simplest ways of doing multicast is a centralized approach.
Problems with centralized approach:
- Fault tolerance: if senders fails, all rest fails;
- Overhead is high, O(N) time complexity.
A better approach, Tree-Based Multicast Protocols
Tree-Based Multicast Protocols
- Build a spanning tree among the processes of the multicast group
- Using spanning tree to disseminate multicasts(Latency reduced to O(log(N)))
- Use either acknowledgements (ACKs) or negative acknowledgements (NAK) to repair multicasts not received
- SRM (Scalable Reliable Multicast)
- Use NAKs
- But adds random delays, and uses exponential back off to avoid NAK storms
- RMTP (Reliable Multicast Transport Protocol)
- Use ACKs
- But ACKs only sent to designated receivers, which then re-transimit missing multicasts
- These protocols still cause an O(N) ACK/NAK overhead
A gossip protocol is a style of computer-to-computer communication protocol inspired by the form of gossip seen in social networks.
Periodically, a sender will select b random nodes from the group, and sends them copies of the gossip. B is a gossip parameter, and typically is a very small number, e.g., 2.
- “Push” Gossip:
- Once you have a multicast message, you start gossiping about it.
- Multiple messages? Gossip a random subset of them, or recently-received ones, or higher priority ones.
- “Pull” Gossip:
- Periodically poll a few randomly selected processes for new multicast messages that you haven’t received
- Get those messages
- Hybrid variant: Push-Pull
- As the name suggests
Claims of the simple Push protocol
- Is lightweight in large groups
- Spreads a multicast quickly
- Is highly fault-tolerant
From old mathematical branch of Epidemiology
- Population of (n + 1) individuals mixing homogeneously
- Contact rate between any individual pair is β
- At any time, each individual is either uninfected (numbering x) or infected (numbering y)
- Then, x0 = n, y0 = 1, and at all times x + y = n + 1
- Infected-uninfected contact turns latter infected, and it stays infected.
xy represents all possible infected-uninfected pairs.
- Log(N) is not constant in theory
- But pragmatically, it is a very slow growing number
- Base 2
- Log(1000) ~ 10
- Log(1M) ~ 20
- Log(1B) ~ 30
- Log(all IPv4 address) = 32
- Packet loss
- 50% packet loss: analyze with b replaced with b/2
- To achieve same reliability as 0% packet loss, take twice as many rounds
- Node failure
- 50% of nodes fail: analyze with n replaced with n/2 and b replaced with b/2
- Same as above
With failures, it is possible, but improbable that the epidemic will die out quickly.
Pull Gossip: Analysis
- Clearinghouse and Bayou projects: email and database transactions
- refDBMS system [PODC ’87]
- Binodal Multicast [ACM TOCS ’99]
- Sensor networks [Li Li et al, Infocom ’02, and PBBF, ICDCS ’05]
- AWS EC2 and S3 Cloud (rumored). [‘00s]
- Cassandra key-value store (and others) use gossip for maintaining membership lists
- Usenet NNTP (Network News Transport Protocol) [’79]
NNTP Inter-Server Protocol
- Each client uploads and downloads news posts from a news server
- Server retains news posts for a while, transmits them lazily, deletes them after a while.
- Multicast is an important problem
- Tree-based multicast protocols
- When concerned about scale and fault-tolerance, gossip is an attractive solution
- Also known as epidemics
- Fast, reliable, fault-tolerant, scalable, topology-aware.
Failures are the norm
- Say, the rate of failure of one machine (OS/Disk/motherboard/network, etc.) is once every 10 years (120 months) on average.
- When you have 120 servers in the DC, the mean time to failure (MTTF) of the next machine is 1 month.
- When you have 12,000 servers in the DC, the MTTF is about once every 7.2 hours!
Group Membership Service (Membership Protocol)
- In each application process pi, there is a membership list that maintains a list of most, or all of non-faulty processes.
- The membership list is a Almost-Complete list(weakly consistent)
- Two sum-protocols
- Failure detector: a mechanism that detects failures.
- Dissemination: disseminates information about these failures after detection to other processes in the system.
The flow of failure detection and dissemination
Distributed Failure Detectors: Desirable Properties
- Completeness: each failure is detected
- Accuracy: there is no mistaken detection
- Speed: Time to first detection of a failure
- Equal Load on each member
- Network Message Load
Accuracy are impossible together in lossy networks [Chandra and Toueg], if possible, then can solve consensus!
In real world,
Completeness is always guaranteed, and
Accurary is partial/probabilistic guaranteed(almost 100%, but not exactly). Because when you have a failure, you definitely want to be able to detect it, recover from it, and make your data consistent again. You don’t want to miss any failures. However, if you mistakenly detect a failure, it’s fine for you to ask that poor and victim process to leave the system and rejoin again, perhaps at a different identifier. The overhead of doing that is less than if you do the reverse.
Here are three failure detection protocols.
- When pj fails, there is no guarantee about who detects its failure
- Overloaded, hotspot issue.
- Avoid a hot spot, better than the centralized approach;
- Might still have some failures undetected when you have multiple failures, e.g., both pi’s neighbors fail
- Repairing the ring is also another overhead.
Each process pi sends out heartbeat to all the other processes in the system.
- Equal load per member, well distributed across all. (The load looks high, but tactually be not so bad)
- The protocol is complete, if pi fails, as long as there is at least one other non-faulty process in the group it will detect pi as having failed.
Gossip-style Failure Detection
- If the heartbeat has not increased for more than T(fail) seconds, the member is considered failed.
- And after T(cleanup) seconds, it will delete the member from the list.
- Why two different timeouts?
- If you delete the member right away where an entry might never go away.
- For instance, if process 2 has an entry just failed, and deletes it. Then process 1 may have the same entry not failed yet, and sends it to process 2, the process 2 may consider it as a new entry and add it back with the new local time. With the ping pong behavior of gossip, this entry may never go away from the system.
Optimal is derived from the main properties:
- Completeness: Guarantee always
- Accuracy: Probability PM(T)
- Speed: T time units
- Time to first detection of a failure
- Equal Load on each member
- Network Message Load
Conclusion: Gossip-based heartbeating protocol has a higher load than the all-to-all heartbeating, it’s because gossip is trying to get a slightly better accuracy by using more messages.
Worst Case Load
Optimal Case Load
- PM(T) is exponential in -K. Also depends on pml (and pf)
- See paper
L/L* < 28, E[L]/L* for up to 15% loss rates
- Prob. of being pinged in T’ = 1 - (1 - 1/N)^(N - 1) = 1 - e^(-1)
- E[T] = T’ e / (e - 1)
- Completeness: Any alive member detects failure
- Eventually, every other non-faulty process that has this failed process in this list, will pick it as a pinged target.
- By using a trick, within worst case O(N) protocol periods
- Key: select each membership element once as a ping target in a traversal
- Round-robin pinging
- Random permutation of list after each traversal
- Each failure is detected in worst case 2N-1 (local) protocol periods
- Preserves FD properties
Multicast (Hardware / IP)
- Multiple simultaneous multicasts
Point-to-point (TCP / UDP)
Zero extra messages: Piggyback on Failure Detector messages
- Infection-style Dissemination
SWIM Failure Detector Protocol
- Epidemic style dissemination
- After λlog(N) protocol periods, N^(-(2λ - 2)) processes would not have heard about an update
- Maintain a buffer of recently joined/evicted processes
- Piggyback from this buffer
- Prefer recent updates
- Buffer elements are garbage collected after a while
- After λlog(N) protocol periods, this defines weak consistency
False detections, due to:
- Perturbed processes
- Packet losses, e.g., from congestion
Indirect pinging may not solve the problem
- e.g., correlated message losses near pinged host
Key: suspect a process before declaring it as failed in the group
- Distinguish multiple suspicions of a process
- Pre-process incarnation number
- Inc # for pi can be incremented only by pi
- Typically Pi does this when it receives a suspect Pi message.
- Somewhat similar to DSDV (a routing protocol that is used in ad hoc networks)
- Higher inc# notifications over-ride lower inc#’s
- Within an inc#: (Suspect inc #) > (Alive, int #)
- (Failed, inc #) overrides everything else
- Failures the norm, not the exception in datacenter
- Every distributed system uses a failure detector
- Many distributed systems use a member ship service
- Ring failure detection underlies
- IBM SP2 and many other similar clusters/machines
- Gossip-style failure detection underlies
- Amazon EC2/S2 (rumored!)
- Hurricane Georges, 17 days in Sept 1998
- “RAMS modeled the mesoscale convective complex that dropped so much rain, in good agreement with recorded data”
- Used 5 km spacing instead of the usual 10 km
- Ran on 256+ processors
- Computation-intensive computing (or HPC = High Performance Computing)
Can one run such a program without access to a supercomputer?
For example, at certain times of the day, the work stations/clusters at Madison, MIT and NCSA are free and available for running your application.
So how do we schedule the jobs?
Condor (Now HTCondor)
- High-throughput computing system from U. Wisconsin Madison
- Belongs to a class of Cycle-scavenging systems
- Run on a lot of workstations
- When workstation is free, ask site’s central server (or Globus) for tasks
- If user hits a keystroke or mouse click, stop task
- Either kill task or ask server to reschedule task
- Can also run on dedicated machines
- Globus Alliance involves universities, national US research labs, and some companies
- Standardized several things, especially software tools
- Separately but related: Open Grid Forum
- Globus Alliance has developed the Globus Toolkit
Important in Grids because they are federated, i.e., no single entity controls the entire infrastructure
Singe sign-on: collective job set should require once-only user authentication
Mapping to local security mechanisms: some sites use Kerberos, others using Unix
Delegation: credentials to access resources inherited by subcomputations, e.g., job 0 to job1
Community authorization: e.g., third-party authentication
These are also important in clouds, but less so because clouds are typically run under a central control
In clouds the focus is on failures, scale, on-demand nature
- Grid computing focuses on computation-intensive computing (HPC)
- Though often federated, architecture and key concepts have a lot in common with that of clouds
- Are Grids/HPC converging towards clouds?
- E.g., Compare OpenStack and Globus