Be insensitive, and keep a lifelong growth mindset.

0%

Class notes 2 for Cloud Computing Concepts by Indranil Gupta @ Coursera

(更新于 03/19/2017)

This chapter mainly introduces the following two building blocks for distributed system (in the cloud)

• Gossip or Epidemic Protocols
• Failure detection and Membership Protocols

## Gossip

### Multicast Problem

What is multicast?

• Sending information to other nodes in the group;
• Opposed to broadcast, multicast is more restricted and only sends info to a particular group of nodes, instead of everyone.

What are the requirements for multicast protocol?

• Fault tolerance
• Scalability

One of the simplest ways of doing multicast is a centralized approach.

Problems with centralized approach:

• Fault tolerance: if senders fails, all rest fails;
• Overhead is high, O(N) time complexity.

A better approach, Tree-Based Multicast Protocols

Tree-Based Multicast Protocols

• Build a spanning tree among the processes of the multicast group
• Using spanning tree to disseminate multicasts(Latency reduced to O(log(N)))
• Use either acknowledgements (ACKs) or negative acknowledgements (NAK) to repair multicasts not received
• SRM (Scalable Reliable Multicast)
• Use NAKs
• But adds random delays, and uses exponential back off to avoid NAK storms
• RMTP (Reliable Multicast Transport Protocol)
• Use ACKs
• But ACKs only sent to designated receivers, which then re-transimit missing multicasts
• These protocols still cause an O(N) ACK/NAK overhead

### The Gossip Protocol

A gossip protocol is a style of computer-to-computer communication protocol inspired by the form of gossip seen in social networks.

Periodically, a sender will select b random nodes from the group, and sends them copies of the gossip. B is a gossip parameter, and typically is a very small number, e.g., 2.

• “Push” Gossip:
• Once you have a multicast message, you start gossiping about it.
• Multiple messages? Gossip a random subset of them, or recently-received ones, or higher priority ones.
• “Pull” Gossip:
• Periodically poll a few randomly selected processes for new multicast messages that you haven’t received
• Get those messages
• Hybrid variant: Push-Pull
• As the name suggests

### Gossip Analysis

Claims of the simple Push protocol

• Is lightweight in large groups
• Is highly fault-tolerant

From old mathematical branch of Epidemiology

• Population of (n + 1) individuals mixing homogeneously
• Contact rate between any individual pair is β
• At any time, each individual is either uninfected (numbering x) or infected (numbering y)
• Then, x0 = n, y0 = 1, and at all times x + y = n + 1
• Infected-uninfected contact turns latter infected, and it stays infected.

xy represents all possible infected-uninfected pairs.

• Log(N) is not constant in theory
• But pragmatically, it is a very slow growing number
• Base 2
• Log(1000) ~ 10
• Log(1M) ~ 20
• Log(1B) ~ 30
• Log(all IPv4 address) = 32

Fault-Tolerance

• Packet loss
• 50% packet loss: analyze with b replaced with b/2
• To achieve same reliability as 0% packet loss, take twice as many rounds
• Node failure
• 50% of nodes fail: analyze with n replaced with n/2 and b replaced with b/2
• Same as above

With failures, it is possible, but improbable that the epidemic will die out quickly.

Pull Gossip: Analysis

### Gossip Implementations

Some Implementations:

• Clearinghouse and Bayou projects: email and database transactions
• refDBMS system [PODC ’87]
• Binodal Multicast [ACM TOCS ’99]
• Sensor networks [Li Li et al, Infocom ’02, and PBBF, ICDCS ’05]
• AWS EC2 and S3 Cloud (rumored). [‘00s]
• Cassandra key-value store (and others) use gossip for maintaining membership lists
• Usenet NNTP (Network News Transport Protocol) [’79]

NNTP Inter-Server Protocol

• Server retains news posts for a while, transmits them lazily, deletes them after a while.

Summary

• Multicast is an important problem
• Tree-based multicast protocols
• When concerned about scale and fault-tolerance, gossip is an attractive solution
• Also known as epidemics
• Fast, reliable, fault-tolerant, scalable, topology-aware.

## Membership

### What is Group Membership List?

Failures are the norm

• Say, the rate of failure of one machine (OS/Disk/motherboard/network, etc.) is once every 10 years (120 months) on average.
• When you have 120 servers in the DC, the mean time to failure (MTTF) of the next machine is 1 month.
• When you have 12,000 servers in the DC, the MTTF is about once every 7.2 hours!

Group Membership Service (Membership Protocol)

• In each application process pi, there is a membership list that maintains a list of most, or all of non-faulty processes.
• The membership list is a Almost-Complete list(weakly consistent)
• Two sum-protocols
• Failure detector: a mechanism that detects failures.
• Dissemination: disseminates information about these failures after detection to other processes in the system.

The flow of failure detection and dissemination

### Failure Detectors

Distributed Failure Detectors: Desirable Properties

• Completeness: each failure is detected
• Accuracy: there is no mistaken detection
• Speed: Time to first detection of a failure
• Scale:
• Equal Load on each member

Completeness and Accuracy are impossible together in lossy networks [Chandra and Toueg], if possible, then can solve consensus!

In real world, Completeness is always guaranteed, and Accurary is partial/probabilistic guaranteed(almost 100%, but not exactly). Because when you have a failure, you definitely want to be able to detect it, recover from it, and make your data consistent again. You don’t want to miss any failures. However, if you mistakenly detect a failure, it’s fine for you to ask that poor and victim process to leave the system and rejoin again, perhaps at a different identifier. The overhead of doing that is less than if you do the reverse.

Here are three failure detection protocols.

#### Centralized Heartbeating

Cons:

• When pj fails, there is no guarantee about who detects its failure

#### Ring Heartbeating

Pros:

• Avoid a hot spot, better than the centralized approach;
Cons:
• Might still have some failures undetected when you have multiple failures, e.g., both pi’s neighbors fail
• Repairing the ring is also another overhead.

#### All-to-All Heartbeating

Each process pi sends out heartbeat to all the other processes in the system.
Pros:

• Equal load per member, well distributed across all. (The load looks high, but tactually be not so bad)
• The protocol is complete, if pi fails, as long as there is at least one other non-faulty process in the group it will detect pi as having failed.

### Gossip-Style Membership

Gossip-style Failure Detection

• If the heartbeat has not increased for more than T(fail) seconds, the member is considered failed.
• And after T(cleanup) seconds, it will delete the member from the list.
• Why two different timeouts?
• If you delete the member right away where an entry might never go away.
• For instance, if process 2 has an entry just failed, and deletes it. Then process 1 may have the same entry not failed yet, and sends it to process 2, the process 2 may consider it as a new entry and add it back with the new local time. With the ping pong behavior of gossip, this entry may never go away from the system.

### Which is the best failure detector?

Optimal is derived from the main properties:

• Completeness: Guarantee always
• Accuracy: Probability PM(T)
• Speed: T time units
• Time to first detection of a failure
• Scale:
• Equal Load on each member

Conclusion: Gossip-based heartbeating protocol has a higher load than the all-to-all heartbeating, it’s because gossip is trying to get a slightly better accuracy by using more messages.

### Another Probabilistic Failure Detector

#### SWIM Failure Detector

• PM(T) is exponential in -K. Also depends on pml (and pf)
• See paper

L/L* < 28, E[L]/L* for up to 15% loss rates

Detection Time

• Prob. of being pinged in T’ = 1 - (1 - 1/N)^(N - 1) = 1 - e^(-1)
• E[T] = T’ e / (e - 1)
• Completeness: Any alive member detects failure
• Eventually, every other non-faulty process that has this failed process in this list, will pick it as a pinged target.
• By using a trick, within worst case O(N) protocol periods

Time-Bounded Completeness

• Key: select each membership element once as a ping target in a traversal
• Round-robin pinging
• Random permutation of list after each traversal
• Each failure is detected in worst case 2N-1 (local) protocol periods
• Preserves FD properties

### Dissemination and suspicion

#### Dissemination Options

• Multicast (Hardware / IP)

• Unreliable
• Multiple simultaneous multicasts
• Point-to-point (TCP / UDP)

• Expensive
• Zero extra messages: Piggyback on Failure Detector messages

• Infection-style Dissemination

SWIM Failure Detector Protocol

#### Infection-Style Dissemination

• Epidemic style dissemination
• After λlog(N) protocol periods, N^(-(2λ - 2)) processes would not have heard about an update
• Maintain a buffer of recently joined/evicted processes
• Piggyback from this buffer
• Buffer elements are garbage collected after a while
• After λlog(N) protocol periods, this defines weak consistency

#### Suspicion Mechanism

• False detections, due to:

• Perturbed processes
• Packet losses, e.g., from congestion
• Indirect pinging may not solve the problem

• e.g., correlated message losses near pinged host
• Key: suspect a process before declaring it as failed in the group

• Distinguish multiple suspicions of a process
• Pre-process incarnation number
• Inc # for pi can be incremented only by pi
• Typically Pi does this when it receives a suspect Pi message.
• Somewhat similar to DSDV (a routing protocol that is used in ad hoc networks)
• Higher inc# notifications over-ride lower inc#’s
• Within an inc#: (Suspect inc #) > (Alive, int #)
• (Failed, inc #) overrides everything else

#### Wrap Up

• Failures the norm, not the exception in datacenter
• Every distributed system uses a failure detector
• Many distributed systems use a member ship service
• Ring failure detection underlies
• IBM SP2 and many other similar clusters/machines
• Gossip-style failure detection underlies
• Amazon EC2/S2 (rumored!)

## Grids

### Grid Applications

#### Example: Rapid Atmospheric Modeling System, ColoState U

• Hurricane Georges, 17 days in Sept 1998
• “RAMS modeled the mesoscale convective complex that dropped so much rain, in good agreement with recorded data”
• Used 5 km spacing instead of the usual 10 km
• Ran on 256+ processors
• Computation-intensive computing (or HPC = High Performance Computing)
• Can one run such a program without access to a supercomputer?

#### Distributed Computing Resources

For example, at certain times of the day, the work stations/clusters at Madison, MIT and NCSA are free and available for running your application.

#### An Application Coded by a Physicist/Biologist/Meteorologist

So how do we schedule the jobs?

### Grid Infrastructure

#### Intra-Site Protocol

Condor (Now HTCondor)

• High-throughput computing system from U. Wisconsin Madison
• Belongs to a class of Cycle-scavenging systems

Such systems

• Run on a lot of workstations
• When workstation is free, ask site’s central server (or Globus) for tasks
• If user hits a keystroke or mouse click, stop task
• Can also run on dedicated machines

#### Inter-Site Protocol

Globus

• Globus Alliance involves universities, national US research labs, and some companies
• Standardized several things, especially software tools
• Separately but related: Open Grid Forum
• Globus Alliance has developed the Globus Toolkit

#### Security Issues

• Important in Grids because they are federated, i.e., no single entity controls the entire infrastructure

• Singe sign-on: collective job set should require once-only user authentication

• Mapping to local security mechanisms: some sites use Kerberos, others using Unix

• Delegation: credentials to access resources inherited by subcomputations, e.g., job 0 to job1

• Community authorization: e.g., third-party authentication

• These are also important in clouds, but less so because clouds are typically run under a central control

• In clouds the focus is on failures, scale, on-demand nature

#### Summary

• Grid computing focuses on computation-intensive computing (HPC)
• Though often federated, architecture and key concepts have a lot in common with that of clouds
• Are Grids/HPC converging towards clouds?
• E.g., Compare OpenStack and Globus