Be insensitive, and keep a lifelong growth mindset.

0%

Gossip, Membership, and Grids

Class notes 2 for Cloud Computing Concepts by Indranil Gupta @ Coursera

(更新于 03/19/2017)

This chapter mainly introduces the following two building blocks for distributed system (in the cloud)

  • Gossip or Epidemic Protocols
  • Failure detection and Membership Protocols

Gossip

Multicast Problem

What is multicast?

  • Sending information to other nodes in the group;
  • Opposed to broadcast, multicast is more restricted and only sends info to a particular group of nodes, instead of everyone.

What are the requirements for multicast protocol?

  • Fault tolerance
  • Scalability

Fault-Tolerance and Scalability

One of the simplest ways of doing multicast is a centralized approach.
Centralized Multicast Protocol

Problems with centralized approach:

  • Fault tolerance: if senders fails, all rest fails;
  • Overhead is high, O(N) time complexity.

A better approach, Tree-Based Multicast Protocols
Tree-Based Multicast Protocol

Tree-Based Multicast Protocols

  • Build a spanning tree among the processes of the multicast group
  • Using spanning tree to disseminate multicasts(Latency reduced to O(log(N)))
  • Use either acknowledgements (ACKs) or negative acknowledgements (NAK) to repair multicasts not received
  • SRM (Scalable Reliable Multicast)
    • Use NAKs
    • But adds random delays, and uses exponential back off to avoid NAK storms
  • RMTP (Reliable Multicast Transport Protocol)
    • Use ACKs
    • But ACKs only sent to designated receivers, which then re-transimit missing multicasts
  • These protocols still cause an O(N) ACK/NAK overhead

The Gossip Protocol

A gossip protocol is a style of computer-to-computer communication protocol inspired by the form of gossip seen in social networks.

Periodically, a sender will select b random nodes from the group, and sends them copies of the gossip. B is a gossip parameter, and typically is a very small number, e.g., 2.

Gossip Protocol

  • “Push” Gossip:
    • Once you have a multicast message, you start gossiping about it.
    • Multiple messages? Gossip a random subset of them, or recently-received ones, or higher priority ones.
  • “Pull” Gossip:
    • Periodically poll a few randomly selected processes for new multicast messages that you haven’t received
    • Get those messages
  • Hybrid variant: Push-Pull
    • As the name suggests

Gossip Analysis

Claims of the simple Push protocol

  • Is lightweight in large groups
  • Spreads a multicast quickly
  • Is highly fault-tolerant

From old mathematical branch of Epidemiology

  • Population of (n + 1) individuals mixing homogeneously
  • Contact rate between any individual pair is β
  • At any time, each individual is either uninfected (numbering x) or infected (numbering y)
  • Then, x0 = n, y0 = 1, and at all times x + y = n + 1
  • Infected-uninfected contact turns latter infected, and it stays infected.

Analysis

xy represents all possible infected-uninfected pairs.

Analysis 2

  • Log(N) is not constant in theory
  • But pragmatically, it is a very slow growing number
  • Base 2
    • Log(1000) ~ 10
    • Log(1M) ~ 20
    • Log(1B) ~ 30
    • Log(all IPv4 address) = 32

Analysis 3

Fault-Tolerance

  • Packet loss
    • 50% packet loss: analyze with b replaced with b/2
    • To achieve same reliability as 0% packet loss, take twice as many rounds
  • Node failure
    • 50% of nodes fail: analyze with n replaced with n/2 and b replaced with b/2
    • Same as above

With failures, it is possible, but improbable that the epidemic will die out quickly.

Pull Gossip: Analysis
Pull Gossip

Pull Gossip 2

Gossip Implementations

Some Implementations:

  • Clearinghouse and Bayou projects: email and database transactions
  • refDBMS system [PODC ’87]
  • Binodal Multicast [ACM TOCS ’99]
  • Sensor networks [Li Li et al, Infocom ’02, and PBBF, ICDCS ’05]
  • AWS EC2 and S3 Cloud (rumored). [‘00s]
  • Cassandra key-value store (and others) use gossip for maintaining membership lists
  • Usenet NNTP (Network News Transport Protocol) [’79]

NNTP Inter-Server Protocol

  • Each client uploads and downloads news posts from a news server
  • Server retains news posts for a while, transmits them lazily, deletes them after a while.
    NNTP Inter-Server Protocol

Summary

  • Multicast is an important problem
  • Tree-based multicast protocols
  • When concerned about scale and fault-tolerance, gossip is an attractive solution
  • Also known as epidemics
  • Fast, reliable, fault-tolerant, scalable, topology-aware.

Membership

What is Group Membership List?

Failures are the norm

  • Say, the rate of failure of one machine (OS/Disk/motherboard/network, etc.) is once every 10 years (120 months) on average.
  • When you have 120 servers in the DC, the mean time to failure (MTTF) of the next machine is 1 month.
  • When you have 12,000 servers in the DC, the MTTF is about once every 7.2 hours!

Group Membership Service (Membership Protocol)
Group Membership Service

  • In each application process pi, there is a membership list that maintains a list of most, or all of non-faulty processes.
  • The membership list is a Almost-Complete list(weakly consistent)
  • Two sum-protocols
    • Failure detector: a mechanism that detects failures.
    • Dissemination: disseminates information about these failures after detection to other processes in the system.

The flow of failure detection and dissemination
Failure Detection and Dissemination

Failure Detectors

Distributed Failure Detectors: Desirable Properties

  • Completeness: each failure is detected
  • Accuracy: there is no mistaken detection
  • Speed: Time to first detection of a failure
  • Scale:
    • Equal Load on each member
    • Network Message Load

Completeness and Accuracy are impossible together in lossy networks [Chandra and Toueg], if possible, then can solve consensus!

In real world, Completeness is always guaranteed, and Accurary is partial/probabilistic guaranteed(almost 100%, but not exactly). Because when you have a failure, you definitely want to be able to detect it, recover from it, and make your data consistent again. You don’t want to miss any failures. However, if you mistakenly detect a failure, it’s fine for you to ask that poor and victim process to leave the system and rejoin again, perhaps at a different identifier. The overhead of doing that is less than if you do the reverse.

Here are three failure detection protocols.

Centralized Heartbeating

Centralized Heartbeating

Cons:

  • When pj fails, there is no guarantee about who detects its failure
  • Overloaded, hotspot issue.

Ring Heartbeating

Ring Heartbeating

Pros:

  • Avoid a hot spot, better than the centralized approach;
    Cons:
  • Might still have some failures undetected when you have multiple failures, e.g., both pi’s neighbors fail
  • Repairing the ring is also another overhead.

All-to-All Heartbeating

All-to-All Heartbeating
Each process pi sends out heartbeat to all the other processes in the system.
Pros:

  • Equal load per member, well distributed across all. (The load looks high, but tactually be not so bad)
  • The protocol is complete, if pi fails, as long as there is at least one other non-faulty process in the group it will detect pi as having failed.

Gossip-Style Membership

Gossip-style Heartbeating 1

Gossip-style Heartbeating 2

Gossip-style Failure Detection

  • If the heartbeat has not increased for more than T(fail) seconds, the member is considered failed.
  • And after T(cleanup) seconds, it will delete the member from the list.
  • Why two different timeouts?
    • If you delete the member right away where an entry might never go away.
    • For instance, if process 2 has an entry just failed, and deletes it. Then process 1 may have the same entry not failed yet, and sends it to process 2, the process 2 may consider it as a new entry and add it back with the new local time. With the ping pong behavior of gossip, this entry may never go away from the system.

Which is the best failure detector?

Optimal is derived from the main properties:

  • Completeness: Guarantee always
  • Accuracy: Probability PM(T)
  • Speed: T time units
    • Time to first detection of a failure
  • Scale:
    • Equal Load on each member
    • Network Message Load

All-to-all Heartbeating Load Analysis

All—to-All Heartbeating Load Analysis

Gossip Heartbeating Load Analysis

Gossip Heartibeating Load Analysis

Conclusion: Gossip-based heartbeating protocol has a higher load than the all-to-all heartbeating, it’s because gossip is trying to get a slightly better accuracy by using more messages.

What’s the best/optimal we can do?

Worst Case Load
Worst Case Load

Optimal Case Load
Optimal Case Load

Another Probabilistic Failure Detector

SWIM Failure Detector Protocol

SWIM Failure Detector Protocol

SWIM Versus Heartbeating

SWIM vs Heartbeating

SWIM Failure Detector

SWIM Failure Detector

Accuracy, Load

  • PM(T) is exponential in -K. Also depends on pml (and pf)
  • See paper

L/L* < 28, E[L]/L* for up to 15% loss rates

Detection Time

  • Prob. of being pinged in T’ = 1 - (1 - 1/N)^(N - 1) = 1 - e^(-1)
  • E[T] = T’ e / (e - 1)
  • Completeness: Any alive member detects failure
    • Eventually, every other non-faulty process that has this failed process in this list, will pick it as a pinged target.
    • By using a trick, within worst case O(N) protocol periods

Time-Bounded Completeness

  • Key: select each membership element once as a ping target in a traversal
    • Round-robin pinging
    • Random permutation of list after each traversal
  • Each failure is detected in worst case 2N-1 (local) protocol periods
  • Preserves FD properties

Dissemination and suspicion

Dissemination

Dissemination Options

  • Multicast (Hardware / IP)

    • Unreliable
    • Multiple simultaneous multicasts
  • Point-to-point (TCP / UDP)

    • Expensive
  • Zero extra messages: Piggyback on Failure Detector messages

    • Infection-style Dissemination

SWIM Failure Detector Protocol

Infection-Style Dissemination

  • Epidemic style dissemination
    • After λlog(N) protocol periods, N^(-(2λ - 2)) processes would not have heard about an update
  • Maintain a buffer of recently joined/evicted processes
    • Piggyback from this buffer
    • Prefer recent updates
  • Buffer elements are garbage collected after a while
    • After λlog(N) protocol periods, this defines weak consistency

Suspicion Mechanism

  • False detections, due to:

    • Perturbed processes
    • Packet losses, e.g., from congestion
  • Indirect pinging may not solve the problem

    • e.g., correlated message losses near pinged host
  • Key: suspect a process before declaring it as failed in the group

Suspicion Mechasnism

  • Distinguish multiple suspicions of a process
    • Pre-process incarnation number
    • Inc # for pi can be incremented only by pi
      • Typically Pi does this when it receives a suspect Pi message.
    • Somewhat similar to DSDV (a routing protocol that is used in ad hoc networks)
  • Higher inc# notifications over-ride lower inc#’s
  • Within an inc#: (Suspect inc #) > (Alive, int #)
  • (Failed, inc #) overrides everything else

Wrap Up

  • Failures the norm, not the exception in datacenter
  • Every distributed system uses a failure detector
  • Many distributed systems use a member ship service
  • Ring failure detection underlies
    • IBM SP2 and many other similar clusters/machines
  • Gossip-style failure detection underlies
    • Amazon EC2/S2 (rumored!)

Grids

Grid Applications

Example: Rapid Atmospheric Modeling System, ColoState U

  • Hurricane Georges, 17 days in Sept 1998
    • “RAMS modeled the mesoscale convective complex that dropped so much rain, in good agreement with recorded data”
    • Used 5 km spacing instead of the usual 10 km
    • Ran on 256+ processors
  • Computation-intensive computing (or HPC = High Performance Computing)
  • Can one run such a program without access to a supercomputer?

Distributed Computing Resources

Distributed Computing Resources

For example, at certain times of the day, the work stations/clusters at Madison, MIT and NCSA are free and available for running your application.

An Application Coded by a Physicist/Biologist/Meteorologist

An Application Coded by a Physicist/Biologist/Meteorologist 1

An Application Coded by a Physicist/Biologist/Meteorologist 2

So how do we schedule the jobs?

Grid Infrastructure

Scheduling Problem: 2-level Scheduling Infrastructure

2-level Scheduling Infrastructure

Intra-Site Protocol

Intra-Site Protocol

Condor (Now HTCondor)

  • High-throughput computing system from U. Wisconsin Madison
  • Belongs to a class of Cycle-scavenging systems

Such systems

  • Run on a lot of workstations
  • When workstation is free, ask site’s central server (or Globus) for tasks
  • If user hits a keystroke or mouse click, stop task
    • Either kill task or ask server to reschedule task
  • Can also run on dedicated machines

Inter-Site Protocol

Inter-Site Protocol

Globus

  • Globus Alliance involves universities, national US research labs, and some companies
  • Standardized several things, especially software tools
  • Separately but related: Open Grid Forum
  • Globus Alliance has developed the Globus Toolkit

Security Issues

  • Important in Grids because they are federated, i.e., no single entity controls the entire infrastructure

  • Singe sign-on: collective job set should require once-only user authentication

  • Mapping to local security mechanisms: some sites use Kerberos, others using Unix

  • Delegation: credentials to access resources inherited by subcomputations, e.g., job 0 to job1

  • Community authorization: e.g., third-party authentication

  • These are also important in clouds, but less so because clouds are typically run under a central control

  • In clouds the focus is on failures, scale, on-demand nature

Summary

  • Grid computing focuses on computation-intensive computing (HPC)
  • Though often federated, architecture and key concepts have a lot in common with that of clouds
  • Are Grids/HPC converging towards clouds?
    • E.g., Compare OpenStack and Globus