Be insensitive, and keep a lifelong growth mindset.

0%

Orientation, Introduction to Clouds, MapReduce

Class notes 1 for Cloud Computing Concepts by Indranil Gupta @ Coursera

(初稿完成于 02/25/2017)

Pre-Requirements

Basic Data Structures

  • Queue: First-in First-out

  • Stack: First-in Last-out

    Process: A Program in Action

    Computer Architecture (Simplified)

    Simplified Computer Architecture

  • A program you write (C++/Java etc.) gets compiled to low-level machine instructions

    • Stored in file system on disk
  • CPU loads instructions in batches into memory(and cache, and registers)

  • As it executes each instruction, CPU loads data for instruction into memory (and cache, and registers)

    • And does any necessary stores into memory
  • Memory can also be flushed to disk

    Big O() Notation

  • One of the most basic ways of analyzing algorithms

  • Describes upper bound on behavior of algorithm as some variable is scaled(increased) to infinity

  • Analyzes run-time (or another performance metric)

  • Worst-case performance

    Basic Probability

    DNS

  • Domain Name Service

  • Collection of servers, throughout the world

  • Input to DNS: a domain name, e.g., coursera.org

  • Output from DNS: IP address of a web server(that hosts that content)

  • IP address may refer to either

    • Web server actually hosting that content, or
    • An indirect server, e.g., a CDN(content distribution network)server, e.g., from Akamai

      Graphs

Introduction to Clouds, MapReduce

Introduction to Clouds

Why Cloud?

Two Categories of Clouds

  • Public Clouds: provides service to any paying customer
    • Amazon S3(Simple Storage Service): store arbitrary datasets, pay per GB-month stored
    • Amazon EC2(Elastic Compute Cloud): upload and run arbitrary OS image, pay per CUP hour used
    • Google App Engine/Compute Engine: develop applications within their App Engine framework, upload data that will be imported into their format, and run
  • Private Clouds: accessible only to company employees

Customers save time and money by using clouds.

What is a Cloud?

Cloud = Lots of storage + compute cycles nearby

  • A single-site cloud (aka “Datacenter”) consists of
    • Compute nodes (grouped into racks)
    • Switches, connecting the racks
    • A network topology, e.g., hierarchical
    • Storage (backend) nodes connected to the network
    • Front-end for submitting jobs and receiving client requests
    • Software Services
  • A geographically distributed cloud consists of
    • Multiple such sites
    • Each site perhaps with a different structure and services

Introduction to Clouds: History

Trends: Technology

  • Doubling periods - storage: 12 months, bandwidth: 9 months(Moore’s law)
  • Then and Now
    • Bandwidth
      • 1985: mostly 56Kbps links nationwide
      • 2012: Tbps links widespread
  • Disk capacity
    • Today’s PCs have TBs, far more than a 1990 super computer
  • Biologists
    • 1990: were running small single-molecule simulations
    • 2012: CERN’s Large Hadron Collider producing many PB/year

In 1965, MIT’s Fernando Corbato and the other designers of the Multics operating system envisioned a computer facility operating “like a power company or water company.”

Plug your thin client into the computing utility and play your favorite Intensive Compute & Communicate Application

  • Have today’s clouds brought us closer to this realty? :)

Introduction to Clouds: What’s New in Today’s Clouds

A Cloud History of Time(1)

A Cloud History of Time(2)

Four Features New in Today’s Clouds

  1. Massive Scale
  • Facebook servers[GigaOM, 2012]:
    • 30K in 2009 -> 60K in 2010 -> 180K in 2012
  • Microsoft[NYTimes, 2008]:
    • 150K machines
    • Growth rate of 10K per month
    • 80K total running Bing
  • Yahoo![2009]:
    • 100K
    • Split into clusters of 4000
  • AWS EC2[Randy Bias, 2009]
    • 40,000 machines
    • 8 cores/machine
  • eBay[2012]: 50K machines
  • HP[2012]: 380K in 180 DCs
  • Google: A lot
  1. On-demand access: Pay-as-you-go, no upfront commitment
  • Anyone can access it
  1. Data-intensive Nature: What was MBs has now become TBs, PBs, and XBs.
  • Daily logs, forensics, Web data, etc.
  • Humans have data numbness: Wikipedia (large) compress is only about 10GB!
  1. New Cloud Programming Paradigms: MapReduce/Hadoop, NoSQL/Cassandra/MongoDB and many others
  • High in accessibility and ease of programmability
  • Lots of open-source

Introduction to Clouds: New Aspects of Clouds

On-Demand Access: *AAS Classification

  • On-demand: renting a cab vs (previously) renting a car, or buying one. E.g.:
    • AWS Elastic Compute Cloud(EC2): a few cents to a few $ per CPU hour
    • AWS Simple Storage Servie (S3): a few cents to a few & per GB-month
  • HaaS: Hardware as a Service
    • You get access to barebones hardware machines, do whatever you want with them, Ex: your own cluster
    • Not always a good idea because of security risks
  • IaaS: Infrastructure as a Service
    • You get access to flexible computing and storage infrastructure. Virtualization is one way of achieving this (what’s another way, e.g., using Linux). Often said to subsume HaaS.
    • Ex: Amazon Web Service (AWS: EC2 and S3), Eucalyptus, Rightscale, Microsoft Azure.
  • PaaS: Platform as a Service
    • You get access to flexible computing and storage infrastructure, coupled with a software platform(often tightly)
    • Ex: Google’s AppEngine (Python, Java, Go)
  • SaaS: Software as a Service
    • You get access to software services, when you need them. Often said to subsume SOA (Service Oriented Architectures)
    • Ex: Google docs, MS Office on demand

Data-Intensive Computing

  • Computation-Intensive Computing
    • Example areas: MPI-based, High-performance computing, Grids
    • Typically run on supercomputers(e.g., NCSA Blue Waters)
  • Data-Intensive
    • Typically store data at datacenters
    • Use compute nodes nearby
    • Compute nodes run computation services
  • In data-intensive computing, the focus shifts from computation to the data: CPU utilization no longer the most important resource metric, instead I/O is (disk and/or network)

New Cloud Programming Paradigms

  • Easy to write and run highly parallel programs in new programming paradigms:
    • Google: MapReduce and Sawzall
    • Amazon: Elastic MapReduce service(pay-as-you-go)
    • Google(MapReduce)
      • Indexing: a chain of 24MapReduce jobs
      • ~200K jobs processing 50PB/month (in 2006)
  • Yahoo!(Hadoop + Pig)
    • WebMap: chain of 100 MapReduce jobs
    • 280TB of data, 2500 nodes, 73 hours
  • Facebook(Hadoop + Hive)
    • ~300TB total, adding 2TB/day(in 2008)
    • 3K jobs processing 55TB/day
  • Similar numbers from other companies, e.g., Yieldex, harmony.com, etc.
  • NoSQL: MySQL is an industry standard, but Cassandra is 2400 times faster!

Introduction to Clouds: Economics of Clouds

  • Two Categories of Clouds

    • Private clouds are accessible only to company employees
    • Public clouds provide service to any paying customer
  • Singe Site Cloud: To Outsource or Own?

    • Medium-sized organization: wishes to run a service for M months
      • Service requires 128 servers(1024 cores) and 524 TB
      • Same as UIUC CCT cloud site
  • Outsource (e.g., via AWS): monthly cost

    • S3 costs: $0.12 per GB month. EC2 costs: $0.10 per CPU hour (costs from 2009)
    • Storage = $0.12 x 524 x 1000 = $62 K
    • Total = Storage + CPUs = $62K + $0.10 x 1024 x 24 x 30 = $136 K
  • Own: monthly cost

    • Storage ~ $349 K / M
    • Total ~ $1555K / M + 7.5K(includes 1 sysadmin / 100 nodes)
      • Using 0.45:0.4:0.15 split for hardware:power:network and 3 year life time of hardware
  • Breakdown analysis: more preferable to own if:

    • $349 K / M < $62 K (storage)
    • $1555K / M + 7.5 K < $136K (overall)
  • Break even points

    • M > 5.55 months (storage)
    • M > 12 months (overall)
  • As a result

    • Startups use clouds a lot
    • Cloud providers benefit monetarily most from storage
  • Summary

    • Clouds build on many previous generations of distributed systems
    • Especially the timesharing and data processing industry of the 1960-70s.
    • Need to identify unique aspects of a problem to classify it as a new cloud computing problem
      • Scale, On-demand access, data-intensive, new programming
    • Otherwise, the solutions to your problem may already exist!

Clouds are Distributed Systems

A cloud is a Distributed System

  • A cloud consists of
    • Hundreds to thousands of machines in a datacenter (server side)
    • Thousands to millions of machines accessing these services (client side)
  • Servers communicate amongst one another -> Distributed System
  • Clients communicate with servers -> Also a distributed system!
  • Clients also communicate with each other
    • In peer-to-peer systems like BitTorrent
    • Also a distributed system!

Four Features of Clouds = All Distributed Systems Features!

  1. Massive Scale: many servers
  2. On-demand nature: access (multiple) servers anywhere
  3. Data-Intensive Nature: lots of data => need a cluster (multiple machines) to store
  4. New Cloud Programming Paradigms: Hadoop/Mapreduce, NoSQL all need clusters

Cloud = A Fancy Word for a Distributed System

  • A “cloud” is the latest nickname for a distributed system
  • Previous nicknames for “distributed system” have included
    • Peer-to-peer systems
    • Grids
    • Clusters
    • Timeshared computers (Data Processing Industry)
  • Nicknames come and go, but the core concepts underlying distributed systems stay the same
    • And they are used decide after decade
      • E.g., Lamppost Timestamps were invented in the 1970s, and they are used in almost all distributed/cloud systems today
    • This course is about these distributed systems concepts
  • A few years from now, there may be a new nickname for distributed systems
    • The core concepts will remain the same, and they will continue to be used in real systems

What is a Distributed System?

A distributed system is a collection of entities, each of which is autonomous, programmable, asynchronous and failure-prone, and which communicate through an unreliable communication medium.

Distributed System = Many Processes Sending and Receiving Messages through Unreliable Communication Network

A range of interesting problems for distributed system designers

  • P2P systems [Gnutella, Kazaa, BitTorrent]
  • Cloud Infrastructures [AWS, Azure, Google Cloud]
  • Cloud Storage [Key-value stores, NoSQL, Cassandra]
  • Cloud Programming [MapReduce, Storm, Pregel]
  • Coordination [Paxos, Leader Election, Snapshots]
  • Managing Many Clients and Servers Concurrently [Concurrency Control, Replication Control]
  • (and many more that you’ll see in this course!)

Challenges in solving these problems

  • Failure: no longer the exception, but rather a norm
  • Scalability: 1000s of machines, Terabytes of data
  • Asynchrony: clock skew and clock drift
  • Concurrency: 1000s of machines interacting with each other accessing the same data

MapReduce

MapReduce Paradigm

  • Terms are borrowed from Functional Language(e.g., Lisp)
  • Sample Application: Wordcout
  • Map:
    • Parallelly Process a large number of individual records to generate intermediate key/value pairs.

Map Task

Two Map Tasks

Many Map Tasks

  • Reduce:
    • Processes and merges all intermediate values associated per key
    • Each key assigned to one Reduce
    • Parallelly Processes and merges all intermediate values by partitioning keys
    • Popular: Hash partitioning, i.e., key is assigned to reduce #= hash(key)% number of reduce servers

Reduce Task

Two Reduce Tasks

MapReduce Examples

Distributed Grep

  • Input: large set of files
  • Output: lines that match pattern
  • Map: Emits a line if it matches the supplied pattern
  • Reduce: Copies the intermediate data to output

Reverse Web-Link Graph

  • Input: Web graph: tuples(a, b) where (page a -> pageb)
  • Output: For each page list of pages that link to it
  • Map: process web log and for each input <source, target>, it outputs <target, source>
  • Reduce: emits <target, list>

Count of URL access frequency

  • Input: Log of accessed URLs, e.g., from proxy server
  • Output: For each URL, % of total accesses for that URL
  • Map: Process web log and outputs <URL, 1>
  • Multiple Reducers : Emits [URL, URL_count]
    (So far, like Wordcount. But still need %)
  • Chain another MapReduce job after above one
  • Map: Process <URL, URL_count> and outputs <1, (<URL, URL_count>)>
  • 1 Reducer: Sums up URL_count’s to calculate overall_count.
    • Emits multiple <URL, URL_count/overall_count>

Sort
Map task’s output is sorted (e.g., quicksort)
Reduce task’s input is sorted (e.g., mergesort)

  • Input: Series of (key, value) pairs
  • Output: Sorted s
  • Map: <key, value> -> <value, _> (identity)
  • Reducer: <key, value> -> <key, value> (identity)
  • Partitioning function - partition keys across reducers based on ranges
    • Take data distribution into account to balance reducer tasks

MapReduce Scheduling

Externally: For user

  • Write a Map program (short), write a Reduce program(short)
  • Submit job, wait for result
  • Need to know nothing about parallel/distributed programming!

Internally: For the Paradigm and Scheduler

  • Parallelize Map
  • Transfer data from Map to Reduce
  • Parallelize Reduce
  • Implement Storage for Map input, Map output, Reduce input, and Reduce output

(Ensure that no Reduce starts before all Maps are finished. That is, ensure the barrier between the Map phase and Reduce phase)

For the cloud:

  • Parallelize Map: easy! each map task is independent of the other!
    • All Map output records with same key assigned to same Reduce
  • Transfer data from Map to Reduce
    • All Map output records with same key assigned to same Reduce task
    • Use partitioning function, e.g., hash(key) % number of reducers
  • Parallelize Reduce: easy! each reduce task is independent of the other!
  • Implement Storage for Map input, Map output, Reduce input, and Reduce output.
    • Map input: from distributed file system
    • Map output: to local disk (at Map node); uses local file system
    • Reduce input: from (multiple) remote disks; uses local file systems
    • Reduce output: to distributed file system
  • local file system = Linux FS, etc.
  • distributed file system = GFS (Google File System), HDFS (Hadoop Distributed File System)

The YARN Scheduler

  • Used in Hadoop 2.x +
  • YARN = Yet Another Resource Negotiator
  • Treats each server as a collection of containers
    • Container = some CPU + some memory
  • Has 3 main components
    • Global Resource Manager (RM)
      • Scheduling
    • Pre-server Node Manager(NM)
      • Daemon and server-specific functions
    • Per-application (job) Application Master (AM)
      • Container negotiation with RM and NMs
      • Detecting task failures of that job

How a Job Gets a Container

MapReduce Fault-Tolerance

Fault Tolerance

  • Server Failure
    • NM heartbeats to RM
      • If server fails, RM lets all affected Ams know, and Ams take action.
    • NM keeps track of each task running at its server
      • If task fails while in-progress, mark the task as idle and restart it
    • AM heartbeats to RM
      • On failure, RM restarts AM, which then syncs up with its running tasks
  • RM Failure
    • Use old checkpoints and bring up secondary RM
  • Heartbeats also used to piggyback container requests

Stragglers (slow nodes)

  • The slowest machine slows the entire job down (why?)
  • Due to Bad Disk, Network Bandwidth, CPU, or Memory
  • Keep track of “progress” of each task (%done)
  • Perform backup (replicated) execution of straggler task: task considered done when first replica completed. Called Speculative Execution.

Locality

  • Since cloud has hierarchical topology (e.g., racks)
  • GFS/HDFS stores 3 replicas of each of chins (e.g., 64 MB in size)
    • Maybe on different racks, e.g., 2 on a rack, 1 on a different rack
    • Mapreduce attempts to schedule a map task on
      • a machine that contains a replica of corresponding input data, or failing that.
      • on the same rack as a machine containing the input, or failing that.
      • Anywhere

Summary

  • Mapreduce uses parallelization + aggregation to schedule applications across clusters
  • Need to deal with failure
  • Plenty of ongoing research work in scheduling and fault-tolerance for Mapreduce and Hadoop