Introduction to Cloud Computing
http://net.pku.edu.cn/~course/cs402/2009/
彭波
[email protected]
北京大学信息科学技术学院
6/30/2009
大纲



云计算(Cloud Computing)是?
大规模数据处理是?
我们这门课的目标和内容是?
2
云计算(Cloud Computing)
What is Cloud Computing?
1.
2.
3.
First write down your own opinion about “cloud
computing” , whatever you thought about in
your mind.
Question: What ? Who? Why? How? Pros and
cons?
The most important question is: What is the
relation with me?
4
Cloud Computing is…




No software
access everywhere by Internet
power -- Large-scale data processing
Appeal for startups




Cost efficiency
实在是太方便了
Software as platform
SaaS
PaaS
Utility Computing
Cons


Security
Data lock-in
5
Software as a Service (SaaS)

a model of software deployment whereby a
provider licenses an application to customers for
use as a service on demand.
6
Platform as a Service (PaaS)

对于开发Web Application和Services,PaaS提供了一
整套基于Internet的,从开发,测试,部署,运营到维护
的全方位的集成环境。特别它从一开始就具备了Multitenant architecture,用户不需要考虑多用户并发的问
题,而由platform来解决,包括并发管理,扩展性,失效
恢复,安全。
7
Utility Computing

“pay-as-you-go” 好比让用户把电源插头插在墙上,你得
到的电压和Microsoft得到的一样,只是你用得少,pay
less;utility computing的目标就是让计算资源也具有这
样的服务能力,用户可以使用500强公司所拥有的计算资
源,只是use less pay less。这是cloud computing的一
个重要方面
8
Cloud Computing is…
9
Key Characteristics
illusion of infinite
computing resources
available on demand;

elimination of an up-front
commitment by Cloud
users; 创业启动花费

ability to pay for use of
computing resources on a
very large datacenters
short-term basis as
needed。小时间片的
large-scale software infrastructure
billing,报告指出utility
computing在这一点上的实
operational expertise
践是失败的

10
Why now?


very large-scale datacenter的实践,
因为新的技术趋势和Business模式

pay-as-you-go computing
11
Key Players



Amazon Web Services
Google App Engine
Microsoft Windows Azure
12
Key Applications




Mobile Interactive applications, Tim O’Reilly相信未来是
属于能够实时对用户提供信息的服务。Mobile必定是关
键。而后台在datacenter中运行是很自然的模式,特别是
那些mashup融合类型的服务。
Parallel batch processing。大规模数据处理使用Cloud
Computing技术很自然,MapReduce,Hadoop在这里起
到重要作用。这里,数据移入/移出cloud是很大的开销,
Amazon开始尝试host large public datasets for free。
The rise of analytics。数据库应用中transaction based应
用还在增长,而analytics的应用增长迅速。数据挖掘,用
户行为分析等应用的巨大推动。
Extension of compute-intensive desktop application。计
算密集型的任务,说matlab, mathematica都有了cloud
computing的扩展,woo~
13
Cloud Computing = Silver Bullet?

Google文档在3月7日发生
了大批用户文件外泄事
件。美国隐私保护组织就
此提请政府对Google采取
措施,使其加强云计算产
品的安全性。

14
Problem of Data Lock-in
Challenges
15
Some other Voices
The interesting thing about Cloud Computing is that we’ve redefined
Cloud Computing to include everything that we already do. . . . I
don’t understand what we would do differently in the light of Cloud
Computing other than change the wording of some of our ads.
Larry Ellison, quoted in the Wall Street Journal, September 26, 2008
It’s stupidity. It’s worse than stupidity: it’s a marketing hype
campaign. Somebody is saying this is inevitable — and
whenever you hear somebody saying that, it’s very likely to be
a set of businesses campaigning to make it true.
Richard Stallman, quoted in The Guardian, September 29,
2008
16
What’s matter with ME?!

What you want to do with 1000pcs, or even
100,000 pcs?
17
Cloud is coming…
Google alone has 450,000
systems running across 20
datacenters, and Microsoft's
Windows Live team is doubling
the number of servers it uses
every 14 months, which is faster
than Moore's Law
“Data Center is a Computer”
Parallelism everywhere
Massive Scalable Reliable
Resource Management
Data Management
Programming Model & Tools
18
大规模数据处理
20
Happening everywhere!
microarray chips
Molecular biology
(cancer)
fiber optics
microprocessors
Network traffic (spam)
300M/day
Simulations
(Millennium)
particle colliders
Particle events (LHC)
1B
21
1M/sec
22
Maximilien Brice, © CERN
23
Maximilien Brice, © CERN
24
Maximilien Brice, © CERN
25
Maximilien Brice, © CERN
How much data?





Internet archive has 2 PB of data + 20 TB/month
Google processes 20 PB a day (2008)
“all words ever spoken by human beings” ~ 5 EB
CERN’s LHC will generate 10-15 PB a year
Sanger anticipates 6 PB of data in 2009
640K ought to be
enough for
anybody.
26
NERSC User George Smoot wins
2006 Nobel Prize in Physics
Smoot and Mather 1992
COBE Experiment showed
anisotropy of CMB
Cosmic Microwave
Background Radiation
(CMB): an image of the
universe at 400,000 years
27
The Current CMB Map
source J. Borrill, LBNL
• Unique imprint of primordial physics through the tiny anisotropies in
temperature and polarization.
• Extracting these Kelvin fluctuations from inherently noisy data is a
serious computational challenge.
28
Evolution Of CMB Data Sets: Cost >
O(Np^3Limiting
)
Experiment
Nt
Np
Nb
Data
COBE (1989)
2x109
6x103
3x101
Time
Satellite,
BOOMERanG
(1998)
3x108
5x105
3x101
Pixel
Balloon, 1st HPC/NERSC
(4yr) WMAP (2001)
7x1010
4x107
1x103
?
Satellite, Analysis-bound
Planck (2007)
5x1011
6x108
6x103
Time/ Pixel
Satellite,
Major HPC/DA effort
POLARBEAR (2007)
8x1012
6x106
1x103
Time
Ground,
NGmultiplexing
CMBPol (~2020)
1014
109
104
Time/ Pixel
Satellite,
Early
planning/design
data compression
29
Notes
Workstation
Example: Wikipedia Anthropology
Kittur, Suh, Pendleton (UCLA, PARC), “He Says,
She Says: Conflict and Coordination in Wikipedia”
CHI, 2007
Increasing fraction of edits are for
work indirectly related to articles

Experiment




Download entire revision
history of Wikipedia
4.7 M pages, 58 M
revisions, 800 GB
Analyze editing patterns &
trends
Computation

30
Hadoop on 20-machine
cluster
Example: Scene Completion
Hays, Efros (CMU), “Scene Completion Using
Millions of Photographs” SIGGRAPH, 2007

Image Database Grouped by
Semantic Content





30 different Flickr.com groups
2.3 M images total (396 GB).

Select Candidate Images Most
Suitable for Filling Hole



Classify images with gist scene
detector [Torralba]
Color similarity
Local context matching
Computation


Extension

31
Index images offline
50 min. scene matching, 20
min. local matching, 4 min.
compositing
Reduces to 5 minutes total by
using 5 machines
Flickr.com has over 500 million
images …
Example: Web Page Analysis
Fetterly, Manasse, Najork, Wiener (Microsoft, HP),
“A Large-Scale Study of the Evolution of Web
Pages,” Software-Practice & Experience, 2004

Experiment


Use web crawler to gather
151M HTML pages weekly
11 times
 Generated 1.2 TB log
information
Analyze page statistics and
change frequencies

Systems Challenge
“Moreover, we experienced a
catastrophic disk failure
during the third crawl,
causing us to lose a quarter
of the logs of that crawl.”
32
GATGCTTACTATGCGGGCCCC
CGGTCTAATGCTTACTATGC
GCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
TAATGCTTACTATGC
AATGCTTAGCTATGCGGGC
AATGCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
?
CGGTCTAGATGCTTACTATGC
AATGCTTACTATGCGGGCCCCTT
CGGTCTAATGCTTAGCTATGC
ATGCTTACTATGCGGGCCCCTT
Reads
Subject
genome
Sequencer
33
DNA Sequencing

Genome of an organism encodes genetic
information in long sequence of 4 DNA
nucleotides: ATCG



Current DNA sequencing machines can generate
1-2 Gbp of sequence per day, in millions of short
reads (25-300bp)



ATCTGATAAGTCCCAGGACTTCAGT
GCAAGGCAAACCCGAGCCCAGTTT
TCCAGTTCTAGAGTTTCACATGATC
Bacteria: ~5 million bp
Humans: ~3 billion bp
Shorter reads, but much higher throughput
Per-base error rate estimated at 1-2% (Simpson,
et al, 2009)
Recent studies of entire human genomes have
used 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et
al., 2008) billion 36bp reads

~144 GB of compressed sequence data
GGAGTTAGTAAAAGTCCACATTGAG
34
Subject reads
CTATGCGGGC
AT CTATGCGG
TCTAGATGCT
GCTTAT CTAT
AT CTATGCGG
AT CTATGCGG
AT CTATGCGG
TTA T CTATGC
CTATGCGGGC
GCTTAT CTAT
CTAGATGCTT
Alignment
CGGTCTAGATGCTTAGCTATGCGGGCCCCTT
Reference sequence
35
Subject reads
ATGCGGGCCC
CTAGATGCTT CTATGCGGGC
TCTAGATGCT ATCTATGCGG
CGGTCTAG
ATCTATGCGG
CTT
CGGTCT
TTATCTATGC
CCTT
CGGTC
GCTTATCTAT
GCCCCTT
GCTTATCTAT
CGG
GGCCCCTT
CGGTCTAGATGCTTATCTATGCGGGCCCCTT
Reference sequence
36
Example: Bioinformatics
Michael Schatz. CloudBurst: Highly
Sensitive Read Mapping with
MapReduce. Bioinformatics, 2009, in
press.

Evaluate running time on local 24 core cluster

Running time increases linearly with the number of
reads
37
Example: Data Mining
Haoyuan Li, Yi Wang, Dong Zhang,
Ming Zhang, Edward Y. Chang: Pfp:
parallel fp-growth for query
recommendation. RecSys 2008: 107114

38
del.icio.us crawl->a
bipartite graph
covering 802739
Webpages and
1021107 tags.
大规模数据处理+云计算
An Example
数据处理任务


词频统计:统计一个文档集中每个词出现的次数
Try on these collection:


2006年初,我们在国内搜集了870 Million 不同网页,共
约2 TB.
商业搜索引擎Google, Yahoo等,收集网页数量在100+
Billion pages
怎样处理海量数据?
40
Divide and Conquer
“Work”
Partition
w1
w2
w3
“worker”
“worker”
“worker”
r1
r2
r3
Combine
“Result”
41
What’s Mapreduce

Parallel/Distributed Computing Programming
Model
Input split
shuffle
42
output
Typical problem solved by MapReduce


读入数据: key/value 对的记录格式数据
Map: 从每个记录里extract something


Shuffle: 混排交换数据


把相同key的中间结果汇集到相同节点上
Reduce: aggregate, summarize, filter, etc.


map (in_key, in_value) -> list(out_key, intermediate_value)
 处理input key/value pair
 输出中间结果key/value pairs
reduce (out_key, list(intermediate_value)) -> list(out_value)
 归并某一个key的所有values,进行计算
 输出合并的计算结果 (usually just one)
输出结果
43
Word Frequencies in Web pages


输入:one document per record
用户实现map function,输入为



key = document URL
value = document contents
map输出 (potentially many) key/value pairs.

对document中每一个出现的词,输出一个记录<word, “1”>
44
Example continued:


MapReduce运行系统(库)把所有相同key的记录收集到一
起 (shuffle/sort)
用户实现reduce function对一个key对应的values计算


求和sum
Reduce输出<key, sum>

45
MapReduce Runtime System
46
History of Hadoop













2004 - Initial versions of what is now Hadoop Distributed File System and
Map-Reduce implemented by Doug Cutting & Mike Cafarella
December 2005 - Nutch ported to the new framework. Hadoop runs reliably
on 20 nodes.
January 2006 - Doug Cutting joins Yahoo!
February 2006 - Apache Hadoop project official started to support the
standalone development of Map-Reduce and HDFS.
March 2006 - Formation of the Yahoo! Hadoop team
May 2006 - Yahoo sets up a Hadoop research cluster - 300 nodes
April 2006 - Sort benchmark run on 188 nodes in 47.9 hours
May 2006 - Sort benchmark run on 500 nodes in 42 hours (better hardware
than April benchmark)
October 2006 - Research cluster reaches 600 Nodes
December 2006 - Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500
nodes in 5.2 hrs, 900 nodes in 7.8
January 2006 - Research cluster reaches 900 node
April 2007 - Research clusters - 2 clusters of 1000 nodes
Sep 2008 - Scaling Hadoop to 4000 nodes at Yahoo!
47
From Theory to Practice
1. Scp data to cluster
2. Move data into HDFS
3. Develop code locally
4. Submit MapReduce job
4a. Go back to Step 3
Hadoop Cluster
You
5. Move data out of HDFS
6. Scp data from cluster
48
课程目标和内容
课程目标





掌握MapReduce编程模型与运行环境的使用。
掌握算法在MapReduce模型下并行化的基本方法。
了解MapReduce运行分布式环境的实现技术。
了解云计算中大规模数据处理和算法并行化技术的
发展现状和关键问题。
了解并培养并行化思考问题的习惯。
50
课程内容
LEC# TOPICS
1
课程介绍-云计算
MapReduce环境
2
MapReduce原理
Inverted Index问题
3
4
并行与分布式系统基础
ABSTRACT
围绕大规模数据处理为背景介绍云计算技术
以MapReduce为平台展开讲授和实践,是课程的
中心。
从函数式语言谈MapReduce的基本原理
分析Inverted Index问题及其MapReduce实现
PageRank问题
介绍大规模并行分布式系统的设计
分析PageRank问题及其MapReduce实现
MapReduce系统设计与
实现
分析MapReduce的系统设计和考虑
分析Clustering问题及其MapReduce实现
Clustering问题
频繁集挖掘问题
介绍MapReduce之上的应用和发展
分析频繁集挖掘问题及其MapReduce实现
6
项目讨论
课程项目讨论
7
特邀报告
邀请学术界或业界研究技术人员报告
8
项目报告
学生课程项目报告
51
5
MapReduce高层应用
Grading Policy
Lab
Lab
Lab
Lab
1
2
3
4
- Introduction to Hadoop, Eclipse
– A Simple Inverted Index
- PageRank over Wikipedia Corpus
– Clustering the Netflix Movie Data



Hw1 - Read - Intro Distributed system;
Intro MapReduce Programming.
Hw2 - Read MapReduce[1]
Hw3 – Read GFS[2]
Hw4 – Read Pig Latin[3]
52
30%
Assignments
20%
Readings
50% Course
project
课程的要求

熟练一种Programming Language

Lots of java programming practices
53
Teachers and Resources

课程网站



陈日闪助教
http://net.pku.edu.cn/~cour
se/cs402/2009/
http://groups.google.com/g
roup/cs402pku
Hadoop 主页


闫宏飞老师
讨论组



http://hadoop.apache.org/c
ore/
Resources

http://net.pku.edu.cn/~cour
se/cs402/2008/resource.ht
ml
54
Homework

登记


组成小组



3-4人,为课程project准备
跨专业方向很好
Lab1


http://net.pku.edu.cn/~course/cs402/2009/
Lab 1 - Introduction to Hadoop, Eclipse
HW Reading1

Intro Distributed system; Intro Parallel Programming.
 http://code.google.com/edu/parallel/dsd-tutorial.html
 http://code.google.com/edu/parallel/mapreduce-tutorial.html
55
Summary

CloudComputing brings





How to make it real?



56
Possible of using unlimited
resources on-demand, and by
anytime and anywhere
Possible of construct and
deploy applications
automatically scale to tens of
thousands computers
Possible of construct and run
programs dealing with
prodigious volume of data
…
Distributed File System
Distributed Computing
Framework
…………………………………
Q&A
参考文献


[1] J. Dean and S. Ghemawat, "MapReduce:
Simplified Data Processing on Large Clusters," in
Osdi, 2004, pp. 137-150.
[2] G. Sanjay, G. Howard, and L. Shun-Tak, "The
Google file system," in Proceedings of the
nineteenth ACM symposium on Operating
systems principles. Bolton Landing, NY, USA: ACM

Press, 2003.
[3] O. Christopher, R. Benjamin, S. Utkarsh, K.
Ravi, and T. Andrew, "Pig latin: a not-so-foreign
language for data processing," in Proceedings of
the 2008 ACM SIGMOD international conference
on Management of data. Vancouver, Canada:
ACM, 2008.
58
Google App Engine

App Engine handles HTTP(S) requests, nothing else



App configuration is dead simple


Think RPC: request in, processing, response out
Works well for the web and AJAX; also for other services
No performance tuning needed
Everything is built to scale


“infinite” number of apps, requests/sec, storage capacity
APIs are simple, stupid
59
App Engine Architecture
req/resp
stateless APIs
urlfech
mail
R/O FS
Python
VM
process
stdlib
app
images
stateful
APIs
datastore
memcache
60
60
Microsoft Windows Azure
61
Amazon Web Services






Amazon’s infrastructure (auto scaling, load
balancing)
Elastic Compute Cloud (EC2) – scalable virtual
private server instances
Simple Storage Service (S3)
Simple Queue Service (SQS) – messaging
SimpleDB - database
Flexible Payments Service, Mechanical Turk,
CloudFront, etc.
62
Amazon Web Services





Very flexible, lower-level offering (closer to
hardware) = more possibilities, higher performing
Runs platform you provide (machine images)
Supports all major web languages
Industry-standard services (move off AWS easily)
Require much more work, longer time-to-market


Deployment scripts, configuring images, etc.
Various libraries and GUI plug-ins make AWS do
help
63
Price of Amazon EC2
64
Descargar

幻灯片 1