6.S984 数据中心计算

先行条件

6.191 Computation Structures

课程内容

仓库规模的数据中心承载了广泛的在线服务，包括云计算、社交网络、网页搜索、视频流和软件即服务。在本课程中，我们将研究现代数据中心的硬件、系统软件和分布式系统技术。我们还将探讨一些跨领域的问题，如总拥有成本、服务水平目标、可用性和可靠性。课程将结合讲座和论文阅读。学生每个主题将阅读最多两篇论文并提交简要摘要

阅读论文时，思考以下问题:

这篇论文试图解决什么问题？这个问题有多现实？
关键思想: 解决方案中的主要思想是什么？新颖性: 与之前的工作有何不同？是一个新问题，一个新解决方案，还是一个现有问题的新环境？
批评: 你会对解决方案做出什么改变？你对作者呈现或评估解决方案的方式有何看法？

参考书

"The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition", Luiz André Barroso, Jimmy Clidaras, Urs Hölzle. Morgan & Claypool Publishers

主题

介绍
数据中心硬件
功耗管理
硬件架构
能源 & 功耗
数据中心存储
可靠性
数据中心网络
应用架构
无服务器计算
微服务
性能分析
尾时延
安全和隐私
监控
性能Debugging
低时延服务管理
数据中心管理
在系统方面的机器学习(skip)
集群管理

Lec 1 介绍

The datacenter as a computer (BCH chapters 1, 2)

lec1.md

Lec 2 数据中心硬件

Datacenter hardware (slides)

lec2.md

Lec 3 电源管理

阅读内容

Barroso & Hoelzle: chapter 4 and 5
(optional) Hennessy & Patterson: A Quantitative: Approach Ch. 1.5 and 6.6

lec3.md

Lec 4 硬件架构

A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

Architecting to Achieve a Billion RPS Throughput on a Single Key-Value Store Server Platform

lec4.md

Lec 5 能源 & 动力

Heracles: improving resource efficiency at scale

Towards Energy Proportionality for Large-Scale Latency-Critical Workloads

Lec 6 数据中心存储

Pocket: Elastic Ephemeral Storage for Serverless Analytics

The Google File System

lec6.md

Lec 7 可实现性

Reliability (slides)

(Lecture notes, BCH chapter 7)

Lec 8 数据中心网络

Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network, SIGCOMM ’15

这篇论文探讨了 Google 数据中心网络中的 Clos 拓扑结构及其十年来的发展和集中控制

Azure Accelerated Networking: SmartNICs in the Public Cloud

问题是如何在公共云中实现高效、低延迟的网络性能。云计算服务需要支持大量的数据传输和多租户环境，而传统的网络架构和软件栈在性能和延迟上都有瓶颈，无法满足快速增长的需求

lec8.md

Lec 9 应用框架

[Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](nsdi12-final138.pdf (usenix.org))

俗称RDD，奠定了Spark的理论基础。

集群计算框架：比如MapReduce，这种抽象让用户在不用考虑任务调度和容错的前提下，使用一系列高级的操作进行并行计算，但是缺少对分布式内存的抽象。在不同计算阶段之间重用数据（如，在两个MapReduce的job之间）的唯一方式是将其写入外部稳定存储系统中，如，分布式文件系统。它们没有提供更加通用的数据重用的抽象。

X-Stream: edge-centric graph processing using streaming partitions, SOSP'13

X-Stream是在共享存储机器上既能处理存放于外存，又能处理存放于内存的图数据

lec9.md

Lec 10 无服务器计算

Occupy the Cloud: Distributed Computing for the 99%

ExCamera -- Encoding, Fast and Slow: Low-Latency Video Processing Using Thousands of Tiny Threads

lec10.md

Lec 11 微服务

阅读资料
Introduction to microservices, 2015, blog
An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud and Edge Systems, ASPLOS‘19

lec11.md