Summer Reading: Distributed Computing

Courtesy of the ai.google research publication database from which I sourced these articles.

Foreword

Prompted by some of my current tasking at work, I went on a journey through Google’s history of research publications.  My focus was, predictably, on Distributed Computing.  From infrastructure to algorithms to AI to version control, Google’s engineers have left a lot of public-facing breadcrumbs on how to be a better engineer.  I filtered on the topics of Data Management, Distributed Systems and Parallel Computing, Networking, Software Engineering, and Software Systems from the years 1998-2019 and came up with this summer reading list.  These papers focus on making systems that push the boundaries of what’s possible for globally scalable systems that quite literally make the Internet run today. This list also includes papers on a lot of the technologies that I use on a daily basis (most of which have external Cloud versions), so it’s been interesting learning what kind of hamsters they have running under the hood.

Come join me on my quest to become a better back-end developer!  If you don’t feel like reading all of these papers, I totally get it.  It’s probably at least worth reading the abstracts/introductions for the research papers.  Some of them are slide decks or videos so they’ll be easier to consume, but the hope is that everyone can find at least something interesting here.

Oh, and strap in. It’s going to be a long one.

1998

So this is where it all began – the paper that presents “Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext.”  This paper is sort of out-dated, being ~20 years old, but a surprising amount of it is still relevant as ever in the current state of the Google. It introduces the concept of having massive parallel input pipelines, parsing and storing that information in a readable format, and then being able to run queries over that space in a matter of milliseconds.  The diagrams inside this paper show a rudimentary microservice network with data ingesters, storage systems, indices, and other fun stuff. If nothing else, it’s fascinating to see how far we’ve come since then.

The Anatomy of a Large-Scale Hypertextual Web Search Engine
Sergey Brin, Lawrence Page

2003

Not a ton happened between 1998 and 2003, or at least there aren’t many public papers to show for it.  There’s a lot about enterprise security, unix administration, and information retrieval until 2003 when there was a burst of publications on new infrastructure.  This was the point where the search engine thing was starting to take off and Google needed some solid new cluster tech. These documents are old enough where “hundreds of GB” was a lot of data, so they were basically written in the stone age.  I’m not sure exactly how much if any of the tech described is still in production, but we definitely still have the same core distributed file system patterns in our infrastructure today. The important part of the GFS paper is the partial consistency model that allows the system to serve data at the scale it does (even if it’s not instantly correct).  Study it well because this will definitely be on the test later.

Web Search for a Planet: Google’s Cluster Architecture
Luiz Andre Barroso, Jeffrey Dean, Urs Hölzle

The Google File System
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung

2004-2005

Moving onwards through history, we start getting into the scaling phase.  Google has established itself as the go-to search engine and they start building massively scalable infrastructure to deal with all this data they’re consuming.  The term “MapReduce” originally just referred to the Google product described below, but has since evolved into a generic programming model for big data processing operations.  Of course the concept of parallel processing massive data sets is cool, but reading into how the work is actually paralleled is fascinating.  The authors get into how master jobs partition out individual tasks to worker nodes, handling failure gracefully all the way through the map and reduce options.  There’s a nice optimization pattern discussed in section 3.6 about double-scheduling straggler jobs that have been running past a certain time limit. It uses more resources but since all of the operations are atomic, the rescheduled job sometimes completes faster than the original job (which decreases overall turnaround time).  The second link is a somewhat sarcastic slide deck that devolves into a weird flex at the end, but ok. It’s interesting to see how the Internet was developing at the time, especially in the field of trying to sell people viagra and porn.

MapReduce: Simplified Data Processing on Large Clusters
Jeffrey Dean, Sanjay Ghemawat

(Here’s a slide deck that presents basically the same concepts as the paper in case you prefer that format.)

Challenges in Running a Commercial Search Engine
Amit Singhal

2006-2007

One of the fundamental concepts of distributed systems is consensus (getting a group of your computers to agree on something).  Without this, you won’t have consistency and people can do things like withdraw the entire balance of a bank account from multiple places across the globe, multiplying their money while they’re at it.  As it turns out, the basic building blocks of massive global clusters are efficient consensus and precise time implementations. Getting these things right allow your engineers to do amazing things with relatively low effort, simply just relying on the tools you give them.  This is where we start to get into some of the main cloud computing products we all know and love today (or at least, that the web services we use on a daily basis know and love today). BigTable is discussed here and some other integral parts of the cloud computing ecosystem are discussed below.

The Chubby Lock Service for Loosely-Coupled Distributed Systems
Mike Burrows

Bigtable: A Distributed Storage System for Structured Data
Jeffrey Dean, Sanjay Ghemawat, and more

2010-2012

This first resource here goes over why trusting hardware will always lead you astray.  Especially considering recent events, software is not a perfect solution either, but ¯\_(ツ)_/¯.  It generally fails less and you won’t run up as big of a bill while you’re at it.  By creating giant clusters of commodity machines and running all of your distributed services together sharing those machines, you can get a lot of work done without schlepping your data halfway around the globe for processing and then waiting for the result.  Having established that, the second and third resources describe large scale data storage and processing systems, both of which are now Cloud products (Dremel is called BigQuery, Spanner is called Cloud Spanner). Finally, we have a paper on how managing 95+ percentile and how a little bit of extra planning goes a long way to making a much better user-facing experience.

Building Large-Scale Internet Services – 2010
Jeff Dean

Dremel: Interactive Analysis of Web-Scale Datasets – 2010
Sergey Melnik, Andrey Gubarev, and more

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure – 2010
Benjamin Sigelman, Luiz André Barroso, and more

Small Cache, Big Effect – 2011
Bin Fan, Hyeontaek Lim, and more (all non-Google, no ai.google link)

Spanner: Google’s Globally Distributed Database – 2012
James C. Corbett, Jeffrey Dean, and more

The Tail at Scale – 2013
Jeff Dean, Luiz André Barroso

Follow-ups: 
An interesting read on why you can not have both total consistency and 100% availability when network partitions are possible (read: whenever you have a global system)
A slide deck on F1, the database that powers Google Ads (built on spanner)
The whitepaper for F1 (in case you wanted more detail)

2014-2016

The next set of papers assumes the presence of the core infrastructure detailed in the last few sections and that user-facing apps are now using that infrastructure.  Sure every team could just buy their own servers and stick them all in big warehouses across the globe, but there’s a huge amount of optimization that can be done to drive down cost and increase reliability.  These frameworks allow you, as a company, to buy a shitload of computing power, plug it all in, and then allocate hardware resources and processing time. This way your prod jobs stay up and your giant batch data processing can run whenever you have time.  This ends up saving a huge amount of money and time and avoids a lot of developer toil. If your devs have to touch the hardware themselves, you’ve probably taken a wrong turn somewhere along the line. The Borg paper also goes into “Lessons Learned” at the end and how they applied those lessons to make Kubernetes more developer friendly.

Large Scale Cluster Management at Google with Borg
Abhishek Verma, Luis Pedrosa, and more

Borg, Omega, and Kubernetes
Brendan Burns, Brian Grant, and more

Follow-ups:
The Rise of Cloud Computing – a powerpoint talking about cluster tech over the years.

2016-2019 (Present)

Now that there are a ton of tools available to create scalable, massive systems without having to do all of the above research, things have begun to focus more on making these systems do the right thing, consistently, even in the presence of failure.  It turns out throwing more raw computing power at something isn’t always going to solve your problem – sometimes you just need to engineer things to be a little smarter. The notion of an SRE (Site Reliability Engineer) has been around for a while, but it seems like it’s becoming even more widespread as DevOps has taken the software engineering world by storm.  The following are some videos, articles, and even a book about how to make your systems robust. (The monorepo article is a bonus for anyone who’s been stuck in git their entire life.  It presents how/when monorepos can be used advantageously, but also why they’re not always the best solution.)

SRE your GRPC
Gráinne Sheerin, Gabe Krabbe

The Calculus of Service Availability
Ben Treynor, Benjamin Lutch, and more

Why Google Stores Billions of Lines of Code in a Single Repository
Rachel Potvin, Josh Levenberg

Follow-ups:
Site Reliability Engineering – An online version of the book explaining the concept of an SRE.  It’s a full book so only go into this as needed, but it’s a great resource to have. (This is ‘The SRE Book’ mentioned in the second article of this section)

That’s all, folks!  (For now)

Well there we have it.  I’m amazed if you’ve stuck it out and made it this far because it certainly took me a while to get through all of this.  If you have any articles that you think should be on a post like this, tweet them to me (@ahandley00) and I’ll get them up here somewhere.  I definitely want to get some more conference recordings on here (see ‘SRE your GRPC’) because of how easy they are to consume, yet how helpful they can be.  

Edits:
2019/7/13 – Added The Tail at Scale (2013)
2020/11/19 – Added Small Cache, Big Effect (2011)
2022/03/15 – Added Dapper whitepaper (2010)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s