Large Scale Data Storage and Processing on Google's Distributed Systems

Invited Lecture: Large Scale Data Storage and Processing on Google's Distributed Systems

17/04/2019 09:00, Aula Magna, Dipartimento di Informatica, Via Celoria 18, Milano, Italy

(This is an open lecture, but also part of the master course in "Distributed and Pervasive Systems" given by Prof. Claudio Bettini)

Lecturer: Dario Freni, Google London

Abstract:

Organizing the world's information and making it universally accessible and useful requires technologies that are able to handle petabytes of data quickly and reliably. This talk focuses on three crucial aspects of Google's infrastructure: storage, processing and reliability. We will present popular technologies within Google, giving an overview of their principles and main use cases. We will cover distributed storage solutions including GFS [1] (distributed file system), Bigtable [2] (distributed multi-dimensional sorted map), Spanner [3] and F1 [4] (globally distributed databases). Processing solutions that will be covered include MapReduce [5], Flume [6] (distributed processing of batch data), and MillWheel [7] (distributed processing of streaming data). These technologies are the building blocks of the publicly available platform named Cloud Dataflow [8], which will also be covered during this talk.

Bibliography:

All papers are available from http://research.google.com/pubs/DistributedSystemsandParallelComputing.html

[1] Sanjay Ghemawat et al.: The Google file system. SOSP 2003: 29-43

[2] Fay Chang et al.: Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst. 26(2) (2008)

[3] James C. Corbett et al.: Spanner: Google's Globally Distributed Database. ACM Trans. Comput. Syst. 31(3): 8 (2013)

[4] Jeff Shute et al.: F1: A Distributed SQL Database That Scales. PVLDB 6(11): 1068-1079 (2013)

[5] Jeffrey Dean, Sanjay Ghemawat: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1): 107-113 (2008)

[6] Craig Chambers et al.: FlumeJava: easy, efficient data-parallel pipelines. PLDI 2010: 363-375

[7] Tyler Akidau et al.: MillWheel: Fault-Tolerant Stream Processing at Internet Scale. PVLDB 6(11): 1033-1044 (2013)

[8] Tyler Akidau et al.: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. PVLDB, 8(2): 1792–1803 (2015)

Bio:

Dario Freni works at Google on the Android Framework, focusing on making the Android OS easier to update. Previously, he was the lead of one of the teams that work on Play Console focusing on providing clear analytics metrics and app health tools to app developers. Prior to that was a tech lead of one of the Ads Site Reliability Engineering teams specializing on fast and reliable large-scale data processing pipelines.

Prior to joining Google in 2011, Dario completed his Ph.D. in computer science at Università degli Studi di Milano (Italy).