Invited Lecture: Large Scale Data Storage and Processing on Google's Distributed Systems



17/04/2019 09:00, Aula Magna, Dipartimento di Informatica, Via Celoria 18, Milano, Italy

(This is an open lecture, but also part of the master course in "Distributed and Pervasive Systems" given by Prof. Claudio Bettini)


LecturerDario Freni, Google London




Organizing the world's information and making it universally accessible and useful requires technologies that are able to handle petabytes of data quickly and reliably. This talk focuses on three crucial aspects of Google's infrastructure: storage, processing and reliability. We will present popular technologies within Google, giving an overview of their principles and main use cases. We will cover distributed storage solutions including GFS [1] (distributed file system), Bigtable [2] (distributed multi-dimensional sorted map), Spanner [3] and F1 [4] (globally distributed databases). Processing solutions that will be covered include MapReduce [5], Flume [6] (distributed processing of batch data), and MillWheel [7] (distributed processing of streaming data). These technologies are the building blocks of the publicly available platform named Cloud Dataflow [8], which will also be covered during this talk.


All papers are available from


[1] Sanjay Ghemawat et al.: The Google file system. SOSP 2003: 29-43

[2] Fay Chang et al.: Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst. 26(2) (2008)

[3] James C. Corbett et al.: Spanner: Google's Globally Distributed Database. ACM Trans. Comput. Syst. 31(3): 8 (2013)

[4] Jeff Shute et al.: F1: A Distributed SQL Database That Scales. PVLDB 6(11): 1068-1079 (2013)

[5] Jeffrey Dean, Sanjay Ghemawat: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1): 107-113 (2008)

[6] Craig Chambers et al.: FlumeJava: easy, efficient data-parallel pipelines. PLDI 2010: 363-375

[7] Tyler Akidau et al.: MillWheel: Fault-Tolerant Stream Processing at Internet Scale. PVLDB 6(11): 1033-1044 (2013)

[8] Tyler Akidau et al.: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. PVLDB, 8(2): 1792–1803 (2015)




Dario Freni works at Google on the Android Framework, focusing on making the Android OS easier to update. Previously, he  was the lead of one of the teams that work on Play Console focusing on providing clear analytics metrics and app health tools to app developers. Prior to that was a tech lead of one of the Ads Site Reliability Engineering teams specializing on fast and reliable large-scale data processing pipelines.


Prior to joining Google in 2011, Dario completed his Ph.D. in computer science at Università degli Studi di Milano (Italy).

The paper "Automatic Detection of Urban Features from Wheelchair Users' Movements" by Gabriele Civitarese, Sergio Mascetti, Alberto Butifar and Claudio Bettini has been accepted at the 17th IEEE PerCom conference which will take place in Kyoto from 11th to 15th of March.

The paper "EPIC: a Methodology for Evaluating Privacy Violation Risk in Cybersecurity Systems" by  Sergio Mascetti, Nadia Metoui, Andrea Lanzi, and Claudio Bettini has been accepted for publication onTransactions on Data Privacy. This work is part of the Israeli-Italian Scientific and Technological Cooperation program, supported by the Ministry of Science, Technology and Space, Israel under grant 3-12288, and by the Ministry of Foreign Affairs and International Cooperation, Italy.

Page 4 of 5