Large Scale Data Storage and Processing on Google's Distributed Systems

Large Scale Data Storage and Processing on Google's Distributed Systems



03/05/2017 08:45, Aula Beta, Via Comelico 39 (Mi)


LecturerDario Freni, Google London




Organizing the world's information and making it universally accessible and useful requires technologies that are able to handle petabytes of data quickly and reliably. This talk focuses on three crucial aspects of Google's infrastructure: storage, processing and reliability. We will present popular technologies within Google, giving an overview of their principles and main use cases. We will cover distributed storage solutions including GFS [1] (distributed file system), Bigtable [2] (distributed multi-dimensional sorted map), Spanner [3] and F1 [4] (globally distributed databases). Processing solutions that will be covered include MapReduce [5], Flume [6] (distributed processing of batch data), and MillWheel [7] (distributed processing of streaming data). These technologies are the building blocks of the publicly available platform named Cloud Dataflow [8], which will also be covered during this talk.


All papers are available from


[1] Sanjay Ghemawat et al.: The Google file system. SOSP 2003: 29-43

[2] Fay Chang et al.: Bigtable: A Distributed Storage System for Structured Data. ACM Trans. Comput. Syst. 26(2) (2008)

[3] James C. Corbett et al.: Spanner: Google's Globally Distributed Database. ACM Trans. Comput. Syst. 31(3): 8 (2013)

[4] Jeff Shute et al.: F1: A Distributed SQL Database That Scales. PVLDB 6(11): 1068-1079 (2013)

[5] Jeffrey Dean, Sanjay Ghemawat: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1): 107-113 (2008)

[6] Craig Chambers et al.: FlumeJava: easy, efficient data-parallel pipelines. PLDI 2010: 363-375

[7] Tyler Akidau et al.: MillWheel: Fault-Tolerant Stream Processing at Internet Scale. PVLDB 6(11): 1033-1044 (2013)

[8] Tyler Akidau et al.: The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. PVLDB, 8(2): 1792–1803 (2015)




Dario Freni leads one of the teams that work on Play Console. He focuses on providing clear analytics metrics and app health tools to app developers. Previously, Dario was a tech lead of one of the Ads Site Reliability Engineering teams specializing on fast and reliable large-scale data processing pipelines.

Prior to joining Google in 2011, Dario completed his Ph.D. in computer science at Università degli Studi di Milano (Italy).