- Milan Vojnovic, Department of Statistics.
*Office hours*: By appointment, COL 5.05

- Christine Yuen, Department of Statistics.
*Office hours*: Thursday 11:00 - 12:00, COL 5.03

- Lectures on Mondays 10:00–12:00 in TW2.2.04
- Classes on Thursdays 12:30–14:00 in TW2.4.01

No lectures or classes will take place during School Reading Week 6.

This course will cover the principles of distributed systems for storing and processing big data. This will include the principles of storage systems, databases and data models that are in common use by on-premise data analytics platforms and cloud computing services. The course will cover the principles of computing over large datasets in distributed computing systems involving multi-core processors and cluster computing systems. Students will learn how to perform canonical distributed computing tasks in batch, streaming and graph processing computation models and how to run scalable machine learning algorithms for regression, classification, clustering and collaborative filtering tasks. This course uses a project-based learning approach where students gain hands-on experience with writing and running computer programmes through computer workshop exercises and project assignments. This will equip students with key skills and knowledge about modern computation platforms for processing big data. In particular, students will get hands-on experience with using Apache Spark, the fastest-growing general engine for processing big data that is used across different industries, and connecting Spark programmes with various databases and other systems. The students will work on weekly exercises and project assignments by using revision-control and group collaboration tools such as GitHub. Each student will develop code for solving one or more computation tasks on an input dataset, and will use GitHub for accessing and submitting course materials and assignments.

On the theory side, we will introduce principles of distributed databases, their design objectives, querying paradigms by using MapReduce style of computations, general numerical computations using dataflow graphs, and querying using SQL application programming interfaces. We will consider graph processing algorithms, for querying graph properties and iterative computations using input graph data. We will also introduce the principles of stream processing, how to perform computations and execute queries over a sliding-window of input data stream elements. We will study the principles of scalable machine learning algorithms that are based on parallel implementations of gradient descent style algorithms for minimizing a loss function, used for training regression and classification models. We will also consider distributed MapReduce based computations for training clustering models such as k-means and collaborative filtering models based on matrix factorization. We will consider numerical computations using dataflow graphs, with a focus on the use case of learning a deep neural network for image classification and other classification tasks. Students will be encouraged to work with computations and data relevant to their own interests.

On the practical side, we will cover a variety of tools that are part of a modern data scientist’s toolkit, including distributed computing using Apache Spark, Mapreduce style processing of big data sets, application programming interfaces for querying structured and unstructured datasets, stream data processing, and deploying large-scale machine learning models. You will learn how to write programmes to define Spark jobs using the Python API and how to deploy a Spark job in a production environment. You will learn how to connect Spark data structures with a variety of external data sources, including key-value databases, relational databases, and publish-subscribe messaging systems.

For the final project, we will ask you to develop and run a distributed computation for a given dataset, which you will be expected to implement in a PySpark Jupyter notebook.

This course is an introduction to the fundamental concepts of distributed computing for big data for students and assumes no prior knowledge of these concepts.

The course will involve 20 hours of lectures and 15 hours of computer workshops in the LT.

Some basic prior programming experience is expected. Prior experience with Python programming is desirable; for example, acquired through the compulsory courses of the MSc in Data Science program.

We will use some tools, notably Apache Spark general engine for computing over large distributed datasets, PySpark (Python API for Spark), SQL APIs for querying datasets, and Tensorflow library for dataflow programmes. Lectures and assignments will be posted on Github, Students are expected to use Github also to submit problem sets and final exam.

Where appropriate, we will use Jupyter notebooks for lab assignments, demonstrations, and the course notes themselves.

Project assignment (80%) and continuous assessment in weeks 4 and 7 (10% each). Students will be expected to produce 10 problem sets in the LT.

In the first week, we will introduce the basic concepts and system architectures for big data processing. We will introduce the basic computing paradigms of batch, streaming, imperative, declarative, graph and machine learning data processing. We will discuss the main architectures of data storage systems based on key-value stores and other data models. We will discss the main design goals of such systems such as consistency, optimization for fast and reliable read or writes. We will then introduce the basic concepts of multi-node computing, such as cluster computing systems consisting of multiple machines, multi-core processors, distributed file systems, partitions of large data files into chunks or extents, distributed computing using master and worker nodes, resource allocation through job scheduling using resource managament systems such as YARN and Mesos.

*Readings*:

- Zaharia, M. et al, Apache Spark: A Unified Engine for Big Data Processing, Communications of the ACM, Vol 59, No 11, November 2016
- Drabas, T. and Lee, D.,
*Learning PySpark*, Chapter 1: Understanding Spark, Packt, 2017

*Further Resources*:

- Manual Installing Spark
- Mastering Apache Spark 2, Task Scheduler
- DAG scheduler http://bit.ly/29WTiKB
- Matt Turcks’s Big Data Landscape

*Lab*: **Hands-on system administration tools**

- Use of basic linux/Mac OS/Windows command line utilities
- Getting to know your cluster system, processors and machines
- Use of cloud computing services: AWS example
- Use of docker images

In this week we will introduce different data models, datasets, databases and data storage paradigms used for distributed computing for big data. We will discuss key-value databases such as Cassandra and more complex relational database models such as Hive. We will discuss different data formats for storing data including csv, tsv, JSON, XML, Parquet, Hive tables, RDF, and Azure blobs. We will introduce the basic data structures used in Spark, including Resilient Distributed Dataset (RDD) and DataFrame. We will discuss the design objectives of various large-scale distributed storage systems such as consistency and fast reads or writes.

*Readings*:

- Hamilton, J., One Size Doesn’t Fit All, Blog, 2012
- Drabas, T. and Lee, D.,
*Learning PySpark*, Chapter 2: Resilient Distributed Datasets, Chapter 3: DataFrames, Packt, 2017 - Apache Cassandra Documentation
- Apache Hive Tutorial

*Further Resources*:

- Vogels, W., Amazon’s Dynamo, Blog, 2007
- Fitzpatrick, B., Distributed Cashing with Memcashed, Linux Journal, 2004
- Nishtala, R. et al, Scaling Memcashe at Facebook, NSDI 2013
- Zhou, J. et al, SCOPE: parallel databases meet MapReduce, VLDB journal, 2012
- Melnik S. et al, Dremel: Interactive Analysis of Web-Scale Datasets, VLDB 2010
- Google, An Inside Look at Google BigQuery, White Paper, 2012
- Amazon Web Services Documentation, White Papers
- Google Cloud Documentation
- Microsoft Azure Storage Documentation, SOSP 2011 paper & slides, Cosmos DB

*Lab*: **Import data to RDD and DataFrame**

- Create an RDD from a data collection and a file
- Transform an RDD to a dataframe and vice-versa
- Import data from JSON and XML files
- Import data from a key-value database
- Import data from a postgreSQL database

In this week we will study how to query large unstructured datasets. We will introduce the parallel computing paradigm of MapReduce. We will discuss how to manage and query datasets using Spark RDD. We will learn how to create an RDD, use transformations such as map, flatMap, filter, distinct, sample, leftOuterJoin, repartition as well as actions such as take, collect, reduce, count saveAsTextFile, and foreach. We will introduce the concept of lambda expressions and how to use regular expressions.

*Readings*:

- Dean, J. and Ghemawat, S., Mapreduce: Simiplified Data Processing on Large Clusters, Communications of the ACM, Vol 51, No 1, January 2008, OSDI 2004 version
- Drabas, T. and Lee, D.,
*Learning PySpark*, Chapter 2: Resilient Distributed Datasets, Packt, 2017 - Karau, H. et al,
*Learning Spark:Lightning-Fast Data Analysis*, O’Reilly 2015 - Karau, H and Warren R.,
*High Performance Spark: Best Practices for Scaling & Optimizing Apache Spark*, O’Reilly 2017 - Laskowski, J.,
*Mastering Apache Spark 2* - Spark Programming Guide http://spark.apache.org/docs/latests/programming-guide.html#rdd-operations

*Further Resources*:

- Zaharia, M. et al, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012
- High Performance Spark, Chapter 5, Effective transformations, section Narrow vs. Wide Transformations, https://smile.amazon.com/High-Performance-Spark-Practices-Optimizing/dp/1491943203
- Regex patterns https://www.packtpub.com/application-development/master-python-regular-expressions

*Lab*: **Using RDDs and MapReduce tasks**

- Use of lambda expressions
- Use of transformations such as map, filter, flatMap, distinct, sample, leftOuterJoin, and repartition
- Use of actions such as take, collect, reduce, count, saveAsTextFile and foreach

In this week we will consider how to query datasets that have a schema. We will introduce the concept of a dataframe and learn how to query data by using dataframe query API and how to execute SQL queries. We will discuss computational complexity of different standard queries and query optimization techniques. We will consider how to compute fast approximate query answers by using sampling techniques such as reservoir sampling, and data summarizations or sketches such as hyperloglog sketch for approximating the number of distinct elements in a multiset and count-min for frequency estimation.

*Readings*:

- Drabas, T. and Lee, D.,
*Learning PySpark*, Chapter 3: DataFrames, Packt, 2017 - Spark SQL, DataFrames, and Datasets Guide

*Further Resources*:

- Armbrust et al, Spark SQL: Relational Data Processing in Spark, ACM SIGMOD 2015
- Zhou, J. et al, SCOPE: parallel databases meet MapReduce, VLDB journal, 2012
- Melnik S. et al, Dremel: Interactive Analysis of Web-Scale Datasets, VLDB 2010
- Cormode, G., Data Sketching, Communications of the ACM, Vol 60, No 9, September 2017

*Lab*: **SQL queries on table data**

- Creating a datframe, querying with the dataframe API, querying with SQL
- Computing approximate query answers
- GitHub and StackExchange data analysis using Google Big Query

In this week we will consider principles and systems for scalable processing of large-scale graph data. This include queries such as evaluating node centralities (e.g. degree centrality), graph traversal or motif queries for finding structural in graph data (e.g. identifying friends-of-friends of a person who were born in London), and iterative algorithms on graph input data (e.g. computing PageRank node centralities). We will discuss different data models for representation of graph data such as RDF, as well as query languages, including SPARQL, Gremlin, Cypher and openCyper that used in graph databases. We will introduce the bulk synchronous parallel computation model that underlies the design of modern computation platforms for iterative computing on input graph data.

*Readings*:

- Drabas, T. and Karau, H.,
*Learning PySpark*, Chapter 7: GraphFrames, Packt, 2017 - Spark GraphX: programming guide (http://giraph.apache.org/)
- Gonzalez, J. E. et al, GraphX: Graph Processing in Distributed Dataflow Framework, OSDI 2014

*Further Resources*:

- Malewicz, G. et al, Pregel: A System for Large-Scale Graph Processing, ACM SIGMOD 2010; open source cousin: [Apache Giraph]
- Low, Y. et al, Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud, VLDB 2012
- Valiant, L. G., A Bridging Model for Parallel Computation, Communications of the ACM, Vol 3, No 8, August 1990.

*Lab*: **Analysis of StackExchange user-posts-topics relations**

- Importing StackExchange relations into graphframes
- Computing degree and PageRank node centralities
- Answering graph motif queries
- Breadth-first search and connected components

In this week we will consider the basic concepts of data stream processing systems. We will explain various global aggregators, cumulative, and sliding-window stream data processing tasks. We will introduce the concept of publish-subscribe systems and use Apache Kafka as an example. We will discuss the importance of fault tolerance in stream processing systems and discuss fault tolerance models such as execute exactly-once. In this context, we will discuss the guarantees provided by Zookeper, an open source server which enables highly reliable distributed coordination.

*Readings*:

*Further Resources*:

- Zaharia, M. et al, Discretized Streams: Fault-Tolerant Streaming Computation at Scale, SOSP 2013
- Apache Kafka documentation

*Lab*: **Twitter feed processing**

- Import Twitter feed into Kafka as a topic
- Integrate Spark with Kafka
- Track heavy-hitter topics
- Visualize topic trends

In this week we will introduce the basic concepts of distributed machine learning algorithms for regression and classification tasks. We will discuss batch optimization methods for model parameter estimation by using gradient descent methods and its variations such as BFGS and L-BFGS. We will also cover online optimisation methods such as stochastic gradient descent (SGD), parallel SGD and mini-batch SGD methods. We will discuss as model and data paralellisation models.

*Readings*:

- Bottou, L. and Le Cun, Y., Large Scale Online Learning, NIPS 2003
- Drabas, T. and Lee, D.
*Learning PySpark*, Chapter 5: Intoducing MLib, Packt, 2017 - Spark programming guide MLib: RDD-based API

*Further Resources*:

- Li, M., Scaling Distributed Machine Learning with the Parameter Server, OSDI 2014
- Numerical optimization: understanding L-BFGS, blog, December 2, 2014
- Machine Learning Glossary

*Lab*: **Churn prediction using MLib package**

- Import the Orange Telecoms Churn dataset
- Train a logistic regression model, compute predictions
- Train a decision tree model, compute predictions
- Evaluate and compare the two models

In this week we will continue by considering distributed machine learning algorithms for clustering and collaborative filtering tasks. We will discuss a Mapreduce algorithm for k-means clustering problem, as well as an iterative algorithm for a collaborative filtering problem. We will consider Spark API approaches provided by MLib and ML packages. For the latter, we will introduce the concept of a pipeline that consists of a dataflow passing through transformer and estimator operators.

*Readings*:

- Murphy, K. P.,
*Machine Learning: A Probabilistic Perspective*, k-means, Section 11.4.2.5, The MIT Press, 2012 - Drabas, T. and Lee, D.
*Learning PySpark*, Chapter 5: Intoducing MLib and Chapter 6: Introducting the ML Package, Packt, 2017 - Spark programming guide MLib: RDD-based API

*Further Resources*:

- Moore, A., K-means and Hierarchical Clustering Tutorial
- Wang, Q., Spark machine learning pipeline by example, August 31, 2016
- Zadeh, R. et al, Matrix Computations and Optimizations in Spark, KDD 2016

*Lab*: **Clustering and movies recommendation**

- k-means clustering using a Mapreduce algorithm
- Movie recommendations using MovieLens data and training a collaborative filtering model using Alternating Least Square (ALS) algorithm

We will introduce the basic concepts of performing numerical computations using data flow graphs. In such settings, the graph nodes represent mathematical operations, while the graph edges represent the multidimensional data arrays that flow between them. We will explain the architecture of Tensorflow, an open source library for numerical computations using data flow graphs. We will go over the the use case of learning a deep neural network, taking the basic architecture of a feedforward deep neural network.

*Readings*:

- TensorFlow API docs
- Abadi et al, TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, OSDI 2016
- Drabas, T. and Lee, D.,
*Learning PySpark*, Chapter 8: TensorFrames, Packt, 2017 - TensorFrames gitHub page

*Further Resources*:

- Tensorflow gitHub page
- Abadi et al, A computational model for TensorFlow: an introduction, MAPL 2017
- Dean, J., Large-Scale Deep Learning with Tensorflow, ScaledML 2016
- Yu, D. et al, An Introduction to Computational Networks and the Computational Network Toolkit, Microsoft Research Technical Report, 2014

*Lab*: **Deep neural network learning**

- Import a dataset of labeled images
- Specify a feedforward deep neural network model
- Train the model
- Evaluate the classification accuarcy of the trained model

In the last week, we will discuss how to deploy large-scale computations in a production cluster system. This will cover setting up a cluster system, running jobs over varied number of machines in the cluster, and tracking their progress. We will consider simple Mapreduce jobs as well as machine learning algorithms for prediction tasks on a large-scale data.

*Readings*:

- Drabas, T. and Lee, D.,
*Learning PySpark*, Chapter 11: Packing Spark Applications, Packt, 2017

*Lab*: **Click prediction using 1TB Criteo dataset**

- Load Criteo dataset into RDD
- Deploy a MapReduce job to compute the click through rate in a cluster system with varying number of machines
- Training a machine learning algorithm for click prediction using the Criteo dataset