lse-st446.github.io

ST446 Distributed Computing for Big Data

Lent Term 2021

Instructors

Milan Vojnovic, email, Department of Statistics. Office hours: By appointment, COL 5.05

Teaching Assistants

Marcos Barreto
Jenjira Jaimunk
Kaifang Zhou

Course Information

Lectures on Wednesday 16:00–18:00 in NAB 1.15
Classes on Friday 12:30–14:00 in 32L.LG.18

No lectures or classes will take place during School Reading Week 6.

Week	Topic
1	Introduction
2	Distributed file systems and key-value stores
3	Distributed computation models
4	Structured queries over large datasets
5	Graph data processing
6	Reading Week
7	Stream data processing
8	Distributed optimization methods for machine learning
9	Machine learning optimization for matrix factorization and topic modeling
10	Scaling up distributed machine learning
11	Distributed dataflow graph computations

Course Description

This course covers main principles of architecture and application programming interfaces of distributed systems for storing and processing big data. The topics covered include distributed file systems, storage systems, and data models that are in common use in modern on-premise data analytics platforms and cloud computing services. The course covers principles of computing over large datasets in distributed computing systems involving multi-core processors and cluster computing systems. Students will learn how to perform canonical distributed computing tasks in batch, interactive and stream processing settings and how to run scalable machine learning algorithms for regression, classification, clustering and collaborative filtering tasks.

This course uses a project-based learning approach where students gain hands-on experience in using computing tools, services and writing software code through computer workshop exercises and project assignments. This equips students with key skills and knowledge about modern computing platforms for processing big data. In particular, students gain hands-on experience in working with Apache Hadoop, an open-source software framework for distributed storage and processing of large datasets using the MapReduce programming model and other services such as Apache Hive. They also gain hands-on experience in working with Apache Spark, a general engine for processing big data used across different industries, and connecting Spark with various data sources and other systems. The students learn how to run big data analytics tasks locally on their laptops as well as on distributed clusters of machines in the cloud. The students work on weekly exercises and project assignments by using GitHub, a popular revision-control and group collaboration tool. Each student develops code for solving one or more computation tasks and uses GitHub for accessing and submitting course materials and assignments.

On the theory side, we introduce the main principles of distributed systems for big data analytics, their design objectives, querying paradigms by using MapReduce and other computation models, general numerical computations using dataflow graphs, and querying data by using SQL-like application programming interfaces. We consider graph processing algorithms for querying graph properties and iterative computations on input graph data. We introduce the main principles of stream data processing, how to perform computations and execute queries over a sliding-window of input data stream elements. We study the principles of scalable machine learning algorithms, based on parallel implementation of gradient-descent based algorithms for minimizing a loss function, used for training classification and regression prediction models. We also consider distributed MapReduce computations for training clustering models such as k-means and collaborative filtering models based on matrix factorization. We consider numerical computations using dataflow graphs, with a focus on learning deep neural networks for image classification and other classification tasks. Students are encouraged to work on computing tasks and datasets that are of their own interest.

On the practical side, we cover a variety of tools that are part of a modern data scientist’s toolkit, including distributed computing using Apache Hadoop and MapReduce style processing of big datasets, Apache Spark, application programming interfaces for querying structured and unstructured datasets such as Apache Hive, Spark SQL, and Google BigQuery, stream data processing, and deploying large-scale machine learning models. Students learn how to write programs to define Spark jobs using the Python API and how to deploy a Spark job in a production environment. Students learn how to connect Spark data structures with a variety of external data sources, including key-value data stores, relational databases, and publish-subscribe messaging systems.

For the course project, students are asked to conduct a big data analytics task using the principles and technologies learned in class and possibly using principles and technologies not covered in the course in a great length. The project report is typically in the form a Jupyter notebook containing a working solution.

Organization

This course is an introduction to the fundamental concepts of distributed computing for big data for students and assumes no prior knowledge of these concepts.

The course involves 20 hours of lectures and 15 hours of computer workshops in the LT.

Prerequisites

Basic programming knowledge is expected. Prior experience with Python programming is desirable; for example, acquired through a course of the MSc in Data Science programme.

Software

We use a wide range of tools, including Juypter notebooks, Apache Hadoop, Google Bigtable, Apache Hive, Apache Spark / PySpark (Python API for Spark), SQL APIs for querying datasets, Tensorflow library for dataflow programs, Docker, and various cloud computing services, e.g. provided by the Google Cloud Platform. Lectures and assignments are posted on Github. Students are required to use GitHub to submit problem set solutions and final project report.

Where appropriate, we use Jupyter notebooks for lab assignments, demonstrations, and the course notes themselves.

Assessment

Project assignment (80%) and continuous assessment in weeks 4 and 7 (10% each). Students are expected to produce 10 problem sets in the LT.

Schedule

Week 1. Introduction

In the first week, we provide an overview of basic concepts starting with definition of big data, followed by an overview of structured, semi-structured and unstructured data types, including relational database tables, delimiter-separated file formats such as csv and tsv files, JSON and XML file formats. We then consider some main properties of traditional relational database management systems, their support of transactions and ACID properties. This leads us to consider the need for the design of scalable systems, the concepts of horizontal and vertical scaling, and various computer system bottlenecks. We then go on to consider modern big data analytics systems and how they have evolved over time. We introduce various computation paradigms such as batch processing, interactive processing, stream processing, and lambda architecture. We discuss main developments such as MapReduce computation model and noSQL databases. The rest of the lecture discusses various computation tasks that led to the development of modern big data analytics systems, which are studied throughout the course.

Readings:

Vogels, W., A Decade of Dynamo, Blog, All Things Distributed, 02 October 2017
Hamilton, J., One Size Doesn’t Fit All, Blog, Perspectives, 2012

Lab: Getting started

Command line interface and commands
Cluster and bucket creation in a cloud
Submitting a simple “Hello World” job to a cluster
Running a Jupyter notebook on a cluster

Week 2. Distributed file systems and key-value stores

In this week, we first consider the main design principles of distributed file systems, explaining the original Google File System and its refinements, as well as other distributed file systems such as Hadoop Distributed File System (HDFS). We then consider the main design principles of distributed key-value stores such as Amazon Dynamo and columnar data storage systems such as BigTable and Apache Cassandra.

Readings:

White, T., Hadoop: The Definitive Guide, O’Reilly, 4th Edition, 2015
Carpenter, J. and Hewitt, E., Cassandra: The Definitive Guide, 2nd Edition, O’Reilly, 2016
Ghemawat, S., Gobioff, H. and Leung S.-T., The Google file system, SOSP 2003
Shvachko, K. et al, The Hadoop Distributed File System, IEEE MSST 2010; see also html
DeCandia, G. et al, Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007
Chang, F. et al, Bigtable: A distributed storage system for structured data, OSDI 2006

Further Readings:

GFS: Evolution on Fast-Forward, ACM Queue, Vol 7, No 7, August, 2009
Apache Hadoop docs: HDFS Architecture
Vogels, W., Amazon’s Dynamo, Blog, 2007
Nishtala, R. et al, Scaling Memcache at Facebook, NSDI 2013
Fitzpatrick, P., Distributed Caching with Memcached, 2004
Lakshman, A. and Malik, K., A Decentralized Structured Storage System, LADIS 2009
Calder, B. et al, Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency, SOSP 2011
Huang, C. et al, Erasure Coding in Windows Azure Storage, USENIX 2012

Lab: System installation and API practice

Go through Hadoop installation
Basic file manipulation commands working with HDFS
Reading and writing data from BigTable

Week 3. Distributed computation models

In this lecture we explain the basic principles of distributed computation models that are in common use in the context of big data analytics systems. We start with explaining MapReduce computation model that is in widespread use for distributed processing of large datasets. We then move on to consider Pregel computation model, developed for iterative computations such as computing PageRank vector for a large-scale input graph. Finally, we consider the concept of a Resilient Distributed Dataset (RDD), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. This involves to consider the types of operations performed on RDDs and their execution on a distributed cluster of machines.

Readings:

Karau, H., Konwinski, A., Wendell, P. and Zaharia, M., Learning Spark: Lightining-fast Data Analysis, O’Reilly, 2015
Karau, H. and Warren, R., High Performance Spark: Best Practices for Scaling & Optimizing Apache Spark, O’Reilly, 2017
Drabas, T. and Lee D., Learning PySpark, Packt, 2016
Dean, J. and Ghemawat, S., Mapreduce: Simiplified Data Processing on Large Clusters, Communications of the ACM, Vol 51, No 1, January 2008; original OSDI 2004 paper
Zaharia, M. et al, Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012

Further Readings:

Apache Hadoop documentation
Apache Hadoop documentation: MapReduce Tutorial
Malewicz, G. et al, Pregel: A System for Large-Scale Graph Processing, SIGMOD 2010
Spark programming guide
Chambers, B. and Zaharia, M., Spark: The Definitive Guide, databricks, 2017
Spark documentation: PySpark package

Lab: MapReduce and resilient distributed datasets

Run a MapReduce job on Hadoop for word count on dblp data
Hands-on experience with running operations on resilient distributed datasets in PySpark, such as map, flatMap, filter, distinct, sample, leftOuter and repartition, and actions such as take, collect, reduce, count, saveAsTextFile and foreach
Run the word count example using resilient distributed datasets in PySpark

Week 4. Structured queries over large datasets

In this week we consider systems for big data analytics using structured query languages. We start with considering the main architectural principles of Apache Hive, a data warehouse solution running on top of Apache Hadoop. We consider data types and query language used by Hive. We then consider the main design principles of Dremel (BigQuery) and Spark SQL for querying data using structured query languages.

Readings:

White, T., Hadoop: The Definitive Guide, O’Reilly, 4th Edition, 2015
Karau, H., Konwinski, A., Wendell, P. and Zaharia, M., Learning Spark: Lightining-fast Data Analysis, Chapter 9 Spark SQL, O’Reilly, 2015
Karau, H. and Warren, R., High Performance Spark: Best Practices for Scaling & Optimizing Apache Spark, O’Reilly, 2017
Drabas, T. and Lee D., Learning PySpark, Chapter 3 DataFrames, Packt, 2016
Rutherglen, J., Wampler, D., Capriolo, E., Programming Hive, 2nd Edition, O’Reilly, 2017
Prokopp, C., The Free Hive Book

Further Readings:

Apache Hive Tutorial
Apache Hive Language Manual
Thusoo, A. et al, Hive-A Petabyte Scale Data Warhouse Using Hadoop, ICDE 2010
Spark SQL programming guide: Spark SQL, DataFrames and Datasets Guide, Apache Spark 2.2.0, 2017
Armbrust, M., et al, Spark SQL: Relational Data Processing in Spark, ACM SIGMOD 2015
Armbrust, M., Dataframes: Simple and Fast Analysis of Structured Data, Webinar, 2017
Big Query SQL Reference: Standard and Legacy

Lab: Hive and Spark SQL queries

Run Hive queries, basic standard SQL and Hive specific queries such as TRANSFORM, and MAP and REDUCE, queries
Loading data from sources, including JSON, XML, weblogs using regular expressions
Run queries in Spark SQL using dataframe API and Spark Session SQL API
Data management using BigQuery via web interface and connector with Python

Note: assignment for grading to be given in this week

Week 5. Graph data processing

In this week we consider principles and systems for scalable processing of large-scale graph data. This includes queries such as evaluating node centralities (e.g. degree centrality), graph traversal or motif queries for finding structures in graph data (e.g. identifying friends-of-friends of a person who were born in London), and iterative algorithms on graph input data (e.g. computing PageRank node centralities). We discuss different data models for representation of graph data such as RDF, as well as query languages, including SPARQL, Gremlin, Cypher and openCyper that used in graph databases. We introduce the Bulk Synchronous Parallel computation model that underlies the design of modern computation platforms for iterative computing on input graph data.

Readings:

Drabas, T. and Karau, H., Learning PySpark, Chapter 7: GraphFrames, Packt, 2017
Spark GraphX: programming guide
Gonzalez, J. E. et al, GraphX: Graph Processing in Distributed Dataflow Framework, OSDI 2014

Further Resources:

Malewicz, G. et al, Pregel: A System for Large-Scale Graph Processing, ACM SIGMOD 2010; open source cousin: Apache Giraph
Low, Y. et al, Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud, VLDB 2012
Valiant, L. G., A Bridging Model for Parallel Computation, Communications of the ACM, Vol 3, No 8, August 1990.

Lab: Analysis of StackExchange user-posts-topics relations

Importing StackExchange relations into graphframes
Computing degree and PageRank node centralities
Answering graph motif queries
Breadth-first search and connected components

Week 6. Reading week

Week 7. Stream data processing

In this week we consider basic concepts of data stream processing systems. We explain various global aggregators, cumulative, and sliding-window stream data processing tasks. We introduce the concept of publish-subscribe systems, taking Apache Kafka as an example. We discuss the importance of fault tolerance in stream processing systems and discuss fault tolerance models such as execute exactly-once. In this context, we discuss the guarantees provided by Zookeper, an open source server which enables highly reliable distributed coordination.

Readings:

Spark streaming programming guide
Spark structured streaming programming guide

Further Resources:

Zaharia, M. et al, Discretized Streams: Fault-Tolerant Streaming Computation at Scale, SOSP 2013
Apache Kafka documentation

Lab: Twitter feed processing

Import Twitter feed into Kafka as a topic
Integrate Spark with Kafka
Track heavy-hitter topics
Visualize topic trends

Week 8. Distributed optimization methods for machine learning

In this week we introduce the basic concepts of scalable distributed machine learning algorithms for regression and classification tasks. We discuss batch optimization methods for model parameter estimation using iterative methods such as gradient descent, batch gradient descent, stochastic gradient descent, and quasi-Newton methods such as BFGS and L-BFGS.

Readings:

Bottou, L. and Le Cun, Y., Large Scale Online Learning, NIPS 2003
Drabas, T. and Lee, D. Learning PySpark, Chapter 5: Intoducing MLib, Packt, 2017
Spark programming guide Machine Learning Library (MLlib) Guide

Further Resources:

Li, M., Scaling Distributed Machine Learning with the Parameter Server, OSDI 2014
Numerical optimization: understanding L-BFGS, blog, December 2, 2014
Machine Learning Glossary

Lab: Logistic regression using Spark MLlib

Training a logistic regression model using batch gradient descent in Spark MLlib
Comparison of stochastic gradient descent and L-BFGS methods

Note: assignment for grading to be given in this week

Week 9. Machine learning optimization for matrix factorization and topic modeling

In this week we continue considering the topic of distributed machine learning algorithms with a focus on matrix factorization methods used for collaborative filtering (recommender systems) and topic modeling in text data.

We will discuss matrix completion problem and the algorithm known as alternating least squares and its distributed implementation. We will then continue with discussing distributed algorithms for singular value decomposition and its application for latent semantic analysis of text data. Finally, we will discuss a Bayesian model, known as Latent Dirichlet Allocation, for topic modeling for text data sources.

Readings:

Drabas, T. and Lee, D. Learning PySpark, Chapter 5: Intoducing MLib and Chapter 6: Introducting the ML Package, Packt, 2017
Spark programming guide Machine Learning Library (MLlib) Guide

Further Resources:

Zadeh, R. et al, Matrix Computations and Optimizations in Spark, KDD 2016
Koren, Y., Bell, R. and Volinsky, C., Matrix Factorization Techniques for Recommender Systems, Computer, Vol 42, No 8, 2009
Deerwester, S., Dumais, S. T., and Harshman, R., Indexing by Latent Semantic Analysis, Journal of the American Society for Information Science, Vol 41, No 6, 391-407, 1990
Blei, D. M., Ng, A. Y., and Jordan, M., I., Latent Dirichlet Allocation, JMLR 2003
Hoffman, M., Bach, F. R., and Blei, D. M., Online Learning for Latent Dirichlet Allocation, NIPS 2010

Lab: Collaborative filtering and topic modelling using Spark MLlib

Movielens movie recommendation problem using Alternating Least Squares
Latent semantic indexing using singular value decomposition
Topic modelling using Latent Dirichlet Allocation

Week 10. Scaling up distributed machine learning

In this lecture, we will introduce basic modeling used for training large-scale machine learning models, with the main application for training large-scale deep neural networks used for image recognition, speech processing and other tasks. We will cover basic distributed computing models, including data parallel and model parallel computation models. We will discuss scalability properties of different approaches, and explain their efficiency with respect to computation and communication complexity.

Lab: Introduction to deep learning using TensorFlow

TensorFlow and Deep Learning, without a PhD Google codelab
Using Cloud ML Engine and Cloud Datalab

Week 11. Distributed dataflow graph computations

In our last lecture, we continue discussing the topic of scaling up distributed machine learning algorithms but from a systems perspective, focusing on data structures, application programming interface and how to the theoretical principles are put in practice. We will discuss the concept of a dataflow graph where nodes of the graph represent operations (e.g. a matrix multiplication, a non-linear activation function) and edges represent flow of data between operations. We will primarily focus on one system that is based on the concept of dataflow graph computations, namely, TensorFlow, used for learning and inference for deep neural networks. We will explain the key system architectural concepts that underlie the design of TensorFlow as well as the application programming interface.

Readings:

Abadi, M. et al, TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems, OSDI 2016
DistBelief: Dean, J., Large Scale Distributed Deep Networks, NIPS 2012
TensorFlow API docs

Further Resources:

Tensorflow gitHub page
Abadi et al, A computational model for TensorFlow: an introduction, MAPL 2017
Dean, J., Large-Scale Deep Learning with Tensorflow, ScaledML 2016

Lab: Distributed TensorFlow

Google Cloud Platform: Running Distributed TensorFlow on Compute Engine
cloud-ml-mnist-example