ST446 Distributed Computing for Big Data

Lent Term 2018


Teaching Assistant

Course Information

No lectures or classes will take place during School Reading Week 6.

Week Topic
1 Introduction
2 Distributed file systems and key-value stores
3 Distributed computation models
4 Structured queries over large datasets
5 Graph data processing
Guest lecturer: Mark Needham, Software Engineer, Neo4j
6 Reading Week
7 Stream data processing
Guest lecturer: Eno Thereska, Principal Engineer, Amazon
8 Scalable machine learning I
Guest lecturer: Ulrich Paquet, Research Scientist, Google DeepMind
9 Scalable machine learning II
Guest lecturer: Ryota Tomioka, Researcher, Microsoft Research
10 Cloud computing services
Guest lecturer: Marc Cohen, Software Engineer, Google
11 Distributed dataflow graph computations

Course Description

This course covers the main principles and application programming interfaces of distributed systems for storing and processing big data. This includes the principles of distributed file systems, storage systems, and data models that are in common use in modern on-premise data analytics platforms and cloud computing services. The course covers the principles of computing over large datasets in distributed computing systems involving multi-core processors and cluster computing systems. Students will learn how to perform canonical distributed computing tasks in batch, interactive and stream processing settings and how to run scalable machine learning algorithms for regression, classification, clustering and collaborative filtering tasks.

This course uses a project-based learning approach where students gain hands-on experience in using computing tools, services and writing software code through computer workshop exercises and project assignments. This equips students with key skills and knowledge about modern computation platforms for processing big data. In particular, students gain hands-on experience in working with Apache Hadoop, an open-source software framework for distributed storage and processing of dataset of big data using the MapReduce programming model and other services such as Apache Hive. They also gain hands-on experience in working with Apache Spark, the fastest-growing general engine for processing big data, used across different industries, and connecting Spark with various data sources and other systems. The students learn how to run big data analytics tasks locally on their laptops as well as on distributed clusters of machines in the cloud. The students work on weekly exercises and project assignments by using GitHub, a popular revision-control and group collaboration tool. Each student develops code for solving one or more computation tasks and uses GitHub for accessing and submitting course materials and assignments.

On the theory side, we introduce the main principles of distributed systems for big data analytics, their design objectives, querying paradigms by using MapReduce and other computation models, general numerical computations using dataflow graphs, and querying data by using SQL-like application programming interfaces. We consider graph processing algorithms for querying graph properties and iterative computations on input graph data. We introduce the principles of stream processing, how to perform computations and execute queries over a sliding-window of input data stream elements. We study the principles of scalable machine learning algorithms, based on parallel implementations of gradient-descent style algorithms for minimizing a loss function, used for training regression and classification models. We also consider distributed MapReduce computations for training clustering models such as k-means and collaborative filtering models based on matrix factorization. We consider numerical computations using dataflow graphs, with a focus on learning deep neural networks for image classification and other classification tasks. Students are encouraged to work on computations and data relevant to their own interests.

On the practical side, we cover a variety of tools that are part of a modern data scientist’s toolkit, including distributed computing using Apache Hadoop and MapReduce style processing of big data sets, Apache Spark, application programming interfaces for querying structured and unstructured datasets such as Apache Hive, Spark SQL, and Google BigQuery, stream data processing, and deploying large-scale machine learning models. Students learn how to write programmes to define Spark jobs using the Python API and how to deploy a Spark job in a production environment. Students learn how to connect Spark data structures with a variety of external data sources, including key-value data stores, relational databases, and publish-subscribe messaging systems.

For the final project, students are asked to conduct a big data analytics task using the principles and technologies learned in class as well as to learn other related technologies not covered in course in a great length (e.g. working with Apache Cassandra or Microsoft Congitive Toolkit). The project report is typically in the form a Jupyter notebook and a working solution.


This course is an introduction to the fundamental concepts of distributed computing for big data for students and assumes no prior knowledge of these concepts.

The course involves 20 hours of lectures and 15 hours of computer workshops in the LT.


Some basic prior programming experience is expected. Prior experience with Python programming is desirable; for example, acquired through the compulsory courses of the MSc in Data Science program.


We use a wide range of tools, including Juypter notebooks, Apache Hadoop, Google Bigtable, Apache Hive, Apache Spark / PySpark (Python API for Spark), SQL APIs for querying datasets, Tensorflow library for dataflow programs, Docker, and various cloud computing services, e.g. provided by the Google Cloud Platform. Lectures and assignments are posted on Github. Students are expected to use Github also to submit problem sets and final exam.

Where appropriate, we use Jupyter notebooks for lab assignments, demonstrations, and the course notes themselves.


Project assignment (80%) and continuous assessment in weeks 4 and 7 (10% each). Students are expected to produce 10 problem sets in the LT.


Week 1. Introduction

In the first week, we provide an overview of basic concepts starting with a definition of big data, followed by an overview of structured, semi-structured and unstructured data types, including relational database tables, delimiter-separated file formats such as csv and tsv files, JSON and XML file formats. We then consider main properties of traditional relational database management systems, their support of transactions and ACID properties. This leads us to consider the need for the design of scalable systems, the concepts of horizontal and vertical scaling, and various computer system bottlenecks. We then go on to consider modern big data analytics systems and how they have evolved over time. We introduce various computation paradigms such as batch processing, interactive processing, stream processing, and lambda architecture. We discuss main developments such as MapReduce computation model and noSQL databases. The rest of the lecture discusses various computation tasks that led to the development of modern big data analytics systems, which are studied throughout the course.


Lab: Getting started

Week 2. Distributed file systems and key-value stores

In this week, we first consider the main design principles of distributed file systems, explaining the original Google File System and its refinements, as well as other distributed file systems such as Hadoop Distributed File System (HDFS). We then consider the main design principles of distributed key-value stores such as Amazon Dynamo and columnar data storage systems such as BigTable and Apache Cassandra.


Further Readings:

Lab: System installation and API practice

Week 3. Distributed computation models

In this lecture we explain the basic principles of distributed computation models that are in common use in the context of big data analytics systems. We start with explaining MapReduce computation model that is in widespread use for distributed processing of large datasets. We then move on to consider Pregel computation model, developed for iterative computations such as computing PageRank vector for a large-scale input graph. Finally, we consider the concept of a Resilient Distributed Dataset (RDD), a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner. This involves to consider the types of operations performed on RDDs and their execution on a distributed cluster of machines.


Further Readings:

Lab: MapReduce and resilient distributed datasets

Week 4. Structured queries over large datasets

In this week we consider systems for big data analytics using structured query languages. We start with considering the main architectural principles of Apache Hive, a data warehouse solution running on top of Apache Hadoop. We consider data types and query language used by Hive. We then consider the main design principles of Dremel (BigQuery) and Spark SQL for querying data using structured query languages.


Further Readings:

Lab: Hive and Spark SQL queries

Note: assignment for grading to be given in this week

Week 5. Graph data processing

In this week we consider principles and systems for scalable processing of large-scale graph data. This includes queries such as evaluating node centralities (e.g. degree centrality), graph traversal or motif queries for finding structures in graph data (e.g. identifying friends-of-friends of a person who were born in London), and iterative algorithms on graph input data (e.g. computing PageRank node centralities). We discuss different data models for representation of graph data such as RDF, as well as query languages, including SPARQL, Gremlin, Cypher and openCyper that used in graph databases. We introduce the Bulk Synchronous Parallel computation model that underlies the design of modern computation platforms for iterative computing on input graph data.


Further Resources:

Lab: Analysis of StackExchange user-posts-topics relations

Week 6. Reading week

Week 7. Stream data processing

In this week we consider basic concepts of data stream processing systems. We explain various global aggregators, cumulative, and sliding-window stream data processing tasks. We introduce the concept of publish-subscribe systems, taking Apache Kafka as an example. We discuss the importance of fault tolerance in stream processing systems and discuss fault tolerance models such as execute exactly-once. In this context, we discuss the guarantees provided by Zookeper, an open source server which enables highly reliable distributed coordination.


Further Resources:

Lab: Twitter feed processing

Week 8. Scalable machine learning I

In this week we introduce the basic concepts of scalable distributed machine learning algorithms for regression and classification tasks. We discuss batch optimization methods for model parameter estimation using iterative methods such as gradient descent, batch gradient descent, stochastic gradient descent, and quasi-Newton methods such as BFGS and L-BFGS.


Further Resources:

Lab: Logistic regression using Spark MLlib

Note: assignment for grading to be given in this week

Week 9. Scalable machine learning II

In this week we continue by considering distributed machine learning algorithms for learning deep neural networks, and consider distributed algorithms for other machine learning problems such as collaborative filtering for recommendation systems and topic model for text data analysis.

for clustering and collaborative filtering tasks. We discuss a MapReduce algorithm for k-means clustering problem, as well as an iterative algorithm for collaborative filtering tasks. We consider Spark API approaches provided by MLib and ML packages. In the latter context, we introduce the concept of a pipeline that consists of a dataflow passing through transformer and estimator operators.


Further Resources:

Lab: Collaborative filtering and topic modelling using Spark MLlib

Week 10. Cloud computing services

Guest lecture: “Data Science in the Cloud,” Marc Cohen, Software Engineer, Google.

The lecture will provide an overview of cloud services provided by Google including Compute Engine, Cloud ML Engine, Vision API, Translation API, Video intelligence API, and managed TensorFlow at scale.

Lab: Introduction to deep learning using TensorFlow

Week 11. Distributed dataflow graph computations

In our last lecture, we introduce basic concepts of distributed computing using dataflow graphs. In such settings, the user defines a dataflow graph where nodes of the graph represent operations (e.g. a matrix multiplication, a non-linear function) and edges represent flow of data between operations. We will primarily focus on one system that uses dataflow graph computations, namely, TensorFlow, used for learning and inference for deep neural networks. We will explain the key system architectural concepts that underlie the design of TensorFlow as well as the application programming interface.


Further Resources:

Lab: Distributed TensorFlow