Plan:

what is it?
historical context
course structure
main topic
mapreduce topic

Distributed Systems

multiple networked cooperating computers

As a client you don't worry about infrastructure, you just use the service.

Why?

connect physically separate machines

connecting from different devices to a server, connection allows as to share resources, like files, printers, etc. between separate machines

increase capacity through parallelism
tolerate faults

one machine fails, the other one takes over -> high availability

achieve security through isolation

Historical context

started in the same time when local networks started to appear (1980s)

AFS (Andrew File System) - distributed file System
DNS + email

rise of the datacenters, big websites (1990s)

Web search, shopping - huge data management, a lot of users

cloud computing (2000s)

users moved there data to the datacenters, instead of running application on the local machines

Current state; active

Challenges

many concurrent components
must be deal with partial failure

split brain - when the network is partitioned, and the two parts of the network can't communicate with each other, they both thinking that they are single.

tricky realize the performance benefits of distributed systems

Focus

infrastructure:

storage (fs, db)
computation (mapreduce, frameworks)
communication (rpc, pub/sub)

Main topics

Fault-tolerance

availability(replication), recoverbility(logging, transactions, durable storage)

Consistency

in replication: does the replicas have the same value?

Performance(throughput, latency)

Achieving all of these is hard, and there are tradeoffs between them.

Implementation(how to manage concurrency, how to communicate)

MapReduce

Context:

Multi-hours, computations of terabytes of data
Web indexing, crawling

Goal: make it easy to non-experts to write distributed computations

Approach:

map + reduce (stateless functions => sequential code)
MR framework deals with all distribution (fault-tolerance, communication, etc.)

Not a general purpose, must fit into the mapreduce model

Abstract view

Running multiple map functions in parallel on the input data, getting pairs of key-value pairs.

Then running the reduce function on the key-value pairs, and getting the grouped key-value pairs.

Expensive operation is shuffling of data.

Fault-tolerance

If worker fails, the master will restart the task on another worker.

Map can run twice (fails, network problems) and this is ok, because result will be the same.

Reduce also can run twice, by the same reasons.

1 machine basically run 1 task.

If the coardinator fails, the whole job must rerun.

Plan: ​

Distributed Systems ​

Why? ​

Historical context ​

Challenges ​

Focus ​

Main topics ​

MapReduce ​

Abstract view ​

Fault-tolerance ​