Final Exam study guide

The three-hour study guide for the final exam

Paul Krzyzanowski

December 16, 2011

Disclaimer: This study guide attempts to touch upon the most important topics that may be covered on the exam but does not claim to necessarily cover everything that one needs to know for the exam. Finally, don't take the time window in the title literally.

Introduction

Why are distributed systems more interesting now than they may have been one or two dozen years ago? A number of advances in computing technology had a profound effect on distributed systems. Network connectivity increased by a factor of a thousand since the 1980s. Connectivity within a local area network (LAN) moved from shared to switched networking, allowing the network to scale without increasing congestion. On the wide area, Internet access has become available to the population at large, not just to researchers on Department of Defense projects. Processor performance, system memory, and disk capacity has also increased by more than a thousandfold over the past couple of decades.

Even though these improvements make it easier to store, process, and move data between systems, what are the motivations for distributing computing? There are several. Performance does not scale linearly with an increase in price with a single computer; with a collection of computers this scaling may be possible. Secondly, distributing systems makes sense in certain environments: databases may reside in different locations than the user, for example. Thirdly, we may need to interact with other people or remote data that are geographically dispersed. Metcalfe's "Law" states that the value of a telecommunications network is proportional to the square of the number of connected users of the system. This isn't a real law, of course. The "square" part comes from the fact that the number of edges in a fully-connected graph is proportional to the square of the number of vertices. Socially, a vertex is a person and an edge is the communication path from one person to another. Simply put, there's a lot of value in being able to communicate with a lot of people. Without it, services such as skype, google, twitter, eBay, facebook, and countless others would not be nearly as useful.

Taxonomy

One way of classifying system architectures is via Flynn's taxonomy, proposed by Michael J. Flynn in 1972. He categorized machines based on the number of concurrent instruction streams and the number of data streams. SISD (single instruction stream, single data stream) refers to conventional single processor systems. SIMD (single instruction stream, multiple data streams) refers to single processor machines where each instruction may process a collection of data. Vector and array processors fall into this category. This includes graphics processors, cell processors, Intel's Streaming SIMD Extensions (SSE4) in their x86 family of processors, Intel's Advanced Vector Extensions, (AVX), and the PowerPC's AltiVec instructions. Finally, MIMD (multiple instruction streams, multiple data streams) refer to any machines with multiple processors, where each processor operates on its own stream of data.

MIMD can be further categorized by identifying whether the system has a shared memory space or not. Systems with shared memory are known as multiprocessor systems. Examples are conventional PCs with multiple processors on a single system bus (e.g., a dual-processor Intel Xeon system) or multi-core systems.

An architecture where multiple identical processors communicate with a single shared memory is called a multiprocessor. Systems without shared memory are collections of separate machines, each with its own memory. They have to rely on a network to communicate and are sometimes referred to as networked computers or multicomputers.

Multiprocessor systems are characterized by three features: (1) the processors all share the same memory, (2) they all share the same clock, and (3) there is an all-or-nothing property to system failure. What this last item means is that if the system is dead, none of the processors are working. With a multicomputer system, it's certainly possible to have one computer functioning while another is not.

The most common architecture for a multiprocessor system is symmetric multiprocessing, or SMP. In this kind of system, all processors are connected to the same shared memory and run the same operating system. No one processor has better access to memory or system peripherals than any other.

Cache coherence

On a bus-based multiprocessor system, all the processors, system memory, and peripherals are connected to a shared system bus. Only one device can access the bus at any time. Since processors need to access memory continuously, there is constant contention for the system bus. Cache memory alleviates this, as it is a small amount of memory that is local to each processor with the expectation is that most programs exhibit a certain amount of locality, allowing most memory references to be satisfied from the cache.

The problem with using cached data is that if a processor modifies a memory location, it will only modify the local copy in the cache. Other processors will read the previous value. This is unacceptable but can be remedied by having write operations pass through to the system bus so that main memory can be updated. This is known as write-through. There is a performance penalty for doing this because we need to use the bus for memory write operations, but read operations usually far outnumber write operations. The second problem is that, even if a processor updates the data in main memory, another processor may still access its own cached data, which is now obsolete. This can be remedied by having each processor's cache controller monitor write operations on the system bus and detect whether others are modifying any cached addresses. This operation is known as snooping. Together, write-through and snooping comprise a snoopy cache. Virtually all bus-based multiprocessor systems employ a snoopy cache to ensure cache memory coherence.

Improving scalability

Because of its shared system bus, bus-based multiprocessor systems face increasing bus congestion as the number of CPUs in the system increases. Beyond eight or so processors, the effects of this congestion can become quite severe, rendering the architecture impractical for highly parallel systems.

From a performance point of view, a crossbar switch is an ideal solution. It allows the switching of any processor to any chunk of memory without interfering with the traffic from any other processor accessing any other chunk of memory. With a sufficiently large crossbar switch, memory can be divided into a lot of chunks and delays will only occur when two processors need to access the same region of memory at the same time.

The problem with a crossbar switch is that it is quite expensive, particularly in the sizes needed to support a large number of processors and a fine-grained partitioning of memory. An attempt to alleviate the cost of a crossbar switch is to decrease the aggregate number of switching elements (crosspoint switches) by increasing the number of switching stages. This is known as an omega network. Unfortunately, this slows down all memory accesses, as each request to memory has to pass through a number of crosspoint switches.

A compromise solution is a non-uniform memory architecture, or NUMA. This architecture associates part of the global system memory with each processor or a group of processors (node), typically on the same board. Every processor (or group/node) is able to access a portion of the system memory at high speed through a local bus. The remaining memory is accessible through a shared (but slower) backplane. The idea is that a program may be loaded in such a way that most memory references are from local memory (and fast). The trick is for the operating system to place programs in memory and assign them to the correct processor. (The other tricky part, in hardware, is to design an architecture to keep caches coherent.)

NUMA is supported in many of today's operating systems and processors. AMD's HyperTransport technology (HTT) in 64-bit Opteron CPUs, each CPU has its own bank of DDR memory and communicates with inter-processor memory over the HTT link. Intel announced support for NUMA in 2007 with their Nehalem and Tukwila processors. These machines can be connected to shared memory via a Quick Path Interconnect (QPI). The Intel Xeon supports an Integrated Memory Controller that provides a fast channel to local memory.

In operating systems, a traditional scheduler, such as the one in the 2.4 Linux kernel, used a single run queue. A multi-queue scheduler with a separate run queue per processor was developed in the 2.5 kernel to support the NUMA model (2002). Microsoft added NUMA support in their Windows Server 2003 and added further optimizations in Windows Server 2008. The system attempts to improve memory access performance by scheduling threads on processors that are in the same node as the It also tries to allocate memory within the same node if possible.

Multicomputers

When we switch focus away from multiprocessors to multicomputers, the fundamental difference is that the processing elements no longer have access to the same memory system. Since machines need to communicate, an alternate communication mechanism is needed. This mechanism is the network interconnect. As with multiprocessors, the network interconnect may be bus-based or switched. A bus-based interconnect means that all systems are connected to the same communications bus and can see all the traffic on the network. The original design of the ethernet was bus-based. Today, it is generally not seen on local area networks. A switched interconnect allows any pair of machines to communicate without affecting the bandwidth of other systems. However, switches generally simulate the features of a bus-based network to allow things like network broadcasts and multicasts to work.

Software

One goal in some distributed systems software is providing a single-system image. This is software that makes a collection of independent computers appear as a single system to the users.

Service models

The traditional computing model is a centralized one, where all computing takes place on a single system. A client-server architecture divides computing activity between servers (which provide some service) and clients (which access the service). This two-tiered model can extend to multiple-tiers (a service requests yet another service and so on). Another model is a processor pool model, where an arsenal of CPU servers can be accessed for computing needs. A generalization of this is grid computing, which provides users with a collection of processing power as well as storage resources, often supporting heterogeneous environments. Thin clients are those where the client software is minimal; often managing the user interface and nothing more (e.g., a system running only a web browser or an X-server). These are in contradistinction to thick clients, which are generally machines running entire applications with occasional reliance on services from servers (e.g., full PCs).

Networking

Networks fall into two broad categories: circuit-switched versus packet switched. A circuit-switched network is one where a dedicated path is established between two endpoints. It provides guaranteed bandwidth and constant latency. The telephone network is an example of circuit switching, providing a maximum delay of 150 milliseconds and digitizing voice to a constant 64 kbps data rate. A circuit can be achieved via a dedicated physical circuit or by taking turns accessing a shared connection where each participant is granted a short, fixed time slot. This scheme for network access is known as TDMA, Time Division Multiple Access. Packet-switched networking uses a shared interconnect and variable size (hence, variable time) packets. Network access is accomplished via statistical multiplexing. Data is segmented into packets and each packet must be identified and addressed. These networks generally cannot provide guaranteed bandwidth or constant latency. Ethernet is an example of a packet-switched network.

Packet switching is the dominant means of data communication today. The packets in a packet-switched network are often called datagrams, and are characterized by unreliable delivery with no guarantees on arrival time or arrival order. This form of communication is also known as connectionless service: there is no concept of a connection, as one has with a circuit; each packet must be addressed to its destination.

A virtual circuit is a layer of software that tries to give a datagram some of the characteristics of a circuit-switched network. The software will typically send packet sequence numbers along with the data, buffer them in memory so they can be presented to the application in order, acknowledge received packets, and request a retransmit of missing or corrupt packets. The software will also keep track of source and destination addresses. We now have the illusion (hence the term virtual) of having a virtual circuit with its pre-set connection and reliable in-order message delivery. What we do not get is constant latency or guaranteed bandwidth. Virtual circuit service is also called connection-oriented service.

Data goes over a network in one of two ways baseband or broadband. Only one node may transmit data at a time on a baseband network but, for that time, it has the full bandwidth of the network. A broadband network, on the other hand, has its available bandwidth divided into multiple channels, or frequency bands. Cable TV is an example of a broadband network. However, data services offered by cable providers are confined to two of those channels (one for downstream traffic and another for upstream), making IP access effectively baseband within broadband. Don't confuse these terms with the marketing community's use of broadband to refer to any relatively high-speed home networking.

Data networking is generally implemented as a stack of several protocols – each responsible for a specific aspect of networking. The OSI reference model defines seven layers of network protocols. Some of the more interesting ones are: the network, transport, and presentation layers.

1. Physical
Hardware, voltage levels, frequencies, etc.
2. Data link
Send and get packets from the physical network. Ethernet packet transmission is an example of this layer.
3. Network
Relays and routes data to its destination. This is where networking gets interesting because we are no longer confined to a single physical network but can route traffic between networks. IP is an example of this layer.
4. Transport
Provides a software endpoint for networking. Now we can talk application-to-application instead of machine-to-machine. TCP/IP and UDP/IP are examples of this layer.
5. Session
Manages multiple logical connections over a single communication link. Examples are SSL (Secure Sockets Layer) tunnels and HTTP 1.1.
6. Presentation
Converts data between machine representations. Examples are XML, XDR (for ONC remote procedure calls), NDR (for Microsoft COM+ remote procedure calls), and ASN.1 (used for encoding cryptographic keys and digital certificates).
7. Application
This is a catch-all layer that includes every application-specific communication protocol. For example, SMTP (sending email), IMAP (receiving email), FTP (file transfer), HTTP (getting web pages).

Ethernet is the most widely used data network for local area networking. Ethernet provides packet-based, unreliable, connectionless communications. It occupies layers one and two of the OSI model. An ethernet controller uses carrier sense multiple access with collision detection (CSMA/CD) to access the network. This is analogous to making a phone call on a party line system. The network interface card monitors the network. Only when it detects no traffic does it send out its packet. While doing so, it still monitors the network to detect a collision — the case where two cards decided to send a packet out simultaneously. If a collision took place, the transmission is reattempted again at a later time. Progressively longer back-off times are used when collisions are encountered to ensure that overall performance degrades gracefully.

IP Networking

The Internet Protocol (IP) handles the interconnection of multiple local and wide-area networks. It is a logical network whose data is transported by physical networks (such as Ethernet, for example). The IP layer provides unreliable datagram delivery of packets between nodes (e.g., computers). Each machine on an IP network must have an IP address. The addressing scheme for IP divided addresses into two segments: a network part of the address, which is used in determining where to route the packet, and a host part, which is used in identifying the specific host within that local area network.

Instead of using a single network-host partition, IP was designed to use three distinct partitions, or classes of networks: A, B, and C. This allowed for a small number of huge networks and a large number of networks with a small number of machines. However, the allocation of machines to networks was still inefficient. An organization that needed addresses for 300 machines would be allocated a class B network, and over 65,000 addresses would go unused (a class C network, accommodating only 254 machines, would have been too small). Classless Inter-Domain Routing (CIDR) was created to alleviate this inefficiency. Networks could be allocated to organizations on any power of two (arbitrary network-host partitioning). This made routing tables a bit more complex; they now need to have an extra datum: the number of leading bits that constitute the network part of the address.

Since IP is a logical network, any machine that needs to send out IP packets must do so via the physical network. Typically, this is Ethernet (which uses a 48-bit Ethernet address, completely unrelated to a 32-bit IP address). To send an IP packet out, the system needs to identify the physical destination address (MAC, or Media Access Control address) that corresponds to the desired IP destination. The Address Resolution Protocol, or ARP, accomplishes this. It works by broadcasting a request containing an IP address (do you know the corresponding MAC address for this IP address) and then waiting for a response from the machine with the corresponding IP address. To avoid doing this for every outgoing packet, it maintains a cache of most recently used addresses.

There are two transport-layer protocols on top of IP: TCP and UDP. TCP (Transmission Control Protocol) provides virtual circuit (connection-oriented) service. This layer of software ensures that packets arrive in order to the application and lost or corrupt packets are retransmitted. The transport layer keeps track of the destination so that the application can have the illusion of a connected data stream. UDP (User Datagram Protocol) provides datagram (connectionless) service. While UDP drops packets with corrupt data, it does not ensure in-order delivery or reliable delivery. Port numbers in both TCP and UDP are used to allow the operating system to direct the data to the appropriate application (or, more precisely, to the socket that is associated with the communication stream).

Sockets are an interface to the network provided to applications by the operating system. They are created with the socket system call and assigned an address and port number with the bind system call. For connection-oriented protocols, a socket on the server can be set to listen for connections with the listen system call. The accept call blocks until a connection is received, at which point the server receives a socket dedicated to that connection. A client can establish a connection with the connect system call. After this, sending and receiving data is compatible with file operations: the same read/write system calls can be used. When communication is complete, the socket can be closed with the shutdown or close system calls.

Protocol encapsulation

We saw that if we want to send an IP packet out on an ethernet network (IP is a logical network, so there is no physical IP network), we needed to send out an ethernet packet. The entire IP packet becomes the payload (data) of an ethernet packet. Similarly, TCP and UDP have their own headers, distinct from IP headers (they need a port number, for example). A TCP or UDP packet is likewise treated as data by the IP layer. This wrapping process is known as protocol enveloping or protocol encapsulation.

Quality of Service

IP was designed as a system that would provide best-effort packet delivery but with no guarantees on the path a packet will take, whether it gets dropped, or what order it arrives in. Hence, there is no concept of delivering any specific grade of service. As IP networks began to be used for carrying continuous media, such as voice and data, several approaches were taken to attempt to provide better controls for scheduling the delivery of IP packets.

Quality of service (QoS) on a network is characterized by four factors:

  1. Bandwidth (or bit rate): the average number of bits per second over the network
  2. Delay (or latency): the average time for data to get from one endpoint to another
  3. Jitter: the variation in the delay
  4. Errors (or packet loss): the percentage of packets that do not reach their destination or reach it with errors

Quality of service control falls into one of two categories: hard QoS and soft QoS. Hard QoS refers to a system that is designed to provide a guaranteed level of service while soft QoS refers to a system that uses a best-effort approach to try to deliver the desired level of service.

Two broad approaches to providing data channels with managed quality of service on a network are admission control and traffic control. Admission control is designed to deliver hard QoS. With admission control, we ask that applications first request a particular quality of service from the network. The network (e.g. all the routers in the path from the source to the destination) will reserve resources for switching and routing the packets to conform to that grade of service and grant the request to the application. If it cannot commit to the needed resources, the request will be denied. Traffic control offers soft QoS. Soft QoS on a network refers to prioritization without any reservation of resources from routers or endpoints or any a priori negotiation for a level of service. With traffic control, applications may send data onto the network freely but the network elements (routers) will classify, queue, schedule, and sometimes drop packets based on various criteria.

Flow detection: Many routers attempt to detect a flow of a stream of packets from one address/port to another address/port and then dropping or delaying packets to control the flow rate. Routers can also be programmed to drop TCP packets over UDP or vice versa or drop packets when traffic exceeds an allotted bandwidth. Routers can also manage several queues based on flows, connected networks, or source or destination addresses and ports. Traffic shaping is when a router queues packets in certain flows during peak usage for later retransmission when there is available bandwidth. With traffic policing, traffic that exceeds the allocated bandwidth for a particular flow is discarded.

Inefficient use of packets: Having a system send lots of small packets instead of a few larger ones is clearly inefficient. For example, the overhead of TCP, IP, and ethernet headers is approximately 58 bytes. Sending one byte of date requires 59 bytes — a 5,800% overhead! Nagle's algorithm adds any new transmitted TCP/IP data to a buffer rather than sending it immediately as long as there are any unacknowledged packets outstanding. Incidentally, Nagle's algorithm can be disabled on a socket with the TCP_NODELAY option to the setsockopt system call

Differentiated services (DiffServ): Flow control mechanisms are outside the purview of programmers or even the computers at either end: it is up to the ISP (Internet Service Provider) and its router configuration to define the policies. Differentiated services provide a way for programmers to provide advisory information inside an IP header on how a packet should be processed by routers. A packet can be assigned a priority as well as high/low levels for reliability, throughput, and delay. DiffServ is an example of traffic classification rather than flow detection. It is entirely up to the routers to decide how to process this information or whether to process it at all. Differentiated services are an example of soft QoS: there is no guarantee on the actual quality of service that will be delivered. A common use for DiffServ is to try to provide a better quality of service for voice over IP (VoIP) phone calls by tagging those packets for Expedited Forwarding (EF) service, which is defined to have the characteristics of low delay, low loss, and low jitter. Whether DiffServ values are used at all and how they are interpreted is strictly up to the ISP.

Hard QoS approach: The Reservation protocol, RSVP, has been developed to allow a flow of packets to be routed with rate and/or delay guarantees. The problem with guaranteeing this is that all routers in the path from the source to the destination must be configured to support RSVP: each intermediate router must commit to reserving the needed amount of routing resources to guarantee the desired level of service. This is an example of admission control while differentiated services is an example of traffic control. If one ISP in the path does not support this then all bets are off.

ATM

Bandwidth and latency is, of course, an issue for real-time and streaming media applications. We can categorize the timing demands of traffic into three categories: asynchronous data has no timing requirements for message delivery; synchronous data must be delivered at strict deadlines; and isochronous data must meet specific bandwidth needs but may be delivered sooner than needed (streaming media with a receive buffer is an example).

ATM, or Asynchronous Transfer Mode, networking was created to bridge the synchronous needs of telephony (low bandwidth but precisely scheduled), asynchronous data networking (often high bandwidth), and isochronous streaming media applications. ATM is a packet based network that negotiates a circuit (route) when a connection is first established. Every router commits to having the requisite resources to provide for the service needs of the connection. The connection is created to provide ABR (available bit rate), CBR (constant bit rate), or VBR (variable bit rate) service.

ATM routes small (53-byte) cells in contrast to the variable-size packets of IP. This allows a router to express its service in cells per second, simplifies switching, makes scheduling predictable, and avoids the issue of large packets delaying small ones.

Naming

Names are used for identifying a variety of things. We often use a name server (offering a naming service) to perform a name to address mapping. An address is nothing more than the lower representation of a name. For example, you can't just look at a string like 192.168.1.5 and say it is an address. To an IP driver it is treated as a name and the address is the underlying ethernet address. Binding is the association of a name to an address. Resolution is the process of looking up that name-to-address binding.

Static binding refers to a hard-wired association between a name and an address (e.g., hard-coded in a program). Early binding refers to a resolution that is performed ahead of time and cached for future use. Late binding refers to performing the resolution just at the time a name-to-address binding is needed.

The Domain Name Server (DNS) is an example of a distributed name server for resolving domain names into IP addresses. Each server is responsible for answering questions about machines within its zone. A name server will do one of several things: (1) answer a request if it already knows the answer, (2) contact another name server(s) to search for the answer, (3) return an error fit he domain name does not exist, or (4) return a referral: the address of another name server that may know more of the answer. For example, searching for mail.pk.org (with no cached information) begins by querying one of several replicated root name servers. These keep track for name servers responsible for top-level domains. This query will give you a referral to a name server that is responsible the .org domain; querying that name server will give you a name server responsible for pk.org. Finally, the name server responsible for pk.org will provide the IP address or an authoritative "this host does not exist" response. Along the way, referrals may be cached so that a name server need not go through the entire process again.

An iterative name resolution process is one where you go through the name hierarchy to find lower-level name servers that will eventually take you to the final component of the name. This is the way DNS servers operate when they return a referral to another name server if they are not responsible for an address. A recursive name resolution process is one where the name server itself will invoke other name servers as needed to get the answer to the query. DNS supports both forms.

Clock synchronization

No two clocks tick in perfect synchrony with each other. The difference between two clocks at any given instant is the clock skew. The rate at which the clocks are drifting is the clock drift. A linear compensating function adjusts the rate at which time is measured on a computer (e.g., number of ticks that make up a second).

Cristian's algorithm sets the time on a client to the time returned by the server plus an offset that is one half of the transit time between the request and response messages: Tclient = Tserver + ½(Treceived - Tsent). It also allows one to compute the maximum error of the new time stamp. The error is ± ½[(round-trip time) - (best-case round-trip time)]. Errors are additive. If you incur an error of ±50 msec and the server's clock source has an error of ±80 msec, your clock's error is now ±130 msec.

The Berkeley algorithm does not assume the presence of a server with an accurate time (i.e., one that keeps track of UTC time). Instead, one system is chosen to act as a coordinator. It requests the time from all systems in the group (including itself) and computes a fault-tolerant average (an arithmetic average, dismissing systems whose time values differ by more than a certain amount). It then sends each machine an offset by which to adjust its clock.

The Network Time Protocol, NTP, was created to allow a large set of machines to synchronize their clocks. A set of machines acts as time servers – this collection of machines is the synchronization subnet. The subnet is hierarchical, with the time server's stratum defined as the number of hops it is from a machine that synchronizes from a direct time source. Machines that are directly connected to a time source (e.g., a GPS receiver) are at stratum 0. Machines that synchronize from a system at stratum 0 are at stratum one, and so on. The Simple Network Time Protocol, SNTP, is a restricted form of NTP that does not support peer-to-peer synchronization of time servers. The peer-to-peer mode maintains state on drift and synchronization time and is intended for time servers to synchronize with each other. SNTP is essentially the same as Cristian's algorithm. The formula for SNTP is time_offset = ½ (T2 - T1 + T3 - T4) where T1 is the time the message left the client, T2 is the time it arrived at the server, T3 is the time the response left the server, and T4 is the time that it arrived at the client. If we let TS = ½(T2+T3) then we arrive at Cristian's formula.

Logical clocks

Lamport clocks allow one to assign sequence numbers ("timestamps") to messages and other events so that all cooperating processes can agree on the order of related events. There is no assumption of a central time source and no concept of total ordering (when events took place). The central concept with logical clocks is the happened-before relation: a→b represents that event a occurred before event b. This order is imposed upon consecutive events at a process and also upon a message being sent before it is received at another process. Beyond that, we can use the transitive property of the relationship to determine causality: if a→b and b→c then a→c. If there is no causal relationship between two events (e.g., they occur on different processes that do not exchange messages or have not yet exchanged messages), the events are concurrent.

Lamport's algorithm states that every event is timestamped (assigned a sequence number) and each message carries a timestamp of the sender's clock (sequence number). A message comprises two events: (1) at the sender, we have the event of sending the message and (2) at the receiver, we have the event of receiving the message. The clock is a process-wide counter (e.g., a global variable) and is always incremented before each event. When a message arrives, if the receiver's clock is less than or equal to the message's timestamp, the clock is set to the message timestamp + 1. This ensures that the timestamp marking the event of a received message will always be greater than the timestamp of that sent message.

One problem with Lamport timestamps is that multiple events on different processes may all be tagged with the same timestamp. We can force each timestamp to be unique by suffixing it with a globally unique process number. While these new timestamps will not relate to real time ordering, they will be unique numbers that can be used for consistent comparisons of timestamps (e.g., if we need to make a decision on who gets to access a resource based on comparing two timestamps).

A second deficiency with Lamport timestamps is that, by looking at timestamps, one cannot determine whether there is a causal relationship between two events. For example, just because event a has a timestamp of 5 and event b has a timestamp of 6, it does not imply that event a happened before event b. A way to create timestamps that allow us to discern causal relationships is to use a vector clock. A vector clock is no longer a single value but rather a vector of numbers, each element corresponding to a process. Before affixing a vector timestamp to an event, a process increments the element of its local vector that corresponds to its position in the (for example, process 0 increments element 0 of its vector; process 1 increments element 1 of its vector). When a process receives a message, it sets the vector of the event to one that contains the higher of two values when doing and element-by-element comparison of the original event's vector and the vector received in the message. This becomes the new per-process vector from which future events on that process will be timestamped. For example, in an environment of four processors, P1 will "own" the second element of the vector. If one event on P1 is (2, 4, 0, 1) then the next event will be (2, 5, 0, 1). If the event after that is the receipt of a message with a timestamp of (3, 2, 9, 8) then the timestamp will be created by setting an element-by-element maximum of (2, 6, 0, 1) and (3, 2, 9, 8), which is (3, 6, 9, 8). We can illustrate this with the following pseudocode where e is the system's vector counter and r is the received vector timestamp:

/* receive message with vector timestamp r */ e[myproc]++; /* increment our process' index */ for (i=0; i < num_procs; i++) { if (e[i] < r[i]) e[i] = r[i]; }

Two events are concurrent if one vector timestamp is neither greater than or equal nor less than or equal to the other element when doing an element-by-element comparison. For example, events (2, 4, 6, 8) and (3, 4, 7, 9) are not concurrent (i.e., they are causally related) because every element of the first vector is less than or equal to the corresponding element of the second vector. The vectors (2, 4, 6, 8) and (1, 5, 4, 9) represent concurrent events. Neither vector is less than or equal to the other. For instance, 2 ≥ 1 (first element of the first vector is greater than the first element of the second vector) but 4 ≤ 5 (second element of the first vector is less than the second element of the second vector).

Group communication

Point-to-point communication is known as unicast. This is what we generally use to communicate a single client and server. There are other modes of communication. Broadcast is the sending of data to every node on the network. Anycast is point-to-point communication (as in unicast) but the receiver is the nearest one of receivers with specific capabilities (for example, IPv6 uses this to allow a host to update the routing table of the nearest host). Finally, there's group communication, known as multicast. This is point-to-multipoint communication. A message gets sent to everyone in the group. One implementation of multicast is known as netcast. This simulates hardware multicast by invoking multiple unicasts, one to each recipient. There are two considerations in group communication (multicast): reliability and message ordering.

Reliability

An atomic multicast requires that a message must reach all group members (if one node cannot get the message, no others can process it). This multicast must survive machines going down – both the sender and/or receivers. Because of this, it requires the most overhead in implementation, often employing persistent logs.

A reliable multicast is a best-effort multicast. The sender sends a message and waits for an acknowledgement. If it doesn't receive the acknowledgement in time, it will retransmit the message. Eventually, after a longer interval of time, it will give up and assume the receiver is dead.

An unreliable multicast doesn't wait for acknowledgements and will generally use whatever underlying multicast mechanism is provided.

Message ordering

The issue in message ordering is that multiple nodes can be sending messages to the entire group at the same time. Will each node receive all the messages in the exact same order? Will each node receive all the messages in the order they were sent?

Global time ordering requires that all messages arrive in the exact order they were sent: if node A sent a message 5 nanoseconds before node B did then all nodes should receive the message from node A first. This is impossible to implement (clocks can't be that perfectly synchronized and the chance of a tie always arises; also, networks will not be able to guarantee this kind of ordering if routing is involved). A more practical approach is total ordering. This requires that all messages arrive in the same order at all nodes. This can be easily achieved by providing a mechanism for obtaining a totally sequenced message ID: for example, from a sequence number server.

Causal ordering is the ordering you would get by attaching Lamport time stamps to messages. The ordering of unrelated (concurrent) messages is insignificant, but causally related messages will be in the proper order. Sync ordering guarantees that all messages from a specific sender will arrive in FIFO (first-in, first-out) order.

Sync ordering requires a special type of message — a sync message. When this is issued, any messages that were already sent have to be processed (and acknowledged to the senders). The sync assures that any messages sent before the sync will be processed before any messages after the sync (globally – for all nodes).

Finally, an unordered multicast doesn't impose any message ordering among messages from other nodes. It may choose to impose FIFO (first in, first out) ordering on messages from a particular source (e.g. as TCP does).

IP multicast

An IP multicast address (also known as a class D address) contains a 28-bit multicast address. A host may join this address and receive messages addressed to that multicast ID. Within a LAN, an IP class D address is mapped onto an Ethernet multicast address by copying the least-significant 23 bits of the address onto an Ethernet multicast address. Within a LAN, an ethernet chip is programmed to to perform an exact match on a small set of addresses or to accept addresses that hash to particular values. The ethernet driver will need to remove any unneeded addresses that pass through. The ethernet chip can also be set to multicast promiscuous mode, where it will accept all multicast ethernet packets.

The Internet Group Management Protocol (IGMP) manages membership beyond a LAN. A node broadcasts a multicast join message to join a specific group. A multicast-aware router will pick this message up and send the join message to other connected networks (this way a spanning tree is built, ultimately forwarding the request to the source). Periodically, each router sends a query message for each multicast group. If any node (or router) is still interested in the group, it must re-send a join message. If no join messages are received, then the router will stop responding to join messages and the LAN will no longer receive packets for that group. With IGMP v2, a leave message was added to avoid having the time out to realize that nobody is interested in a group.

Remote Procedure Calls

One problem with the interface offered by sockets was that it encouraged a send-receive model of interaction. However, most programs use a functional (procedure call) model. Remote procedure calls are a programming language construct (something provided by the compiler, as opposed to an operating system construct such as sockets). They provide the illusion of calling a procedure on a remote machine. During this time, execution of the local thread stops until the results are returned. The programmer is alleviated from packaging data, sending and receiving messages, and parsing results.

The illusion of a remote procedure call is accomplished by generating stub functions. On the client side, the stub is a function with the same interface as the desired remote procedure. Its job is to take the parameters, marshal them into a network message, send them to the server, await a reply, and then unmarshal the results and return them to the caller. On the server side, the stub (sometimes known as a skeleton) is responsible for being the main program that registers the service and awaits incoming requests for running the remote procedure. It unmarshals the data in the request, calls the user's procedure, and marshals the results into a network message that is sent back to the recipient.

Sun (ONC) RPC

Sun's RPC was one of the first RPC systems to achieve widespread use. It is still in use on virtually all Unix-derived systems (System V, *BSD, Linux, OS X). It uses a precompiler called rpcgen that takes input from an interface definition language (IDL) file. This is a file that defines the interfaces to the remote procedures. Fron this, rpcgen creates client stub functions and a server stub program. These can be compiled and linked with the client and server functions, respectively.

Every interface is assigned a unique 32-bit number, known as a program number. When the server starts up, it binds a socket to any available port and registers that port number and its program number with a name server, known as the portmapper, running on the same machine. A client, before invoking any remote procedure calls, contacts the portmapper on the desired server to find the port to which it needs to send its requests.

DCE RPC

The Distributed Computing Environment, defined by the Open Group created its own flavor of RPC, very similar to Sun's. They also had the programmer specify an interface in an IDL, which they called the Interface Definition Notation (IDN).

To avoid the problem of picking a unique 32-bit identifier for the interface, DCE RPC provides the programmer with a program called uuidgen. This generates a unique universal ID (UUID) – a 128-bit number that is a function of the current time and ethernet address.

The DCE also introduced the concept of a cell, which is an administrative grouping of machines. Each cell has a cell directory server that maintains information about the services available within the cell. Each machine in the cell knows how to contact the cell directory server. When a server program starts up under DCE RPC, it registers its port and the interface's UUID with a local name server (the DCE host dæmon, dced, which is similar to the portmapper). It also registers the UUID, host mapping with the cell directory server. This allows a degree of location transparency for services: a client does not need to know what machine a service lives on a priori.

A standard way of encapsulating data is crucial since encodings may differ between different machines. Sun defined a standard format called XDR (eXternal Data Representation). Every participating system must be sure to encode data into this format. DCE defined a format called NDR (Network Data Representation). Instead of being a single set of definitions, this format defines a set of data representations the can be used. The hope is that the sender can find a format that will require minimal or no data coversion (hence, greater efficiency). If the client uses the same system architecture, it too will not need to convert data. This is known as a multi-canonical approach to data conversion.

As object oriented languages gained popularity in the late 1980s and 1990s, RPC systems like Sun's and DCE's proved incapable of handling some object oriented constructs, such as instances of objects or polymorphism (different functions sharing the same name, with the function distinguished by the incoming parameters). A new generation of RPC systems dealt with these issues.

Microsoft COM+/DCOM & ORPC (MS-RPC)

Microsoft already had a scheme in place for dynamically loading components into a process. This was known as COM, the Component Object Model and provided a well-defined mechanism for a process to identify and access interfaces within the component. The same model was extended to invoke remotely-located components to become the Distributed Component Object Model (DCOM), later fully merged with COM into something called COM+. Because the components can no longer be loaded into the local process space (as they're on a remote system), they have to be loaded by some process. This process is known as a surrogate process. It runs on the server, accepting remote requests for loading components and invoking operations on them.

DCOM is implemented through remote procedure calls. Microsoft slightly enhanced DCE RPC to create what they termed Object RPC (ORPC). This is essentially DCE RPC with the addition of support for an interface pointer identifier (IPID). The IPID provides the ability to identify a specific instance of a remote object. Interfaces are defined via the Microsoft Interface Definition Language (MIDL) and compiled into client and server side stubs. The client-side stub becomes the local COM object that is loaded when the object is activated.

Since objects can be instantiated and deleted remotely, the surrogate process needs to ensure that there isn't a build-up of objects that are no longer needed by any client. Microsoft accomplishes this via remote reference counting. This is an explicit action where the client can send requests to increment or decrement a reference count on the server. When their reference count drops to zero, the surrogate process deletes the object. To guard against programming errors or processes that terminated abnormally, a secondary mechanism exists, called pinging. The client must periodically send the server a ping set – a list of all the remote objects that are currently active. If the server does not receive this information within a certain period, it elides the objects.

CORBA

The Common Object Request Broker Architecture (CORBA) was created to provide a software platform for distributing objects that is architecture, language, and operating system independent. The core concept is the ORB – the Object Request Broker. This is the collection of stub functions and libraries that support the remote invocation of procedures and the management of objects. CORBA provides an Interface Definition Language (IDL) that is compiled with a pre-compiler to create client and server stubs.

CORBA provides a rich set of capabilities for the management of objects. These include the ability to start the server if it is not running, discover interfaces, persist objects to persistent media,

One of the biggest problems with CORBA (aside from its complexity) was the fact that, while the programming interfaces were defined, the underlying protocol was not. This meant that clients and servers would often not be able to communicate unless they used an implementation of CORBA from the same vendor. This was rectified in 1996 with the introduction of the Internet Inter-ORB Protocol (IIOP). By this time, however, much of the momentum of using CORBA over the Internet was gone.

Java RMI

When Java was created, it was designed to be a language for deploying downloadable applets. In 1995, Sun extended Java to support Remote Method Invocation (RMI). This allows a programmer to invoke methods on objects that are resident on other JVMs.

Since RMI is designed for Java, there is no need for OS, language, or architecture interoperability. This allows RMI to have a simple and clean design. Classes that interact with RMI must simply play by a couple of rules. All parameters to remote methods must implement the serializable class. This ensures that the data can be serialized into a byte stream for transport over the network. All remote interfaces (methods that may be invoked from a remote Java virtual machine) must extend the remote class.

RMI provides a naming service called rmiregistry to allow clients to locate remote object references. These objects are given symbolic names and looked up via a URI naming scheme (e.g., rmi://remus.rutgers.edu:2311/testinterface).

Java's distributed garbage collection is somewhat simpler than Microsoft's COM+. There are two operations that a client can send to the server: dirty and clean. When the first reference to a remote object is made, the client JVM sends a dirty message to the server for that object. As long as local references exist for the object, the client will periodically send dirty messages to the server to renew the lease on the object. When the local JVM's garbage collector detects that there are no more references to the object, it sends a clean call for the object to the server.

XML RPC

As people started looking beyond the LAN for hosting services via RPC, firewalls stood in the way. Most server-side RPC systems had a habit of asking the operating system to pick any available port. This required opening up a whole range of ports on a firewall. Moreover, people often could not get firewall rules modified in some environments. A workaround to this is to send RPC messages via HTTP ports (e.g., 80 and 443), and have the messages formatted as XML messages to get around any content-inspecting firewalls.

XML-RPC was created in 1998 as a simple protocol that marshals all requests and responses into XML messages. There are a lot of libraries to support this system but no IDL compiler or stub function generator thus far.

SOAP

XML RPC took an evolutionary fork and evolved (with the support of companies such as Microsoft and IBM) into something known as SOAP, the Simple Object Access Protocol. XML RPC is a subset of SOAP. In addition to remote procedure calls, SOAP added support for general purpose messaging (send, receive, asynchronous notification) of messages. SOAP invocations are XML messages sent via an HTTP protocol. SOAP services can be described via an XML document formatted to conform to an interface specified by a corresponding Web Services Description Language (WSDL) document – another XML document.

.NET Remoting

A problem with Microsoft’s DCOM was that it was a somewhat low-level implementation. Clients had to provide reference counting of their objects explicitly and the convenience of using DCOM depended very much on the language (easy in VB, difficult in C++). Moreover, the use of operating-system-selected random ports, and a binary data format made it difficult to use over the Internet where firewalls may block certain ports or requests need to in an XML format and be sent over HTTP.

With .NET, Microsoft provided (among other things) a runtime environment, a set of libraries, and a web server that provides inbuilt support for web services. For supporting function-based access to web services, .NET provides a remoting capability that allows a process to interact with objects that reside on other processes and other machines.

.NET Remoting was created to be a successor to DCOM that would work more easily over the Internet (no random ports, for example, as well as the ability to use XML over HTTP). As with other RPC systems, client and server stub functions (proxies) are created. Microsoft’s Visual Studio application development environment hides the mechanics of this and integrates remote procedure calls directly into the language.

These stubs rely on the .NET runtime system to marshal parameters into one of several types, which include SOAP or binary formats. The .NET runtime system then sends the message over a particular channel that that was set up for transport. This channel may be HTTP using TCP/IP to send SOAP messages, TCP/IP with no wrapping protocol to send binary messages on a local-area network, or named pipes to send messages between processes on the same machine. Microsoft’s web server, IIS, can be configured to directs certain URLs that contain the SOAP (encapsulated in HTTP) request to specific objects running under the .NET framework, which then sends a response back to the web server, which is then returned to the caller.

.NET supports two forms of object activation: (1) server activated objects are those where the client has no control in managing the lifespan of the object, and (2) client activated objects are those where the lifetime is controlled by the client and the distributed garbage collection must be handled.

Two forms of server activated objects are supported:

  1. Single call objects instantiate a new instance of the object for a call and immediately clean it up upon return. There is no ability to keep state.
  2. Singleton objects share the same instance of the object among all callers. This is much like traditional function calls in non-object-oriented systems. Everyone shares the same state.

Client activated objects are created when the client requests a new object. This is similar to the way objects are handled in DCOM. .NET manages object lifetime of client activated objects via a Leasing Distributed Garbage Collector (LDGC). The lease manager on the server manages object lifetime by checking the object's lease timer. When an object is created, it is given an initial lease time (five minutes by default). Each time a method is called on an object, the lease timer is renewed. The object will be cleaned up unless the client either references its objects or makes explicit calls to renew the lease time. This is a very similar concept to Java RMI with the exception that a client never sends a message stating that there are no more object references (Java sends a clean message to the server). There is no more reference counting as under DCOM.

Scalability Terminology

Systems need to be able to scale as demand increases. A farm is a collection of servers, apps, and data at a site. A farm is administered as a single unit (think of the single system image that we covered in the introduction). A geoplex is a set of geographically replicated farms. This is done for fault tolerance and/or load balancing. With multiple farms, there are two ways of running them:

  1. active-active: all farms are running and handling traffic (requests from clients on the network).
  2. active-passive: one farm is active and others are in standby (passive), ready to take over when the active farm dies.

Service cloning is the replicating of the service onto another node (machine). This is crucial for load balancing so that requests can be routed to individual clones. It also provides fault tolerance since individual machines (clones) may fail. RACS stands for reliable array of cloned services. [Warning: this acronym is overloaded and used for other things, such as "redundant array of cloud storage". Hence, the acronym is usually not used very much.] Since services need to access storage, there are two basic ways of dealing with storage:

  1. Shared-nothing RACS: storage is duplicated per note. Cloning does not improve storage capacity and synchronization becomes an issue.
  2. Shared-disk RACS: all clones share common storage.

A partition is the duplication of hardware and software. This is just like shared-nothing RACS, but data is divided (partitioned) among nodes instead of replicated. The key functional difference is that, with RACS, a request can be handled by any server. With a partitioned system, only specific nodes can handle a specific request. Ideally, partitioning should be transparent to the application. Requests should get routed to the partition that has the relevant data. Partitioning on its own does not improve availability since data is not replicated. Even if data is fault tolerant, the node hardware that works on the data managed by that partition may fail. The way to deal with fault tolerance with partitioning is to use a group of two or more nodes that provide access to the storage in a partition: we clone a partition. This set of cloned nodes is called a pack. Just like with RACS, the storage can be shared among pack members or not:

  1. Shared-disk pack: all pack members can access the storage in a partition
  2. Shared nothing pack: each pack member has exclusive access its own storage. The shared nothing pack can be configured in two ways:
    1. Active-active: each node has responsibility for one partition but can access another if the other node dies
    2. Active-passive: one node is active, the other node(s) is in standby, waiting to take over when an active node dies.

Mutual exclusion

Mutual exclusion is responsible for making sure that only one process or thread accesses a resource at a time. In non-distributed systems, it is accomplished through mechanisms such as semaphores and monitors and, at a low level, through instructions such as test-and-set locks. None of these mechanisms work across networks of computers.

Distributed algorithms fall into three categories. A centralized approach uses a central coordinator that is responsible for granting access permissions. A token-based approach allows a process to access a resource if it is holding a "token"; that is, it received an unsolicited message where ownership of the message allows the process to access the resource. Sending the message (token) means forfeiting access. A contention-based algorithm is one where all processes coordinate together on deciding who gets access to the resource.

The centralized algorithm is a server that accepts REQUEST messages for a resource (e.g., critical section). If nobody is using the resource, it responds with a GRANT message. If somebody is using the resource, it doesn't respond. When a client is done with a resource, it sends the server a RELEASE message. The server then sends a GRANT message to the next client in the queue (if there is one).

The token ring algorithm creates a logical communication ring among the processes in the group. A message, called a token, is created for each resource. This token is passed from process to process along the ring. If a process receives a token and does not need to access that resource, it simply passes the token to the next process. Otherwise, it will hold on to the token until it is done with the resource.

The Ricart & Agrawala algorithm was an early demonstration that a truly distributed algorithm is possible. A process that wants to enter a critical section sends a request to all other processes in the group and waits for all of them to respond. If another process is currently using the critical section, that process delays its response until it is done. If process A and process B sent out requests concurrently (i.e., two systems want the same resource concurrently), each system compares the timestamp of that request with that of its own request. If process A received a request from process B that is older (a lower timestamp) than the one it sent, then process A will give process B priority to access the resource by sending a response to process B . Otherwise, if process A has the earlier timestamp, it will queue the request from B and continue to wait for all acknowledgements, sending a response to B (and any other processes who wanted the resource) only when it is done using the resource. The key point is that processes A and B will make the same comparison and exactly one will hold back on sending the response.

Lamport's algorithm, like the Ricart & Agrawala algorithm, is also based on reliable multicast. This algorithm has a process send a timestamped request to all processes in the group, including itself. Every message is acknowledged immediately. Each process places a received message in a priority queue that is sorted by the Lamport timestamps in the message. A process decides whether it can access the resource by checking whether its own request is the earliest (first) one in the queue of all requests that it has received. If so, it accesses the resource and, when done, sends a release to all members (including itself). The receipt of a release message causes each process to remove the process from the ordered queue. If a process now finds itself as the earliest process in the queue, it knows that it is now its turn to access the resource.

Election algorithms

Election algorithms are designed to allow a collection of processes to agree upon a single coordinator. All candidate processes are considered to be equally capable of acting as a coordinator. The task is to select one of these processes — and make sure that every process is aware of the selection. In designing these algorithms, we assume that every process has a unique ID (for example, a process ID combined with the machine's address).

The bully algorithm selects the largest process ID as the coordinator. If a process detects a dead coordinator, it sends an ELECTION message to all processes with a higher ID number and waits for any replies. If it gets none within a certain time, it announces itself as a coordinator. When a process receives an ELECTION message it immediately sends a response back to the requestor (so the requestor won't become the coordinator) and holds an election to see if there are any higher-numbered processes that will respond.

The ring election algorithm requires a logical communication ring among the processes in the group. When a process detects a non-responding coordinator, it creates an ELECTION message containing its own process ID and sends it to the next process on the ring. If the process is dead, it sends the message to the following process, skipping dead processes until it finds one that can receive the message. When a process receives an ELECTION message, it adds its own process ID to the body and sends that message out to its neighboring process. When the election message circulates back to the original sender, the sender looks at the list of active processes in the message body and chooses a coordinator from among them. Since multiple elections may have been started concurrently, the algorithm for choosing the leader must yield the same result among the different list. Hence, selecting the highest or the lowest process ID will work. Selecting the first or last process in the list will not.

The Chang and Roberts ring algorithm optimizes the ring algorithm in two ways. First, there is no need to keep the entire list of live processes in the election messages; we can perform the election as we go along. If the goal is to vote for the highest-numbered live process ID and a process receives an election message, it does one of two things: (1) pass it untouched if its process ID is smaller than the one in the message, or (2) replace the process ID In the message with its own if the process ID is greater than the one in the message. When a process receives a message that contains its own process ID, it knows that the message has come full circle and it is the winner. It will then notify each process of the outcome. The second optimization is to avoid having multiple election messages circulate if multiple processes have initiated elections concurrently. To do this, a process keeps track if it has participated in an election (i.e., has initiated or forwarded an election message). If a process receives a message with a smaller process ID than its own and it is a participant in an election, it simply discards the message.

One problem with election algorithms is when a network gets segmented (also known as partitioned: one set of processes is separated from another. In this case, each segment may elect its own coordinator and problems can arise. This is known as a split brain situation. To combat this, a redundant network is needed (or some alternate communication mechanism to enquire about the state of processes).

Consensus

The goal of a consensus algorithm is to allow a group of processes to agree on a common value. The value must be a value that was proposed by at least one process. Achieving consensus is easy if processes don't fail or, as with the two-phase commit protocol, we are willing to wait for processes to recover if they die. The challenge with a consensus algorithm is to achieve consensus in the presence of faults.

Agreement in faulty systems

There are three classes of faults: transient, intermittent, and permanent. Faults are either fail-silent (where the component does not produce results) or Byzantine (where the component produces faulty results).

The two-army problem demonstrates that reliable communication can never be achieved with faulty communication lines. The Byzantine generals problem illustrates the difficulty of achieving reliable communication in the presence of faulty processors exhibiting Byzantine faults.

Paxos

Paxos is a popular fault-tolerant distributed consensus algorithm. It allows agreement on a proposed value as long as a majority of processes running the Paxos algorithm is alive. Paxos can be used to agree on a globally consistent (total) order to be assigned to client messages.

Paxos comprises three groups of agents: proposers, acceptors, and learners. Proposers put forth proposed values. Acceptors drive the algorithm's goal to reach agreement on a single value. Learners are informed of the consensus outcome by acceptors and can take action on the result (e.g., write data to storage).

See the lecture notes for a description of the algorithm.

When processes rely on locks, there is an implicit assumption that fault-tolerance is lost. The process holding a lock may forget to release a lock or may die. A way to avoid this problem is to use leases instead of locks. A lease is a lock with an expiration time. The downside with a leasing approach is that the resource is unavailable to others until the lease expires. Now we have a trade-off: have long leases with a possibly long wait after a failure or have short leases that need to be renewed frequently.

An advantage of Paxos is that it can provide a fault-tolerant approach for achieving consensus granting leases to resources. However, the overhead of the algorithm isn't attractive for all situations. A compromise is to use hierarchical leases, or sub-leases. A consensus algorithm is used to elect a coordinator. This coordinator is granted a lease on a set of resources (or all resources) in the system. Now the coordinator hands out leases to processes that need those resources. When the coordinator's main lease expires, a consensus algorithm has to be run again to grant a new lease and possibly elect a new coordinator but it does not have to be run for every client's lease request; that is simply handled by the coordinator.

Distributed transactions

A transaction is a set of operations that operates, and often modifies, data. A key facet of a transaction is that it has to be atomic — all results have to be made permanent (commit) and appear as an indivisible action. If a transaction cannot complete, it must abort, reverting the state of the system to that before it ran. If several transactions run concurrently, the end result must be the same as if they ran in some (any) serial order. The specific order in which transactions execute is known as a schedule.

A transaction-based model guarantees a set of properties known as ACID:

Atomic
The transaction happens as a single indivisible action. Everything succeeds or else the entire transaction is rolled back. Others do not see intermediate results.
Consistent
A transaction cannot leave the database in an inconsistent state. If the system has invariants, they must hold after the transaction. For example, the total amount of money in all accounts must be the same before and after a “transfer funds” transaction.
Isolated (Serializable)
Transactions cannot interfere with each other. If transactions run at the same time, the final result must be the same as if they executed in some serial order.
Durable
Once a transaction commits,the results are made permanent. No failures after a commit will cause the results to revert.

A write-ahead log (or transaction log) is used for rollback: reverting to the previous state when aborting a transaction. It is also crucial for maintaining state in a stable place in case the system should die; it allows the system to recover from where it was in the transaction.

Two-phase commit protocol

The two-phase commit protocol uses atomic multicasts to reach a consensus among the group on whether to commit or abort. It uses a coordinator to send a request ("can you commit?") to every member of the group (reliably, retransmitting as often as needed until all replies are received). Phase 1 is complete when every member of the group (called cohorts responds. If the coordinator gets even a single abort response, it will have to tell all group members to abort the entire transaction. Otherwise, it can tell everybody to commit it. In phase 2, the coordinator sends the commit or abort order and waits for a response from everyone.

The write-ahead log is crucial to attain atomic multicasts (not for rollback, that's used for aborts!). For example, if a machine sent the coordinator a commit response for phase 1 and then died, it must be able to reboot and reconstruct the transaction state from the log; it cannot change its mind.

Three-phase commit protocol

The two-phase commit protocol is a blocking protocol. If either the coordinator or any cohort stops running, the entire protocol is stalled until the process is restarted. The three-phase commit protocol is similar to the two-phase commit protocol but allows all entities to time out.

In phase 1, the coordinator sends a request ("can you commit?") to every member of the group (reliably, retransmitting as often as needed until all replies are received). If any of the cohorts respond with a no or if any fail to respond within a defined time, then send an abort to every cohort.

In phase 2, the coordinator sends a pre-commit message to all cohorts and gets acknowledgements from everyone. When this message is received, a member knows that the unanimous decision was to commit. If a cohort fails to receive this message in time, then it aborts.

In phase 3, the coordinator sends a commit message to all cohorts telling them to commit. If a cohort fails to receive this message in time, it commits.

Paxos commit

The two-phase commit protocol requires all processes in the group to be available to complete the transaction. The three-phase commit introduces timeouts but cannot be proved to work correctly in all situations. If the two-phase or three-phase protocols are enhanced for fault tolerance to elect an alternate coordinator, one runs into problems where different cohorts may have received messages from different coordinators. The Paxos commit algorithm uses the Paxos distributed consensus protocol to design a system that provides a fault-tolerant infrastructure reach agreement on commit or abort decisions. Multiple machines run the Paxos algorithm. Each resource manager (a process that participates in a transaction) contacts its own instance of the Paxos algorithm on these machines to ensure that Paxos agrees on that resource manager's commit or abort decision. Paxos forwards that decision to multiple learner processes. The learners decide on the overall commit or abort based on whether all resource managers sent a commit message (each learner gets the same data and hence reaches the same decision; the multiple learners are there for fault tolerance). They then notify each of the resource managers with the commit/abort decision.

Brewer's CAP Theorem

Eric Brewer's CAP Theorem states that if you want consistency, availability, and partition tolerance, you have to settle for two out of three of these. Hence, if we want to build a highly available system that can withstand network partitions (occasional breaks in the ability of nodes to communicate), we have to give up on consistency and break the guarantees of ACID. An alternative to the requirements of ACID is BASE. BASE stands for Basic Availability, Soft-state, Eventual consistency. Instead of requiring consistency after every transaction, it is enough for a database to eventually be in a consistent state. The downside is that some processes may access stale data which has not yet been brought into a consistent state.

Distributed Deadlocks

A deadlock occurs when there is a circular dependency on processes holding and requesting resources. The four conditions that must hold are:

  1. mutual exclusion: A resource can be held by at most one process.
  2. hold and wait: Processes that already hold resources can wait for another resource.
  3. non-preemption: A resource, once granted, cannot be taken away.
  4. circular wait: Two or more processes are waiting for resources held by one of the other processes.

Three approaches can be used for managing deadlocks in distributed systems.

Detection

A centralized deadlock detection approach uses a central coordinator to manage a resource graph of processes and the resources they are using. Each time a process wants to grab or releases a resource, it sends a message to this coordinator (waiting-for or releasing). The coordinator builds a graph of all the processes and the resources they are holding or waiting for. If a cycle is detected, then the coordinator knows a deadlock exists. In some cases, if release and waiting-for messages are received out of order, they can lead the coordinator to believe that there is a deadlock cycle when none really exists. In reality the release message should have been processed first and would cause the deadlock to not happen. This condition is known as false deadlock.

The Chaudy, Misra, and Haas distributed deadlock detection algorithm has a process send a probe message to a process that is holding a resource prior to waiting for the resource. The receiving process forwards the probe to every process that contains resources it's waiting for. If the original process receives its own probe message then it knows that a dependency cycle, and hence deadlock, will exist if it waits for the resource it wants.

Prevention

Deadlock prevention approaches require processes to access resources in restricted ways to ensure that a deadlock cannot occur. The approach to implementing this is to make decisions based on the timestamps of each transaction competing for resources. The wait-die algorithm states that if a younger process is using the resource, then the older process (that wants the resource) waits. If an older process is holding a resource, then the younger process that wants the resource kills itself (that's ok; transactions are designed to be restartable). This forces the resource utilization graph to be directed from older to younger processes, making cycles impossible. The wound-wait algorithm ensures that the graph flows from young to old and cycles are again impossible. an old process will preempt (kill) the younger process that holds a resource. If a younger process wants a resource that an older one is using, then it waits until the old process is done.

Avoidance

This requires predicting the precise resources that will be needed, the times they will be needed, and which processes will need them and manage resource allocation or transaction scheduling to ensure that this will not happen. This is difficult (usually impossible) to predict and hence is not a practical approach.

Concurrency control

The goal of concurrency control is to allow multiple transactions to run concurrently but to ensure that data access is scheduled such that the net effect is the same as the transactions all ran in some serialized order. That is, we cannot have the net result be one where a transaction read an interim state of data from another transaction.

Exclusive locks, via a lock manager, are an easy way to accomplish this. However, to ensure serializability, it is important that a transaction does not acquire any new locks after it has released a lock. This is known as two-phase locking. The first phase is the growing phase, in which locks are acquired. The second phase is the shrinking phase, in which locks are released. The problem with two-phase locking is that, if a transaction that has released some locks aborts, there is a chance that other transactions have used the data held by the released lock. In that case, those transactions (and all transactions that depend on them) have to abort as well. This condition is known as cascading aborts. Strict two-phase locking avoids this problem by requiring all locks to be held until the end. The shrinking phase, in effect, is an atomic operation that occurs at the very end of the transaction.

Two-phase locking forces a lock to be held while the transaction is using the resource (or for the duration of the transaction). This restricts the maximum amount of concurrency that may be achieved in the system. Optimistic schemes assume that transactions are more likely to complete than not. As such, it is better to put more effort on rectifying errors from having used data from aborted transactions than to lock access to the data. One moderately optimistic scheme is two-version locking. In this case, one transaction writes tentative versions while other transactions read existing committed versions of the data. This allows increased concurrency but transactions that write data risk waiting or rejection when they attempt to commit and make their data permanent. A fully optimistic system uses no locks and checks for conflicts at commit time.

Timestamp ordering allows concurrency by keeping track of two timestamps per object: the timestamp of the last committed that read the object and the timestamp of the last committed transaction that wrote the object. If a transaction wants to write an object, it compares its own timestamp with the object’s write timestamp. If the object’s timestamp is older than the ordering is good and the transaction can proceed. Otherwise the transaction aborts and is restarted.

Distributed Lookup Services

The purpose of a distributed lookup service is to find the machine that has data that corresponds to a key that you have. For example, the key can be the name of the song and the machine is one that is hosting the MP3 file.

The most straightforward approach is to use a central server to store all the keys and their mapping to machines. You ask the server to look up a key and it gives you the machine containing the content. This is the classic Napster implementation.

Another approach is to "flood" the network with queries. What happens here is that your machine knows of a few peer machines. Those machines, in turn, know of other peer machines. When a machine needs to look up a key, it forwards the request to all of its peer nodes. If one of the peers has the content, then it responds. Otherwise, it forwards the request to its peers (and they forward the request to their peers ...). This approach was used by the Gnutella file sharing system.

Finally, the distributed hash table (DHT) approach is a set of solutions that is based on hashing the search key to a number that is then used to find the node responsible for items that hash to that number (or a range of numbers). A key difference between the DHT approach and the centralized or flooding approaches is that a specific node is responsible for holding information relating to the key: the one that is responsible for managing hash(key).

Distributed file systems

To provide the same system call interface for supporting different local file systems as well as remote files, operating systems generally rely on a layer of abstraction that allows different file system-specific interfaces to coexist underneath the common system calls. On most Unix-derived systems (e.g., Linux, bsd, OS X, SunOS), this is known as the vfs layer (Virtual File System).

There are a couple of models for implementing distributed file systems: the download/upload model or the remote procedure call model. In a stateful file system, the server maintains varying amounts of state about client access to files (e.g., whether a file is open, whether a file has been downloaded, cached blocks, modes of access). In a stateless file system, the server maintains no state about a client's access to files. The design of a file system will influence the access semantics to files. Sequential semantics are what we commonly expect to see in file systems, where reads return the results of previous writes. Session semantics occur when an application owns the file for the entire access session, writing the contents of the file only upon close, thereby making the updates visible to others after the file is closed, and overwriting any modifications made by others prior to that.

NFS

NFS was designed as a stateless, RPC-based model implementing commands such as read bytes, write bytes, link files, create a directory, and remove a file. Since the server does not maintain any state, there is no need for remote open or close procedures: these only establish state on the client. NFS works well in faulty environments: there's no state to restore if a client or server crashes. To improve performance, a client reads data a block (8 KB by default) at a time and performs read-ahead (fetching future blocks before they are needed). It suffers from ambiguous semantics because the server (or other clients) has no idea what blocks the client has cached and the client does not know whether its cached blocks are still valid. The system checks modification times if there are file operations to the server and otherwise invalidates the blocks after a few seconds. File locking could not be supported because of NFS's stateless design but was added through a separate lock manager that maintained the state of locks.

AFS

AFS was designed as an improvement over NFS to support file sharing on a massive scale. NFS suffered because clients would never cache data for a long time (not knowing if it would become obsolete) and had to frequently contact the server. AFS introduced the use of a partition on a client's disk to cache large amounts of data for a long time: whole file caching and long-term caching. It supports a file download-upload model. The entire file is downloaded on first access (whole file download) and uploaded back to the server after a close only if it was modified. Because of this behavior, AFS provides session semantics: the last one to close a modified file wins and other changes (earlier closes) are lost.

During file access, the client need never bother the server: it already has the file. When a client first downloads a file, the server makes a callback promise: it maintains a list of each client that has downloaded a copy of a certain file. Whenever it gets an update for that file, the server goes through the list and sends a callback to each client that may have a cached copy so that it can be invalidated on the client. The next time the client opens that file, it will download it from the server. Files under AFS are shared in units called volumes. A volume is just a directory (with its subdirectories and files) on a file server that is assigned a unique ID among the cell of machines (remember cells from DCE RPC?). If an administrator decides to move the volume to another server, the old server can issue a referral to the new server. This allows the client to remain unaware of resource movement.

CODA

Coda was built on top of AFS and focused on two things: supporting replicated servers and disconnected operation. To support replicated storage, AFS's concept of a volume was expanded into that of a Volume Storage Group (VSG). Before a client accesses a file, it first looks up the replicated volume ID of the file to get the list of servers containing replicated volumes and the respective local volume IDs. While it can read files from any available server, it first checks the versions from all of them to ensure that one or more servers don't have out-of-date files. If the client discovers that a server has an old version of a file, it initiates a resolution process by sending a message to that server. When a client closes a file, if there were any modifications, the changes are written out to all available replicated volumes.

If no servers are available for access, the client goes into disconnected operation mode. In this mode, no attempt is made to contact the server and any file updates are logged instead in a client modification log (CML). Upon connection, the client plays back the log to send updated files to the servers and receive invalidations. If conflicts arise (e.g., the file may have been modified on the server while the client was disconnected) user intervention may be required.

Because there's a chance that users may need to access files that are note yet in the local cache, Coda supports hoarding, which is a term for user-directed caching. It provides a user interface that allows a user to look over what is already in the cache and bring additional files into the cache if needed.

DFS (AFS version 3)

AFS evolved over the years. The most significant evolutionary version is version 3, which was adopted as the recommended distributed file system in the Distributed Computing Environment (DCE) where it is named DFS (Distributed File System). The primary design goal of this system was to avoid the unpredictable lost data problems of session semantics if multiple clients are modifying the same file. The concept of tokens was introduced. A token is permission given by the server to the client to perform certain operations on a file and cache a file's data. The system has four classes of tokens: open, data, status, and lock tokens. An open token must be obtained to have permission to open a file. A read data token must be obtained for a byte range of a file to have permission to access that part of the file. Similarly, a write data token is needed to write the file. Status tokens tell the client that it may be able to cache file attributes. These tokens give the server control over who is doing what to a file. Tokens are granted and revoked by the server. For example, if one client needs to write to a file then any outstanding read and write data tokens that were issued to any clients for that byte range get revoked: those clients are now denied access until they get new tokens.

SMB/CIFS

Microsoft's Server Message Block protocol was designed as a connection-oriented, stateful file system with a priority on file consistency and support of locking rather than client caching and performance. While it does not use remote procedure calls, its access principle is the same: requests (message blocks) are functional messages, providing file access commands such as open, create, rename, read, write, and close.

With the advent of Windows NT 4.0 and an increasing need to provide improved performance via caching, Microsoft introduced the concept of opportunistic locks (oplocks) into the operating system. This is a modified form of DFS's tokens. An oplock tells the client how it may cache data. At any time, a client's oplock may be revoked or changed by the server. A level 1 oplock tells the client that it has exclusive access to the file (nobody else is reading or writing it), so it can cache lock information, file attributes, and perform read-aheads and write-behinds. A level 2 oplock is granted if one or more clients are reading the file and one process is writing it. For example, a process that had a level 1 oplock will have it revoked and replaced with a level 2 oplock if another process opens the file for reading. In this case, read operations and file attributes may be cached but everything else is sent to the server. If two or more processes open a file for writing, then the level 2 oplock will be revoked and the client will have to perform all operations directly against the server.

WebDAV

WebDAV is a file access protocol built as an extension to the standard HTTP commands. Additional HTTP commands were added to copy and move resources, manage directory structures, and lock/unlock resources.

Many popular applications use WebDAV (Apple's iCal and iDisk, Microsoft Exchange and IIS) and it is supported as a network file system type by popular operating systems (Linux, OS X, Windows).

GmailFS

GmailFS is an example of writing a custom file system to take advantage of the free storage provided by Gmail (over 7 GB). Each message represents a file. Subject headers identify the file system name so the interface can find all relevant messages that constitute the "file system". They also contain file names and other metadata (symbolic link information, user ID, size, etc.). The actual file data resides in attachments to the message.

Cluster-based file systems

Google File System (GFS)

The Google File System is designed to support large data-intensive applications (the kind of algorithms Google uses for search and other services). The system is designed for an environment where there are many thousands of storage servers, some of which are expected to be down at any given point in time. The data managed comprises of billions of objects and many petabytes of data. It is not designed to be a general-purpose file system and is optimized for large files, reads, and atomic appends.

Each GFS cluster contains one master file server. This is a faster and more reliable machine that manages file system metadata, such as names and attributes of files and directories. None of the actual file data resides on this system. Instead, it contains a mapping of the file contents to the chunks that hold the data. Chunks are fixed-size, 64 MB blocks and are stored in chunkservers. The chunkservers are less reliable than the master and are replicated so that each chunk is stored on typically three three separate chunkservers.

Clients contact the master to look up files. The master gives them a chunkserver name and chunk handles for the files requested. To write to a file, the master grants a chunk lease to one of the replicated chunks. The chunkserver holding this is the primary replica chunkserver for that chunk. Writing is a two-stage process. First, the client writes to that replica. In turn, the replica forwards the data to another replica chunk server, and so on until all replicas receive the data. Once all replicas acknowledge receipt of the data, the second stage, writing, takes place. The client sends a write request to the primary, identifying the data that was sent. The primary assigns a sequence to the write operations that take place on the data and sends write requests to all the replicas so that they will write data in the same serial-number order. Note that the data flows from client to the primary replica to secondary replica to another secondary replica, etc. The control flow (write messages) goes from the primary directly to all of the secondaries.

Hadoop Distributed File System

Apache's Hadoop Distributed File System (HDFS) is incredibly similar to GFS and is designed with essentially the same goals. The key distinction is that it does not support modifying a file once it is created. This avoids the need to manage leases or locks for the file.

HDFS also uses a separate server that is responsible for keeping track of the filesystem namespace and the location of file data. Instead of a master, they call this the NameNode. Actual data is broken into 64 MB blocks and is stored on DataNodes (GFS's chunkservers). Each block (GFS chunk) is replicated on multiple DataNodes (the default is three, as with GFS).

Content delivery networks

A content delivery network (CDN) is a set of servers that are usually placed at various points throughout a wide area network to cache content and distribute it to users. By serving content that is replicated on a collection of servers, traffic from the main (master) server is reduced. Because some of the servers are likely to be closer to the requesting users, network latency is reduced. Because there are multiple servers, traffic is distributed among them. Because all of the servers are unlikely to be down at the same time, availability is increased. Hence, a CDN can provide highly increased scalability and availability for content.

We will focus on one CDN: Akamai. The company evolved from MIT research that was focused on "inventing a better way to deliver Internet content." A key issue was the flash crowd problem: what if your web site becomes really popular all of a sudden? Chances are, your servers and/or ISP will be saturated and a vast number of people will not be able to access your content. This is known as the slashdot effect.

There are several traditional approaches to making a site more scalable and available:

Local clustering
Multiple machines within a datacenter can be load-balanced. However, they all fail if the datacenter loses power or internet connectivity.
Multihoming
Machines can be connected with links to multiple networks served by multiple ISPs to guard against ISP failure. However, protocols that drive dynamic routing of IP packets (BGP) are often not quick enough to find new routes, resulting in service disruption.
Mirroring at multiple sites
The data can be served at multiple sites, with each machine's content synchronized with the others. However, synchronization can be difficult.

All these solutions require additional capital costs. You're building the capability to handle excess capacity and improved availability even if the traffic never comes and the faults never happen.

Akamai currently (2011) runs on 95,000 servers in 1,900 networks across 71 countries. It serves clients from nearby, available servers that are likely to have content. It does this by a mapping process, where it uses a set of dynamic DNS servers and encodes information about the content type and customer into the domain name for the content. All references to the content within a web page are replaced with these modified host names.

Akamai generates its own approximate map of overall IP network topology based on BGP information and traceroute from various points on the network. Content servers report their load to a monitoring application. The monitoring app publishes load reports to a local Akamai DNS server, which then determines which IP addresses to return when resolving names. When an Akamai DNS server gets a request to resolve a host name, it chooses the IP address to return based on:

  • service requested (e.g., QuickTime, HTML, Windows Media)
  • content requested (is server likely to have it? - based on hash)
  • server health
  • server load
  • user location
  • network status
  • load balancing

Akamai can also perform load shedding on specific content servers; if servers get too loaded, the DNS server will not respond with those addresses.

MapReduce

MapReduce is a framework for parallel computing. Programmers get a simple API and do not have to deal with issues of parallelization, remote execution, data distribution, load balancing, or fault tolerance. The framework makes it easy for one to use thousands of processors to process huge amounts of data (e.g., terabytes and petabytes). It is designed for problems that lend themselves to a map-reduce technique, which we will discuss. From a user's perspective, there are two basic operations in MapReduce: Map and Reduce.

The Map function reads a stream of data and parses it into intermediate (key, value) pairs. When that is complete, the Reduce function is called once for each unique key that was generated by Map and is given the key and a list of all values that were generated for that key as parameters. The keys are presented in sorted order.

The MapReduce client library creates a master process and many worker processes on a large set of machines. The master serves as a coordinator and is responsible for assigning roles to workers, keeping track of progress, and detecting failed workers. The set of input is broken up into chunks called shards. The master assigns map tasks to idle workers and gives each of them a unique shard to work on. The map worker parses the data and generates intermediate (key, value) results. All map workers work in parallel and there is no need for any one to communicate with another. The intermediate (key, value) data is partitioned based on the key according to a partitioning function that determines which of R reduce workers will work on that key and its associated data. The default function is simply a hash(key) mod R that distributes keys uniformly among R reduce workers but a user can replace it with a custom function. When all map workers informed the master that they are complete, the master contacts reduce workers. Each reduce worker contacts all the map worker nodes to get the (key, value) data that was targeted for them. The list of data is then sorted by key and the user reduce function gets called once for each unique key. The function is passed the key and the list of all values associated with that key. The reduce function writes its output to an output file, which the client can read once the MapReduce operation is complete.

A master periodically pings each worker for liveness. If no response is received within a time limit, then the master reschedules that worker's task onto another worker.

BigTable

BigTable is a distributed storage system developed at Google that is structured as a large table: one that may be petabytes in size and distributed among tens of thousands of machines. It is designed and used for storing items such as billions of URLs, with many versions per page; over 100 TB of satellite image data; hundreds of millions of users; and performing thousands of queries a second.

To the user, an instance of BigTable appears as a large spreadsheet: a table that is indexed by a row key, column name, and timestamp. If an application does not specify a timestamp, it will retrieve the latest version. Alternatively, it can specify a timestamp and get the latest version that is earlier than or equal to that timestamp. All operations on rows are atomic.

Columns in the table are organized into column families. Each column family has a unique name and contains within it a list of named columns. A BigTable instance will typically have a small number of column families (perhaps a few hundred at most) but each column family may have a huge number (perhaps millions) of columns within it. Timestamped versioning is configured on a per-column-family basis.

A BigTable table starts off as a single table. As rows are added, the table is kept sorted by row key. As the table grows, it may split along row boundaries into sub-tables, called tablets. Tablet servers are responsible for serving data for tablets.

BigTable comprises a client library (linked with the user's code), a master server that coordinates activity, and many tablet servers. Each tablet server is responsible for managing multiple tablets. Depending on the size of the tablets, it may manage from tens to thousands of tablets. Tablet servers can be added or removed dynamically. Tablet data is stored within GFS (Google File System) and tablet servers run on the same machines as GFS chunkservers. The master assigns tablets to tablet servers and balances tablet server load. It strives to run the tablet server on the same GFS chunkserver that holds data for that tablet. The master is also responsible for garbage collection of files in GFS and managing schema changes (table and column family creation).

Chubby

Chubby is a highly available and persistent distributed lock service that manages leases for resources and stores configuration information. The service runs as five active replicas, one of which is elected as the master to serve requests. A majority must be running for the service to work. Paxos is used to keep the replicas consistent. Chubby provides a namespace of files & directories. Each file or directory can be used as a lock.

In BigTable, Chubby is used to: ensure there is only one active master, store the bootstrap location of BigTable data, discover tablet servers, store BigTable schema information, and store access control lists.

Replication

BigTable can be configured for replication to multiple BigTable clusters in different data centers to ensure availability. Data propagation is asynchronous and results in an eventually consistent model.

Cryptography

Cryptography deals with encrypting plaintext using a cipher, also known as an encryption algorithm, to create ciphertext, which is unintelligible to anyone unless they can decrypt the message.

A restricted cipher is one where the workings of the cipher must be kept secret. There is no reliance on a key and the secrecy of the cipher is crucial to the value of the algorithm. This has obvious flaws (people in the know leaking the secret or coming up with a poor algorithm that can easily be reverse engineered). For any serious encryption, we use non-secret algorithms that rely on secret keys.

A symmetric encryption algorithm uses the same secret key for encryption and decryption. A public key algorithm uses a pair of keys: data encrypted with the first key can be decrypted only with the second key and vice versa. One of these keys is kept private (known only to the creator) and is known as the private key. The corresponding key is generally made visible to others and is known as the public key. Anything encrypted with the private key can only be decrypted with the public key. This is the basis for digital signatures. Anything that is encrypted with a public key can be encrypted only with the corresponding private key. This is the basis for authentication and covert communication.

A one-way function is one that can be computed relatively easily in one direction but there is no known way of computing the inverse function. One-way functions are crucial in a number of cryptographic algorithms, including digital signatures, Diffie-Hellman key exchange, and RSA public key cryptography. For Diffie-Hellman and RSA keys, they ensure that someone cannot generate the corresponding private key when presented with a public key. A particularly useful form of a one-way function is the hash function. This is a one-way function whose output is always a fixed number of bits for any input. For good cryptographic hash functions, it is highly unlikely that two messages will ever hash to the same value, it is extremely difficult to construct text that hashes to a specific value, and it is extremely difficult to modify the plaintext without changing its hash. The hash function is the basis for message authentication codes and digital signatures. Note that when we talk about cryptography and mention phrases such as "extremely difficult", we mean "impossible for all practical purposes," not that "you can do it if you spend an entire day working on the problem."

DES and 3DES

The Data Encryption Standard, DES, was standardized in 1976 and is a block cipher that encrypts 64-bit chunks of data at a time. It uses a 56-bit key and employs 16 iterations of substitutions followed by permutations.

The only serious weakness of DES is its key 56-bit key. By the 1990s it was possible to build machines that could iterate through all of the 256 permutations of keys within a few hours. Networked efforts could also test hundreds of billions of keys per second.

To prevent against such a brute force attack, DES would need a longer key. Each additional bit in a key doubles the time that it takes to test all keys because each each additional bit doubles the total number of combinations (e.g., a 4-bit key has 16 possible values; a 5-bit key has 32 possible values). Triple-DES (3DES) solves this problem by using the standard 56-bit DES algorithm three times:

C = EK3(DK2(EK1(P)))

If K1 = K3, then we have a "classic" Triple-DES mode that uses a 2*56 (112-bit) key. If K1 = K2 = K3, then the middle decryption undoes the first encryption and we have a standard 56-bit DES algorithm. Finally, if all three keys are different, then we have a 168-bit (3*56) key.

Although AES (Advanced Encryption Standard) is the most popular symmetric algorithm these days (due to being a U.S. government standard), 3DES still enjoys widespread deployment.

Secure communication

To communicate securely using a symmetric cipher, both parties need to have a shared secret key. Alice will encode a message to Bob using the key and Bob will use the same key to decode the message. If Alice wants to communicate with Charles, she and Charles will also need a secret key. The fact that every pair of entities will need a secret key leads to a phenomenon known as key explosion. Overall, in a system with n users, there will be O(n2) keys.

The biggest problem with symmetric cryptography is dealing with key distribution: how can Alice and Bob establish a key so they can communicate securely? The Diffie-Hellman exponential key exchange algorithm allows one to do this. Each party will generate a private key and a public key (these are not encryption keys; they're just numbers – Diffie-Hellman does not implement public key cryptography – it is unfortunate that the term was used to describe these numbers). It uses the one-way function abmod c in a way that allows Alice to compute a common key using her private key and Bob's public key. Bob can compute the same common key by using his private key and Alice's public key.

Using public key cryptography, such as RSA, if Alice encrypts a message with Bob's public key, Bob will be the only one who can decrypt it since doing so will require Bob's private key. Likewise, Bob can encrypt messages with Alice's public key, knowing that only Alice will be able to decrypt them with her private key.

Session keys

A session key is a random key that is generated for encrypting data for one communication session. It is useful because if the key is ever compromised, no lasting information is obtained: future communication sessions will use different keys. A hybrid cryptosystem uses public key cryptography to send a session key securely. The originator generates a random session key and encrypts it with the recipient's public key. The recipient decrypts the message with the corresponding private key to extract the session key. After that, symmetric cryptography is used for communication, with messages encrypted with the session key. This has the advantages of higher performance (public key cryptography is much, much slower than symmetric cryptography), ease of communicating with multiple parties (just encrypt the session key with the public keys of each of the recipients), and allows the bulk of data to be encrypted with session keys instead of the hardly-ever-changing public keys.

Digital signatures

With public key cryptography, a digital signature is simply the act of encrypting a hash of a message with the creator's private key. Anyone who has the message signer's public key can decrypt the hash and thus validate it against the message. Other parties cannot recreate the signature. Even though they can generate the same hash for the message, they do not have the signer's private key to encrypt that hash.

Authentication

The three A's of security are:

  1. Authentication: the process of binding an identity to the user. Note the distinction between authentication and identification. Identification is simply the process of asking you to identify yourself (for example, ask for a login name). Authentication is the process of proving that the identification is correct.
  2. Authorization: given an identity, making a decision on what access the user is permitted. Authentication is responsible for access control.
  3. Accounting: logging system activity so that any breaches can be identified (intrusion detection) or a post facto analysis can be performed.

A fourth item, not in the "standard list," is auditing: inspecting the software and system configuration for security flaws.

The three factors of authentication are: something you have (such as a key or a card), something you know (such as a password or PIN), and something you are (biometrics). Combining these into a multi-factor authentication scheme can increase security against the chance that any one of the factors is compromised.

Password Authentication Protocol (PAP)

The classic authentication method is the use of reusable passwords. This is known as the password authentication protocol, or PAP. The system asks you to identify yourself (login name) and then enter a password. If the password matches that which is associated with the login name on the system then you're authenticated.

One problem with the protocol is that if someone gets hold of the password file on the system, then they have all the passwords. The common way to thwart this is to store hashes of passwords instead of the passwords themselves. This is taking advantage of one-way functions. To authenticate a user, check if hash(password) = stored_hashed_password. If someone got hold of the password file, they're still stuck since they won't be able to reconstruct the original password from the hash. They'll have to resort to an exhaustive search or a dictionary attack and search for a password that hashes to the value in the file.

The other problem with reusable passwords is that if a network is insecure, an eavesdropper may sniff the password from the network. A potential intruder may also simply observe the user typing a password. To thwart this, we can turn to one-time passwords. If someone sees you type a password (or gets it from the network stream), it won't matter because that password will be useless for future logins.

S/Key Authentication

S/Key authentication allows the use of one-time passwords by generating a list via one-way functions. The list is created such that password n is generated as f(password[n-1]), where f is a one-way function. The list of passwords is used backwards. Given a password password[p], it is impossible for an observer to compute the next valid password because a one-way function f makes it improbably difficult to compute f-1(password[p]) to get the next valid password, password[p-1].

CHAP Authentication

The Challenge-Handshake Authentication Protocol (CHAP) is an authentication protocol that allows a server to authenticate a user without sending a password over the network. Both the client and server share a secret (such as a password). A server creates a random bunch of bits (called a nonce) and sends it to the client (user) that wants to authenticate. This is the challenge. The client identifies itself and sends a response that is the hash of the shared secret combined with the challenge. The server has the same data and can generate its own hash of the same challenge and secret. If the hash matches the one from the client, the server is convinced that the client knows the shared secret and is therefore legitimate. An intruder that sees this hash cannot extract the original data. An intruder that sees the challenge cannot create a suitable hashed response without knowing the secret. Note that this technique requires passwords to be accessible at the server and the security rests on the password file remaining secure.

SecurID®

RSA's SecureID is a two-factor authentication system that generates one-time passwords for response to a user login prompt. It relies on a user password (Personal ID Number, PIN) and a token device (an authenticator card or fob). The token generates a new number every 30 seconds. The number is a function of a seed that is unique for each card and the time of day. To authenticate to a server, you send a combination of your PIN and the number from the number from the token in lieu of a password. A legitimate remote system will have your PIN as well as the token seed and will be able to compute the same value to validate your password. An intruder would not know your PIN or the token’s seed and will never see it on the network.

Public key authentication

A nonce is a random bunch of bits that is generated on the fly and usually used to present to the other party as a challenge for them to prove that they are capable of encrypting something with a specific key that they possess. The use of a nonce is central to public key authentication. If I send you a nonce and you encrypt it with your private key and give me the results, I can decrypt that message using your public key. If the decryption matches the original nonce, this will convince me that only you could have encrypted the message.

Wide-mouth frog

The wide-mouth frog protocol shows how a trusted third party can be used with symmetric cryptography to perform authentication and key exchange to allow two parties to communicate securely. A trusted third party is a trusted entity that has everybody's keys. Suppose Bob wants to talk to Alice securely. He does not have shared secret key with her. Instead, he generates a random session key creates a request to talk with Alice, and encrypts the message containing both of the items with his own secret key. The only entities who have Bob's key and can decode this message are Bob and the trusted third party (Trent). Bob sends the message to Trent. Trent approves Bob's request to communicate with Alice and generates a message for Alice that is encrypted with Alice's secret key. This message contains the session key from Bob as well as Bob's identity information. Alice gets and decrypts this message and now has the session key and knows she will be talking with Bob. Trent is now out of the picture. The protocol can be enhanced by adding a timestamp into the encrypted message. This will guard against replay attacks — an intruder retransmitting an encrypted message from the past.

Kerberos authentication

Kerberos is a trusted third party authentication and key exchange protocol using symmetric cryptography. When you want to access a service, you first need to ask Kerberos. If access is granted, you get two messages. One is encrypted with your secret key and contains the session key for your communication with the service. The other message is encrypted with the service's secret key. You cannot read or decode this message. It is known as a ticket or sealed envelope. It contains the same session key that you received but is encrypted for the service. When the service decrypts it, it knows that the message must have been generated by an entity that had its secret key: Kerberos. Now that it has the session key, the service can communicate with you securely by encrypting all traffic with that key.

Digital certificates

While public keys simplify authentication (just decrypt this with my public key and you know that I was the only one who could have encrypted it), identity binding of the public key must be preserved for you to know that you really have my public key instead of someone else's. X.509 digital certificates provide a way to do this. A certificate is a data structure that contains user information and the user’s public key. This data structure also contains a signature of the certification authority. The signature is created by taking a hash of the rest of the data in the structure and encrypting it with the private key of the certification authority. The certification authority (CA) is responsible for setting policies of how they validate the identity of the person who presents the public key for encapsulation in a certificate.

Transport Layer Security (Secure Sockets Layer)

Secure Sockets Layer (SSL, also known as TLS — Transport Layer Security) is a layer of software designed to provide authentication and secure communication over the abstraction of a sockets interface. It makes it easy to add a secure transport onto insecure TCP socket based protocols (e.g., HTTP and FTP). SSL uses a hybrid cryptosystem and relies on public keys for authentication. If both the sender and receiver have X.509 digital certificates, SSL can validate them and use nonce-based public key authentication to validate that each party has the corresponding private key. In some cases, it may validate the server only. If the server does not have a certificate, SSL will then use a public key simply to allow a symmetric session key to be passed securely from client to server. The client generates a session key and encrypts it with the server's public key. This ensures that only the server will be able to decode the message and get the session key. After that, communication takes place using a symmetric algorithm and the client-generated session key.

Biometric authentication

Biometric authentication differs dramatically from other techniques in that it relies on statistical pattern recognition and will never yield a 100% true or false answer. Comparisons are based on thresholds. The ROC (receiver operator characteristic) plot defines the trade-off between false accepts and false rejects for a particular biometric system. A false accept is the condition when the system mistakenly validates the wrong user. A false reject is the condition when the system mistakenly rejects the right user. The authentication process comprises four steps: (1) sensing the user’s biometric feature; (2) extracting the features of interest, selecting features that are distinctive and repeatable; (3) matching the pattern against stored signals, searching for small distances between the matches; and (4) deciding if the match is close enough.

CAPTCHA

CAPTCHA (Completely Automated Public Turning test to tell Computers and Humans Apart) is not a technique to authenticate users but rather a technique to identify whether a system is interacting with a human being or with automated software. The idea behind it is that humans can recognize highly distorted characters far better than character recognition software can.

Distributed Authentication via OpenID

OpenID was created to solve the problem of alleviating a user from managing multiple identities (logins and passwords) at many different web sites. It provides a decentralized mechanism for single sign-on. OpenID is not an authentication protocol. Rather, it's a technique for a service to redirect authentication to a user's chosen OpenID authentication server and have that server be responsible for authentication.

A user logs in with a login name that contains a URI (Universal Resource Identifier) from which the name of the user's Identity Provider can be fetched. OpenID Identity Providers are responsible for authenticating users (typically with a login and password) and storing various user attributes.

When a user enters a login URI, the web site, known as a relying party issues a redirect to the user's browser, redirecting the user to the user's OpenID Identity Provider login page. The user authenticates with the Identity Provider, which then sends a redirect back to the originating web site. This redirect contains digitally-signed user credentials. The web site validates these either using a cached secret key that it has previously established with the Identity Provider or else it connects to the Identity Provider via HTTPS and requests authentication information directly and compares that with what it received via the browser redirect (this guards against a malicious attacker). If the site is convinced that the credentials are genuine then the user is considered to be logged in.

Through the Identity Provider, the user can control how many logins should be allowed (e.g. one, one per day, etc.) before the user has to intervene again. The user can also control what user identification attributes get sent to each site requesting user credentials.

Service Authorization via OAuth

In some ways, OAuth seems similar to OpenID in that they both rely on HTTP redirects and are both decentralized (multiple OpenID and OAuth providers can coexist). However, their purpose is fundamentally different. OpenID is provides single sign-on; allowing a user to use the same login and a third party (OpenID) authentication service for many web sites. OAuth, on the other hand, allows users to control what data one service can access from another service. For example, if you want a photo printing service to access photos from your flickr account, you don't want to provide it with your flickr username and password for unrestricted access. Instead, OAuth allows you to specify what access you allow a site to have (and for how long). Token credentials are used in place of the resource owner’s username and password for gaining access to the service. These token credentials include a token identifier and a secret.

There are three entities involved in OAuth:

  1. user
  2. consumer (client): this is the service that the user is accessing. For example, the moo.com photo printing servic
  3. provider (server): this is the service that the consumer needs to access. For example, flicker.com to get the photos

We'll use the moo.com photo printing and the flickr.com photo serving service as an example.

  1. Alice wants to order some prints and logs into moo.com. Moo knows about flickr and allows her to select select flickr as the source of her photos. When Moo built its service, its developers they obtained OAuth client credentials (client ID and secret) from flickr.
  2. When Alice selects flickr, moo.com contacts flickr.com for a request token, which is a set of temporary credentials. Upon receiving this, Alice is redirected to the flicker OAuth page with a directive to redirect her back to Moo when she's done.
  3. At the flicker OAuth page, she authenticates (using login/password or via OpenID, which may cause another level of redirection) and is presented with a description of what moo.com is requesting to do (e.g., access to download photos for the next 10 minutes). She can approve or reject the request.
  4. If she approves the request, the temporary credentials (request token) are tagged as "approved". Alice is redirected back to moo.com. The redirect contains an identifier for the approved credentials.
  5. Moo now contacts moo.com and exchanges the request token for an access token. Request tokens are used to obtain a user's approval. Access tokens are used to access resources on the provider (server). Moo now sends requests containing the access token to flickr to download the requested photos.

Firewalls

A firewall protects the junction between an untrusted network (e.g., external Internet) and a trusted network (e.g., internal network). Two approaches to firewalling are packet filtering and proxies. A packet filter, or screening router, determines not only the route of a packet but whether the packet should be dropped based on contents in the IP header, TCP/UDP header, and the interface on which the packet arrived. With stateless inspection, a packet is examined on its own. Stateful inspection allows the router to keep track of TCP connections and understand the relationship between packets. For example, a port that needs to be enabled for the FTP data channel once an FTP connection has been established or that return packets should be allowed in response to outbound requests. Deep packet inspection (DPI) allows a firewall to examine application data as well and make decisions based on its contents. Deep packet inspection can validate the protocol of an application as well as check for malicious content such as viruses or other security attacks.

Proxy services (also known as application proxies) live on bastion hosts. These are stripped down machines to give potential intruders as few tools as possible (few user accounts, no compilers, minimal commands). An application proxy is software that presents the same protocol to the outside network as the application for which it is a proxy (e.g., a mail server proxy will listen on port 25 and understand SMTP, the simple mail transfer protocol). The primary job of the proxy is to validate the application protocol. Valid requests are regenerated to the real application that is running on another server and is not accessible from the outside network. The bastion hosts on which proxies live are dual-homed hosts. This means the computer has two network cards. This is important since packets will not flow directly between the outside network to the inside network. The proxy is the only one who can communicate with the internal network. Unlike DPI, a proxy may modify the data stream, such as stripping headers or modifying machine names.

A typical firewalled environment is a screened subnet architecture, containing a separate subnet between the internal and externally-facing networks called the DMZ (demilitarized zone). The DMZ contains all the bastion hosts that may be offering services to the external network (usually the Internet). Machines on the internal network are not directly accessible from the Internet. Screening routers on both sides of the DMZ ensure that no packet from the outside network is permitted into the inside network.

All machines within an organization will be either in the DMZ or in the internal network. The exterior router will allow IP packets to only the machines/ports in the DMZ that are offering valid services. It would also reject any packets that are masqueraded to appear to come from the internal network. The interior router would allow only packets to come from designated machines in the DMZ that may need to access services in the internal network. Any packets not targeting the appropriate services in the internal network will be rejected. Both routers will generally allow traffic to flow from the internal network to the Internet, although an organization may block certain services (ports) or force users to use a proxy (for web access, for example).

VPNs

Virtual private networks (VPNs) allow disconnected local area networks to communicate securely over the public Internet, saving money by using a shared public network (Internet) instead of leased lines. This is achieved by tunneling, the encapsulation of an IP packet within another packet. In this case, a packet that is destined for a remote subnet (which will often have local source and destination IP addresses that are not routable over the public Internet) will be treated as payload and placed in IP packets that are routed over the public Internet. The source and destination addresses of this packet are the VPN endpoints at both sides, usually the routers. When the VPN endpoint (router) receives this encapsulated packet, it extracts the data, which is an IP packet, and routes it on the local network. This tunneling behavior gives us the virtual network part of the VPN.

To achieve security, an administrator setting up a VPN will usually be concerned that the data contents are not readable and the data has not been tampered with. To ensure this, the encapsulated packets can be encrypted and signed. Signing a packet ensures that the data has not been modified in transit. Encrypting ensures that intruders would not be able to make sense of the packets. VPNs usually provide several options for key management: shared private keys (AES or 3DES), Diffie-Hellman key exchange, or RSA public keys.

Fault Tolerance

There are three classes of faults: transient, intermittent, and permanent. Faults are either fail-silent (where the component does not produce results) or Byzantine (where the component produces faulty results). A popular approach to fault tolerance is redundancy, which may be information redundancy (e.g., error correcting codes), time redundancy (e.g., retransmission), or physical redundancy (redundant components: standby systems or triple-modular redundancy for active voting). A technique to deal with components that may exhibit Byzantine faults is Triple Modular Redundancy (TMR). This requires a three-way replication of the component with election logic to weed out a component that is producing faulty results (the two good components will overrule the faulty one). With software, we can use a consensus algorithm to keep replicated processes on multiple machines in sync as the process moves through different states of execution. Each state transition will generate a consesus-based update to all replicas.

The two-army problem demonstrates that reliable communication can never be achieved with faulty communication lines. The Byzantine generals problem illustrates the difficulty of achieving reliable communication in the presence of faulty processors exhibiting Byzantine faults.

Clusters

Clustering is the aggregation of multiple independent computers to provide increased reliability and/or performance. There are four distinct approaches to clustering: high-performance computing, batch processing, load balancing, and high availability.

Clusters will usually run software to identify cluster membership: which nodes are members of the cluster. Quorum keeps track of the number of nodes that have to be alive for the cluster to function. Typically, a majority is required. This provides a simple way of avoiding split-brain due to network partitioning where one group of machines cannot talk to another and two instances of the cluster are created. A cluster's quorum also identifies which machines (nodes) in the cluster perform which roles (e.g., standby, active, running specific applications).

High-Performance Computing (HPC)

High-performance clusters are generally custom efforts but there are a number of components that are common across many implementations. They are designed for traditional supercomputing applications that focus on a large amount of computation on large data sets. These applications are designed to be partitioned into multiple communicating processes. The Message Passing Interface (MPI) is a popular programming interface for sending and receiving messages that handles point-to-point and group communication and provides support for barrier-based synchronization. It is often used together with the Parallel Virtual Machine (PVM), a layer of software that provides an interface for creating tasks, managing global task IDs, and managing groups of tasks on arbitrary collections of processors. Beowulf and Rocks Cluster are examples of a popular HPC clusters based on Linux and other libraries. Microsoft offers an HPC Server 2008. There are many other HPC systems as well. The common thread among them all is that they provide a front-end server for scheduling jobs and monitoring processes and offer an MPI library for programming.

Batch Processing: Single-Queue Work Distribution

Batch processing is often used for applications such as renderfarms for computer animation, where a central coordinator (dispatcher) sends job requests to a collection of machines. When a system completes a job (e.g., “render frame #4,178”), the dispatcher will send it the next job (e.g., “render frame #12,724”). The dispatcher will have the ability to list jobs, delete jobs, dispatch jobs, and get notified when a job is complete. The worker nodes have no need to communicate with each other.

Load Balancing

Web-services load balancing is a somewhat trivial but very highly used technique for distributing the load of many network requests among a collection of machines. The simplest form is to have all requests go to a single machine that then returns an HTTP REDIRECT error. This is part of the HTTP protocol and will lead the client to re-issue the request to the machine specified by the REDIRECT error.

A more sophisticated software-only approach was implemented by IBM’s Interactive Network Dispatcher. As with the REDIRECT approach, all requests go to one machine. This machine forwards the entire stream of IP packets to one of several load-balanced back-end systems. All of these machines share the same IP address and the coordinator has special kernel extensions that allow it to forward TCP and UDP packets to the different machines by changing the MAC address of the packets.

Finally, and the most popular approach, one can turn to hardware and use a load-balancing router to map incoming requests to multiple back-end machines.

High Availability

High-availability clusters strive to provide a high level of system uptime by taking into account the fact that machines may fail. In such a case, applications running on those machines will be moved to other machines that are still up.

Low-level software to support high-availability clustering includes facilities for IP address takeover (allowing a machine to listen on multiple IP addresses) and to access shared disks. Mid-layer software includes distributed elections to pick a coordinator, propagation of status information, and figuring out which systems and applications are alive. Higher-layer software includes the ability to restart applications, let a user assign applications to machines, and let a user see what's going on in the system as a whole.

The key to detecting machine failure is the heartbeat network: exchanging messages between machines to ensure that they are alive and capable of responding. Since a main network may go down, one or more secondary networks (such as direct cable connections between two machines) are often used as dedicated heartbeat networks to distinguish failed machines from failed networks.

Failed applications may be restarted on a standby machine (active/passive configuration) or may be run on a machine that’s already running other services (active/active configuration). A cold failover is an application restart – the application is started from the beginning. A warm failover is one where the application is checkpointed periodically and is restarted from the last checkpoint. In a hot failover, a copy of the application is always running in lockstep synchrony with the main application on another machine. This is difficult and not always practical, especially where communications are involved. Cascading failover refers to the ability of an application to fail over even after it already has failed over in the past (a double restart). Multi-directional failover refers to the ability to restart applications from a failed system on multiple available systems instead of a specific targeted backup machine.

Clustered systems may share access to the same disk (e.g., via a a fibre channel switch or ethernet switch). This is a shared-disk configuration. Disk access in this form is raw, block-level access. All machines run a local file system. Having two machines access the same disk is not a good idea, of course, since that can put the filesystem in an incoherent state. Therefore, systems that do this must ensure that they have mutually exclusive access to key filesystem data structures such as inodes and free block maps as well as be responsible for flushing their caches when they don’t have access. They do this via a distributed lock manager (DLM). File systems designed to support this mode of operation are known as cluster file systems. A simpler solution is a shared-nothing architecture where there is no shared storage and a system forwards any requests for disk I/O to the system that owns the disk. In the case of failure, however, the access to such storage may be lost or may have to be gained by using a shared link that was simply not used when both systems were up. Alternatively, a shared-nothing architecture may use a network file system (such as NFS, SMB, AFP, etc.) to provide file-level rather than block-level file access.

Some clusters may be tightly coupled via system area network (SAN). This is a switched network to allow low-latency I/O between machines without incurring any processor intervention. It also avoids the overhead of TCP/IP by providing a reliable communication channel. Remote DMA (RDMA) allows data to be copied directly to the memory of another processor. SANs are often used for HPC clusters, with SAN/RDMA communication incorporated into the Message Passing Interface (MPI) library. Examples are Infiniband, Myrinet, and 10 Gbps ethernet.

Storage in a cluster is often separated from the actual machines via a storage area network (SAN). This is a network that is dedicated to disk I/O traffic between machines and the disks, which reside in separate storage arrays. The communications between machines and disks is typically SCSI over fibre channel or the iSCSI over ethernet. Because of this dissociation between machines and disks, the switch between the machines and the storage can be configured to allow specific hosts to read and write from/to specific disks.

To make storage highly available, RAID (redundant array of independent disks) is often employed. RAID 1 is mirroring. Anything that is written to one disk gets written to a secondary disk. If one fails then you still have the other. RAID 5 stripes the data across several disks and also adds in error correcting codes (e.g., parity in the case of allowing one to correct for a single failure) so that it data could be reconstructed from the available segments if one would die. RAID 0 is a technique used to improve performance by striping data across multiple disks. It does not improve fault tolerence.

Case Study: The Google Cluster Architecture

The Google Cluster architecture is built atop over 10,000 unreliable commodity PCs running fault-tolerant software. The goal of the system is energy efficiency and the best price for the realized performance rather than maximizing processor performance.

The Google search service contains several clusters that are distributed worldwide. Each of these clusters has several thousand machines. A user's query is directed to a specific cluster via the DNS lookup of google.com. The DNS load-balancing system takes the user's round-trip time and system capacity into account to provide the address of a suitable cluster. Once the request gets to the cluster, a hardware load balancer forwards the request to one of the web servers in the cluster.

The fundamental approach to exploiting distribution is to transform a query on a very large database into a much of queries on smaller databases, followed by a merge of the results.

The web server performs a query against several index servers. Since the index is many terabytes in size, it is divided into pieces (shards), each holding a subset of the documents from the full index. Each of these index servers is replicated, with a load balancer assigning query requests. If any of the replica servers goes down then performance is degraded proportionally. The index server returns a list of document IDs to the web server that made the query.

The next task for the web server is to take the result from the index queries and consult document servers that get the title, URL, and the in-context snippet from the document. As with index lookup, the documents are also randomly distributed among multiple document servers and each server has multiple replicas that are load balanced.

Most index and document lookup operations are read-only and updates to the data are infrequent. Replicas can be taken off-line during an update, so consistency problems are not an issue.

Since individual shards (pieces of the index or pieces of the document database) don't need to communicate with each other, performance can scale almost linearly with an increase in the number of machines. Because the performance scales so well, the emphasis on optimizing cost favors using commodity components instead of, say, the very fastest processors, whose cost is disproportionately higher as well as using inexpensive and somewhat slower disks.

Amazon Dynamo

Dynamo is a highly available key-value store created internally within Amazon. Many services within Amazon, such as best seller lists, shopping carts, user preferences, session management, etc., neeed only primary key access to data. A multi-table relational database system would be overkill and would limit scalability and availability. With Dynamo, an application can configure its instance of Dynamo for desired availability (# replicas) and consistency.

Dynamo supports two operations: get(key) and put(key, data, context). The data is an arbitrary bunch of bytes (typically less than 1 MB) and is always identified by a unique key. The (key, data) values are distributed over a large set of servers and replicated for avaailability. The context is a value that is returned with a get operation and then sent back with a put operation. It is meaningless to the user program and is used like a cookie. Internally, it contains versioning information. The system has been described as a zero-hop distributed hash table (DHT).

The system is decentralized: there is no coordinator overseeing operations. Every node has equivalent responsibilities and the system can grow dynamcially by adding new nodes.

Partitioning: scalability

Since massive scalability is a key design aspect of Dynamo, data storage is distributed among many nodes. Dynamo relies on consistent hashing to identify which node will be responsible for a specific key. With normal hashing, a change in the number of slots in the table (e.g., number of nodes since we are hashing to find a node) will result in having to remap all keys. With consistent hashing, this is not the case and only K/n keys need to be remapped, where K is the number of keys and n is the number of slots (nodes).

Nodes are arranged in a logical ring. Each node is assigned a number from the possible set of hash outputs. It is responsible for all keys that hash to any number between its number and its predecessor's number. When a key is hashed, a node will traverse a logical ring of node values to find the first node with a position greater than that hashed result. This tells it which node is responsible for handling that key. Since nodes are not added that frequently, every node can keep a cached copy of this ring (the size is just N hash values, where N is the number of nodes). Moreover, to avoid the extra hop resulting because a client sent a request to the wrong Dynamo node, a client may also get a copy of this ring.

Adding or removing nodes affects only the node's immediate neighbors in the ring. If a node is added, the neighbors on either side will give up some range of values to the new node. In reality, a node is assigned to manage multiple points in the ring (multiple ranges of hash values). These nodes that we just discussed are really virtual nodes. A real machine, or physical node, will be responsible for managing multiple virtual nodes. The advantage of doing this is that we can attain balanced load distribution. If a machine becomes unavailable, the load is evenly distributed among the available nodes. If a new machine is added, each virtual node within it will take a set of values from other virtual nodes, effectively taking on a small amount of load from a set of other machines instead of from just two neighboring machines. Moreover, if a more powerful machine is added, it can be assigned a larger number of virtual nodes so that it bears a greater percentage of the query workload.

Replication: availability

Dynamo is designed as an always writable data store. Data can be replicated on N hosts, where the value N is configurable per instance of Dynamo. The node that is responsible for the hashed key is assigned the role of coordinator for that key and is in charge of replication for that key. The coordinator copies (key,value) data onto N-1 successive nodes clockwise in the ring. Copying is asynchronous (in the background) and Dynamo provides an eventually consistent model. This replication technique is called optimistic replication, which means that replicas are not guaranteed to be identical at all times. If a client cannot contact the coordinator, it sends the request to a node holding a replica, which will then periodically try to contact the coordinator.

Because data is replicated and the system does not provide ACID guarantees, it is possible that replicas may end up with different versions. Dynamo favors application-based conflict resolution since the application is aware of the meaning of the data associated with the key and can act upon it intelligently. For example, the application may merge two versions of a shopping cart. If the application does not want to bother with this, the system falls back on a last write wins strategy. Dynamo uses vector clocks to identify versions of stored data and to distinguish between versions that have been modified concurrently from versions that are causally related (a later one is just an update of an earlier one). For example, one can identify whether there are two versions of a shopping cart that have been updated independently or whether one version is just a newer modification of an older shopping card. A vector clock value is a set of (node, counter) pairs. This set of values constitutes the version of the data. It is returned as a "context" value with a get operation and updated and stored when that context is passed as a parameter to a put operation.

Because Dynamo is completely decentralized and there is no coordinator (unlike BigTable), each node serves three functions:

  1. Managing get and put requests. A node may act as a coordinator responsible for a particular key or may forward the request to the node that has that responsibility.
  2. Keeping track of membership and detecting failures. A node will detect if other nodes are out of service and keep track of the list of nodes in the system and their associated hash ranges.
  3. Local persistant storage. Each node is responsible for being either the primary or replica store for keys that hash to a certain range of values. These (key,value) pairs are stored within that node using a variety of mechanisms depending on application needes. These include the Berkeley Database Transactional Data Store, MySQL (for large objects), or an in-memory buffer with persistent backing store (for highest performance).

Distributed shared memory

Distributed shared memory (DSM) is a mechanism to allow processes on different machines to have the illusion of sharing memory (i.e., being able read data from memory that another process wrote to memory). Transparency is generally achieved through the system's memory management unit (MMU). When a page fault occurs, the page fault handler invokes the DSM algorithm, which can fetch the page from another server, map it into the process' address space and restart the instruction. A directory is the server, or set of servers that allows a node to find out where the needed page lives (and possibly where cached copies reside).

A well-defined memory consistency models is key to DSM is the algorithms must enforce the model. Sequential consistency is what we generally expect from a memory system. A read always returns the result of the last write for a given instruction stream. For multiple streams (concurrent processes), any interleaving is acceptable as long as accesses within each stream are in sequential order. To achieve sequential consistency, each write operation must invalidate or update all cached copies before the write completes. Since writes have to be visible in the same order by all processes, a read cannot take place until the write is acknowledged.

Presenting sequential consistency is an expensive proposition, since considerable network activity has to take place for each write operation. To enhance performance, weaker consistency models can be employed.

A weak (or relaxed) consistency model requires that memory be consistent only at specific synchronization events. For example, a sync variable is an operation that will force all local operations to be propagated out and remote operations on memory to be brought in. This way, memory is made consistent explicitly on a sync rather than for each operation. Release consistency allows us to break up the operations of sending out changes and receiving updates into two parts: an acquire phase and a release phase. An acquire indicates that the processor is about to perform operations on the memory and needs to get any updates from other processors. A release indicates that the processor is done with its operations on the memory. Any of the processor's writes now have to be made visible to a processor doing a subsequent acquire. There are two variants of release consistency: eager and lazy. The eager protocol ensures that all copies of pages on other nodes are updated with the contents of the newly-modified pages when a processor performs a release operation. In fact, page invalidations may be sent during program execution and the release operation itself is a blocking one that ensures that all invalidations have been acknowledged. The lazy protocol does not bother to update or invalidate existing copies of the pages upon a release (on the assumption that those nodes with cached copies of the page might never even access the page again). Instead, page invalidations are propagated at acquire time.

Finally, entry consistency allows the acquire and release operations to take place on regions of memory (such as individual variables or data structures) rather than all of shared memory. To achieve entry consistency, one must employ smart compilers or control the consistency explicitly since the system's MMU cannot protect these boundaries.