Exam 3 study guide
The one-hour study guide for exam 3
December 2, 2011
Disclaimer: This study guide attempts to touch upon the most important topics that may be covered on the exam but does not claim to necessarily cover everything that one needs to know for the exam. Finally, don't take the one hour time window in the title literally.
MapReduce
MapReduce is a framework for parallel computing. Programmers get a simple API and do not have to deal with issues of parallelization, remote execution, data distribution, load balancing, or fault tolerance. The framework makes it easy for one to use thousands of processors to process huge amounts of data (e.g., terabytes and petabytes). It is designed for problems that lend themselves to a map-reduce technique, which we will discuss. From a user's perspective, there are two basic operations in MapReduce: Map and Reduce.
The Map function reads a stream of data and parses it into intermediate (key, value) pairs. When that is complete, the Reduce function is called once for each unique key that was generated by Map and is given the key and a list of all values that were generated for that key as parameters. The keys are presented in sorted order.
The MapReduce client library creates a master process and many worker processes on a large set of machines. The master serves as a coordinator and is responsible for assigning roles to workers, keeping track of progress, and detecting failed workers. The set of input is broken up into chunks called shards. The master assigns map tasks to idle workers and gives each of them a unique shard to work on. The map worker parses the data and generates intermediate (key, value) results. All map workers work in parallel and there is no need for any one to communicate with another. The intermediate (key, value) data is partitioned based on the key according to a partitioning function that determines which of R reduce workers will work on that key and its associated data. The default function is simply a hash(key) mod R that distributes keys uniformly among R reduce workers but a user can replace it with a custom function. When all map workers informed the master that they are complete, the master contacts reduce workers. Each reduce worker contacts all the map worker nodes to get the (key, value) data that was targeted for them. The list of data is then sorted by key and the user reduce function gets called once for each unique key. The function is passed the key and the list of all values associated with that key. The reduce function writes its output to an output file, which the client can read once the MapReduce operation is complete.
A master periodically pings each worker for liveness. If no response is received within a time limit, then the master reschedules that worker's task onto another worker.
BigTable
BigTable is a distributed storage system developed at Google that is structured as a large table: one that may be petabytes in size and distributed among tens of thousands of machines. It is designed and used for storing items such as billions of URLs, with many versions per page; over 100 TB of satellite image data; hundreds of millions of users; and performing thousands of queries a second.
To the user, an instance of BigTable appears as a large spreadsheet: a table that is indexed by a row key, column name, and timestamp. If an application does not specify a timestamp, it will retrieve the latest version. Alternatively, it can specify a timestamp and get the latest version that is earlier than or equal to that timestamp. All operations on rows are atomic.
Columns in the table are organized into column families. Each column family has a unique name and contains within it a list of named columns. A BigTable instance will typically have a small number of column families (perhaps a few hundred at most) but each column family may have a huge number (perhaps millions) of columns within it. Timestamped versioning is configured on a per-column-family basis.
A BigTable table starts off as a single table. As rows are added, the table is kept sorted by row key. As the table grows, it may split along row boundaries into sub-tables, called tablets. Tablet servers are responsible for serving data for tablets.
BigTable comprises a client library (linked with the user's code), a master server that coordinates activity, and many tablet servers. Each tablet server is responsible for managing multiple tablets. Depending on the size of the tablets, it may manage from tens to thousands of tablets. Tablet servers can be added or removed dynamically. Tablet data is stored within GFS (Google File System) and tablet servers run on the same machines as GFS chunkservers. The master assigns tablets to tablet servers and balances tablet server load. It strives to run the tablet server on the same GFS chunkserver that holds data for that tablet. The master is also responsible for garbage collection of files in GFS and managing schema changes (table and column family creation).
Chubby
Chubby is a highly available and persistent distributed lock service that manages leases for resources and stores configuration information. The service runs as five active replicas, one of which is elected as the master to serve requests. A majority must be running for the service to work. Paxos is used to keep the replicas consistent. Chubby provides a namespace of files & directories. Each file or directory can be used as a lock.
In BigTable, Chubby is used to: ensure there is only one active master, store the bootstrap location of BigTable data, discover tablet servers, store BigTable schema information, and store access control lists.
Replication
BigTable can be configured for replication to multiple BigTable clusters in different data centers to ensure availability. Data propagation is asynchronous and results in an eventually consistent model.
Cryptography
Cryptography deals with encrypting plaintext using a cipher, also known as an encryption algorithm, to create ciphertext, which is unintelligible to anyone unless they can decrypt the message.
A restricted cipher is one where the workings of the cipher must be kept secret. There is no reliance on a key and the secrecy of the cipher is crucial to the value of the algorithm. This has obvious flaws (people in the know leaking the secret or coming up with a poor algorithm that can easily be reverse engineered). For any serious encryption, we use non-secret algorithms that rely on secret keys.
A symmetric encryption algorithm uses the same secret key for encryption and decryption. A public key algorithm uses a pair of keys: data encrypted with the first key can be decrypted only with the second key and vice versa. One of these keys is kept private (known only to the creator) and is known as the private key. The corresponding key is generally made visible to others and is known as the public key. Anything encrypted with the private key can only be decrypted with the public key. This is the basis for digital signatures. Anything that is encrypted with a public key can be encrypted only with the corresponding private key. This is the basis for authentication and covert communication.
A one-way function is one that can be computed relatively easily in one direction but there is no known way of computing the inverse function. One-way functions are crucial in a number of cryptographic algorithms, including digital signatures, Diffie-Hellman key exchange, and RSA public key cryptography. For Diffie-Hellman and RSA keys, they ensure that someone cannot generate the corresponding private key when presented with a public key. A particularly useful form of a one-way function is the hash function. This is a one-way function whose output is always a fixed number of bits for any input. For good cryptographic hash functions, it is highly unlikely that two messages will ever hash to the same value, it is extremely difficult to construct text that hashes to a specific value, and it is extremely difficult to modify the plaintext without changing its hash. The hash function is the basis for message authentication codes and digital signatures. Note that when we talk about cryptography and mention phrases such as "extremely difficult", we mean "impossible for all practical purposes," not that "you can do it if you spend an entire day working on the problem."
DES and 3DES
The Data Encryption Standard, DES, was standardized in 1976 and is a block cipher that encrypts 64-bit chunks of data at a time. It uses a 56-bit key and employs 16 iterations of substitutions followed by permutations.
The only serious weakness of DES is its key 56-bit key. By the 1990s it was possible to build machines that could iterate through all of the 256 permutations of keys within a few hours. Networked efforts could also test hundreds of billions of keys per second.
To prevent against such a brute force attack, DES would need a longer key. Each additional bit in a key doubles the time that it takes to test all keys because each each additional bit doubles the total number of combinations (e.g., a 4-bit key has 16 possible values; a 5-bit key has 32 possible values). Triple-DES (3DES) solves this problem by using the standard 56-bit DES algorithm three times:
C = EK3(DK2(EK1(P)))
If K1 = K3, then we have a "classic" Triple-DES mode that uses a 2*56 (112-bit) key. If K1 = K2 = K3, then the middle decryption undoes the first encryption and we have a standard 56-bit DES algorithm. Finally, if all three keys are different, then we have a 168-bit (3*56) key.
Although AES (Advanced Encryption Standard) is the most popular symmetric algorithm these days (due to being a U.S. government standard), 3DES still enjoys widespread deployment.
Secure communication
To communicate securely using a symmetric cipher, both parties need to have a shared secret key. Alice will encode a message to Bob using the key and Bob will use the same key to decode the message. If Alice wants to communicate with Charles, she and Charles will also need a secret key. The fact that every pair of entities will need a secret key leads to a phenomenon known as key explosion. Overall, in a system with n users, there will be O(n2) keys.
The biggest problem with symmetric cryptography is dealing with key distribution: how can Alice and Bob establish a key so they can communicate securely? The Diffie-Hellman exponential key exchange algorithm allows one to do this. Each party will generate a private key and a public key (these are not encryption keys; they're just numbers – Diffie-Hellman does not implement public key cryptography – it is unfortunate that the term was used to describe these numbers). It uses the one-way function abmod c in a way that allows Alice to compute a common key using her private key and Bob's public key. Bob can compute the same common key by using his private key and Alice's public key.
Using public key cryptography, such as RSA, if Alice encrypts a message with Bob's public key, Bob will be the only one who can decrypt it since doing so will require Bob's private key. Likewise, Bob can encrypt messages with Alice's public key, knowing that only Alice will be able to decrypt them with her private key.
Session keys
A session key is a random key that is generated for encrypting data for one communication session. It is useful because if the key is ever compromised, no lasting information is obtained: future communication sessions will use different keys. A hybrid cryptosystem uses public key cryptography to send a session key securely. The originator generates a random session key and encrypts it with the recipient's public key. The recipient decrypts the message with the corresponding private key to extract the session key. After that, symmetric cryptography is used for communication, with messages encrypted with the session key. This has the advantages of higher performance (public key cryptography is much, much slower than symmetric cryptography), ease of communicating with multiple parties (just encrypt the session key with the public keys of each of the recipients), and allows the bulk of data to be encrypted with session keys instead of the hardly-ever-changing public keys.
Digital signatures
With public key cryptography, a digital signature is simply the act of encrypting a hash of a message with the creator's private key. Anyone who has the message signer's public key can decrypt the hash and thus validate it against the message. Other parties cannot recreate the signature. Even though they can generate the same hash for the message, they do not have the signer's private key to encrypt that hash.
Authentication
The three A's of security are:
- Authentication: the process of binding an identity to the user. Note the distinction between authentication and identification. Identification is simply the process of asking you to identify yourself (for example, ask for a login name). Authentication is the process of proving that the identification is correct.
- Authorization: given an identity, making a decision on what access the user is permitted. Authentication is responsible for access control.
- Accounting: logging system activity so that any breaches can be identified (intrusion detection) or a post facto analysis can be performed.
A fourth item, not in the "standard list," is auditing: inspecting the software and system configuration for security flaws.
The three factors of authentication are: something you have (such as a key or a card), something you know (such as a password or PIN), and something you are (biometrics). Combining these into a multi-factor authentication scheme can increase security against the chance that any one of the factors is compromised.
Password Authentication Protocol (PAP)
The classic authentication method is the use of reusable passwords. This is known as the password authentication protocol, or PAP. The system asks you to identify yourself (login name) and then enter a password. If the password matches that which is associated with the login name on the system then you're authenticated.
One problem with the protocol is that if someone gets hold of the password file on the system, then they have all the passwords. The common way to thwart this is to store hashes of passwords instead of the passwords themselves. This is taking advantage of one-way functions. To authenticate a user, check if hash(password) = stored_hashed_password. If someone got hold of the password file, they're still stuck since they won't be able to reconstruct the original password from the hash. They'll have to resort to an exhaustive search or a dictionary attack and search for a password that hashes to the value in the file.
The other problem with reusable passwords is that if a network is insecure, an eavesdropper may sniff the password from the network. A potential intruder may also simply observe the user typing a password. To thwart this, we can turn to one-time passwords. If someone sees you type a password (or gets it from the network stream), it won't matter because that password will be useless for future logins.
S/Key Authentication
S/Key authentication allows the use of one-time passwords by generating a list via one-way functions. The list is created such that password n is generated as f(password[n-1]), where f is a one-way function. The list of passwords is used backwards. Given a password password[p], it is impossible for an observer to compute the next valid password because a one-way function f makes it improbably difficult to compute f-1(password[p]) to get the next valid password, password[p-1].
CHAP Authentication
The Challenge-Handshake Authentication Protocol (CHAP) is an authentication protocol that allows a server to authenticate a user without sending a password over the network. Both the client and server share a secret (such as a password). A server creates a random bunch of bits (called a nonce) and sends it to the client (user) that wants to authenticate. This is the challenge. The client identifies itself and sends a response that is the hash of the shared secret combined with the challenge. The server has the same data and can generate its own hash of the same challenge and secret. If the hash matches the one from the client, the server is convinced that the client knows the shared secret and is therefore legitimate. An intruder that sees this hash cannot extract the original data. An intruder that sees the challenge cannot create a suitable hashed response without knowing the secret. Note that this technique requires passwords to be accessible at the server and the security rests on the password file remaining secure.
SecurID®
RSA's SecureID is a two-factor authentication system that generates one-time passwords for response to a user login prompt. It relies on a user password (Personal ID Number, PIN) and a token device (an authenticator card or fob). The token generates a new number every 30 seconds. The number is a function of a seed that is unique for each card and the time of day. To authenticate to a server, you send a combination of your PIN and the number from the number from the token in lieu of a password. A legitimate remote system will have your PIN as well as the token seed and will be able to compute the same value to validate your password. An intruder would not know your PIN or the token’s seed and will never see it on the network.
Public key authentication
A nonce is a random bunch of bits that is generated on the fly and usually used to present to the other party as a challenge for them to prove that they are capable of encrypting something with a specific key that they possess. The use of a nonce is central to public key authentication. If I send you a nonce and you encrypt it with your private key and give me the results, I can decrypt that message using your public key. If the decryption matches the original nonce, this will convince me that only you could have encrypted the message.
Wide-mouth frog
The wide-mouth frog protocol shows how a trusted third party can be used with symmetric cryptography to perform authentication and key exchange to allow two parties to communicate securely. A trusted third party is a trusted entity that has everybody's keys. Suppose Bob wants to talk to Alice securely. He does not have shared secret key with her. Instead, he generates a random session key creates a request to talk with Alice, and encrypts the message containing both of the items with his own secret key. The only entities who have Bob's key and can decode this message are Bob and the trusted third party (Trent). Bob sends the message to Trent. Trent approves Bob's request to communicate with Alice and generates a message for Alice that is encrypted with Alice's secret key. This message contains the session key from Bob as well as Bob's identity information. Alice gets and decrypts this message and now has the session key and knows she will be talking with Bob. Trent is now out of the picture. The protocol can be enhanced by adding a timestamp into the encrypted message. This will guard against replay attacks — an intruder retransmitting an encrypted message from the past.
Kerberos authentication
Kerberos is a trusted third party authentication and key exchange protocol using symmetric cryptography. When you want to access a service, you first need to ask Kerberos. If access is granted, you get two messages. One is encrypted with your secret key and contains the session key for your communication with the service. The other message is encrypted with the service's secret key. You cannot read or decode this message. It is known as a ticket or sealed envelope. It contains the same session key that you received but is encrypted for the service. When the service decrypts it, it knows that the message must have been generated by an entity that had its secret key: Kerberos. Now that it has the session key, the service can communicate with you securely by encrypting all traffic with that key.
Digital certificates
While public keys simplify authentication (just decrypt this with my public key and you know that I was the only one who could have encrypted it), identity binding of the public key must be preserved for you to know that you really have my public key instead of someone else's. X.509 digital certificates provide a way to do this. A certificate is a data structure that contains user information and the user’s public key. This data structure also contains a signature of the certification authority. The signature is created by taking a hash of the rest of the data in the structure and encrypting it with the private key of the certification authority. The certification authority (CA) is responsible for setting policies of how they validate the identity of the person who presents the public key for encapsulation in a certificate.
Transport Layer Security (Secure Sockets Layer)
Secure Sockets Layer (SSL, also known as TLS — Transport Layer Security) is a layer of software designed to provide authentication and secure communication over the abstraction of a sockets interface. It makes it easy to add a secure transport onto insecure TCP socket based protocols (e.g., HTTP and FTP). SSL uses a hybrid cryptosystem and relies on public keys for authentication. If both the sender and receiver have X.509 digital certificates, SSL can validate them and use nonce-based public key authentication to validate that each party has the corresponding private key. In some cases, it may validate the server only. If the server does not have a certificate, SSL will then use a public key simply to allow a symmetric session key to be passed securely from client to server. The client generates a session key and encrypts it with the server's public key. This ensures that only the server will be able to decode the message and get the session key. After that, communication takes place using a symmetric algorithm and the client-generated session key.
Biometric authentication
Biometric authentication differs dramatically from other techniques in that it relies on statistical pattern recognition and will never yield a 100% true or false answer. Comparisons are based on thresholds. The ROC (receiver operator characteristic) plot defines the trade-off between false accepts and false rejects for a particular biometric system. A false accept is the condition when the system mistakenly validates the wrong user. A false reject is the condition when the system mistakenly rejects the right user. The authentication process comprises four steps: (1) sensing the user’s biometric feature; (2) extracting the features of interest, selecting features that are distinctive and repeatable; (3) matching the pattern against stored signals, searching for small distances between the matches; and (4) deciding if the match is close enough.
CAPTCHA
CAPTCHA (Completely Automated Public Turning test to tell Computers and Humans Apart) is not a technique to authenticate users but rather a technique to identify whether a system is interacting with a human being or with automated software. The idea behind it is that humans can recognize highly distorted characters far better than character recognition software can.
Distributed Authentication via OpenID
OpenID was created to solve the problem of alleviating a user from managing multiple identities (logins and passwords) at many different web sites. It provides a decentralized mechanism for single sign-on. OpenID is not an authentication protocol. Rather, it's a technique for a service to redirect authentication to a user's chosen OpenID authentication server and have that server be responsible for authentication.
A user logs in with a login name that contains a URI (Universal Resource Identifier) from which the name of the user's Identity Provider can be fetched. OpenID Identity Providers are responsible for authenticating users (typically with a login and password) and storing various user attributes.
When a user enters a login URI, the web site, known as a relying party issues a redirect to the user's browser, redirecting the user to the user's OpenID Identity Provider login page. The user authenticates with the Identity Provider, which then sends a redirect back to the originating web site. This redirect contains digitally-signed user credentials. The web site validates these either using a cached secret key that it has previously established with the Identity Provider or else it connects to the Identity Provider via HTTPS and requests authentication information directly and compares that with what it received via the browser redirect (this guards against a malicious attacker). If the site is convinced that the credentials are genuine then the user is considered to be logged in.
Through the Identity Provider, the user can control how many logins should be allowed (e.g. one, one per day, etc.) before the user has to intervene again. The user can also control what user identification attributes get sent to each site requesting user credentials.
Service Authorization via OAuth
In some ways, OAuth seems similar to OpenID in that they both rely on HTTP redirects and are both decentralized (multiple OpenID and OAuth providers can coexist). However, their purpose is fundamentally different. OpenID is provides single sign-on; allowing a user to use the same login and a third party (OpenID) authentication service for many web sites. OAuth, on the other hand, allows users to control what data one service can access from another service. For example, if you want a photo printing service to access photos from your flickr account, you don't want to provide it with your flickr username and password for unrestricted access. Instead, OAuth allows you to specify what access you allow a site to have (and for how long). Token credentials are used in place of the resource owner’s username and password for gaining access to the service. These token credentials include a token identifier and a secret.
There are three entities involved in OAuth:
- user
- consumer (client): this is the service that the user is accessing. For example, the moo.com photo printing servic
- provider (server): this is the service that the consumer needs to access. For example, flicker.com to get the photos
We'll use the moo.com photo printing and the flickr.com photo serving service as an example.
- Alice wants to order some prints and logs into moo.com. Moo knows about flickr and allows her to select select flickr as the source of her photos. When Moo built its service, its developers they obtained OAuth client credentials (client ID and secret) from flickr.
- When Alice selects flickr, moo.com contacts flickr.com for a request token, which is a set of temporary credentials. Upon receiving this, Alice is redirected to the flicker OAuth page with a directive to redirect her back to Moo when she's done.
- At the flicker OAuth page, she authenticates (using login/password or via OpenID, which may cause another level of redirection) and is presented with a description of what moo.com is requesting to do (e.g., access to download photos for the next 10 minutes). She can approve or reject the request.
- If she approves the request, the temporary credentials (request token) are tagged as "approved". Alice is redirected back to moo.com. The redirect contains an identifier for the approved credentials.
- Moo now contacts moo.com and exchanges the request token for an access token. Request tokens are used to obtain a user's approval. Access tokens are used to access resources on the provider (server). Moo now sends requests containing the access token to flickr to download the requested photos.
Firewalls
A firewall protects the junction between an untrusted network (e.g., external Internet) and a trusted network (e.g., internal network). Two approaches to firewalling are packet filtering and proxies. A packet filter, or screening router, determines not only the route of a packet but whether the packet should be dropped based on contents in the IP header, TCP/UDP header, and the interface on which the packet arrived. With stateless inspection, a packet is examined on its own. Stateful inspection allows the router to keep track of TCP connections and understand the relationship between packets. For example, a port that needs to be enabled for the FTP data channel once an FTP connection has been established or that return packets should be allowed in response to outbound requests. Deep packet inspection (DPI) allows a firewall to examine application data as well and make decisions based on its contents. Deep packet inspection can validate the protocol of an application as well as check for malicious content such as viruses or other security attacks.
Proxy services (also known as application proxies) live on bastion hosts. These are stripped down machines to give potential intruders as few tools as possible (few user accounts, no compilers, minimal commands). An application proxy is software that presents the same protocol to the outside network as the application for which it is a proxy (e.g., a mail server proxy will listen on port 25 and understand SMTP, the simple mail transfer protocol). The primary job of the proxy is to validate the application protocol. Valid requests are regenerated to the real application that is running on another server and is not accessible from the outside network. The bastion hosts on which proxies live are dual-homed hosts. This means the computer has two network cards. This is important since packets will not flow directly between the outside network to the inside network. The proxy is the only one who can communicate with the internal network. Unlike DPI, a proxy may modify the data stream, such as stripping headers or modifying machine names.
A typical firewalled environment is a screened subnet architecture, containing a separate subnet between the internal and externally-facing networks called the DMZ (demilitarized zone). The DMZ contains all the bastion hosts that may be offering services to the external network (usually the Internet). Machines on the internal network are not directly accessible from the Internet. Screening routers on both sides of the DMZ ensure that no packet from the outside network is permitted into the inside network.
All machines within an organization will be either in the DMZ or in the internal network. The exterior router will allow IP packets to only the machines/ports in the DMZ that are offering valid services. It would also reject any packets that are masqueraded to appear to come from the internal network. The interior router would allow only packets to come from designated machines in the DMZ that may need to access services in the internal network. Any packets not targeting the appropriate services in the internal network will be rejected. Both routers will generally allow traffic to flow from the internal network to the Internet, although an organization may block certain services (ports) or force users to use a proxy (for web access, for example).
VPNs
Virtual private networks (VPNs) allow disconnected local area networks to communicate securely over the public Internet, saving money by using a shared public network (Internet) instead of leased lines. This is achieved by tunneling, the encapsulation of an IP packet within another packet. In this case, a packet that is destined for a remote subnet (which will often have local source and destination IP addresses that are not routable over the public Internet) will be treated as payload and placed in IP packets that are routed over the public Internet. The source and destination addresses of this packet are the VPN endpoints at both sides, usually the routers. When the VPN endpoint (router) receives this encapsulated packet, it extracts the data, which is an IP packet, and routes it on the local network. This tunneling behavior gives us the virtual network part of the VPN.
To achieve security, an administrator setting up a VPN will usually be concerned that the data contents are not readable and the data has not been tampered with. To ensure this, the encapsulated packets can be encrypted and signed. Signing a packet ensures that the data has not been modified in transit. Encrypting ensures that intruders would not be able to make sense of the packets. VPNs usually provide several options for key management: shared private keys (AES or 3DES), Diffie-Hellman key exchange, or RSA public keys.
Fault Tolerance
There are three classes of faults: transient, intermittent, and permanent. Faults are either fail-silent (where the component does not produce results) or Byzantine (where the component produces faulty results). A popular approach to fault tolerance is redundancy, which may be information redundancy (e.g., error correcting codes), time redundancy (e.g., retransmission), or physical redundancy (redundant components: standby systems or triple-modular redundancy for active voting). A technique to deal with components that may exhibit Byzantine faults is Triple Modular Redundancy (TMR). This requires a three-way replication of the component with election logic to weed out a component that is producing faulty results (the two good components will overrule the faulty one). With software, we can use a consensus algorithm to keep replicated processes on multiple machines in sync as the process moves through different states of execution. Each state transition will generate a consesus-based update to all replicas.
The two-army problem demonstrates that reliable communication can never be achieved with faulty communication lines. The Byzantine generals problem illustrates the difficulty of achieving reliable communication in the presence of faulty processors exhibiting Byzantine faults.
Clusters
Clustering is the aggregation of multiple independent computers to provide increased reliability and/or performance. There are four distinct approaches to clustering: high-performance computing, batch processing, load balancing, and high availability.
Clusters will usually run software to identify cluster membership: which nodes are members of the cluster. Quorum keeps track of the number of nodes that have to be alive for the cluster to function. Typically, a majority is required. This provides a simple way of avoiding split-brain due to network partitioning where one group of machines cannot talk to another and two instances of the cluster are created. A cluster's quorum also identifies which machines (nodes) in the cluster perform which roles (e.g., standby, active, running specific applications).
High-Performance Computing (HPC)
High-performance clusters are generally custom efforts but there are a number of components that are common across many implementations. They are designed for traditional supercomputing applications that focus on a large amount of computation on large data sets. These applications are designed to be partitioned into multiple communicating processes. The Message Passing Interface (MPI) is a popular programming interface for sending and receiving messages that handles point-to-point and group communication and provides support for barrier-based synchronization. It is often used together with the Parallel Virtual Machine (PVM), a layer of software that provides an interface for creating tasks, managing global task IDs, and managing groups of tasks on arbitrary collections of processors. Beowulf and Rocks Cluster are examples of a popular HPC clusters based on Linux and other libraries. Microsoft offers an HPC Server 2008. There are many other HPC systems as well. The common thread among them all is that they provide a front-end server for scheduling jobs and monitoring processes and offer an MPI library for programming.
Batch Processing: Single-Queue Work Distribution
Batch processing is often used for applications such as renderfarms for computer animation, where a central coordinator (dispatcher) sends job requests to a collection of machines. When a system completes a job (e.g., “render frame #4,178”), the dispatcher will send it the next job (e.g., “render frame #12,724”). The dispatcher will have the ability to list jobs, delete jobs, dispatch jobs, and get notified when a job is complete. The worker nodes have no need to communicate with each other.
Load Balancing
Web-services load balancing is a somewhat trivial but very highly used technique for distributing the load of many network requests among a collection of machines. The simplest form is to have all requests go to a single machine that then returns an HTTP REDIRECT error. This is part of the HTTP protocol and will lead the client to re-issue the request to the machine specified by the REDIRECT error.
A more sophisticated software-only approach was implemented by IBM’s Interactive Network Dispatcher. As with the REDIRECT approach, all requests go to one machine. This machine forwards the entire stream of IP packets to one of several load-balanced back-end systems. All of these machines share the same IP address and the coordinator has special kernel extensions that allow it to forward TCP and UDP packets to the different machines by changing the MAC address of the packets.
Finally, and the most popular approach, one can turn to hardware and use a load-balancing router to map incoming requests to multiple back-end machines.
High Availability
High-availability clusters strive to provide a high level of system uptime by taking into account the fact that machines may fail. In such a case, applications running on those machines will be moved to other machines that are still up.
Low-level software to support high-availability clustering includes facilities for IP address takeover (allowing a machine to listen on multiple IP addresses) and to access shared disks. Mid-layer software includes distributed elections to pick a coordinator, propagation of status information, and figuring out which systems and applications are alive. Higher-layer software includes the ability to restart applications, let a user assign applications to machines, and let a user see what's going on in the system as a whole.
The key to detecting machine failure is the heartbeat network: exchanging messages between machines to ensure that they are alive and capable of responding. Since a main network may go down, one or more secondary networks (such as direct cable connections between two machines) are often used as dedicated heartbeat networks to distinguish failed machines from failed networks.
Failed applications may be restarted on a standby machine (active/passive configuration) or may be run on a machine that’s already running other services (active/active configuration). A cold failover is an application restart – the application is started from the beginning. A warm failover is one where the application is checkpointed periodically and is restarted from the last checkpoint. In a hot failover, a copy of the application is always running in lockstep synchrony with the main application on another machine. This is difficult and not always practical, especially where communications are involved. Cascading failover refers to the ability of an application to fail over even after it already has failed over in the past (a double restart). Multi-directional failover refers to the ability to restart applications from a failed system on multiple available systems instead of a specific targeted backup machine.
Clustered systems may share access to the same disk (e.g., via a a fibre channel switch or ethernet switch). This is a shared-disk configuration. Disk access in this form is raw, block-level access. All machines run a local file system. Having two machines access the same disk is not a good idea, of course, since that can put the filesystem in an incoherent state. Therefore, systems that do this must ensure that they have mutually exclusive access to key filesystem data structures such as inodes and free block maps as well as be responsible for flushing their caches when they don’t have access. They do this via a distributed lock manager (DLM). File systems designed to support this mode of operation are known as cluster file systems. A simpler solution is a shared-nothing architecture where there is no shared storage and a system forwards any requests for disk I/O to the system that owns the disk. In the case of failure, however, the access to such storage may be lost or may have to be gained by using a shared link that was simply not used when both systems were up. Alternatively, a shared-nothing architecture may use a network file system (such as NFS, SMB, AFP, etc.) to provide file-level rather than block-level file access.
Some clusters may be tightly coupled via system area network (SAN). This is a switched network to allow low-latency I/O between machines without incurring any processor intervention. It also avoids the overhead of TCP/IP by providing a reliable communication channel. Remote DMA (RDMA) allows data to be copied directly to the memory of another processor. SANs are often used for HPC clusters, with SAN/RDMA communication incorporated into the Message Passing Interface (MPI) library. Examples are Infiniband, Myrinet, and 10 Gbps ethernet.
Storage in a cluster is often separated from the actual machines via a storage area network (SAN). This is a network that is dedicated to disk I/O traffic between machines and the disks, which reside in separate storage arrays. The communications between machines and disks is typically SCSI over fibre channel or the iSCSI over ethernet. Because of this dissociation between machines and disks, the switch between the machines and the storage can be configured to allow specific hosts to read and write from/to specific disks.
To make storage highly available, RAID (redundant array of independent disks) is often employed. RAID 1 is mirroring. Anything that is written to one disk gets written to a secondary disk. If one fails then you still have the other. RAID 5 stripes the data across several disks and also adds in error correcting codes (e.g., parity in the case of allowing one to correct for a single failure) so that it data could be reconstructed from the available segments if one would die. RAID 0 is a technique used to improve performance by striping data across multiple disks. It does not improve fault tolerence.
Case Study: The Google Cluster Architecture
The Google Cluster architecture is built atop over 10,000 unreliable commodity PCs running fault-tolerant software. The goal of the system is energy efficiency and the best price for the realized performance rather than maximizing processor performance.
The Google search service contains several clusters that are distributed worldwide. Each of these clusters has several thousand machines. A user's query is directed to a specific cluster via the DNS lookup of google.com. The DNS load-balancing system takes the user's round-trip time and system capacity into account to provide the address of a suitable cluster. Once the request gets to the cluster, a hardware load balancer forwards the request to one of the web servers in the cluster.
The fundamental approach to exploiting distribution is to transform a query on a very large database into a much of queries on smaller databases, followed by a merge of the results.
The web server performs a query against several index servers. Since the index is many terabytes in size, it is divided into pieces (shards), each holding a subset of the documents from the full index. Each of these index servers is replicated, with a load balancer assigning query requests. If any of the replica servers goes down then performance is degraded proportionally. The index server returns a list of document IDs to the web server that made the query.
The next task for the web server is to take the result from the index queries and consult document servers that get the title, URL, and the in-context snippet from the document. As with index lookup, the documents are also randomly distributed among multiple document servers and each server has multiple replicas that are load balanced.
Most index and document lookup operations are read-only and updates to the data are infrequent. Replicas can be taken off-line during an update, so consistency problems are not an issue.
Since individual shards (pieces of the index or pieces of the document database) don't need to communicate with each other, performance can scale almost linearly with an increase in the number of machines. Because the performance scales so well, the emphasis on optimizing cost favors using commodity components instead of, say, the very fastest processors, whose cost is disproportionately higher as well as using inexpensive and somewhat slower disks.
Amazon Dynamo
Dynamo is a highly available key-value store created internally within Amazon. Many services within Amazon, such as best seller lists, shopping carts, user preferences, session management, etc., neeed only primary key access to data. A multi-table relational database system would be overkill and would limit scalability and availability. With Dynamo, an application can configure its instance of Dynamo for desired availability (# replicas) and consistency.
Dynamo supports two operations: get(key) and put(key, data, context). The data is an arbitrary bunch of bytes (typically less than 1 MB) and is always identified by a unique key. The (key, data) values are distributed over a large set of servers and replicated for avaailability. The context is a value that is returned with a get operation and then sent back with a put operation. It is meaningless to the user program and is used like a cookie. Internally, it contains versioning information. The system has been described as a zero-hop distributed hash table (DHT).
The system is decentralized: there is no coordinator overseeing operations. Every node has equivalent responsibilities and the system can grow dynamcially by adding new nodes.
Partitioning: scalability
Since massive scalability is a key design aspect of Dynamo, data storage is distributed among many nodes. Dynamo relies on consistent hashing to identify which node will be responsible for a specific key. With normal hashing, a change in the number of slots in the table (e.g., number of nodes since we are hashing to find a node) will result in having to remap all keys. With consistent hashing, this is not the case and only K/n keys need to be remapped, where K is the number of keys and n is the number of slots (nodes).
Nodes are arranged in a logical ring. Each node is assigned a number from the possible set of hash outputs. It is responsible for all keys that hash to any number between its number and its predecessor's number. When a key is hashed, a node will traverse a logical ring of node values to find the first node with a position greater than that hashed result. This tells it which node is responsible for handling that key. Since nodes are not added that frequently, every node can keep a cached copy of this ring (the size is just N hash values, where N is the number of nodes). Moreover, to avoid the extra hop resulting because a client sent a request to the wrong Dynamo node, a client may also get a copy of this ring.
Adding or removing nodes affects only the node's immediate neighbors in the ring. If a node is added, the neighbors on either side will give up some range of values to the new node. In reality, a node is assigned to manage multiple points in the ring (multiple ranges of hash values). These nodes that we just discussed are really virtual nodes. A real machine, or physical node, will be responsible for managing multiple virtual nodes. The advantage of doing this is that we can attain balanced load distribution. If a machine becomes unavailable, the load is evenly distributed among the available nodes. If a new machine is added, each virtual node within it will take a set of values from other virtual nodes, effectively taking on a small amount of load from a set of other machines instead of from just two neighboring machines. Moreover, if a more powerful machine is added, it can be assigned a larger number of virtual nodes so that it bears a greater percentage of the query workload.
Replication: availability
Dynamo is designed as an always writable data store. Data can be replicated on N hosts, where the value N is configurable per instance of Dynamo. The node that is responsible for the hashed key is assigned the role of coordinator for that key and is in charge of replication for that key. The coordinator copies (key,value) data onto N-1 successive nodes clockwise in the ring. Copying is asynchronous (in the background) and Dynamo provides an eventually consistent model. This replication technique is called optimistic replication, which means that replicas are not guaranteed to be identical at all times. If a client cannot contact the coordinator, it sends the request to a node holding a replica, which will then periodically try to contact the coordinator.
Because data is replicated and the system does not provide ACID guarantees, it is possible that replicas may end up with different versions. Dynamo favors application-based conflict resolution since the application is aware of the meaning of the data associated with the key and can act upon it intelligently. For example, the application may merge two versions of a shopping cart. If the application does not want to bother with this, the system falls back on a last write wins strategy. Dynamo uses vector clocks to identify versions of stored data and to distinguish between versions that have been modified concurrently from versions that are causally related (a later one is just an update of an earlier one). For example, one can identify whether there are two versions of a shopping cart that have been updated independently or whether one version is just a newer modification of an older shopping card. A vector clock value is a set of (node, counter) pairs. This set of values constitutes the version of the data. It is returned as a "context" value with a get operation and updated and stored when that context is passed as a parameter to a put operation.
Because Dynamo is completely decentralized and there is no coordinator (unlike BigTable), each node serves three functions:
- Managing get and put requests. A node may act as a coordinator responsible for a particular key or may forward the request to the node that has that responsibility.
- Keeping track of membership and detecting failures. A node will detect if other nodes are out of service and keep track of the list of nodes in the system and their associated hash ranges.
- Local persistant storage. Each node is responsible for being either the primary or replica store for keys that hash to a certain range of values. These (key,value) pairs are stored within that node using a variety of mechanisms depending on application needes. These include the Berkeley Database Transactional Data Store, MySQL (for large objects), or an in-memory buffer with persistent backing store (for highest performance).