A peer-to-peer model is an application architecture that removes the need for dedicated servers and enables each host to participate in providing a service. Because all systems can both access as well as provide the service, they are called peers. In this discussion, we will focus on one application domain: peer-to-peer file distribution.
The first peer-to-peer based system to gain widespread use was Napster, which was used primarily for sharing music. Napster used a central server to track who had what music on their servers. The actual content was hosted on users' systems. When that central server was shut down after a lawsuit from the recording industry, the service effectively disappeared even if the individual content servers were still running.
After Napster, Gnutella set out to create an architecture that offers distributed file sharing with no central point of control. Unlike Napster, Gnutella can not be shut down since there is no central server.
Gnutella’s approach to finding content is based on query flooding. When a peer joins the system, it needs to contact at least one other Gnutella node and ask it for a list of nodes it knows about (its “friends”). This list of peers becomes its list of connected nodes. This builds an overlay network. An overlay network is a logical network that is formed by peer connections. Each peer knows of a limited set of other peers. These become its neighbors, and do not need to be physical neighbors. A peer is capable of communicating with any other peer; it is just the lack of knowing that the other peer exists that stops it.
To search for content, a peer sends a query message to its connected nodes. Each node that receives a query will respond if it has the content. Otherwise, it forwards the content to its connected nodes. This is the process of flooding. Once the content is found, the requesting peer downloads the content from the peer hosting the content via HTTP.
A facet of Gnutella’s original design was anonymity. Replies were sent replies through the same path that the queries took. A peer receiving a query would not know if it came from the requestor or from a peer just forwarding the request.
While Gnutella is decentralized, its flooding-based search is inefficient compared to maintaining a single database. Search may require contacting a large number of systems and going through multiple hops. Well-known nodes (e.g., those that may be configured in default installations) may become overly congested.
Kazaa was created a year after Gnutella with the core premise that not all nodes have equivalent capabilities as far as network connectivity and uptime are concerned. They introduced the concept of supernodes. These nodes have high uptime, fast connectivity, faster processors, and potentially more storage than regular nodes. They also know other supernodes. Incidentally, Gnutella later enhanced its system to support the same concept by adding ultrapeers.
A client (peer) needs to know of one supernode to join the system. It sends that supernode a list of all the files that it is hosting. Only supernodes are involved in the search process. Search is a flood over the overlay network as in Gnutella. Once a query reaches a supernode that has the requested content in its list, it sends a reply directly to the peer that initiated the query. As with Gnutella, the querying peer will then download the content directly from the peer that hosts the content.
The design of BitTorrent was motivated by the flash crowd problem. How do you design a file sharing service that will scale as a huge number of users want to download a specific file? Systems such as Napster, Gnutella, and Kazaa all serve their content from the peer that hosts it. If a large number of users try to download a popular file, all of them will have to share the bandwidth that is available to the peer hosting that content.
The idea behind BitTorrent is to turn each peer that is downloading content into a server of that content. BitTorrent only focuses on the download problem and does not handle the mechanism for locating the content.
To offer content, the content owner creates a .torrent file. This file contains metadata, or information, about the file, such as the name, creation time, and size of the file. It also contains a list of hashes of blocks of the content. The content is logically divided into fixed-size blocks and the list of hashes in the .torrent file allows a downloading peer to validate that any downloaded blocks has been downloaded correctly. Finally, the .torrent file contains a list of trackers.
The tracker is a server running a process that manages downloads for a set of .torrent files. When a downloading peer opens a .torrent file, it contacts a tracker that is specified in that file. The tracker is responsible for keeping track of which peers have which have the content. There could be many trackers, each responsible for different torrents.
A seeder is a peer that has the entire file available for download by other peers. Seeders register themselves with trackers so that trackers can direct downloading peers to them. An initial seeder is the initial version of the file.
A leecher is a peer that is downloading files. To start the download, the leecher must have a .torrent file. That identifies the tracker for the contents. It contacts the tracker, which keeps track of the seed nodes for that file as well as other leechers, some of whom may have already downloaded some blocks of the file. A leecher contacts seeders and other leechers to download random blocks of the file. As it gets these blocks, it can make them available to other leechers. This allows download bandwidth to scale: every downloader increases overall download capacity. Once a file is fully downloaded, the leecher has the option of turning itself into a seeder and continue to offer serving the file.