Sockets

Paul Krzyzanowski

November 26, 2010

Introduction

We know that data can sent and received between machines via IP. We also know that TCP and UDP were developed as connection-oriented and connectionless transport-layer protocols over IP. How do we associate these data streams with our applications? Previously, we mentioned that a server must be able to get a transport address for a service and associate that address with the service. The client must be able to figure out this address and access the service through it. The most popular implementation is the concept of sockets. They were developed by the University of California, Berkeley for providing inter-process communication for the 4.2 BSD variant of the UNIX operating system. Since then, the interface has become pervasive among most operating systems, including the various versions of Linux, OS X, Windows, and a wide variety of embedded sytems.

Sockets are an attempt at creating a generalized IPC model with the following set of goals:

  • Communication between processes should not depend on whether they are on the same machine
  • Efficiency: this should be an efficient layer on top of network communication facilities
  • Compatibility: processes that just read from a standard input file and write to a standard output file should continue to work in distributed environments
  • Must support different protocols and naming conventions (different "communication domains" or "address families")

The socket is an abstract object from which messages are sent and received. It is created in a communications domain roughly similar to a file being created in a file system. Sockets exist only as long as they are referenced. A socket allows an application to request a particular style of communication (virtual circuit, datagram, message-based, in-order delivery, ...). Unrelated processes should be able to locate communication endpoints, so sockets need to be named. The name is something that is meaningful within the specific communications domain.

Programming with sockets

There are several steps involved in creating a socket, locating the remote endpoint, and communicating over the socket. This section cannot cover the entire topic fully. Several of the references listed provide more complete information. On-line manual pages will provide you with the latest information on acceptable parameters and functions. The interface described here is the system call interface provided by the Berkeley family of operating systems (including OS X) and is generally extremely similar amongst all Unix systems (and many other operating systems).

1. Create a socket

A socket is created with the socket system call:

int s = socket(int domain, int type, int protocol)

All the parameters as well as the return value are integers.

  • domain, or address family: communication domain in which the socket should be created. Some of address families are AF_INET (IP family), AF_UNIX (local channel, similar to pipes), AF_NS (Xerox Network Systems protocols).
  • type: type of service. This is selected according to the properties required by the application: SOCK_STREAM (virtual circuit service), SOCK_DGRAM (datagram service), SOCK_RAW (direct IP service). Check with your address family to see whether a particular service is available.
  • protocol: indicates a specific protocol to use in supporting the sockets operation. This is useful in cases where some families may have more than one protocol to support a given type of service.
  • The return value is a file descriptor (a small integer). The analogy of creating a socket is that of requesting a telephone line from the phone company.

    Creating a socket is conceptually similar to performing an open operation on a file with the important distinction that open creates a new reference to a possibly existing object whereas a socket creates a new instance of an object.

    2. Name a socket

    When we mention naming a socket, we are talking about assigning a transport address to the socket. This operation is called binding an address. The analogy is that of assigning a phone number to the line that you requested from the phone company in step 1 or that of assigning an address to a mailbox.

    You can explicitly assign an address or allow the system to assign one. The address is defined in a socket address structure. Applications find addresses of well-known services by looking up their names in a database (e.g., the file /etc/services). The system call for binding is:

    int error = bind(int s, const struct sockaddr *addr, socklen_t addrlen)

    where s is the socket descriptor obtained in step 1, addr is the address structure (struct sockaddr *) and addrlen is an integer containing the address length. One may wonder why don't we name the socket when we create it. The reason is that in some domains it may be useful to have a socket without a name. Not forcing a name on a socket will make the operation more efficient in those cases and remove confusion. Also, some communication domains may require additional information before binding (such as selecting a grade of service).

    As users, we might want names to be user-friendly, such as Bob's print server. We don't get that here. Sockets is a low-level interface that is designed to operate comfortably with the layers of abstraction provided by the networking stack. If a network allows user-friendly textual names then bind would let us use them. For TCP and UDP, however, bind refers to assigning an IP address and port number.

    3a. Accept connections (server-side operation)

    For connection-based communication, the server has to first state its willingness to accept connections. This is done with the listen system call:

    int error = listen(int s, int backlog)

    The backlog is an integer specifying the upper bound on the number of pending connections that should be queued for acceptance. After a listen, the socket s is set to manage the queue of connection requests; it will not be used for data exchange.

    Connections can now be accepted with the accept system call, which extracts the first connection request on the queue of pending connections. It creates a new socket with the same properties as the listening socket and allocates a new file descriptor for it. By default, socket operations are synchronous, or blocking, and accept will block until a connection is present on the queue. The syntax of accept is:

    struct sockaddr *clientaddr; socklen_t clientaddrlen = sizeof(struct sockaddr); int snew = accept(int s, clientaddr, &clientaddrlen);

    The clientaddr structure allows a server to obtain the client address. accept returns a file descriptor that is associated with a new socket. The address length field initially contains the size of the address structure and, on return, contains the actual size of the address. Communication takes place on this new socket. The original socket is used only for managing a queue of connection requests (you can, and often will, still listen for other requests on the original socket).

    None of this is needed for connectionless sockets. For those, recvmsg and recvfrom system calls were created that allow one to specify the address and port for incoming messages.

    3b. Connect (client-side operation)

    For connection-based communication, the client initiates a connection with the connect system call:

    int error = connect(int s, const struct sockaddr *serveraddr, socklen_t serveraddrlen)

    where s is the socket (type int) and serveraddr is a pointer to a structure containing the address of the server (struct sockaddr *). Since the structure may vary with different transports, connect also requires a parameter containing the size of this structure (serveraddrlen).

    This call can also be used for connectionless service. In this case, no connection is established but the operating system will send datagrams and maintain an association between the socket and the remote address so that you don't have to specify the address each time you send or receive a message.

    4. Exchange data

    Data can now be exchanged with the regular file system read and write system calls using the socket descriptors. This is the most significant part about the desire for compatibility with file descriptors. After a connection has been established, the code can be completely unaware of networking and simply treat the socket as a file input/output stream, no different than a user's terminal or a disk-based file.

    Additional system calls were added to support datagram service and additional networking features. The send/recv calls are similar to read/write but support an extra flags parameter that lets one peek at incoming data and to send out-of-band data. The sendto/recvfrom system calls are similar to send/recv but allow callers to specify or receive addresses of the peer with whom they are communicating (most useful for connectionless sockets). Finally, sendmsg/recvmsg support a full IPC interface and allow access rights to be sent and received. Could this have been designed cleaner and simpler? Most likely. The point to remember is that the read/write or send/recv calls must be used for connection-oriented communication and sendto/recvfrom or sendmsg/recvmsg must be used for connectionless communication. Also note that when you send data, it's possible that the other side may have to perform multiple reads to get results from a single write (because of fragmentation of packets) or vice versa (a client may perform two writes and the server may read the data via a single read).

    5. Close the connection

    The shutdown system call may be used to stop all further read and write operations on a socket:

    int shutdown(int socket, int how)

    A close can be used to terminate all communications on a socket as well but shutdown offers more options with the how parameter, which can be set to:

    • 0 (SHUT_RD): you can send but not receive data on this socket
    • 1 (SHUT_WR): you can receive but not send more data on this socket
    • 2 (SHUT_RDWR): you can neither send nor receive more data on this socket

    Synchronous or asynchronous

    Network communication, and file system access in general, system calls may operate in two modes: synchronous or asynchronous. In the synchronous mode, socket routines return only when the operation is complete. For example, accept returns only when a connection arrives. In the asynchronous mode, socket routines return immediately: system calls become non-blocking calls (e.g., read does not block). You can change the mode with the fcntl system call. For example,

    fcntl(s, F_SETFF, FNDELAY);

    sets the socket s to operate in asynchronous mode.

    Sockets internals

    Figure 1. Logical upward flow of data from a device to a socket.

    Sockets are how an operating system exposes its networking subsystem to applications. Figure 1 shows a logical flow of data through this subsystem (the BSD implementation is used as a guide here). The logical layers include the following:

    Network Interface Layer
    The network interface layer (link layer) is responsible for interfacing with network devices. It is responsible for performing packet encapsulation (wrapping packets within an ethernet packet, for instance) or decapsulation (stripping off an ethernet header). This layer corresponds to the link layer of the OSI reference model. A key difference between network devices and other devices is that they do not appear in the file system (e.g., under /dev). The device itself cannot be addressed via read/write operations. The I/O interface to network devices is packet-based, not the arbitrary byte stream of character devices or fixed-size blocks of block devices.
    Network Layer
    The network layer is responsible for the delivery of data between network devices and higher levels of the networking stack. It needs to be aware of data routing and be able to select the appropriate outbound interface. It corresponds to the network layer of the OSI reference model.
    Transport Layer
    The transport layer maintains an association between a socket and transport layer addressing. For instance, it needs to identify the socket that corresponds to a particular <address, port> tuple for incoming data and generate TCP and UDP headers with appropriate addresses and port numbers for outbound packets.

    Data flows are asynchronous. Incoming packets are received by the network device and passed onto per-protocol queues. The operating system schedules a kernel thread to process operations in these network queues. Processing a queue item may place it into another protocol's queue until the transport-layer interface is known. At that time, the data is sent to a receive queue for the associated socket.

    Figure 2. Network stack

    Within the operating system (Linux is the example here, but others are similar), the implementation of the networking subsystem comprises five layers (figure 2).

    System call interface

    System calls provide the interface between application programs and the operating system. There are two ways to access the networking interface via the system call interface.

    One method is via the several socket-specific system calls (socket, bind, shutdown, etc.). These calls actually implemented as a single system call that take a parameter identifying the requested command (sys_socketcall defined in socket.c). The code in sys_socketcall directs the request to the appropriate funciton in the kernel.

    The other method is via a file descriptor operation — a system call that accepts file descriptors as parameters (e.g., read, write, close, etc.). Since sockets were designed to be compatible with file descriptors, they reside in the file descriptor table. Sockets are not implemented as a file system and do not live within the Virtual File System (VFS). The distinction between a file and socket descriptor takes place just above the VFS layer. However, there is a direct parallel to the VFS structure in that a socket's f_ops field points to a set of functions that can be made on the socket. A socket acts as a queuing point for data that is being transmitted and received and has both send and receive queues associated with it. The queues contain high watermarks to avoid resource exhaustion. Only so much data can be queued before operations will block.

    Generic network interface: sockets layer

    All network communication takes place via a socket. The socket structure (defined in include/net/sock.h in the Linux kernel source) keeps all the state of a socket, including the protocol and the operations that can be performed upon it. Similar to VFS for file systems, this layer provides common functions to support a variety of lower-level protocols (such as TCP, UDP, IP, raw ethernet, and other networks). Each networking protocol has a structure called proto associated with it. This structure defines the socket operations that can be performed from the sockets layer to the transport layer. These include basic operations such as create a socket, establish a connection with a socket, close a socket, etc.

    Network protocols

    Network protocols comprise implementations of all the specific protocols available to the system (e.g., TCP, UDP). Each protocol (or family of protocols) is a module. Just like device drivers and file systems, the module may be a part of the bootable kernel or loaded dynamically. Also like other modules, each module is initialized and registered with the system at start-up. For example, the proto_register function for the built-in IP family of protocols calls the the inet_init function to registers them with the kernel. The proto_register function adds the protocol to the active protocol list and optionally allocates caches and buffers (e.g., TCP needs buffers to store connection state). Additional protocols can be added by calling the kernel function inet_register_proto_sw.

    The socket buffer: sk_buff

    The core component for managing the flow of a packet between the application and the device is a structure called sk_buff, or the socket buffer (defined in include/linux/skbuff.h in the Linux kernel). This is a kernel data structure that contains the data packet, state, and control data encompassing multiple layers of the protocol stack. It contains fields that point to specific layers in the networking stack. For example, the transport_header contains transport-layer (layer 4; typically TCP or UDP) information; network_header contains network layer (layer 3; typically IP) information; and mac_header contains link layer (layer 2; typically ethernet) information. Packet data is never copied between the layers of the protocol stack; that would be too inefficient. Instead, a pointer to the socket buffer is moved among the various queues of the layers of the stack.

    The socket buffer is created when network data arrives: either from a network device driver or from a user socket-based operation (write, sendto, sendmsg system calls). Each packet that is sent or received is associated with an sk_buff structure. The packet data is kept track of by the sk_buff and is identified by the pointer elements data and tail (start and end of data, respectively). The total allocated packet buffer is pointed to by the head and end elements. The reason for the two sets of pointers is to avoid reallocating and copying data to handle encapsulation. For instance, when we receive a TCP/IP packet from the ethernet interface, it is enveloped by an ethernet MAC header. Within it, we have the the IP packet. Within that, we have the TCP/IP packet. Within that, we have the data. Each layer of processing can adjust the data and tail elements to point to the reduced or increased packet that would be of interest to the next layer.

    Sk_buffs are organized as a doubly linked list, so it is easy to move an element from one list to another list. Each sk_buff also identifies the ultimate network device in a structure net_device. The rx_dev element points to the network device that received the packet. The dev element identifies the network device on which the buffer operates. This is often the same as rx_dev but, if a routing decision has been made to a different interface, this contains that outbound interface.

    Abstract device interface

    The abstract device interface is an abstract layer that provides higher-level software with a uniform interface to network devices. It also contains a common set of functions for low-level device drivers to use to interact with the higher-level protocol stack. This layer is defined by a net_device structure. The actual network device driver is implemented underneath this layer.

    Initialization

    The abstract interface contains registration and unregistration functions for the network device (register_netdevice, unregister_netdevice) and an initialization function . The caller creates and populates a net_device structure and passes it for registration. During registration, the kernel calls the structure's init function to perform device-level initialization.

    Sending

    The sending capabilities of the abstract interface handle the sending of data in the sk_buff to the physical device. The layer's dev_queue_xmit function enqueues an sk_buff for transmission to the underlying driver. The device to which the data will be sent is defined in the sk_buff. The device structure contains a method called hard_start_xmit, which is the device driver's function for transmitting the data in the sk_buff to the network.

    Receiving

    When a network device receives a packet, it raises an interrupt. This interrupt is handled by the device driver for that network device. The driver allocates a socket buffer (sk_buff) as well as memory for the packet data (to which the socket buffer will point), which includes the headers and data.

    If the contents are an IP packet, the sk_buff is passed to the network layer with a call to netif_rx, which causes the sk_buff to be placed on a queue for processing by the network layer (IP layer). It will be dequeued when the kernel thread calls netif_rx_schedule.

    Device drivers

    The lowest layer of the networking stack is the set of network device drivers that interact with the physical network. Examples are drivers for ethernet, 802.11b/g/n wireless networks, and SLIP (serial line IP). Upon initialization, the driver allocates a net_device structure and initializes it with its device-specific functions:

    • dev->hard_start_xmit defines how the upper layer should enqueue an sk_buff for transmission to the network. Typically, the packet is moved to a hardware queue and then transmitted.
    • net_rx defines the function used to receive a packet from the hardware interface.

    The device driver module calls the kernel register_netdevice function to make the device available to the networking stack. Unlike block and character devices, network devices do not present themselves as named devices within the file system.

    References