Transmission Control Protocol

Transmission Control Protocol (TCP) provides a connection-based, reliable, byte-stream service to programs. Microsoft networking relies upon the TCP transport for logging on, file and print sharing, replication of information between domain controllers, transfer of browse lists, and other common functions. TCP can only be used for one-to-one communications. TCP uses a checksum on both the headers and data of each segment to reduce the chance of network corruption going undetected.

Size Calculation of the TCP Receive Window

The TCP receive window size is the amount of receive data (in bytes) that can be buffered at one time on a connection. The sending host can send only that amount of data before waiting for an acknowledgment (ACK) and window update from the receiving host.

The TCP/IP stack is designed to self-tune itself in most environments. Instead of using a hard-coded default receive window size, TCP adjusts to even increments of the maximum segment size (MSS) negotiated during connection setup.

Matching the receive window to even increments of the MSS increases the percentage of full-sized TCP segments used during bulk data transmission. The following defaults are used for receive window size: TCPWindowSize = 8K rounded up to the nearest MSS increment for the connection; if that is not at least 4 times the MSS, then it's adjusted to 4 times the MSS, with a maximum size of 64K.

Note

The maximum window size is 64K because the field in the TCP header is 16 bits in length. RFC 1323 describes a TCP window scale option that can be used to obtain larger receive windows; however Windows NT TCP/IP does not yet implement that option.

For Ethernet, the window will normally be set to 8760 bytes (8192 rounded up to six 1460-byte segments); for 16/4 Token Ring or FDDI, it will be around 16K. These are default values and it's not generally advisable to alter them; however, you can either change the registry parameter TcpWindowSize to globally change the setting for the computer, or use the setsockopt() Windows Sockets call to change the setting on a per-socket basis.

Delayed Acknowledgments

Per RFC 1122, TCP uses delayed acknowledgments to reduce the number of packets sent on the media. The Microsoft stack takes a common approach to implementing delayed acknowledgments. The following conditions cause an acknowledgment to be sent as data is received by TCP on a given connection:

In summary, normally an ACK is sent for every other TCP segment received on a connection, unless the delayed ACK timer (200ms) expires. There is no configuration parameter to disable delayed ACKs.

PMTU Discovery

RFC 1191 describes PMTU discovery. When a connection is established, the two hosts involved exchange their TCP MSS values. The smaller of the two MSS values is used for the connection. The MSS for a computer is usually the MTU at the link layer minus 40 bytes for the IP and TCP headers.

When TCP segments are destined to a non-local network, the "don't fragment" bit is set in the IP header. Any router or media along the path may have an MTU that differs from that of the two hosts.

If a media is encountered with an MTU that is too small for the IP datagram being routed, the router will attempt to fragment the datagram accordingly. Upon attempting to do so, it will find that the "don't fragment" bit in the IP header is set. At this point, the router should inform the sending host with an ICMP destination unreachable message that the datagram can't be forwarded further without fragmentation. Most routers will also specify the MTU that is allowed for the next hop by putting the value for it in the low-order 16 bits of the ICMP header field that is labeled "unused" in the ICMP specification. See RFC 1191, section 4, for the format of this message.

Upon receiving this ICMP error message, TCP adjusts its MSS for the connection to the specified MTU minus the TCP and IP header size, so that any further packets sent on the connection will be no larger than the maximum size that can traverse the path without fragmentation. The minimum MTU permitted by RFCs is 68 bytes, and this limit is enforced by Windows NT TCP.

Some non-compliant routers may silently drop IP datagrams that cannot be fragmented, or may not correctly report their next-hop MTU. If this occurs, it may be necessary to make a configuration change to the PMTU detection algorithm. There are two registry changes that can be made to the TCP/IP stack to find and correct errors caused by these problematic routers:

The PMTU between two computers can be discovered by manually using ping with the -f (do not fragment) switch as follows:


ping -f -n <number of pings> -l <size> <destination ip address>

In the preceding example, the size parameter can be varied until the MTU is found. Note that the size parameter used by ping is the size of the data buffer to send, not including headers. The ICMP header consumes 8 bytes, and the IP header would normally be 20 bytes. In the following case (Ethernet), the link layer MTU is the maximum-sized ping buffer plus 28, or 1500 bytes:


C:\temp>ping -f -n 1 -l 1472 172.16.48.03 Pinging 172.16.48.03 with 1472 bytes of data: Reply from 172.16.48.03: bytes=1472 time<10ms TTL=30 C:\temp>ping -f -n 1 -l 1473 172.16.48.03 Pinging 172.16.48.03 with 1473 bytes of data: Packet needs to be fragmented but DF set

In the preceding example, the router returned an ICMP error message which ping interpreted for us. If the router had been a "black hole" router, the ping would simply not be answered once its size exceeded the MTU that the router could handle. Ping can be used in this manner to detect such a router.

A sample ICMP destination unreachable error message is as follows:


+ FRAME: Base frame properties + FDDI: Length = 77 + LLC: UI DSAP=0xAA SSAP=0xAA C + SNAP: ETYPE = 0x0800 + IP: ID = 0x0; Proto = ICMP; Len: 56 ICMP: Destination Unreachable, Destination: 172.16.112.125 ICMP: Packet Type = Destination Unreachable ICMP: Unreachable Code = Fragmentation Needed, DF Flag Set ICMP: CheckSum = 0x8ABF ICMP: Data: Number of data bytes remaining = 28 (0x001C) 00000: 50 00 60 8C 14 C7 0E 00 00 0C 1A EB C0 AA AA 03 00010: 00 00 00 08 00 45 00 00 38 00 00 00 00 FF 01 D3 00020: 36 C7 C7 2C 01 C7 C7 2C FE 03 04 8A BF 00 00 05 00030: C7 45 00 05 F8 55 24 40 00 1F 01 1B D7 C7 C7 2C 00040: FE C7 C7 28 7D 08 00 00 75 01 00 63 00

Network Monitor did not parse the MTU suggestion in this frame, but it is shown underlined in the hex portion of the trace. This error is generated by using ping -f -l 2000 on an FDDI-based host to send a large datagram through a router to an Ethernet host. When the router tried to place the large frame onto the Ethernet segment, it found that fragmentation is not allowed, and so it returned the error message indicating the largest datagram that could be forwarded is 0x5c7, or 1479 bytes.

Dead Gateway Detection

Microsoft TCP/IP provides dead gateway detection. Dead gateway detection allows TCP to detect failure of the default gateway and to make an adjustment to the IP routing table to use another default gateway.

Dead gateways are detected by using TCP retries. Microsoft TCP/IP stack uses the triggered reselection method as described in RFC 816.

TCP will attempt to send a packet to the default gateway configured on a computer until it receives an acknowledgment or until one-half of the TcpMaxDataRetransmissions registry parameter is reached. If no response is received from the default gateway and multiple gateways are configured on the computer, TCP requests that IP switch to the next default gateway in the list.

Note

If the computer running Windows NT Server or Windows NT Workstation is a DHCP client, the default gateway is automatically configured on the computer.

To add additional default gateways or to configure gateways for non-DHCP configured computers

1. Click Start, point to Settings, and click Control Panel.

2. Double-click Network, and then click the Protocol tab.

3. Under Network Protocols, click TCP/IP, and then click Properties.

4. If necessary, click the IP Address tab, and then click Advanced.

5. You can add additional gateways under Gateway in the Advanced IP Addressing dialog box.

IP utilities such as ping do not trigger the dead gateway detection process. They use the current default gateway. If TCP detects a dead gateway and selects a new one, the IP utilities will then function using the new gateway. By default, dead gateway detection is set to "on" when you configure a computer running under Windows NT with the IP address of more than one gateway.

Retransmission Behavior

TCP starts a retransmission timer when each outbound segment is handed down to IP. If no acknowledgment has been received for the data in a given segment before the timer expires, then the segment is retransmitted, up to the value of the TcpMaxDataRetransmissions registry parameter. The default value for this parameter is 5.

The retransmission timer is initialized to three seconds when a TCP connection is established; however it is adjusted "on the fly" to match the characteristics of the connection using smoothed round trip time (SRTT) calculations as described in RFC 793. The timer for a given segment is doubled after each retransmission of that segment. Using this algorithm, TCP tunes itself to the "normal" delay of a connection. TCP connections over high-delay links will take much longer to time out than those over low-delay links.

Note

Adding [1] to the registry parameter TcpMaxDataRetransmissions approximately doubles the total retransmission time-out period for all connections.

The following trace clip shows the retransmission algorithm for two hosts connected over Ethernet on the same subnet. An FTP file transfer was in progress when the receiving host was disconnected from the network. Since the SRTT for this connection is very small, the first retransmission is sent after about one-half second. The timer is then doubled for each of the retransmissions that followed. After the fifth retransmission, the timer is once again doubled, and if no acknowledgment is received before it expires, the transfer is aborted.

Delta

Source Ip

Dest Ip

Pro

Flags

Description

0.000

172.16.90.32

172.16.80.138

TCP

.A....

, len: 1460, seq: 8043781, ack: 8153124, win: 8760

0.521

172.16.90.32

172.16.80.138

TCP

.A....

, len: 1460, seq: 8043781, ack: 8153124, win: 8760

1.001

172.16.90.32

172.16.80.138

TCP

.A....

, len: 1460, seq: 8043781, ack: 8153124, win: 8760

2.003

172.16.90.32

172.16.80.138

TCP

.A....

, len: 1460, seq: 8043781, ack: 8153124, win: 8760

4.007

172.16.90.32

172.16.80.138

TCP

.A....

, len: 1460, seq: 8043781, ack: 8153124, win: 8760

8.130

172.16.90.32

172.16.80.138

TCP

.A....

, len: 1460, seq: 8043781, ack: 8153124, win: 8760


TCP Keepalive Messages

A TCP keepalive packet is simply an ACK with the sequence number set to one less than the current sequence number for the connection. A computer receiving one of these ACKs should respond with an ACK for the current sequence number. Keepalives can be used to verify that the computer at the remote end of a connection is still available. TCP keepalives can be sent once every KeepAliveTime (defaults to 7,200,000 milliseconds or two hours), if no other data or higher level keepalives have been carried over the TCP connection. If there is no response to a keepalive, it is repeated once every KeepAliveInterval seconds. KeepAliveInterval defaults to one second. NetBT connections, such as those used by many Microsoft networking components, send NetBIOS keepalives more frequently, and so normally no TCP keepalives will be sent on a NetBIOS connection. TCP keepalives are disabled by default, but Windows Sockets programs may enable them using setsockopt().

Slow Start Algorithm and Congestion Avoidance

When a connection is initially established, TCP processes at a slow rate to assess the bandwidth of the connection and to avoid overflowing the receiving host or any other devices or links in the path. The send window is set to two TCP segments.

If the TCP/IP segments are acknowledged, the send window is incremented again, and so on until the amount of data being sent per burst reaches the size of the receive window on the remote host. At that point, the slow start algorithm is no longer in use and flow control is governed by the receive window on the remote host.

However, at any time during transmission, congestion could still occur on a connection. If this happens (evidenced by the need to retransmit), a congestion avoidance algorithm is used to reduce the send window size temporarily, and then to slowly increment the send window back towards the receive window size.

Note

Slow start and congestion avoidance are discussed in RFC 1122.

Silly Window Syndrome

Silly Window Syndrome (SWS) is described in RFC 1122 as follows:

In brief, SWS is caused by the receiver advancing the right window edge whenever it has any new buffer space available to receive data and by the sender using any incremental window, no matter how small, to send more data [TCP:5]. The result can be a stable pattern of sending tiny data segments, even though both sender and receiver have a large total buffer space for the connection.

TCP/IP for Windows NT implements SWS avoidance per RFC 1122 by not sending more data until there is a sufficient window size advertised by the receiving end to send a full segment. It also implements SWS on the receive end of a connection by not opening the receive window in increments of less than a TCP segment.

Nagle Algorithm

TCP/IP for Windows NT Server and Windows NT Workstation implements the Nagle algorithm described in RFC 896. The purpose of this algorithm is to reduce the number of "tiny" segments sent, especially on high-delay (remote) links. The Nagle algorithm allows only one small segment to be outstanding at a time without acknowledgment. If more small segments are generated while awaiting the ACK for the first one, then these segments are coalesced into one larger segment. Any full-sized segment is always transmitted immediately, assuming there is a sufficient receive window available. The Nagle algorithm is effective in reducing the number of packets sent by interactive programs, such as Telnet, especially over slow links.

The following trace captured by using Microsoft Network Monitor shows the Nagle algorithm at work. The trace was captured by using PPP to dial up an Internet provider at 9600 bps. A Telnet (character-mode) session is established, then the "y" key is held down on the Windows NT Workstation. At all times, one segment is sent, and further "y" characters were held by the stack until an acknowledgment is received for the previous segment. In this example, three to four "y" characters were saved up each time and sent together in one segment. The Nagle algorithm resulted in a huge savings in the number of packets sent¾it is reduced by a factor of about three.

Source IP

Dest IP

Prot

Description


172.16.16.243


172.16.144.0


TELNET


To Server From Port = 1901


172.16.144.0


172.16.16.243


TELNET


To Client With Port = 1901


172.16.16.243


172.16.144.0


TELNET


To Server From Port = 1901


172.16.144.0


172.16.16.243


TELNET


To Client With Port = 1901


172.16.16.243


172.16.144.0


TELNET


To Server From Port = 1901


172.16.144.0


172.16.16.243


TELNET


To Client With Port = 1901


Each segment contained several of the "y" characters. Following is the first segment shown more fully parsed, and the data portion is pointed out in the hex at the bottom.

Time Source IP Dest IP Prot Description


0.644 172.16.48.1 172.16.112.0 TELNET To Server From Port = 1901 + FRAME: Base frame properties + ETHERNET: ETYPE = 0x0800 : Protocol = IP: DOD Internet Protocol + IP: ID = 0xEA83; Proto = TCP; Len: 43 + TCP: .AP..., len: 3, seq:1032660278, ack: 353339017, win: 7766, src: 1901 dst: 23 (TELNET) TELNET: To Server From Port = 1901 TELNET: Telnet Data D2 41 53 48 00 00 52 41 53 48 00 00 08 00 45 00 .ASH..RASH....E. 00 2B EA 83 40 00 20 06 F5 85 CC B6 42 53 C7 B5 .+..@. .....BS.. A4 04 07 6D 00 17 3D 8D 25 36 15 0F 86 89 50 18 ...m..=.%6....P. 1E 56 1E 56 00 00 79 79 79 .V.V..yyy ^^^ data

Windows Sockets programs can disable the Nagle algorithm for their connection(s) by setting the TCP_NODELAY socket option. However, this practice should be avoided unless absolutely necessary because it increases network usage. Some network programs may not perform well if their design does not take into account the effects of transmitting large numbers of small packets and the Nagle algorithm.

Throughput Considerations

TCP is designed to provide optimum performance over varying link conditions. Actual throughput for a link is dependent on a number of variables, but the most important factors are:

TCP throughput calculation is discussed in detail in Chapters 20 through 24 of TCP/IP Illustrated, by W. Richard Stevens. The following are some key considerations:

To summarize, Windows NT TCP/IP will adapt to most network conditions and dynamically provide the best throughput and reliability possible on a per-connection basis. Attempts at manual tuning are often counter-productive unless a qualified network engineer performs careful study of data flow.