[Transport Layer] TCP — three handshakes, four waves.

Time:2024-4-9

timeout retransmission mechanism

[Transport Layer] TCP -- three handshakes, four waves.

  • After host A sends data to B, the data may not reach host B because of network congestion or other reasons;
  • If host A does not receive an acknowledgement answer from B within a specific time interval, it retransmits the

How does the sender determine that a packet has been lost?
In fact, the sender doesn’t really know if the packet was actually lost or not. The sender doesn’t really know if the packet was lost or not, but the policy is set, and when the timeout expires, it’s determined that the packet was lost.

However, host A does not receive an acknowledgement answer from B, possibly because the ACK is lost
[Transport Layer] TCP -- three handshakes, four waves.
So host B will receive a lot of duplicate data (which is also a type of unreliability), then the TCP protocol needs to be able to recognize those packets as duplicates and discard the duplicates. This is where we can easily do the de-duplication by utilizing the previously mentioned sequence numbers

Thinking point: the sender, who sends out data, is sent data that is not immediately removed as we would like, but must be maintained for a period of time, maintained in the send window or send buffer

How is the amount of time determined if the timeout is exceeded?

  • Ideally, a minimum time is found that guarantees that “the acknowledgement answer will definitely be returned within this time”.
  • But the length of this time varies with the network environment.
  • If the timeout is set too long, it will affect the overall retransmission efficiency;
  • If the timeout is set too short, it is possible that duplicate packets will be sent frequently;

TCP dynamically calculates this maximum timeout in order to ensure relatively high-performance communication regardless of the environment.

  • In Linux (and BSD Unix and Windows as well), timeouts are controlled in 500ms increments, with the timeout retransmitted each time a timeout is determined
  • The times are all integer multiples of 500ms.
  • If there is still no answer after one retransmission, wait for 2*500ms and retransmit again.
  • If there is still no answer, wait 4*500ms for retransmission, and so on in exponential increments.
  • After accumulating a certain number of retransmissions, TCP assumes that the network or the host on the other end has an anomaly and forcibly closes the connection.

Connection Management Mechanism

Under normal circumstances, TCP goes through three handshakes to establish a connection and four handshakes to break the connection

[Transport Layer] TCP -- three handshakes, four waves.

handshake

[Transport Layer] TCP -- three handshakes, four waves.
Why three handshakes?

  • One handshake: if there is only one handshake, then the establishment of the connection cannot be confirmed, the sending and receiving capabilities of both parties cannot be determined, and a reliable connection cannot be established (SYN flood)
  • Two-Handshake: With only two handshakes, while one party’s ability to send and receive can be confirmed, the other party’s ability to receive cannot be guaranteed. This can cause both parties to send data in different time windows, which can lead to data desynchronization and loss (SYN flooding)

Why not four handshakes?

  • Reliability: TCP’s three handshakes are sufficient to ensure a reliable connection. With three handshakes, the client and the server can ensure that each other’s ability to send and receive is normal and in a state of synchronization. In the third handshake, the server has acknowledged the client’s connection request and is ready to receive data.
  • Efficiency and Latency: Increasing the number of handshakes introduces additional latency and overhead. Each handshake requires a back and forth exchange of data and acknowledgements, increasing the time it takes for the connection to be established. Since the design goal of TCP is to provide efficient data transfer, three handshakes are considered a reasonable compromise that ensures reliability while minimizing the number of handshakes.
  • Avoiding redundant or invalid connections: expired connection requests and invalid connection establishment can be prevented by three handshakes. If the number of handshakes increases, it may lead to more invalid connections, wasting network and server resources

Three times handshake is the minimum cost to verify that the full-duplex communication channel is clear, three times handshake can effectively prevent a single machine to carry out attacks on the server (the server receives an attack itself is not the tcp handshake should be resolved, but if the handshake has a clear loophole, then it is a handshake problem)
The three handshakes don’t have to be successful, the biggest concern at the beginning is the loss of the last ACK, but there is a matching solution
Links need to be managed, by the OS, described first, in the organization. Maintaining a link has a cost [time and space].
(The three handshakes can also be referred to as four handshakes, because the server can also separate the SYN+ACK in the answer, but they are different flag bits, so they can be sent over at once)

SYN Flood is a network attack designed to overload the resources of a target server by sending a large number of spoofed TCP connection requests (SYN packets), preventing it from properly processing legitimate connection requests.
The SYN flood attack exploits a vulnerability in the three handshakes of the TCP protocol. The attacker sends a large number of SYN packets with forged source IP addresses to the target server, causing the server to allocate certain resources, including memory and connection table entries, for each connection request upon receiving these SYN packets. However, due to the spoofed source IP address, the server is unable to get a response when waiting for the client to send subsequent ACK packets for connection establishment, resulting in wasted resources.

Why connect? Because of reliability.
The tcp connection itself is not a direct guarantee of reliability, in fact when tcp establishes a connection, how do you know which messages are lost? How do you know what state it’s in – communication or disconnected? Which messages need to be retransmitted? How long is the next retransmission? How many messages has the server received and how many messages has it sent? These above characteristics are maintained in the tcp connection structure, it is three handshakes such a mechanism, so the establishment of the two sides of the connection structure of such a consensus, it is the connection structure to complete the so-called timeout retransmission, flow control and other strategies.So the connection structure is the basis of the data structure to ensure the reliability of the data, and the three handshakes are the basis for creating the structure, so the tcp three handshakes indirectly ensure the reliability.

Four waves.

[Transport Layer] TCP -- three handshakes, four waves.
[Transport Layer] TCP -- three handshakes, four waves.

  • The party that initiates the disconnection has a final state ofTIME_WAITstate of affairs
  • The party that is passively disconnected, with two swings completed, will enter theCLOSE_WAITStatus (this is independent of whether the parties are servers or clients, as TCP is a peer-to-peer protocol)
  • Four waving motions are completed, but the party actively disconnecting maintains a period ofTIME_WAIT

If our server has a large number ofCLOSE_WAIT

  1. The server is buggy and doesn’t do the close file descriptor action
  2. The server is under pressure and may have been pushing messages to the client, making it too late to CLOSE!

Why does the party actively disconnecting maintain a period ofTIME_WAIT
Generally maintained for 2*MSL for the purpose of:

  1. Ensure that the last ACK is received by the other side as much as possible
  2. When both parties disconnect, there are still stranded messages in the network, in order to ensure that the stranded messages are dissipated (reason given in the textbook)

Sometimes the server can be restarted immediately, sometimes it can’t be restarted immediately because of ?bind_error
[Transport Layer] TCP -- three handshakes, four waves.

The server actively disconnects, the server has to maintain the TIME_WAIT state, during the maintenance of this state, the port and the connection still exists, so you can not bind the port

Solution for bind failures caused by TIME_WAIT state

  • It is not allowed to re-listen until the server’s TCP connection is completely disconnected, and in some cases it may be unreasonable for the server to have to deal with a very large number of client connections (each connection may have a very short time to live, but a very large number of clients requesting it every second).
  • At this time if the connection is closed by the server-side initiative (e.g. some clients are inactive and need to be cleaned up by the server-side initiative), a large number of TIME_WAIT connections will be generated.
  • Since we have a large number of requests, it can lead to a large number of connections for TIME_WAIT, each of which occupies a communication quintet (source ip, source port, destination ip, destination port, protocol). Where the server’s ip and port and protocol are fixed, if the ip and port number of a new incoming client connection duplicates the link occupied by TIME_WAIT, a problem occurs

Solution: Use setsockopt() to set the socket descriptor option SO_REUSEADDR to 1, which allows the creation of multiple socket descriptors with the same port number but different IP addresses.
[Transport Layer] TCP -- three handshakes, four waves.

[Transport Layer] TCP -- three handshakes, four waves.

sliding window

With our acknowledgement strategy, we give an ACK for each data segment sent, and then send the next data segment after receiving the ACK. This has a major drawback, which is poor performance, especially when the data round trip time is long

[Transport Layer] TCP -- three handshakes, four waves.

Since this one-send-one-receive approach has low performance, we can greatly improve performance by sending multiple pieces of data at once (actually overlapping the wait times of multiple segments)

[Transport Layer] TCP -- three handshakes, four waves.

How does the sender know the other side’s receptivity the first time around? Three handshakes are done long before communication, exchanging window sizes
If we send data and don’t receive an answer, we have to save our sent data temporarily, in order to support timeout retransmission

  • The window size refers to the maximum value of data that can be sent without waiting for an acknowledgement response. The window size in the above figure is 4000 bytes (four segments).
  • Send the first four segments without waiting for any ACKs, just send them;
  • After the first ACK is received, the sliding window moves backward and continues to send data for the fifth segment; and so on;
  • The operating system kernel, in order to maintain this sliding window, needs to open thetransmitter buffer to keep track of what data is still unanswered; only data that has been acknowledged can be deleted from the buffer;
  • The larger the window, the higher the throughput rate of the network

[Transport Layer] TCP -- three handshakes, four waves.
[Transport Layer] TCP -- three handshakes, four waves.

[Transport Layer] TCP -- three handshakes, four waves.
How is the start size of the window set? How does it change in the future?
Currently: the size of the sliding window is related to the acceptance of the other party
win_start = 0; win_end = win_start + tcp_win – In the future, no matter how it slides, make sure that the other side is able to make a normal acceptance.
Sliding window size = the size of my own receptivity as communicated to me by the other party [for now]

Definition of the acknowledgement sequence number: ACK seq X +1 , identifying that all data prior to X+1 was received in full. This also supports the sliding rule set for our sliding window

[Transport Layer] TCP -- three handshakes, four waves.
1000 has already lost the packet, so we definitely can’t return 2000, because returning 2000 means that 1000 has already answered.

When a TCP client sends packets with serial numbers 1000 and 2000, if the packet with serial number 1000 is lost, the server does not receive that packet. Therefore, it expects to receive the packet with serial number 1000. Therefore, the server returns an ACK packet with an acknowledgement serial number (ACK) of 1000. This means that the server tells the client, “I have successfully received all data up to serial number 999 and now expect to receive data with serial number 1000.”

Simply put, the server returns an ACK with serial number 1000.

Keep sliding backwards? What if there’s not enough room?
In fact, the send buffer is organized by the kernel into a ring structure, the linear structure above is just to facilitate understanding at first

congestion control

[Transport Layer] TCP -- three handshakes, four waves.
It’s like a school final exam. If you’re the only one in the class who fails, you’ll think it’s your own fault, but if only one person in the class passes, you might think it’s a teaching accident or something else.

This situation of sending 10,000 messages and losing 9,999 is generally considered to be a problem with the network, since we have a reliability policy with the Server side

So do we still do timeout retransmissions with this kind of network problem? No, we used to have all sorts of timeout retransmissions, and the interactive window was based on end-to-end, and the network was already having problems, so if we continued to send it, it would make another large number of messages in the network, which would only exacerbate the problem of the network failing. You may think you don’t have a lot of messages, and sending them won’t be too much of a problem for the network, but if the TCP connections are all like that, then overall it can cause major problems.

TCP’s reliability takes into account not only the hosts on both sides, but also the network on the road – Congestion Control Background

Although TCP has a sliding window as a big killer that allows it to send large amounts of data efficiently and reliably, it can still cause problems if large amounts of data are sent at the very beginning of the phase.

Because there are a lot of computers on the network, the current network state may already be more congested, without knowing the current network state, hastily send a large amount of data, it is very likely to cause the snowball.

TCP introductionslow start mechanism, first a small amount of data, to explore the road, feel the current network congestion, and then decide to follow how fast the data transmission

[Transport Layer] TCP -- three handshakes, four waves.
Here a conceptual program is introduced for the congestion window

  • Define the congestion window size to be 1 at the beginning of the send
  • Each time an ACK answer is received, the congestion window is increased by 1;
  • Each time a packet is sent, the congestion window is compared with the size of the window fed back from the host at the receiving end, and the smaller value is taken as the actual window sent;
  • Me (sliding window), network (congestion window), peer (window size: own adaptability) – Me → Network → Peer –Sliding window size = min(congestion window, window size)

Congestion windows like the one above grow at an exponential rate. “Slow start” just means that it is slow at first make, but grows very fast.

  • In order not to grow so fast, therefore the congestion window cannot be simply doubled.
  • This introduces a program calledThreshold for slow start
  • When the congestion window exceeds this threshold, it no longer grows exponentially, but linearly

[Transport Layer] TCP -- three handshakes, four waves.

  • When TCP starts up, the slow start threshold is equal to the window maximum;
  • On each timeout retransmission, the slow-start threshold becomes half of its original value while the congestion window is set back to 1;

For a small number of packet losses, we simply trigger a timeout to retransmit; for a large number of packet losses, we consider the network congested;
When TCP communication begins, network throughput rises gradually; as the network becomes congested, throughput drops immediately;
Congestion control, ultimately, is a compromise between the TCP protocol’s desire to get data to the other side as fast as possible, but to avoid putting too much strain on the network

delayed response

If the host receiving the data returns an ACK answer immediately, the return window may be small at this point.

  • Assuming a buffer of 1M on the receiving end, the data is received 50OK at a time; if it is answered immediately, the window returned is 500K;
  • But in reality, it is possible that the processor side is processing very fast, consuming 50OK data from the buffer within 10ms;
  • In this case, the processing on the receiving end is far from its own limit, even if the window is enlarged a bit more;
  • If the receiver waits a little while before answering, for example, waiting 200ms before answering, then the window size returned at this time is 1M;

It is important to remember that the larger the window, the greater the network throughput and the more efficient the transmission. The goal is to maximize transmission efficiency while ensuring that the network is not congested;

So can all packets have delayed answers? Definitely not.

  • Quantity limit: answer every N packets
  • Time limit: answer once after the maximum delay time is exceeded

The specific number and timeout time vary according to the operating system; generally N is taken as 2 and the timeout time is taken as 200ms;

[Transport Layer] TCP -- three handshakes, four waves.

bring a message to sb.

On top of delayed answering, we found that in many cases, the client-server is also “sending and receiving” at the application layer. This means that the client says “How are you” to the server, and the server says “Fine, thank you” back to the client;
Then this time ACK can hitch a ride, along with the server’s response of “Fine, thank you” back to the client

[Transport Layer] TCP -- three handshakes, four waves.

byte stream oriented

Create a TCP socket, and create a send buffer and a receive buffer in the kernel.

  • When write is called, the data is first written to the send buffer
  • If the number of bytes sent is too long, it will be split into multiple TCP packets to be sent out
  • If the number of bytes sent is too short, it will wait in the buffer until the buffer length is almost up, or some other suitable time to send it out
  • When receiving data, the data also arrives in the kernel’s receive buffer from the NIC driver; the application can then call read to get the data from the receive buffer.
  • On the other hand, a TCP connection that has both a send buffer and a receive buffer, then for that one connection, both read and write data are possible. This concept is called full duplex

Because of the buffer, reads and writes to TCP programs do not need to be -matched, for example:

  • When writing 100 bytes of data, you can call write once to write 100 bytes, or you can call write 100 times, one byte at a time
  • When reading 100 bytes of data, you also don’t need to think about how to write it at all, you can either read 100 bytes at a time or read one byte at a time and repeat the process 100 times.

Sticky bag problem

First of all, it should be clear that the “package” in the sticky package problem refers to theApplication layer packets. For example, if you read a portion less of the message of the previous sequence, it also affects the reading of the message of the next sequence number, which is sticky treasure.

  • In the TCP protocol header, there is no “message length” field as in UDP, but there is a serial number field.
  • From the perspective of the transport layer, TCP comes one message at a time, sequenced in the buffer according to the serial number.
  • Standing at the application layer, all you see is a string of consecutive bytes of data.
  • Then the application program sees such a sequence of bytes of data and doesn’t know from which part it starts to which part it is a complete application layer packet

So how do you avoid sticky bag problems?
It all boils down to the phrase, define the boundary between the two packages

  • For fixed-length packets, just make sure to read them at a fixed size each time. For example, the above Request structure, is a fixed size, then from the buffer from the beginning according to sizeof (Request) to read in turn can be
  • For variable-length packages, you can agree on a field for the total length of the package in the package header, so you know where the package ends.
  • For longer packets, you can also use explicit separators between packets and packets (application layer protocols, it’s up to the programmers to decide, just make sure that the separators don’t conflict with the body of the text)

Think: Is there a “sticky packet problem” for the UDP protocol?

  • For UDP, if there is no upper layer to deliver the data yet, the UDP message length is still there, and at the same time, UDP is delivering the data to the application layer one by one, there is a very clear data boundary
  • From the application layer’s point of view, when using UDP, you either receive the full UDP message or you don’t, and you don’t get a “half”.

TCP anomaly

process termination: the process termination will release the file descriptor, still can send FIN and normal close no difference (OS will be normal automatic four waved, no difference with close)

reboot: Same as for process termination

Machine power down/network cable disconnected: The receiver thinks the connection is still there, once the receiver has a write operation, the receiver will reset when it realizes that the connection is no longer there. Even if there is no write operation, TCP itself has a built-in keep-alive timer that periodically asks the other side if it is still there, and releases the connection if the other side is not there. In addition, some protocols in the application layer have some of these detection mechanisms. For example, in HTTP long connections, the status of the other party is also periodically detected. QQ, for example, will also periodically try to reconnect after QQ disconnects. (OS does not have time to react, the client recognizes the network changes, but the server does not know, the client wants to tell the server there is no way to do so, because the first unplugged network cable. This time it does. The server thinks the connection is good, the client thinks the connection is down. The server may adopt some strategy to ask after a while if the client is still there or not).

TCP Summary

Why is TCP so complex? Because of the need to ensure reliability while maximizing performance.
Reliability:

  • checksum
  • Serial number (arriving in sequence)
  • acknowledgement of receipt
  • resend after the appointed time
  • connection management
  • flow control
  • congestion control

Improved performance:

  • sliding window
  • Fast Retransmission
  • delayed response
  • bring a message to sb.

Other:

  • Timers (timeout retransmission timer, keep alive timer, TIME_WAIT timer, etc.)

Based on TCP application layer protocol

  • HTTP
  • HTTPS
  • SSH
  • Telnet
  • FTP
  • SMTP

And, of course, your own custom application layer protocols when you write your own TCP programs;

Understanding the second argument to listen

[Transport Layer] TCP -- three handshakes, four waves.
[Transport Layer] TCP -- three handshakes, four waves.
The client state is normal, but the server side has a SYN_RECV state instead of an ESTABLISHED state
This is because the Linux kernel stack manages to use two queues for a tcp connection:

  1. Semi-linked queue (used to hold requests in SYN_SENT and SYN_RECV states)
  2. Fully-connected queue (accpetd queue) (used to hold requests that are in the ESTABLISHED state, but for which the application layer has not called ACCEPT to fetch them)

The length of the fully-connected queue is affected by the second argument to listen.
When the full connection queue is full, it is not possible to continue to get the current connection into the established state
The length of this queue, as shown by the above experiment, is the second argument to listen + 1

This queue is essentially a short buffer maintained for the server to fill whenever the upper layers of the server are finished serving new connections directly from the queue.


If there are any errors or unclear areas, please feel free to point them out in private messages or comments!