last week , When we do performance testing , A problem was found .
We have one on our server redis Server , Listening 0.0.0.0 Of 1234 port , Another process on the same machine frequently initiates short connections to the server , The result is two problems ：
1. a large number of TIME_WAIT The state of the connection ;
2. Of the process that initiated the connection CPU Close occupancy 100%.
These two results seriously affect the performance of our gateway , Before analyzing the specific reasons , First of all, make an advocacy , That is ： Local connection , be the first choice UNIX Domain socket instead of TCP!
The reason doesn't need data analysis , Only theoretical analysis is enough , The premise is that you do it Linux Kernel protocol stack IP Sufficient understanding of layer processing and soft interrupt scheduling , of course , It's all very simple .
First, let's look at the problem 1.TIME_WAIT I won't say much , As long as either end is actively disconnected , Then it may eventually enter TIME_WAIT state , Whether it will enter the Linux It depends on several factors , first , Do you have both ends open timestamps, If it's on , Do you open it on the server recycle, If it's on , that TIME_WAIT The socket will disappear quickly , in other words , Want to make recycle work , You have to turn it on timestamps. without timestamps, Then there will be a lot of TIME_WAIT State socket .
stay Linux The implementation of kernel protocol stack , All data streams connected to the machine , Its routing will eventually be directed to loopback, If you don't have a binding source IP address , So source / target IP All addresses are 127.0.0.1! If the service port is fixed , Then it will be accepted in the end 65535-1 Connections , reduce 1 The reason is that the server has already bind Service port , So the client can't do it again bind. This is reasonable , Because considering the uniqueness of quadruple , A service can only accept one specific IP Of address 65535 Connections or 65534 Connections , But the problem is , If the demand is huge , This obviously does not meet the requirements , You need to know , As a server , It considers the total maximum number of concurrent connections , At the same time on one machine 6 More than 10000 connections are not likely , therefore TCP It's reasonable in most cases , use 16bit The port number of is just right , Because the agreement header can't be too big , Otherwise, the load rate will be smaller , This is obviously required by network transmission , However, when the local computer is connected to the local computer , No network transmission is required , Of course, you think you have to meet all the needs , however TCP It's not suitable for this occasion .
Local connection , No delay caused by network transmission , Throughput limit is also limited to local resource utilization , So concurrency 10 Ten thousand or more is reasonable , however TCP It's not enough , The reason is that it only 16bit Port number of , Target port fixed , At the same time, there can only be 65534 Connections . How to solve it ? We know 127.0.0.0/8 All belong to loopback Of , We can use different sources IP address , If you want to , There are two options , That's either the client bind source IP by 127.x.y.z, or SNAT become 127.x.y.z, In this way, we can accept the massive connection demand . But this is not the final solution , Why do you have to use it TCP What about ?TCP Originally designed for network transmission , Its flow control should deal with different hosts , Support and control deal with the fickle network , On this machine , None of this is a problem , So this machine is connected to this machine , It is best to use native sockets , such as UNIX Domain socket .
Let's look at it again 2, One connected to the local TCP The packet finally arrived loopback Of xmit Send function , Among them, the simple scheduling CPU A soft interrupt handling on , It will then schedule its execution after the next interrupt. , This is most likely done in the context of the current sending process , in other words , The send process did a send operation in its context , At this time, the soft interrupt borrows its context to trigger the receive operation , And then ,LOCK The cost of , Due to a large number of TW Socket insert and delete, Need frequent LOCK Hashtable , This cost is fully accounted for in the name of the sending process , It's also unfair .
be careful ,Linux In the kernel ,softirq It is executed in two contexts , One is in any context after hardware interruption , One is every CPU In the context of a kernel thread , The latter will be charged to top Ordered si percentage , The former accounts for any interrupted process .