At high concurrency , Asynchronization and other scenarios , The use of thread pool is everywhere . Thread pool in essence , I.e. exchange space for time , Because the creation and destruction of threads consume resources and time , For scenarios where threads are heavily used , Use pooling management to delay thread destruction , Greatly improve the reusability of single thread , Further improve overall performance .

Today, I met a typical online problem , It's just about the thread pool , Deadlock is also involved ,jstack Use of commands ,JDK Knowledge points such as suitable scenarios of different thread pools , At the same time, the whole investigation idea can be used for reference , I hereby record and share .

01  Business background description

The online problem occurs in the core fee deduction service of the advertising system , First, simply explain the general business process , Easy to understand .

The green box part is the position of the fee deduction service in the fee deduction process of advertisement recall , Simple understanding : When the user clicks an ad , From C End initiates a real-time fee deduction request (CPC, Deduction mode by click ), Deduction service takes on the core business logic of this action : Including the implementation of anti cheating strategies , Create deduction record ,click Log burying point, etc .

02 Problem phenomenon and business impact

12 month 2 Night of 11 Point left and right , We received an online alert : The thread pool task queue size of the fee deduction service far exceeds the set threshold , And the size of the queue continues to grow over time . The detailed alarm contents are as follows :

Corresponding , Our advertising metrics : Number of hits , There is also a very obvious decline in income , Service alarm notification was sent almost at the same time . among , The curve corresponding to the number of hits is as follows :

The online fault occurs during the peak flow period , It lasted nearly 30 Minutes before returning to normal .

03 Problem investigation and accident resolution process

The following details the investigation and analysis process of the whole accident .

The first 1 step : After receiving the alarm from the task queue of thread pool , We first looked at the real-time data of various dimensions of the fee deduction service : Including service adjustment , Timeout , Error log ,JVM monitor , No abnormality found .

The first 2 step : Then we further investigate the storage resources that the fee deduction service depends on (mysql,redis,mq), External services , A large number of database slow queries were found during the accident .

The above slow query comes from a big data extraction task just launched during the accident , For withholding service mysql Large batch of concurrent data extraction from database to hive surface . Because the deduction process also involves writing mysql, Guess this time mysql All read and write performance of is affected , As expected, further discovery insert The operation time is also much longer than the normal period .

The first 3 step : We
Guess that the slow query of database affects the performance of deduction process , This results in a backlog of task queues , So we decided to temporarily determine the task of big data extraction . But it's strange : After stopping the extraction task , Database insert Performance is back to normal , But the size of the blocking queue is still increasing , The alarm does not disappear .

The first 4 step :
Considering that advertising revenue is still falling substantially , It will take a long time to further analyze the code , So I decided to restart the service immediately to see if it works . To keep the scene of the accident , We kept a server that didn't reboot , Just took the machine off the service management platform , So it doesn't receive a new deduction request .

Sure enough, the killer mace of restarting the service works , All business indicators are back to normal , No more alarms . thus , The whole online fault is solved , It lasted about 30 minute .

04  Analysis process of the root cause of the problem

Let's talk about the analysis process of the root cause of the accident in detail .

The first 1 step : After work the next day , We guess the server that kept the scene of the accident , The backlog of tasks in the queue should be pooled by threads
Disposed of , So try to mount this server again to verify our guess , The result is completely contrary to expectation , The backlog is still there , And with the new request , The system alarm reappeared immediately , So I took this server off at once .

The first 2 step : Thread pool backlog of thousands of tasks , after 1 It hasn't been processed by the thread pool all night , We guess there should be deadlock . So I plan to pass jstack command dump Detailed analysis of thread snapshot .
# Find the process number of the fee deduction service $ jstack pid > /tmp/stack.txt #  Pass process number dump Thread snapshot , Export to file
$ jstack pid > /tmp/stack.txt

stay jstack In the log file of , I found it immediately : All threads of the business thread pool used for deduction are in the waiting state , All threads are stuck on the code line corresponding to the red box in the screenshot , This line of code calls countDownLatch Of await() method , That is, the waiting counter changes to 0 Release shared lock after .

The first 3 step : After finding the above exceptions , It's close to finding the root cause ,
Let's go back to the code and continue our investigation , First, I have a look at the use of newFixedThreadPool Thread pool , The number of core threads is set to 25.
in the light of newFixedThreadPool,JDK The document is described as follows :

Create a thread pool that can reuse a fixed number of threads , Run these threads in a shared, unbounded queue . If a new task is submitted while all threads are active , Before there are available threads , New tasks will wait in the queue .

about newFixedThreadPool, The core includes two points :

1, Maximum threads  =  Number of core threads , When all the core threads are processing tasks , New tasks will be submitted to the task queue to wait ; 

2, Unbounded queue used : The task queue submitted to the thread pool is unlimited in size , If the task is blocked or slow down , So obviously the queue will get bigger and bigger .

therefore , The further conclusion is : All core threads deadlock , New tasks don't flow into the boundless line , Increasing task queues .

The first 4 step : What is the cause of deadlock , We're back jstack The line of code prompted in the log file for further analysis . Here is my simplified sample code :
/**  *  Perform fee deduction task  */
public Result<Integer> executeDeduct(ChargeInputDTO chargeInput) {
  ChargeTask chargeTask = new ChargeTask(chargeInput);
  bizThreadPool.execute(() -> chargeTaskBll.execute(chargeTask ));
  return Result.success(); } /*  *  Specific business logic of fee deduction task  */
public class ChargeTaskBll implements Runnable {
  public void execute(ChargeTask chargeTask) {      //  Step 1 : Parameter verification
     verifyInputParam(chargeTask);      //  Step 2 : Perform anti cheating subtask
     executeUserSpam(SpamHelper.userConfigs);      //  Step 3 : Execution deduction
     handlePay(chargeTask);      //  Other steps : Click the burying point, etc      ...   } } /**  *  Perform anti cheating subtask
 */ public void executeUserSpam(List<SpamUserConfigDO> configs) {
  if (CollectionUtils.isEmpty(configs)) {     return;   }   try {
    CountDownLatch latch = new CountDownLatch(configs.size());
    for (SpamUserConfigDO config : configs) {
      UserSpamTask task = new UserSpamTask(config,latch);
      bizThreadPool.execute(task);     }     latch.await();
  } catch (Exception ex) {     logger.error("", ex);   } }

By the above code , Can you find out how deadlock happens ? The root cause is : One time deduction belongs to the parent task , At the same time, it contains multiple subtasks : Subtasks for parallel execution of anti cheating strategy , The parent and child tasks use the same business thread pool . When all tasks in the thread pool are parent tasks in execution , And all parent tasks have child tasks unfinished , This will cause a deadlock . Next through 1 Let's take a look at deadlock directly :

Suppose the number of core threads is 2, Currently executing deduction parent task 1 and 2. in addition , Anti cheating subtask 1 and 3 It's all done , Anti cheating subtask 2 and 4 Are stuck in the task queue waiting to be scheduled . Because the anti cheating subtask 2 and 4 Not finished , So deduction parent task 1 and 2 It's impossible to complete , So there's a deadlock , Core threads can never be released , Thus, the task queue is increasing , Until the program OOM
crash.

When the cause of deadlock is clear , There's another question : The above code has been running online for a long time , Why is the problem exposed now ? In addition, is it directly related to database slow query ?

We haven't confirmed it yet , But it can be inferred : The probability that the above code must have deadlock , Especially in the case of high concurrency or slow task processing , The probability will increase greatly . Database slow query should be the fuse of this accident .

05  Solution

After finding out the root cause , The simplest solution is : Add a new business thread pool , Used to isolate parent-child tasks , The existing thread pool is only used for processing fee deduction tasks , New thread pool for anti cheating tasks . In this way, deadlock can be completely avoided .

06 Problem summary

Review the resolution process of the accident and the technical scheme of deduction , There are the following points to be further optimized :

1, A thread pool with a fixed number of threads exists OOM risk , In Alibaba Java It is also clearly stated in the development manual , And the words are 『 not allow 』 use Executors Create thread pool .
  But through ThreadPoolExecutor To create , In this way, the students who write can be more clear about the running rules and core parameter settings of the thread pool , Avoiding the risk of resource exhaustion .

2, The deduction scenario of advertisement is an asynchronous process , Through the line pool or MQ To implement asynchronous processing is optional .
in addition , Very few click requests lost without deduction are allowed from business , However, it is not allowed to discard a large number of requests without processing and compensation scheme . After the bounded queue is adopted subsequently , Reject policy can consider sending MQ Retry processing .

Technology
©2019-2020 Toolsou All rights reserved,
QCustomPlot series (5)- Real time dynamic curve python To solve the problem of dictionary writing list in GDOI2019 travels Java Basics ( Bubble sort )Unity Scene loading asynchronously ( Implementation of loading interface )" Cephalosporin wine Say go and go "? DANGER ! Don't drink alcohol when taking these drugs 2021 year 1 Monthly programmer salary statistics , average 14915 element Huawei certification HCIA-AI artificial intelligence java Simple lottery algorithm , luck draw DemoNOI2019 travels