background

In the Internet industry , In order to increase the user's stickiness to the product , Often targeted to carry out user operation activities . Taking red envelope activity as an example, this paper introduces the general process of operation activities . First of all, according to the previous red envelope activities have different activation effect for different users , Divide the user groups into low activity groups , Zhonghuo , High living population ; Secondly, in order to improve the next day retention rate of users , The plan of making an appointment to receive the red envelope the next day is designed , The contents of the red envelope, such as “ Cash box ” perhaps “ Amount range ” It needs to be done by delineating the sample population AB experiment , according to AB The experiment significance and the next day retention rate results choose the better design scheme ; Finally, we will push the activity online to the market crowd , Carry on the follow-up data summary and analysis .

Pain point

  AB In the process of the experiment, there is a restriction to analyze the index of the next day's retention rate , That is, for T+0 Activities of the day experiment , You can only get exposure in a short time , Click through rate , Direct observable or calculable indicators related to activities such as utilization rate , However, for the active T+1 The retention rate can only be T+2 When you can get it , The whole experiment period is longer , Not conducive to rapid response of operations to activities .

Simulation scheme

   For the pain points with long response period mentioned above , We hope that the ability of data simulation can assist the operation to make timely activity prediction and adjustment .
   The overall simulation scheme is to pull the hourly data from the history related activity log data , According to the crowd label classification , The model is trained by supervised learning algorithm ;
Design related indicators , Forecast on short-term hourly data of new activities , And compare the effect difference .

To implement simulation , It is necessary to improve the accuracy of prediction under limited data . There are two issues to consider :

*
How to organize data

*
How to design prediction model .

data organization

   Historical related activities choose a kind of activities with common characteristics . Such as growth activities , Promotion activities, etc . The crowd label should be representative and universal , The selected people have participated in history related activities .

   Hour level data is accumulated at the same time every day , such as 9 Point data statistics is 0 Point to 9 Results between points ,10 Point data statistics is 0 Point to 10 Results between points , It can be seen that 10 Point data includes 9 Point data .
The advantage of this accumulation is that it can alleviate the time when the time point is early and the number of participants is small , The problem of random error in statistical value .

The data is organized into the above hierarchical results .
A specific activity , Participants were classified by specific labels , Each tag population can be grouped by tag value , The population under the label value of each group was conducted again AB Experiment sub barrel , Finally, the hourly data of the population in each barrel were counted .
Or the first example , For booking red envelope activities , The crowd can pass the label “ User activity ”,“ Is it a sensitive group ”,“ Buyers recently 30 Tiancheng jiaoshu ” And so on .
stay “ User activity ” Under this label , The crowd can also be based on “ Low activity ”,“ Zhonghuo ”,“ Gao Huo ” Group , Each group was divided into two groups: the reference barrel and the experimental barrel , Finally, the hourly data of each barrel population were counted .

model design

   The design of the model is related to the type of index itself . In the simulation model , The indicators are mainly divided into three categories : Observation index , Real time index and delay index .
The observation index can be directly obtained through the buried point log , For example, the number of people exposed to the event , Number of red envelope users ,app Number of people unloaded, etc ;
The real-time index is the index that can be calculated through the observation index , For example, the utilization rate of red packets can be calculated by the number of people using and receiving red packets ;
The delay index can not be directly obtained by observation index or real-time index , For example, the retention rate of activities , Need to be in T+2 In order to obtain .

  1) The prediction model of real-time index is :

  2) The prediction model of delay index is :

   The real-time index model is designed to be in the T+0 Daily real time index X stay t The predicted value of time is equal to 0 reach t-1 Time index X stay f1 Prediction results under the model .
   The delay index model is designed to be in T+n Day delay index Y stay t The predicted value of time is equal to T+0 day t Time observation index or real-time index X1,X2……XN stay f2 Prediction results under the model .

model prediction

   This section mainly introduces the delay index prediction model f2.T+n day t Time delay index Y And T+0 day t Time observation index or real-time index X1,X2……XN There is a high nonlinearity between them .
The main models are as follows CART,GBDT,NN.
  1)CART: Classification regression tree .
Can be used for classification or regression , It's a binary tree , The feature is divided into two parts according to whether the condition is satisfied or not , In the regression problem, the least square error is used as the criterion of feature segmentation , Finally, the features are divided into N Disjoint regions , During subsequent regression , According to the characteristics of input samples , Step by step , Make the sample fall into N One of the regions , The average value of training samples in this area is used as the result of sample prediction .
The delay index is assumed in the figure below Y include X1,X2,X3 Three characteristics , The construction of regression tree is based on the criterion of minimum square error before and after segmentation , obtain a1,a2,a3 Three slice values , The whole regression tree is divided into Y1,Y2,Y3,Y4 Four regions .
The new sample data is based on X1,X2,X3 The eigenvalues fall down step by step Y1,Y2,Y3,Y4 One of the four regions , Use the average value of this area as the result of prediction .CART
The algorithm is simple , The results are reliable , Although there are pruning operations , But the model still has the problem of over fitting .

  2)GBDT: Gradient lifting decision tree .GBDT use CART Combination of trees boosting Integrating learning to improve the precision of regression .
GBDT Every round CART The tree training is based on the residual of the last round of training , The residual here is the last round CART Negative gradient of tree model .
Delay index in the figure below Y Still included X1,X2,X3 Three characteristics ,CART1 Based on the criterion of minimum square error before and after segmentation, the regression tree is divided into two parts Y1,Y2,Y3,Y4 Four regions , Calculate the negative gradient residual of each training sample , Send in CART2 To fit ,CART2 Based on the criterion of minimum square error before and after segmentation, the regression tree is divided into two parts Y5,Y6,Y7,Y8 Four regions .
Iterate on in turn , Every round CART Trees are all fitted to the previous round CART Residuals of trees .
Take a simple example , Suppose there is a sample with a retention rate of 0 0.8, first round CART The predicted results are as follows 0.6, The second round CART Yes 0.2 By fitting 0.15, The third round CART Yes 0.05 By fitting 0.03, Continue until the maximum number of iterations is reached .
The new sample data is based on X1,X2,X3 The eigenvalues run each tree in turn CART The tree obtains the final cumulative result .

  3)NN: neural network . No matter what CART still GBDT, All of them are learning in a model driven way , That is, we need to choose a reasonable segmentation criterion .
Neural network is a data-driven way to learn , The mapping relationship between input data and output results is learned through different network connection methods .
General , The neural network uses the input layer , Activation layer , Full connection layer , The regularization layer and the output layer are organized into different network topologies .
Delay index in the figure below Y Still included X1,X2,X3 Three characteristics , Firstly, each feature is preprocessed by subtracting the mean and dividing the variance , Then they pass through the full connection layer ,RELU Activation layer ,Dropout Regularization layer , Finally, it was approved sigmoid Function output Y Forecast results of .

effect

   The fast simulation function has been applied to the growth of idle fish and the activity of high altitude , With idle fish 222 For example, the activity of thrusting high , It includes eight venues :
Fruit venue , Coupon , General merchandise , Fruits , Costume venue , Jewelry and entertainment , Game packs and bestsellers . The main real-time indicator of each activity is per capita ipv, Jump click through rate , Purchase rate and release rate, etc .
Use all the observation indicators and real-time indicators , Establish the prediction model of secondary retention rate .
   with 2 month 21 day 11 Point data for verification , surface 1 Here are the results of eight activities under the label of all groups ( Only if the number of exposed people is greater than 100 Data processing ).
From table 1 It can be seen that on the whole NN and GBDT The mean square error is less than CART.

   The figure on the left below shows the comparison results of the retention rates of the three models under the eight activities in all populations , You can see that the costume venue activities ,NN The predicted secondary retention rate is significantly lower than GBDT and CART Predicted secondary retention rate , In the game gift pack activity ,NN The predicted secondary retention rate is obviously better than GBDT and CART Predicted secondary retention rate .
The right picture randomly selects the clothing venue in the label " lately 180 Number of idle purchases per day " Comparison results of secondary retention rate of different models in lower population , It can be seen that the predicted secondary retention rate is close to the real value .

   surface 2 Here are the results of eight activities under the label of all groups ( Only if the number of exposed people is greater than 50 Data processing ). From table 2 It can be seen that on the whole NN and GBDT The mean square error is less than CART.
And table 1 comparison , When the number of exposures is limited from 100 drop to 50 Time , The mean square error increased significantly . When the number of people is small , There are some random errors in real-time indicators .

expectation

   The above describes how to quickly conduct activity simulation , Including how to organize the data in the simulation scheme , How to design model according to index type , Then choose the appropriate model to predict .
In the future, we can continue to study how to get more accurate prediction results when the number of people exposed is small , How to add the influence of adjacent hour level data into the prediction model and how to give further operation suggestions through the prediction results .

Technology
©2019-2020 Toolsou All rights reserved,
Error summary -myBatis plus paging use easyPOI Import Excel data In the problem of target detection “ recall Recall”,“ Accuracy Precision”vue use vue-clipboard2 Realize the function of copy link C In language switch sentence Wechat applet (uni-app)url Parameter transfer object hive compress &&hdfs Merge small files hive Summary of processing methods for a large number of small files use spring Of AntPathMatcher matching url route Linux Page replacement algorithm C Language implementation