There is a generally accepted view in industry ,“ Data and features determine the upper limit of machine learning projects , And the algorithm is just as close to the upper limit as possible ”. In actual combat , Feature engineering takes almost half of the time , It's a very important part . Missing value processing , Outlier handling , Data standardization , Imbalances and other issues should be easy to deal with , In this paper, we discuss a pit which is easy to be ignored : Data consistency .

as everyone knows , Most machine learning algorithms have a premise : The training data samples and the test samples of the location come from the same distribution . If the distribution of the test data is inconsistent with the training data , Then it will affect the effect of the model .

In some machine learning related competitions , The distribution of some features in a given training set and test set is likely to be inconsistent . In practical application , With the development of business , The distribution of training samples will also change , Finally, the generalization ability of the model is insufficient .

Next, we will introduce several methods to check the consistency of feature distribution of training set and test set :

KDE( Kernel density estimation ) Distribution map

Kernel density estimation (kernel density
estimation) It is used to estimate the unknown density function in probability theory , It is one of the nonparametric test methods , Through the kernel density estimation graph, we can intuitively see the distribution characteristics of the data sample itself .

seaborn In kdeplot It can be used for kernel density estimation and visualization of univariate and bivariate .

Take a small example :
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt
train_set=pd.read_csv(r'D:\...\train_set.csv')
test_set=pd.read_csv(r'D:\...\test_set.csv') plt.figure(figsize=(12,9)) ax1 =
sns.kdeplot(train_set.balance,label='train_set') ax2 =
sns.kdeplot(test_set.balance,label='test_set')

KS test (Kolmogorov-Smirnov)

KS The test is based on the cumulative distribution function , It is used to test whether a distribution conforms to a certain theoretical distribution or to compare whether there are significant differences between two empirical distributions . Two samples K-S The test is sensitive to the differences of the location and shape parameters of the empirical distribution function of the two samples , So it becomes one of the most useful and common nonparametric methods to compare two samples .

We can use it scipy.stats In the library ks_2samp, conduct KS test :
from scipy.stats import ks_2samp ks_2samp(train_set.balance,test_set.balance)

ks Inspection generally returns two values : The first value represents the maximum distance between two distributions , The smaller the value is, the smaller the difference between the two distributions is , The more consistent the distribution is . The second value is p value , A parameter used to determine the result of hypothesis testing ,p The higher the value is , The more we can't refuse the original hypothesis ( Two distributed identical distributions to be tested ), That is, the more identical the two distributions are .
Ks_2sampResult(statistic=0.005976590587342234, pvalue=0.9489915858135447)
The final result can be seen ,balance This feature obeys the same distribution in the training set and the test set .

Countermeasure verification (Adversarial validation)

except KDE and KS test , At present, confrontation verification is more popular , It is not a way to evaluate the effectiveness of the model , It is a method to confirm whether the distribution of training set and test set changes .
specific working means :
1, Put the training set , Test sets are merged into a data set , Add a new label column , The samples of training set are marked as 0 , The samples of the test set are marked as 1 .
2, A new division train_set and test_set( Different from the original training set and test set ).
3, use train_set Training a binary classification model , have access to LR,RF,XGBoost, LightGBM wait , with AUC As a model indicator .
4, If AUC stay 0.5 about , It shows that the model can not distinguish the original training set from the test set , That is to say, the distribution of the two is consistent . If AUC Relatively large , It shows that the original training set and test set are quite different , Inconsistent distribution .
5, Using the third 2 Classifier model in step , Score and predict the original training set , The samples are sorted from large to small according to the model , The bigger the model score is , The closer the description is to the test set , Then take the training focus TOP N
As the validation set of the target task , In this way, the original samples can be split to get the training set , Validation set , Test set .

In addition to determining the consistency of feature distribution of training set and test set , Countermeasure verification can also be used for feature selection . If you are interested, please give me a compliment + I'm looking , Next lecture 《 feature selection 》, Let's take an example to see the specific usage and effect of countermeasure verification .

Technology
©2019-2020 Toolsou All rights reserved,
tkinter Implementation of user login interface Vue Invalid render dynamic component in subcomponent ( First invalid , Valid for the second time )9 ride 9 Sudoku C/C++ Memory model TypeScript actual combat -12-TS The mechanism of type checking - Type inference Java Thread safety and insecurity JLink,STLink,DAPLink,CMSIS DAP Differences in use Chapter 10 use SpringMVC Frame transformation of supermarket order management system -1C Language calculation of deposit interest About the Blue Bridge Cup , Things you should know !