There is a generally accepted view in industry ,“ Data and features determine the upper limit of machine learning projects , And the algorithm is just as close to the upper limit as possible ”. In actual combat , Feature engineering takes almost half of the time , It's a very important part . Missing value processing , Outlier handling , Data standardization , Imbalances and other issues should be easy to deal with , In this paper, we discuss a pit which is easy to be ignored ： Data consistency .
as everyone knows , Most machine learning algorithms have a premise ： The training data samples and the test samples of the location come from the same distribution . If the distribution of the test data is inconsistent with the training data , Then it will affect the effect of the model .
In some machine learning related competitions , The distribution of some features in a given training set and test set is likely to be inconsistent . In practical application , With the development of business , The distribution of training samples will also change , Finally, the generalization ability of the model is insufficient .
Next, we will introduce several methods to check the consistency of feature distribution of training set and test set ：
KDE( Kernel density estimation ) Distribution map
Kernel density estimation (kernel density
estimation) It is used to estimate the unknown density function in probability theory , It is one of the nonparametric test methods , Through the kernel density estimation graph, we can intuitively see the distribution characteristics of the data sample itself .
seaborn In kdeplot It can be used for kernel density estimation and visualization of univariate and bivariate .
Take a small example ：
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt
test_set=pd.read_csv(r'D:\...\test_set.csv') plt.figure(figsize=(12,9)) ax1 =
sns.kdeplot(train_set.balance,label='train_set') ax2 =
KS test (Kolmogorov-Smirnov)
KS The test is based on the cumulative distribution function , It is used to test whether a distribution conforms to a certain theoretical distribution or to compare whether there are significant differences between two empirical distributions . Two samples K-S The test is sensitive to the differences of the location and shape parameters of the empirical distribution function of the two samples , So it becomes one of the most useful and common nonparametric methods to compare two samples .
We can use it scipy.stats In the library ks_2samp, conduct KS test ：
from scipy.stats import ks_2samp ks_2samp(train_set.balance,test_set.balance)
ks Inspection generally returns two values ： The first value represents the maximum distance between two distributions , The smaller the value is, the smaller the difference between the two distributions is , The more consistent the distribution is . The second value is p value , A parameter used to determine the result of hypothesis testing ,p The higher the value is , The more we can't refuse the original hypothesis ( Two distributed identical distributions to be tested ), That is, the more identical the two distributions are .
The final result can be seen ,balance This feature obeys the same distribution in the training set and the test set .
Countermeasure verification (Adversarial validation)
except KDE and KS test , At present, confrontation verification is more popular , It is not a way to evaluate the effectiveness of the model , It is a method to confirm whether the distribution of training set and test set changes .
specific working means ：
1, Put the training set , Test sets are merged into a data set , Add a new label column , The samples of training set are marked as 0 , The samples of the test set are marked as 1 .
2, A new division train_set and test_set( Different from the original training set and test set ).
3, use train_set Training a binary classification model , have access to LR,RF,XGBoost, LightGBM wait , with AUC As a model indicator .
4, If AUC stay 0.5 about , It shows that the model can not distinguish the original training set from the test set , That is to say, the distribution of the two is consistent . If AUC Relatively large , It shows that the original training set and test set are quite different , Inconsistent distribution .
5, Using the third 2 Classifier model in step , Score and predict the original training set , The samples are sorted from large to small according to the model , The bigger the model score is , The closer the description is to the test set , Then take the training focus TOP N
As the validation set of the target task , In this way, the original samples can be split to get the training set , Validation set , Test set .
In addition to determining the consistency of feature distribution of training set and test set , Countermeasure verification can also be used for feature selection . If you are interested, please give me a compliment + I'm looking , Next lecture 《 feature selection 》, Let's take an example to see the specific usage and effect of countermeasure verification .