There are three main methods of model evaluation : Set aside method , cross validation , Self help law
. These three methods need to divide the data set into test set and training set , Using the “ Test error ” To approximate the generalization error of the model . Of course, test set and training set should be mutually exclusive as far as possible , Only in this way can the model get better generalization performance .

1. Set aside method

Set aside method is to set aside the data set D Two sets directly divided into mutually exclusive sets , As training set S And test set T. Using test set T To estimate the generalization error of the model .

In general, the ratio of data division is :2/3~4/5 A proportional sample is used for training , The remaining samples are used for testing .

The consistency of data distribution should be kept as much as possible during the sample division , Keep at least the similarity of the class proportions , Avoid introducing bias . A sample that retains the class scale is often referred to as “ Stratified sampling ”(stratified sampling).

Given the proportion of sample partition , data set D There can be many ways to divide , In order to make the estimation result more stable and reliable , Repeated tests shall be carried out , Take the average value of multiple evaluation results .

It can be used sklearn Medium StratifiedShuffleSplit Function , Of course, I can also write a simple one .

#!/usr/bin/env python # -*- coding: utf-8 -*- from sklearn.model_selection
import StratifiedShuffleSplit import numpy as np X = np.array([[0, 1], [0, 2],
[0, 3], [1, 8], [1,9], [1,10]]) y = np.array([0, 0, 0, 1, 1, 1])
# The proportion of the test set is 0.333, Stratified sampling was used . And randomly 5 second sss = StratifiedShuffleSplit(n_splits=5,
test_size=0.333) for train_index, test_index in sss.split(X, y):
print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test =
X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index]

result :

2. cross validation

The cross validation method is to combine the data sets D Divided into k Mutually exclusive subsets with approximate share size , At the same time, the data distribution should be consistent as much as possible ( Stratified sampling ).

Every time k-1 Subsets for training , Then use the remaining one as the test set , So it can be done k Training and testing , Final return k Average of test results . This cross validation method is called “k Fold cross validation ”(k-fold
cross validation). Commonly used are 10 Fold cross validation method , There is a special way to stay (Leave- One-Out cross validation, LOOCV).

(1)10 Fold cross validation

Divide the dataset into 10 share , Take it every time 9 For training ,1 Copies for testing , Finally, the average value of ten test results is obtained ( Outer cross validation ).

Every time 9 When training with data , If the model needs to be adjusted , Cross validation can be further carried out in the training process ( Inner layer ). such as : Every time 9 Share 8 Data for training , The remaining one is for verification , use 9 To evaluate the current parameters . Here is a 3 Columns of fold cross validation , among C Is the parameter to be optimized .

It can be used sklearn Medium StratifiedKFold Function to partition the dataset , realization 10 Fold cross validation . Here is the SVM Classifier as an example :

#!/usr/bin/env python # -*- coding: utf-8 -*- import loadData import copy
import numpy as np from sklearn import svm from sklearn.model_selection import
StratifiedKFold from sklearn.metrics import accuracy_score feature ,label =
loadData.loadData() # Self written function , Load data smps = len(label) # Number of samples # Sample index for each discount
foldsList = [] ss = StratifiedKFold(n_splits=10, shuffle = True) for
train_index, test_index in ss.split(feature, label): print("TEST:",
test_index)# Get index value foldsList.append(test_index) test_accur_list = [] for i in
xrange(10):# Outer circulation ,9 Training , A test train_index = list(set(range(0, smps)) -
set(foldsList[i])) test_index = foldsList[i] train_fea, test_fea =
feature[train_index], feature[test_index] train_lab, test_lab =
label[train_index], label[test_index] foldLi = copy.deepcopy(foldsList)
# Delete the test copy del foldLi[i] # take foldLi It's inside list merge foldL = [x for p in foldLi for x in
p] print 'for %s time gridSearch process:' % i c_can = np.logspace(-15, 15, 10,
base = 2)# assume SVM Parameters in C Value of n_search = len(c_can) bestC = 0
# For this time 9 Training data , Through the following inner layer cross validation to determine the best C value bestAccur = 0 # The best test precision of inner cross validation , Namely C take bestC Time for j in
xrange(n_search):# For parameters C Test one by one Accur = 0 for n in xrange(9):#C The value is c_can[j] Inner cross validation based on
train_i = list(set(foldL) - set(foldLi[n])) test_i= foldLi[n] train_f, test_f =
feature[train_i], feature[test_i]# Values corresponding to training set train_l, test_l = label[train_i],
label[test_i]# The value corresponding to the category set clf = svm.SVC(C = c_can[j], kernel='linear')
clf.fit(train_f, train_l) y_hat = clf.predict(test_f) Accur +=
accuracy_score(test_l, y_hat) / 9 print ' Accur:%s' % Accur if Accur >
bestAccur:# According to the results of inner cross validation , lookup bestC bestAccur = copy.deepcopy(Accur) bestC =
copy.deepcopy(c_can[j]) print ' Best validation accuracy on current dataset
split:', bestAccur print ' Best para C:', bestC # find bestC after , And then it's the outer layer : train , test . clf
= svm.SVC(C = bestC, kernel='linear') clf.fit(train_fea, train_lab) y_hat =
clf.predict(test_fea) test_accur_list.append(accuracy_score(test_lab, y_hat))
print ' test accur:', test_accur_list[i] print '\n' # Finally, the result of ten fold cross validation is obtained print
'average test accur:', sum(test_accur_list) / len(test_accur_list)

The above implementation is one time 10 Fold cross validation , In order to avoid the introduction of additional error due to different sample partition , The above process should be repeated , The mean value of the test results was taken , as 10 second 10 Fold cross validation .

(2) Keep one (LOOCV)

Leave one method is a special case in cross validation . If data set D contain m Samples , One way to stay is to k Take as m What happened . The left one method has only one result when dividing samples , That is, each subset contains a sample , Therefore, the results of the left one method are not affected by the random division of samples . in addition , The training set and original data set used by the left one method D Only one sample is missing , So the left one model is very similar to the expectation model , Assessment results are often considered to be more accurate . But in the dataset D Larger , The left one method also has the disadvantage of high computational complexity . In the process of practice , Take one sample at a time as the test set , The rest is the training set ( Outer layer ). If the model needs to be adjusted , It can be used during training 10 The parameters were determined by cross validation ( Inner layer ). It's still here SVM Classifier as an example :

#!/usr/bin/env python # -*- coding: utf-8 -*- import loadData import copy
import numpy as np from sklearn import svm from sklearn.model_selection import
StratifiedKFold from sklearn.metrics import accuracy_score feature ,label =
loadData.loadData() # Self written function , Load data smps = len(label) # Number of samples test_accur_list = []
for i in xrange(smps):# Outer circulation ,LOOCV: One sample test , Remaining training train_index = list(set(range(0,
smps)) - set([i])) test_index = [i] train_fea, test_fea = feature[train_index],
feature[test_index] train_lab, test_lab = label[train_index], label[test_index]
foldLi = [] ss = StratifiedKFold(n_splits=10, shuffle = True) for train_i,
test_i in ss.split(train_fea, train_lab): print("TEST:", test_i)# Get index value
foldLi.append(test_i) # take foldLi It's inside list merge foldL = [x for p in foldLi for x in p]
print 'for %s time gridSearch process:' % i c_can = np.logspace(-15, 15, 10,
base = 2)# assume SVM Parameters in C Value of n_search = len(c_can) bestC = 0
# For this time 9 Training data , Through the following inner layer cross validation to determine the best C value bestAccur = 0 # The best test precision of inner cross validation , Namely C take bestC Time for j in
xrange(n_search):# For parameters C Test one by one Accur = 0 for n in
xrange(10):#C The value is c_can[j] Inner cross validation based on train_i = list(set(foldL) - set(foldLi[n]))
test_i= foldLi[n] train_f, test_f = feature[train_i], feature[test_i]# Values corresponding to training set
train_l, test_l = label[train_i], label[test_i]# The value corresponding to the category set clf = svm.SVC(C =
c_can[j], kernel='linear') clf.fit(train_f, train_l) y_hat =
clf.predict(test_f) Accur += accuracy_score(test_l, y_hat) / 10 print '
Accur:%s' % Accur if Accur > bestAccur:# According to the results of inner cross validation , lookup bestC bestAccur =
copy.deepcopy(Accur) bestC = copy.deepcopy(c_can[j]) print ' Best validation
accuracy on current dataset split:', bestAccur print ' Best para C:', bestC
# find bestC after , And then it's the outer layer : train , test . clf = svm.SVC(C = bestC, kernel='linear')
clf.fit(train_fea, train_lab) y_hat = clf.predict(test_fea)
test_accur_list.append(accuracy_score(test_lab, y_hat)) print ' test accur:',
test_accur_list[i] print '\n' #LOOCV The results of print 'average test accur:',
sum(test_accur_list) / len(test_accur_list)
3. Self help law

one side , The left one method has the disadvantage of high computational complexity ; on the other hand , If more test sets are reserved , However, there is a certain deviation between the training model and the expectation model . Self help method is a better solution . Hypothetical data set D Yes m Samples , Every time from D A sample is randomly selected , implement m second , Get contained m Data set of samples D'. obviously ,D Some samples in the D' Many times , The other part will not . A sample is here m The probability of not being collected in sub sampling is about 0.368, So the initial dataset D About 36.8% The sample of does not appear in the D' in . such , Both the actual evaluation model and the expected evaluation model will be used m Training samples , And we do 1/3 The samples can be used for training .

Self help method is often used for small data sets , It is difficult to divide training effectively / Test set time . Since the self-help method can generate different training sets from the original data set each time , These training sets are similar, but not completely different , So it can be used in ensemble learning algorithm . But bootstrap sampling will change the distribution of the initial data set , Introducing estimation bias , So when there's enough data , Try to use the set aside method and cross validation method .

Sample code :

#!/usr/bin/env python # -*- coding: utf-8 -*- import loadData import copy
import numpy as np from sklearn import svm from sklearn.model_selection import
StratifiedKFold from sklearn.metrics import accuracy_score feature ,label =
loadData.loadData() # Self written function , Load data testAccur_list = [] samps = len(label)
# Using self-service sampling method , Repeat execution 10 second for i in xrange(10): train_index = [] # Self help sampling , Random sampling with return samps Samples
for i in xrange(int(1.0 * samps)): train_index.append(random.randint(0,samps -
1)) test_index = list(set(list(range(0, samps))) - set(train_index))
print("TRAIN:", train_index, "TEST:", test_index)# Get index value train_fea, test_fea =
feature[train_index], feature[test_index]# Values corresponding to training set train_lab, test_lab =
label[train_index], label[test_index]# The value corresponding to the category set ''' ''' Cross validation using training samples , Get parameters :C =
bestC ''' ''' clf = svm.SVC(C = bestC, kernel='linear') clf.fit(train_fea,
train_lab) y_hat = clf.predict(test_fea) accur = accuracy_score(test_lab,
y_hat) print i, ' time run testAccuracy:', accur testAccur_list.append(accur)
print '10 times run testAccur_list:', testAccur_list print 'average testAccur
value:', sum(testAccur_list) / 10

reference :

1.《 machine learning 》, Zhou Zhihua .

2.Statnikov A, Aliferis C F, Tsamardinos I, et al. A comprehensive evaluation
of multicategory classification methods for microarray gene expression cancer
diagnosis[J]. Bioinformatics, 2004, 21(5): 631-643.

Technology
©2019-2020 Toolsou All rights reserved,
element ui Drop down box search function inherit jpa Repository Write custom method query The shortest path of maze BFS algorithm (python realization )( Essence )2020 year 8 month 9 day C# Basic knowledge reflex python read , write in txt Text content ( Essence )2020 year 7 month 15 day Wechat applet assembly Component Use of Golang Array bisection , Array split , Array grouping Keras Summary of training data loading Go language Array initialization and basic operations Front end to background 5 Summary of different ways