- 2020-07-02 08:38
*views 8*- machine learning
- Python

There are three main methods of model evaluation ： Set aside method , cross validation , Self help law

. These three methods need to divide the data set into test set and training set , Using the “ Test error ” To approximate the generalization error of the model . Of course, test set and training set should be mutually exclusive as far as possible , Only in this way can the model get better generalization performance .

1. Set aside method

Set aside method is to set aside the data set D Two sets directly divided into mutually exclusive sets , As training set S And test set T. Using test set T To estimate the generalization error of the model .

In general, the ratio of data division is ：2/3~4/5 A proportional sample is used for training , The remaining samples are used for testing .

The consistency of data distribution should be kept as much as possible during the sample division , Keep at least the similarity of the class proportions , Avoid introducing bias . A sample that retains the class scale is often referred to as “ Stratified sampling ”（stratified sampling）.

Given the proportion of sample partition , data set D There can be many ways to divide , In order to make the estimation result more stable and reliable , Repeated tests shall be carried out , Take the average value of multiple evaluation results .

It can be used sklearn Medium StratifiedShuffleSplit Function , Of course, I can also write a simple one .

#!/usr/bin/env python # -*- coding: utf-8 -*- from sklearn.model_selection

import StratifiedShuffleSplit import numpy as np X = np.array([[0, 1], [0, 2],

[0, 3], [1, 8], [1,9], [1,10]]) y = np.array([0, 0, 0, 1, 1, 1])

# The proportion of the test set is 0.333, Stratified sampling was used . And randomly 5 second sss = StratifiedShuffleSplit(n_splits=5,

test_size=0.333) for train_index, test_index in sss.split(X, y):

print("TRAIN:", train_index, "TEST:", test_index) X_train, X_test =

X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index]

result ：

2. cross validation

The cross validation method is to combine the data sets D Divided into k Mutually exclusive subsets with approximate share size , At the same time, the data distribution should be consistent as much as possible （ Stratified sampling ）.

Every time k-1 Subsets for training , Then use the remaining one as the test set , So it can be done k Training and testing , Final return k Average of test results . This cross validation method is called “k Fold cross validation ”（k-fold

cross validation）. Commonly used are 10 Fold cross validation method , There is a special way to stay （Leave- One-Out cross validation, LOOCV）.

（1）10 Fold cross validation

Divide the dataset into 10 share , Take it every time 9 For training ,1 Copies for testing , Finally, the average value of ten test results is obtained （ Outer cross validation ）.

Every time 9 When training with data , If the model needs to be adjusted , Cross validation can be further carried out in the training process （ Inner layer ）. such as ： Every time 9 Share 8 Data for training , The remaining one is for verification , use 9 To evaluate the current parameters . Here is a 3 Columns of fold cross validation , among C Is the parameter to be optimized .

It can be used sklearn Medium StratifiedKFold Function to partition the dataset , realization 10 Fold cross validation . Here is the SVM Classifier as an example ：

#!/usr/bin/env python # -*- coding: utf-8 -*- import loadData import copy

import numpy as np from sklearn import svm from sklearn.model_selection import

StratifiedKFold from sklearn.metrics import accuracy_score feature ,label =

loadData.loadData() # Self written function , Load data smps = len(label) # Number of samples # Sample index for each discount

foldsList = [] ss = StratifiedKFold(n_splits=10, shuffle = True) for

train_index, test_index in ss.split(feature, label): print("TEST:",

test_index)# Get index value foldsList.append(test_index) test_accur_list = [] for i in

xrange(10):# Outer circulation ,9 Training , A test train_index = list(set(range(0, smps)) -

set(foldsList[i])) test_index = foldsList[i] train_fea, test_fea =

feature[train_index], feature[test_index] train_lab, test_lab =

label[train_index], label[test_index] foldLi = copy.deepcopy(foldsList)

# Delete the test copy del foldLi[i] # take foldLi It's inside list merge foldL = [x for p in foldLi for x in

p] print 'for %s time gridSearch process:' % i c_can = np.logspace(-15, 15, 10,

base = 2)# assume SVM Parameters in C Value of n_search = len(c_can) bestC = 0

# For this time 9 Training data , Through the following inner layer cross validation to determine the best C value bestAccur = 0 # The best test precision of inner cross validation , Namely C take bestC Time for j in

xrange(n_search):# For parameters C Test one by one Accur = 0 for n in xrange(9):#C The value is c_can[j] Inner cross validation based on

train_i = list(set(foldL) - set(foldLi[n])) test_i= foldLi[n] train_f, test_f =

feature[train_i], feature[test_i]# Values corresponding to training set train_l, test_l = label[train_i],

label[test_i]# The value corresponding to the category set clf = svm.SVC(C = c_can[j], kernel='linear')

clf.fit(train_f, train_l) y_hat = clf.predict(test_f) Accur +=

accuracy_score(test_l, y_hat) / 9 print ' Accur:%s' % Accur if Accur >

bestAccur:# According to the results of inner cross validation , lookup bestC bestAccur = copy.deepcopy(Accur) bestC =

copy.deepcopy(c_can[j]) print ' Best validation accuracy on current dataset

split:', bestAccur print ' Best para C:', bestC # find bestC after , And then it's the outer layer ： train , test . clf

= svm.SVC(C = bestC, kernel='linear') clf.fit(train_fea, train_lab) y_hat =

clf.predict(test_fea) test_accur_list.append(accuracy_score(test_lab, y_hat))

print ' test accur:', test_accur_list[i] print '\n' # Finally, the result of ten fold cross validation is obtained print

'average test accur:', sum(test_accur_list) / len(test_accur_list)

The above implementation is one time 10 Fold cross validation , In order to avoid the introduction of additional error due to different sample partition , The above process should be repeated , The mean value of the test results was taken , as 10 second 10 Fold cross validation .

（2） Keep one (LOOCV）

Leave one method is a special case in cross validation . If data set D contain m Samples , One way to stay is to k Take as m What happened . The left one method has only one result when dividing samples , That is, each subset contains a sample , Therefore, the results of the left one method are not affected by the random division of samples . in addition , The training set and original data set used by the left one method D Only one sample is missing , So the left one model is very similar to the expectation model , Assessment results are often considered to be more accurate . But in the dataset D Larger , The left one method also has the disadvantage of high computational complexity . In the process of practice , Take one sample at a time as the test set , The rest is the training set （ Outer layer ）. If the model needs to be adjusted , It can be used during training 10 The parameters were determined by cross validation （ Inner layer ）. It's still here SVM Classifier as an example ：

#!/usr/bin/env python # -*- coding: utf-8 -*- import loadData import copy

import numpy as np from sklearn import svm from sklearn.model_selection import

StratifiedKFold from sklearn.metrics import accuracy_score feature ,label =

loadData.loadData() # Self written function , Load data smps = len(label) # Number of samples test_accur_list = []

for i in xrange(smps):# Outer circulation ,LOOCV: One sample test , Remaining training train_index = list(set(range(0,

smps)) - set([i])) test_index = [i] train_fea, test_fea = feature[train_index],

feature[test_index] train_lab, test_lab = label[train_index], label[test_index]

foldLi = [] ss = StratifiedKFold(n_splits=10, shuffle = True) for train_i,

test_i in ss.split(train_fea, train_lab): print("TEST:", test_i)# Get index value

foldLi.append(test_i) # take foldLi It's inside list merge foldL = [x for p in foldLi for x in p]

print 'for %s time gridSearch process:' % i c_can = np.logspace(-15, 15, 10,

base = 2)# assume SVM Parameters in C Value of n_search = len(c_can) bestC = 0

# For this time 9 Training data , Through the following inner layer cross validation to determine the best C value bestAccur = 0 # The best test precision of inner cross validation , Namely C take bestC Time for j in

xrange(n_search):# For parameters C Test one by one Accur = 0 for n in

xrange(10):#C The value is c_can[j] Inner cross validation based on train_i = list(set(foldL) - set(foldLi[n]))

test_i= foldLi[n] train_f, test_f = feature[train_i], feature[test_i]# Values corresponding to training set

train_l, test_l = label[train_i], label[test_i]# The value corresponding to the category set clf = svm.SVC(C =

c_can[j], kernel='linear') clf.fit(train_f, train_l) y_hat =

clf.predict(test_f) Accur += accuracy_score(test_l, y_hat) / 10 print '

Accur:%s' % Accur if Accur > bestAccur:# According to the results of inner cross validation , lookup bestC bestAccur =

copy.deepcopy(Accur) bestC = copy.deepcopy(c_can[j]) print ' Best validation

accuracy on current dataset split:', bestAccur print ' Best para C:', bestC

# find bestC after , And then it's the outer layer ： train , test . clf = svm.SVC(C = bestC, kernel='linear')

clf.fit(train_fea, train_lab) y_hat = clf.predict(test_fea)

test_accur_list.append(accuracy_score(test_lab, y_hat)) print ' test accur:',

test_accur_list[i] print '\n' #LOOCV The results of print 'average test accur:',

sum(test_accur_list) / len(test_accur_list)

3. Self help law

one side , The left one method has the disadvantage of high computational complexity ; on the other hand , If more test sets are reserved , However, there is a certain deviation between the training model and the expectation model . Self help method is a better solution . Hypothetical data set D Yes m Samples , Every time from D A sample is randomly selected , implement m second , Get contained m Data set of samples D'. obviously ,D Some samples in the D' Many times , The other part will not . A sample is here m The probability of not being collected in sub sampling is about 0.368, So the initial dataset D About 36.8% The sample of does not appear in the D' in . such , Both the actual evaluation model and the expected evaluation model will be used m Training samples , And we do 1/3 The samples can be used for training .

Self help method is often used for small data sets , It is difficult to divide training effectively / Test set time . Since the self-help method can generate different training sets from the original data set each time , These training sets are similar, but not completely different , So it can be used in ensemble learning algorithm . But bootstrap sampling will change the distribution of the initial data set , Introducing estimation bias , So when there's enough data , Try to use the set aside method and cross validation method .

Sample code ：

#!/usr/bin/env python # -*- coding: utf-8 -*- import loadData import copy

import numpy as np from sklearn import svm from sklearn.model_selection import

StratifiedKFold from sklearn.metrics import accuracy_score feature ,label =

loadData.loadData() # Self written function , Load data testAccur_list = [] samps = len(label)

# Using self-service sampling method , Repeat execution 10 second for i in xrange(10): train_index = [] # Self help sampling , Random sampling with return samps Samples

for i in xrange(int(1.0 * samps)): train_index.append(random.randint(0,samps -

1)) test_index = list(set(list(range(0, samps))) - set(train_index))

print("TRAIN:", train_index, "TEST:", test_index)# Get index value train_fea, test_fea =

feature[train_index], feature[test_index]# Values corresponding to training set train_lab, test_lab =

label[train_index], label[test_index]# The value corresponding to the category set ''' ''' Cross validation using training samples , Get parameters :C =

bestC ''' ''' clf = svm.SVC(C = bestC, kernel='linear') clf.fit(train_fea,

train_lab) y_hat = clf.predict(test_fea) accur = accuracy_score(test_lab,

y_hat) print i, ' time run testAccuracy:', accur testAccur_list.append(accur)

print '10 times run testAccur_list:', testAccur_list print 'average testAccur

value:', sum(testAccur_list) / 10

reference ：

1.《 machine learning 》, Zhou Zhihua .

2.Statnikov A, Aliferis C F, Tsamardinos I, et al. A comprehensive evaluation

of multicategory classification methods for microarray gene expression cancer

diagnosis[J]. Bioinformatics, 2004, 21(5): 631-643.

Technology

- Java426 articles
- Python242 articles
- Vue127 articles
- Linux119 articles
- MySQL100 articles
- javascript77 articles
- SpringBoot72 articles
- C++68 articles
- more...

Daily Recommendation

©2019-2020 Toolsou All rights reserved,

It's unexpected Python Cherry tree （turtle The gorgeous style of Library ） Some East 14 Pay change 16 salary , Sincerity or routine ? Browser kernel （ understand ）java Four functional interfaces （ a key , simple ）HashMap Explain in detail html Writing about cherry trees , Writing about cherry trees os Simple use of module