In engineering practice , The data we get will have missing values , Repeated values, etc , Data preprocessing is needed before use . There is no standard process for data preprocessing , It is usually different for different tasks and data set attributes . The common processes of data preprocessing are ： Remove unique attributes , Processing missing values , features （ attribute ） code , Data standardization and regularization , feature selection .
Introduction of data preprocessing methods ：
1, Remove unique attributes ：
The only properties are usually some ID attribute , These attributes can not describe the distribution of samples themselves , So simply delete these attributes .
2, Processing missing values ：
Three methods of missing value processing ： Direct use of features with missing values ; Delete features with missing values ; Complete missing values .
Common methods to complete missing values ： Average filling , Mean filling of the same kind ,K Nearest neighbor method , regression , Expectation maximization method （EM）
① Average filling ： The attributes in the initial dataset are divided into numerical attributes and non numerical attributes to process them respectively . How can null values be numeric , The missing attribute value is filled in based on the average value of the attribute in all other objects ; If NULL is nonnumeric , According to the mode principle of statistics , The missing attribute value is supplemented with the value that the attribute has the most values in all other objects .
② Mean filling of the same kind ： Firstly, the samples are classified , Then the missing values are filled with the mean values of the samples in this class .
③K Nearest neighbor method ： First, according to the Euclidean distance, the nearest to the sample with missing data is determined K Samples , Put this K To estimate the missing data of the sample
④ regression ： Based on complete data set , Establish regression equation . For objects with null values , The known attribute values are brought into the equation to estimate the unknown attribute values , Fill in this estimate . When the variables are not linearly correlated, biased estimates will result .
⑤ Expectation maximization method （EM）：EM The algorithm is an iterative algorithm to calculate the maximum likelihood estimation in the case of incomplete data . Two steps are performed alternately during each iteration cycle ：E step （ Expectation step ）—— The conditional expectation of the log likelihood function corresponding to the complete data is calculated under the condition that the complete data and the parameter estimation obtained in the previous iteration are given ;M step （ Maximum step ）—— The maximum log likelihood function is used to determine the value of the parameter , And used for the next iteration . Algorithm in E Step and step M The steps are iterated until they converge , That is, when the parameter change between the two iterations is less than a given threshold value . This method may fall into local extremum , The convergence rate is not very fast , And the calculation is very complicated .
3, features （ attribute ） code ：
① Feature dualization ： The process of feature dualization is to convert numeric attributes to Boolean attributes , Set a threshold value as the partition attribute value of 0 and 1 Separation point of .
② Single hot coding （One-HotEncoding）： Single hot coding is adopted N Bit status register N Possible values are encoded , Each state is represented by a separate register , And only one of them is valid at any time .
4, Data standardization and regularization ：
Data standardization is to scale the attributes of a sample to a specified range , The reason for standardization is that ： Some algorithms require samples to have zero mean and unit variance ; It is necessary to eliminate the influence of different attributes of samples with different orders of magnitude .
Data regularization is to scale a certain norm of the sample into place 1, The process of regularization is for a single sample , For each sample, scale the sample to the unit norm .
5, feature selection ：
Feature selection is the process of selecting the relevant feature subset from a given feature set . There are two main reasons for feature selection ： Reducing dimension disaster ; Reduce the difficulty of learning tasks . Feature selection must ensure that important features are not lost . Common dimensionality reduction methods ：SVD,PCA,LDA