Outlier analysis in data mining is the premise of ensuring data quality , It is the data exploration stage in the data processing stage , in short , Find the outliers of the data , It is conducive to the stability of our final model .
There are three main methods for outlier analysis ：
1. Simple statistical analysis ：
We can first make a descriptive estimate of the data collected , The most commonly used methods are maximum and minimum
. It is used to judge whether the variable is beyond the normal people's understanding , For example , We can make statistics on the attribute column of a person's age , The minimum age obtained is assumed to be -1, The maximum value is 130. This is obviously out of order , In the outlier analysis phase of our data, it is necessary to pass Dropped /
2. Box diagram analysis
Box diagram is the most intuitive method to judge the abnormal value of data , His outliers are defined as those that may appear above the upper quartile and below the lower quartile . of course , It's not that all numbers in this range are outliers , But for sure , Outliers must be generated here .
In order to first perceive the basic situation of our data ,
stay Python Of Pandas In the library , Just read in the data to be processed , Then use the describe() function , You can view the basic information of the data . This involves many properties of data , For example, you can view missing values , minimum value , Maximum value, etc .
In order to be more intuitive, our outlier analysis , We can implement it by drawing box diagram .
# -*- coding: utf-8 -*- """ Created on Tue Apr 10 20:58:14 2018 @author:
Administrator """ import pandas as pd catering_sale='E:/catering_sale.xls'
data=pd.read_excel(catering_sale,index_col=u' date ') # Read data , appoint " date " Column as index import
matplotlib.pyplot as plt plt.rcParams['font.sans-serif']=['SimHei']
plt.rcParams['axes.unicode_minus']=False # Building images plt.figure() # Draw a box diagram
y=p['fliers'].get_ydata() y.sort() for i in range(len(x)): if i>0:
else: plt.annotate(y[i],xy=(x[i],y[i]),xytext=(x[i]+0.08,y[i])) plt.show()
Above is the code for drawing box diagram , The results are as follows ：
The box diagram above can be understood in this way ：
Data between lower and upper bounds , We think it's normal , The corresponding part on the box diagram is the part between two horizontal lines , The two parts outside the two horizontal lines are considered to be the producing areas of outliers , The data here is shown on the box diagram , There are eight figures , Now we can guess , Data close to the upper and lower bounds , as 4065.2 And 4060.3, These two data are not different from the upper bound , It can also be considered normal data , However, the distance between the upper and lower bounds is larger , as 22.0 etc. , It can be regarded as an abnormal number .
The problem is that , We need to determine the criteria for an outlier , Determine in which interval the data is normal data , For example, in the 400-5000 We think the data in this interval are normal , Then the rest of the data can be considered as abnormal data .
Analyze the abnormal data , Next, you can delete it .