- 2020-08-04 18:58
*views 4*- data mining

Outlier analysis in data mining is the premise of ensuring data quality , It is the data exploration stage in the data processing stage , in short , Find the outliers of the data , It is conducive to the stability of our final model .

There are three main methods for outlier analysis ：

1. Simple statistical analysis ：

We can first make a descriptive estimate of the data collected , The most commonly used methods are maximum and minimum

. It is used to judge whether the variable is beyond the normal people's understanding , For example , We can make statistics on the attribute column of a person's age , The minimum age obtained is assumed to be -1, The maximum value is 130. This is obviously out of order , In the outlier analysis phase of our data, it is necessary to pass Dropped /

2. Box diagram analysis

Box diagram is the most intuitive method to judge the abnormal value of data , His outliers are defined as those that may appear above the upper quartile and below the lower quartile . of course , It's not that all numbers in this range are outliers , But for sure , Outliers must be generated here .

In order to first perceive the basic situation of our data ,

stay Python Of Pandas In the library , Just read in the data to be processed , Then use the describe() function , You can view the basic information of the data . This involves many properties of data , For example, you can view missing values , minimum value , Maximum value, etc .

In order to be more intuitive, our outlier analysis , We can implement it by drawing box diagram .

# -*- coding: utf-8 -*- """ Created on Tue Apr 10 20:58:14 2018 @author:

Administrator """ import pandas as pd catering_sale='E:/catering_sale.xls'

data=pd.read_excel(catering_sale,index_col=u' date ') # Read data , appoint " date " Column as index import

matplotlib.pyplot as plt plt.rcParams['font.sans-serif']=['SimHei']

plt.rcParams['axes.unicode_minus']=False # Building images plt.figure() # Draw a box diagram

p=data.boxplot(return_type='dict') x=p['fliers'][0].get_xdata()

y=p['fliers'][0].get_ydata() y.sort() for i in range(len(x)): if i>0:

plt.annotate(y[i],xy=(x[i],y[i]),xytext=(x[i]+0.05-0.8/(y[i]-y[i-1]),y[i]))

else: plt.annotate(y[i],xy=(x[i],y[i]),xytext=(x[i]+0.08,y[i])) plt.show()

Above is the code for drawing box diagram , The results are as follows ：

The box diagram above can be understood in this way ：

Data between lower and upper bounds , We think it's normal , The corresponding part on the box diagram is the part between two horizontal lines , The two parts outside the two horizontal lines are considered to be the producing areas of outliers , The data here is shown on the box diagram , There are eight figures , Now we can guess , Data close to the upper and lower bounds , as 4065.2 And 4060.3, These two data are not different from the upper bound , It can also be considered normal data , However, the distance between the upper and lower bounds is larger , as 22.0 etc. , It can be regarded as an abnormal number .

The problem is that , We need to determine the criteria for an outlier , Determine in which interval the data is normal data , For example, in the 400-5000 We think the data in this interval are normal , Then the rest of the data can be considered as abnormal data .

Analyze the abnormal data , Next, you can delete it .

Technology

- Java392 articles
- Python205 articles
- Linux110 articles
- Vue97 articles
- MySQL83 articles
- SpringBoot70 articles
- javascript65 articles
- Spring62 articles
- more...

Daily Recommendation

views 2

©2019-2020 Toolsou All rights reserved,

use css Design a simple style html login interface The industrial Internet may not be 5G My life-saving straw C/C++ Memory model Do you have a magic interview ? Half a month 25 College questions , Actually captured Ali P8offerJava Misunderstanding —— Method overloading is a manifestation of polymorphism ?springboot2 Separation of front and rear platforms ,token Put in header Pit for verification Regression of dependent variable order categories （R language ）Pandas Fundamentals of statistical analysis _ data processing （DataFrame Common operations ）Unity2019 UIElement note （ ten ） Simple exercise 2 45 The 12-year-old programmer was turned down , Is the workplace wrong ?