Datawhale Pandas Study Club The fourth punch in record
Dry goods first ：
<> One ,pandas Summary of data deformation function
<> Two , Data deformation in data preprocessing
<> Why data cleaning and deformation ?
generally speaking , Whether manual or sensor data acquisition , There are more or less errors or flaws , There may be duplicate or inaccurate data , There are many default values for the data , Missing value , Outliers, etc , So the data we get can't be analyzed directly , Generally, data exploration and data preprocessing should be carried out first . From generally accepted experience , Data preprocessing usually takes up the whole data analysis project 70%~80%, in other words , do “ Data analysis ” In fact, a large part of the time is doing data processing , send “ Dirty data ” become “ clean ”,“ neat ”, From a form that cannot be analyzed to one that can be analyzed , modeling , Visual form , Therefore, it is necessary to master data cleaning and plasticity .
<> What is clean data ?
Since data cleaning , Deformation is so important , Our goal is to make the data “ clean ”, It sounds very abstract . What exactly is that “ Dirty data ”? What is it “ clean ” What about the data ?
“ Dirty data ” In fact, there is no clear standard , As long as it can not meet the requirements of personal analysis , Such datasets can be “ dirty ” data . Each dataset may have its own “ dirty ”, But for “ clean , neat ”, We all have a unified standard .
Currently recognized “ clean ” It can be summed up in the following three points ：
1, Variables with the same properties form a column .
2, A single observation forms a line .
3, Individual property values must exist independently .
In other words, the form of data should be ： Each row is an observation , Each column is an attribute （ Or variable ）.
<> Form of data
In terms of form , Data can be divided into two types: growth data and wide data , generally speaking , Most of the data we have is “ Wide data ”.
* Wide data ： Generally, the same or different types of variables coexist . Such as gender “ male ” And “ female ” All as variables .
* Long data ： Variables of the same type are listed separately .
“ Wide data ” More in line with people's daily Excel Understanding of format data , and “ Long data ” It's easier for computers to identify , Store in calculation . in other words , Humans like it intuitively “ Wide data ”, And our computers actually like it better “ Long data ”, So a lot of time we need to do data deformation , hold “ Wide data ” Become a computer favorite “ Long data ” Feed it again .
It should be noted that ,“ clean ” Data and “ Long data ”,“ Wide data ” There is no clear correspondence , in other words ,“ clean ” Data can be wide or long .
<> Types of data deformation
Since the data are mainly “ long ”,“ wide ” Two forms , As the name suggests , Data deformation is mainly the transformation of two states , It can be mainly divided into data “ Long to wide ” perhaps “ Wide to long ” two types .
From the specific use of deformation ：
* Wide to long ： It is often used for visualization and other tasks .
* Long to wide ： In general, it is used to make statistical tables , Summarize the original information and quantitative relationship of the data , Presentation, etc .
So if the data from the deformation of the specific task To distinguish the functions we learned in this chapter , It can be roughly summarized as ：
Wide to long ：melt,stack etc .
Long to wide ： Pivot table function （pivot,pivot_table,crosstab）,unstack,get_dummies etc .
R In language reshape2 perhaps tidyr Packages have similar functions , say concretely ,reshape2 Packaged melt and dcast combination ,tidyr In the bag gather and spread Combination can be used for data deformation , I like it better tidyr, Because it is tidyverse A member of a series of components , It works well with other packages . From the function function point of view ：
R language tidyr / reshape2 etc.
Wide to long ：melt,gather.
Long to wide ：dcast,spread.
reference ：R Detailed explanation and case analysis of data science combat tools
Datawhale community JoufulPandas Chapter 4 of the course —— deformation