Datawhale Pandas Study Club The fourth punch in record
Dry goods first :

<> One ,pandas Summary of data deformation function

<> Two , Data deformation in data preprocessing

<> Why data cleaning and deformation ?

generally speaking , Whether manual or sensor data acquisition , There are more or less errors or flaws , There may be duplicate or inaccurate data , There are many default values for the data , Missing value , Outliers, etc , So the data we get can't be analyzed directly , Generally, data exploration and data preprocessing should be carried out first . From generally accepted experience , Data preprocessing usually takes up the whole data analysis project 70%~80%, in other words , do “ Data analysis ” In fact, a large part of the time is doing data processing , send “ Dirty data ” become “ clean ”,“ neat ”, From a form that cannot be analyzed to one that can be analyzed , modeling , Visual form , Therefore, it is necessary to master data cleaning and plasticity .

<> What is clean data ?

Since data cleaning , Deformation is so important , Our goal is to make the data “ clean ”, It sounds very abstract . What exactly is that “ Dirty data ”? What is it “ clean ” What about the data ?

“ Dirty data ” In fact, there is no clear standard , As long as it can not meet the requirements of personal analysis , Such datasets can be “ dirty ” data . Each dataset may have its own “ dirty ”, But for “ clean , neat ”, We all have a unified standard .
Currently recognized “ clean ” It can be summed up in the following three points :
1, Variables with the same properties form a column .
2, A single observation forms a line .
3, Individual property values must exist independently .
In other words, the form of data should be : Each row is an observation , Each column is an attribute ( Or variable ).

<> Form of data

In terms of form , Data can be divided into two types: growth data and wide data , generally speaking , Most of the data we have is “ Wide data ”.

* Wide data : Generally, the same or different types of variables coexist . Such as gender “ male ” And “ female ” All as variables .
* Long data : Variables of the same type are listed separately .

“ Wide data ” More in line with people's daily Excel Understanding of format data , and “ Long data ” It's easier for computers to identify , Store in calculation . in other words , Humans like it intuitively “ Wide data ”, And our computers actually like it better “ Long data ”, So a lot of time we need to do data deformation , hold “ Wide data ” Become a computer favorite “ Long data ” Feed it again .
It should be noted that ,“ clean ” Data and “ Long data ”,“ Wide data ” There is no clear correspondence , in other words ,“ clean ” Data can be wide or long .

<> Types of data deformation

Since the data are mainly “ long ”,“ wide ” Two forms , As the name suggests , Data deformation is mainly the transformation of two states , It can be mainly divided into data “ Long to wide ” perhaps “ Wide to long ” two types .
From the specific use of deformation :

* Wide to long : It is often used for visualization and other tasks .
* Long to wide : In general, it is used to make statistical tables , Summarize the original information and quantitative relationship of the data , Presentation, etc .
So if the data from the deformation of the specific task To distinguish the functions we learned in this chapter , It can be roughly summarized as :

Python pandas
Wide to long :melt,stack etc .
Long to wide : Pivot table function (pivot,pivot_table,crosstab),unstack,get_dummies etc .

R In language reshape2 perhaps tidyr Packages have similar functions , say concretely ,reshape2 Packaged melt and dcast combination ,tidyr In the bag gather and spread Combination can be used for data deformation , I like it better tidyr, Because it is tidyverse A member of a series of components , It works well with other packages . From the function function point of view :

R language tidyr / reshape2 etc.
Wide to long :melt,gather.
Long to wide :dcast,spread.

reference :R Detailed explanation and case analysis of data science combat tools
Datawhale community JoufulPandas Chapter 4 of the course —— deformation

©2019-2020 Toolsou All rights reserved,
about Bellman-Ford Personal understanding of algorithms latex Custom commands in ———\newcommandpython in switch_to_alert Usage of use VS2019 “Windows Desktop applications ” Module creation Win32 window ajax get Request Chinese parameter garbled solution subversion ! Never take a nap longer than this time ! Watch out for fatal diseases …( Essence )2020 year 6 month 26 day C# Class library GUID Help class Map judge key Corresponding value Does the value exist -containsKey() Novices play hiss HI3520D Development board ( One , upgrade )( Essence )2020 year 7 month 30 day Wechat applet Use of modules