Datawhale Pandas Study Club The fourth punch in record
Dry goods first :

<> One ,pandas Summary of data deformation function

<> Two , Data deformation in data preprocessing

<> Why data cleaning and deformation ?

generally speaking , Whether manual or sensor data acquisition , There are more or less errors or flaws , There may be duplicate or inaccurate data , There are many default values for the data , Missing value , Outliers, etc , So the data we get can't be analyzed directly , Generally, data exploration and data preprocessing should be carried out first . From generally accepted experience , Data preprocessing usually takes up the whole data analysis project 70%~80%, in other words , do “ Data analysis ” In fact, a large part of the time is doing data processing , send “ Dirty data ” become “ clean ”,“ neat ”, From a form that cannot be analyzed to one that can be analyzed , modeling , Visual form , Therefore, it is necessary to master data cleaning and plasticity .

<> What is clean data ?

Since data cleaning , Deformation is so important , Our goal is to make the data “ clean ”, It sounds very abstract . What exactly is that “ Dirty data ”? What is it “ clean ” What about the data ?

“ Dirty data ” In fact, there is no clear standard , As long as it can not meet the requirements of personal analysis , Such datasets can be “ dirty ” data . Each dataset may have its own “ dirty ”, But for “ clean , neat ”, We all have a unified standard .
Currently recognized “ clean ” It can be summed up in the following three points :
1, Variables with the same properties form a column .
2, A single observation forms a line .
3, Individual property values must exist independently .
In other words, the form of data should be : Each row is an observation , Each column is an attribute ( Or variable ).

<> Form of data

In terms of form , Data can be divided into two types: growth data and wide data , generally speaking , Most of the data we have is “ Wide data ”.

* Wide data : Generally, the same or different types of variables coexist . Such as gender “ male ” And “ female ” All as variables .
* Long data : Variables of the same type are listed separately .

“ Wide data ” More in line with people's daily Excel Understanding of format data , and “ Long data ” It's easier for computers to identify , Store in calculation . in other words , Humans like it intuitively “ Wide data ”, And our computers actually like it better “ Long data ”, So a lot of time we need to do data deformation , hold “ Wide data ” Become a computer favorite “ Long data ” Feed it again .
It should be noted that ,“ clean ” Data and “ Long data ”,“ Wide data ” There is no clear correspondence , in other words ,“ clean ” Data can be wide or long .

<> Types of data deformation

Since the data are mainly “ long ”,“ wide ” Two forms , As the name suggests , Data deformation is mainly the transformation of two states , It can be mainly divided into data “ Long to wide ” perhaps “ Wide to long ” two types .
From the specific use of deformation :

* Wide to long : It is often used for visualization and other tasks .
* Long to wide : In general, it is used to make statistical tables , Summarize the original information and quantitative relationship of the data , Presentation, etc .
So if the data from the deformation of the specific task To distinguish the functions we learned in this chapter , It can be roughly summarized as :

Python pandas
Wide to long :melt,stack etc .
Long to wide : Pivot table function (pivot,pivot_table,crosstab),unstack,get_dummies etc .

R In language reshape2 perhaps tidyr Packages have similar functions , say concretely ,reshape2 Packaged melt and dcast combination ,tidyr In the bag gather and spread Combination can be used for data deformation , I like it better tidyr, Because it is tidyverse A member of a series of components , It works well with other packages . From the function function point of view :

R language tidyr / reshape2 etc.
Wide to long :melt,gather.
Long to wide :dcast,spread.

reference :R Detailed explanation and case analysis of data science combat tools
Datawhale community JoufulPandas Chapter 4 of the course —— deformation

Technology
©2019-2020 Toolsou All rights reserved,
Final review of database : Summary of comprehensive application questions use Python Make simple games Laplance operator ( Second derivative ) Convert hard disk to GPT Partition format Python Implementation of Hanoi Tower code about String How to create objects vue3 Learning journey 1—— establish vue3 project java String from back to front _Java String String summary use Python Write a story about plants versus zombies 【 Greedy Algorithm 】 Huffman coding problem