pandas It is very good at dealing with one-dimensional data and two-dimensional data , It is widely used in the actual production environment . Let's go straight to the subject , This paper mainly explains its two core data structures :Series and
DataFrame.

one ,Series ( one-dimensional , Tagged array )

ta It's a fixed length dictionary sequence . It's fixed length because it's stored , Equivalent to two
ndarray, An array constitutes the key of the object (index), The value of another constituent object (values) This is also the biggest difference from the dictionary structure . Because in the structure of the dictionary , The number of elements is not fixed .

ta There are two basic properties :index and values. stay Series In structure ,index Default is 0,1,2,…… Incremental integer sequence .

1, establish
from pandas import Series, DataFrame x1 = Series([1, 2, 3, 4]) x2 =
Series(data=[1, 2, 3, 4], index=['a', 'b', 'c', 'd']) d = {'a': 1, 'b': 2, 'c':
3, 'd': 4} x3 = Series(d) print x3 ''' Here is x1 Output results of : 0 1 1 2 2 3 3 4 dtype: int64
''' print x1 ''' Here is x2 Output results of : a 1 b 2 c 3 d 4 dtype: int64 ''' print x2 '''
Here is x3 Output results of : a 1 b 2 c 3 d 4 dtype: int64 ''' print x3

2, Slice and index
t = pd.Series(np.arange(10), index=list(string.ascii_uppercase[:10])) '''
The output results are as follows : A 0 B 1 C 2 D 3 E 4 F 5 G 6 H 7 I 8 J 9 dtype: int64 ''' print t #
1, section , The parameter passed in is : start,end,step se1 = t[2:10:2] ''' C 2 E 4 G 6 I 8 dtype: int64
''' print se1 # 2, Indexes , The acquisition sequence number is 1 Value of , The following acquisition methods and this :se2=t['B'] Is equivalent se2 = t[1] ''' 1 '''
print se2 # 3, Indexes , When obtaining values corresponding to multiple sequence numbers , And this se3 = t[['C', 'D', 'G']] Is equivalent se3 = t[[2, 3,
6]] ''' C 2 D 3 G 6 dtype: int64 ''' print se3 # 4, Indexes , Get sequence number ( Is the default index ) greater than 5 Data se4 =
t[t > 5] ''' G 6 H 7 I 8 J 9 dtype: int64 ''' print se4
two ,DataFrame( two-dimensional Series container )

DataFrame Type data structure is similar to 2D table . It includes row index and column index , We can DataFrame Regarded as having the same index Series Dictionary type of composition .

*
Row index : Indicate different lines , Horizontal index , be called index,0 axis ,axis=0

*
Column index : Show different columns , Vertical index , call columns,1 axis ,axis=1

1, establish
from pandas import Series, DataFrame data = {'Chinese': [66, 95, 93, 90, 80],
'English': [65, 85, 92, 88, 90], 'Math': [30, 98, 96, 77, 90]} df1 =
DataFrame(data) df2 = DataFrame(data, index=['ZhangSan', 'LiSi', 'WangWu',
'XiaoMing', 'LiHua'], columns=['English', 'Math', 'Chinese']) ''' Here is df1 Return result of :
Chinese English Math 0 66 65 30 1 95 85 98 2 93 92 96 3 90 88 77 4 80 90 90 '''
print df1 print('------------------') ''' Here is df2 Return result of : English Math Chinese
ZhangSan 65 30 66 LiSi 85 98 95 WangWu 92 96 93 XiaoMing 88 77 90 LiHua 90 90
80 ''' print df2

2,DataFrame Basic properties of

*
df.shape # Number of rows , Number of columns

*
df.dtypes # Column data type

*
df.ndim # Data dimension

*
df.index # Row index

*
df.columns # Column index

*
df.values # Object value , two-dimensional ndarray array

*
df.head(3) # Show first few lines , Default is 5 that 's ok

*
df.tail(3) # Show last few lines , Default is 5 that 's ok

*
df.info() # Overview of relevant information

*
df.describe() # Rapid comprehensive statistical results

three , data processing

1, Data import and export

Pandas Allow direct access from json,txt,xlsx,csv Import data from files such as , You can also output to  json,txt,xlsx,csv Other documents , Very convenient .
import pandas as pd from pandas import Series, DataFrame # Reading and writing xlsx Format file score
= DataFrame(pd.read_excel('data.xlsx')) score.to_excel('data1.xlsx') # Reading and writing json
Format file data=DataFrame(pd.read_json('data.json'))
data.to_json('another_data.json') # Reading and writing csv Format file
data2=DataFrame(pd.read_csv('data.csv')) data2.to_csv('another_data.csv') #
Reading and writing txt Format file data3=DataFrame(pd.read_hdf('./data.txt'))
data3.to_hdf('./another_data.txt')

2, delete DataFrame Unnecessary columns or rows in

  2.1, Delete a column
data = {'Chinese': [66, 95, 93, 90, 80], 'English': [65, 85, 92, 88, 90],
'Math': [30, 98, 96, 77, 90]} df2 = DataFrame(data=data, index=['ZhangSan',
'LiSi', 'WangWu', 'XiaoMing', 'LiHua'], columns=['English', 'Math', 'Chinese'])
''' df2 The contents are as follows : English Math Chinese ZhangSan 65 30 66 LiSi 85 98 95 WangWu 92 96
93 XiaoMing 88 77 90 LiHua 90 90 80 ''' # delete 'Chinese' This column df2 =
df2.drop(columns=['Chinese']) ''' The operation results are as follows ; English Math ZhangSan 65 30 LiSi 85
98 WangWu 92 96 XiaoMing 88 77 LiHua 90 90 ''' print df2
 

2.2, Delete a row
data = {'Chinese': [66, 95, 93, 90, 80], 'English': [65, 85, 92, 88, 90],
'Math': [30, 98, 96, 77, 90]} df2 = DataFrame(data=data, index=['ZhangSan',
'LiSi', 'WangWu', 'XiaoMing', 'LiHua'], columns=['English', 'Math', 'Chinese'])
''' df2 The contents are as follows : English Math Chinese ZhangSan 65 30 66 LiSi 85 98 95 WangWu 92 96
93 XiaoMing 88 77 90 LiHua 90 90 80 ''' # delete 'ZhangSan' This line
df2=df2.drop(index=['ZhangSan']) ''' The operation results are as follows ; English Math ZhangSan 65 30 LiSi
85 98 WangWu 92 96 XiaoMing 88 77 LiHua 90 90 ''' print df2

3, Rename column name columns perhaps Row index index, Make list names or row names easier to identify
data = {'Chinese': [66, 95, 93, 90, 80], 'English': [65, 85, 92, 88, 90],
'Math': [30, 98, 96, 77, 90]} df2 = DataFrame(data=data, index=['ZhangSan',
'LiSi', 'WangWu', 'XiaoMing', 'LiHua'], columns=['English', 'Math', 'Chinese'])
''' df2 The contents are as follows : English Math Chinese ZhangSan 65 30 66 LiSi 85 98 95 WangWu 92 96
93 XiaoMing 88 77 90 LiHua 90 90 80 ''' df2.rename(columns={'Chinese': ' language ',
'English': ' English ','Math':' mathematics '}, inplace=True) df2.rename(index={'ZhangSan':
' Zhang San '},inplace=True) ''' The operation results are as follows ; English mathematics language Zhang San 65 30 66 LiSi 85 98 95 WangWu 92 96
93 XiaoMing 88 77 90 LiHua 90 90 80 ''' print df2

4, De duplicate values
data = {'Chinese': [66, 95, 93, 90, 80,80], 'English': [65, 85, 92, 88,
90,90], 'Math': [30, 98, 96, 77, 90,90]} df2 = DataFrame(data=data,
index=['ZhangSan', 'LiSi', 'WangWu', 'XiaoMing', 'LiHua','LiHua'],
columns=['English', 'Math', 'Chinese']) ''' df2 The contents are as follows : English Math Chinese
ZhangSan 65 30 66 LiSi 85 98 95 WangWu 92 96 93 XiaoMing 88 77 90 LiHua 90 90
80 LiHua 90 90 80 ''' # Remove duplicate lines df2 = df2.drop_duplicates() ''' The operation results are as follows ; English
Math Chinese ZhangSan 65 30 66 LiSi 85 98 95 WangWu 92 96 93 XiaoMing 88 77 90
LiHua 90 90 80 ''' print df2

5, Change data format

have access to astype Function to standardize the data format , That is, convert data types
data = {'Chinese': [66, 95, 93, 90, 80,80], 'English': [65, 85, 92, 88,
90,90], 'Math': [30, 98, 96, 77, 90,90]} df2 = DataFrame(data=data,
index=['ZhangSan', 'LiSi', 'WangWu', 'XiaoMing', 'LiHua','LiHua'],
columns=['English', 'Math', 'Chinese']) ''' df2 The contents are as follows : English Math Chinese
ZhangSan 65 30 66 LiSi 85 98 95 WangWu 92 96 93 XiaoMing 88 77 90 LiHua 90 90
80 LiHua 90 90 80 ''' df2[['Math']].astype('str')
df2[['Math']].astype(np.float) ''' The operation results are as follows ; English Math Chinese ZhangSan 65 30
66 LiSi 85 98 95 WangWu 92 96 93 XiaoMing 88 77 90 LiHua 90 90 80 ''' print df2

6, Spaces between data
data = {'Chinese': [66, 95, 93, 90, 80,80], 'English': [65, 85, 92, 88,
90,90], 'Math': [30, 98, 96, 77,
90,90],'address':['Beijing','Chendu','Xian','Shanghai','Beijing','Beijing']}
df2 = DataFrame(data=data, index=['ZhangSan', 'LiSi', 'WangWu', 'XiaoMing',
'LiHua','LiHua'], columns=['English', 'Math', 'Chinese','address']) '''
df2 The contents are as follows : English Math Chinese address ZhangSan 65 30 66 Beijing LiSi 85 98 95
Chendu WangWu 92 96 93 Xian XiaoMing 88 77 90 Shanghai LiHua 90 90 80 Beijing
LiHua 90 90 80 Beijing ''' # Delete left and right spaces df2['address'] =
df2['address'].map(str.strip) # Delete left space df2['address'] =
df2['address'].map(str.lstrip) # Delete the right space df2['address'] =
df2['address'].map(str.rstrip) ''' The operation results are as follows ; English Math Chinese address
ZhangSan 65 30 66 Beijing LiSi 85 98 95 Chendu WangWu 92 96 93 Xian XiaoMing 88
77 90 Shanghai LiHua 90 90 80 Beijing LiHua 90 90 80 Beijing ''' print df2

7, Remove special characters
data = {'Chinese': [66, 95, 93, 90, 80,80], 'English': [65, 85, 92, 88,
90,90], 'Math': [30, 98, 96, 77,
90,90],'address':['Beijing','Chendu#','Xian','Shanghai#','Beijing','Beijing']}
df2 = DataFrame(data=data, index=['ZhangSan', 'LiSi', 'WangWu', 'XiaoMing',
'LiHua','LiHua'], columns=['English', 'Math', 'Chinese','address']) '''
df2 The contents are as follows : English Math Chinese address ZhangSan 65 30 66 Beijing LiSi 85 98 95
Chendu# WangWu 92 96 93 Xian XiaoMing 88 77 90 Shanghai# LiHua 90 90 80 Beijing
LiHua 90 90 80 Beijing ''' print df2 # delete '#' df2['address'] =
df2['address'].str.strip('#') ''' The operation results are as follows ; English Math Chinese address ZhangSan
65 30 66 Beijing LiSi 85 98 95 Chendu WangWu 92 96 93 Xian XiaoMing 88 77 90
Shanghai LiHua 90 90 80 Beijing LiHua 90 90 80 Beijing ''' print df2

8, Some columns are converted to uppercase , Lowercase or capital letters
data = {'Chinese': [66, 95, 93, 90, 80,80], 'English': [65, 85, 92, 88,
90,90], 'Math': [30, 98, 96, 77,
90,90],'address':['Beijing','Chendu#','Xian','Shanghai#','Beijing','Beijing']}
df2 = DataFrame(data=data, index=['ZhangSan', 'LiSi', 'WangWu', 'XiaoMing',
'LiHua','LiHua'], columns=['English', 'Math', 'Chinese','address']) '''
df2 The contents are as follows : English Math Chinese address ZhangSan 65 30 66 Beijing LiSi 85 98 95
Chendu# WangWu 92 96 93 Xian XiaoMing 88 77 90 Shanghai# LiHua 90 90 80 Beijing
LiHua 90 90 80 Beijing ''' print df2 # column 'address' Convert to uppercase df2['address'] =
df2['address'].str.upper() # column 'address' Convert to lowercase # df2['address'] =
df2['address'].str.lower() # column 'address' Convert initial capital # df2['address'] =
df2['address'].str.title() ''' The operation results are as follows ; English Math Chinese address ZhangSan 65
30 66 BEIJING LiSi 85 98 95 CHENDU# WangWu 92 96 93 XIAN XiaoMing 88 77 90
SHANGHAI# LiHua 90 90 80 BEIJING LiHua 90 90 80 BEIJING ''' print df2

9, Find null value , Fill in null values , Delete null value :
data = {'Chinese': [66, 95, 93, 90, 80, np.nan], 'English': [65, 85, 92, 88,
90, np.nan], 'Math': [30, 98, 96, 77, 90, 90], 'address': ['Beijing',
'Chendu#', 'Xian', 'Shanghai#', 'Beijing', 'Beijing']} df2 =
DataFrame(data=data, index=['ZhangSan', 'LiSi', 'WangWu', 'XiaoMing', 'LiHua',
'LiHua'], columns=['English', 'Math', 'Chinese', 'address']) ''' df2 The contents are as follows :
English Math Chinese address ZhangSan 65.0 30 66.0 Beijing LiSi 85.0 98 95.0
Chendu# WangWu 92.0 96 93.0 Xian XiaoMing 88.0 77 90.0 Shanghai# LiHua 90.0 90
80.0 Beijing LiHua NaN 90 NaN Beijing ''' print df2 #
1, Judge whether there is in the data NANN Situation , You can do it this way :df2.isna() perhaps pd.isna(df2); df2.notna() perhaps
pd.notna(df2) tips:isnull yes isna Alias for ''' The operation results are as follows : English Math Chinese address
ZhangSan False False False False LiSi False False False False WangWu False
False False False XiaoMing False False False False LiHua False False False
False LiHua False False False False ''' print df2.isna() # 2, Determine which column in the data has null value '''
The operation results are as follows : English False Math False Chinese False address False dtype: bool '''
print df2.isna().any() # 3, Encountered in the whole data nap Just replace it with 0 df2 = df2.fillna(0) ''' The operation results are as follows :
English Math Chinese address ZhangSan 65.0 30 66.0 Beijing LiSi 85.0 98 95.0
Chendu# WangWu 92.0 96 93.0 Xian XiaoMing 88.0 77 90.0 Shanghai# LiHua 90.0 90
80.0 Beijing LiHua 0.0 90 0.0 Beijing ''' print df2 # 4, Fill in empty values with average
df2.fillna(df2.mean()) # 5, Fill empty values with median df2.fillna(df2.median()) #
6, discard ‘English’ and ‘Chinese’ Rows with missing values in these two columns df2.dropna(axis=0, subset=["English",
"Chinese"]) # 7, Discard all null columns data.dropna(axis=1, how="all") #
8, Discard columns with missing values ( Usually not , This will delete a feature ) data.dropna(axis=1) # 9, Only those rows with null values are discarded
data.dropna(how='all')

10, use apply Function to process data :
data = {'Chinese': [66, 95, 93, 90, 80, np.nan], 'English': [65, 85, 92, 88,
90, np.nan], 'Math': [30, 98, 96, 77, 90, 90], 'address': ['Beijing',
'Chendu#', 'Xian', 'Shanghai#', 'Beijing', 'Beijing']} df2 =
DataFrame(data=data, index=['ZhangSan', 'LiSi', 'WangWu', 'XiaoMing', 'LiHua',
'LiHua'], columns=['English', 'Math', 'Chinese', 'address']) ''' df2 The contents are as follows :
English Math Chinese address ZhangSan 65.0 30 66.0 Beijing LiSi 85.0 98 95.0
Chendu# WangWu 92.0 96 93.0 Xian XiaoMing 88.0 77 90.0 Shanghai# LiHua 90.0 90
80.0 Beijing LiHua NaN 90 NaN Beijing ''' print df2 # yes address Convert columns to uppercase
df2['address'] = df2['address'].apply(str.upper) ''' The following is the running result : English Math
Chinese address ZhangSan 65.0 30 66.0 BEIJING LiSi 85.0 98 95.0 CHENDU# WangWu
92.0 96 93.0 XIAN XiaoMing 88.0 77 90.0 SHANGHAI# LiHua 90.0 90 80.0 BEIJING
LiHua NaN 90 NaN BEIJING ''' print df2

11, use apply Function for complex processing of data :
def plus(df, n, m): df['another1'] = (df[u'Chinese'] + df[u'English']) * m
df['another2'] = (df[u'Chinese'] + df[u'English']) * n return df def method3():
data = {'Chinese': [66, 95, 93, 90, 80, np.nan], 'English': [65, 85, 92, 88,
90, np.nan], 'Math': [30, 98, 96, 77, 90, 90], 'address': ['Beijing',
'Chendu#', 'Xian', 'Shanghai#', 'Beijing', 'Beijing']} df2 =
DataFrame(data=data, index=['ZhangSan', 'LiSi', 'WangWu', 'XiaoMing', 'LiHua',
'LiHua'], columns=['English', 'Math', 'Chinese', 'address']) ''' df2 The contents are as follows :
English Math Chinese address ZhangSan 65.0 30 66.0 Beijing LiSi 85.0 98 95.0
Chendu# WangWu 92.0 96 93.0 Xian XiaoMing 88.0 77 90.0 Shanghai# LiHua 90.0 90
80.0 Beijing LiHua NaN 90 NaN Beijing ''' print df2 #
apply Call a function plus, Operate according to the axis listed (axis=1),args transmit plus Two parameters required . df2 = df2.apply(plus,
axis=1, args=(2, 3,)) ''' The following is the running result : English Math Chinese address another1
another2 ZhangSan 65.0 30 66.0 Beijing 393.0 262.0 LiSi 85.0 98 95.0 Chendu#
540.0 360.0 WangWu 92.0 96 93.0 Xian 555.0 370.0 XiaoMing 88.0 77 90.0
Shanghai# 534.0 356.0 LiHua 90.0 90 80.0 Beijing 510.0 340.0 LiHua NaN 90 NaN
Beijing NaN NaN ''' print df2

12, Functions related to data statistics :

 

13, Data table consolidation
df1 = DataFrame({'name': ['joel', 'ziwen', 'a', 'b', 'c'], 'data1': range(5)})
df2 = DataFrame({'name': ['joel', 'ziwen', 'A', 'B', 'C'], 'data2': range(5)})
# 1. Join based on the specified column df3 = pd.merge(df1, df2, on='name') ''' The operation results are as follows : data1 name data2
0 0 joel 0 1 1 ziwen 1 ''' print df3 # 2,inner Inner connection df4 = pd.merge(df1, df2,
how='inner') ''' The operation results are as follows : data1 name data2 0 0 joel 0 1 1 ziwen 1 ''' print df4
# 3,left Left connection df5 = pd.merge(df1, df2, how='left') ''' The following is the running result : data1 name data2
0 0 joel 0.0 1 1 ziwen 1.0 2 2 a NaN 3 3 b NaN 4 4 c NaN ''' print df5 #
4,right Right connection df6 = pd.merge(df1, df2, how='right') ''' data1 name data2 0 0.0
joel 0 1 1.0 ziwen 1 2 NaN A 2 3 NaN B 3 4 NaN C 4 ''' print df6 # 5,outer External connection
df7 = pd.merge(df1, df2, how='outer') ''' data1 name data2 0 0.0 joel 0.0 1 1.0
ziwen 1.0 2 2.0 a NaN 3 3.0 b NaN 4 4.0 c NaN 5 NaN A 2.0 6 NaN B 3.0 7 NaN C
4.0 ''' print df7

14,df.loc Index by tag ( Row index and column index ) Get row data
t = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=list(string.ascii_uppercase[:3]),
columns=list(string.ascii_uppercase[-4:])) ''' t The output of is : W X Y Z A 0 1 2 3 B 4 5
6 7 C 8 9 10 11 ''' print t tmp = t[:-1][['W', 'Z']] ''' 1, Simple and rough values : W Z A 0 3 B
4 7 ''' print tmp # 2, Get row data by column name print t.loc['A', 'W'] # The result is :0 # 3,'A' that 's ok , Multi column data '''
give the result as follows : W 0 Z 3 ''' print t.loc['A', ['W', 'Z']] # 4, Multi row and multi column ''' W X A 0 1 B 4 5 '''
print t.loc[['A', 'B'], ['W', 'X']] # 5, Multi row and multi column , be careful The colon connects a closed interval ''' W X Y Z A 0 1 2
3 B 4 5 6 7 C 8 9 10 11 ''' print t.loc['A':'C', 'W':'Z']

15,df.iloc, Get row data by sequence number
t = pd.DataFrame(np.arange(12).reshape((3, 4)),
index=list(string.ascii_uppercase[:3]),
columns=list(string.ascii_uppercase[-4:])) ''' t The output of is : W X Y Z A 0 1 2 3 B 4 5
6 7 C 8 9 10 11 ''' print t # 1, Get multi row and multi column data ,1:3 Note that this is a left closed right open interval ''' give the result as follows : Y Z B 6 7 C
10 11 ''' print t.iloc[1:3, [2, 3]] # 2, Multi row and multi column , The following two 1:3 They are all left closed and right open intervals ''' W X A 0 1
B 4 5 ''' print t.iloc[1:3, 1:3] # 3, Modify data ''' W X Y Z A 0 1 100 3 B 4 5 6 7 C 8
9 10 11 ''' t.loc['A', 'Y'] = 100 print t

16,pandas Boolean index
data = {"Row_Labels": ['BELLS', 'CHARLIE', 'LUCKY', 'ROCKY', 'E', 'M'],
"Count_AnimalName": [1195, 856, 723, 823, 100, 20]} t = pd.DataFrame(data=data,
index=list([1156, 2660, 8552, 12368, 11, 12])) ''' t Content of : Count_AnimalName
Row_Labels 1156 1195 BELLS 2660 856 CHARLIE 8552 723 LUCKY 12368 823 ROCKY 11
100 E 12 20 M ''' print t t = t[(t['Row_Labels'].str.len() > 2) &
(t['Count_AnimalName'] > 700)] ''' Returned results : Count_AnimalName Row_Labels 1156 1195
BELLS 2660 856 CHARLIE 8552 723 LUCKY 12368 823 ROCKY ''' print t

17,pandas String of

Technology
©2019-2020 Toolsou All rights reserved,
C++ of string of compare usage Python Study notes ( one )evo Tool usage problems ——Degenerate covariance rank, Umeyama alignment is not possibleRISC-V_GD32VF103-TIMER0 timer interrupt java Array subscript variable _Java Basic grammar : array be based on stm32 Control four-wheel trolley motor drive ( one ) be based on redis Design of liking function Software engineering career planning mysql Query random data by conditions _MySQL Random query of several qualified records centos7 install RabbitMq