import pandas as pd df = pd.DataFrame({'key1':list('aabba'), 'key2': ['one',
'two','one','two','one'], 'data1': np.random.randn(5), 'data2': np.random.randn(
5)}) df

grouped=df['data1'].groupby(df['key1']) grouped.mean()
All the above group keys are Series, In fact, the group key can be any array of appropriate length
states=np.array(['Ohio','California','California','Ohio','Ohio']) years=np.
array([2005,2005,2006,2005,2006]) df['data1'].groupby([states,years]).mean()


As you can see, No key2 column , because df[‘key2’] Not numeric data , So it's removed from the results . By default , All numeric columns are aggregated , Although sometimes it may be filtered into a subset .

Iterating over groups
for name, group in df.groupby('key1'): print (name) print (group)

It can be seen that name namely groupby Medium key1 Value of ,group That's what you want to output .
In the same way :
for (k1,k2),group in df.groupby(['key1','key2']): print ('===k1,k2:') print
(k1,k2)print ('===k3:') print (group)

Yes group by After the content of the operation , If it is converted into a dictionary
piece=dict(list(df.groupby('key1'))) piece {'a': data1 data2 key1 key2 0 -
0.233405 -0.756316 a one 1 -0.232103 -0.095894 a two 4 1.056224 0.736629 a one,
'b': data1 data2 key1 key2 2 0.200875 0.598282 b one 3 -1.437782 0.107547 b two
} piece['a']

groupby The default is in axis=0 Grouped on , You can also group on any other axis by setting .
grouped=df.groupby(df.dtypes, axis=1) dict(list(grouped)) {dtype('float64'):
data1 data20 -0.233405 -0.756316 1 -0.232103 -0.095894 2 0.200875 0.598282 3 -
1.437782 0.107547 4 1.056224 0.736629, dtype('O'): key1 key2 0 a one 1 a two 2 b
one 3 b two 4 a one
Select a column or group of columns

For big data , In many cases, only partial columns need to be aggregated

Through a dictionary or series Group
people=pd.DataFrame(np.random.randn(5,5), columns=list('abcde'), index=['Joe',
'Steve','Wes','Jim','Travis']) people.ix[2:3,['b','c']]=np.nan # How many nan people

Grouping relations of known columns
by_column=people.groupby(mapping,axis=1) by_column.sum()

If not axis=1, It will only appear a b c d e

Series It's the same thing
map_series=pd.Series(mapping) map_series a red b red c blue d blue e red f
orangedtype: object people.groupby(map_series,axis=1).count()

Grouping by function

Compared with dic perhaps Series,python Function is more creative in defining group relation mapping . Any function that is used as a group key is called once on each index , The return value is used as the group name . Suppose you group people by the length of their names , Pass in only len that will do
people.groupby(len).sum() a b c d e 3 -1.308709 -2.353354 1.585584 2.908360 -
1.267162 5 -0.688506 -0.187575 -0.048742 1.491272 -0.636704 6 0.110028 -0.932493
1.343791 -1.928363 -0.364745
Combine functions and arrays , list , Dictionaries ,Series Mixing is not a problem , Because everything will eventually be converted to an array
key_list=['one','one','one','two','two'] people.groupby([len,key_list]).sum()
Group by index level

The most convenient place of hierarchical index is that it can aggregate according to the index level . To achieve this goal , adopt level Keyword access level number or name can be :
,names=['cty','tenor']) hier_df=pd.DataFrame(np.random.randn(4,5
),columns=columns) hier_df


Data aggregation

Call custom aggregate function

Column oriented multi function application

Yes Series perhaps DataFrame The aggregation operation of columns actually uses aggregate Or call mean,std And so on . Next, we want to use different aggregate functions for different columns , Or apply multiple functions at once
grouped=tips.groupby(['sex','smoker']) grouped_pct=grouped['tip_pct'] #tip_pct column
grouped_pct.agg('mean')# To and 9-1 Statistics described in the icon , You can pass the function name directly as a string
# If you pass in a set of functions , Got it df The column names are named after the corresponding function

The automatic column name recognition is low , If the input is (name, function) List of tuples , The first element of each tuple is used as the df Column name of

about df, You can define a set of functions for all columns , Or apply different functions to different columns

If you want to apply different functions to different columns , The specific way is to think agg Pass in a dictionary that maps from column names to functions

Only when multiple functions are applied to at least one column ,df To have hierarchical columns

Group level operations and transformations

Aggregation is just one kind of grouping operation , It is a feature of data transformation .transform and apply More outstanding .

transform A function is applied to each group , Then put the results in place . If the scalar value generated by each group , The scalar value is broadcast .

transform It is also a special function with strict conditions : The passed in function produces only two results , Either it produces a scalar value that can be broadcast ( as :np.mean), Either it produces an array of results of the same size .
people=pd.DataFrame(np.random.randn(5,5), columns=list('abcde'), index=['Joe',
'Steve','Wes','Jim','Travis']) people

key=['one','two','one','two','one'] people.groupby(key).mean()


You can see that there are many with the table 2 Same value .
def demean(arr): return arr-arr.mean()
demeaned=people.groupby(key).transform(demean) demeaned demeaned.groupby(key)
Most general groupby The method is apply.
tips=pd.read_csv('C:\\Users\\ecaoyng\\Desktop\\work space\\Python\\
py_for_analysis_code\\pydata-book-master\\ch08\\tips.csv') tips[:5]

Create a new column
tips['tip_pct']=tips['tip']/tips['total_bill'] tips[:6]

Choose the highest according to the group 5 individual tip_pct value
def top(df,n=5,column='tip_pct'): return df.sort_index(by=column)[-n:]

Yes smoker Group and apply the function

Multi parameter version

Quantile and bucket analysis

cut and qcut And groupby Combine them , Can easily set the bucket (bucket) Or quantile (quantile) analysis .
frame=pd.DataFrame({'data1':np.random.randn(1000), 'data2': np.random.randn(
1000)}) frame[:5]

factor=pd.cut(frame.data1,4) factor[:10] 0 (0.281, 2.00374] 1 (0.281, 2.00374]
2 (-3.172, -1.442] 3 (-1.442, 0.281] 4 (0.281, 2.00374] 5 (0.281, 2.00374] 6 (-
1.442, 0.281] 7 (-1.442, 0.281] 8 (-1.442, 0.281] 9 (-1.442, 0.281] Name:
data1, dtype: category Categories (4, object): [(-3.172, -1.442] < (-1.442,
0.281] < (0.281, 2.00374] < (2.00374, 3.727]] def get_stats(group): return {
grouped=frame.data2.groupby(factor) grouped.apply(get_stats).unstack()

These are barrels of equal length , We need to divide the samples into numbers to get equal sized buckets , use qcut that will do .

Barrels of equal length : Equal interval size
Barrels of equal size : The number of data points is equal
grouping=pd.qcut(frame.data1,10,labels=False)#label=false The number of quantile
grouped=frame.data2.groupby(grouping) grouped.apply(get_stats).unstack()

©2019-2020 Toolsou All rights reserved,
One is called “ Asking for the train ” A small village Finally got the train Spring Boot Lesson 16 :SpringBoot Implementation of multithreading with injection class Chrome OS, For programmers and Windows What does it mean ? Internet Marketing JAVA Convert a string to a numeric type I've been drinking soft water for three years ? What is the use of soft water and water softener You don't know ——HarmonyOS Talking about uni-app Page value transfer problem JavaScript Medium Call and ApplySparkSQL Achieve partition overlay write Character recognition technology of vehicle license plate based on Neural Network