Theory part

* understand SAC Process and groupby mechanism
* Master the three operations of grouping ： polymerization , Filtering and transformation
* be familiar with apply Function usage
Practice section

* Diamond data set analysis
* Analysis of illicit drug data sets
<> Exercise explanation

【 Exercise one 】： There is a copy about diamonds Data set of , The number of carats is recorded in columns , colour , Mining depth , Price , Please solve the following problems ： The description is as follows

(a) At all weights over 1 In carats of diamonds , What is the price range ?
'price'] df_r.max()-df_r.min()

(b) If the mining depth is 0.2\0.4\0.6\0.8 Quantile is the basis of grouping , Which is the most colorful diamond in each group ? Is this color the most expensive per unit weight in the group on average ?
bins = df['depth'].quantile(np.linspace(0,1,6)).tolist() cuts = pd.cut(df[

color_result = df.groupby('cuts')['color'].describe() color_result

df[' Average price ']=df['price']/df['carat'] color_result['top'] == [i for i in df.
groupby(['cuts','color'])[' Average price '].mean().groupby(['cuts']).idxmax().values]

© Group by weight (0-0.5,0.5-1,1-1.5,1.5-2,2+), Sort indexes by increasing depth , Find the maximum of the length of the continuous strictly increasing price series in each group .
df = df.drop(columns=' Average price ') cuts = pd.cut(df['carat'],bins=[0,0.5,1,1.5,2,np.

def f(nums): if not nums: return 0 res = 1 cur_len = 1 for i in range(1, len(
nums)): if nums[i-1] < nums[i]: cur_len += 1 res = max(cur_len, res) else:
cur_len= 1 return res for name,group in df.groupby('cuts'): group = group.
sort_values(by='depth') s = group['price'] print(name,f(s.tolist()))

(d) Please group by color , The regression coefficients of price with respect to carat number are calculated respectively .（ Simple linear regression of single variable , And only use Pandas and Numpy complete ）
for name,group in df[['carat','price','color']].groupby('color'): L1 = np.array
([np.ones(group.shape),group['carat']]).reshape(2,group.shape) L2 = group[
'price'] result = (np.linalg.inv(L1.dot(L1.T)).dot(L1)).dot(L2).reshape(2,1)
print(' When the color is %s Time , The intercept is ：%f, The regression coefficient is ：%f'%(name,result,result))

【 Exercise 2 】： There's a copy about America 10 Year to 17 Illicit drug data sets for , Columns record the year separately , State （5 individual ）, county , Drug type , Number of reports , Please solve the following problems

(a) Statistics by year , Which county has the most reports ? Was the county's state also the most reported in that year ?
2010,2018): county = (df.groupby(['COUNTY','YYYY']).sum().loc[idx[:,i],:].idxmax
()) state = df.query('COUNTY == "%s"'%county)['State'].iloc state_true
= df.groupby(['State','YYYY']).sum().loc[idx[:,i],:].idxmax() if state==
state_true: print(' stay %d year ,%s Counties have the largest number of reports , The state it belongs to %s It is also the largest number of reports '%(i,county,state)) else: print
(' stay %d year ,%s Counties have the largest number of reports , But the state it belongs to %s Not the most reported ,%s State reports the most '%(i,county,state,state_true))

(b) from 14 Year to year 15 year ,Heroin Which state has increased the most ? Is it the biggest increase of all drugs in the state ? If not , Please find the medicine that meets the condition .
df_b = df[(df['YYYY'].isin([2014,2015]))&(df['SubstanceName']=='Heroin')]