Theory part

* understand SAC Process and groupby mechanism
* Master the three operations of grouping : polymerization , Filtering and transformation
* be familiar with apply Function usage
Practice section

* Diamond data set analysis
* Analysis of illicit drug data sets
<> Exercise explanation

【 Exercise one 】: There is a copy about diamonds Data set of , The number of carats is recorded in columns , colour , Mining depth , Price , Please solve the following problems : The description is as follows

(a) At all weights over 1 In carats of diamonds , What is the price range ?
df = pd.read_csv('data/Diamonds.csv') df.head() df_r = df.query('carat>1')[
'price'] df_r.max()-df_r.min()

(b) If the mining depth is 0.2\0.4\0.6\0.8 Quantile is the basis of grouping , Which is the most colorful diamond in each group ? Is this color the most expensive per unit weight in the group on average ?
bins = df['depth'].quantile(np.linspace(0,1,6)).tolist() cuts = pd.cut(df[
'depth'],bins=bins) # Optional label Add custom label df['cuts'] = cuts df.head()

color_result = df.groupby('cuts')['color'].describe() color_result

df[' Average price ']=df['price']/df['carat'] color_result['top'] == [i[1] for i in df.
groupby(['cuts','color'])[' Average price '].mean().groupby(['cuts']).idxmax().values]

© Group by weight (0-0.5,0.5-1,1-1.5,1.5-2,2+), Sort indexes by increasing depth , Find the maximum of the length of the continuous strictly increasing price series in each group .
df = df.drop(columns=' Average price ') cuts = pd.cut(df['carat'],bins=[0,0.5,1,1.5,2,np.
inf]) # Optional label Add custom label df['cuts'] = cuts df.head()

def f(nums): if not nums: return 0 res = 1 cur_len = 1 for i in range(1, len(
nums)): if nums[i-1] < nums[i]: cur_len += 1 res = max(cur_len, res) else:
cur_len= 1 return res for name,group in df.groupby('cuts'): group = group.
sort_values(by='depth') s = group['price'] print(name,f(s.tolist()))

(d) Please group by color , The regression coefficients of price with respect to carat number are calculated respectively .( Simple linear regression of single variable , And only use Pandas and Numpy complete )
for name,group in df[['carat','price','color']].groupby('color'): L1 = np.array
([np.ones(group.shape[0]),group['carat']]).reshape(2,group.shape[0]) L2 = group[
'price'] result = (np.linalg.inv(L1.dot(L1.T)).dot(L1)).dot(L2).reshape(2,1)
print(' When the color is %s Time , The intercept is :%f, The regression coefficient is :%f'%(name,result[0],result[1]))

【 Exercise 2 】: There's a copy about America 10 Year to 17 Illicit drug data sets for , Columns record the year separately , State (5 individual ), county , Drug type , Number of reports , Please solve the following problems

(a) Statistics by year , Which county has the most reports ? Was the county's state also the most reported in that year ?
df = pd.read_csv('data/Drugs.csv') df.head() idx=pd.IndexSlice for i in range(
2010,2018): county = (df.groupby(['COUNTY','YYYY']).sum().loc[idx[:,i],:].idxmax
()[0][0]) state = df.query('COUNTY == "%s"'%county)['State'].iloc[0] state_true
= df.groupby(['State','YYYY']).sum().loc[idx[:,i],:].idxmax()[0][0] if state==
state_true: print(' stay %d year ,%s Counties have the largest number of reports , The state it belongs to %s It is also the largest number of reports '%(i,county,state)) else: print
(' stay %d year ,%s Counties have the largest number of reports , But the state it belongs to %s Not the most reported ,%s State reports the most '%(i,county,state,state_true))

(b) from 14 Year to year 15 year ,Heroin Which state has increased the most ? Is it the biggest increase of all drugs in the state ? If not , Please find the medicine that meets the condition .
df_b = df[(df['YYYY'].isin([2014,2015]))&(df['SubstanceName']=='Heroin')]
df_add= df_b.groupby(['YYYY','State']).sum() (df_add.loc[2015]-df_add.loc[2014])
.idxmax() df_b = df[(df['YYYY'].isin([2014,2015]))&(df['State']=='OH')] df_add =
df_b.groupby(['YYYY','SubstanceName']).sum() display((df_add.loc[2015]-df_add.
loc[2014]).idxmax()) # The feature of index alignment is used here display((df_add.loc[2015]/df_add.loc[2014]).
idxmax())

This part is about grouping exercises , Next is the exercise of deformation

Technology
©2019-2020 Toolsou All rights reserved,
ELementUI select Multi select drop-down box to get all properties of the selected item Conflict statement (conflicting declaration) solve common 5 species JAVA Runtime exception TypeScript Data types in is enough Redis Counter High concurrency applications npm ERR! code ELIFECYCLE npm ERR! errno 1 npm ERR! test_vue_0613@1.0.0 dev: Solutions to errors ( Essence )2020 year 6 month 26 day C# Class library read json Profile help class ( Essence )2020 year 6 month 26 day C# Class library Log help class unity Get the car speed and forward or backward PHP Printout log