1.5.2 2. practice
1.5.2.1 【 Exercise one 】： There is a copy about diamonds Data set of , The number of carats is recorded in columns , colour , Mining depth , Price , Please solve the following problems ：
carat color depth price
0 0.23 E 61.5 326
1 0.21 E 59.8 326
2 0.23 E 56.9 327
3 0.29 I 62.4 334
4 0.31 J 63.3 335
1.5.2.2 (a) At all weights over 1 In carats of diamonds , What is the price range ?
1.5.2.3 (b)
If the mining depth is 0.2\0.4\0.6\0.8 Quantile is the basis of grouping , Which is the most colorful diamond in each group ? Is this color the most expensive per unit weight in the group on average ?
1.5.2.4 © Group by weight (0-0.5,0.5-1,1-1.5,1.5-2,2+), Sort indexes by increasing depth , Find the maximum of the length of the continuous strictly increasing price series in each group .
1.5.2.5 (d) Please group by color , The regression coefficients of price with respect to carat number are calculated respectively .（ Simple linear regression of single variable , And only use Pandas and Numpy complete ）
1.5.2.6 【 Exercise 2 】： There's a copy about America 10 Year to 17 Illicit drug data sets for , Columns record the year separately , State （5 individual ）, county , Drug type , Number of reports , Please solve the following problems ：
YYYY State COUNTY SubstanceName DrugReports
0 2010 VA ACCOMACK Propoxyphene 1
1 2010 OH ADAMS Morphine 9
3 2010 VA ALEXANDRIA CITY Heroin 5
4 2010 PA ALLEGHENY Hydromorphone 5
1.5.2.7 (a) Statistics by year , Which county has the most reports ? Was the county's state also the most reported in that year ?
1.5.2.8 (b) from 14 Year to year 15 year ,Heroin Which state has increased the most ? Is it the biggest increase of all drugs in the state ? If not , Please find the medicine that meets the condition .

1.5 The first 3 chapter ： Exercise one
1.5.1 (a) The range is 17561
carat color depth price
0 0.23 E 61.5 326
1 0.21 E 59.8 326
2 0.23 E 56.9 327
3 0.29 I 62.4 334
4 0.31 J 63.3 335
df_r = df.query(‘carat>1’)[‘price’]
df_r.max()-df_r.min()
17561
1.5.2 (b) 0-0.2 The most quantile range was ‘E’, The rest of the interval is ‘G’
bins = df[‘depth’].quantile(np.linspace(0,1,6)).tolist()
cuts = pd.cut(df[‘depth’],bins=bins) # Optional label Add custom label
df[‘cuts’] = cuts
carat color depth price cuts
0 0.23 E 61.5 326 (60.8, 61.6]
1 0.21 E 59.8 326 (43.0, 60.8]
2 0.23 E 56.9 327 (43.0, 60.8]
3 0.29 I 62.4 334 (62.1, 62.7]
4 0.31 J 63.3 335 (62.7, 79.0]
color_result = df.groupby(‘cuts’)[‘color’].describe()
color_result
count unique top freq
cuts
(43.0, 60.8] 11294 7 E 2259
(60.8, 61.6] 11831 7 G 2593
(61.6, 62.1] 10403 7 G 2247
(62.1, 62.7] 10137 7 G 2193
(62.7, 79.0] 10273 7 G 2000
1.5.3 The first three quantiles do not satisfy the condition , The color with the most quantity in the last two intervals is indeed the most expensive in average weight price
df[‘ Average price ’]=df[‘price’]/df[‘carat’]
color_result[‘top’] == [i[1] for i in df.groupby([‘cuts’
,‘color’])[‘ Average price ’].mean().groupby([‘cuts’]).idxmax().values]
cuts
(43.0, 60.8] False
(60.8, 61.6] False
(61.6, 62.1] False
(62.1, 62.7] True
(62.7, 79.0] True
Name: top, dtype: bool
1.5.4 © The results are shown below ：
df = df.drop(columns=‘ Average price ’)
cuts = pd.cut(df[‘carat’],bins=[0,0.5,1,1.5,2,np.inf]) # Optional label Add custom label
df[‘cuts’] = cuts
carat color depth price cuts
0 0.23 E 61.5 326 (0.0, 0.5]
1 0.21 E 59.8 326 (0.0, 0.5]
2 0.23 E 56.9 327 (0.0, 0.5]
3 0.29 I 62.4 334 (0.0, 0.5]
4 0.31 J 63.3 335 (0.0, 0.5]
def f(nums):
if not nums:
return 0
res = 1
cur_len = 1
for i in range(1, len(nums)):
if nums[i-1] < nums[i]:
cur_len += 1
res = max(cur_len, res)
else:
cur_len = 1
return res
for name,group in df.groupby(‘cuts’):
group = group.sort_values(by=‘depth’)
s = group[‘price’]
print(name,f(s.tolist()))
(0.0, 0.5] 8
(0.5, 1.0] 8
(1.0, 1.5] 7
(1.5, 2.0] 11
(2.0, inf] 7
1.5.5 (d) The results are as follows ：
for name,group in df[[‘carat’,‘price’,‘color’]].groupby(‘color’):
L1 =
np.array([np.ones(group.shape[0]),group[‘carat’]]).reshape(2,group.shape[0])
L2 = group[‘price’]
result = (np.linalg.inv(L1.dot(L1.T)).dot(L1)).dot(L2).reshape(2,1)
print(‘ When the color is %s Time , The intercept is ：%f, The regression coefficient is ：%f’%(name,result[0],result[1]))
When the color is D Time , The intercept is ：-2361.017152, The regression coefficient is ：8408.353126
When the color is E Time , The intercept is ：-2381.049600, The regression coefficient is ：8296.212783
When the color is F Time , The intercept is ：-2665.806191, The regression coefficient is ：8676.658344
When the color is G Time , The intercept is ：-2575.527643, The regression coefficient is ：8525.345779
When the color is H Time , The intercept is ：-2460.418046, The regression coefficient is ：7619.098320
When the color is I Time , The intercept is ：-2878.150356, The regression coefficient is ：7761.041169
When the color is J Time , The intercept is ：-2920.603337, The regression coefficient is ：7094.192092
1.6 The first 3 chapter ： Exercise 2
1.6.1 (a)
YYYY State COUNTY SubstanceName DrugReports
0 2010 VA ACCOMACK Propoxyphene 1
1 2010 OH ADAMS Morphine 9
3 2010 VA ALEXANDRIA CITY Heroin 5
4 2010 PA ALLEGHENY Hydromorphone 5
idx=pd.IndexSlice
for i in range(2010,2018):
county = (df.groupby([‘COUNTY’,‘YYYY’]).sum().loc[idx[:,i],:].idxmax()[0][0])
state = df.query(‘COUNTY == “%s”’%county)[‘State’].iloc[0]
state_true = df.groupby([‘State’,‘YYYY’]).sum().loc[idx[:,i],:].idxmax()[0][0]
if statestate_true:
print(‘ stay %d year ,%s Counties have the largest number of reports , The state it belongs to %s It is also the largest number of reports ’%(i,county,state))
else:
print(‘ stay %d year ,%s Counties have the largest number of reports , But the state it belongs to %s Not the most reported ,%s State reports the most ’%(i,county,state,state_true))
stay 2010 year ,PHILADELPHIA Counties have the largest number of reports , The state it belongs to PA It is also the largest number of reports
stay 2011 year ,PHILADELPHIA Counties have the largest number of reports , But the state it belongs to PA Not the most reported ,OH State reports the most
stay 2012 year ,PHILADELPHIA Counties have the largest number of reports , But the state it belongs to PA Not the most reported ,OH State reports the most
stay 2013 year ,PHILADELPHIA Counties have the largest number of reports , But the state it belongs to PA Not the most reported ,OH State reports the most
stay 2014 year ,PHILADELPHIA Counties have the largest number of reports , But the state it belongs to PA Not the most reported ,OH State reports the most
stay 2015 year ,PHILADELPHIA Counties have the largest number of reports , But the state it belongs to PA Not the most reported ,OH State reports the most
stay 2016 year ,HAMILTON Counties have the largest number of reports , The state it belongs to OH It is also the largest number of reports
stay 2017 year ,HAMILTON Counties have the largest number of reports , The state it belongs to OH It is also the largest number of reports
1.6.2 (b) OH States increased the most ,Heroin It's the largest increment , But the biggest increase was Acetyl fentanyl
df_b = df[(df[‘YYYY’].isin([2014,2015]))&(df[‘SubstanceName’]‘Heroin’)]
DrugReports OH
dtype: object
df_b = df[(df[‘YYYY’].isin([2014,2015]))&(df[‘State’]==‘OH’)]