The answer is at the beginning , So that you don't just pull down and see the results ~
modin.pandas It can make some functions use multi-core cpu Speed up processing , But now some functions are not perfect , Some functions still use the default pandas handle …
Which functions can be accelerated can be seen below
Mainly tested apply,groupby,read_csv

<> One ,Modin.pandas

Talking about modin before , A brief introduction pandas,pandas Mainly python A library used to process data , And for efficiency , Not with python Written , The underlying logic is c language . And for all kinds of computing logic, it has been developed to a relatively excellent level by developers .

But even so , because python Characteristics of its own language ,pandas It can only be calculated on a single core , So we're thinking about acceleration pandas When dealing with large amounts of data , Use multi-core cpu That's the first consideration .
This time Modin And the project came up .

Modin It's the University of California, Berkeley RISELab Early projects of , To promote the application of distributed computing in Data Science . It's a multi process Dataframe library , With and pandas same API, Allow users to accelerate their Pandas Workflow .

to make a long story short ,modin.pandas It's in pandas On the basis of this, we encapsulate one layer and use multi-core cpu Conduct accelerated calculation .

Not much nonsense , Let's look directly at the test results .

<> Two ,Modin.pandas test

The test code is as follows
def pandas_test(): import pandas as pd from time import time df = pd.DataFrame(
zip(range(1000000),range(1000000,2000000)),columns=['a','b']) start = time() df[
'c'] = df.apply(lambda x:x.a+x.b ,axis=1) df['d'] = df.apply(lambda x:1 if x.a%2
==0 else 0, axis=1) print('pandas_df.apply Time: {:5.2f}s'.format(time() - start
)) start = time() group_df = df[['d','a']].groupby('d',as_index=False).agg({"a":
['sum','max','min','mean']}) print('pandas_df.groupby Time: {:5.2f}s'.format(
time() - start)) start = time() data = pd.read_csv('test_modin.csv') print(
'pandas_df.read_csv Time: {:5.2f}s'.format(time() - start)) def
modin_pandas_test(): import modin.pandas as pd from time import time df = pd.
DataFrame(zip(range(1000000),range(1000000,2000000)),columns=['a','b']) start =
time() df['c'] = df.apply(lambda x:x.a+x.b ,axis=1) df['d'] = df.apply(lambda x:
1 if x.a%2==0 else 0, axis=1) print('modin_pandas_df.apply Time: {:5.2f}s'.
format(time() - start)) start = time() group_df = df[['d','a']].groupby('d',
as_index=False).agg({"a":['sum','max','min','mean']}) print(
'modin_pandas_df.groupby Time: {:5.2f}s'.format(time() - start)) start = time()
data= pd.read_csv('test_modin.csv') print('modin_pandas_df.read_csv Time:
{:5.2f}s'.format(time() - start)) if __name__ == '__main__': pandas_test()
modin_pandas_test()
Create a million line df conduct apply ,groupby
test_modin.csv Size is 70M

Watch server CPU Usage of , Server background 4 nucleus
1,pandas

2,modin.pandas

from cpu The usage of modin.pandas Better than average pandas Multi core used cpu Calculate .
and modin.pandas Will automatically use all cores for calculation , that is cpu It's going to be full all of a sudden ..

OK, Let's look directly at the execution results of the code .

You can see in apply Follow read_csv It's true that pandas soon , It's just not as fast as you think .
But in groupby But not faster , It's slower .
Think about why :
groupby Use multi-core cpu It's a bit like we open multiple processes for parallel computing , The best way to cut is to groupby The number of processes corresponding to the field category of , Then summarize the data .
In the above test example, only 2 Categories , And we also need to summarize and calculate the data , Therefore, there may not be optimal application of multi-core in efficiency , On the contrary, it increases the summary time .
If there's a big guy who knows why, let me know ~

<> Three , summary

So far, our preliminary test is over .

In terms of results modin Yes pandas The acceleration is still obvious , Especially when the data volume is large , The amount of data I tested above is relative to the real big data , It's just a demo. So if you're dealing with big data and pandas When the efficiency is not very satisfactory, you can consider using it modin.pandas
Acceleration .
But there are some functions modin.pandas Not yet fully realized . Pay attention when using .

I am a moving ant , Hope to move forward together .

If it helps you a little , One like is enough , thank !

notes : If there are any mistakes and suggestions in this blog , Welcome to point out , esteem it a favor !!!

Technology
©2019-2020 Toolsou All rights reserved,
Gude Haowen serial - You deserve to be an engineer ( Preface ) A single key controls multiple water lamp states Bitcoin in ten years ,VDS Opportunity or fraud CSS architecture design Programmer Tanabata Valentine's Day confession code Python+OpenCV Detailed explanation of face recognition technology Bug Can data be used as the basis of technical personnel assessment KPI?Jsp+Ajax+Servlet+Mysql Add, delete, modify and query ( one ) Thorough explanation from Zhongtai Unity Scene loading asynchronously ( Implementation of loading interface )