The face recognition of stars is the feature of microblog , There are lots of pictures of stars , There is a huge demand for identification . Star face recognition has particular difficulties , Common face recognition schemes , Photo expression used , Less modeling , There is a big difference between different people , And the star's face is rich in expression , Various shapes , Some cosmetic faces can't even tell the brain apart , Not to mention machine identification .
The present , Recommended ads in web pages , If the advertisement can reach the user's will , You don't have to work hard , This kind of advertisement effect will naturally be much better .
Here, the ideal machine learning platform undertakes offline training and online prediction tasks . Transfer the data generated by the implementation to the background , Used to extract features , off-line training , More and more businesses use deep learning methods ,TensorFlow/Caffe The framework is integrated .
at present , Offline training mainly uses GPU Fleet , Part of the computer cluster belongs to alicloud .
TensorFlow Distributed computing model provided by
Tensorflow Distributed computing and HPC Of MPI Distribution calculations are quite different ,MPI Processes are equal to each other , Ensure that there are no bottlenecks in the process ,MPI_IO It is also designed that each host can share equally IO The pressure of ,MPI Computing tasks on the process also require uniform partitioning , Ensure that the calculation progress of each process is consistent ,MPI Only the boundaries of data blocks are exchanged between processes , Minimize network traffic , Compress communication time .
TensorFlow Distributed computing design is simple and crude . In several parameter servers , And a number of workers to form a fleet , The parameters obtained from each operation are submitted to the parameter server , The parameter server merges the parameters from all workers , Get the global parameters , The global parameters are then set , Send to labor , On the basis of global parameters, the labor is doing the next step of calculation ,
TensorFlow Master slave mode is adopted , The parameter server is the bottleneck . All parameters are passed at each step , Too much network traffic , Let's assume that each labor parameter takes up memory 1GB, Cluster includes 1 Parameter servers and 10 Workers , Then each iteration step will produce 20GB Network traffic of , according to 10GbE network computing , Communication time should be at least 16 second . And in fact , each batch The operation time of the data may not be enough 1 second , Model parameters may take up much more memory than 1GB. From the perspective of theoretical analysis ,TensorFlow Distributed computing is not as efficient as MPI.
Some people say that deep learning is only a special mode of high performance computing , But it's not ,TensorFlow And HPC There is a big difference in the fleet .
TensorFlow And HPC The difference between the fleet
HPC The fleet has 3 Major features ： High performance computing chip , High speed computing network , High speed parallel storage , and TensorFlow There is only one high-end one GPU.
Labor is trained on a set of data △W and △B（ Collectively referred to as △P） It's called one-step training , After all the workers have completed one-step training , Stop training , Will wait for their own △P Send to parameter server , The parameter server has been waiting , Until all labor parameters change △P Add and average , Then use this parameter to update the old parameter , Get the new parameters P, With the help of P Send to all workers , After receiving this new parameter, the labor will do the next calculation .
Compared with using one server , use N Taiwan workers train at the same time + Synchronous update parameters are equivalent to batch Has expanded N times . say concretely , If you use 1 Server hours , Each step of training adopts 100 Digital pictures （batch=100）, Then use it 4 The variation of parameters obtained by workers （ Namely ∆P） Synchronous update , It's equivalent to every step of training 400 Digital pictures （batch=400）. thus , The parameters change more smoothly , Faster convergence .
But synchronous updates also have disadvantages ： The overall speed depends on the slowest worker , If there is a big difference between the hardware and software between the labor force , There is a significant difference in speed , The calculation speed of synchronous update is slow . To avoid this speed difference ,TensorFlow Asynchronous update strategy is provided .
When a worker is trained to get a parameter variation ∆P Time , Let's assume it's in the picture Device A, The worker immediately ∆P Send to parameter server . Parameter server received from labor Device
A Of ∆P after , Don't wait for other workers , Use it now ∆P Update global parameters , Get the global parameters P, And then the P Send to labor Device A. worker Device
A Global parameter received P after , Start the next step now . Parameters updated asynchronously , It is equivalent to training with only one server , They all train with a small number of images at a time , It's just the order of each batch , It is determined by the random operation state of labor .