Regression analysis , Only one independent variable and one dependent variable are included , The relationship between them can be expressed by a straight line , This kind of regression analysis is called unitary linear regression analysis ; If two or more independent variables are included in the regression analysis , The relationship between dependent variable and independent variable is linear , It is called multiple linear regression analysis . The commonly used methods are gradient descent method and least square method .
1. Gradient descent method （GD）
1.1 principle ：
among , Is the learning rate or step size （Learning rate）
1.2 Hypothesis function ：
1.3 loss function ：
1.4 Analysis process ：
1.4.1 Batch gradient descent method （BGD）
Batch gradient descent method , It is the most commonly used form of gradient descent method , This is done by using all m To update . Update the formula to ：
1.4.2 Stochastic gradient descent method （SGD）
Stochastic gradient descent method , In fact, it is similar to the principle of batch gradient descent method , The difference between the gradient and not all m Data of samples , It's just a sample i
To find the gradient . Update the formula to ：
Stochastic gradient descent method , and 4.1 The batch gradient descent method is two extremes , One uses all the data for gradient descent , One uses one sample for gradient descent . Naturally, their advantages and disadvantages are very prominent . For training speed , The stochastic gradient descent method uses only one sample at a time , Training is fast , And the batch gradient descent method is used when the sample size is large , Training speed is not satisfactory . For accuracy , The random gradient descent method is used to determine the direction of the gradient with only one sample , The resulting solution is probably not optimal .
1.4.3 Small batch gradient descent method （MBGD）
Small batch gradient descent method is a compromise between batch gradient descent method and random gradient descent method , That is, for m Samples , We use x Let's iterate ,1<x<m. Generally, it can be taken x=10, Of course, according to the sample data , You can adjust this x Value of . Update the formula to ：
1.5 Algorithm optimization ：
When using the gradient descent method , Tuning is needed . What needs to be tuned ?
Step size selection of algorithm . In the previous algorithm description , I mentioned taking the step size as 1, But the actual value depends on the data sample , You can take more values , From big to small , Run the algorithm separately , Look at the iteration effect , If the loss function is decreasing , It indicates that the value is valid , Otherwise, increase the step size . I said it earlier . The step size is too large , It will cause the iteration to be too fast , It may even miss the optimal solution . The step size is too small , The iteration speed is too slow , The algorithm won't end for a long time . Therefore, the step size of the algorithm needs to be run many times to get a better value . For step size , The following data can be used for testing ：
0.001,0.01,0.1,1,10... Or you can use it 0.001,0.003,0.01,0.03,0.1,0.3,1... It can be used immediately 3 Times or 10 Times the speed , Slowly adjust it to a range , Then fine tune it
Initial value selection of algorithm parameters . The initial values are different , The minimum values obtained may also be different , Therefore, only the local minimum is obtained by gradient descent ; Of course, if the loss function is convex, it must be the optimal solution . Because of the risk of local optimal solution , The algorithm needs to be run multiple times with different initial values , Minimum value of critical loss function , Select the initial value of loss function minimization .
normalization . Due to different samples, the range of values of characteristics is different , It can lead to slow iterations , In order to reduce the influence of feature value , Feature data can be normalized , That is, for each feature x, The standard deviation of the sum of its expectations std(x), And then it turns into ：
The new expectation of this feature is 0, The new variance is 1, The iteration speed can be greatly accelerated .
2. least square method (OLS)
Matrix expression ：
of course , It may not exist , This is very rare . If it's irreversible , The following two cases are generally considered ：
（1） Remove redundant features . There is a linear dependence between some features
（2） When there are too many features , Delete some features , such as （m<n）, For small sample data, regularization is used .
3. Selection of gradient descent method and least square method
Gradient descent method least square method
* Need to choose learning rate
* Multiple iterations are required
* When the range of eigenvalues is too different , Normalization is required
* When the characteristic number n When I was very young , Be able to work well
* There is no need to choose the learning rate
* Multiple iterations are not required
* Normalization is not required
* When the characteristic number n When I was very young , It's very slow , Because the inverse matrix is slow
under normal conditions , When n<10000 Time , Using the least square method . When n>=10000 Time , Consider the gradient descent method .
For some more complicated algorithms, only gradient descent method can be selected