Regression analysis , Only one independent variable and one dependent variable are included , The relationship between them can be expressed by a straight line , This kind of regression analysis is called unitary linear regression analysis ; If two or more independent variables are included in the regression analysis , The relationship between dependent variable and independent variable is linear , It is called multiple linear regression analysis . The commonly used methods are gradient descent method and least square method .

1. Gradient descent method (GD)

1.1 principle :


        among , Is the learning rate or step size (Learning rate)

1.2 Hypothesis function :


1.3 loss function :


1.4 Analysis process :

        1.4.1 Batch gradient descent method (BGD)

                Batch gradient descent method , It is the most commonly used form of gradient descent method , This is done by using all m To update . Update the formula to :


        1.4.2 Stochastic gradient descent method (SGD)

                 Stochastic gradient descent method , In fact, it is similar to the principle of batch gradient descent method , The difference between the gradient and not all m Data of samples , It's just a sample i
To find the gradient . Update the formula to :


   Stochastic gradient descent method , and 4.1 The batch gradient descent method is two extremes , One uses all the data for gradient descent , One uses one sample for gradient descent . Naturally, their advantages and disadvantages are very prominent . For training speed , The stochastic gradient descent method uses only one sample at a time , Training is fast , And the batch gradient descent method is used when the sample size is large , Training speed is not satisfactory . For accuracy , The random gradient descent method is used to determine the direction of the gradient with only one sample , The resulting solution is probably not optimal .

        1.4.3  Small batch gradient descent method (MBGD)

  Small batch gradient descent method is a compromise between batch gradient descent method and random gradient descent method , That is, for m Samples , We use x Let's iterate ,1<x<m. Generally, it can be taken x=10, Of course, according to the sample data , You can adjust this x Value of . Update the formula to :


1.5  Algorithm optimization :

         When using the gradient descent method , Tuning is needed . What needs to be tuned ?

Step size selection of algorithm . In the previous algorithm description , I mentioned taking the step size as 1, But the actual value depends on the data sample , You can take more values , From big to small , Run the algorithm separately , Look at the iteration effect , If the loss function is decreasing , It indicates that the value is valid , Otherwise, increase the step size . I said it earlier . The step size is too large , It will cause the iteration to be too fast , It may even miss the optimal solution . The step size is too small , The iteration speed is too slow , The algorithm won't end for a long time . Therefore, the step size of the algorithm needs to be run many times to get a better value . For step size , The following data can be used for testing :

  0.001,0.01,0.1,1,10... Or you can use it 0.001,0.003,0.01,0.03,0.1,0.3,1... It can be used immediately 3 Times or 10 Times the speed , Slowly adjust it to a range , Then fine tune it

Initial value selection of algorithm parameters . The initial values are different , The minimum values obtained may also be different , Therefore, only the local minimum is obtained by gradient descent ; Of course, if the loss function is convex, it must be the optimal solution . Because of the risk of local optimal solution , The algorithm needs to be run multiple times with different initial values , Minimum value of critical loss function , Select the initial value of loss function minimization .

normalization .  Due to different samples, the range of values of characteristics is different , It can lead to slow iterations , In order to reduce the influence of feature value , Feature data can be normalized , That is, for each feature x, The standard deviation of the sum of its expectations std(x), And then it turns into :


            The new expectation of this feature is 0, The new variance is 1, The iteration speed can be greatly accelerated .

2. least square method (OLS)

Matrix expression : 

order :


that ,


of course , It may not exist , This is very rare . If it's irreversible , The following two cases are generally considered :

        (1) Remove redundant features . There is a linear dependence between some features

        (2) When there are too many features , Delete some features , such as (m<n), For small sample data, regularization is used .

3. Selection of gradient descent method and least square method

Gradient descent method least square method

shortcoming :

* Need to choose learning rate
* Multiple iterations are required
* When the range of eigenvalues is too different , Normalization is required
advantage :

* When the characteristic number n When I was very young , Be able to work well
advantage :

* There is no need to choose the learning rate
* Multiple iterations are not required
* Normalization is not required
shortcoming :

* When the characteristic number n When I was very young , It's very slow , Because the inverse matrix is slow

under normal conditions , When n<10000 Time , Using the least square method . When n>=10000 Time , Consider the gradient descent method .

For some more complicated algorithms, only gradient descent method can be selected

©2019-2020 Toolsou All rights reserved,
The 11th Blue Bridge Cup python The real topic of the University Group National Games JavaSwing To achieve a simple Lianliankan games 【Spring Source code analysis 】42-@Conditional Detailed explanation element-ui Step on pit record 2019PHP Interview questions ( Continuously updated )PHPJava Misunderstanding —— Method overloading is a manifestation of polymorphism ? First issue 500 100 million , Set up a new Department , What is Tencent going to do ? Google chrome The browser can't open the web page , But what if other browsers can open it ? Regression of dependent variable order categories (R language )【Golang Basic series 10 】Go language On conditional sentences if