A very common method used for defining a model and estimating its parameters, it is to minimize model’s errors with Gradient Descend. The Gradient Descent estimates the weights of the model in many iterations by minimizing a cost function at every step.

According to wiki,** gradient descent** is a first-order optimization algorithm. To find a local minimum of a function using **gradient descent**, one takes steps proportional to the negative of the **gradient** (or of the approximate **gradient**) of the function at the current point. If instead one takes steps proportional to the positive of the gradient, one approaches a local maximum of that function; the procedure is then known as gradient ascent.

**Hypothesis: **h_{θ}(x) = θ^{T}x = θ_{0}x_{0} + θ_{1}x_{1} + θ_{2}x_{2} + … + θ_{n}x_{n
}**Parameters: **θ_{0}, θ_{1}, …, θ_{n} (you can think of the parameters of the model as being a vector θ)

**Cost function: Gradient decent: **

, **α – learning rate** (simultaneously update for every j = 0, …, n).

Normally, cost function for gradient descent should look like in the graph below:

*Fig. 1.1.*

As it can be seen from the previous graph, after a certain number of iterations, the curve has flattened, so that gradient descend starts to converge because the cost function isn’t going down much more. The number of iterations the gradient descent takes to converge for a physical application can vary a lot, so it turns out to be difficult to tell in advance how many iterations gradient descendant needs to converge. Usually, to find out the proper number of iterations you can plot the cost function as shown above.

It is also possible to come up with automatic convergence test.

E.g. **Automatic convergence test.** Declare convergence if J(θ) decreases by less than 10^{-3} in one iteration. Usually, it is pretty difficult to choose an appropriate threshold. Thus, in order to check your gradient descendant’s converge, it is a good idea to take use of this kind of plots, rather than rely on an automatic convergence test.

Also, looking at this sort of figure can also tell you, or give you an advance warning if maybe gradient descent is not working correctly.** **If you plot J(θ) as a function of the number of iterations, then if you see a figure like the one below, where J(θ) is actually increasing, it is clear that something is wrong with your gradient descent. A θ like this usually means that you should use a smaller learning rate (α).

*Fig. 1.2.*

Similarly, sometimes you may also see J(θ) do something like in Fig. 1.3. Also, a fix fore something like this is to use a smaller value of alpha.

*Fig. 1.3.*

Assumptions about the cost function:

- For sufficient small α, J(θ) should decrease on every iteration.
- If this doesn’t happen, probably α is to big, so you should set it smaller.

- But if α is too small, gradient descent can be slow to converge.

For a better understanding, below you can find an example provided by Coursera.

Summary:

- If α is too small: slow convergence.
- If α is too large: J(θ) may not decrease on every iteration; may not converge.

To choose α, try: …, 0.001, 0.01, 1,…

**Thanks:**

wiki

coursera

blog.datumbox.com

[…] very different ranges of values and it is important to apply feature scaling, if you’re using gradient descent, to get them into comparable ranges of […]

LikeLike

[…] and disadvantages between gradient descent and normal […]

LikeLike