For the implementation of gradient descent for linear regression I draw from Digithead’s blog post.
Andrew Ng defines gradient descent for linear regression as:
Where $\alpha$ is the training rate, $m$ is the number of training examples, and the term on the right is the familiar squared error term after multiplication with the partial derivative $\frac{\delta}{\delta\theta_0}$ or $\frac{\delta}{\delta\theta_1}$ as appropriate.
Digithead’s implementation of this is quite slick, and it took me a while to get my head around the vectorised implementation, so I will break it down here, for my own memory if nothing else:
To run linear regression as a matrix multiplication it is necessary to add a column of ones, so that $x_0 = 1$. This means that when matrix $X$ is multiplied by the parameter matrix $\theta$, the intercept $\theta_0=\theta_0\times1$. i.e.:
We define the usual squared error cost function:
Except that in Digithead’s implementation below $h_\theta$ is defined by the matrix multiplication of $X$ and $\theta$ as described above, and rather than multiplying by $\frac{1}{2m}$, he divides by $2m$.
Alpha is set at a low number initially, and the number of iterations set to 1000.
A vector and a list are initialised to handle the history of the cost function $J(\theta_0,\theta_1)$ and the parameters $\theta$ at each iteration.
The coefficients for regression are initialised to zero
Finally, the gradient descent is implemented as a for loop:
This makes a reasonably large jump, so I’ll break down each line of this loop, for my own understanding.
Recall that Andrew Ng defines the final algorithm for gradient descent for linear regression to be:
The first and second lines of the loop handle the term $(h_{\theta}(x^{(i)})-y^{(i)})\cdot x^{(i)}$. The first line:
does our linear regression by matrix multiplication, as mentioned above. The second line:
does both the sum ($\sum_{i=1}^{m}$) and the element-wise multiplication denoted by the $\cdot x^{(i)}$. In the latter case this takes every single error (predicted - observed) score from the error
function and multiplies it by the transpose $X$ (t(X)
), which includes $x_0=1$.
To give a snippet of this calculation from the first iteration:
So this will end with a two dimensional vector (or a $2\times1$ dimensional matrix) delta
$\in\mathbb{R}^{2}$. The end of this line divides by the length of the vector y
, or in the notation that I have been using so far: $m$, and this is in place of multiplying by $\frac{1}{m}$.
The third line of the loop:
updates $\theta$ using the learning rate $\alpha$ multiplied by delta
, whilst the next line:
applies the sum of squares cost function to the parameters $\theta$ following the update, and saves this out to the double-precision vector cost_history
initialised earlier.
Finally, the last line of the code saves out the parameter vector $\theta$ to the list theta_history
. The loop then repeats.
So let’s run it and see what happens…and just out of interest, I have included a call to system.time
so we can measure how long it takes.
We can check these values using the built in regression function in R which uses the normal equation, also with a call to system.time
.
So interestingly this shows us that the gradient decsent run for 1000 iterations has stopped short of finding the correct answer, and also took much longer. This may mean that $\alpha$ is too small, or that there were not enough iterations. The answer is close, but still not quite the same as the answer derived from the noraml equation:
Plotting $J(\theta_0,\theta_1)$ for each iteration would indicate that we had not yet minimised $J(\theta_0,\theta_1)$, and that it is continuing to fall with each further iteration:
The right hand plot just zooms in on the left hand plot, at the very end of the curve, so we can see that it is still dropping at quite a pace.
This time I try gradient descent again with having rolled the code into a self-contained function to take arguments and follow the notation that Andrew Ng has stuck to in the machine learning course. In addition, the cost function has been changed to the vectorised form:
So we define $\theta$ and then save it out to the object out
, this time specifying many more iterations.
Now let’s plot it, along with $J(\theta_0,\theta_1)$.
So this time it appears to converge, although it took many iterations to get here. Some adjustment of the learning rate would probably bring this down.