In my last post I compared vectorised logistic regression solved with an optimisation algorithm with a generalised linear model. I tested it out on a very simple dataset which could be classified using a linear boundary. In this post I’m following the next part of Andrew Ng’s Machine Learning course on coursera and implementing regularisation and feature mapping to allow me to map non-linear decision boundaries using logistic regression. And of course, I’m doing it in R, not Matlab or Octave.
As ever the full code to produce this page is available on github.
Visualising the data
First I plot the data…and it’s pretty clear that to create an accurate decision boundary will require some degree of polynomial features in order to account for its spherical nature.
Feature mapping
In this example I’ll map the features into all polynomial terms of $x_1$ and $x_2$ up to the twelfth power giving a crazy amount of input features. Hence:
These polynomials can be calculated with the following code. In future I will update this to take more than two input features.
And to the list the 91 features:
Now run the linear regression I implemented in my previous post.
So the optimisation algorithm converged successfully, and if I was to call ucminf_out$par, it would return our 91 parameters.
At this point it is probably worth defining some sort of measure of accuracy. A simple proportion error will suffice in this case.
So the present model accurately predicts 93% of the training data, but it is a pretty specific shape that is likely to be overfitted.
With just two original input features, we can quite easily plot the decision boundary. To do so I create a matrix $X$ of $m$ rows which corresponds to a grid of points for which we can then generate a prediction. We use the output $\theta$ derived from the model fit from the ex2data1 data. We then combine the predictions from the grid of points in a contour plot.
The function to create the boundary thus takes two inputs: a sequence of numbers xy delineating the limits of the plot. This works for situations where the ranges of the two features are similar, but would need to be adapted for features with different ranges (although it would probably be fine if feature scaling is used)
Create the grid of predictions:
And now for the decision boundary:
So this looks is capturing the positive values pretty well, but it could probably be improved especially in the top and bottom left where new cases are likely to be mis-classified.
Regularisation - cost function and gradient
To improve on the boundary above we can implement regularisation; this should reduce some of the overfitting seen in the last plot.
Andrew Ng gives us the regularised cost function as:
Note that the parameter $\theta_0$ is not regularised as this corresponds to the intercept.
So let’s test this in comparison with the cost function that I defined in the previous post by setting the parameter $\lambda=0$, i.e. no regularisation.
Great, the function passes this basic test. And the cost for all values of $\theta$ initialised to zero should be around $0.693$.
Now for the gradient function. As noted, we don’t regularise $\theta_0$, so we need a more complicated gradient function.
This can be implemented in vectorised fashion:
Now check that this gives the same result for the implementation without regularisation.
So far so good. Now I’ll try running regularised logistic regression for the polynomial example, but first I’ll wrap this into a function to save having to explicitly declare the parameters each time.
So we can try this…
And it seems to be working, but notice that with $\lambda=1$ the error has increased to 0.15. This doesn’t tell the whole story though as looking at the previous decision boundary suggests overfitting. So what about the decision boundary for $\lambda=1$?
Regularisation has smoothed away much of the overfitting. We can’t tell how succesful this will be without evaluating the model on the a set, but we can also try a range of values for $\lambda$ and see what effect this has.
First compute the percentage errors for $\lambda={0,0.0001,0.001,0.01,0.1,1}$.
So it looks like increasing $\lambda$ is reducing the accuracy of the model on the training set. But again, this isn’t the whole story. What about the decision boundaries?
Now use tidyr::gather to turn this wide data into long data so it can be passed to ggplot2::facet_wrap.
So it’s clear that increasing $\lambda$ leads to progressively greater smoothing of the decision boundary. And despite decreasing accuracy on the training set, these regularised decision boundaries would certainly perform better against a test set.