Matt Upson bio photo

Matt Upson

Transponster

Twitter LinkedIn Github Stackoverflow

I’m currently working on the excellent Machine Learning course by Andrew Ng available on coursera. I’ve been working through the exercises using R, not matlab or octave as is requried in the course. This is the first programming exercise - implementing linear regression using the gradient descent algorithm rather than the normal equation method.

Gradient descent for a function with one parameter

Rather than calculating the optimal solution for the linear regression with a single algorithm, in this exercise we use gradient descent to iteratively find a solution. To get the concept behing gradient descent, I start by implementing gradient descent for a function which takes just on parameter (rather than two - like linear regression).

In this instance I have adapted code from Matt Bogard’s excellent blog Econometric Sense, and will use the same same function:

So we can state our objective to minimise $\theta_1$ with respect of $J(\theta_1)$ with a real number, or put mathetically $\min\limits_{\theta_1}J(\theta_1)$ and $\theta_1\in\mathbb{R}$

Cost function

We define the cost function $J(\theta_1)$ using calculus as $J(\theta)=2.4(x-2)$ (see Matt’s blog).

Gradient descent is defined by Andrew Ng as:

  • where $\alpha$ is the learning rate governing the size of the step take with each iteration.

Here I define a function to plot the results of gradient descent graphically so we can get a sense of what is happening.

library(dplyr)
library(magrittr)
library(ggplot2)
 
 
xs <- seq(0,4,len = 100) # create some values
 
# define the function we want to optimize
 
f <-  function(x) {
  1.2 * (x-2)^2 + 3.2
  }
 
# plot the function 
 
create_plot <- function(title) {
  plot(
    ylim = c(3,8),
    x = xs,
    y = f(xs), 
    type = "l", 
    ylab = expression(1.2(x-2)^2 + 3.2),
    xlab = "x",
    main = title
    )
  
  abline(
    h = 3.2,
    v = 2, 
    col = "red", 
    type = 2
    )
  
}
 
# J(theta)
 
cost <- function(x){
  1.2 * 2 * (x-2)
}

Below is the actual implementation of gradient descent.

# gradient descent implementation
 
grad <- function(x = 0.1, alpha = 0.6, j = 1000) {
  
  xtrace <- x
  ftrace <- f(x)
  
  for (i in 1:j) {
    
    x <- x - alpha * cost(x)
    
    xtrace <- c(xtrace,x)
    ftrace <- c(ftrace,f(x))
    
    }
  
  data.frame(
    "x" = xtrace,
    "f_x" = ftrace
    )
  }

Pretty simple! Now I use the plotting function to produce plots, and populate these with points using the gradient descent algorithm.

create_plot(expression(Low~alpha))
 
with(
  alpha_too_low <- grad(
    x = 0.1, # initialisation of x
    alpha = 0.1, # learning rate
    j = 100 # iterations
    ),
  points(
    x, 
    f_x, 
    type = "b", 
    col = "green"
    )
  )

plot of chunk gradient_descent_plots

create_plot(expression(alpha~just~right))
 
with(
  alpha_just_right <- grad(
    x = 0.1, # initialisation of x
    alpha = 0.6, # learning rate
    j = 100 # iterations
    ),
  points(
    x, 
    f_x, 
    type = "b", 
    col = "blue"
    )
  )

plot of chunk gradient_descent_plots

create_plot(expression(High~alpha))
 
with(
  alpha_too_high <- grad(
    x = 0.1, # initialisation of x
    alpha = 0.8, # learning rate
    j = 100 # iterations
    ),
  points(
    x, 
    f_x, 
    type = "b", 
    col = "red"
    )
  )

plot of chunk gradient_descent_plots

Another way to look at the rate of convergence is to plot the number of iterations against the output of $f(x)$. Vertical lines show when convergence occurs. When $\alpha$ is set very low, it takes much longer than necessary (although it does converge). When $\alpha$ is too high, convergence doesn’t occur at all within a hundred iterations.

par(mfrow=c(1,3))
 
plot(alpha_too_low$x, type = "l",col = "green")
abline(v = (round(alpha_too_low$x,4) != 2) %>% which %>% length)
 
plot(alpha_just_right$x, type = "l",col = "blue")
abline(v = (round(alpha_just_right$x,4) != 2) %>% which %>% length)
 
plot(alpha_too_high$x, type = "l", col = "red")
abline(v = (round(alpha_too_high$x,4) != 2) %>% which %>% length)

plot of chunk gradient_descent_iterations