Supervised learning

Introduction

“Supervised learning” refers to building a model with a data set that includes records that have values for a response variable of interest (for example, crop yield), and for predictor variables (for example, rainfall and fertilizer application rates).

In matrix notation these models have the form \(\hat{Y} = \hat{f}(X) + \epsilon\), where \(\hat{Y}\) (Y-hat) is a vector of model predictions for the variable of interest; \(X\) is the matrix with predictor variables; and \(\epsilon\) (epsilon) is the unexplained variation. The art is to find the most appropriate function \(\hat{f}\) (f-hat) — that is the one that most closely resembles the true function \(f\). In the next paragraph, we illustrate that the best model is not necessarily the model that minimizes the unexplained variation (\(\epsilon\)).

Over-fitting versus under-fitting

linear regression model

Over-fitting a model means that the model fits the data that was used to build the model very well — in fact too good. Let’s look at that with a simple data set where we have a response variable y and a single predictor variable x.

x <- c(35, 47.2, 20.7, 27.1, 42.5, 40.1, 28.2, 35.8, 47.4,
        21.3, 39, 30.6, 18.7, 8.4, 45.8, 14, 47.3, 31.5, 38, 32)
y <- c(9, 11.5, 7.6, 8.8, 9.8, 11, 8.2, 8.6, 11.4, 7.4,
        11.1, 7.9, 6, 4.6, 12.3, 6.1, 11.4, 8.8, 10.3, 9.4)
plot(x, y)

image0

We could build a simple linear regression model like this

mlr <- lm(y~x)
plot(x, y, pch=20, cex=1.5, las=1)
abline(mlr, col="red", lwd=3)

image1

summary(mlr)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
##      Min       1Q   Median       3Q      Max
## -1.02846 -0.32128 -0.09411  0.47323  0.93314
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   3.4050     0.4266   7.982 2.53e-07 ***
## x             0.1738     0.0124  14.021 3.97e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6214 on 18 degrees of freedom
## Multiple R-squared:  0.9161, Adjusted R-squared:  0.9115
## F-statistic: 196.6 on 1 and 18 DF,  p-value: 3.967e-11

A linear regression model takes the form \(\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ...\) where \(\beta_0\) is the intercept and the other \(\beta\)s are slopes. In this case we found \(\hat{f}\) to be y = 3.4 + 0.17x.

I hope you will agree that this looks very good. The line seems to capture the general trend; the model R2-0.92, and the p-values for the model parameters are very small (strong support for the intercept and slope being unequal to zero). But can we do better?

Spline model

Linear regression models with a single predictor variable are not very flexible. Let’s try another extreme, a spline function. A spline model is referred as a non-parametric model (this is a misnomer, it has, in fact, a lot of parameters, so many that we do not want to consider them, as we cannot easily interpret them).

# create the function from the data
sf <- splinefun(x, y)
# x prediction points
px <- seq(1, 50, 0.25)
# predictions
py <- sf(px)
plot(x, y, pch=20, cex=2, las=1)
abline(mlr, col="red", lwd=3)
lines(px, py, col="blue", lwd=3)

image2

Wow. Model sf seems perfect. Let’s compare the two models with the Root Mean Squared Error (RMSE) statistic.

Root Mean Squared Error

The Root Mean Squared Error (RMSE) is a commonly used metric to compare models with a quantiative repsonse. For qualitative responses measures such as kappa are used, and if the response if true/false (presence/absence) ROC-AUC is commonly used.

Here is a function that implements RMSE.

rmse <- function(obs, prd, na.rm=FALSE) {
    sqrt(mean((obs - prd)^2, na.rm=na.rm))
}

Let’s test this new function. We have observed values of 1 to 10, and a three different predictions.

# no difference, perfect model
rmse(1:10, 1:10)
## [1] 0
# a small difference, not bad
rmse(1:10, 2:11)
## [1] 1
# rather different, much worse than the above
rmse(1:10, 10:1)
## [1] 5.744563

So the more different the observed and predicted data are, the higher the RMSE. The scale of the RMSE is in units of the data. To make it easier to interpret RMSE values, you can express them relative to a Null model. A Null model is a simple, perhaps naive model, that can serve as an important benchmark. To evaluate a complex model it is often not of prime interest to know how good it does, but rather how much better it does than a Null model.

With the example data above, a good Null Model might be the mean value of the observed data.

null <- rmse(1:10, mean(1:10))
null
## [1] 2.872281

A function that computes the RMSE relative to a NULL model could look like this.

rmsenull <- function(obs, pred, na.rm=FALSE) {
    r <- rmse(obs, pred, na.rm=na.rm)
    null <- rmse(obs, mean(obs))
    (null - r) / null
}

Let’s use it.

rmsenull(1:10, 1:10)
## [1] 1
rmsenull(1:10, 2:11)
## [1] 0.6518447
rmsenull(1:10, 10:1)
## [1] -1

A perfect model explains 100% of the RMSE of the Null model. In the example above, the prediction 2:11 explains 65% of the Null model. That is very good! The prediction 10:1 is worse than the Null model.

Compare models

Now we can use our rmsenull method with our two models, and compare their performance.

For the linear regression model

plr <- predict(mlr, data.frame(x=x))
rmsenull(y, plr)
## [1] 0.7103717

And for the spline model

psf <- sf(x)
rmsenull(y, psf)
## [1] 1

By this measure, the spline model is perfect. The linear regression model might still be fit for purpose, but it does not look as good as the spline model.

However, we made a major mistake in our model evaluation procedure. We should not evaluate the model with the same data that we used to build the model. Because if we do so, we are guarenteed to select models that are overfit. To illustrate this, let’s split the data in a two samples (this is not generally the best practise, see the section on cross-validation).

n <- length(x)
set.seed(321)
i <- sample(n, 0.5 * n)
i
##  [1] 18 13 16 17  4 15 11  9 12  2
xa <- x[i]
ya <- y[i]
xb <- x[-i]
yb <- y[-i]

Now train (fit) and test (evaluate) the models again. First use sample “a” to train and sample “b” to test.

mlr_a <- lm(ya~xa)
sf_a <- splinefun(xa, ya)
plr_a <- predict(mlr_a, data.frame(xa=xb))
psf_a <- sf_a(xb)
rmsenull(yb, plr_a)
## [1] 0.6423837
rmsenull(yb, psf_a)
## [1] -0.929826

Now use sample “a” to test and sample “b” to train.

mlr_b <- lm(yb~xb)
sf_b <- splinefun(xb, yb)
plr_b <- predict(mlr_b, data.frame(xb=xa))
psf_b <- sf_b(xa)
rmsenull(ya, psf_b)
## [1] -2.512748
rmsenull(ya, plr_b)
## [1] 0.7028963

In both cases the linear regression model peforms much better. The mean RMSE, relative to the Null model, for the linear regression model is ( 0.64 + 0.7 )/2 = 0.67, much smaller than for the the spline model ( -0.93 + -2.51 )/2 = -1.72. Also note that the RMSE for the linear regression model is similar to the RMSE computed with the model training data when it was fit will all data (0.71). That is also easy to see visually.

#set up the plot
plot(x, y, las=1, cex=0, xlim=c(0,80), ylim=c(3,15))
# original models
abline(mlr, col="red", lwd=2, lty=3)
lines(px, py, col="blue", lwd=2, lty=3)
# sample A models
abline(mlr_a, lwd=2, col="red")
lines(px, sf_a(px), lwd=2, col="blue")
# sample B models
abline(mlr_b, lwd=2, col="red", lty=2)
lines(px, sf_b(px), lwd=2, col="blue", lty=2)
# sample A and B points
points(xa, ya, cex=1.5)
points(xb, yb, cex=1.5, pch=19)
# a complex legend
legend("bottomright", pch=c(1,19,rep(NA, 10)), lty=c(NA, NA, rep(c(NA, NA,1,2,3),2)), lwd=2, , pt.cex=1.5, bg="white",
    legend=c("Sample A", "Sample B", "", "Linear regression", "sample a", "sample b", "all data", "", "Spline model", "sample a", "sample b", "all data"),
    col=c("black", "black", "white", "white", rep("red", 3), "white", "white", rep("blue", 3))
)

image3

The spline model fit the training data perfectly, but that made it predict spectacularly bad for some of testing data. The linear regression model is not very flexible, so it did not change much when it only got the subset of data. This is an important feature. The linear model has low variance. That is, the estimated model does not change much when it is estimated with slightly different input data (do not confuse this use of the word “variance” with how it can be used to describe variability in a sample of quantitative observations). This is very important as a model only has value if it has some generality, not if it just fits a particular dataset. Note that the RMSE are essteniall for an interpolation problem. When x is between 15 and 40, the spline model is much worse than the linear regression model, but the three models (all data, sample A, and sample B) are somewhat similar. However, if we were to use these models for extrapolation (in practical terms, think predicting to a different region, group of people, time period), the results would be dramatically unreliable as the different spline models show extreme, but opposite responses. This does not mean that the linear model is right — we have no data for x < 8 or x > 48.

The true model

Well, in this case, we actually do have . The data for x and y were sampled (and then rounded to 1 decimal) from a known function — we do not have these for the real world. But we can use functions like that to generate data to learn about modeling methods. We used the function f below to generate the data used in this section.

f <- function(x) x/10 + sin(x) + sqrt(x)
X <- seq(1,50,0.1)
plot(X, f(X), type="l")

image4

And the following procedure to create the sample data.

set.seed(2)
sx <- sample(X, 20)
sy <- f(sx)

So, the linear model clearly was wrong. But, then, all model are wrong. Some models are more useful than others, and the linear model might be close enough to be fit for purpose.

Can flexible models be good?

So is a single variable linear regression always better than a spline model? No. It depends. It depends on how linear the relationship between x and y, and it also depends on the sample size. The more complex the model, and the larger the sample size, the better the spline might do. Here is simple example where the spline model does better.

The true model:

f <- function(x) {x - x^2 + x^3*sin(x)}
X <- seq(1,50,0.1)
Y <- f(X)
plot(X, Y, type="l")

image5

Two samples from the true model. Here we do not add error to the data — but that is often done to simulate the effect of measurment error. It is easy to do with a function like runif or rnorm.

set.seed(2)
xa <- sample(X, 40)
ya <- f(xa)
xb <- sample(X, 40)
yb <- f(xb)

And compare the two models again

mlr <- lm(ya~xa)
sf <- splinefun(xa, ya)
rmsenull(yb, predict(mlr, data.frame(xa=xb)))
## [1] -0.0312567
rmsenull(yb, sf(xb))
## [1] 0.8082887
mlr <- lm(yb~xb)
sf <- splinefun(xb, yb)
rmsenull(ya, predict(mlr, data.frame(xb=xa)))
## [1] -0.04616483
rmsenull(ya, sf(xa))
## [1] 0.7672714

In both instances, the spline model RSME is about five times smaller.

px <- seq(1, 50, 0.25)
py <- sf(px)
plot(X, Y, type="l", lwd=3, las=1)
points(xa, ya, cex=1.5)
points(xb, yb, pch=19, cex=1.5)
abline(mlr, col="red", lwd=2)
lines(px, py, col="blue", lwd=2)
legend("bottomleft", pch=c(1,19,rep(NA, 10)), lty=c(NA, NA, rep(c(NA, NA,1,2,3),2)), lwd=2, , pt.cex=1.5, bg="white",
    legend=c("Sample A", "Sample B", "", "Linear regression", "sample a", "sample b", "all data", "", "Spline model", "sample a", "sample b", "all data"),
    col=c("black", "black", "white", "white", rep("red", 3), "white", "white", rep("blue", 3))
)

image6

What have we learned?

We should note that these stylized examples have little to do with real data, where you might have a lot of noise and multiple observations for y for each x. Perhaps you can look at that in a simulation of your own? But we hope that this section helped illustrate (a) the risk of overfitting; (b) that models should be evaluated with data that was not used to train the model.

There is much more to say about linear regression. In practice you often have many predictor variables. Or you can create new pridictor variables by creating squares and cubes, or interactions. For example, you can do

mlr <- lm(ya ~ xa+I(xa^2)+I(xa^3))

With each parameter you add flexibility to the model. That can lead to overfitting, and a loss of interpretability. See the sections on stepwise model selection, and on lasso and ridge regression in ISLR for detailed discussion.

Like in lasso and ridge regression, spline models can be constrained (made less flexible) through some form of regularization. In these “smoothing spline” models, there is a penalty function that reduces the amount of flexibility.

Complex models

There are various algorithms than can be used to make relatively complex models for predictions with large data sets, with many variables, of which several may contribute to the model, perhaps with complex interactions, and likely noisy data. Popular examples of such (“machine learning”) algorithms include Random Forest, Support Vector Machines (SVM), and Artificial Neural Networks (ANN).

The next chapter focuses on the Random Forest algorithm as an example.

Citation

Hijmans, R.J., 2019. Statistical modeling. In: Hijmans, R.J. and J. Chamberlin. Regional Agronomy: a pratical handbook. CIMMYT. https:/reagro.org/tools/statistical/