Yield Prediction¶

Introduction¶

In the previous chapters we have worked on identifying rice area, and the computations of vegetation metrics to predict rice yield. In this chapter we explore that relationship. The goal is to later use it to create an insurance product.

Yield data¶

We use a pre-prepared dataset with rice yield data that was collected through a farm survey. The numbers are based on the recollection of the farmers.

library(agrodata)
d <- data_rice("yield")
head(d, n=3)
##                 region     zone fid year    y
## 1401 Ndungu Irr Scheme Ndungu N   4 2003   40
## 1403 Ndungu Irr Scheme Ndungu N   4 2005 1296
## 1404 Ndungu Irr Scheme Ndungu N   4 2006 1296

We have rice yield, for individual farmers (“fid”) by year.

The number of farmers is

length(unique(d$fid))
## [1] 309

There are no missing values in this data set, so we can tabulate the number of observations by year like this:

table(d$year)
##
## 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
##   78   91  116  117  140  181  232  277  299  303

Farmers are grouped in zones

unique(d$zone)
##  [1] "Ndungu N"          "Ndungu S"          "Maore N"
##  [4] "Maore SE"          "Maore SW"          "Ndungu E"
##  [7] "Ndungu W"          "Sounthern Plain N" "Sounthern Plain S"
## [10] "Sounthern Plain W"

We can look at the distribution of rice yield values (they are expressed in kg/ha)

boxplot(d$y)

Some of the crop yields are extremely low. Here are the lowest 40 observations

sort(d$y)[1:40]
##  [1]   0.00000   0.00000   0.00000   0.00000   0.00000   0.00000   0.00000
##  [8]   0.00000   0.00000   0.00000   0.00000   0.00000   0.00000   0.00000
## [15]   0.00000   0.00000   0.00000   0.00000   0.00000   0.00000   0.00000
## [22]   0.00000   0.00000   0.00000   0.00000   0.00000   4.00000  40.00000
## [29]  40.00000  40.00000  48.00000  66.66667  66.66667  66.66667  70.00000
## [36]  80.00000  80.00000  80.00000  80.00000 100.00000

We may want to check if these are reasonable; or perhaps errors?

Aggregation¶

Regression models between yield and vegetation metrics may work better if we use average data for zones, rather than for individual farmers or fields; and that is what we do here.

Also, instead of the average yield, we could consider using the yield deviation, that is the difference between the yield in a given year (for a farmer or for a zone) and the expected yield.

The expected yield would be the long term average (mean or median) yield. The median could be a good choice. If there are outliers, it may better represents a “typical” (expected) yield. A benefit of this approach could be that it accounts for inherent differences between zones. That is some zones will have higher average yields that others; but they may be very similar in relative terms. We could also do this for the vegetation metrics.

So let’s create these aggregated variables for zones.

First the average yield for each zone/year combination:

yzt <- aggregate(d$y, d[, c("zone", "year")], median, na.rm=TRUE)
colnames(yzt)[3] = "y_zt"
head(yzt, 3)
##       zone year y_zt
## 1  Maore N 2003 1360
## 2 Maore SE 2003 1400
## 3 Maore SW 2003 1230

And the average yield for a zone (over years):

yz <- aggregate(d$y, d[, "zone", drop=FALSE], median, na.rm=TRUE)
colnames(yz)[2] = "y_z"
head(yz, 3)
##       zone  y_z
## 1  Maore N 1600
## 2 Maore SE 1440
## 3 Maore SW 1440

We can also use aggregate to compute the number of farmers per zone per year (note that this would not give the correct result if there are no missing values — as all records are counted).

n <- aggregate(d$fid, d[,c("zone", "year")], length)
colnames(n)[3] = "n"
head(n, 3)
##       zone year  n
## 1  Maore N 2003 13
## 2 Maore SE 2003  4
## 3 Maore SW 2003 10

Merge the zone level aggregates.

z <- merge(n, yzt)
z <- merge(z, yz)
head(z, 3)
##      zone year  n y_zt  y_z
## 1 Maore N 2003 13 1360 1600
## 2 Maore N 2004 18 1580 1600
## 3 Maore N 2005 22 1470 1600

Zone level relative yield can be compute like this:

z$y_dz <- z$y_zt / z$y_z
head(z, 3)
##      zone year  n y_zt  y_z    y_dz
## 1 Maore N 2003 13 1360 1600 0.85000
## 2 Maore N 2004 18 1580 1600 0.98750
## 3 Maore N 2005 22 1470 1600 0.91875

Here are the variables we now have (in data.frames d for farmer level and z for zone level)

variable	description
region	Region
zone	zone name
year	year
fid	farmer ID
y	reported yield for a farmer (i) in a year (t)
y_zt	mean zone yield in a year t
y_z	long term average zone yield

We save the zonal yield data for future use.

saveRDS(z, "zonal_rice_yield.rds")

We also merge the individual level data with the zone level data for use in the next chapter.

dz <- merge(d, z)
saveRDS(dz, "hh_rice_yield.rds")

Vegetation metrics¶

First read the pre-computed vegetation metrics.

idx <- data_rice("indices")
head(idx, n=3)
##          region    zone year     ndvi       evi       gpp       et       lai
## 1 Northern_zone Maore N 2003 1.420657 0.8657794 -696.7520 16.15727  39.73491
## 2 Northern_zone Maore N 2004 3.131069 1.9807431  630.2755 13.13894 115.83244
## 3 Northern_zone Maore N 2005 2.469837 1.9425927 -266.2523 24.80350 -53.02961
##       rain
## 1 689.2841
## 2 629.2459
## 3 543.0463

We have 5 different metrics computed for the growing season from MODIS data. They are all aggregated by zone.

variable	description
region	Region
zone	zone name
year	year
ndvi	ndvi index
evi index	evi
gpp	Gross Primary Productivity index
et	evapo-transpiration index
lai	leaf area index

The values have been transformed, and some may look a little odd (for example, they can be negative) — we are not going into these details here; let’s just accept them as they are.

We can merge the zonal yield and satellite index data.

z <- merge(z, idx[,-1])
head(z, 3)
##      zone year  n y_zt  y_z    y_dz     ndvi       evi       gpp       et
## 1 Maore N 2003 13 1360 1600 0.85000 1.420657 0.8657794 -696.7520 16.15727
## 2 Maore N 2004 18 1580 1600 0.98750 3.131069 1.9807431  630.2755 13.13894
## 3 Maore N 2005 22 1470 1600 0.91875 2.469837 1.9425927 -266.2523 24.80350
##         lai     rain
## 1  39.73491 689.2841
## 2 115.83244 629.2459
## 3 -53.02961 543.0463

Explore¶

Let’s do a quick exploration of the combined data to see if we seen any association. Here are the correlation coefficients:

cr <- cor(z[,-c(1:3, 5)])
cr[,1:2]
##             y_zt         y_dz
## y_zt  1.00000000  0.892855126
## y_dz  0.89285513  1.000000000
## ndvi  0.40445628  0.379001496
## evi   0.41859193  0.389936563
## gpp   0.43046794  0.404100931
## et    0.27167524  0.254701878
## lai  -0.01201251 -0.036259112
## rain  0.02558946  0.002945276

These suggest that we the strongest relationships are between y_zt and gpp and evi, and also with ndvi.

Let’s plot the first two.

par(mfrow=c(1,2))
plot(z$gpp, z$y_zt, xlab="GPP index", ylab="Yield (kg/ha)")
plot(z$evi, z$y_zt, xlab="EVI index", ylab="Yield (kg/ha)")

Regression¶

Now on the the creation of regression models. There are many differnt models that could be formulated. We try a few, and compare them. Feel free to try other models.

First two single variable models. First zonal yield as a function of the gpp index.

m1 <- lm (y_zt ~ gpp, data=z)
cf <- coefficients(m1)
cf
##  (Intercept)          gpp
## 1516.4412306    0.1106993
plot(y_zt ~ gpp, data=z)
abline(m1, col="red")

summary(m1)
##
## Call:
## lm(formula = y_zt ~ gpp, data = z)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -617.33 -157.15  -18.93  115.10 1424.88
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.516e+03  2.665e+01  56.904  < 2e-16 ***
## gpp         1.107e-01  2.345e-02   4.721 7.81e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 265.3 on 98 degrees of freedom
## Multiple R-squared:  0.1853, Adjusted R-squared:  0.177
## F-statistic: 22.29 on 1 and 98 DF,  p-value: 7.811e-06

Have a good look at what is returned by summary.

First it shows the model that we formulated. Then it shows the distribution of the “residuals”. That is the differences between the observed values for y_zt and what the model would predict (the red line). You can get the residuals with residuals(m1).

Below that, the regression coefficients (under “Estimate”) are shown. There is an intercept of 1516.4 (that is the prediction when gpp=0), and a slope for gpp of 0.111. The good news is that the slope is positive (we expect that a higher gpp index is associated with higher yields). You can get the coefficients with coefficients(m1)

The standard error for each estimated coefficient is also shown. The t value is used to compute statistical signifiance. The p-values (the probabiliy to find, by chance, a t value that is higher then the ones we found) are very low and the coefficients are highly significant. In this context this means that is very unlikely that they are actually zero (no intercept, or no effect of evi).

Let’s try another model: zonal yield as a function of the evi index.

m2 <- lm (y_zt ~ evi, data=z)
summary(m2)
##
## Call:
## lm(formula = y_zt ~ evi, data = z)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -607.49 -141.87   -4.27  126.25 1337.32
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  1437.86      33.22  43.277  < 2e-16 ***
## evi            88.46      19.39   4.563 1.46e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 267 on 98 degrees of freedom
## Multiple R-squared:  0.1752, Adjusted R-squared:  0.1668
## F-statistic: 20.82 on 1 and 98 DF,  p-value: 1.464e-05

The diagnistics for both models (as shown by summary) are very similar. Highly statistically significant, but a low R^2. Although p-values have information, we do not care that much about the significance — we are not testing a hypothesis. We do care about the R^2 as that is a measure of how good the regression model fits the data; and for simple models also a good indicator of how good it can make predictions.

Now to more complex (just a little) models. We can make a multiple regression model that explains y_zt from both gpp and evi.

m3 <- lm (y_zt ~ gpp + evi, data=z)
summary(m3)
##
## Call:
## lm(formula = y_zt ~ gpp + evi, data = z)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -620.49 -146.08  -11.51  105.40 1390.32
##
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.478e+03  4.131e+01  35.790   <2e-16 ***
## gpp         6.867e-02  4.211e-02   1.631    0.106
## evi         4.154e+01  3.461e+01   1.200    0.233
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 264.8 on 97 degrees of freedom
## Multiple R-squared:  0.1972, Adjusted R-squared:  0.1807
## F-statistic: 11.92 on 2 and 97 DF,  p-value: 2.359e-05

We can add quadratic or interaction terms. Here we use a quadractic term for gpp. There is also a “zone” effect.

m4 <- lm (y_zt ~ gpp + evi +I(gpp^2) + zone, data=z)
summary(m4)
##
## Call:
## lm(formula = y_zt ~ gpp + evi + I(gpp^2) + zone, data = z)
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -573.13 -104.25    9.65   97.45 1316.10
##
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)
## (Intercept)            1.390e+03  1.057e+02  13.141   <2e-16 ***
## gpp                    6.311e-02  4.359e-02   1.448   0.1513
## evi                    6.437e+01  4.339e+01   1.484   0.1415
## I(gpp^2)              -1.212e-05  1.541e-05  -0.787   0.4335
## zoneMaore SE           4.805e+01  1.173e+02   0.410   0.6831
## zoneMaore SW          -9.987e+01  1.155e+02  -0.865   0.3896
## zoneNdungu E           2.158e+02  1.278e+02   1.688   0.0949 .
## zoneNdungu N           2.975e+02  1.161e+02   2.562   0.0121 *
## zoneNdungu S           6.102e+01  1.204e+02   0.507   0.6134
## zoneNdungu W           1.367e+02  1.138e+02   1.201   0.2330
## zoneSounthern Plain N  2.750e+01  1.299e+02   0.212   0.8328
## zoneSounthern Plain S  4.883e+01  1.280e+02   0.381   0.7038
## zoneSounthern Plain W  8.300e+01  1.274e+02   0.651   0.5165
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 253.9 on 87 degrees of freedom
## Multiple R-squared:  0.3378, Adjusted R-squared:  0.2464
## F-statistic: 3.698 on 12 and 87 DF,  p-value: 0.0001642

The “zone” paramter, allows for each zone to have its own intercept. Note that the R^2 is higher now, but that only the base intercept is highly significant.

We can also something entirely different, such as Random Forest

library(randomForest)
rf <- randomForest(y_zt ~ gpp+evi+ndvi+gpp+et, data=z)
rf
##
## Call:
##  randomForest(formula = y_zt ~ gpp + evi + ndvi + gpp + et, data = z)
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
##
##           Mean of squared residuals: 73715.23
##                     % Var explained: 12.96

We see that the fit of the Random Forest model is not great. There is no magic machine learning here. Apparently there are not complex interactions to discover in this dataset.

Model comarpison¶

We have made four models. Which one is the best? We can compare R^2, adjusted R^2 and AIC.

models <- list(m1, m2, m3, m4)
r2 <- sapply(models, function(i) round(summary(i)$r.squared,  2))
ar2 <- sapply(models, function(i) round(summary(i)$adj.r.squared, 2))
aic <- sapply(models, AIC)
r <- rbind(r2=r2, ar2=ar2, aic=round(aic, 2))
colnames(r) <- c("m1", "m2", "m3", "m4")
r
##          m1      m2      m3      m4
## r2     0.19    0.18    0.20    0.34
## ar2    0.18    0.17    0.18    0.25
## aic 1403.97 1405.20 1404.50 1405.25

The R^2 values are between 0.18 and 0.34. The problem with using R^2 on the model training data is that it will always get higher as you add parameters to the model. It is thus not surprising that the most complex model has the highest R^2.

The Adjusted-R^2 corrects for this problem — for each additional parameter in the model, the R^2 is penalized. Thus for it to go up, the benefit of the new variable must outweight the cost (because of this you can even get negative values — wich seem very odd for a squared quantity). We see that the adjusted R^2 is much lower than the R^2 for thet complex model (m4) has a much reduced adjfit whereas the simpler .

AIC is a measure that can be used to find the most parsimonious model. That is, the simplest model that described the data well.

Cross-validation¶

We need a function that returns Root Mean Squared Error (or another statistic of interest)

rmse <- function(observed, predicted){
  i <- observed < 1500
  error <- observed[i] - predicted[i]
  sqrt(mean(error^2))
}

Now set up the “k-folds” and the output structure.

library(agro)
k <- 5
set.seed(123)
f <- agro::make_groups(z, k)
result <- matrix(nrow=k, ncol=5)
colnames(result) <- c("m1", "m2", "m3", "m4", "rf")

And run the actual cross-validation.

for (i in 1:k) {
  train <- z[f!=i, ]
  test  <- z[f==i, ]
  cm1 <- lm (y_zt ~ gpp, data=train)
  cm2 <- lm (y_zt ~ evi, data=train)
  cm3 <- lm (y_zt ~ gpp + evi, data=train)
  cm4 <- lm (y_zt ~ gpp + evi +I(gpp^2) + zone, data=train)
  p <- predict(cm1, test)
  result[i,"m1"] <- rmse(test$y_zt, p)
  result[i,"m2"] <- rmse(test$y_zt, predict(cm2, test))
  result[i,"m3"] <- rmse(test$y_zt, predict(cm3, test))
  result[i,"m4"] <- rmse(test$y_zt, predict(cm4, test))
  crf <- randomForest(y_zt ~ gpp+evi+ndvi+gpp+et, data=train)
  result[i,"rf"] <- rmse(test$y_zt, predict(crf, test))
}
colMeans(result)
##       m1       m2       m3       m4       rf
## 213.6315 229.9579 225.4810 271.2868 243.1905

Note that m1 comes out as the best model (lowest RMSE), but m2 and m3 are very close. Of these three models, we should probably prefer m1 for the contract design.

m4 looked the best, but the cross-validation results now suggests that m4 is overfitted and should not be used.

We can make a plot of osberved vs predicted values for the models. Here for m1.

plot(z$y_zt, predict(m1), xlim=c(750, 3000), ylim=c(750, 3000), xlab="observed yield (kg/ha)", ylab="predicted yield (kg/ha)")
abline(a=0,b=1, col="red")

Saving the models

mods <- list(m1=m1, m2=m2, m3=m3, m4=m4)
saveRDS(mods, "rice_models.rds")