Sweetpotato Cultivar Selection¶
Introduction¶
This case study looks at data from a study on sweetpotato cultivars in Mozambique. Specifically, we will demonstrate methodology used in the article published in Genotype x Environment interaction and selection for drought adaptation in sweetpotato in Mozambique. In this study, 58 cultivars were evaluated over three years, looking at performance under both full irrigation and limited irrigation. Methods used include analysis of variance, regression, and additive main effects multiplicative interaction (AMMI), which will be shown in this case study.
The data used is available on the CIP datavarse. Unfortunately, the dataset is missing data from the 2009 study, so we will use the same methodology with only 2006 and 2008 data. Results will differ from those reported in the paper.
Download¶
library(agro)
ff <- get_data_from_uri("https://data.cipotato.org/dataset.xhtml?persistentId=doi:10.21223/P3/M0HGJ4", ".")
ff
## [1] "./doi_10.21223_P3_M0HGJ4/Cluster_Analysis_Drought.xlsx"
## [2] "./doi_10.21223_P3_M0HGJ4/Data_Dictionary_Drought.xlsx"
## [3] "./doi_10.21223_P3_M0HGJ4/Drought_06_08.xls"
Read¶
The dataset consisits of three files, including the data itself,
information on the clones, and a data dictionary. Two of the files are
“.tab”, and the third is an excel file. Below, we read in each of the
files in the dataset. To read the excel file we use the readxl
package.
Now that we have the files, we can read them into R.
cluster <- read.delim(ff[1], stringsAsFactors=FALSE)
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## line 1 appears to contain embedded nulls
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on
## './doi_10.21223_P3_M0HGJ4/Cluster_Analysis_Drought.xlsx'
## Error in type.convert.default(data[[i]], as.is = as.is[i], dec = dec, : invalid multibyte string at '<ef><ef><d3>'
dict <- read.delim(ff[2], stringsAsFactors=FALSE)
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## line 1 appears to contain embedded nulls
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## line 5 appears to contain embedded nulls
## Warning in read.table(file = file, header = header, sep = sep, quote = quote, :
## incomplete final line found by readTableHeader on
## './doi_10.21223_P3_M0HGJ4/Data_Dictionary_Drought.xlsx'
## Error in read.table(file = file, header = header, sep = sep, quote = quote, : more columns than column names
library(readxl)
drought <- read_excel(ff[3])
We need to convert some our variables of interest to numeric. The reason
is that some records for these variable have the odd values “*”. I first
set these to NA
to avoid the warning NAs introduced by coercion
.
You need to treat warnings like errors, never ignore them, and make them
go away if you can.
drought$RYTHA[drought$RYTHA == "*"] <- NA
drought$FYTHA[drought$FYTHA == "*"] <- NA
drought$BIOM[drought$BIOM == "*"] <- NA
Now the conversion
drought$RYTHA <- as.numeric(drought$RYTHA)
drought$FYTHA <- as.numeric(drought$FYTHA)
drought$BIOM <- as.numeric(drought$BIOM)
Remove NAs
drought <- drought[!is.na(drought$RYTHA),]
Summary Statistics¶
First, we will plot some graphs to demonstrate basic summary statistics of the data. We will use the package ‘ggplot’, which has more flexibility in the plots produced. We begin by adding an important variable to the data, environment (a combination of year and treatment), then plot a histogram and boxplot.
To create a variable environment
we combine the year and treatment
(irrigated/non-irrigated)
drought$env <- as.factor(paste(drought$YEAR, drought$TREATMENT))
First, we will represent the data with a histogram, using the argument ‘geom_histogram’. This shows us the differences in distribution of observations in the four environments of interest. We can see that in both non-irrigated environments, the majority of observations fall in the smallest bin, while this is not the case for the irrigated environments.
library(ggplot2)
ggplot(drought, aes(x=RYTHA)) +
geom_histogram(bins=10) +
facet_wrap(~drought$env, nrow=1)
Next, we can represent the data with a box plot. Here we graph the distribution of the two treatments, irrigated and non-irrigated. From initial observation, it appears that the irrigated plots had higher and less variable root yields than the non-irrigated plots.
ggplot(drought, aes(y=RYTHA, x=TREATMENT, fill = REP)) +
geom_boxplot() +
theme_classic()
## Warning: The following aesthetics were dropped during statistical transformation: fill
## ℹ This can happen when ggplot fails to infer the correct grouping structure in the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical variable into a factor?
The article makes a distinction between farmer landraces and bred/modern cultivars. In the dataframe “cluster”, we can split the cultivars between farmer landraces and modern varieties, following the distinction on table 5 of the paper. Here we show boxplots for four variables of interest: dry matter (DM), total biomass (BIOM), harvest index (HI), and geometric mean (GM). For more information on these variables, see the table ‘dict’. We also use the package ‘gridExtra’ to plot all the graphs at once.
#New variable for type of cultivar
cluster$type <- as.numeric(row.names(cluster))
## Error in eval(expr, envir, enclos): object 'cluster' not found
#The first 40 cultivars are landrace, the rest are modern (refer to table 5)
cluster$type <- ifelse(cluster$type <41, "landrace", "modern")
## Error in eval(expr, envir, enclos): object 'cluster' not found
#Remove x-asis title (on first two) and legend (on all) to avoid repetition
p1 <- ggplot(cluster, aes(y=DM, x=type, color=type)) +
geom_boxplot()+
theme(legend.position="none", axis.title.x=element_blank())
## Error in eval(expr, envir, enclos): object 'cluster' not found
p2 <- ggplot(cluster, aes(y=BIOM, x=type, color=type)) +
geom_boxplot()+
theme(legend.position="none", axis.title.x=element_blank())
## Error in eval(expr, envir, enclos): object 'cluster' not found
p3 <- ggplot(cluster, aes(y=HI, x=type, color=type)) +
geom_boxplot()+
theme(legend.position="none")
## Error in eval(expr, envir, enclos): object 'cluster' not found
p4 <- ggplot(cluster, aes(y=GM, x=type, color=type)) +
geom_boxplot()+
theme(legend.position="none")
## Error in eval(expr, envir, enclos): object 'cluster' not found
g_legend <- function(a.gplot) {
tpm <- ggplot_gtable(ggpl)
}
library(gridExtra)
grid.arrange(p1, p2, p3, p4, nrow=2)
## Error in eval(expr, envir, enclos): object 'p1' not found
Cluster Analysis¶
The authors use cluser analysis to show linkages between the different cultivars, based on many of the variables of interest: geometric yield, drought sensitivity index, drought tolerance expression, percent reduction, harvest index, and root dry matter. Below, we do the same using the function ‘hclust’.
row.names(cluster) <- cluster$Cultivar
## Error in eval(expr, envir, enclos): object 'cluster' not found
j <- hclust(dist(cluster[,c(2,3,5,7,8)]), method="average")
## Error in eval(expr, envir, enclos): object 'cluster' not found
plot(j, xlab="Cultivar", hang= -1, cex = 0.6)
## Error in eval(expr, envir, enclos): object 'j' not found
Analysis of Variance¶
The performance of cultivars was assessed with analysis of variance. The three traits of interest were storage root yield, vine yield, and biomass, and were observed across variation in environment (E), genotype (G), and G x E. While the study reports 5 degrees of freedom for the environment, we will only have 3 since we are missing data from 2009. We will report the mean square, stability variance, and relative variance as shown in Table 3 of the paper. In this section, we will use the function ‘aov’ for analysis of variance.
#Storage Root Yield
RYTHA.anova <- aov(RYTHA ~ env + CULTIVAR + CULTIVAR*env, data=drought)
summary(RYTHA.anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## env 3 2200 733.2 69.486 <2e-16 ***
## CULTIVAR 57 6784 119.0 11.280 <2e-16 ***
## env:CULTIVAR 170 5048 29.7 2.814 <2e-16 ***
## Residuals 427 4506 10.6
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Vine Yield
FYTHA.anova <- aov(FYTHA ~ env + CULTIVAR + CULTIVAR*env, data=drought)
summary(FYTHA.anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## env 3 27944 9315 90.332 < 2e-16 ***
## CULTIVAR 57 68062 1194 11.580 < 2e-16 ***
## env:CULTIVAR 170 31036 183 1.771 1.86e-06 ***
## Residuals 425 43824 103
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness
#Biomass
BIOM.anova <- aov(BIOM ~ env + CULTIVAR + CULTIVAR*env, data=drought)
summary(BIOM.anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## env 3 43828 14609 113.66 < 2e-16 ***
## CULTIVAR 57 76124 1336 10.39 < 2e-16 ***
## env:CULTIVAR 170 40432 238 1.85 2.98e-07 ***
## Residuals 425 54630 129
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 2 observations deleted due to missingness
Regression¶
Next, we replicate table 4, which uses linear regression and takes into account interactions between environment and genotypes.
#create dummy variables for the different environments
drought$IR6 <- ifelse(drought$env=="2006 Irrigation", 1, 0)
drought$NI6 <- ifelse(drought$env=="2006 Non_Irrigation", 1, 0)
drought$IR8 <- ifelse(drought$env=="2008 Irrigation", 1, 0)
drought$NI8 <- ifelse(drought$env=="2008 Non_Irrigation", 1, 0)
#To see the effect of each environment on root yield, we can use the linear model with root yield as the y axis, and each environment as the x-axis.
linear <- lm(RYTHA ~IR6 + NI6 + IR8, data=drought)
summary(linear)
##
## Call:
## lm(formula = RYTHA ~ IR6 + NI6 + IR8, data = drought)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.938 -3.341 -1.464 1.994 32.868
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2562 0.3789 5.955 4.26e-09 ***
## IR6 3.2641 0.5539 5.893 6.09e-09 ***
## NI6 3.4561 0.5502 6.282 6.10e-10 ***
## IR8 4.8595 0.5359 9.069 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.998 on 654 degrees of freedom
## Multiple R-squared: 0.1187, Adjusted R-squared: 0.1146
## F-statistic: 29.35 on 3 and 654 DF, p-value: < 2.2e-16
#Because we have so many cultivars, we can make them into dummy variables with the following code:
cultivar.f = factor(drought$CULTIVAR)
dummies = model.matrix(~cultivar.f)
linear2 <- lm(drought$RYTHA ~ dummies)
#summary(linear2)
AMMI Model¶
The fourth step used in this article for the analysis of sweet potato
cultivars is the additive main effects and multiplicative interactions
model (AMMI). AMMI combines both ANOVA methodology (additive) and PCA
(multiplicative) to better understand how environments and genotypes
interact. Here, we will use the AMMI
function from the agricolae
package.
library(agricolae)
## Warning: package 'agricolae' was built under R version 4.3.1
#To make for easier viewing, we will use the codes rather than cultivar names, as charactes.
drought$CODE_CULTIVAR <- as.character(drought$CODE_CULTIVAR)
#the function AMMI specifies the environment variable, genotype, and repition, as well as the y-variable of interest.
ammi <- AMMI(ENV=drought$env, GEN=drought$CODE_CULTIVAR, REP=drought$REP, Y = drought$RYTHA, PC=T)
ammi$ANOVA
## Analysis of Variance Table
##
## Response: Y
## Df Sum Sq Mean Sq F value Pr(>F)
## ENV 3 2199.6 733.20 15.9863 0.0009673 ***
## REP(ENV) 8 366.9 45.86 4.6448 1.867e-05 ***
## GEN 57 6770.1 118.77 12.0284 < 2.2e-16 ***
## ENV:GEN 170 5063.1 29.78 3.0162 < 2.2e-16 ***
## Residuals 419 4137.4 9.87
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ammi$analysis
## percent acum Df Sum.Sq Mean.Sq F.value Pr.F
## PC1 58.9 58.9 59 3171.8441 53.760070 5.44 0
## PC2 37.2 96.1 57 2005.9770 35.192580 3.56 0
## PC3 3.9 100.0 55 207.6749 3.775907 0.38 1
We can graph the results from the ammi model on a bi-plot. Below we plot two biplots. The first shows the first and second principal componenets plotted against each other, and the second shows the first principal component plotted against the yield variable.
#The regular plot(ammi) will plot PC1 against PC2
plot(ammi)
#To graph the dependent variable against pc1, the following code is used. 0 is the y-dependent variable (RYTHA), and 1 is PC1.
plot(ammi, 0, 1, gcol="blue", ecol="green")