INTRODUCTION
⌅Following rice and maize, wheat (Triticum aestivum L.) is the third key food crop in the world and it is farmed in a variety of environments. Wheat yield is affected by agronomic, phenological, physiological, and climatic factors, nutrient storage, crop management such as fertilizer amount and infection to pests and diseases, land management, and land conditions (Farokhzadeh et al., 2020). Also, soil properties such as soil texture, percentage/concentration of nitrogen (N), potassium (K), phosphorus (P), percentage of organic matter, and electrical conductivity (EC) affect wheat yield (Asseng et al., 2001; Takahashi & Anwar, 2007).
So far, many studies have been conducted to reveal the relationship between yields and other related traits and consequently identify the traits affecting the growth and development of wheat (Zhang et al., 2016b; Farokhzadeh et al., 2020). For instance, in a study conducted by Yang et al. (2022), according to a two-site multi-cultivar test via principal component analysis (PCA), structural equation model, and partial least squares model, 11 phenotypes were considered as the key phenotypes that contributed most to grain yield, including spike density, leaf area index, biomass, harvest index, net photosynthetic rate, leaf chlorophyll, canopy temperature, carboxylation efficiency, stomatal conductance, leaf nitrate reductase, and transpiration rate. Norouzi et al. (2010) applied artificial neural networks to predict dryland wheat yield in semi-arid and mountainous areas of western Iran. They stated that the sediment transport index was the most significant topographic factor in the yield of wheat. Barikloo et al. (2017) evaluated the performance of a neuro-genetic hybrid model to predict wheat yield based on land characteristics. Sensitivity analysis indicated that soil parameters such as available phosphorus, total nitrogen, gravel content, organic matter percentage, and soil reaction play the main role in determining wheat yield. They found that total soil organic matter and nitrogen had the highest and lowest correlations with the yield quantity and quality of wheat respectively. In addition, they argued that some of the chemical and physical properties of soil such as nitrogen content had an impact on soil fertility and water storage in the soil, which are the major factors in wheat yield.
Characterizing the traits that contributed most to yield diversity, could provide the basis for developing cultivars with high yields. Using multivariate analysis including PCA, stepwise regression and machine learning not only may extract the concealed pattern present in data but also facilitate ways of determining notable traits (Farokhzadeh et al., 2021).
The PCA is a usual procedure to condense a larger set of correlated variables into smaller and effectively interpretable axes of variation. The PCA technique can help us to understand the main data structure and develop a smaller number of uncorrelated variables. The objective of the principal component analysis was to determine the highest variance with the lowest number of components possible (Farokhzadeh et al., 2022).
In recent decades, various yield models, including linear and non-linear models, have been applied to modeling linear and non-linear relationships among variables. Yield modeling not only allows us to predict plant production but also contributes to understanding how the yield is affected by environmental factors and yield components
Generally, partial least squares regression (PLSR) is accepted as one of the methods with highest efficiency in terms of extracting and creating reliable model to predict chemical composition in sunflower seed (Fassio & Gozzolino, 2003), predict grain yield in maize farm through drought tolerance traits (Shaibu & Adnan, 2015), examine growth of rice leaf and nitrogen level (Nguyen & Lee, 2006), determine the factors in rice yield and the yield of fields of winter wheat (Zhang et al., 2020), and determine the priority of the factors in winter wheat yield (Hu et al., 2018). PLSR, as an effective method, combines multiple linear regression and PCA to transform the data matrix efficiently and alleviate the collinearity issue of independent variables (Costa et al., 2012). The support vector machine (SVM) was first introduced by Vapnik et al. (1995). It has gained much popularity as a machine learning tool in classification and regression analysis called SVM classification and support vector regression (SVR), respectively. The SVR technique provides users with high flexibility of underlying variables distribution, the relationship between the independent and dependent variables, and the control on the penalty term (Hu et al., 2018; Zhang et al., 2020). Similar to PLSR, SVR has been used in crop research including agricultural drought prediction (Tian et al., 2018), yield prediction in pepper (Wilson et al., 2021), and predicting the mass of ber fruits (Abdel-Sattar et al., 2021), but there are few studies regarding the application of SVR in crop yield prediction.
PLSR is also highly recommended for analyzing an immense array of pertinent predictor variables with a sample size that is not large enough in comparison with the number of independent variables (Carrascal et al., 2009) and SVR has the capacity to process the data of high dimensionality and is less affected by sample size (Meng & Zhao, 2015). In addition to the aforementioned and also sample size in the present study, these two machine learning methods were used to fit the yield prediction model based on agronomic traits. Besides, to our best knowledge, no study has directly compared the performance of the aforementioned two methods, PLSR and SVR, in developing models for predicting wheat grain yield
On the other hand, the relationship between agronomic traits and yield may be due to the influence of the environment. However, few studies examined traits related to yield based on data from multiple locations. For instance, in Gustavo et al.´s (2022) study, a large database included 367 papers published compiled to recognize the main determinants of the number of grains per unit in response to environmental and genetic factors. They suggested that the responsiveness of the number of grains per unit area was similarly explained by changes in both the number of spikes m-2 and the number of grains spike-1. To fill this gap, in the current study, a comprehensive investigation of key traits that contribute to yield was conducted on 22 agronomic traits collected from six different regions of 90 farms to minimize the effect of the environment on the relationships between traits and yield. Multivariate statistical analyses were used to clarify and assess the underlying determinants of wheat yield diversity. In addition, two machine learning methods, PLSR, as a linear model, and SVR, as a non-linear model, were applied to assess the predicting power of two models of the relationship between 21 traits with grain yield as well as to assess selected traits related to yield.
MATERIAL AND METHODS
⌅Data collection
⌅The data were collected from six different regions (Darab, Kavar, Marvdasht, Fasa, Lar, and Khonj) of 90 farms in Fars province, Iran, as the most important wheat-growing regions during 2020-2021. Repeated random sampling was performed on every farm using a 1-m2 quadrat to measure the agronomic traits, including grain yield (t ha-1), thousand seed weight (g), number of spikes m-2, grain number spike-1, awn length (cm), spike length (cm), plant height (cm), number of weeds m-2, pest and disease infestations percentage. Soil EC (dS m-1) was measured with an EC meter for a slurry consisting of 1:5 (w/v) soil/distilled water (Bao et al., 2005). Rainfall (mm) was obtained from the weather stations. Other traits including the nitrogen fertilizer (N, kg ha-1), phosphorus fertilizer (P, kg ha-1), potassium chloride fertilizer (K, kg ha-1), animal manure application, number of irrigation cycles, seed rate (kg ha-1), use of herbicides (narrow leaf-herbicide and broadleaf herbicide), time to plant maturity (month) and planting depth (cm) were collected using a questionnaire on each farm. It should be noted that samples using a quadrat of 1 m2 were taken in each farm. Then the pest damaged and diseased wheat plants were counted, separately and the percentage of disease and pest infestations were calculated via count of: (pest damaged or diseased plant count/total plant counts) × 100.
Multivariate statistical analysis
⌅The Shapiro-Wilk test is a statistical test that was used to check if a variable follows a normal distribution. Pearson’s correlation coefficient was used to measure the degree of linearity of the relationship between two variables. Stepwise regression was applied to identify the worthiest effective features on grain yield in the regression model. Through a stepwise regression analysis, grain yield was considered as a dependent variable while the rest of the traits were considered as independent variables
Since there was a correlation among independent variables, the multicollinearity test was performed to test regression assumptions through the computation of TOL (tolerance) and VIF (variance inflation factor) using SPSS software. The VIF index was smaller than 10, which indicates the absence of multicollinearity between variables. Also, the TOL index, which was greater than 0.1, indicates that there was no multicollinearity between variables.
The PCA was accomplished to distinguish new variables (principal components) containing the trait combinations that include the most variation. In this study, an association of some important traits with grain yield was estimated using PCA.
Data analysis in SAS (Statistical Analysis System v. 9.2) was used to check the normal distribution of data (Shapiro-Wilk test) and also to perform stepwise regression analysis and descriptive statistics, and in SPSS (Statistical Package for the Social Sciences, v. 24) for Pearson’s correlation analysis. Pheatmap, factoextra, and ggplot2 packages in RStudio (v. 4.0.3) were used to plot correlation heatmap and PCA bi-plot.
Machine learning methods
⌅Machine learning model analysis, including PLSR and SVR analysis, were conducted using PYTHON (multi-paradigm programming language, v. 3.10.5) for prediction and EXCEL software for statistical analyses and graphs drawing. For this aim, 80% of samples were used for the training stage, and the rest 20% of samples, for testing stage. Mathematical background on PLSR and SVR is as follows:
- SVR. The principles used in SVR are identical to those used in SVM classification. SVR is a modified form of SVM in which instead of categorized dependent variables, numerical ones are used. SVR enables optimal interpretation of the resulting model since it allows non-linear model construction without altering the explanatory variable outlines. Pertinent to SVR is the implementation principle of maximal margin, which allows its description as a convex optimization problem. Hence the method can proceed with predictions as long as the error does not exceed a certain set value. Moreover, regression is not over-fitted as the SVR method allows it to be penalized using a cost parameter
- PLSR. This is a robust, efficient regression method for multivariate analysis over a wide data range (Martens & Martens, 2000). This technique reduces predictors to a smaller set of non-correlated components for least square regression. PLSR is uniquely useful for analyzing highly collinear predictors or for data that predictors exceed observations and the ordinary least square regression method would have failed completely or yielded coefficients with high standard errors. Additionally, PLSR outperforms the traditional regression method as it employs a linear multivariate model to correlate two data matrices, X and Y, and further progresses to model the corresponding structures. The ability of this technique to further analyze multiple, noisy, collinear, and incomplete variables in X and Y (Wold et al., 2001) describes its uniqueness and efficiency. PLSR theory, principles, and application have been extensively reviewed by Abdi (2010).
Model performance
⌅The four statistical evaluation criteria were used to assess the model performance, including the coefficient of determination (R2), the root-mean-square error (RMSE), the mean squared error (MSE), and BIAS for the training and testing datasets. These statistical indexes were calculated as:
where Xmi is the measured yield value in the field, Xpi is the predicted yield value, and n is the number of samples. A combination of the mentioned statistical parameters is sufficient for model evaluation. Sheikh Khozani et al. (2020) used these indices to examine the performance of models.