Use of multivariate analysis and machine learning methods to characterize traits contributing to wheat yield diversity

Ali Behpouri

Department of Agroecology, College of Agriculture and Natural Resources of Darab, Shiraz University, Iran.

https://orcid.org/0000-0003-4349-9676

Sara Farokhzadeh

Department of Agroecology, College of Agriculture and Natural Resources of Darab, Shiraz University, Iran.

https://orcid.org/0000-0001-6786-854X

Zahra Zinati

Department of Agroecology, College of Agriculture and Natural Resources of Darab, Shiraz University, Iran.

https://orcid.org/0000-0001-9362-0417

Zobeir Khosravi

Department of Agroecology, College of Agriculture and Natural Resources of Darab, Shiraz University, Iran.

https://orcid.org/0000-0003-0687-1598

Abstract

Aim of study: Regarding the third largest staple food crop in the world, determining the factors affecting wheat yield is of great importance. This study aimed to determine useful subsets of agronomic traits and evaluate the order of importance of traits in grain yield.

Area of study: Fars province, Iran.

Material and methods: In total, the data corresponding to 22 agronomic traits was collected from six different regions (Darab, Kavar, Marvdasht, Fasa, Lar, and Khonj) of 90 farms of Fars province, Iran as the most important wheat-growing regions. Multivariate statistical analysis (correlation, stepwise regression, and principal component analysis (PCA)) and machine learning modeling approaches, such as partial least squares regression (PLSR) and support vector regression (SVR) models, were applied to agronomic traits.

Main results: The findings, based on integrated approaches such as correlation, stepwise regression, and PCA, highlighted that number of spikes m^-2, grain number spike^-1, and thousand-grain weight had a major impact on the yield followed by awn length, spike length, narrow leaf herbicide, broadleaf herbicide, time to plant maturity (month), and soil salinity. Besides, PLSR with nine inputs (nine selected traits) displayed better prediction capability (R²=85 %, RMSE=0.32, MSE=0.10, and BIAS=-0.05) than that with all twenty-two input traits.

Research highlights: Integrated multivariate statistical analyses and machine learning regression methods could be a powerful tool in determining traits that have a significant impact on yield. These achievements can be considered for future breeding programs.

Additional key words: Triticum aestivum; multivariate statistical analysis; partial least squares regression; support vector regression.

Abbreviations used: MSE (mean squared error); PCA (principal component analysis); PLSR (partial least squares regression); RMSE (root-mean-square error); SVR (support vector regression); TGW (thousand-grain weight); TOL (tolerance); VIF (variance inflation factor).

Citation: Behpouri, A; Farokhzadeh, S; Zinati, Z; Khosravi, Z (2023). Use of multivariate analysis and machine learning methods to characterize traits contributing to wheat yield diversity. Spanish Journal of Agricultural Research, Volume 21, Issue 1, e0901.
https://doi.org/10.5424/sjar/2023211-19835

Received: 16 Sep 2022. Accepted: 31 Jan 2023.

Funding: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

Competing interests:The authors have declared that no competing interests exist.

Correspondence should be addressed to Zahra Zinati: zahrazinati@shirazu.ac.irz

CONTENT

RESULTS

⌅

The Shapiro-Wilk test of data collected for different traits indicated a normal distribution for variables. Statistical information related to 22 different traits is presented in Table 1.

Table 1. Descriptive statistics of the model dataset (total sample) used in the study.

Traits	Variable	Maximum	Minimum	Mean	Std Deviation	Correlation coefficients of traits with grain yield
Seeding rate (kg ha^-1)	x1	350.00	220.00	285.17	3.95	0.228*
Awn length (cm)	x2	12.00	4.00	8.13	1.87	0.353**
Spike length (cm)	x3	13.00	5.00	10.59	1.35	0.316**
Plant height (cm)	x4	98.00	70.00	83.61	6.94	0.452**
Nitrogen fertilizer (N, kg ha^-1)	x5	350.00	100.00	210.22	61.83	0.381**
Phosphorus fertilizer (P, kg ha^-1)	x6	150.00	0.00	71.28	44.64	0.441**
Potassium chloride fertilizer (K, kg ha^-1)	x7	100.00	0.00	33.17	30.99	0.254*
Narrow leaf herbicide	x8	2.00	0.00	1.34	0.47	0.278**
Broadleaf herbicide	x9	2.00	0.00	1.05	0.52	0.368**
Time to plant maturity (month)	x10	7.50	5.00	6.35	0.66	0.392**
Number of irrigation cycles	x11	10.00	5.00	7.16	1.34	0.230*
Animal manure application	x12	3000.00	0.00	111.11	507.13	0.375**
Pest infestation (%)	x13	18.00	0.00	5.76	3.20	-0.276**
Disease infestation (%)	x14	15.00	0.00	4.61	3.56	-0.433**
Number of weeds m^-2	x15	20.00	3.00	8.60	3.95	-0.302**
Rainfall (mm)	x16	104.70	66.60	93.32	12.87	-0.159
Planting depth (cm)	x17	5.00	2.00	3.41	0.68	-0.299**
Soil salinity (dS m^-1)	x18	3.50	0.50	1.63	0.81	-0.585**
Number of spikes m^-2	x19	402.00	220.00	277.74	34.86	0.915**
Grain number spike^-1	x20	66.00	27.00	41.46	8.49	0.841**
Thousand grain weight (g)	x21	45.00	31.00	37.66	3.67	0.798**
Grain yield (t ha^-1)	x22	8.00	3.80	5.66	0.91	1.00

Multivariate statistical analysis

⌅

Correlation

The phenotypic correlations (r_p) between grain yield and different traits are listed in Table 1. Also, heatmap correlation matrix among 22 agronomic traits is presented in Fig. 1. The correlation coefficients (r) of significant correlations ranged from 0.228 to 0.915. Grain yield indicated significant positive correlations with the majority of traits except for pest and disease infestations percentage, number of weeds m^-2, planting depth, and soil salinity. Also, there was no correlation between rainfall and grain yield.

Figure 1. Heatmap correlation matrix among 22 agronomic traits: x1, seeding rate (kg ha^-1); x2, awn length (cm); x3, spike length (cm); x4, plant height (cm); x5, nitrogen fertilizer (n; kg ha^-1); x6, phosphorus fertilizer (P; kg ha^-1); x7, potassium chloride fertilizer (K; kg ha^-1); x8, narrow-leaf herbicide; x9, broadleaf herbicide; x10, time to plant maturity (month); x11, number of irrigation cycles; x12, animal manure application; x13, pest infestation (%); x14, disease infestation (%); x15, number of weeds m^-2; x16, rainfall (mm); x17, planting depth (cm); x18, soil salinity (ds m^-1); x19, number of spikes m^-2; x20, grain number spike^-1; x21, thousand-grain weight (g); and x22, grain yield (t ha^-1).

Stepwise linear regression

Regression analysis showed that the number of spikes m^-2, grain number spike^-1, spike length, broadleaf herbicide, soil salinity, time to plant maturity, narrow leaf herbicide, thousand-grain weight (TGW), rainfall, and awn length had justified the maximum of grain yield changes (Table 2).

Table 2. Stepwise regression analysis of grain yield as dependent and other traits as independent variables in wheat.

Traits	Unstandardized coefficients		Standardized coefficients	F	Partial R²	R²	Entering into model, respectively
Traits	B	Std. error	Beta	F	Partial R²	R²	Entering into model, respectively
(Constant)	-2.603	0.573	-	-	-	-	-
Number of spikes m^-2	0.008	0.002	0.313	452.40**	0.8371	0.8371	1
Grain number spike^-1	0.030	0.006	0.278	15.37**	0.0245	0.8616	2
Spike length (cm)	0.064	0.023	0.095	6.52**	0.0098	0.8713	3
Broadleaf herbicide	0.162	0.062	0.092	4.88**	0.0070	0.8783	4
Soil salinity (dS m^-1)	-0.184	0.052	-0.162	5.11**	0.0070	0.8853	5
Time to plant maturity (month)	0.209	0.055	0.152	10.01**	0.0123	0.8976	6
Narrow leaf herbicide	0.209	0.065	0.107	5.06*	0.0059	0.9036	7
Thousand-grain weight (g)	0.042	0.015	0.167	4.94*	0.0055	0.9091	8
Rainfall (mm)	0.007	0.002	0.095	5.75*	0.0061	0.9152	9
Awn length (cm)	0.048	0.02	0.098	6.05*	0.0060	0.9212	10

Principal component analysis (PCA)

The PCA showed six major principal components (with eigenvalues more than one), indicating 70.77 % of the total variance among 90 wheat samples (Table 3). The first three principal components explained 28.22, 13.02 and 10.48 % of the total variance, respectively. These components represented 51.72 % of the total variance components (Table 3). The greatest variability in the data is justified by the first PC, in which number of spikes m^-2, grain number spike^-1, TGW, and grain yield had the most positive contribution while soil salinity had the most negative contribution. In PC2, number of irrigation cycles and time to plant maturity had the most positive contribution while awn length and plant had the most negative contribution. In PC3, K, P, number of weeds m^-2, and disease infestation had the most positive contribution. In PC4, seeding rate and rainfall had the most positive and negative contributions, respectively. In PC5, Animal manure application, pest infestation, and planting depth had the most positive contributions. Finally, spike length and narrow leaf herbicide had the most positive contribution to variation justified by PC6.

Table 3. The results of the principal component analysis for different traits in wheat.

Traits	PC1	PC2	PC3	PC4	PC5	PC6
Seeding rate (kg ha^-1)	0.103	0.046	0.022	0.585	0.010	0.052
Awn length (cm)	0.183	-0.320	-0.104	0.002	0.250	-0.071
Spike length (cm)	0.121	0.070	-0.188	-0.153	0.074	0.503
Plant height (cm)	0.257	-0.312	-0.026	0.077	0.110	0.246
Nitrogen fertilizer (N, kg ha^-1)	0.238	-0.235	0.254	-0.008	-0.171	0.197
Phosphorus fertilizer (P, kg ha^-1)	0.196	0.064	0.399	-0.233	-0.087	0.125
Potassium chloride fertilizer (K, kg ha^-1)	0.122	0.063	0.451	-0.083	-0.207	-0.086
Narrow leaf herbicide	0.116	0.138	0.273	0.075	-0.097	0.521
Broadleaf herbicide	0.165	0.156	0.166	0.123	-0.224	-0.105
Time to plant maturity (month)	0.124	0.378	-0.237	0.174	-0.181	0.097
Number of irrigation cycles	0.017	0.500	-0.132	-0.113	-0.010	0.108
Animal manure application	0.131	0.167	-0.005	-0.035	0.315	0.231
Pest infestation (%)	-0.201	0.223	0.237	0.067	0.407	-0.010
Disease infestation (%)	-0.235	0.205	0.300	0.062	0.176	-0.149
Number of weeds m^-2	-0.151	-0.024	0.419	0.222	0.297	0.037
Rainfall (mm)	-0.123	0.041	0.016	-0.584	0.288	0.053
Planting depth (cm)	-0.156	-0.076	-0.120	0.296	0.354	0.261
Soil salinity (dS m^-1)	-0.283	0.263	-0.101	0.036	-0.194	0.142
Number of spikes m^-2	0.357	0.111	-0.032	-0.012	0.183	-0.138
Grain number spike^-1	0.322	0.139	-0.034	-0.060	0.154	-0.153
Thousand-grain weight (g)	0.296	0.214	-0.035	0.102	0.162	-0.319
Grain yield (t ha^-1)	0.367	0.141	-0.015	-0.024	0.187	-0.042
Eigenvalue	6.21	2.86	2.30	1.83	1.29	1.07
Proportional variance (%)	28.22	13.02	10.48	8.33	5.85	4.87
Cumulative variance (%)	28.22	41.25	51.72	60.05	65.90	70.77

Next, PCA-Biplot was drawn by R packages factoextra, and ggplot2. It illustrates the relationships of the samples as well as the measured traits in wheat. The sample size is based on the amount of yield. The cosine of the angles between vectors shows the extent of correlation between traits. The angles between grain yield vector with planting depth (x17), rainfall (x16), and number of weeds (x8 and x9) m^-2 were obtuse, indicating a negative correlation, while grain yield vector had acute angles with the rest of the traits, indicating a positive correlation (Fig. 2).

Figure 2. PCA-Bi-plot of the measured traits in wheat. Larger circles represent higher yield, and smaller circles show lower yield. Traits x1 to x22: see Figure 1.

Finally, 9 traits were specified by integrated approaches including stepwise regression, correlation, and PCA. In summary, stepwise regression was applied to designate the most effective traits on grain yield in the regression model, which 10 traits with R² = 92.12% justified the maximum yield changes. Analysis of the correlation coefficient indicated that among these 10 traits, 9 traits were highly correlated with grain yield. In addition, PCA was conducted to identify significant traits related to grain yield and since loadings in the PC show that the variables how strongly influence the component and in other words reflect variables’ significance, we enumerated the variables with large loadings in the PC, which 9 traits had large loadings in the PC1 and PC2. In other words, in all three methods (Stepwise linear regression, correlation, and PCA), these 9 traits including the awn length (cm), spike length (cm), narrow leaf herbicide, broadleaf herbicide, time to plant maturity (month), soil salinity (dS m^-1), number of spikes m^-2, grain number spike^-1, and TGW (g) were jointly identified as important and effective traits on grain yield.

Grain yield modeling by SVR for all traits

⌅

The results of the training (I) and testing (II) stages for all traits are shown in Fig. 3 as a scatter plot (A) and line-graph (B). According to the results shown in these figures, the SVR model applied to all traits, predicted the grain yield with R² of 0.7292 and 0.6239 for training and testing datasets, respectively. Moreover, based on the results of line-graph, this model could not estimate grain yield for all traits. As indicated in the line-graph, these models are not efficient in terms of predicting minimum or maximum grain yield.

Figure 3. Scatter plot (A) and line-graph (B) of predicted vs. measured grain yield for the training (I) and testing (II) datasets of all traits, using SVR. Traits include seeding rate (kg ha^-1), awn length (cm), spike length (cm), plant height (cm), nitrogen fertilizer (N, kg ha^-1), phosphorus fertilizer (P, kg ha-1), potassium chloride fertilizer (K, kg ha^-1), narrow-leaf herbicide, broadleaf herbicide, time to plant maturity (month), number of irrigation cycles, animal manure application, pest infestation (%), disease infestation (%), number of weeds m^-2, rainfall (mm), planting depth (cm), soil salinity (dS m^-1), number of spikes m^-2, grain number spike^-1, thousand-grain weight (g), and grain yield (t ha^-1).

Grain yield modeling by PLSR for all traits

⌅

The results of the training (I) and testing (II) stages for all traits are shown in Fig. 4 as a scatter plot (A) and line-graph (B). Taking all traits as inputs, the PLSR model predicted the grain yield with R² of 0.92 and 0.76 for training and testing datasets, respectively. Therefore, the PLSR model on all traits, as shown in the line-graphs, is strong in predicting the maximum and minimum grain yield. Accordingly, from two prediction models, PLSR was chosen for further evaluation of the selected traits associated with yield.

Figure 4. Scatter plot (A) and line-graph (B) of predicted vs. measured grain yield for the training (I) and testing (II) datasets of all traits, using PLSR. Traits as Figure 3.

Grain yield modeling by SVR for the most important variables

⌅

The results of the training and testing stages for the 9 selected traits are shown in Table 4. The SVR model applying on all traits predicted the grain yield with R² of 0.73 and 0.62 for training and testing datasets, respectively. Moreover, applying the SVR model on 9 selected traits predicted the grain yield with R² of 0.88 and 0.59 for training and testing datasets, respectively. Accordingly, this model could not estimate satisfactorily grain yield neither for all traits nor 9 selected traits.

Table 4. Grain yield estimation model results based on the SVR (support vector regression) and PLSR (partial least squares regression) models for training and testing datasets.

Models	Training dataset				Testing dataset
Models	R²	RMSE	MSE	BIAS	R²	RMSE	MSE	BIAS
SVR (All data)	0.73	0.57	0.33	0.04	0.62	0.67	0.44	0.29
PLSR (All data)	0.92	0.26	0.07	0.00	0.76	0.46	0.21	0.27
SVR (9 traits)	0.88	0.35	0.12	-0.02	0.59	0.53	0.29	0.10
PLSR (9 traits)	0.92	0.26	0.07	0.00	0.85	0.32	0.10	-0.05

Grain yield modeling by PLSR for the most important variables

⌅

The results of the training (I) and testing (II) stages are illustrated in Fig. 5 as a scatter plot (A) and line-graph (B) for the 9 selected traits. The figures show a satisfactory agreement between the measured and predicted PLSR model values. Applying the PLSR model on 9 selected traits predicted the grain yield with R² of 0.92 and 0.85 for training and testing datasets, respectively.

Figure 5. Scatter plot (A) and line-graph (B) of predicted vs. measured grain yield for the training (I) and testing (II) datasets of nine traits, using PLSR. Traits include awn length (cm), spike length (cm), narrow-leaf herbicide, broadleaf herbicide, time to plant maturity (month), soil salinity (dS m^-1), number of spikes m^-2, grain number spike^-1, and thousand-grain weight (g).

Table 4 presents the R², RMSE, MSE, and BIAS values of the applied models for both training and testing stages for all traits and 9 selected traits by integrated approaches including stepwise regression, correlation, and PCA. As shown in Table 4, the SVR model demonstrated the least accurate results for estimating grain yield with the highest RMSE, MSE, BIAS, and the lowest value of R² for both training and testing stages for all data and 9 selected traits. Accordingly, this model could not estimate grain yield satisfactorily neither for all traits nor 9 selected traits. Table 4 also shows that the PLSR model outperforms SVR models given that for all the data and nine selected traits the value of RMSE, MSE, and BIAS were the lowest and the value of R2 was the highest. Therefore, given the results, the PLSR model was implemented for nine selected traits as the best model to achieve the most reliable estimate of grain yield.

DISCUSSION

⌅

Regarding the third main staple food crop and one of the most economically significant crops worldwide, determining the factors affecting wheat yield is of paramount importance to subsequent improve yield and ensure food security (Farokhzadeh et al., 2021).

The factors in yield are usually correlated, meaning redundancies or potentially misleading outcomes when it comes to identifying the dominant variables that control the yield (Carrascal et al., 2009), and may misdirect the plant breeders to achieving their main breeding objectives. In this study also, Pearson correlation analysis revealed the relationship among 22 traits (Fig. 1). In this regard, to get a better understanding of key determinants of wheat yield, a comprehensive data analysis including multivariate statistical analysis (stepwise regression, correlation analysis, and PCA) combined with machine learning methods (PLSR and SVR) was done. Several studies have documented the relationship between yield and some yield components in wheat (Leilah & Al-Khateeb, 2005; Baye et al., 2020). In this context, to clarify the underlying determinants of wheat yield diversity, the yield components were also included in the input variables list.

Correlation and stepwise linear regression

⌅

Analysis of the correlation coefficient indicated that among traits, number of spikes m^-2 was highly correlated with grain yield followed by grain number spike^-1 and TGW in wheat. These findings were in accordance with previous studies carried out to determine traits affecting grain yield (Farokhzadeh et al., 2013; 2020).

Stepwise regression analysis was used to measure the effect of agronomic traits on wheat grain yield. Through this method, 92.12% of the total change in grain yield was attributed to 10 traits. All these traits except rainfall had a significant positive correlation with grain yield. Although rainfall trait had no significant correlation with grain yield, it had a direct effect on grain yield regarding the standardized coefficient of rainfall in the stepwise regression model. The lack of a significant correlation between rainfall and grain yield can be due to the fact that irrigation probably neutralizes the effect of rainfall effect. Farokhzadeh et al. (2022) using stepwise regression reported that wheat grain yield as a dependent variable has been modeled as a function of the independent variables grain number spike^-1, days to heading, and spikelet number spike^-1.

Principal component analysis (PCA)

⌅

PCA was performed and the first two components were utilized to make a bi-plot to visualize the relationships of the samples as well as the measured traits (Fig. 1). As seen, low or average yield samples are on the left side and high and very high yield samples are on the right side of the bi-plot. Samples with larger PCA1 and PCA2 scores showed a very high yield. As clearly revealed in the PCA bi-plot, number of spikes m^-2, grain number spike^-1, and TGW were the dominant and similar traits in samples with very high yield, while pest infestation (x13), disease infestation (x14), and soil salinity (x18) are the dominant traits of samples with a low or average yield. The results of PCA confirm the correlation coefficient and stepwise regression results expressing that the increase in the number of spikes m^-2, grain number spike^-1, and TGW results in higher yields. Different reports confirm the importance of the traits that are identified in this study. For example, the importance of awn length, spike length, time to plant maturity (month), weed percentage, soil salinity, number of spikes m^-2, grain number spike^-1, and TGW have been highlighted in some studies (Shamsi et al., 2011; Farokhzadeh et al., 2013; 2020; 2022). The main goal of wheat breeding programs is to improve grain yield potential. Grain yield and yield components, including the number of spikes, grain number spike^-1, and TGW are polygenic complex traits that are influenced by genetic background, environment, and the interaction between them. Usually, grain yield can be determined by combining two components: grain number and TGW (Zhang et al., 2016a). Physiological studies showed that increased grain yield is positively associated with an increase in grain number, attended by a negative association between grain number and grain weight (Miralles & Slafer, 2007). Grain weight is a yield component for breeders that have made considerable attempts to improve (Lopes et al., 2012). Larger seeds are not only directly related to grain yield, but also affect beneficially seedling vigor and early growth, thus promoting and stabilizing production ability (Botwright et al., 2002). Under current agronomic production systems, to enhance grain yield potential, improvement of spikes number or grains m^-2 is more important than other yield components (Gaju et al., 2009).

It is important to highlight the awn length affects wheat yield. Abebe et al. (2009) revealed that the awn length influences the grain yield. The study showed that awn affects significantly spikes photosynthetic characteristics and tolerance to stress in barley. Seed weight and size were reduced due to the absence of awns and in turn reduction in starch content caused by it.

As expected, soil salinity was another key trait identified by the multivariate statistical analysis in this study. The soil salinity issues generally happen in arid/semi-arid areas and decrease the productivity of crops at different levels. Salinity is also a major factor limiting crop yield in defectively drained soils (Patel et al., 2002; Rogers, 2002). Rajpar et al. (2006) reported spike length, plant height, number of spikelet spike^-1 and grain yield progressively decreased with increasing soil salinity. Besides, with the increase in soil ECe, Na⁺ and K⁺ concentrations increase and decrease, respectively, which leads to reduce the K⁺/Na⁺ ratio in the flag leaf sap and grains (Rajpar et al., 2006). The percentage of weeds is one of the most important harmful and reducing factors in agricultural systems. It is the main consumer of nutrients such as chemical fertilizers that can increase their growth and development compared to crops (Jalilian et al., 2018). The percentage of weeds reduces grain yield and yield components (Mekonnen, 2022).

Yield estimation model results based on the SVR and PLSR models

⌅

In the current study, two analyzing methods, SVR and PLSR, were performed on agronomic traits obtained from 90 farms in six different areas for model development. PLSR performed better than SVR, in which R² and RSME were 0.76 vs. 0.62 and 0.46 vs. 0.67, respectively, and showed a good predictive ability and robustness compared with the SVR. This result is consistent with the conclusion of a previous study which stated that PLSR can prioritize the most important factors controlling winter wheat yield through a long-term experiment (Hu et al., 2018). Duan et al. (2020) argued that PLSR gives us a practical way to determine the factors in yield as it practically removes the correlation of the variable and modifies the bias of the factors’ role in rice yield. These findings are consistent with the present work, which indicates that PLSR has a better performance in terms of predicting yield. Also, Zhang et al. (2020) estimated the yield of field-grown winter wheat by applying the PLSR model. These reports are in accordance with the present study which implies that PLSR displayed better prediction capability.

Taking all traits in predicting grain yield with the superior model (PLSR), the predicted R² and RSME of the validation (testing) dataset for grain yield prediction were 0.76 and 0.46, respectively. Whereas applying this model on a testing dataset with 9 selected traits as inputs improved R² and RSME to 0.85 and 0.32, respectively. PLSR model could attain a comparable accuracy by using the 9 selected traits compared with using all traits. This result indicated the substantial influence of nine selected traits in explaining the grain yield variation. In other words, multivariate statistical analyses could provide great potential to reduce the number of inputs (traits) of the model and consequently increase the prediction accuracy of PLSR.

The high accuracy of the PLSR model with 9 inputs (traits) implies that the selected traits are attributable to the improvement of grain yield. From a plant breeding perspective, our study recommends considering the number of spikes m^-2, grain number spike^-1, TGW , awn length, spike length, and time to plant maturity (month) in breeding programs for achieving improved grain yield in wheat. Farokhzadeh et al. (2022) using integrating results of multivariate statistics and supervised learning methods illustrated that spikelet number spike^-1 and grain number spike-1 traits can be used to create a selection index for the high grain yield in wheat.

CONCLUSIONS

⌅

The grain yield of wheat is the most important economic part of the plant, which is the result of yield components and other related traits. Identifying the main components of grain yield and their relationship with grain yield can be essential in directing breeding and management programs to increase yield. In the current study, multivariate analyses including, correlation, stepwise regression, and principal component analysis (PCA) were applied to select useful subsets of agronomic traits from 90 farms and to evaluate the order of importance of traits influencing the grain yield. Moreover, partial least squares regression (PLSR) displayed better prediction capability with nine inputs (nine selected traits). In brief, integrated multivariate statistical analyses and machine learning regression methods could be a powerful tool in determining key contributing traits to wheat yield.

ACKNOWLEDGMENTS

⌅

The authors are very grateful to Zohreh Sheikh Khozani for her help in the data modeling and analysis and Ehsan Bijanzadeh for valuable guidance throughout this study.

REFERENCES

⌅

Abdel-Sattar M, Aboukarima AM, Alnahdi BM, 2021. Application of artificial neural network and support vector regression in predicting mass of ber fruits (Ziziphus mauritiana Lamk.) based on fruit axial dimensions. PLoS ONE 16(1): e0245228. https://doi.org/10.1371/journal.pone.0245228

Abdi H, 2010. Partial least squares regression and projection on latent structure regression (PLS regression). Wiley Interdisciplinary Reviews: Comput Stat 2: 97-106. https://doi.org/10.1002/wics.51

Abebe T, Wise RP, Skadsen RW, 2009. Comparative transcriptional profiling established the awn as the major photosynthetic organ of the barley spike while the lemma and the palea primarily protect the seed. Plant Genome 2: 247-259.https://doi.org/10.3835/plantgenome.2009.07.0019

Asseng S, Turner NC, Keating BA, 2001. Analysis of water- and nitrogen-use efficiency of wheat in a Mediterranean climate. Plant Soil 233: 127-143. https://doi.org/10.1023/A:1010381602223

Bao SD (ed), 2005. Analysis of soil agrochemistry. China Agriculture Press, Beijing, China. 495 pp

Barikloo A, Alamdari P, Moravej K, Servati M, 2017. Prediction of irrigated wheat yield by using hybrid algorithm methods of artificial neural networks and genetic algorithm. J Water Soil 31: 715-726

Baye A, Berihun B, Bantayehu M, Derebe B, 2020. Genotypic and phenotypic correlation and path coefficient analysis for yield and yield-related traits in advanced bread wheat (Triticum aestivum L.) lines. Cogent Food Agric 6(1): 1752603. https://doi.org/10.1080/23311932.2020.1752603

Botwright TL, Condon AG, Rebetzke AG, Richards RA, 2002. Field evaluation of early vigour for genetic improvement of grain wheat. Aust J Agric Res 53(10): 1137-1145. https://doi.org/10.1071/AR02007

Carrascal LM, Galván I, Gordo O, 2009. Partial least squares regression as an alternative to current regression methods used in ecology. Oikos 118(5): 681-690. https://doi.org/10.1111/j.1600-0706.2008.16881.x

Costa C, Menesatti P, Spinelli R, 2012. Performance modelling in forest operations through partial least square regression. Silva Fenn 46(2): 241-252. https://doi.org/10.14214/sf.57

Duan L, Xie H, Li Z, Yuan H, Guo Y, Xiao X, Zhou Q, 2020. Use of partial least squares regression to identify factors controlling rice yield in Southern China. Agron J 112(3): 1502-1516. https://doi.org/10.1002/agj2.20161

Farokhzadeh S, Shahsavand-Hassani H, Mohammadi-Nejad GH, 2013. Evaluation of genetic diversity of primary tritipyrum, triticale and bread wheat genotypes. Iran J Agron Sci 5: 93-112

Farokhzadeh S, Fakheri BA, Mahdinejad N, Tahmasebi S, Mirsoleimani A, Heidari B, 2020. Mapping QTLs associated with grain yield and yield-related traits under aluminum stress in bread wheat. Crop Pasture Sci 71: 429-444. https://doi.org/10.1071/CP19511

Farokhzadeh S, Fakheri BA, Zinati Z, Tahmasebi S, 2021. New selection strategies for determining the traits contributing to increased grain yield in wheat (Triticum aestivum L.) under aluminum stress. Genet Resour Crop Evol 68: 2061-2073. https://doi.org/10.1007/s10722-021-01117-4

Farokhzadeh S, Shahsavand-Hassani H, Zinati Z, Rajaei M, 2022. Evaluation of triticale lines compared to wheat cultivars in terms of agronomic traits using supervised learning methods and multivariate statistics. Philipp Agric Sci 105(4): 369-389

Fassio A, Cozzolino D, 2003. Non-destructive prediction of chemical composition in sunﬂower seeds by near infrared spectroscopy. Indust Crops Prod 20: 321-329. https://doi.org/10.1016/j.indcrop.2003.11.004

Gaju O, Reynolds MP, Sparkes DL, Foulkes MJ, 2009. Relationships between large-spike phenotype, grain number, and yield potential in spring wheat. Crop Sci 49: 961-973. https://doi.org/10.2135/cropsci2008.05.0285

Gustavo AS, Guillermo AG, Roman AS, Daniel JM, 2022. Physiological drivers of responses of grains per m2 to environmental and genetic factors in wheat. Field Crops Res 285: 108593. https://doi.org/10.1016/j.fcr.2022.108593

Hu Y, Wei X, Hao M, Fu W, Zhao J, Wang Z, 2018. Partial least squares regression for determining factors controlling winter wheat yield. Agron J 110: 281-292. https://doi.org/10.2134/agronj2017.02.0108

Jalilian A, Mondani F, Khoramivafa M, Bagheri A, 2018. Evaluation of Clipest model in simulation of winter wheat (Triticum aestivum L.) and wild oat (Avena ludoviciana L.) competition in Kermanshah. Iran J Agroeco 10: 248-266.

Leilah AA, Al-Khateeb SA, 2005. Statistical analysis of wheat yield under drought conditions. J Arid Environ 61(3): 483-496. https://doi.org/10.1016/j.jaridenv.2004.10.011

Lopes MS, Reynolds MP, Manes Y, Singh RP, Crossa J, Braun HJ, 2012. Genetic yield gains and changes in associated traits of CIMMYT spring bread wheat in a “historic” set representing 30 years of breeding. Crop Sci 52(3): 1123-1131. https://doi.org/10.2135/cropsci2011.09.0467

Martens H, Martens M, 2000. Modified Jack-knife estimation of parameter uncertainty in bilinear modeling by partial least squares regression (PLSR). Food Qual Prefer 11: 5-16. https://doi.org/10.1016/S0950-3293(99)00039-7

Mekonnen G, 2022. Wheat (Triticum aestivum L.) yield and yield components as influenced by herbicide application in Kaffa Zone, Southwestern Ethiopia. Int J Agron 2022: 3202931. https://doi.org/10.1155/2022/3202931

Meng M, Zhao C, 2015. Application of support vector machines to a small-sample prediction. Adv Petrol Explord Dev 10(2): 72-75.

Miralles DJ, Slafer GA, 2007. Sink limitations to yield in wheat: How could it be reduced? J Agric Sci 145: 139-149. https://doi.org/10.1017/S0021859607006752

Nguyen HT, Lee BW, 2006. Assessment of rice leaf growth and nitrogen status by hyperspectral canopy reflectance and partial least square regression. Eur J Agron 24: 349-356. https://doi.org/10.1016/j.eja.2006.01.001

Norouzi M, Ayoubi S, Jalalian A, Khademi H, Dehghani AA, 2010. Predicting rainfed wheat quality and quantity by artificial neural network using terrain and soil characteristics. Acta Agric Scand - B Soil Plant Sci 60(4): 341-352. https://doi.org/10.1080/09064710903005682

Patel R, Prasher S, Bonnell R, Boughton R, 2002. Development of comprehensive soil salinity index. J Irrig Drain Eng 128: 185-188. https://doi.org/10.1061/(ASCE)0733-9437(2002)128:3(185)

Rajpar I, Khanif YM, Soomro FM, Suthar JK, 2006. Effect of NaCl salinity on the growth and yield of Inqlab wheat (Triticum aestivum L.) variety. Am J Plant Physiol 1: 34-40. https://doi.org/10.3923/ajpp.2006.34.40

Rogers ME, 2002. Irrigating perennial pasture with saline water: Effects on soil chemistry, pasture production and composition. Aust J Exp Agric 42: 265-272. https://doi.org/10.1071/EA00128

Shaibu AS, Adnan AA, 2015. Predicting grain yield of maize using drought tolerance traits. Afr J Agric Res 10(33): 3332-3337. https://doi.org/10.5897/AJAR2015.9561

Shamsi K, Petrosyan M, Noor-Mohammadi G, Haghparas A, Kobraee S, et al., 2011. Differential agronomic responses of bread wheat cultivars to drought stress in the west of Iran. Afr J Biotechnol 10: 2708-2715. https://doi.org/10.5897/AJB10.1133

Sheikh Khozani Z, Khosravi KH, Torabi M, Mosavi A, Rezaei B, Rabczuk T, 2020. Shear stress distribution prediction in symmetric compound channels using data mining and machine learning models. Front Struct Civ Eng 14: 10971109. https://doi.org/10.1007/s11709-020-0634-3

Takahashi S, Anwar MR, 2007. Wheat grain yield, phosphorus uptake and soil phosphorus fraction after 23 years of annual fertilizer application to an Andosol. Field Crops Res 101: 160-171. https://doi.org/10.1016/j.fcr.2006.11.003

Tian Y, Xu YP, Wang G, 2018. Agricultural drought prediction using climate indices based on support vector regression in Xiangjiang River basin. Sci Total Environ 622: 710-720. https://doi.org/10.1016/j.scitotenv.2017.12.025

Vapnik V (ed), 1995. The nature of statistical learning theory. Springer, NY. https://doi.org/10.1007/978-1-4757-2440-0

Wilson A, Hemalatha N, Sukumar R, 2021. Computational prediction model for pepper yield prediction using support vector regression. AgriRxiv 10310468. https://doi.org/10.31220/agriRxiv.2021.00069

Wold S, Sjostrom M, Eriksson L, 2001. PLS-regression: A basic tool of chemometrics. Chemometr Intell Lab Syst 58: 109-130. https://doi.org/10.1016/S0169-7439(01)00155-1

Yang Y, Li N, Wu Y, Liu B, Li S, Tao L, et al., 2022. Key phenotypes related to wheat grain yield in a two-site multi-cultivar test. Agron J 114(5): 2874-2885. https://doi.org/10.1002/agj2.21098

Zhang H, Chen J, Li R, Deng Z, Zhang K, Liu B, Tian J, 2016a. Conditional QTL mapping of three yield components in common wheat (Triticum aestivum L.). Crop J 4: 220-228. https://doi.org/10.1016/j.cj.2016.01.007

Zhang PP, Zhou XX, Wang ZX, Mao W, Li WX, Yun F, et al., 2020. Using HJ-CCD image and PLS algorithm to estimate the yield of field-grown winter wheat. Sci Rep 10: 5173. https://doi.org/10.1038/s41598-020-62125-5

Zhang Y, Xu W, Wang W, Dong H, Qi X, Zhao M, et al., 2016b. Progress in genetic improvement of grain yield and related physiological traits of Chinese wheat in Henan Province. Field Crops Res 199: 117-128. https://doi.org/10.1016/j.fcr.2016.09.022

Use of multivariate analysis and machine learning methods to characterize traits contributing to wheat yield diversity

INTRODUCTION

MATERIAL AND METHODS

Data collection

Multivariate statistical analysis

Machine learning methods

Model performance

RESULTS

Multivariate statistical analysis

Grain yield modeling by SVR for all traits

Grain yield modeling by PLSR for all traits

Grain yield modeling by SVR for the most important variables

Grain yield modeling by PLSR for the most important variables

DISCUSSION

Correlation and stepwise linear regression

Principal component analysis (PCA)

Yield estimation model results based on the SVR and PLSR models

CONCLUSIONS

ACKNOWLEDGMENTS

AUTHOR'S CONTRIBUTIONS

REFERENCES