optRF: Optimising random forest stability by determining the optimal number of trees
A
Thomas Martin Lange 1✉ Email
Felix Heinrich 1
Mehmet Gültas 2,3
Armin O. Schmitt 1,3
1 Breeding Informatics Group Georg-August University Margarethe von Wrangell-Weg 7 37075 Göttingen Germany
2 Faculty of Agriculture South Westphalia University of Applied Sciences Lübecker Ring 2 59494 Soest Germany
3 Center for Integrated Breeding Research (CiBreed) Georg-August University Albrecht-Thaer-Weg 3 37075 Göttingen Germany
Thomas M. Lange*1, Felix Heinrich1, Mehmet Gültas2,3, Armin O. Schmitt1,3
1Breeding Informatics Group, Georg-August University,
Margarethe von Wrangell-Weg 7, 37075 Göttingen, Germany
2Faculty of Agriculture, South Westphalia University of Applied Sciences,
Lübecker Ring 2, 59494 Soest, Germany
3Center for Integrated Breeding Research (CiBreed), Georg-August University,
Albrecht-Thaer-Weg 3, 37075 Göttingen, Germany
ORCIDs
Thomas M. Lange: https://orcid.org/0000-0003-4351-7950
Felix Heinrich: https://orcid.org/0000-0002-6093-8522
Mehmet Gültas: https://orcid.org/0000-0003-3297-3192
Armin O. Schmitt: https://orcid.org/0000-0002-4910-9467
Abstract
Machine learning is frequently used to make decisions based on big data. Among these techniques, random forest is particularly prominent in genomic research, where it is used for selecting the best individuals within a test population or for identifying the most important genomic markers. Although random forest is known to have many advantages, one aspect that is often overseen is that it is a non-deterministic method that can produce different models using the same input data. This can have severe consequences on decision-making processes. In this study, we introduce a method to quantify the impact of non-determinism on predictions, variable importance estimates, and the selection process. Our findings demonstrate that increasing the number of trees in random forests enhances the stability in a non-linear way while computation time increases linearly. Consequently, we conclude that there exists an optimal number of trees for any given data set that maximises the stability without unnecessarily extending the computation time. Based on these results, we have developed the R package optRF which models the relationship between the number of trees and the stability of random forest, providing recommendations for the optimal number of trees for any given data set.
Keywords:
parameter optimisation
random forest
machine learning
non-determinism
decision-making
genomic selection
variable selection
*Corresponding author: Thomas Martin Lange - thomas.lange@uni-goettingen.de.
Introduction
Please have a look at courier new font provided for text in article.
Machine learning is a powerful tool to analyse complex and large data sets and is frequently used across various scientific disciplines in decision-making processes. The term machine learning describes methods that enable the computer to recognise patterns and, in this way, to “learn” from the data and to make predictions for novel data based on the given data [1, 2]. Beyond predictions, these models are increasingly used to make decisions based on their outputs, enabling data-driven decision-making processes [3]. Such processes are widely applied in fields such as finance, healthcare, engineering, and genomic research [4, 5, 6]. One widely used method to do so is random forest. Its popularity is due to the fact that it is a non-parametric method that performs very well, requires very simple input preparation, and can handle binary, categorical, count, and continuous response variables [7].
One important task in genomic research is genomic selection where the breeding value or the phenotype of individuals is predicted using genomic data and subsequently used to select the best individuals in a population where the phenotype is unknown [7, 8]. Such studies rely heavily on the use of single nucleotide polymorphisms (SNPs) which can be analysed in large quantities using SNP chips [9, 10, 11]. However, such studies are often limited by the number of individuals that can be analysed, thus, leading to a small number of observations (n) and a large number of variables (p). This so called “small n, large p problem” is a general problem in genomic research [12]. While random forest is known to sufficiently work also when the small n, large p problem occurs [13, 14, 15], it has been discussed that interactions between variables might not be detected and correctly used for predictions when p is too large [16, 17].
In order to reduce the small n, large p problem as well as to reduce costs, the number of necessary SNPs on a SNP chip can be reduced, leading to the development of low-density SNP chips such as the 35K Axiom Wheat Breeder’s genotyping array which was derived from the 820K Axiom Wheat high density genotyping array [18, 19, 20]. Random forest has the potential to be used for this purpose since it not only allows to build prediction models but also to determine the importance of each variable in the data set [21, 22, 23]. In this way, random forest can also be used to analyse associations between the genotype and the phenotype [24, 25, 26]. Doing so, genomic markers can be selected that show the highest association with the phenotype [17, 27, 28] and to identify genes related to certain traits [29].
Despite the many advantages of random forest, it is often overseen that random forest is a non-deterministic method. This means that repeated runs of random forest with the same data set will lead to different prediction models and varying variable importance estimates. This variability is particularly problematic when the predictions or the estimated variable importances are used in decision-making processes as it is the case in genomic selection and in the selection of most important variables. Even though the impact of non-determinism on the selection process may have severe consequences, it has not been adequately addressed to date.
Random forest works by growing multiple decision or regression trees and making predictions by averaging the predictions of the trees. In order to grow the trees, at each node, the variable is chosen that splits the data set into two distinct groups [30]. Rather than choosing this variable from all available variables, the best variable can be chosen from a random sample of all variables [31]. While some publications state that the value for this parameter, which is typically called mtry, can influence the prediction accuracy [32, 33, 34], it was also described that the value for mtry did not influence the out of bag error rate in experiments with microarray data [35]. Next to the number of variables to choose the best variable from, it can also be adjusted how many observations are drawn for the training of each tree. On the one hand, a small value for the sample size can reduce computation time and will lead to more diverse trees which has a positive effect on prediction accuracy. On the other hand, more diverse trees will lead to less stability of the random forest model when aggregating the trees [32]. The depth of the trees can be controlled by defining the minimum number of observations in the terminal node (node size) [32]. Tuning this hyperparameter can increase prediction accuracy, especially for data sets with a large sample size and a small number of variables [36]. However, it must be pointed out that in the field of genomic research we are more often confronted with the opposite problem: a large number of predictors and a small number of observations [13, 14].
Even though the number of trees grown in a random forest can have a severe effect on the prediction accuracy, this parameter has been described as not tunable because there is no best value for it [32]. While R packages for performing random forest such as randomForest, ranger, and Rborist use as default 500 trees [31, 37, 38], it is widely agreed that a greater number of trees in random forest has only benefits for the quality of the model such as more stable and accurate results without overfitting [32, 39, 40, 41]. However, it is also known that the number of trees increases the computation time linearly [32, 41]. This leads to the question of how to set this parameter. It has been recommended to set the number of trees to be ten times the number of variables in the data set and from there to increase the number of trees until the error rate stabilises [39]. It should be noted that when following this recommendation, analysing a data set with 10,000 genomic markers would require to perform random forest multiple times starting with 100,000 trees. This recommendation was not followed in studies which often perform genomic selection with a smaller number of trees [4, 42, 43]. Despite the suggestion to set the number of trees as high as the computational power allows it, research with biomedical data sets has shown that from a certain number of trees on, a further increase in the number of trees did not lead to a further improve in quality but in computation time [44].
Here, we present the first study that focuses on quantifying the stability of random forests with special focus on selection processes. Moreover, we analysed the relationship between the number of trees and the stability of the random forest. We analysed this relationship using multiple data sets from plant and animal breeding and show that with all data sets under investigation, the number of trees had a severe effect on the stability of random forest and that this relationship is non-linear. Based on these results, we developed the R package optRF which models this relationship in order to determine the optimal number of trees that will lead to a high stability of the random forest without unnecessarily increasing the computation time. While it is possible to tune the mtry value for random forest using the caret package [45] or the mtry, node size, and sample size parameters using the tuneRanger package [32], there is no package available that recommends a number of trees to be set even though we show that regarding the stability of random forest, this parameter is highly important. We fill this void and show that using the optRF package, a certain number of trees is recommended that leads to stable random forest models without unnecessarily increasing the computation time which offers a huge advantage for data-driven decision-making processes based on random forest.
Materials and methods
The optRF package
We developed an R package that (I) calculates the stability of a random forest model with t trees, (II) models the relationship between number of trees and the stability, and (III) defines an optimal number of trees from where on additional trees would increase computation time but would barely improve the stability further. To do so, the R package contains two main functions: opt_prediction to optimise the number of trees for predictions and selection of the best individuals in a test population and opt_importance to optimise the number of trees to estimate the variable importance of each variable and select the most important variables in a data set.
The opt_prediction function
The opt_prediction function needs as input a training data set where the response variable is inserted in the y argument and the predictor variables are inserted in the X argument. Optionally, the user can specify a test data set using the X_Test argument containing the same predictor variables as in X. If no such data set is given, the function will randomly create a test data set based on the variable distributions in X. The opt_prediction function will use the training data to construct a random forest model with a certain number of trees using the R package ranger, version 0.16.0 [37], and use this model to predict the response in the test data set. By default, this process is repeated ten times, leading to a matrix containing
predictions. With this prediction matrix, the prediction stability is calculated and the process is repeated with different numbers of trees, leading to a result table that contains the number of trees and the corresponding prediction stability. By default, opt_prediction will calculate the prediction stability for random forests with 250, 500, 750, 1,000, and 2,000 trees.
If the response variable is metric, random forest regression is performed. Therefore, the prediction matrix will contain metric values. Naturally, if the random forest is stable, the repeated predictions will be highly correlated. The prediction stability is, thus, defined as the intraclass correlation coefficient (ICC) [46] between the ten repetitions of random forest. The ICC is calculated using a one-way model and single measures ICC(1,1) [47, 48] using the function icc from the R package irr, version 0.84.1 [49]. If the response variable is categorical, random forest classification is performed and the prediction matrix will contain categorical values. In this case, if the random forest model is stable, the same class will be repeatedly predicted for the same individual in the test data set. Thus, the prediction stability is defined as Fleiss’ kappa (κ) [50, 51] using the function kappam.fleiss from the irr package. For both the ICC and Fleiss’ kappa, a value of 1 would indicate a perfect prediction stability and a value of zero would indicate poor prediction stability [52, 53].
To evaluate the selection stability when random forest regression is performed, by default opt_prediction will select the 15% individuals with the highest predicted values in the test data set in each repetition of random forest. However, both the number of individuals to be selected as well as the selection criterion can be adjusted by the user. The selection criterion can be adjusted in the select_for argument where either the value "high" (default) can be entered to select individuals with the highest phenotypic values, the value "low" can be entered to select individuals with the smallest phenotypic values, or the value "zero" can be entered to select individuals with phenotypic values closest to zero. If random forest classification is performed, in each repetition of random forest, individuals are selected for which a certain class was predicted. Thus, the user has to specify the name of the desired class in the select_for argument when using opt_prediction with a categorical response variable. In both cases, a selection matrix is derived from the prediction matrix that contains a binary classification of "selected" or "rejected" for each individual in each repetition of random forest in the test data set and can be used to calculate the selection stability as Fleiss’ kappa.
The opt_importance function
The opt_importance function needs as input the response variable in the y argument and the predictor variables in the X argument. By default, it will repeat random forest ten times and in each repetition, the variable importance of each variable will be calculated, leading to a
variable importance matrix. If the random forest model is stable, the variable importance estimates will be highly correlated in all repetitions. Hence, the variable importance stability is also defined as the ICC. Since the variable importance estimates are always metric, it does not make a difference if random forest regression or random forest classification is performed. Also opt_importance calculates the variable importance stability by default with 250, 500, 750, 1,000, and 2,000 trees and derives a result table that contains the number of trees and the corresponding variable importance stability from which a selection matrix can be derived. By default, the 5% variables with the highest variable importance will be selected in each repetition of random forest. With this selection matrix, the selection stability can be calculated which is defined as Fleiss’ kappa.
Modelling the stability
While the function opt_prediction calculates the prediction stability and the selection stability for selecting the best individuals in a test population, opt_importance calculates the variable importance stability and the selection stability for variable selection. In a next step, both functions model the stability (
) of a random forest model j with
trees with the two parameter logistic (2PL) model
where
denotes the number of trees where
and
denotes the slope at
. This 2PL model equals the three parameter logistic model described in [54] where the maximum effect is set to 1 since the maximum value of the stability measures described above is 1 and the minimum value is 0. The 2PL model can also be derived from the extended four and five parameter logistic models described in [55, 56, 57] which have been adapted in multiple studies in the natural sciences [58, 59, 60] by setting the asymmetry parameter to 1, assuming no asymmetry within the logistic model. In order to estimate the parameter values, the Levenberg-Marquardt algorithm was applied using the R package minpack.lm, version 1.2-4 [61].
To define an optimal number of trees, the model is used to estimate the increase of stability for each tree being added to the random forest model in the interval from 10 to 10,000,000. Based on the estimated stability per 10 trees added to the random forest, the optimal number of trees can be determined. The number of trees will be recommended where an additional 10 trees in random forest leads to an increase in stability of
or less. This threshold was set arbitrarily and can be defined by the user in the rec.thresh argument of the opt_prediction and the opt_importance function. By default, opt_prediction will calculate the optimal number of trees based on the prediction stability and opt_importance based on the variable importance stability. However, the user can set the recommendation to be based on the selection stability in both functions.
Application with real life data
A
In order to demonstrate the effectiveness of the optRF package, we estimated the optimal number of trees for both genomic selection and for variable selection with multiple data sets from plant and animal breeding with different species and traits (see Supplementary Data). Therefore, we used random forest with (I) the default settings of the ranger function which is 500 trees, (II) the number of trees recommended by optRF, and (III) the number of trees defined as 10 times the number of genomic markers as it was recommended in [39] and calculated the prediction or variable importance stability, the selection stability, and the computation time.
All in all, 45 data sets were analysed. Two barley (Hordeum vulgare L.) data sets with the traits β-glucan and yield [62], one chicken (Gallus domesticus) data set where the egg weight was measured in 36 weeks old chicken [63], four maize (Zea mays L.) data sets with the traits anthesis-silk interval (ASI) and yield under irrigated and drought stress conditions [64], six rice (Oryza sativa L.) data sets with the traits amylose content, grain yield, yield after milling, and infestation with Nakataea oryzae, Rhizoctonia oryzae-sativae, and Pyricularia oryzae [65], eight rye (Secale cereale L.) data sets with the traits plant height and yield measured in four successive years [66], two strawberry (Fragaria × ananassa) data sets where the infestation with Phytophthora cactorum was scored in two successive years [67], two strawberry data sets where the infestation with Verticillium wilt was scored in two successive years [68], one sugar beet (Beta vulgaris L.) data set where the infestation with the Beet necrotic yellow vein virus (BNYVV) was analysed [17], two data sets where the yield of wheat (Triticum aestivum L.) cultivars was measured in two successive years [69], six data sets where the yield of synthetic hexaploid wheat lines was measured in three successive years under irrigated and drought stress conditions [69], four wheat data sets where the yield of two F5 populations and two populations of doubled haploid lines was measured in two successive years [70], two wheat data sets where the root length and the yield were measured [71], one data set where the germination rate of wheat lines was measured, and four data sets where the infestation with Puccinia striiformis and the yield were measured in two successive years [72]. All data sets were prepared for the analysis by calculating the mean of the response values if individuals with the same genotype were repeated in the experiment, missing genomic data were imputed with k-nearest neighbours using the function kNN from the R package VIM, version 6.2.2, with default settings [73], and SNPs with a minor allele frequency of less than 1% were removed.
All data sets under investigation contained metric response variables. In order to estimate the selection stability, the default settings of opt_prediction were used to select the top 15% individuals in the test data set and the 5% most important genomic markers with opt_importance. Regarding opt_prediction, the select_for argument was adjusted depending on the trait under investigation. When performing genomic selection with yield data, naturally, individuals were selected with the highest values. The same applies to the root length and the germination rate in plants as well as the egg weight in chicken. While it could be selected for individuals with high β-glucan content when breeding barley for human consumption [74, 75], we selected for individuals with low β-glucan content which is done to avoid filtration problems in the production of alcoholic beverages [62]. Also regarding the plant height in rye as well as the amylose content in rice, plant breeding tends to select varieties with low values since shorter rye varieties reduce the risk of lodging and allow for easier handling during harvest [76] and rice varieties with moderately low amylose content have gained increasing popularity among consumers [77]. Naturally, also regarding the infestation with diseases, plant breeding tends to select varieties with low levels of infestation. When performing genomic selection for ASI, individuals should be selected with values closest to zero [42].
One important feature of this method is that the prediction stability is analysed, not the prediction accuracy. While it is necessary for the calculation of the prediction accuracy to compare the predicted values to the observed values in the test data set, the observed values of the test data set do not need to be known to calculate the prediction stability. Thus, in a realistic scenario where the variables, in case of genomic selection the SNPs, are known for the test data set but the phenotypic values are unknown, this method can still be applied.
Results
Results from applying optRF onto two exemplary data sets
To explain the application of the optRF package, we show the detailed result from applying opt_prediction and opt_importance to the maize data set where the yield was measured in well irrigated plots which is the smallest data set under consideration (264 observations, 1,134 SNPs, data set 7 in Supplementary Data) and the chicken data set which is the biggest data set under consideration (1,063 observations, 139,101 SNPs, data set 3 in Supplementary Data).
The results of applying opt_prediction and opt_importance to the maize data set can be seen in Fig. 1. Both functions estimate the relationship between the number of trees and the stability of the random forest using 250, 500, 750, 1000, and 2000 trees, shown as red dots in Fig. 1. The relationship is modelled using the 2PL model which is visualised in Fig. 1 as blue line and the estimated stability with the recommended number of trees is indicated with a red line. Furthermore, to show that the model can describe the relationship between the stability and the number of trees even when extrapolating the relationship with higher number of trees, the prediction and variable importance stability was additionally calculated when using 5,000, 7,500, 10,000, 12,500, 15,000, 17,500, and 20,000 trees, visualised as black dots in Fig. 1.
Fig. 1
Example of the application of the optRF package to the maize data set (data set 7 in Supplementary Data). The relationship between the number of trees and the stability of the random forest was analysed for five numbers of trees (red dots) and estimated using a non-linear model (blue curve) for prediction (left graph) and variable importance stability (right graph). The prediction and variable importance stability with higher number of trees was also calculated (black dots). The horizontal red line indicates the stability with the recommended number of trees (14,000 for prediction, 17,000 for variable importance estimation)
Click here to Correct
One can see in the left graph of Fig. 1 that even though the 2PL model was derived using only the first five data points, the deviation between the model estimates and the observed values for the following data points is very small. Thus, it can be concluded that the model can describe the relationship between the prediction stability and number of trees in a random forest accurately even when it is derived only with small values for the number of trees which is very important for the run time of the opt_prediction function. The same applies to the opt_importance function and the relationship between the variable importance stability and the number of trees which is visualised in the right graph of Fig. 1.
In this example, the opt_prediction function recommended to use 14,000 trees in the random forest which led to a prediction stability of 0.986 which indicates an increase by 32% compared to the prediction stability of 0.746 using the default value of 500 trees in random forest. Using the function opt_importance, the variable importance stability was increased from 0.615 with 500 trees to 0.981 with the recommended number of 17,000 trees. In both cases, a further increase of the prediction or variable importance stability is barely possible, thus, a further increase of the number of trees would unnecessarily increase the computation time.
The results of applying the functions of the optRF package to the chicken data set can be seen in Fig. 2. Here as well, the red dots visualise the calculated stability with 250, 500, 750, 1,000, and 2,000 trees and the blue line shows the model that describes the relationship between number of trees and stability for any number of trees. To show that the model can accurately predict the stability for higher values for the number of trees, the prediction and variable importance stability were calculated when using 25,000, 50,000, 75,000, 100,000, 125,000, and 150,000 trees, visualised as black dots. Furthermore, the predicted stability with the recommended number of trees is shown as red line.
Fig. 2
Example of the application of the optRF package to the chicken data set (data set 3 in Supplementary Data). The relationship between the number of trees and the stability of the random forest was analysed for five numbers of trees (red dots) and estimated using a non-linear model (blue curve) for prediction (left graph) and variable importance stability (right graph). The prediction and variable importance stability with higher number of trees was also calculated (black dots). The horizontal red line indicates the stability with the recommended number of trees (112,000 for prediction, 137,000 for variable importance estimation)
Click here to Correct
In this example, the functions of the optRF package were applied to the largest data set under consideration and a larger number of trees was necessary to build stable random forest models. Here, the prediction stability was only 0.0289 when using random forest with the default of 500 trees. The opt_prediction function recommended to use 112,000 trees which led to a prediction stability of 0.8725. Also using the function opt_importance, the stability was increased severely. Here, the variable importance stability was increased from 0.0193 with 500 trees to 0.8309 with the recommended number of 137,000 trees. One can see in Fig. 2 that also for the large data set where a high number of trees was recommended, the model describes the calculated prediction and variable importance stability very well even though the model was derived with only five data points and up to 2,000 trees.
Results from applying optRF onto all data sets
To show the effectiveness of the optRF package, we applied both opt_prediction and opt_importance to 43 further data sets from animal and plant breeding with various phenotypic data, numbers of observations, and numbers of predictor variables. For each data set, we calculated the optimal number of trees to perform genomic selection or to select the most important variables. The data sets and the recommended numbers of trees as well as the resulting stability and computation time are given in the Supplementary Data. While we could in general observe that a high number of trees is necessary to build stable random forest models in data sets with a large number of variables, the results also show that different numbers of trees were recommended for data sets with similar numbers of variables. For example, data sets 35 and 38 contain the same trait for the same species with the same number of SNPs. However, data set 35 contains 61 observations while data set 38 contains 759 observations. While it is optimal to use 80,000 trees for variable selection in data set 35, 10,000 trees are already sufficient for variable selection in data set 38. This indicates that not only the number of variables but also the number of observations in the training data set affects the optimal number of trees.
In the same way, data sets 43 and 44 can be compared. Both data sets analyse the same species with the same number of variables and the same number of observations. However, while data set 43 contains data about the infestation with P. striiformis, data set 44 contains yield data. While it is optimal to use 28,000 trees for genomic selection and 18,000 trees for variable selection of P. striiformis infestation, it is optimal to use 47,000 trees for genomic selection and 50,000 trees for variable selection of the yield. This indicates that not only the dimensionality of the data affects the recommended number of trees but also the response variable.
Fig. 3
The prediction stability (left graph) and the selection stability (right graph) for all 45 data sets under investigation when using random forest for genomic selection with (I) 500 trees, (II) the recommended number of trees from the optRF package, and (III) 10 times the number of variables as the number of trees
Click here to Correct
Figure 3 visualises the prediction and selection stability of random forest used for genomic prediction with the 45 data sets under investigation with 500 trees which is the default setting in the ranger function, the recommended number of trees from the optRF package, and the number of trees being 10 times the number of variables. One can see in Fig. 3 that both the prediction and selection stability increased markedly when the number of trees was increased from 500 to the optimal number of trees as recommended by the opt_prediction function. The average prediction stability increased from 0.306 to 0.959 and the average selection stability increased from 0.162 to 0.868. At the same time, the random forest with the optimal number of trees needed on average around 6 times as much computation time as the random forest with 500 trees. Furthermore, one can see that the random forests where the number of trees was set to 10 times the number of variables increased the prediction and selection stability only slightly compared to the random forests with the optimal number of trees. The random forests with 10 times the number of variables led to an average prediction stability of 0.987 and an average selection stability of 0.946. However, this method of defining the number of trees needed on average around 8 times as much computation time as the random forests with the optimal number of trees and 50 times longer than the random forests with 500 trees. Thus, this slight increase in stability comes at the cost of a large increase of computation time.
Figure 4 visualises the variable importance and selection stability of random forest used for variable selection with the 45 data sets under investigation. Here as well, the number of trees was defined as 500, as the recommended number of trees from the opt_importance function, and as 10 times the number of variables. One can see that the variable importance stability and the selection stability increased markedly when the number of trees was increased from 500 to the optimal number of trees. Here, the variable importance stability increased from an average of 0.379 to an average of 0.956 and the selection stability increased from an average of 0.256 to 0.796.
Fig. 4
The variable importance stability (left graph) and the selection stability (right graph) when using random forest to select the most important variables with (I) 500 trees, (II) the recommended number of trees from the optRF package, and (III) 10 times the number of variables as the number of trees
Click here to Correct
When variable selection was performed with this increase in stability, random forest needed on average around 20 times as long as random forest with 500 trees. Also here, the variable importance and selection stability increased in most cases when the number of trees was 10 times the number of variables. While the variable importance stability increased on average to 0.969, the selection stability increased on average to 0.88. Random forest with this number of trees needed around 11 times as long as random forest with the optimal number of trees and around 231 times longer than random forest with 500 trees. Here as well, the slight increase in stability comes at the cost of a large increase of computation time.
The application of random forest with 500 trees showed very poor stability for all data sets under investigation. The functions from the optRF package could recommend an optimal number of trees for random forest that increased the stability markedly. These results also suggest that the number of trees can be set to 10 times the number of variables which led in most cases to the highest stability. However, on average, this method increased the stability only slightly but increased the computation time severely compared to the number of trees that was recommended by the optRF package. Furthermore, these results suggest that the number of trees necessary to receive stable random forest results depends not only on the number of variables.
Discussion
Random forest is a useful tool for data-driven decision-making processes. However, in order to fully exploit the potential of random forest, optimal hyperparameters must be set. While it is generally agreed that the number of trees is an extremely important parameter in random forest, it has been argued that this parameter cannot be tuned since there is no value that would maximise the quality of random forest. Instead, it was generally recognised that the quality of random forest increases with higher number of trees and it was, thus, recommended to use as many trees as the computational power allows. However, we showed that the stability of random forest increases non-linearly with higher numbers of trees while the computation time increases linearly. Thus, it is possible to search for an optimal number of trees that increases the stability of random forest until a further increase of the number of trees only leads to a negligible increase of the stability.
With the default setting of 500 trees, the resulting random forest models exhibited poor stability across all data sets under investigation. Notably, for the largest data set, all stability measures were close to zero. But even for the smallest data set, random forest with 500 trees resulted in poor stability. This indicates that the default setting of 500 trees is insufficient to achieve stable results even with low-dimensional data.
In all data sets, the number of trees recommended by the optRF package significantly improved stability. However, considering Fig. 2, the optimal number of trees recommended by the optRF package led to prediction and variable importance stability below 0.9. This is because the optimal number of trees is defined as the number at which an increase of ten additional trees would increase the stability by less than
. While we found this threshold to be adequate, it can be adjusted by the user via the rec.thresh argument in both opt_prediction and opt_importance. Alternatively, opt_prediction and opt_importance can be used to study the relationship between the number of trees and the stability and the functions estimate_stability and estimate_numtrees can then be used to either analyse the stability of a random forest model with a certain number of trees or to determine the smallest number of trees to achieve a desired level of stability. Moreover, these functions also estimate the computation time, enabling the user to adjust the criteria for determining the optimal number of trees as desired.
While R packages such as caret or tuneRanger recommend specific values for mtry, the node size, or the sample size [45, 32], optRF recommends a certain value for the number of trees. Furthermore, while caret and tuneRanger aim to maximise the prediction accuracy, optRF optimises the number of trees based on the stability of random forest. This leads to an advantage when using random forest to predict the phenotype in a test data set because the true phenotype of the test data set does not need to be known. While caret and tuneRanger tune parameters in the training data set, optRF can set the optimal number of trees for a specific training and test data set. Furthermore, since the optRF package recommends values for a hyperparameter that is not tuned by the other packages, they can be effortlessly combined to set values for mtry, sample size, node size, and the number of trees. However, Probst et al. (2019) pointed out that a smaller sample size will lead to higher accuracy but smaller stability [32]. Since the tuneRanger package recommends hyperparameter values that maximise accuracy, it could decrease stability. Thus, we recommend to first optimise hyperparameters such as mtry, sample size, and node size and then optimise the number of trees using the optRF package.
Despite the many advantages of optimised hyperparameters on the quality of random forest, searching for optimal parameter values can be a computational burden [41]. We have developed a procedure where the relationship between the stability and the number of trees in random forest is modelled with small numbers of trees and with this model, the relationship is extrapolated for higher number of trees. Doing so, the computation time to find the optimal number of trees can be reduced immensely. However, this process is based on parametric statistical modelling and, thus, requires assumptions to be made about the relationship and the distribution of the models’ residuals.
Furthermore, the process of modelling the relationship between the number of trees and the stability of the random forest model depends on the data used to build the statistical model. Thus, the numbers of trees that are analysed could theoretically have an impact on the recommendation. That is why we give the user the possibility to enter any set of values for the number of trees that should be analysed and used to derive the statistical model that is used to determine the optimal number of trees. This way, if the user is questioning the results based on the initial numbers of trees, the user can repeat the procedure with different values. Nevertheless, since the recommended number of trees is based on the statistical model that describes the relationship between number of trees and the stability with the initial values for the number of trees and the observed stability, it can also be influenced by randomness. That is why with any number of trees, the process of estimating the prediction stability and the variable importance stability must be repeated and averaged. We found that repeating the process ten times with each number of trees will lead to stable results for the averaged prediction stability and the variable importance stability. However, the user can change this value to a higher value to ensure more stable estimations, however, this will also increase the computation time. Thus, we propose to reduce non-determinism in random forest using a non-deterministic method itself. However, with the 45 data sets under investigation, we observed stable recommendations using the default settings of the optRF package.
Moreover, the functions in the optRF package can not only be used to determine the optimal number of trees but also to analyse the stability of a random forest model with a certain number of trees. Since the stability of random forest is crucial for the reproducibility of results, we highly recommend to state the number of trees and the stability of the random forest model either as prediction and selection stability or as variable importance and selection stability when publishing results that were determined using random forest.
Conclusion
The results presented here show that the number of trees has an important effect on the stability of random forest. Furthermore, it shows that a random forest model with the default setting of 500 trees provides too much instability for decision-making processes in all data sets that were analysed. Moreover, the results indicate that the number of trees necessary to reach a stable random forest model depends not only on the number of variables but also on the number of observations and the response variable.
While other R packages aim to maximise the prediction accuracy by tuning mtry, the sample size, and the node size, we present a method that optimises the stability of random forest by determining the optimal number of trees. We share the optRF package via CRAN to enable others to either search for an optimal number of trees when using random forest for decision-making processes or to estimate the stability of the random forest model with a given number of trees. Since the stability of random forest is crucial for the reproducibility of results, we highly recommend to increase the number of trees until stable results can be retrieved from random forest and to publish the number of trees as well as the stability of the random forest model whenever random forest is used for prediction, variable importance estimation or decision-making processes.
Declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Data Availability
The raw data used in this study are publicly available (see citations in the Materials and methods section). The stability measures when applying random forest with 500 trees and with the optimised numbers of trees are given in the Supplementary Data. The optRF package is freely available at CRAN.
Competing interests
The authors declare that they have no competing interests.
A
Funding
This research received no specific grant from any funding agency in the public, commercial, or non-profit sectors.
Author Contribution
TML developed the method, wrote the R package, performed the data analysis, interpreted the results and wrote the manuscript. FH contributed to the development of the method and the R package. AOS conceived and supervised the study. FH, MG, and AOS contributed to the discussion of the results and edited the manuscript. All authors read and approved the final manuscript.
Acknowledgement
We would like to thank Johanna-Sophie Schlüter for helpful discussions during the development of the method. Furthermore, we would like to thank David Omar Gonzalez Dieguez from the International Maize and Wheat Improvement Center (CIMMYT) for valuable data to test the method and the R package.
Electronic Supplementary Material
Below is the link to the electronic supplementary material
References
1.
Bini SA. Artificial intelligence, machine learning, deep learning, and cognitive computing: what do these terms mean and how will they impact health care? J Arthroplast. 2018;33(8):2358–61.
2.
Helm JM, Swiergosz AM, Haeberle HS, Karnuta JM, Schaffer JL, Krebs VE, Spitzer AI, Ramkumar P. Machine learning and artificial intelligence: definitions, applications, and future directions. Curr Rev Musculoskelet Med. 2020;13(1):69–76.
3.
Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science. 2015;349(6245):255–60.
4.
González-Camacho JM, Ornella L, Pérez‐Rodríguez P, Gianola D, Dreisigacker S, Crossa J. Applications of machine learning methods to genomic selection in breeding wheat for rust resistance. Plant Genome, 11, 2, 2018.
5.
Li B, Zhang N, Wang YG, George AW, Reverter A, Li Y. Genomic prediction of breeding values using a subset of SNPs identified by three machine learning methods. Front Genet, 9, 237, 2018.
6.
van der Heide EMM, Veerkamp RF, van Pelt ML, Kamphuis C, Athanasiadis I, Ducro BJ. Comparing regression, naive Bayes, and random forest methods in the prediction of individual survival to second lactation in Holstein cattle. J Dairy Sci. 2019;102(10):9409–21.
7.
Montesinos-López OA, Montesinos-López A, Crossa J. Multivariate statistical machine learning methods for genomic prediction. Cham: Springer Nature; 2022.
8.
Trends in Plant Science, vol. 22, no. 11, pp. 961–975, 2017.
9.
Jenkins S, Gibson N. High-throughput SNP genotyping. Comp Funct Genomics. 2002;3(1):57–66.
10.
Syvänen AC. Toward genome-wide SNP genotyping, Nature genetics, vol. 37, no. Suppl 6, pp. S5-S10, 2005.
11.
Ganal MW, Polley A, Graner EM, Plieske J, Wieseke R, Luerssen H, Durstewitz G. Large SNP arrays for genotyping in crop plants. J Biosci. 2012;37(5):821–8.
12.
Heslot N, Jannink JL, Sorrells ME. Perspectives for genomic selection applications and research in plants. Crop Sci. 2015;55(1):1–12.
13.
Long N, Gianola D, Rosa GJ, Weigel KA, Avendano S. Machine learning classification procedure for selecting SNPs in genomic selection: application to early mortality in broilers. J Anim Breed Genet. 2007;124(6):377–89.
14.
Jannink JL, Lorenz AJ, Iwata H. Genomic selection in plant breeding: from theory to practice. Brief Funct Genomics. 2010;9(2):166–77.
15.
Chen X, Ishwaran H. Random forests for genomic data analysis. Genomics. 2012;99(6):323–9.
16.
Wright MN, Ziegler A, König IR. Do little interactions get lost in dark random forests? BMC Bioinformatics. 2016;17:1–10.
17.
Lange TM, Heinrich F, Kopisch-Obuch F, Keunecke H, Gültas M, Schmitt AO. Improving genomic prediction of rhizomania resistance in sugar beet (Beta vulgaris L.) by implementing epistatic effects and feature selection, F1000Research, vol. 12, no. 280, 2023.
18.
Lange TM, Heinrich F, Kopisch-Obuch F, Keunecke H, Gültas M, Schmitt AO. density SNP genotyping array for hexaploid wheat and its secondary and tertiary gene pool, Plant Biotechnology Journal, vol. 14, no. 5, pp. 1195–1206, 2016.
19.
Lange TM, Heinrich F, Kopisch-Obuch F, Keunecke H, Gültas M, Schmitt AO. throughput SNP genotyping of global accessions of hexaploid bread wheat (Triticum aestivum), Plant Biotechnology Journal, vol. 15, no. 3, pp. 390–401, 2017.
20.
Lange TM, Heinrich F, Enders M, Wolf M, Schmitt AO. In silico quality assessment of SNPs—A case study on the Axiom Wheat genotyping arrays. Curr Plant Biology, 21, 2020.
21.
Genuer R, Poggi JM, Tuleau-Malot C. Variable selection using random forests. Pattern Recognit Lett. 2010;31(14):2225–36.
22.
Goldstein BA, Polley EC, Briggs FB. Random forests for genetic association studies. Stat Appl Genet Mol Biol, 10, 1, 2011.
23.
Grömping U. Variable importance in regression models. Wiley Interdisciplinary Reviews: WIREs Comput Stat. 2015;7(2):137–52.
24.
Lunetta KL, Hayward LB, Segal J, van Eerdewegh P. Screening large-scale association study data: exploiting interactions using random forests. BMC Genet. 2004;5:1–13.
25.
Schwarz DF, König IR, Ziegler A. On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. Bioinformatics. 2010;26(14):1752–8.
26.
Degenhardt F, Seifert S, Szymczak S. Evaluation of variable selection methods for random forests and omics data sets. Brief Bioinform. 2019;20(2):492–503.
27.
Klees S, Lange TM, Bertram H, Rajavel A, Schlüter JS, Lu K, Schmitt AO, Gültas M. In Silico Identification of the Complex Interplay between Regulatory SNPs, Transcription Factors, and Their Related Genes in Brassica napus L. Using Multi-Omics Data. Int J Mol Sci, 22, 2, 2021.
28.
Haleem A, Klees S, Schmitt AO, Gültas M. Deciphering pleiotropic signatures of regulatory SNPs in Zea mays L. using multi-omics data and machine learning algorithms. Int J Mol Sci, 23, 9, 2022.
29.
Kursa MB. Robustness of Random Forest-based gene selection methods. BMC Bioinformatics. 2014;15:1–8.
30.
Breiman L. Random forests. Mach Learn. 2001;45:5–32.
31.
Liaw A, Wiener M. Classification and regression by randomForest. R news, 2002.
32.
Probst P, Wright MN, Boulesteix AL. Hyperparameters and tuning strategies for random forest. Wiley Interdisciplinary Reviews: data Min Knowl discovery, 9, 3, 2019.
33.
Bernard S, Heutte L, Adam S. Influence of hyperparameters on random forest accuracy, in Multiple Classifier Systems: 8th International Workshop, MCS 2009, Reykjavik, Iceland, June 10–12, 2009, 2009.
34.
Scornet E. Tuning parameters in random forests, ESAIM: Proceedings and Surveys, vol. 60, pp. 144–162, 2017.
35.
Díaz-Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7:1–13.
36.
Lin Y, Jeon Y. Random forests and adaptive nearest neighbors. J Am Stat Assoc. 2006;101(474):578–90.
37.
Wright MN, Ziegler A. ranger: A Fast Implementation of Random Forests. J Stat Softw. 2017;77(1):1–17.
38.
Seligman M. Rborist: Extensible, Parallelizable Implementation of the Random Forest Algorithm, 2024.
39.
Boehmke B, Greenwell BM. Hands-on machine learning with R. 1 ed. Chapman and Hall/CRC; 2019.
40.
Hastie T, Tibshirani R, Friedman JH, Friedman JH. The elements of statistical learning: data mining, inference, and prediction. Volume 2. NY: Springer New York; 2009.
41.
Biau G, Scornet E. A random forest guided tour, Test, vol. 25, pp. 197–227, 2016.
42.
Heredity, vol. 112, no. 6, pp. 616–626, 2014.
43.
Sirsat MS, Oblessuc PR, Ramiro RS. Genomic prediction of wheat grain yield using machine learning, Agriculture, vol. 12, no. 9, 2022.
44.
Oshiro TM, Perez PS, Baranauskas JA. How many trees in a random forest? in Machine Learning and Data Mining in Pattern Recognition: 8th International Conference, MLDM 2012, Berlin, Germany, July 13–20, 2012, 2012.
45.
Kuhn M. Building Predictive Models in R. J Stat Softw. 2008;28(5):1–26.
46.
Bartko JJ. The intraclass correlation coefficient as a measure of reliability. Psychol Rep. 1966;19(1):3–11.
47.
McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods, 1, 1, 1996.
48.
Trevethan R. Intraclass correlation coefficients: clearing the air, extending some cautions, and making some requests. Health Serv Outcomes Res Method. 2017;17(2):127–43.
49.
Gamer M, Lemon J, Singh P, Fellow I. irr: Various Coefficients of Interrater Reliability and Agreement, 2019.
50.
Fleiss JL. Measuring nominal scale agreement among many raters. Psychol Bull, 76, 5, 1971.
51.
Konstantinidis M, Le LW, Gao X. An empirical comparative assessment of inter-rater agreement of binary outcomes and multiple raters, Symmetry, vol. 14, no. 2, 2022.
52.
Zou G. Sample size formulas for estimating intraclass correlation coefficients with precision and assurance. Stat Med. 2012;31(29):3972–81.
53.
Jonsdottir G, Haraldsdottir E, Sigurdardottir V, Thoroddsen A, Vilhjalmsson R, Tryggvadottir GB, Jonsdottir H. Developing and testing inter-rater reliability of a data collection tool for patient health records on end‐of‐life care of neurological patients in an acute hospital ward. Nurs Open. 2023;10(8):5500–8.
54.
Pinheiro J, Bates D. Mixed-effects models in S and S-PLUS. 1 ed. NY: Springer New York; 2006.
55.
Ricketts JH, Head GA. A five-parameter logistic equation for investigating asymmetry of curvature in baroreflex studies. Am J Physiology-Regulatory Integr Comp Physiol, 277, 2, pp. R441-R454, 1999.
56.
Gottschalk PG, Dunn JR. The five-parameter logistic: a characterization and comparison with the four-parameter logistic. Anal Biochem. 2005;343(1):54–65.
57.
Lin D, Shkedy Z, Yekutieli D, Amaratunga D, Bijnens L. Modeling dose-response microarray data in early drug development experiments using R: order-restricted analysis of microarray data. 1 ed. Berlin Heidelberg: Springer; 2012.
58.
Vølund A. Application of the four-parameter logistic model to bioassay: comparison with slope ratio and parallel line models. Biometrics pp. 357–65, 1978.
59.
Journal of Biopharmaceutical Statistics, vol. 15, no. 2, pp. 205–223, 2005.
60.
Lange TM, Rotärmel M, Müller D, Mahone GS, Kopisch-Obuch F, Keunecke H, Schmitt AO. Non-linear transformation of enzyme-linked immunosorbent assay (ELISA) measurements allows usage of linear models for data analysis. Virol J, 19, 1, 2022.
61.
Elzhov TV, Mullen KM, Spiess A, Bolker B. minpack.lm: R Interface to the Levenberg-Marquardt Nonlinear Least-Squares Algorithm Found in MINPACK. Plus Support for Bounds; 2023.
62.
Frontiers in Plant Science, vol. 11, 2020.
63.
Liu Z, Sun C, Yan Y, Li G, Wu G, Liu A, Yang N. Genome-wide association analysis of age-dependent egg weights in chickens. Front Genet, 9, 2018.
64.
Genetics, vol. 186, no. 2, pp. 713–724, 2010.
65.
Crop Science, vol. 63, no. 3, pp. 1300–1315, 2023.
66.
Eckhoff W. Phenotypic data, Genotypic data, Rye, Secale cereale, Dry matter yield, Plant height, GCA, Hybrid Breeding, Inbred Line, 2023.
67.
Jiménez NP, Feldmann MJ, Famula RA, Pincot DD, Bjornson M, Cole GS, Knapp SJ. Harnessing underutilized gene bank diversity and genomic prediction of cross usefulness to enhance resistance to Phytophthora cactorum in strawberry. Plant Genome, 16, 1, 2023.
68.
Pincot DD, Hardigan MA, Cole GS, Famula RA, Henry PM, Gordon TR, Knapp SJ. Accuracy of genomic selection and long-term genetic gain for resistance to Verticillium wilt in strawberry. Plant Genome, 13, 3, 2020.
69.
G3: Genes, Genomes, Genetics, vol. 13, no. 5, 2023.
70.
Lozada DN, Ward BP, Carter AH. Gains through selection for grain yield in a winter wheat breeding program. PLoS ONE, 15, 4, 2020.
71.
Guo X, Svane SF, Füchtbauer WS, Andersen JR, Jensen J, Thorup-Kristensen K. Genomic prediction of yield and root development in wheat under changing water availability. Plant Methods. 2020;16:1–15.
72.
Genome Biology, vol. 22, no. 1, 2021.
73.
Kowarik A, Templ M. Imputation with the R Package VIM. J Stat Softw. 2016;74:1–16.
74.
Kerckhoffs DA, Hornstra G, Mensink RP. Cholesterol-lowering effect of β-glucan from oat bran in mildly hypercholesterolemic subjects may decrease when β-glucan is incorporated into bread and cookies. Am J Clin Nutr. 2003;78(2):221–7.
75.
β-glucan and amylopectin molecular structure, Carbohydrate Polymers, vol. 316, 2023.
76.
Miedaner T, Müller BU, Piepho H-P, Falke KC. Genetic architecture of plant height in winter rye introgression libraries. Plant Breeding. 2011;130(2):209–16.
77.
Miedaner T, Müller BU, Piepho H-P, Falke KC. tuning the amylose content of rice by precise base editing of the Wx gene, Plant Biotechnology Journal, vol. 19, no. 1, pp. 11–13, 2021.
A
78.
Lourakis MI. A brief description of the Levenberg-Marquardt algorithm implemented by levmar. Foundation Res Technol. 2005;4(1):1–6.
Click here to download actual image
.
Click here to download actual image
.
Click here to download actual image
.
Click here to download actual image
.
Total words in MS: 6823
Total words in Title: 12
Total words in Abstract: 195
Total Keyword count: 7
Total Images in MS: 4
Total Tables in MS: 0
Total Reference count: 82