EurekaMag.com logo
+ Site Statistics
References:
53,517,315
Abstracts:
29,339,501
+ Search Articles
+ Subscribe to Site Feeds
EurekaMag Most Shared ContentMost Shared
EurekaMag PDF Full Text ContentPDF Full Text
+ PDF Full Text
Request PDF Full TextRequest PDF Full Text
+ Follow Us
Follow on FacebookFollow on Facebook
Follow on TwitterFollow on Twitter
Follow on Google+Follow on Google+
Follow on LinkedInFollow on LinkedIn

+ Translate

Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO



Recursive Random Forests Enable Better Predictive Performance and Model Interpretation than Variable Selection by LASSO



Journal of Chemical Information and Modeling 55(4): 736-746



Variable selection is of crucial significance in QSAR modeling since it increases the model predictive ability and reduces noise. The selection of the right variables is far more complicated than the development of predictive models. In this study, eight continuous and categorical data sets were employed to explore the applicability of two distinct variable selection methods random forests (RF) and least absolute shrinkage and selection operator (LASSO). Variable selection was performed: (1) by using recursive random forests to rule out a quarter of the least important descriptors at each iteration and (2) by using LASSO modeling with 10-fold inner cross-validation to tune its penalty λ for each data set. Along with regular statistical parameters of model performance, we proposed the highest pairwise correlation rate, average pairwise Pearson's correlation coefficient, and Tanimoto coefficient to evaluate the optimal by RF and LASSO in an extensive way. Results showed that variable selection could allow a tremendous reduction of noisy descriptors (at most 96% with RF method in this study) and apparently enhance model's predictive performance as well. Furthermore, random forests showed property of gathering important predictors without restricting their pairwise correlation, which is contrary to LASSO. The mutual exclusion of highly correlated variables in LASSO modeling tends to skip important variables that are highly related to response endpoints and thus undermine the model's predictive performance. The optimal variables selected by RF share low similarity with those by LASSO (e.g., the Tanimoto coefficients were smaller than 0.20 in seven out of eight data sets). We found that the differences between RF and LASSO predictive performances mainly resulted from the variables selected by different strategies rather than the learning algorithms. Our study showed that the right selection of variables is more important than the learning algorithm for modeling. We hope that a standard procedure could be developed based on these proposed statistical metrics to select the truly important variables for model interpretation, as well as for further use to facilitate drug discovery and environmental toxicity assessment.

(PDF same-day service: $19.90)

Accession: 058713553

Download citation: RISBibTeXText

PMID: 25746224

DOI: 10.1021/ci500715e



Related references

Boosting model performance and interpretation by entangling preprocessing selection and variable selection. Analytica Chimica Acta 938: 44-52, 2016

The lasso method for variable selection in the Cox model. Statistics in Medicine 16(4): 385-395, 1997

A Systematic Approach for Variable Selection With Random Forests: Achieving Stable Variable Importance Values. IEEE Geoscience and Remote Sensing Letters 14(11): 1988-1992, 2017

Recursive Random Lasso (RRLasso) for Identifying Anti-Cancer Drug Targets. Plos One 10(11): E0141869-E0141869, 2016

Evaluation of variable selection methods for random forests and omics data sets. Briefings in Bioinformatics, 2017

Variable selection using support vector regression and random forests: A comparative study. Intelligent Data Analysis 20(1): 83-104, 2016

A Weighted Random Forests Approach to Improve Predictive Performance. Statistical Analysis and Data Mining 6(6): 496-505, 2014

r2VIM: A new variable selection method for random forests in genome-wide association studies. Biodata Mining 9(): 7-7, 2016

Recursive variable selection to update near-infrared spectroscopy model for the determination of soil nitrogen and organic carbon. Geoderma 268: 92-99, 2016

Unbiased split variable selection for random survival forests using maximally selected rank statistics. Statistics in Medicine 36(8): 1272-1284, 2017

Use of wrapper algorithms coupled with a random forests classifier for variable selection in large-scale genomic association studies. Journal of Computational Biology 16(12): 1705-1718, 2010

Fast detection of fenthion on fruit and vegetable peel using dynamic surface-enhanced Raman spectroscopy and random forests with variable selection. Spectrochimica Acta. Part A, Molecular and Biomolecular Spectroscopy 200: 20-25, 2018

Fast detection of fenthion on fruit and vegetable peel using dynamic surface-enhanced Raman spectroscopy and random forests with variable selection. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy 200: 20-25, 2018

A predictive model of subcutaneous glucose concentration in type 1 diabetes based on Random Forests. Conference Proceedings 2012: 2889-2892, 2013

Variable selection in the cox regression model with covariates missing at random. Biometrics 66(1): 97-104, 2010