# Regression Modeling Strategies: With Applicatio...

Frank E. Harrell, Jr. is Professor of Biostatistics and Chair, Department of Biostatistics, Vanderbilt University School of Medicine, Nashville. He has developed numerous methods for predictive modeling, quantifying predictive accuracy and model validation and has published numerous predictive models and articles on applied statistics, medical research and clinical trials. He is on the editorial board for several biomedical and methodologic journals. He is a Fellow of the American Statistical Association (ASA) and a consultant to the U.S. Food and Drug Administration and to the pharmaceutical industry. He teaches a graduate course in regression modeling strategies and a course in biostatistics for medical researchers. In 2014 he was chosen to receive the WJ Dixon Award for Excellence in Statistical Consulting by the ASA.

## Regression Modeling Strategies: With Applicatio...

"This is the latest volume in the generally excellent Springer Series in Statistics, and it has to be one of the best. Professor Harrell has produced a book that offers many new and imaginative insights into multiple regression, logistic regression and survival analysis, topics that form the core of much of the statistical analysis carried out in a variety of disciplines, particularly in medicine. ... Regression Modelling Stategies is a book that many statisticians will enjoy and learn from. The problems given at the end of each chapter may also make it suitable for some postgrdauate courses, particularly those for medical students in which S-PLUS is a major component. Working through the case studies in the book will demonstrate what can be achieved with a little imagination, when modelling complex and challenging data sets. So here we have a truly excellent, informative and attractive text that is highly recommended."

"Over the past 7 years, I have probably read this book, on its preversion, a half-dozen times, and I refer to it routinely. If my work bookshelf held only one book, it would be this one. The book covers, very completely, the nuances of regression modeling with particular emphasis on binary and ordinal logistic regression and parametric and nonparametric survival analysis...Harrell very nicely walks the reader through numerous analyses, explaining and defining his model-building choices at each step in the process. It is refreshing to have an author present choices and actuallly defend an approach, and in this manner."

Regression models such as the Cox proportional hazards model have had increasing use in modelling and estimating the prognosis of patients with a variety of diseases. Many applications involve a large number of variables to be modelled using a relatively small patient sample. Problems of overfitting and of identifying important covariates are exacerbated in analysing prognosis because the accuracy of a model is more a function of the number of events than of the sample size. We used a general index of predictive discrimination to measure the ability of a model developed on training samples of varying sizes to predict survival in an independent test sample of patients suspected of having coronary artery disease. We compared three methods of model fitting: (1) standard 'step-up' variable selection, (2) incomplete principal components regression, and (3) Cox model regression after developing clinical indices from variable clusters. We found regression using principal components to offer superior predictions in the test sample, whereas regression using indices offers easily interpretable models nearly as good as the principal components models. Standard variable selection has a number of deficiencies.

The performance of prediction modelling strategies is a data-dependent process and can be highly variable between data sets within the same clinical domain. A priori strategy comparison can be used to determine an optimal logistic regression modelling strategy for a given data set before selecting a final modelling approach.

We present an extended framework for comparing strategies in linear and logistic regression model building. A wrapper approach is utilized [18], in which repeated bootstrap resampling of a given data set is used to estimate the relative predictive performance of different modelling strategies. Attention is centred on a single aspect of the model building process, namely, shrinkage-based model adjustment, to illustrate the concept of a priori strategy comparison. We demonstrate applications of the framework in four examples of empirical clinical data, all within the setting of deep vein thrombosis (DVT) diagnostic prediction research. Following from this, simulations highlighting the data-dependent nature of strategy performance are presented. Finally, the outlined comparison framework is applied in a case study, and the impact of a priori strategy selection is investigated.

When comparisons were extended to additional DVT prediction data sets, a large degree of heterogeneity was observed in the victory rates for each strategy across the different sets. The results of these comparisons are summarized in Table 2. The victory rates of the heuristic strategy showed the greatest variation between data sets, ranging from 3.9 to 63.8 %. This is reflected by the broad range in values of the estimated shrinkage factor, with poorest performance coinciding with severe shrinkage of the regression coefficients. Firth regression showed the greatest consistency between data sets, with victory rates ranging from 65.8 to 73.8 %, and good performance in the Oudega and Toll data sets, but relatively poor performance compared to the split-sample, cross-validation and bootstrap strategies in the Deepvein data set.

There are numerous approaches for developing a clinical prediction model, and in many cases no approach is universally superior. We demonstrate here that the performance of regression modelling strategies is data set-specific, and influenced by a combination of different data characteristics. We outline a means of conducting a comparison of modelling strategies in a data set before deciding on a final approach. A concept that was previously outlined for linear regression has now been extended to logistic regression, using the model likelihood as a means of comparing the performance of two strategies. The resulting distribution of comparisons can then provide researchers with evidence on which to base their decisions for model building. Three summary measures, the victory rate, the distribution median and the distribution interquartile range can be used to guide researchers in their analytical decision making.

In some settings, particularly the Oudega subset and Toll data, we observed problems with model convergence in logistic regression due to separation [39]. This problem was most apparent in data with only dichotomous variables in the models, and few EPV. The drop in victory rates for sampling-based strategies, from 66.8 to 61.9 % for sample splitting, 48.0 to 38.3 % for 10 fold cross-validation, and 66.4 to 56.4 % for bootstrapping could in part be explained by this phenomenon. We found that some strategies may exacerbate problems with separation, and that low victory rates, with extremely skewed comparison distributions may indicate the occurrence of separation. In such a case, researchers may wish to consider alternative strategies.

Several authors have previously noted that regression methods may perform quite differently according to certain data parameters [7, 40], and has been recognized that data structure as a whole should be considered during model building [41]. Our simulations in linear regression confirm the findings of others in a tightly controlled setting, and similar trends are seen upon extending these simulations to empirically-derived settings for logistic regression. Through assessing the influence of EPV on strategy performance in two different data sets, we find that while trends are present, they may differ between data sets. In combination with the findings from comparisons between strategies in four clinical data sets this supports the idea that strategy performance is data-dependent. This may have implications for the generalizability of currently existing recommendations for several stages of the model building process that were originally based on a small number of clinical examples.

It must be noted that there are limitations within the current framework. Our study only focuses on comparisons within the domain of shrinkage, which is only one stage of the prediction modelling process. It may be that our approach is not suitable for certain aspects of model building that we have not explored. For example, strategies that yield models that use varying numbers of degrees of freedom should not be compared directly by their model likelihoods. Furthermore, we currently only provide a framework for linear and logistic regression problems, and while this is most useful for diagnostic settings, a natural extension would be to enable the comparison of survival models, such as Cox proportional hazards models, as these are the most commonly used methods in prognostic prediction modelling [42].

Furthermore, the interpretation of the results of comparisons warrants some caution when using logistic regression in sparse data settings. We encountered many difficulties with separation of logistic regression, especially when resampling or sample-splitting methods were used in the model building process. When separation occurs, the models may exhibit problems with convergence, and this complicates the interpretation of victory rates and other summary measures. While there is no straightforward solution to this problem, we argue that there may be some value in observing the frequency and severity of separation that occurs during strategy comparison. 041b061a72