On this examine, we described an ML-based strategy to establish the exposures that greatest predict self-perceived well being in a 30-year cohort examine. Our strategy includes (1) pre-processing the repeated publicity measurements by setting up measures of the imply and time development of the exposures, (2) making use of RF to construct and optimizing the predictive mannequin, and utilizing AUC to find out the corresponding predictive energy, ( 3) rating exposures in line with their contribution to predictive efficiency, (4) deciding on exposures that every one collectively kind of decide the general predictive efficiency, and (5) utilizing PDPs and ALE plots to find out the character of their relationship with the outcomes.
Our strategy relies on a number of key ideas. Firstly, a non-parametric strategy appears properly suited to an exploratory examine. From a statistician’s standpoint, information is generated by a stochastic mannequin (y = fleft( x proper)). In distinction to conventional regression approaches, ML approaches typically make only a few assumptions in regards to the useful type of (fleft( x proper))5. (An exception could be LASSO, for instance4). The objective of many exposome research is to research associations between publicity and consequence, when usually there may be little to no a priori information of how every publicity is expounded to consequence, or of their relative significance. There’s not essentially a powerful purpose for these research to take any explicit useful kind, particularly when the info is high-dimensional. Such assumptions may embody the variety of exposures to be included, the linearity of the relationships, and the dearth of interplay results. Assuming a flawed useful kind may even result in flawed conclusions in some instances6. For instance, if a linear relationship between publicity and consequence is imposed on what is definitely a parabolic relationship, the corresponding estimate of the regression parameter just isn’t significant and will end in that publicity not being recognized as a related predictor. In our software we discovered that almost all exposures resulted in non-linear relationships, suggesting that the chance of erroneously imposing a linear relationship just isn’t negligible.
Second, it’s tough for any researcher to carry out mannequin and variable choice in observe, particularly for high-dimensional information. There’s additionally a danger of over-adjustment with our setting (96 pictures).4. Extreme overfitting casts doubt not solely on the prediction mannequin, but in addition on the predictors it derived not directly throughout coaching. ML approaches automate mannequin choice by discovering a useful kind that maximizes prediction accuracy whereas utilizing methods (primarily based on cross-validation and associated methods) to evaluate out-of-sample errors and reduce the chance of overfitting. In distinction, stepwise choice strategies utterly neglect out-of-sample error and are due to this fact liable to overfitting28are among the many hottest strategies of variable choice in epidemiology29. Moreover, these strategies utterly neglect a number of check issues, which is an issue particularly for high-dimensional settings30.
Third, generally, a mixture of information preprocessing and post-hoc visualization methods can be utilized to make ML fashions extra interpretable in longitudinal exposome research. As a result of particular person publicity can change over time, the course of publicity might be predictive. Subsequently, to facilitate interpretation, we made aggregations of repeated publicity measurements, as beforehand beneficial12. In our examine, we offered the programs by contemplating each the typical publicity over time and the typical development in publicity, which describe the persistence or improvement of publicity. These illustration measures can then be used within the ML mannequin. After coaching the ML mannequin, visualization methods similar to PDPs31 and ALE plots25 can assist interpret the ML mannequin. These plots illustrate, for any given publicity, how the prediction of the end result adjustments on common when the values of that publicity are modified whereas all different exposures stay fixed at their authentic values. Though it isn’t doable to assemble easy regression coefficients, such plots can all the time be used to acquire an identical interpretation when it comes to impact dimension signal and magnitude.
Within the present examine, all domains examined (demographics, way of life, environmental and organic pressures) had been represented within the recognized predictors of self-perceived well being. That is in step with earlier prediction and danger evaluation research of well being outcomes similar to self-perceived well being, mortality, and disability-adjusted life years, which additionally recognized exposures from totally different domains as necessary in predicting these well being outcomes32,33,34. Whereas the organic elements had been comparatively overrepresented within the high-level predictors, this publicity area didn’t outperform the opposite domains in its relative contribution to predicting self-perceived well being (Desk 3). Subsequently, it can’t be concluded that self-perception is primarily predicted from any explicit area. As an alternative, it appears extra applicable on this context to use a variety of exposures throughout domains (i.e. an exposome framework). The strategy used within the present examine is helpful for this, because it permits for a direct comparability and rating of the predictive efficiency of several types of predictors of self-perceived well being.
Throughout all domains, the typical variety of hours labored over time was by far an important predictor of self-perceived well being in outdated age. Specifically, having no working hours on common over time was predictive of a poorly perceived well being standing (Fig. 6). Accordingly, in earlier research on the predictive worth of exposures in numerous domains for well being outcomes, a historical past of unemployment was among the many high 5 elements related to the best danger of sick well being and mortality33.35.
This paper goals to supply different researchers with an instance and steering on how ML can act as a helpful addition to an epidemiologist’s toolbox. On this manner, it could actually present different researchers with an software for utilizing an ML algorithm to reply a public well being analysis query. Nevertheless, the proposed strategy solely covers the necessities and may due to this fact be seen as a place to begin for epidemiologists. The constraints of our strategy embody the next. First our strategy was illustrated utilizing HF, however there are lots of algorithms. As a result of the main target of many epidemiologists and public well being researchers is on the applying itself and its relevance to public well being coverage, just one algorithm was included on this paper and RF was thought of a sensible choice for this goal. Nevertheless, another algorithms that may be thought of are different tree-based strategies (e.g36), help vector machines and neural networks7:14. As well as, we used the AUC of the ROC curve to evaluate the discrimination high quality of our mannequin, however various measures of discrimination are additionally out there (e.g. the scaled Brier rating).37.
Second, various methods could exist to pick an important variables. Our technique relies on contemplating the variety of exposures as a tuning parameter utilizing cross-validation and visible inspection of the exposures, which contribute considerably to the prediction efficiency. There’s room for variations of interpretation right here. Moreover, interpretation is strengthened by the modest contribution of many exposome variables. Such exposures can in reality be attributed, however primarily based on a metric primarily based on predictive efficiency, they have a tendency to not be as straightforward to establish. It could due to this fact be extra worthwhile to search for various variable choice methods38.39or utilizing p-values of variable significance40.41. As well as, extremely correlated exposures in variable significance rankings could also be harder to interpret and should require different approaches to enhance interpretation42.
Third, our strategy doesn’t account for the potential for informative censoring and/or lack of information in longitudinal research. Early college leaving by people might be associated to their traits and a few approaches have been developed to take care of this43.44.
Fourth, our strategy didn’t account for the category imbalance within the consequence. When the info set is severely imbalanced, i.e. one class of the result’s closely overrepresented in comparison with one other class, the ML algorithm primarily focuses on predicting the bulk class properly, whereas the minority class is most certainly the category of curiosity45. In our case examine, the category imbalance was restricted, however in instances of extreme imbalance (e.g., when one class of the end result accommodates say 1% and the opposite 99% of the instances), it could be value utilizing a balancing approach similar to over-sampling or undersampling45.46.
Lastly, you will need to be aware that the proposed strategy focuses on predicting a well being consequence and doesn’t goal to estimate causal results. Though there was much less emphasis within the literature on using ML for causal inference, that is at present a quickly rising space of analysis9. Some attention-grabbing new developments embody, for instance, causal forests and causal construction studying47.48.