At the heart of much medical research is a seemingly simple question: is factor X (e.g., smoking) causally related to factor Y (e.g., cancer)? To infer causality, the standard experimental design is through a “randomized intervention”. Specifically, two groups of individuals are picked from the population (say, from urban areas in the US) and one of these groups is randomly assigned to be “exposed” to an intervention (such as a prescription drug). Then, these two groups are followed over time to assess if the individuals who had the intervention are at higher or lower risk for having a particular health or disease outcome. In this elegant design, the experimenter has full control through randomization of the intervention. In other words, by randomly choosing who does and who doesn’t get the intervention, the experimenter can rest assured that the intervention -- and only the intervention -- causes the health outcome.
A big problem exists with the model above to answer the question: is X associated with Y? A randomized experiment can only facilitate considerations of X that are not a priori known to be harmful; specifically, the example above (smoking and cancer) is untenable in human experiments as smoking exposure cannot be randomized.
What can one do instead? One solution is to “observational data”, or non-randomized cohort datasets, that follow individuals through time. Simply put, individuals are selected from the population and their risk factors (e.g., smoking, diet, and body weight as examples) are observed over time. These observations are then correlated with presence (or absence) of future disease. The problems with observation versus randomized intervention is that it is impossible to deconvolve the exposure from other confounding factors. In short, smokers are also likely to have other risk factors (such as alcohol consumption) that are difficult to tease apart.
Famous examples of observational studies include the Nurses Health Study and the Framingham Heart Study whose objectives were to correlate the relationship between modifiable factors in future disease risk (such as cancer and heart disease).
While traditional research cohort studies are important, most such studies have been constrained to examining a few relationships at a time.
Another source of large-scale cohort data that is exploding is real-world data such as those extracted from insurance health claims data and electronic health records. These data are known as “administrative”, as they are used for primarily financial accounting rather than addressing medical research questions.
However, coupled with “biobanks”, or blood banks that store biological samples for measurement (e.g., the UK Biobank) and residential addresses to impute both genetic and environmental data (see our recent paper in Nature Genetics for an example or our writeup in the Harvard Health Blog), these resources are a new substrate for potential discovery, containing thousands to millions of individuals and billions of observations, such as who gets what drug and diagnosis. Because the primary use-case for real-world data is administrative, real-world data may mean real-world access to individuals – specifically, people in these real-world data may be a phone call away from recruited to a clinical study.
Challenges solved? No.
(1) Confounding. Real-world data are observational. Because individuals in cohorts built from real-world data are not intervened nor randomized, testing the question above (is X associated with Y? ) is fraught with challenges. X could certainly be associated with Y, but a whole host of factors that are also associated with both X and Y exist – these are known as “confounding” factors. In short, confounders challenge the causal relationship between X and Y.
(2) Huge hypothesis space. There are tons of possible Xs and Ys that one can test in these real-world data. They are huge in both the number of people and the number of factors (X and Y). So, which ones does an analyst test? Their favorite X? Their favorite Y? By cherry picking single Xs and Ys, the chances are high that statistical correlations are spurious (See Figure 1).
(3) Commoditization of analytical approaches and flexibility of analyses. There are tons of possible analytic tools in the armamentarium of an analyst. Machine learning, regression, artificial intelligence, you have probably heard it all. Many of the software packages to do these analyses are freely available. Which one should you use? How do you train the model? How do you decide you've trained your model long enough? These questions are often ad hoc, often the choice of the analyst. And whenever we (arbitrarily) choose, we are at risk.
A spectacular summary of the challenges of real-world science comes from scientists John Schoenfeld and John Ioannidis. In this study, they examined the consistency of associations between dietary factors (e.g., wine) and cancer (e.g., all cancer, breast cancer, etc). Specifically, they found that in many studies where dietary factors and health outcomes were associated there were contradictory findings, or indications of both increased risk AND decreased risk (e.g., wine associated with both increased risk for breast cancer and decreased risk for breast cancer).
We leave it to the one and only John Oliver to describe their findings:
If some studies indicate increased risk and others decreased risk for a factor, what should we believe?
We at XY.ai are working on approaches to map health and disease while mitigating these challenges of real-world data . We dive deeper in future posts!