Introduction to the Exposome Data Warehouse

August 22, 2019

How do we scrape and use terabytes of exposure data to predict health? At XY, we have developed a suite of data science tools and application programming interfaces to integrate large datasets that enable disease prediction in large scale, instrumental for precision public health.

What is the exposome?

The exposome is the collection of elements we are exposed to, such as air pollution, climate, and even where we live that play a large role in our health. We have compiled and scraped terabytes of diverse geographical environmental exposure data, spanning back from 1901 to the current day, including but not exclusive to:

  1. Environmental Protection Agency
  2. National Oceanic and Atmospheric Association
  3. United States Census
  4. United States Centers for Disease Control and Prevention

These data vary over space and time. Specifically these data vary in the type of measure (e.g., air pollution versus climate), the frequency of measure (e.g., daily, monthly, yearly), and what spatial resolution (e.g., zipcode versus county versus tract). Integrating over space and time and populations are Exposome Data Warehouse’s strongest features and a prerequisite for precision health and medicine.

Here, we will describe how we built these data resources for research and clinical use-cases. ‍

Design of the Exposome Data Warehouse

The Exposome Data Warehouse is stored on the fully managed Google Cloud SQL platform, allowing us to easily scale our analytical queries. We store all of our raw data and shapefiles in a PostGIS database. We utilize many of the novel features of PostGIS such as spatial indexing and JSON data types to store and quickly query many complicated datasets using a single data model. In a particular dataset, a single observation may have multiple key-value pairs (e.g. temperature, dew point, and rain totals for a particular NOAA sensor on a particular day), we store all of this data as a JSON object in our ‘data’ column.

This model structure provides us the flexibility to handle disparate datasets by combining the relative advantages of both relational and No-SQL style databases. By linking the raw data to shapefiles, we are able to perform complex geospatial joins which can aggregate multiple datasets on the basis of location. We utilize the PostGIS spatial indexing capabilities along with temporal indexing in order to quickly ascertain the relevant data within our >10TB TB database.

‍The Exposome Data Warehouse also contains the entire Census Tiger Database which allows for geocoding of any US address directly from our database rather than having to send private information to a third party provider such as the Google Maps API, which is particularly important when dealing with sensitive patient data. Check out Figure 1.

Figure 1. Example of Exposome Data Warehouse integration across air pollution (as air quality index, left panel, top row), average monthly temperature (left panel, middle row), “poverty level” (left panel, bottom row) for all regions seen in the map on the right panel. Right panel shows a chloropleth map of poverty level for Boston, MA.

Finding your nearest sensor

The spatial indexing capabilities allow us to easily find the k nearest EPA or NOAA sensors to a particular zipcode or address. The following query demonstrates the ability of Exposome Data Warehouse to return the ozone levels (dataid = '4977') on June 7, 2015 for the 5 closest EPA sensors to the zipcode 10027 (shapeid = 2580363).

For example:

SELECT (addy).address As stno, (addy).streetname As street,
(addy).streettypeabbrev As styp, (addy).location As city, (addy).stateabbrev As st,(addy).zip,
ST_X(g.geomout) AS longitude, ST_Y(g.geomout) AS latitude 
FROM geocode('2950 Broadway New York, NY 10027') as g;

Now find the K nearest sensors:

select d.geoid as zipcode, 
ST_Distance(table_EPA.geographywkt, d.geographywkt)/1000.0 AS distance_km
FROM exposome_pici.shapefile as d
select a.startdate, a.enddate, -> 'AQI' -> 'data' ->> 'value' as AQI, -> '1st_Max_Hour' -> 'data' ->> 'value' as first_Max_Hour, -> '1st_Max_Value' -> 'data' ->> 'value' as first_Max_Value, -> 'Arithmetic_Mean' -> 'data' ->> 'value' as Arithmetic_Mean, -> 'Observation_Count' -> 'data' ->> 'value' as Observation_Count, -> 'Observation_Percent' -> 'data' ->> 'value' as Observation_Percent,
from exposome_pici.facttable a
inner join exposome_pici.shapefile c on (a.shape_id=c.shape_id)
where a.data_id='4977' and date(a.startdate)= date('2015-07-15')
order by d.geometrywkt <-> c.geometrywkt limit 5) as table_EPA
where d.shape_id=2580363

How Can These Data Be Used in Biomedical Research for Precision Medicine?

An outstanding question in precision medicine and precision public health is how much does genetics and the exposome play in disease risk? In a novel analysis reported in Nature Genetics we linked the Exposome Data Warehouse to participants in a large insurance claims dataset in order conduct one of the largest twin and sibling studies in the United States. We aimed to systematically quantify the relative contribution of genetics and the environmental exposures in 560 diseases. With the Exposome Data Warehouse, we mapped each insurance claimant to their geographical socioeconomic status, pollution, and climate exposure (based on their home zipcode). We then developed methods to quantify the contribution of these factors along with genetics. We found that most diseases have contribution of both genetics and environment; however, the role of each varied based on the disease. For example, diseases such as intellectual disability had a large contribution from genetics; others, such as obesity, were influenced by both environment and genetics. We also found that the influence of socioeconomic status to morbid obesity and climate had a modest contribution to lead poisoning and influenza.

Check out the recent news on our initial findings at the Washington Post, CNN, the Verge, and the Harvard Gazette!