Healthy Biome Score

February 24, 2021

Executive Summary

The environment has been linked to a variety of chronic diseases, ranging from asthma to obesity and heart disease [1,2]. The environment comprises everything we are exposed to within and outside our homes. It could be the pollution in the air we breathe, the cars that pass by us on our walk to work, the trees in the park we frequent, and even the density of houses, restaurants, and buildings we pass by. Given the amount of time we spend interacting with the environment, it stands to reason that there should exist some interpretable and actionable metric quantifying how the environment impacts us.

XY's Healthy Biome Score uses unsupervised machine learning to condense features such as air quality, the amount of green space, the natural disaster risk index, and building density into a single, static measure at the census tract (CT) level.

We additionally incorporate a time-varying, dynamic component using hourly air quality measures to more accurately reflect the current state of the CT. In addition to informing the individual about relevant environmental trends in their area and surrounding areas they may frequent, our Healthy Biome Score could help inform public health departments around the U.S. about areas that need more attention.

In the series of posts to come, we provide a framework for understanding your community’s health from a multitude of vantage points. In this blog post, we specifically focus on measuring a community’s environmental health.

Full Methodology

Calculating the Healthy Biome Score


Features (all at the CT-level):

  • % Green Cover: The percent area of green space (trees, grass, etc.) in the CT. This is directly relevant to CT environmental health, as photosynthetic organisms are key in carbon sequestration.
  • % Water Cover: The percent area of water (ponds, lakes, etc.) in the CT. This is directly relevant to the health of wildlife and sustaining the natural environment.
  • Air Quality Index (pollution): The concentration of PM2.5, particulate matter with diameter in the 2.5 micrometer range, in the air. A large body of research has directly linked PM2.5 with chronic disease.
  • Ozone Concentration: The concentration of ozone in the air. Similarly to PM2.5, shown to be linked to various chronic diseases.
  • % Building Cover: The percent area covered by buildings in the CT. The amount of development in an area is directly influenced by the building cover, and there are many studies linking environmental health with development.
  • Natural Disaster Risk Index: Measures the environmental safety of a CT, developed by:
  • Flood Risk: Measures the flood safety of a CT. We believe that safety indices should be incorporated into any metric on environmental health.

All of these environmental health indicators are also available through the N3 API.

Data Pre-Processing

We start by investigating the distributions of each feature and determining whether we need to perform any transformations on these features, such as a log transformation to reduce skew and/or scale.

Above, we see that the original distribution for building density was heavily right-skewed. The log transformation takes care of this issue, as the resulting distribution on the right looks much less skewed and closer to normal.

Additionally, it is important to note that our final metric, the static Healthy Biome Score, should indicate a healthier environment as it increases. Similarly, we expect that there is a notion of positivity and negativity for our individual indicators as noted in our description of each indicator earlier, where positive denotes increasing as the environment gets healthier. We a priori determine the positivity or negativity of each indicator and flip the signs of indicators that are not aligned with the static Healthy Biome score, which is positive. For example, we know that increases in PM2.5 concentration are negative a priori and flip the sign of the PM2.5 feature as a pre-processing step. On the other hand, green space is a priori a positive feature, and increases in green space are already aligned with a healthier environment. This step results in a set of only positive features, where increases of any feature are aligned with increases in the Healthy Biome Score.

Healthy Biome Score

After the pre-processing steps, we begin the Healthy Biome Score calculations. As can be seen below, there are some strong correlations, such as between PM2.5 and mean_green_cover (% green cover of the CT). Note that this is a strong positive rather than negative correlation because we have flipped the sign of the PM2.5 feature, thus modifying it to be a “good” feature.

Due to the strong correlations between some of these features, we’d like to take these correlations into account before computing any metric utilizing these features. Why is this necessary? Consider the extreme case where two features are generated by the same mechanism and differ from each other by only noise in their measurement. In this scenario, using the raw data directly would result in a double-counting of the underlying mechanism in the final score. In the extreme case we’d definitely be better off simply removing one of the duplicate features, but in the real world we’ll often see highly correlated features (such as PM2.5 and mean_green_cover with a correlation of 0.7) instead of literally duplicate features. Highly correlated features run into a similar issue as duplicate features, but still have enough useful information independently to keep in the dataset.

PCA 101

One way to take into account correlations between features without simply removing features is to de-correlate the data, or transform the data into a set of new features that all have zero correlation with each other. Principal component analysis, or PCA, is an unsupervised machine learning algorithm that achieves this by selecting linear combinations of the original features that have zero correlation with each other. The weights of each of these linear combinations as a vector, which we enforce to be unit norm, represent the principal components, or PCs. The weights themselves tell us how much each PC loads up on each of the original features when computing the new, de-correlated features. If two original features are very correlated, such as PM2.5 and green cover, we will often see a principal component that has a high loading on both of these features, consolidating their effect into a single feature.

In addition to orthogonalizing the data, PCA selects principal components that explain the most variance in the original data. One can think of PCA as a constrained optimization algorithm, where the first PC’s weights are selected such that it explains the most variance in the dataset, while each PC’s weights after the first are selected such that it explains as much of the remaining variance as possible while also being orthogonal to each of the preceding PCs. To ensure that PCA is not just optimizing for variance in the data generation process of the individual, original features, we z-score the original features to ensure that their variances are all the same (equal to one). This puts all of the original features on the same “scale”, allowing us to disregard differences in variance in the data generation process and instead focus on the relationships between features, i.e. their correlations.


The figures below show the results of the PCA analysis. The figure on the left depicts the first two features of the transformed data (where order is determined by explained variance), and the red regression line along the x-axis shows the desired zero correlation between the two transformed features. Note that this would be the case for any selection of two features, but we only show the first two here due to space constraints. The explained variance ratio graph on the right shows the decrease in explained variance by each PC as we progress from the first to the last, where the ratio is defined as the fraction of the total variance. Also note that if the original features were all uncorrelated, each transformed feature would have an explained variance ratio of 1/7 - in contrast, the explained variance ratio of the first transformed feature here is ~0.38, which encompasses the variance of almost three features in the uncorrelated case! This difference emphasizes the amount of correlation that exists between some of the original features, further showing the need for de-correlation.

It’s also important to recognize that since the new features we’re using are now some linear combination of the original features, where the loadings are both negative and positive, it is difficult to assign some a priori notion of positivity or negativity to these transformed features. As a proxy for positivity, we sum the PC loadings - if the sum is positive, we keep the sign of the transformed feature. If negative, we reverse the sign of the transformed feature. Note that this is also effective in the general case where the PC solver arbitrarily returns the negative of the desired PC, since there is no guarantee on the absolute signs of the PCs, but only on the relative signs between loadings within each PC.

To calculate the Healthy Biome Score, we sum the two transformed features for each CT and normalize the data to be in the range (0,100) via a modified min-max normalization. In other words, we’d like PC1 to, on average, contribute more to the static score than PC2, proportionate to the ratio of their explained variances.

When examining the PC loadings, we noticed that PC1 places a large emphasis on the features PM2.5, ozone, and % green cover of the CT. This is in line with our expectations, since these features are significantly correlated as can be seen from the correlation matrix displayed earlier. Because of this, Florida lights up heavily under our static metric, while California gets the short end of the stick due to both its generally poor air quality and lack of green space (in addition to color, notice the difference in scale on the right of each map between Davie and Inglewood):

Real-Time Biome Monitoring

The exposures you experience from your environment change constantly -- and the Healthy Biome Score should, too.

So far, we have computed a Healthy Biome Score that acts as a single metric that represents a CT’s environmental health. However, we can make this score dynamic by incorporating recent, hourly trends in the features. The N3 API provides hourly updates to air pollution (PM2.5), weather, humidity, and barometric pressure data. which we can compare against the EPA’s standards to determine which one of six zones the CT is currently in: good, moderate, unhealthy for sensitive groups, unhealthy, very unhealthy, or hazardous. We take a weighted average of the static score with the real-time, dynamic score to get a final Healthy Biome score that is dynamic. Below, we see a gif of the hourly HBS in California during their August wildfires - we can see the HBS getting progressively worse and worse as the 24th nears.

What’s Next?

This post is the first in a series of blog posts that will demonstrate some of the innovation we are leading around using machine learning to understand community health. In future blog posts, we will spotlight more analysis around environmental health, as well as COVID-19 disease risk, chronic disease, and mental health.

Our next post will show how we use time-series analysis to either temper or exacerbate the impact of real-time data on the Healthy Biome Score calculation based on historical norms for the community using time series analysis.


  2. Prüss-Üstün et. al. The impact of the environment on health by country: a meta-synthesis. BMC Environmental Health (2008).