COVID-19: Identifying communities at risk

April 07, 2020

In this blog post, we identify communities in the U.S. that might be at increased risk for COVID-19-related morbidity and mortality.

Coronavirus 19, or COVID-19, is a respiratory disease that was discovered in Wuhan, China in December, 2019. The virus is thought to spread through droplet transmission, and can result in severe pneumonia and death. While information about the virus is rapidly changing, there are over 150,000 people infected worldwide, and greater than 6000 deaths have been attributed to the disease.

An important part of determining which populations in the United States may be most affected is finding communities with high prevalence of co-morbid conditions associated with poorer outcomes in COVID-19 cases. Further, demographic factors (e.g. age > 60 years) seem to play a big role in outcomes.

In this blog post, we show how co-morbid conditions relevant to COVID-19 can be accessed by data scientists and then describe work in progress towards a "risk score" for identifying communities that might be most at risk. Relevant characteristics of the population to understand include:

  1. Age
  2. Immunosuppression due to medications and chronic conditions
  3. Respiratory conditions such as COPD and asthma in older individuals
  4. Smoking status, which may compromise lung function and predispose to a respiratory infection
  5. Access to healthcare

Since the epidemiology of COVID-19 is evolving, a "risk index" will have to evolve with new data. Below we walk through some code examples that data scientists can use to gather this data towards building a COVID-19 community risk index. We also describe how data sources such as satellite imagery can be used to predict health and disease outcomes and fill in the gaps of where publicly available data about large cities leaves off. Complete code can be found here.

Finding Regions at Risk Using the CDC 500 Cities Data and XY's Exposome Data Warehouse

Using the CDC's 500 Cities data, we can look into the relationship between several potential risk factors for COVID-19 infection, such as cancer, asthma, COPD, CHD, stroke, smoking, diabetes, and access to healthcare. For age, we'll use the median age calculated in the US census for each census tract. We have prepared a .csv with the CDC 500 Cities data available here. ‍ Once we have loaded the data into an R object called fh_cities, we can take a look at the available features:


This data frame has prevalence of several relevant conditions and health indicators, for example lack of access to care (ACCESS2CrudePrev) and COPD (COPDCrudePrev). We merged the CDC 500 Cities data with census information gathered from our Exposome Data Warehouse.

We can investigate communities with combined co-morbidity risk in an infinite number of ways. For simplicity, let's look at the combined prevalence on one axis and median age of more populous census tracts on the other axis:

fh_cities <- fh_cities %>% 
mutate(comorbidity_risk_score = 
(CANCER_CrudePrev + STROKE_CrudePrev + CASTHMA_CrudePrev + COPD_CrudePrev + 
CHD_CrudePrev + OBESITY_CrudePrev + CSMOKING_CrudePrev + DIABETES_CrudePrev))

fh_cities <- fh_cities %>% filter(population_2010 > 
median(fh_cities$population_2010)) # filter by high population census tracts
fh_cities$comorbidity_risk_score <- scale(fh_cities$comorbidity_risk_score) # z-score the risk score
fh_cities <- fh_cities %>% filter(! 

# top census tracts per state, but filtered for those that have a high risk score
top_per_state <- fh_cities %>% group_by(stateabbr) %>% 
top_n(1,comorbidity_risk_score) %>% ungroup() %>% filter(comorbidity_risk_score >= 2.5)

# top census tracts per state based on age, but filtered for those that have a high risk score
oldest_tracts <- fh_cities %>% group_by(stateabbr) %>% top_n(1,median_age) %>%
 ungroup() %>% filter(median_age >= 50)
top_both <- fh_cities %>% filter(comorbidity_risk_score > 2 & median_age >= 55)

### main plot 
p <- ggplot(fh_cities, aes(median_age, comorbidity_risk_score))
p <- p + geom_point(alpha=0.5, color='gray')

p <- p + geom_point(data=top_per_state, aes(median_age, comorbidity_risk_score, size=population_2010))
p <- p + geom_text_repel(data=top_per_state, 
aes(median_age, comorbidity_risk_score, label=paste(placename, stateabbr)), size=3)

p <- p + geom_point(data=oldest_tracts, 
aes(median_age, comorbidity_risk_score, size=population_2010), color='red')
p <- p + geom_text_repel(data=oldest_tracts, 
aes(median_age, comorbidity_risk_score, label=paste(placename, stateabbr)),color='red', size=3)

p <- p + theme_fivethirtyeight() + 
theme(axis.title = element_text(), legend.position = 'none') + 
labs(x = 'Median Age of Tract', y = 'Comorbidity Risk Score')

Each of the ~13K points in the plot above is a census tract. We highlight census tracts by the city in which they reside. Communities with a high co-morbidity risk score are highlighted in black and communities with a high median age are in red.

This suggests that younger communities in places such as Louisiana, Florida, Ohio, and Pennsylvania are at additional risk for morbidity and mortality associated with COVID-19. Furthermore, older communities such as those in Nevada, Arizona, New Mexico, and Florida may also be at increased risk.

In this analysis, we have weighted the underlying morbidities evenly, which is an assumption that is very unlikely to be optimal. We are currently working on inferring these weights from COVID-19 and healthcare claims data, and combining age and co-morbidity structure to form a single risk index that ranks communities. We further are integrating community-level information on healthcare capacity and resources (e.g. ICU beds), centrality of hospitals, and licensed medical professional availability across the country. As data, both from public health bureaus and healthcare organizations like insurance companies come online, we expect this risk index to improve.

A Significant Population at risk for COVID-19 Outcomes lacks Access to Care

We conducted a similar analysis using the data in fh_cities on (lack of) access to care:

lack_access <- fh_cities %>% filter(comorbidity_risk_score >= 3, ACCESS2_CrudePrev > 25)

p <- ggplot(fh_cities, aes(ACCESS2_CrudePrev, comorbidity_risk_score, size=population_2010))
p <- p + geom_point(alpha=0.5, color='gray')
p <- p + geom_point(data=lack_access, 
aes(ACCESS2_CrudePrev, comorbidity_risk_score), color='black')
p <- p + geom_text_repel(data=lack_access, 
aes(ACCESS2_CrudePrev, comorbidity_risk_score, 
label=paste(placename, stateabbr)),color='black ', size=3)
p <- p + theme_fivethirtyeight() + 
theme(axis.title = element_text(), legend.position = 'none') + 
labs(x = '% Lack of access to Healthcare', y = 'Comorbidity Risk Score')

We find that communities such as those in Louisiana, Tennessee, and Florida are all at risk for morbidity and mortality due to COVID-19 and a significant proportion of these populations at risk and lack access to care.

Going Forward: Using Satellite Imagery

XY is using satellite imagery coupled with deep learning techniques to predict health and disease outcomes across the country. has also used satellite imagery and geospatial analysis to address other public health issues, such as wildfires.

The "built environment," which includes how cities and towns are constructed and laid out, can affect how diseases are spread: if people are closer together or there are more public meeting places, then diseases that spread through respiratory transmission (such as COVID-19) might spread quickly.

Understanding how the built environment affects disease spread and the progression of a pandemic is crucial. To complement the public sources of data in this blog post, which primarily focus on large U.S. cities, we are currently using satellite imagery to understand the evolution of the COVID-19 pandemic and gain a high-resolution view of where co-morbidities and the environment conspire to create pockets of risk in the United States.