For the last two weeks, the New York City Department of Health has been releasing a dataset of zipcodes with their positive coronavirus case numbers. In the past, we have used that dataset to make an interactive map, showing which neighborhoods have the most cases per 100,000 population, along with map layers you can use to compare the pattern of infection to other indicators, like median income and age. That map currently indicates the epicenter of the New York outbreak is in north Queens, in Jackson Heights, Corona, and Elmhurst.

While the map is a useful illustration of the data, it can make discerning patterns of infection across all New York neighborhoods difficult. So, today we have created an interactive scatterplot, allowing you to view all NYC zipcodes at once, and examine possible correlations between infections and income, age, and racial data (use menu button on top right to change views):

Let's look at each of the four scatterplots. First, COVID-19 cases vs. median income:

Here we see a clear association: the richer neighborhoods in Manhattan have the lowest per capita rates of COVID infection. Poorer neighborhoods, particularly in Queens, have the highest rates.

Next, let's look at the correlation between median age and COVID-19:

Here we can see a weak correlation: older neighborhoods seem to have more cases, but there's so much variation that the trend could also be due to random noise.

Now let's look at the two plots of COVID-19 cases vs. black or Hispanic percent of the neighborhood population:

Here we see a moderate association: neighborhoods with higher percentages of black or Hispanic residents seem to have higher rates of COVID-19 per capita.

We should note, as statisticians always insist, that correlation is not causation. Though coronavirus shows a higher presence in neighborhoods whose residents are poorer, older, and more diverse, that does not mean poverty, age, or diversity causes COVID-19—only that higher numbers of cases per capita are associated with these neighborhood markers. It could be the case that these markers are associated with an underlying cause that is the real driver of higher rates of COVID-19 infections, such as poorer health.

There is also a chance that these correlations are distorted by varying access to testing, although normally less affluent populations have less access to tests, not more, and that tends to depress their numbers of positive cases, not increase them.

We will continue to investigate the city data looking for more correlations—if you have an idea for a graph or questions about this work, please email [email protected]. You can see our daily-updated charts and graphs tracking coronavirus here.

Update 4/17: we've updated the language of this post to be more precise with statistical terms like "correlation" and "association".