Querying and Using SANDAG Covid19 Data from ARCGIS in Python

Note: A Jupyter Notebook version of this post is available here.

The San Diego County Health Department has made county-level Covid19 data available on ArcGIS. See here for a list of available datasets.

To create your customized query links you may use the interactive tool on each dataset’s API Explorer. For example, the query string for general county data can be constructed on this page.

The examples below demonstrate simple queries of the ArcGIS data and show how to import the data into pandas DataFrames for use in python. These examples were run in python 3.8. In addition to python, you will need to install the python packages requests, numpy, and pandas.

import pandas as pd
import numpy as np
import requests

import datetime
import sys


print("Python version: {}".format(sys.version))
print("Numpy Version: {}".format(np.__version__))
print("Pandas Version: {}".format(pd.__version__))
print("Requests Version: {}".format(requests.__version__))

    
Python version: 3.8.2 (v3.8.2:7b3ab5921f, Feb 24 2020, 17:52:18) 
[Clang 6.0 (clang-600.0.57)]
Numpy Version: 1.18.2
Pandas Version: 1.0.3
Requests Version: 2.23.0

Preliminaries

The following function is used for all of the queries. Pass in a well-formed query URL to return a cleaned json string used to instantiate a pandas DataFrame.

def query_database(link):
    ''' Query San Diego County COVID-19 ArcGIS Database
        
        Parameters:
            LINK:  A well formed database query url. 
        
        Returns:
            JSON string with headers and GIS geometry removed
    '''

    f = requests.get(link)
    j = f.json()

    # Clean up the json string
    j.pop('exceededTransferLimit',None)
    [rec.pop('geometry',None) for rec in j['features']]
    
    return j

Example 1: Query County-Level Test Results and Rate of New Cases by Date

This example shows how to download data from the Covid 19 Statistics San Diego County Dataset, which includes the data displayed on the San Diego County 2019 Coronavirus Disease Local Situation landing page. Use the API Explorer tab to create the desired query URL string.

# Perform query
link = 'https://gis-public.sandiegocounty.gov/arcgis/rest/services/Hosted/COVID_19_Statistics_San_Diego_County/FeatureServer/0/query?where=1%3D1&outFields=objectid,date,tests,positives,newtests,rolling_perc_pos_cases,newcases&outSR=4326&f=json'
j = query_database(link)

# Create DataFrame 
df = pd.DataFrame.from_dict([itm['attributes'] for itm in j['features']])

# Translate date values to human readable
df['date']=df['date'].apply(lambda x: datetime.datetime.fromtimestamp(x/1000).strftime('%Y-%m-%d'))

# Reindex
df.index = df['date']
df = df.sort_index()

# Plot
df.loc[:,['newtests','rolling_perc_pos_cases']].plot(secondary_y=['rolling_perc_pos_cases'],title='San Diego Covid19 Test Statistics')

Example 2: Query Rate of New Cases by Zip Code by Date

This example shows how to download case incidence by ZIP code from the San Diego Covid-19 Statistics by ZIP Code Dataset. For the query builder, click here. to create your custom query URL.

# Specify start and end dates to limit returned data objects
start_date = "2020-04-15"
end_date = "2020-05-01"

# Perform query
link = "https://gis-public.sandiegocounty.gov/arcgis/rest/services/Hosted/COVID_19_Statistics__by_ZIP_Code/"\
    "FeatureServer/0/query?where=updatedate%20%3E%3D%20TIMESTAMP%20'{start_date}%2000%3A00%3A00'%20"\
    "AND%20updatedate%20%3C%3D%20TIMESTAMP%20'{end_date}%2000%3A00%3A00'&outFields=ziptext,zipcode_zip,case_count,"\
    "updatedate&outSR=4326&f=json".format(start_date=start_date,end_date=end_date)    
j = query_database(link)

# Create DataFrame 
df = pd.DataFrame.from_dict([itm['attributes'] for itm in j['features']])

# Translate date values to human readable
df['updatedate']=df['updatedate'].apply(lambda x: datetime.datetime.fromtimestamp(x/1000).strftime('%Y-%m-%d'))

# Reindex
df.index = df['updatedate']
df = df.sort_index().drop(columns='updatedate')

# If 'ziptext' is NaN, copy value from 'zipcode_zip' field.
df.loc[df.ziptext.isnull(),'ziptext'] = df.loc[df.ziptext.isnull(),'ziptext'].apply(lambda x: str(x))

# Pivot table to create ZIP code columns
# Note when pivoting use more generalized `pd.pivot_table` instead of `pd.DataFrame.pivot` 
# so that duplicate entries (i.e. same updatedate, zipcode) can be dealt with by taking the max value
df=pd.pivot_table(df,index = df.index, columns='ziptext',values='case_count', aggfunc = np.max)
df

Example 3: Query Cases by Episode Date

San Diego County’s Covid-19 Triggers Dashboard requires calculation of the 7-day average case rate measured using episode date with 7-day lag as described in this pdf document. The data to calculate the 7-day lag is found in the Covid 19 Cases by Episode Date Dataset. This example shows how to download the data from that dataset and for each episode date, select only the datum with the latest update date.

link = 'https://gis-public.sandiegocounty.gov/arcgis/rest/services/Hosted/COVID19_Cases_by_Episode_Date/FeatureServer/0/query?where=1%3D1&outFields=episode_date,update_date,case_count_excluding_inmates&outSR=4326&f=json'
    
j = query_database(link)

# Create DataFrame
df = pd.DataFrame.from_dict([itm['attributes'] for itm in j['features']])

# Drop NaN Values
df = df.dropna(subset=['case_count_excluding_inmates'])

# Translate date values to human readable
df['update_date']=pd.to_datetime(df['update_date'].apply(lambda x: datetime.datetime.fromtimestamp(x/1000).strftime('%Y-%m-%d')))
df['episode_date']=pd.to_datetime(df['episode_date'].apply(lambda x: datetime.datetime.fromtimestamp(x/1000).strftime('%Y-%m-%d')))

# For each episode_date, keep only the datum with the latest update_date
df = df.groupby('episode_date').apply(lambda x: x.sort_values(by='update_date').iloc[-1,:])

# Reindex
df.index=df['episode_date']
df = df.drop(columns=['episode_date'])
df = df.sort_index()

# Plot
df['case_count_excluding_inmates'].plot(title='San Diego New Cases by Episode Date')