It's yet another summer that people just want to travel without fear. Probably it has been "back to normal" for some people, but for me, it has to be backed with some data. There are plenty of websites that publish COVID case data and visualize them by city (even in Google onebox - by the way, Google Travel shows some helpful metrics on hotel reservation rate as well).

However, with a lot of travel destinations in mind, it would be very helpful if there's a dashboard that I can compare COVID cases between different cities and figure out the trend myself. I haven't seen one handy, and I really want to play around the data with some home baked solution.

I have been using free Grafana Cloud as a playground to visualize some data, but the datasource Grafana uses will require us to build a full ETL pipeline to ingest data into the supported database technology, which I want to avoid as I only want to look at a few places only.

This is when Python data science ecosystem comes into place - pandas, plotly, Jupyter, and Colab. Let's play around it!

I will share a Jupyter notebook with what I have so far, but before that, I will share some notes out of my novice experience working with pandas.

Why pandas stack

  • I did have some MATLAB experience, but I haven't played with pandas that intensively, and everyone seems familiar with it
  • I need Python to do some data transformation from whatever data is available online - it's easy to work with data in Python (and as I figured out, with pandas)
  • Plotly generates a reasonably interactive graph that I can dive deep on in the browser (zoom in and data tooltip specifically)
  • Everything runs in the browser and it's free - the most expensive part is resolved by Google's Colab ?

Where to get data

Choosing the right data is actually not that easy. Until you realize that some states and 3P aggregators have ended their data publishing in 2021. (I get it; it's not a small effort, and most states have ended their emergency order, which means less money.)

When I started, I want some accurate, up-to-date COVID cases data in the US, since I only need the granularity at the city/county level. Accuracy wise, these are the data from the source:

  • State level reporting system
    • most of them are published as PDF or all kinds of interactive dashboard (all players in the field definitely provide a much better public awareness experience than ever before - I have seen Power BI and Tableau a lot)
    • I don't want to spend time scraping raw data out of PDF (:/) and dashboard API, to be honest
  • CDC reporting system (aggregating state level data with 1 day delay)
    • data.cdc.gov publishes machine readable data only on a weekly basis, which is a bit lagged for me

Edit: I dug into CDC's datasets and finally found something that works for me - but I already built something, so that could be a homework if you want 🙂

After realizing that data from the 3P aggregator might be more developer friendly, and their delay is not that terrible comparing to data.cdc.gov, I looked into data sources from JHU (CCTV's favorite), 1point3acres, and CovidActNow.

One interesting fact about scraping from 1point3acres (and the blog they attached is a good read):

We do not scrape from others, so please, be respectful of our effort by only fair use of our data. Web Scraping and Crawling Are Perfectly Legal, Right?

I ended up using the API from CovidActNow - it's publicly available with 1 email address away, and it's just a collection of JSONs possibly powered by S3 and some authorization middleware, which is surprising easy to work with. It's only 10 lines of Python code to download data with urllib.request and some file-based caching, and Colab's 12-hour maximum VM lifetime helps us maintain a reasonable data TTL without extra effort - the reasonable TTL for COVID data is 24 hours.

Sorry this has been a long passage talking about data source even before we started the actual tech - that's the reality of any data analysis effort.

pandas datatypes

Before I started this, I only know pd.DataFrame. From CovidActNow, I have a list of JSON objects like this:

[
  {
    "date": "2022-01-01",
    "caseDensity": 0,
    "vaccinationsInitiatedRatio": 0.75,
    "vaccinationsCompletedRatio": 0.8
  }
]

It's surprising easy to convert this to a DataFrame:

pd.DataFrame.from_records(data['metricsTimeseries'], index='date')

With some Colab extension's help ( %load_ext google.colab.data_table), you can just browse the DataFrame as a HTML table, which is very nice.

Each DataFrame consists of multiple series. df['series_name'] returns a pandas.Series (I was usingdf.loc() which is definitely not the right way in most cases), which includes the mapping between "index" (key) and value. This is neat because you can then use pd.concat() to join series by the index.

Data transformation

I will just list some transformation methods I used:

  • pd.concat()
  • pd.to_datetime()
  • df.filter()
  • df.dropna()
  • df.interpolate()
  • series.rename()
  • series.truncate()
  • series.shift() - useful to do week over week data shift

Sometimes it might take a while to find the method I need, but it's generally very intuitive.

Finally... the chart

If you configure Pandas Plotting Backend, it's very easy to plot the data table as a chart:

fig = case_density_df.plot()
fig.show()

Plotly supports a lot of configurations. It doesn't cover everything but it fits my need.

  • layout config: hovermode - show labels for all series on the same day, rather than showing up one by one
  • traces config: hovertemplate - make the label short and crispy
  • figure add_annotation() - annotate datapoint

The notebook and thought

If you are interested in playing around the data as well, this is the Jupyter notebook that I can share. I used it to compare new case density between counties, as well as doing deep dive by shifting vaccinationsInitiated rate forward by 25 days to see how the trend compares to vaccinationsCompleted rate.

Gist: https://gist.github.com/phy25/9e0b7f130acf13e5a84b0fae627c304b

It's an interesting time that data is available and can be transformed so easily, and can not only benefit the institute's large decision, but also inform our small decisions. Is it improving fast enough though? I would say no as I wished.

发表回复

您的电子邮箱地址不会被公开。 必填项已用 * 标注