How to update a Pandas data frame using an API

Posted on Sun 17 January 2021 in jupter-notebook

We are going to use data from the dataMontgomery web site which is Montgomery County, MD's public data web site to demonstrate how to:

  1. Download and Load a .csv File into a Pandas Dataframe Automatically Using the URL
  2. Access Data from an API in JSON Format
  3. Combine these two methods to update our base dataframe with updates from an API

Import the Necessary Python Libaries

In [12]:
import json
import numpy as np
import pandas as pd
import requests

1. Download and Load a .csv File into a Pandas Dataframe Automatically Using the URL

The Pandas pd.read_csv command is very powerful in that it will read a file directly from your computer if you provide a file name (including the file path if not located in the same directory as your Jupyter Notebook). You are all probably very familiar with this use of pd.read_csv.

You can also pull in a csv file by providing a URL in place of a file name. In this example, we will use the Adoptable Pets file found at Adoptable Pets. The advantage to this approach is you can skip the step of downloading to file to your computer.

Either of these methods work well if your data file is relatively small.

In [2]:
url = 'https://data.montgomerycountymd.gov/api/views/e54u-qx42/rows.csv?accessType=DOWNLOAD'
df_csv = pd.read_csv(url)
df_csv.head()
Out[2]:
Animal ID Intake Type In Date Pet name Animal Type Pet Age Pet Size Color Breed Sex URL Link Crossing
0 A470102 OWNER SUR 11/18/2020 BRITTANY SPEARS OTHER 4 YEARS LARGE GREEN TURTLE / REDEARED SLIDER F http://www.petharbor.com/get_image.asp?res=DET... NaN
1 A470644 OWNER SUR 12/06/2020 GHOSI OTHER 7 YEARS MED GREEN TURTLE / REDEARED SLIDER U http://www.petharbor.com/get_image.asp?res=DET... NaN
2 A470132 OWNER SUR 11/19/2020 XENA CAT 9 YEARS MED BRN TABBY DOMESTIC SH S http://www.petharbor.com/get_image.asp?res=DET... NaN
3 A467487 OWNER SUR 09/06/2020 TUTTI OTHER NO AGE SMALL GREEN / BLACK TURTLE U http://www.petharbor.com/get_image.asp?res=DET... NaN
4 A471166 OWNER SUR 12/20/2020 OLIVER OTHER 18 YEARS SMALL BROWN TURTLE M http://www.petharbor.com/get_image.asp?res=DET... NaN

2. Access Data from an API in JSON Format

The next method we will use is via an API. The dataMontgomery site provides access to all its datasets via API as well as file download. This is really convenient if the data is updated frequently and we just need the most recently updated records. This is especially true if the data file we are downloading is very large. There are two ways we can do this.

2a. The first is to use the Pandas read_json command instead of read_csv.

In [3]:
df_api = pd.read_json('https://data.montgomerycountymd.gov/resource/e54u-qx42.json')
df_api.head()
Out[3]:
animalid intype indate petname animaltype petage petsize color breed sex url crossing :@computed_region_vu5j_pcmz :@computed_region_tx5f_5em3 :@computed_region_kbsp_ykn9 :@computed_region_d7bw_bq6x :@computed_region_rbt8_3x7n
0 A472016 OWNER SUR 2021-01-17T00:00:00.000 *BIG DADDY BIRD 1 YEAR LARGE RED CHICKEN M {'url': 'http://www.petharbor.com/get_image.as... NaN NaN NaN NaN NaN NaN
1 A472015 OWNER SUR 2021-01-17T00:00:00.000 *GREGORY PECK BIRD 1 YEAR MED GRAY / WHITE CHICKEN M {'url': 'http://www.petharbor.com/get_image.as... NaN NaN NaN NaN NaN NaN
2 A472018 OWNER SUR 2021-01-17T00:00:00.000 SANZON BIRD 5 MONTHS LARGE BLACK CHICKEN M {'url': 'http://www.petharbor.com/get_image.as... NaN NaN NaN NaN NaN NaN
3 A472017 OWNER SUR 2021-01-17T00:00:00.000 ATILA BIRD 5 MONTHS LARGE BLACK CHICKEN M {'url': 'http://www.petharbor.com/get_image.as... NaN NaN NaN NaN NaN NaN
4 A471545 OWNER SUR 2021-01-04T00:00:00.000 *RUFFLES OTHER 15 WEEKS MED BLACK / WHITE RABBIT SH F {'url': 'http://www.petharbor.com/get_image.as... NaN NaN NaN NaN NaN NaN

2b. The second way is to use the Requests library which allows us to communicate via http and then read the object as JSON with the Request's .json method. Of course with this method we need to create the dataframe using the pd.DataFrame method on the JSON object.

This is a little more verbose but gives you a good sense of exactly what is happening in the background. Also, you won't always want to pull your data into a Pandas dataframe so it is good practice to know how to use the Requests library for this task as I could just as easily assign the JSON object to any variable and not convert it to a dataframe.

In [4]:
df_api2 = pd.DataFrame(requests.get('https://data.montgomerycountymd.gov/resource/e54u-qx42.json').json())
df_api2.head()
Out[4]:
animalid intype indate petname animaltype petage petsize color breed sex url crossing :@computed_region_vu5j_pcmz :@computed_region_tx5f_5em3 :@computed_region_kbsp_ykn9 :@computed_region_d7bw_bq6x :@computed_region_rbt8_3x7n
0 A472015 OWNER SUR 2021-01-17T00:00:00.000 *GREGORY PECK BIRD 1 YEAR MED GRAY / WHITE CHICKEN M {'url': 'http://www.petharbor.com/get_image.as... NaN NaN NaN NaN NaN NaN
1 A472016 OWNER SUR 2021-01-17T00:00:00.000 *BIG DADDY BIRD 1 YEAR LARGE RED CHICKEN M {'url': 'http://www.petharbor.com/get_image.as... NaN NaN NaN NaN NaN NaN
2 A472018 OWNER SUR 2021-01-17T00:00:00.000 SANZON BIRD 5 MONTHS LARGE BLACK CHICKEN M {'url': 'http://www.petharbor.com/get_image.as... NaN NaN NaN NaN NaN NaN
3 A472017 OWNER SUR 2021-01-17T00:00:00.000 ATILA BIRD 5 MONTHS LARGE BLACK CHICKEN M {'url': 'http://www.petharbor.com/get_image.as... NaN NaN NaN NaN NaN NaN
4 A471545 OWNER SUR 2021-01-04T00:00:00.000 *RUFFLES OTHER 15 WEEKS MED BLACK / WHITE RABBIT SH F {'url': 'http://www.petharbor.com/get_image.as... NaN NaN NaN NaN NaN NaN

Let's find the shape (number of rows and columns) of our original dataframe.

In [5]:
df_csv.shape
Out[5]:
(40, 12)

Convert In Date column to datetime

Despite the fact that these data represent the same datasets, there are some subtle differences, probably just due to when each version was created or perhaps because they were created by different developers.

The first thing we notice is that the first columns are labled differently; Animal ID vs. animalid, Intake Type vs. Intype, Intake Date vs., Indate, etc. We will fix this later. For now, the only think we need to do is change the In Date column in Df_csv to a datetime type so we can use it properly in any further analysis. We could do this straight away when we first imported the file, but since we didn't, let's do it here.

In [6]:
df_csv['In Date'] = pd.to_datetime(df_csv['In Date'])
df_csv.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 12 columns):
Animal ID      40 non-null object
Intake Type    40 non-null object
In Date        40 non-null datetime64[ns]
Pet name       40 non-null object
Animal Type    40 non-null object
Pet Age        40 non-null object
Pet Size       40 non-null object
Color          40 non-null object
Breed          40 non-null object
Sex            40 non-null object
URL Link       40 non-null object
Crossing       10 non-null object
dtypes: datetime64[ns](1), object(11)
memory usage: 3.9+ KB

3. Combine these two methods to update our base dataframe with updates from an API

As we can see from the cell above, our dataframe in this case is not too large. So we could pull it down in its entirety each time we wanted to get updates without any trouble using either the .csv or JSON method. But what if we had a file that was hundreds of thousands or even millions of records? We wouldn't want to have to wait for such a large download each time we wanted an update. And what if, as updates occurred, our data file does not grow with additional updates but is capped say at a million records. With each update we will lose the earlierst records in our dataframe.

To avoid these issues, we can load our data once via .csv and then append our dataframe with new records retrieved using the API using concat.

This is where combining these two methods comes in handy.

First, let's change all of our df_csv column names to be the same as the names used in df_api column. Notice there is not much rhyme or reason to the column names between the two data sets. There is even a space after URL Link! Arggggg. This is why data wrangling takes 80% of your time :-(

In [7]:
df_csv = df_csv.rename(columns={'Animal ID': 'animalid',
                                'Intake Type': 'intype',
                                'In Date': 'indate',
                                'Pet name': 'petname',
                                'Animal Type': 'animaltype',
                                'Pet Age': 'petage',
                                'Pet Size': 'petsize',
                                'Color': 'color',
                                'Breed': 'breed',
                                'Sex': 'sex',
                                'URL Link ': 'url',
                                'Crossing': 'crossing'
                               })
df_csv.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 12 columns):
animalid      40 non-null object
intype        40 non-null object
indate        40 non-null datetime64[ns]
petname       40 non-null object
animaltype    40 non-null object
petage        40 non-null object
petsize       40 non-null object
color         40 non-null object
breed         40 non-null object
sex           40 non-null object
url           40 non-null object
crossing      10 non-null object
dtypes: datetime64[ns](1), object(11)
memory usage: 3.9+ KB

Now we will pull in the API JSON data again and this time we will drop the last 5 columns as these do not appear to have any valuable data and are mostly NaNs. We'll also set the indate column to a proper datetype data type.

In [8]:
df_api = pd.read_json('https://data.montgomerycountymd.gov/resource/e54u-qx42.json')
df_api.drop(df_api.iloc[:,12:17], axis=1, inplace=True)
df_api['indate'] = pd.to_datetime(df_api['indate'])
df_api.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 12 columns):
animalid      40 non-null object
intype        40 non-null object
indate        40 non-null datetime64[ns]
petname       40 non-null object
animaltype    40 non-null object
petage        40 non-null object
petsize       40 non-null object
color         40 non-null object
breed         40 non-null object
sex           40 non-null object
url           40 non-null object
crossing      10 non-null object
dtypes: datetime64[ns](1), object(11)
memory usage: 3.9+ KB

Use concat to Combine the Dataframes

First we will use the concat function to combine the dataframes row wise, keeping the columns consistent and stacking one dataframe on top of the other. This is why we had to standardize the column names for each dataframe.

We'll use the join='inner' option to keep from repeating the columns. The inner operation is operating on the axis that we are NOT concatenating on. The default axis is index=0 which is rows. So by setting join='inner' we get just the columns that are in common between the two dataframes.

In [9]:
df_new = pd.concat([df_csv, df_api], join='inner', copy=False)
df_new.shape
Out[9]:
(80, 12)

Hmm, seems our dataframe has doubled in terms of the number of records.

This is because concat does not eliminate duplicates. To do this we need to use drop_duplicates. Since we know that animalid is unique for this data set, we can specify this column to ensure we drop records with the same animalid.

Let's also sort the dataframe on the indate column in ascending order to see the most recent records.

In [10]:
df_new = df_new.drop_duplicates(subset=['animalid'])
df_new.sort_values(by='indate', ascending=False).head()
Out[10]:
animalid intype indate petname animaltype petage petsize color breed sex url crossing
39 A472015 OWNER SUR 2021-01-17 *GREGORY PECK BIRD 1 YEAR MED GRAY / WHITE CHICKEN M http://www.petharbor.com/get_image.asp?res=DET... NaN
38 A472017 OWNER SUR 2021-01-17 ATILA BIRD 5 MONTHS LARGE BLACK CHICKEN M http://www.petharbor.com/get_image.asp?res=DET... NaN
37 A472018 OWNER SUR 2021-01-17 SANZON BIRD 5 MONTHS LARGE BLACK CHICKEN M http://www.petharbor.com/get_image.asp?res=DET... NaN
36 A472016 OWNER SUR 2021-01-17 *BIG DADDY BIRD 1 YEAR LARGE RED CHICKEN M http://www.petharbor.com/get_image.asp?res=DET... NaN
30 A471545 OWNER SUR 2021-01-04 *RUFFLES OTHER 15 WEEKS MED BLACK / WHITE RABBIT SH F http://www.petharbor.com/get_image.asp?res=DET... NaN

Create a get_update Function

Now we have our updated dataframe df_new and can use the code in the last three cells each time we want to refresh the data and ensure we have the most up-to-date data while also ensuring we are not dropping any records we have retrieved in the past.

If we want to get fancy, we can create a function to combine all of these steps with the two dataframe names as our arguments.

In [11]:
def get_update(df, api_endpoint):
    df_api = pd.read_json(api_endpoint)
    df_api['indate'] = pd.to_datetime(df_api['indate'])
    df_new = pd.concat([df, df_api], join='inner', copy=False)
    df_new = df_new.drop_duplicates(subset='animalid')
    return(df_new.sort_values(by='indate', ascending=False))

df_update = get_update(df_csv, 'https://data.montgomerycountymd.gov/resource/e54u-qx42.json')
df_update.head()
Out[11]:
animalid intype indate petname animaltype petage petsize color breed sex url crossing
39 A472015 OWNER SUR 2021-01-17 *GREGORY PECK BIRD 1 YEAR MED GRAY / WHITE CHICKEN M http://www.petharbor.com/get_image.asp?res=DET... NaN
38 A472017 OWNER SUR 2021-01-17 ATILA BIRD 5 MONTHS LARGE BLACK CHICKEN M http://www.petharbor.com/get_image.asp?res=DET... NaN
37 A472018 OWNER SUR 2021-01-17 SANZON BIRD 5 MONTHS LARGE BLACK CHICKEN M http://www.petharbor.com/get_image.asp?res=DET... NaN
36 A472016 OWNER SUR 2021-01-17 *BIG DADDY BIRD 1 YEAR LARGE RED CHICKEN M http://www.petharbor.com/get_image.asp?res=DET... NaN
30 A471545 OWNER SUR 2021-01-04 *RUFFLES OTHER 15 WEEKS MED BLACK / WHITE RABBIT SH F http://www.petharbor.com/get_image.asp?res=DET... NaN

And there you have it. To download this Jupyter Notebook, go to my GitHub repo.