How to update a Pandas data frame using an API
Posted on Sun 17 January 2021 in jupter-notebook
We are going to use data from the dataMontgomery web site which is Montgomery County, MD's public data web site to demonstrate how to:¶
- Download and Load a .csv File into a Pandas Dataframe Automatically Using the URL
- Access Data from an API in JSON Format
- Combine these two methods to update our base dataframe with updates from an API
Import the Necessary Python Libaries¶
import json
import numpy as np
import pandas as pd
import requests
1. Download and Load a .csv File into a Pandas Dataframe Automatically Using the URL¶
The Pandas pd.read_csv command is very powerful in that it will read a file directly from your computer if you provide a file name (including the file path if not located in the same directory as your Jupyter Notebook). You are all probably very familiar with this use of pd.read_csv.
You can also pull in a csv file by providing a URL in place of a file name. In this example, we will use the Adoptable Pets file found at Adoptable Pets. The advantage to this approach is you can skip the step of downloading to file to your computer.
Either of these methods work well if your data file is relatively small.
url = 'https://data.montgomerycountymd.gov/api/views/e54u-qx42/rows.csv?accessType=DOWNLOAD'
df_csv = pd.read_csv(url)
df_csv.head()
2. Access Data from an API in JSON Format¶
The next method we will use is via an API. The dataMontgomery site provides access to all its datasets via API as well as file download. This is really convenient if the data is updated frequently and we just need the most recently updated records. This is especially true if the data file we are downloading is very large. There are two ways we can do this.
2a. The first is to use the Pandas read_json command instead of read_csv.
df_api = pd.read_json('https://data.montgomerycountymd.gov/resource/e54u-qx42.json')
df_api.head()
2b. The second way is to use the Requests library which allows us to communicate via http and then read the object as JSON with the Request's .json method. Of course with this method we need to create the dataframe using the pd.DataFrame method on the JSON object.
This is a little more verbose but gives you a good sense of exactly what is happening in the background. Also, you won't always want to pull your data into a Pandas dataframe so it is good practice to know how to use the Requests library for this task as I could just as easily assign the JSON object to any variable and not convert it to a dataframe.
df_api2 = pd.DataFrame(requests.get('https://data.montgomerycountymd.gov/resource/e54u-qx42.json').json())
df_api2.head()
Let's find the shape (number of rows and columns) of our original dataframe.¶
df_csv.shape
Convert In Date column to datetime¶
Despite the fact that these data represent the same datasets, there are some subtle differences, probably just due to when each version was created or perhaps because they were created by different developers.
The first thing we notice is that the first columns are labled differently; Animal ID vs. animalid, Intake Type vs. Intype, Intake Date vs., Indate, etc. We will fix this later. For now, the only think we need to do is change the In Date column in Df_csv to a datetime type so we can use it properly in any further analysis. We could do this straight away when we first imported the file, but since we didn't, let's do it here.
df_csv['In Date'] = pd.to_datetime(df_csv['In Date'])
df_csv.info()
3. Combine these two methods to update our base dataframe with updates from an API¶
As we can see from the cell above, our dataframe in this case is not too large. So we could pull it down in its entirety each time we wanted to get updates without any trouble using either the .csv or JSON method. But what if we had a file that was hundreds of thousands or even millions of records? We wouldn't want to have to wait for such a large download each time we wanted an update. And what if, as updates occurred, our data file does not grow with additional updates but is capped say at a million records. With each update we will lose the earlierst records in our dataframe.
To avoid these issues, we can load our data once via .csv and then append our dataframe with new records retrieved using the API using concat.
This is where combining these two methods comes in handy.
First, let's change all of our df_csv column names to be the same as the names used in df_api column. Notice there is not much rhyme or reason to the column names between the two data sets. There is even a space after URL Link! Arggggg. This is why data wrangling takes 80% of your time :-(
df_csv = df_csv.rename(columns={'Animal ID': 'animalid',
'Intake Type': 'intype',
'In Date': 'indate',
'Pet name': 'petname',
'Animal Type': 'animaltype',
'Pet Age': 'petage',
'Pet Size': 'petsize',
'Color': 'color',
'Breed': 'breed',
'Sex': 'sex',
'URL Link ': 'url',
'Crossing': 'crossing'
})
df_csv.info()
Now we will pull in the API JSON data again and this time we will drop the last 5 columns as these do not appear to have any valuable data and are mostly NaNs. We'll also set the indate column to a proper datetype data type.
df_api = pd.read_json('https://data.montgomerycountymd.gov/resource/e54u-qx42.json')
df_api.drop(df_api.iloc[:,12:17], axis=1, inplace=True)
df_api['indate'] = pd.to_datetime(df_api['indate'])
df_api.info()
Use concat to Combine the Dataframes¶
First we will use the concat function to combine the dataframes row wise, keeping the columns consistent and stacking one dataframe on top of the other. This is why we had to standardize the column names for each dataframe.
We'll use the join='inner' option to keep from repeating the columns. The inner operation is operating on the axis that we are NOT concatenating on. The default axis is index=0 which is rows. So by setting join='inner' we get just the columns that are in common between the two dataframes.
df_new = pd.concat([df_csv, df_api], join='inner', copy=False)
df_new.shape
Hmm, seems our dataframe has doubled in terms of the number of records.¶
This is because concat does not eliminate duplicates. To do this we need to use drop_duplicates. Since we know that animalid is unique for this data set, we can specify this column to ensure we drop records with the same animalid.
Let's also sort the dataframe on the indate column in ascending order to see the most recent records.
df_new = df_new.drop_duplicates(subset=['animalid'])
df_new.sort_values(by='indate', ascending=False).head()
Create a get_update Function¶
Now we have our updated dataframe df_new and can use the code in the last three cells each time we want to refresh the data and ensure we have the most up-to-date data while also ensuring we are not dropping any records we have retrieved in the past.
If we want to get fancy, we can create a function to combine all of these steps with the two dataframe names as our arguments.
def get_update(df, api_endpoint):
df_api = pd.read_json(api_endpoint)
df_api['indate'] = pd.to_datetime(df_api['indate'])
df_new = pd.concat([df, df_api], join='inner', copy=False)
df_new = df_new.drop_duplicates(subset='animalid')
return(df_new.sort_values(by='indate', ascending=False))
df_update = get_update(df_csv, 'https://data.montgomerycountymd.gov/resource/e54u-qx42.json')
df_update.head()
And there you have it. To download this Jupyter Notebook, go to my GitHub repo.