We are going to use data from the dataMontgomery web site which is Montgomery County, MD's public data web site to demonstrate how to:¶

Download and Load a .csv File into a Pandas Dataframe Automatically Using the URL
Access Data from an API in JSON Format
Combine these two methods to update our base dataframe with updates from an API

Import the Necessary Python Libaries¶

In [12]:

import json
import numpy as np
import pandas as pd
import requests

1. Download and Load a .csv File into a Pandas Dataframe Automatically Using the URL¶

The Pandas pd.read_csv command is very powerful in that it will read a file directly from your computer if you provide a file name (including the file path if not located in the same directory as your Jupyter Notebook). You are all probably very familiar with this use of pd.read_csv.

You can also pull in a csv file by providing a URL in place of a file name. In this example, we will use the Adoptable Pets file found at Adoptable Pets. The advantage to this approach is you can skip the step of downloading to file to your computer.

Either of these methods work well if your data file is relatively small.

In [2]:

url = 'https://data.montgomerycountymd.gov/api/views/e54u-qx42/rows.csv?accessType=DOWNLOAD'
df_csv = pd.read_csv(url)
df_csv.head()

Out[2]:

	Animal ID	Intake Type	In Date	Pet name	Animal Type	Pet Age	Pet Size	Color	Breed	Sex	URL Link	Crossing
0	A470102	OWNER SUR	11/18/2020	BRITTANY SPEARS	OTHER	4 YEARS	LARGE	GREEN	TURTLE / REDEARED SLIDER	F	http://www.petharbor.com/get_image.asp?res=DET...	NaN
1	A470644	OWNER SUR	12/06/2020	GHOSI	OTHER	7 YEARS	MED	GREEN	TURTLE / REDEARED SLIDER	U	http://www.petharbor.com/get_image.asp?res=DET...	NaN
2	A470132	OWNER SUR	11/19/2020	XENA	CAT	9 YEARS	MED	BRN TABBY	DOMESTIC SH	S	http://www.petharbor.com/get_image.asp?res=DET...	NaN
3	A467487	OWNER SUR	09/06/2020	TUTTI	OTHER	NO AGE	SMALL	GREEN / BLACK	TURTLE	U	http://www.petharbor.com/get_image.asp?res=DET...	NaN
4	A471166	OWNER SUR	12/20/2020	OLIVER	OTHER	18 YEARS	SMALL	BROWN	TURTLE	M	http://www.petharbor.com/get_image.asp?res=DET...	NaN

2. Access Data from an API in JSON Format¶

The next method we will use is via an API. The dataMontgomery site provides access to all its datasets via API as well as file download. This is really convenient if the data is updated frequently and we just need the most recently updated records. This is especially true if the data file we are downloading is very large. There are two ways we can do this.

2a. The first is to use the Pandas read_json command instead of read_csv.

In [3]:

df_api = pd.read_json('https://data.montgomerycountymd.gov/resource/e54u-qx42.json')
df_api.head()

Out[3]:

	animalid	intype	indate	petname	animaltype	petage	petsize	color	breed	sex	url	crossing	:@computed_region_vu5j_pcmz	:@computed_region_tx5f_5em3	:@computed_region_kbsp_ykn9	:@computed_region_d7bw_bq6x	:@computed_region_rbt8_3x7n
0	A472016	OWNER SUR	2021-01-17T00:00:00.000	*BIG DADDY	BIRD	1 YEAR	LARGE	RED	CHICKEN	M	{'url': 'http://www.petharbor.com/get_image.as...	NaN	NaN	NaN	NaN	NaN	NaN
1	A472015	OWNER SUR	2021-01-17T00:00:00.000	*GREGORY PECK	BIRD	1 YEAR	MED	GRAY / WHITE	CHICKEN	M	{'url': 'http://www.petharbor.com/get_image.as...	NaN	NaN	NaN	NaN	NaN	NaN
2	A472018	OWNER SUR	2021-01-17T00:00:00.000	SANZON	BIRD	5 MONTHS	LARGE	BLACK	CHICKEN	M	{'url': 'http://www.petharbor.com/get_image.as...	NaN	NaN	NaN	NaN	NaN	NaN
3	A472017	OWNER SUR	2021-01-17T00:00:00.000	ATILA	BIRD	5 MONTHS	LARGE	BLACK	CHICKEN	M	{'url': 'http://www.petharbor.com/get_image.as...	NaN	NaN	NaN	NaN	NaN	NaN
4	A471545	OWNER SUR	2021-01-04T00:00:00.000	*RUFFLES	OTHER	15 WEEKS	MED	BLACK / WHITE	RABBIT SH	F	{'url': 'http://www.petharbor.com/get_image.as...	NaN	NaN	NaN	NaN	NaN	NaN

2b. The second way is to use the Requests library which allows us to communicate via http and then read the object as JSON with the Request's .json method. Of course with this method we need to create the dataframe using the pd.DataFrame method on the JSON object.

This is a little more verbose but gives you a good sense of exactly what is happening in the background. Also, you won't always want to pull your data into a Pandas dataframe so it is good practice to know how to use the Requests library for this task as I could just as easily assign the JSON object to any variable and not convert it to a dataframe.

In [4]:

df_api2 = pd.DataFrame(requests.get('https://data.montgomerycountymd.gov/resource/e54u-qx42.json').json())
df_api2.head()

Out[4]:

	animalid	intype	indate	petname	animaltype	petage	petsize	color	breed	sex	url	crossing	:@computed_region_vu5j_pcmz	:@computed_region_tx5f_5em3	:@computed_region_kbsp_ykn9	:@computed_region_d7bw_bq6x	:@computed_region_rbt8_3x7n
0	A472015	OWNER SUR	2021-01-17T00:00:00.000	*GREGORY PECK	BIRD	1 YEAR	MED	GRAY / WHITE	CHICKEN	M	{'url': 'http://www.petharbor.com/get_image.as...	NaN	NaN	NaN	NaN	NaN	NaN
1	A472016	OWNER SUR	2021-01-17T00:00:00.000	*BIG DADDY	BIRD	1 YEAR	LARGE	RED	CHICKEN	M	{'url': 'http://www.petharbor.com/get_image.as...	NaN	NaN	NaN	NaN	NaN	NaN
2	A472018	OWNER SUR	2021-01-17T00:00:00.000	SANZON	BIRD	5 MONTHS	LARGE	BLACK	CHICKEN	M	{'url': 'http://www.petharbor.com/get_image.as...	NaN	NaN	NaN	NaN	NaN	NaN
3	A472017	OWNER SUR	2021-01-17T00:00:00.000	ATILA	BIRD	5 MONTHS	LARGE	BLACK	CHICKEN	M	{'url': 'http://www.petharbor.com/get_image.as...	NaN	NaN	NaN	NaN	NaN	NaN
4	A471545	OWNER SUR	2021-01-04T00:00:00.000	*RUFFLES	OTHER	15 WEEKS	MED	BLACK / WHITE	RABBIT SH	F	{'url': 'http://www.petharbor.com/get_image.as...	NaN	NaN	NaN	NaN	NaN	NaN

Let's find the shape (number of rows and columns) of our original dataframe.¶

In [5]:

df_csv.shape

Out[5]:

(40, 12)

Convert In Date column to datetime¶

Despite the fact that these data represent the same datasets, there are some subtle differences, probably just due to when each version was created or perhaps because they were created by different developers.

The first thing we notice is that the first columns are labled differently; Animal ID vs. animalid, Intake Type vs. Intype, Intake Date vs., Indate, etc. We will fix this later. For now, the only think we need to do is change the In Date column in Df_csv to a datetime type so we can use it properly in any further analysis. We could do this straight away when we first imported the file, but since we didn't, let's do it here.

In [6]:

df_csv['In Date'] = pd.to_datetime(df_csv['In Date'])
df_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 12 columns):
Animal ID      40 non-null object
Intake Type    40 non-null object
In Date        40 non-null datetime64[ns]
Pet name       40 non-null object
Animal Type    40 non-null object
Pet Age        40 non-null object
Pet Size       40 non-null object
Color          40 non-null object
Breed          40 non-null object
Sex            40 non-null object
URL Link       40 non-null object
Crossing       10 non-null object
dtypes: datetime64[ns](1), object(11)
memory usage: 3.9+ KB

3. Combine these two methods to update our base dataframe with updates from an API¶

As we can see from the cell above, our dataframe in this case is not too large. So we could pull it down in its entirety each time we wanted to get updates without any trouble using either the .csv or JSON method. But what if we had a file that was hundreds of thousands or even millions of records? We wouldn't want to have to wait for such a large download each time we wanted an update. And what if, as updates occurred, our data file does not grow with additional updates but is capped say at a million records. With each update we will lose the earlierst records in our dataframe.

To avoid these issues, we can load our data once via .csv and then append our dataframe with new records retrieved using the API using concat.

This is where combining these two methods comes in handy.

First, let's change all of our df_csv column names to be the same as the names used in df_api column. Notice there is not much rhyme or reason to the column names between the two data sets. There is even a space after URL Link! Arggggg. This is why data wrangling takes 80% of your time :-(

In [7]:

df_csv = df_csv.rename(columns={'Animal ID': 'animalid',
                                'Intake Type': 'intype',
                                'In Date': 'indate',
                                'Pet name': 'petname',
                                'Animal Type': 'animaltype',
                                'Pet Age': 'petage',
                                'Pet Size': 'petsize',
                                'Color': 'color',
                                'Breed': 'breed',
                                'Sex': 'sex',
                                'URL Link ': 'url',
                                'Crossing': 'crossing'
                               })
df_csv.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 12 columns):
animalid      40 non-null object
intype        40 non-null object
indate        40 non-null datetime64[ns]
petname       40 non-null object
animaltype    40 non-null object
petage        40 non-null object
petsize       40 non-null object
color         40 non-null object
breed         40 non-null object
sex           40 non-null object
url           40 non-null object
crossing      10 non-null object
dtypes: datetime64[ns](1), object(11)
memory usage: 3.9+ KB

Now we will pull in the API JSON data again and this time we will drop the last 5 columns as these do not appear to have any valuable data and are mostly NaNs. We'll also set the indate column to a proper datetype data type.

In [8]:

df_api = pd.read_json('https://data.montgomerycountymd.gov/resource/e54u-qx42.json')
df_api.drop(df_api.iloc[:,12:17], axis=1, inplace=True)
df_api['indate'] = pd.to_datetime(df_api['indate'])
df_api.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 40 entries, 0 to 39
Data columns (total 12 columns):
animalid      40 non-null object
intype        40 non-null object
indate        40 non-null datetime64[ns]
petname       40 non-null object
animaltype    40 non-null object
petage        40 non-null object
petsize       40 non-null object
color         40 non-null object
breed         40 non-null object
sex           40 non-null object
url           40 non-null object
crossing      10 non-null object
dtypes: datetime64[ns](1), object(11)
memory usage: 3.9+ KB

Use concat to Combine the Dataframes¶

First we will use the concat function to combine the dataframes row wise, keeping the columns consistent and stacking one dataframe on top of the other. This is why we had to standardize the column names for each dataframe.

We'll use the join='inner' option to keep from repeating the columns. The inner operation is operating on the axis that we are NOT concatenating on. The default axis is index=0 which is rows. So by setting join='inner' we get just the columns that are in common between the two dataframes.

In [9]:

df_new = pd.concat([df_csv, df_api], join='inner', copy=False)
df_new.shape

Out[9]:

(80, 12)

Hmm, seems our dataframe has doubled in terms of the number of records.¶

This is because concat does not eliminate duplicates. To do this we need to use drop_duplicates. Since we know that animalid is unique for this data set, we can specify this column to ensure we drop records with the same animalid.

Let's also sort the dataframe on the indate column in ascending order to see the most recent records.

In [10]:

df_new = df_new.drop_duplicates(subset=['animalid'])
df_new.sort_values(by='indate', ascending=False).head()

Out[10]:

	animalid	intype	indate	petname	animaltype	petage	petsize	color	breed	sex	url	crossing
39	A472015	OWNER SUR	2021-01-17	*GREGORY PECK	BIRD	1 YEAR	MED	GRAY / WHITE	CHICKEN	M	http://www.petharbor.com/get_image.asp?res=DET...	NaN
38	A472017	OWNER SUR	2021-01-17	ATILA	BIRD	5 MONTHS	LARGE	BLACK	CHICKEN	M	http://www.petharbor.com/get_image.asp?res=DET...	NaN
37	A472018	OWNER SUR	2021-01-17	SANZON	BIRD	5 MONTHS	LARGE	BLACK	CHICKEN	M	http://www.petharbor.com/get_image.asp?res=DET...	NaN
36	A472016	OWNER SUR	2021-01-17	*BIG DADDY	BIRD	1 YEAR	LARGE	RED	CHICKEN	M	http://www.petharbor.com/get_image.asp?res=DET...	NaN
30	A471545	OWNER SUR	2021-01-04	*RUFFLES	OTHER	15 WEEKS	MED	BLACK / WHITE	RABBIT SH	F	http://www.petharbor.com/get_image.asp?res=DET...	NaN

Create a get_update Function¶

Now we have our updated dataframe df_new and can use the code in the last three cells each time we want to refresh the data and ensure we have the most up-to-date data while also ensuring we are not dropping any records we have retrieved in the past.

If we want to get fancy, we can create a function to combine all of these steps with the two dataframe names as our arguments.

In [11]:

def get_update(df, api_endpoint):
    df_api = pd.read_json(api_endpoint)
    df_api['indate'] = pd.to_datetime(df_api['indate'])
    df_new = pd.concat([df, df_api], join='inner', copy=False)
    df_new = df_new.drop_duplicates(subset='animalid')
    return(df_new.sort_values(by='indate', ascending=False))

df_update = get_update(df_csv, 'https://data.montgomerycountymd.gov/resource/e54u-qx42.json')
df_update.head()

Out[11]:

	animalid	intype	indate	petname	animaltype	petage	petsize	color	breed	sex	url	crossing
39	A472015	OWNER SUR	2021-01-17	*GREGORY PECK	BIRD	1 YEAR	MED	GRAY / WHITE	CHICKEN	M	http://www.petharbor.com/get_image.asp?res=DET...	NaN
38	A472017	OWNER SUR	2021-01-17	ATILA	BIRD	5 MONTHS	LARGE	BLACK	CHICKEN	M	http://www.petharbor.com/get_image.asp?res=DET...	NaN
37	A472018	OWNER SUR	2021-01-17	SANZON	BIRD	5 MONTHS	LARGE	BLACK	CHICKEN	M	http://www.petharbor.com/get_image.asp?res=DET...	NaN
36	A472016	OWNER SUR	2021-01-17	*BIG DADDY	BIRD	1 YEAR	LARGE	RED	CHICKEN	M	http://www.petharbor.com/get_image.asp?res=DET...	NaN
30	A471545	OWNER SUR	2021-01-04	*RUFFLES	OTHER	15 WEEKS	MED	BLACK / WHITE	RABBIT SH	F	http://www.petharbor.com/get_image.asp?res=DET...	NaN

And there you have it. To download this Jupyter Notebook, go to my GitHub repo.