This example will show you how to leverage Plotly’s API for Python (and Pandas) to visualize data from a Socrata dataset. We’ll be using Plotly’s recently open sourced library and connecting it to a IPython/Pandas setup with cufflinks. Cufflinks patches Pandas so that you can visualize straight from a dataframe object(Very convenient!).
Let’s start by importing libraries…
We’ll be taking a look at NYPD’s Motor Vehicle Collisions dataset. The dataset contains 3 years of data (from 2012 to 2015) and gets constantly updated. It has very valuable information like the coordinates where the incident happened, the borough, amount of injured people and more. I’m only interested in last year’s data, so I’ll factor that into my query below using SoQL:
Now that we got our data, let’s list the columns and see what we have to work with:
Index(['borough', 'contributing_factor_vehicle_1',
'contributing_factor_vehicle_2', 'contributing_factor_vehicle_3',
'contributing_factor_vehicle_4', 'contributing_factor_vehicle_5',
'cross_street_name', 'date', 'latitude', 'location', 'longitude',
'number_of_cyclist_injured', 'number_of_cyclist_killed',
'number_of_motorist_injured', 'number_of_motorist_killed',
'number_of_pedestrians_injured', 'number_of_pedestrians_killed',
'number_of_persons_injured', 'number_of_persons_killed',
'off_street_name', 'on_street_name', 'time', 'unique_key',
'vehicle_type_code1', 'vehicle_type_code2', 'vehicle_type_code_3',
'vehicle_type_code_4', 'vehicle_type_code_5', 'zip_code'],
dtype='object')
Let’s look at the contributing factors of vehicle collisions. The factors are inconveniently divided into 5 columns, however pandas’ concat
method should help us concatenate them into one:
Now let’s plot! Cufflinks conviniently connects plotly to the iplot
method in my dataframe. Let’s plot the occurence of each factor in a bar chart:
That’s a nice and fast way to visuzlie this data, but there is room for improvement: Plotly charts have two main components, Data
and Layout
. These components are very customizable. Let’s recreate the bar chart in a horizontal orientation and with more space for the labels. Also, let’s get rid of the Unspecified
values.
Now let’s look at incidents over time. I’m gonna transform the date column into an actual date object so that plotly is able to graph it in a time series. In addition, we want to make sure that the df is sorted by date:
Now we can use the .groupby
method to aggregate incidents by date as well as sum deaths per day. And again, plotting them is as easy as calling the .plot
method in our dataframe.
Finally, the previous charts can be merged into one, making good use of the data
and layout
components:
And there you have it.