Data Visualization with Plotly and Pandas

This example will show you how to leverage Plotly’s API for Python (and Pandas) to visualize data from a Socrata dataset. We’ll be using Plotly’s recently open sourced library and connecting it to a IPython/Pandas setup with cufflinks. Cufflinks patches Pandas so that you can visualize straight from a dataframe object(Very convenient!).

Let’s start by importing libraries…

import pandas as pd
import numpy as np
import matplotlib
import cufflinks as cf
import plotly
import plotly.offline as py
import plotly.graph_objs as go

cf.go_offline() # required to use plotly offline (no account required).
py.init_notebook_mode() # graphs charts inline (IPython).

We’ll be taking a look at NYPD’s Motor Vehicle Collisions dataset. The dataset contains 3 years of data (from 2012 to 2015) and gets constantly updated. It has very valuable information like the coordinates where the incident happened, the borough, amount of injured people and more. I’m only interested in last year’s data, so I’ll factor that into my query below using SoQL:

url = 'https://data.cityofnewyork.us/resource/qiz3-axqb.json?$limit=1000000&\
$where=date%20between%20%272014-01-01T00:00:00%27%20and%20%272015-01-01T00:00:00%27'
collisions = pd.read_json(url)

Now that we got our data, let’s list the columns and see what we have to work with:

collisions.columns

Index(['borough', 'contributing_factor_vehicle_1',
       'contributing_factor_vehicle_2', 'contributing_factor_vehicle_3',
       'contributing_factor_vehicle_4', 'contributing_factor_vehicle_5',
       'cross_street_name', 'date', 'latitude', 'location', 'longitude',
       'number_of_cyclist_injured', 'number_of_cyclist_killed',
       'number_of_motorist_injured', 'number_of_motorist_killed',
       'number_of_pedestrians_injured', 'number_of_pedestrians_killed',
       'number_of_persons_injured', 'number_of_persons_killed',
       'off_street_name', 'on_street_name', 'time', 'unique_key',
       'vehicle_type_code1', 'vehicle_type_code2', 'vehicle_type_code_3',
       'vehicle_type_code_4', 'vehicle_type_code_5', 'zip_code'],
      dtype='object')

Let’s look at the contributing factors of vehicle collisions. The factors are inconveniently divided into 5 columns, however pandas’ concat method should help us concatenate them into one:

contributing_factors = pd.concat(
          [collisions.contributing_factor_vehicle_1,
           collisions.contributing_factor_vehicle_2,
           collisions.contributing_factor_vehicle_3,
           collisions.contributing_factor_vehicle_4,
           collisions.contributing_factor_vehicle_5])

Now let’s plot! Cufflinks conviniently connects plotly to the iplot method in my dataframe. Let’s plot the occurence of each factor in a bar chart:

contributing_factors.value_counts().iplot(kind='bar')

That’s a nice and fast way to visuzlie this data, but there is room for improvement: Plotly charts have two main components, Data and Layout . These components are very customizable. Let’s recreate the bar chart in a horizontal orientation and with more space for the labels. Also, let’s get rid of the Unspecified values.

temp = pd.DataFrame({'contributing_factors':contributing_factors.value_counts()})
df = temp[temp.index != 'Unspecified']
df = df.sort_values(by='contributing_factors', ascending=True)
data  = go.Data([
            go.Bar(
              y = df.index,
              x = df.contributing_factors,
              orientation='h'
        )])
layout = go.Layout(
        height = '1000',
        margin=go.Margin(l=300),
        title = "Contributing Factors for Vehicle Collisions in 2015"
)
fig  = go.Figure(data=data, layout=layout)
py.iplot(fig)

Now let’s look at incidents over time. I’m gonna transform the date column into an actual date object so that plotly is able to graph it in a time series. In addition, we want to make sure that the df is sorted by date:

collisions.date = pd.to_datetime(collisions.date)
collisions.date.sort_values().index
df_by_date = collisions.iloc[collisions.date.sort_values().index]

Now we can use the .groupby method to aggregate incidents by date as well as sum deaths per day. And again, plotting them is as easy as calling the .plot method in our dataframe.

collisions_by_date = df_by_date.groupby('date').date.count()
collisions_by_date.iplot(kind='scatter', title='Collisions Per Day')

deaths_by_date = df_by_date.groupby('date')['number_of_persons_killed'].sum()
deaths_by_date.iplot(kind='bar', title='Deaths per Day')

Finally, the previous charts can be merged into one, making good use of the data and layout components:

colli_deaths = pd.DataFrame({'collisions':collisions_by_date, 'deaths':deaths_by_date })

color1 = '#9467bd'
color2 = '#F08B00'

trace1 = go.Scatter(
    x = colli_deaths.index,
    y = colli_deaths['collisions'],
    name='collisions',
    line = dict(
        color = color1
    )
)
trace2 = go.Scatter(
    x= colli_deaths.index,
    y =colli_deaths['deaths'] ,
    name='deaths',
    yaxis='y2',
    mode='markers'

)
data = [trace1, trace2]
layout = go.Layout(
    title= "Collisions and Deaths per day",
    yaxis=dict(
        title='collisions',
        titlefont=dict(
            color=color1
        ),
        tickfont=dict(
            color=color1
        )
    ),
    yaxis2=dict(
        title='deaths',
        overlaying='y',
        side='right',
        titlefont=dict(
            color=color2
        ),
        tickfont=dict(
            color=color2
        )

    )

)
fig = go.Figure(data=data, layout=layout)
plot_url = py.iplot(fig)

And there you have it.

Data Visualization with Plotly and Pandas

jsanch

Related Posts