Interactive data visualization with Python

Posted at — Dec 24, 2020

This tutorial builds upon previous posts on data analysis and mapping by demonstrating how to do simple interactive charts in Python. For this example we will use NYC’s tree census dataset, using the SODA API to access. Plotting will be done with Plotly, an open source visualization tool for Python.

Let’s start by importing all the necessary libraries:

import os
import pandas as pd
import numpy as np
from sodapy import Socrata
import plotly.express as px
import plotly.graph_objects as go

Next we can use the Socrata Python tool to download the city tree data. Below we specify the soda developer token and the location of the data. By default, Socrata limits the amount of data that can be downloaded to 20k rows, below we add the limit argument and this update to 600k – approximately the number of rows the tree census.

client = Socrata("data.cityofnewyork.us", os.environ['nyc_soda_cuny_token'])
results = client.get("uvpi-gqnh", limit=600000)
trees = pd.DataFrame.from_records(results)

Usually all columns in datasets downloaded from Socrata are of a string / object type. Below we specify the numeric columns to be either integers or floats.

cols_int = ['tree_dbh', 'stump_diam']
cols_float = ['latitude', 'longitude',
              'x_sp', 'y_sp']
for column in cols_int:
    trees[column] = trees[column].astype(int)
for column in cols_float:
    trees[column] = trees[column].astype(float)

Simple Bar Chart

The first data point we’ll visualize is the amount of trees in each of the five boroughs in NYC. There are many ways to do this with Pandas – below we use the value_counts() method on a Series object. This method counts the occurrences elements in a DataFrame, and returns a new Series object where the element names are the index and the values are the amount.

trees['boroname'].value_counts()

Now that we have the number of trees in each borough, we need the information in a format that is understood by Plotly. Below we turn the resulting Series object above back into a DataFrame using the reset_index() method.

boro_data = trees['boroname'].value_counts().reset_index()
boro_data

index	boroname
Queens	224748
Brooklyn	150555
Staten Island	97500
Bronx	78593
Manhattan	48604

Notice the column names in the above DataFrame are nonsensical – the boroname column is the amount of trees and the actual names of the boros are in a column called index. We can change that by using rename() and passing a dictionary to change these around.

boro_data = boro_data.rename(columns={'index':'boro',
                                      'boroname': 'count'})
boro_data

boro	count
Queens	224748
Brooklyn	150555
Staten Island	97500
Bronx	78593
Manhattan	48604

Now we are ready to pass the data to Plotly for visualization.

Two of the main drivers behind nearly all Plotly visuals are the Figure object and a graph type, in this case Bar. To create a bar chart we place the Bar object inside the figure and specify the X and Y columns by passing the columns of our DataFrame above.

Plotly by default always shows a control bar at the top each plot – below we turn this off by passing a dictionary to the config argument.

fig = go.Figure(
    go.Bar(
        x=boro_data['boro'], y=boro_data['count'])
)

fig.show(
    config= {'displaylogo': False,
             'displayModeBar': False}
)

The resulting graph above is raw and begs for a little customization. Customizing graphs is easy with Plotly’s Layout object. Below we specify a new background color (in RGB), a title, and set the width to 500 pixels. We also change the color of the bars themselves in the Bar object.

layout = go.Layout(
    plot_bgcolor='rgba(255,255,255,1)',
    title='Number of trees in each borough',
    autosize=False,
    width=500
)

fig = go.Figure(
    go.Bar(
        x=boro_data['boro'], y=boro_data['count'],
    marker_color='rgb(55, 83, 109)'),
    layout=layout
)

fig.show(
    config= {'displaylogo': False,
             'displayModeBar': False}
)

Stacked Bar Charts

Next to demonstrate how to create a stacked bar chart, we’ll use the groupby() method in Pandas to count the number of healthy trees in each borough. Groupby is one of the most powerful concepts within Pandas, allowing researchers to group DataFrames according to a specific (or multiple) entities, in this case Boroughs and health condition. Once grouped, Pandas allows you to run any number of aggregation functions on the resulting groups. For example, below we group our DataFrame by boroname and tree health, and then count the occurrences of good, fair, and poor trees within each borough.

trees.groupby(['boroname', 'health']).agg({'tree_id':'count'})

As we did previously, below we convert the above groupby result into a dataframe and change the column names.

boro_health = trees.groupby(['boroname', 'health'])\
                   .agg({'tree_id':'count'})\
                   .reset_index()\
                   .rename(columns={'tree_id': 'count'})
boro_health.head()

boroname	health	count
Bronx	Fair	9587
Bronx	Good	61985
Bronx	Poor	2774
Brooklyn	Fair	20298
Brooklyn	Good	118480

Now we’re ready to feed the above dataframe into the Plotly Bar object again. However, to create a stacked bar chart, we need to feed the data into plotly as separate dataframes for each group. Below we filter the data by the health condition and create 3 separate Bar objects within the Figure, and specify the barmode argument as stack.

fig = go.Figure(
    data=[
        go.Bar(name='Poor',
               x=boro_health['boroname'].loc[boro_health['health']=='Poor'],
               y=boro_health['count'].loc[boro_health['health']=='Poor']),
        go.Bar(name='Fair',
               x=boro_health['boroname'].loc[boro_health['health']=='Fair'],
               y=boro_health['count'].loc[boro_health['health']=='Fair']),
        go.Bar(name='Good',
               x=boro_health['boroname'].loc[boro_health['health']=='Good'],
               y=boro_health['count'].loc[boro_health['health']=='Good'])
])

fig.update_layout(barmode='stack')

fig.show(
    config= {'displaylogo': False,
             'displayModeBar': False}
)

We can improve our code above slightly by using list comprehension. Rather than manually specifying a dataframe for each of the 3 health conditions, we can use list comprehension to do this for us automatically.

Further, we add custom colors for the bars, change the width, and background color.

health_levels = ['Poor', 'Fair', 'Good']
colors = ['steelblue', 'grey', 'firebrick']

fig = go.Figure(
    data=[
        go.Bar(name=health,
               x=boro_health['boroname'].loc[boro_health['health']==health],
               y=boro_health['count'].loc[boro_health['health']==health],
               marker_color=colors[idx]) for idx, health in enumerate(health_levels)
])

fig.update_layout(barmode='stack',
                  plot_bgcolor='rgba(255,255,255,1)',
                  autosize=False,
                  width=500)

fig.show(
    config= {'displaylogo': False,
             'displayModeBar': False}
)

Scatter Plots

For the final example we’ll create a scatter plot to visualize two variables simultaneously and at a more granular geographic unit, zip code. We use the groupby method once again to gather: how many trees are alive in each zipcode, and how many trees are in poor health condition.

zip_status = trees.groupby(['boroname', 'zipcode', 'status'])\
                  .agg({'status':'count'})\
                  .rename(columns={'status':'count'})\
                  .reset_index()
zip_health = trees.groupby(['boroname', 'zipcode', 'health'])\
                  .agg({'health':'count'})\
                  .rename(columns={'health':'h_count'})\
                  .reset_index()

data = pd.merge(zip_status, zip_health, on=['boroname', 'zipcode'])
data.head()

boroname	zipcode	status	count	health	hcount
Bronx	10451	Alive	2189	Fair	439
Bronx	10451	Alive	2189	Good	1565
Bronx	10451	Alive	2189	Poor	185
Bronx	10451	Dead	109	Fair	439
Bronx	10451	Dead	109	Good	1565

data = data[(data['status']=='Alive') & (data['health']=='Poor')]
data.head()

boroname	zipcode	status	count	health	h_count
Bronx	10451	Alive	2189	Poor	185
Bronx	10452	Alive	2845	Poor	165
Bronx	10453	Alive	2793	Poor	98
Bronx	10454	Alive	1352	Poor	46
Bronx	10455	Alive	1639	Poor	57

fig = px.scatter(data, x="count", y="h_count",
                 size="h_count", color="boroname",
                 hover_name="zipcode", log_y=True, size_max=60)
fig.update_layout(barmode='stack',
                  plot_bgcolor='rgba(255,255,255,1)')
fig.show(
    config= {'displaylogo': False,
             'displayModeBar': False}
)

Exporting charts

Finally, Plotly has various methods for exporting charts as static images or interactive html files.

Exporting images can be done with the write_image() method – specifying the image type by adding the file extension at the end.

note: ensure the Kaleido library is intsalled by running –> pip install -U kaleido

fig.write_image("fig1.png")

Similarly, exporting html files can be done with the write_html() method.

fig.write_html("fig1.html")

End!

Location Intelligence

Exploring the use of machine learning and advanced analytics for geospatial and urban applications

Interactive data visualization with Python

Simple Bar Chart

Stacked Bar Charts

Scatter Plots

Exporting charts