Photo by Mourizal Zativa on Unsplash
Learn how to use group-by aggregation to uncover insights from your data
Exploratory Data Analysis (EDA) is the core competency of a data analyst. Every day, data analysts are tasked with seeing the “unseen,” or extracting useful insights from a vast ocean of data.
In this regard, I’d like share a technique that I find beneficial for extracting relevant insights from data: group-by aggregation.
To this end, the rest of this article will be arranged as follows:
Explanation of group-by aggregation in PandasThe dataset: Metro Interstate TrafficMetro Traffic EDA
Group-By Aggregation
Group-by aggregation is a data manipulation technique that consists of two steps. First, we group the data based on the values of specific columns. Second, we perform some aggregation operations on top of the grouped data.
Group-by aggregation is especially useful when our data is granular, as in typical fact tables (transactions data) and time series data with narrow intervals. By aggregating at a higher level than raw data granularity, we can represent the data in a more compact way — and may distill useful insights in the process.
In pandas, we can perform group-by aggregation using the following general syntax form.
df.groupby([‘base_col’]).agg(
agg_col=(‘ori_col’,’agg_func’)
)
Where base_col is the column whose values become the grouping basis, agg_col is the new column defined by taking agg_func aggregation on ori_col column.
For example, consider the infamous Titanic dataset whose five rows are displayed below.
import pandas as pd
import seaborn as sns
# import titanic dataset
titanic = sns.load_dataset(“titanic”)
titanic.head()Titanic data’s first 5 rows (Image by Author)
We can group this data by the survived column and then aggregate it by taking the median of the fare column to get the results below.
Median fare of titanic passengers, by survival status (Image by Author)
Suddenly, we see an interesting insight: survived passengers have a higher fare median, which has more than doubled. This could be related to prioritizing safety boats for higher cabin class passengers (i.e., passengers with higher fare tickets).
Hopefully, this simple example demonstrates the potential of group by aggregation in gathering insights from data. Okay then, let’s try group-by-aggregation on a more interesting dataset!
The Dataset
We will use the Metro Interstate Traffic Volume dataset. It’s a publicly available dataset with a Creative Common 4.0 license (which allows for sharing and adaptation of the dataset for any purpose).
The dataset contains hourly Minneapolis-St Paul, MN traffic volume for westbound I-94, which also includes weather details from 2012–2018. The data dictionary information can be found on its UCI Machine Learning repo page.
import pandas as pd
# load dataset
df = pd.read_csv(“dir/to/Metro_Interstate_Traffic_Volume.csv”)
# convert date_time column from object to proper datetime format
df[‘date_time’] = pd.to_datetime(df[‘date_time’])
# head
df.head()Traffic data (df) head (Image by Author)
For this blog demo, we will only use data from 2016 onwards, as there is missing traffic data from earlier periods (try to check yourself for exercise!).
Furthermore, we will add a new column is_congested, which will have a value of 1 if the traffic_volume exceeds 5000 and 0 otherwise.
# only consider 2016 onwards data
df = df.loc[df[‘date_time’]>=”2016-01-01″,:]
# feature engineering is_congested column
df[‘is_congested’] = df[‘traffic_volume’].apply(lambda x: 1 if x > 5000 else 0)
Metro Traffic EDA
Using group-by aggregation as the main weapon, we will try to answer the following analysis questions.
How is the monthly progression of the traffic volume?How is the traffic profile of each day in a week (Monday, Tuesday, etc)?How are typical hourly traffic volume across 24 hours, broken down by weekday vs weekend?What are the top weather conditions that correspond to higher congestion rates?
Monthly progression of traffic volume
This question requires us to aggregate (sum) traffic volumes at month level. Because we don’t have the month column, we need to derive one based on date_time column.
With monthcolumn in place, we can group based on this column, and take the sum of traffic_volume. The codes are given below.
# create month column based on date_time
# sample values: 2016-01, 2026-02
df[‘month’] = df[‘date_time’].dt.to_period(“M”)
# get sum of traffic_volume by month
monthly_traffic = df.groupby(‘month’, as_index=False).agg(
total_traffic = (‘traffic_volume’, ‘sum’)
)
# convert month column to string for viz
monthly_traffic[‘month’] = monthly_traffic[‘month’].astype(str)
monthly_traffic.head()monthly_traffic head (Image by Author)
We can draw line plot from this dataframe!
# draw time series plot
plt.figure(figsize=(12,5))
sns.lineplot(data=monthly_traffic, x =”month”, y=”total_traffic”)
plt.xticks(rotation=90)
plt.title(“Monthly Traffic Volume”)
plt.show()Monthly traffic volume (Image by Author)
The above visualization shows that traffic volume has generally increased over the months within the considered data period.
Daily traffic profile
To analyze this, we need to create two additional columns: date and dayname. The former is used as the primary group-by basis, whereas the latter is used as a breakdown when displaying the data.
In the following codes, we define date and dayname columns. Later on, we group-by based on both columns to get the sum of traffic_volume. Note that since dayname is more coarse (higher aggregation level) than date , it effectively means we aggregate based on date values.
# create column date from date_time
# sample values: 2016-01-01, 2016-01-02
df[‘date’] = df[‘date_time’].dt.to_period(‘D’)
# create dayname column
# sample values: Monday, Tuesday
df[‘dayname’] = df[‘date_time’].dt.day_name()
# get sum of traffic, at date level
daily_traffic = df.groupby([‘dayname’,’date’], as_index=False).agg(
total_traffic = (‘traffic_volume’, ‘sum’)
)
# map dayname to number for viz later
dayname_map = {
‘Monday’: 1,
‘Tuesday’: 2,
‘Wednesday’: 3,
‘Thursday’: 4,
‘Friday’: 5,
‘Saturday’: 6,
‘Sunday’: 7
}
daily_traffic[‘dayname_index’] = daily_traffic[‘dayname’].map(dayname_map)
daily_traffic = daily_traffic.sort_values(by=’dayname_index’)
daily_traffic.head()daily_traffic head (Image by Author)
The above table contains different realizations of daily total traffic volume per day name. Box plot visualizations are appropriate to show those variations of traffic volume, allowing us to comprehend how traffic volumes differ on Monday, Tuesday, and so on.
# draw boxplot per day name
plt.figure(figsize=(12,5))
sns.boxplot(data=daily_traffic, x=”dayname”, y=”total_traffic”)
plt.xticks(rotation=90)
plt.title(“Daily Traffic Volume”)
plt.show()
The above plot shows that all weekdays (Mon-Fri) have roughly the same traffic density. Weekends (Saturday and Sunday) have lower traffic, with Sunday having the least of the two.
Hourly traffic patterns, broken down by weekend status
Similar as previous questions, we need to engineer two new columns to answer this question, i.e., hour and is_weekend.
Using the same trick, we will group by is_weekend and hour columns to get averages of traffic_volume.
# extract hour digit from date_time
# sample values: 1,2,3
df[‘hour’] = df[‘date_time’].dt.hour
# create is_weekend flag based on dayname
df[‘is_weekend’] = df[‘dayname’].apply(lambda x: 1 if x in [‘Saturday’, ‘Sunday’] else 0)
# get average traffic at hour level, broken down by is_weekend flag
hourly_traffic = df.groupby([‘is_weekend’,’hour’], as_index=False).agg(
avg_traffic = (‘traffic_volume’, ‘mean’)
)
hourly_traffic.head()hourly_traffic head (Image by Author)
For the visualization, we can use bar chart with break down on is_weekend flag.
# draw as barplot with hue = is_weekend
plt.figure(figsize=(20,6))
sns.barplot(data=hourly_traffic, x=’hour’, y=’avg_traffic’, hue=’is_weekend’)
plt.title(“Average Hourly Traffic Volume: Weekdays (blue) vs Weekend (orange)”, fontsize=14)
plt.show()Hourly traffic pattern, by weekend status (Image by Author)
Very interesting and rich visualization! Observations:
Weekday traffic has a bimodal distribution pattern. It reaches its highest traffic between 6 and 8 a.m. and 16 and 17 p.m. This is somewhat intuitive because those time windows represent people going to work and returning home from work.Weekend traffic follows a completely different pattern. It has a unimodal shape with a large peak window (12–17). Despite being generally inferior (less traffic) to weekday equivalent hours, it is worth noting that weekend traffic is actually higher during late-night hours (22–2). This could be because people are staying out until late on weekend nights.
Top weather associated with congestion
To answer this question, we need to calculate congestion rate for each weather condition in the dataset (utilizing is_congested column). Can we calculate it using group-by aggregation? Yes we can!
The key observation to make is that the is_congested column is binary. Thus, the congestion rate can be calculated by simply averaging this column! Average of a binary column equals to sum(value 1)/count(all rows) — let that sink in for a moment if it’s new for you.
Based on this neat observation, all we need to do is take the average (mean) of is_congested grouped by weather_description. Following that, we sort the results descending by congested_rate.
# rate of congestion (is_congested) , grouped by weather description
congested_weather = df.groupby(‘weather_description’, as_index=False).agg(
congested_rate = (‘is_congested’, ‘mean’)
).sort_values(by=’congested_rate’, ascending=False, ignore_index=True)
congested_weather.head()congested_weather head (Image by Author)# draw as barplot
plt.figure(figsize=(20,6))
sns.barplot(data=congested_weather, x=’weather_description’, y=’congested_rate’)
plt.xticks(rotation=90)
plt.title(‘Top Weather with High Congestion Rates’)
plt.show()Top weather based on congestion rate (Image by Author)
From the graph:
The top three weather conditions with the highest congestion rates are sleet, light shower snow, and very heavy rain.Meanwhile, light rain and snow, thunderstorms with drizzle, freezing rain, and squalls have not caused any congestion. People must be staying indoors during such extreme weather!
Closing
In this blog post, we covered how to use group-by-aggregation in EDA exercises. As we can see, this technique is highly effective in revealing interesting, useful insights from data, particularly when dealing with granular data.
I hope you can practice doing group-by aggregation during your next EDA project! All in all, thanks for reading, and let’s connect with me on LinkedIn! 👋
A Powerful EDA Tool: Group-By Aggregation was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
