16 Data Reshaping

16.1 Introduction to Data Reshaping

Data reshaping involves transforming the structure of your data without changing the underlying information. Think of it like rearranging furniture in a room—the furniture stays the same, but its organization changes to suit different purposes.

In data analysis, the same dataset can be organized in different ways:

Wide format spreads variables across multiple columns
Long format stacks variables into fewer columns with more rows

Neither format is inherently “better”—the right format depends on what you’re trying to accomplish.

16.1.1 Why Data Reshaping Matters

1. Different Tools Require Different Formats

Visualization libraries (like seaborn and matplotlib) often prefer long format for grouped plotting
Statistical models may require wide format with one row per subject
Machine learning algorithms typically need wide format where each row is an observation and columns are features

2. Facilitates Specific Analyses

Time series comparisons are easier in wide format (comparing 2007 vs 1957 life expectancy)
Grouped aggregations work naturally with long format
Cross-sectional analysis benefits from wide format

3. Simplifies Data Operations

Some calculations are trivial in one format but complex in another
Merging datasets often requires matching formats
Creating summary tables may need pivoting

16.1.2 Understanding Wide vs. Long Format

Let’s use a concrete example to understand these formats. Imagine we have life expectancy data for three countries across three years.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

# Create a simple example dataset in LONG format
long_data = pd.DataFrame({
    'Country': ['USA', 'USA', 'USA', 'Canada', 'Canada', 'Canada', 'Mexico', 'Mexico', 'Mexico'],
    'Year': [2000, 2005, 2010, 2000, 2005, 2010, 2000, 2005, 2010],
    'LifeExp': [76.8, 77.5, 78.5, 79.2, 80.1, 81.0, 74.1, 75.2, 76.5]
})

print("LONG FORMAT (Tidy Data):")
print("="*50)
print(long_data)
print("\nShape:", long_data.shape)
print("Characteristics: More rows, fewer columns, each observation is one row")

LONG FORMAT (Tidy Data):
==================================================
  Country  Year  LifeExp
0     USA  2000     76.8
1     USA  2005     77.5
2     USA  2010     78.5
3  Canada  2000     79.2
4  Canada  2005     80.1
5  Canada  2010     81.0
6  Mexico  2000     74.1
7  Mexico  2005     75.2
8  Mexico  2010     76.5

Shape: (9, 3)
Characteristics: More rows, fewer columns, each observation is one row

# Convert to WIDE format
wide_data = long_data.pivot(index='Country', columns='Year', values='LifeExp')

print("\nWIDE FORMAT:")
print("="*50)
print(wide_data)
print("\nShape:", wide_data.shape)
print("Characteristics: Fewer rows, more columns, years spread across columns")


WIDE FORMAT:
==================================================
Year     2000  2005  2010
Country                  
Canada   79.2  80.1  81.0
Mexico   74.1  75.2  76.5
USA      76.8  77.5  78.5

Shape: (3, 3)
Characteristics: Fewer rows, more columns, years spread across columns

Key Observations:

Format	Rows	Columns	Use Case
Long	9	3	Better for plotting with seaborn, groupby operations, statistical modeling
Wide	3	3 (+index)	Better for comparing across years, reading tables, some statistical tests

Same data, different organization!

16.1.3 Reshaping Tools in Pandas

Pandas provides several functions to reshape data between wide and long formats:

Function	Direction	Primary Use
`pivot_table()`	Long → Wide	Create summary tables with aggregation
`melt()`	Wide → Long	Unpivot columns into rows
`stack()`	Wide → Long	Move column index to row index (works with MultiIndex)
`unstack()`	Long → Wide	Move row index to column index (works with MultiIndex)

In this chapter, we’ll explore each method with practical examples using the gdp_lifeExpectancy.csv dataset.

Let’s start by reading the CSV file into a pandas DataFrame.

# Load the data
gdp_lifeExp_data = pd.read_csv('./Datasets/gdp_lifeExpectancy.csv')
gdp_lifeExp_data.head()

	country	continent	year	lifeExp	pop	gdpPercap
0	Afghanistan	Asia	1952	28.801	8425333	779.445314
1	Afghanistan	Asia	1957	30.332	9240934	820.853030
2	Afghanistan	Asia	1962	31.997	10267083	853.100710
3	Afghanistan	Asia	1967	34.020	11537966	836.197138
4	Afghanistan	Asia	1972	36.088	13079460	739.981106

16.2 Pivoting from Long to Wide: `pivot_table()`

The pivot_table() function is one of pandas’ most versatile tools—it serves a dual purpose as both an aggregation tool and a data reshaping tool. While you may have seen pivot tables in Excel for summarizing data, pandas’ pivot_table() is even more powerful.

16.2.1 What is Pivoting?

Pivoting transforms long format data into wide format by:

Taking values from rows and spreading them across columns
Creating a cross-tabulation structure that’s easy to read
Automatically aggregating when there are multiple values per group

Think of it as “pivoting” your data 90 degrees—what was vertical becomes horizontal.

Let’s start with a simple example before moving to more complex scenarios.

16.2.2 Example 1: Simple Pivot with Single Index

Let’s start by pivoting life expectancy data to compare specific years. We’ll create a table where continents are rows and years are columns.

# Simple pivot: continents as rows, years as columns
simple_pivot = gdp_lifeExp_data.pivot_table(
    index='continent',       # Rows
    columns='year',          # Columns
    values='lifeExp',        # Values to display
    aggfunc='mean'           # How to aggregate (mean life expectancy)
)

print("Mean Life Expectancy by Continent and Year:")
simple_pivot

Mean Life Expectancy by Continent and Year:

year	1952	1957	1962	1967	1972	1977	1982	1987	1992	1997	2002	2007
continent
Africa	39.135500	41.266346	43.319442	45.334538	47.450942	49.580423	51.592865	53.344788	53.629577	53.598269	53.325231	54.806038
Americas	53.279840	55.960280	58.398760	60.410920	62.394920	64.391560	66.228840	68.090720	69.568360	71.150480	72.422040	73.608120
Asia	46.314394	49.318544	51.563223	54.663640	57.319269	59.610556	62.617939	64.851182	66.537212	68.020515	69.233879	70.728485
Europe	64.408500	66.703067	68.539233	69.737600	70.775033	71.937767	72.806400	73.642167	74.440100	75.505167	76.700600	77.648600
Oceania	69.255000	70.295000	71.085000	71.310000	71.910000	72.855000	74.290000	75.320000	76.945000	78.190000	79.740000	80.719500

Understanding the Parameters:

index='continent' - Values from this column become row labels
columns='year' - Unique values from this column become column headers
values='lifeExp' - The data to display in the table cells
aggfunc='mean' - Since multiple countries exist per continent-year, we average them

Result: Each cell shows the mean life expectancy for a continent in a specific year. This wide format makes it easy to compare across years (reading horizontally across a row).

16.2.3 Example 2: Multi-Level Index (Hierarchical Pivot)

Now let’s create a more detailed pivot table with both continent and country as row indices. This creates a hierarchical structure that allows drill-down analysis.

# Multi-level pivot: both continent and country as row indices
gdp_lifeExp_data_pivot = gdp_lifeExp_data.pivot_table(
    index=['continent', 'country'],  # Two-level row index
    columns='year',                   # Years as columns
    values='lifeExp',                 # Life expectancy values
    aggfunc='mean'                    # Aggregation function
)

# Display first 10 rows
gdp_lifeExp_data_pivot.head(10)

	year	1952	1957	1962	1967	1972	1977	1982	1987	1992	1997	2002	2007
continent	country
Africa	Algeria	43.077	45.685	48.303	51.407	54.518	58.014	61.368	65.799	67.744	69.152	70.994	72.301
	Angola	30.015	31.999	34.000	35.985	37.928	39.483	39.942	39.906	40.647	40.963	41.003	42.731
	Benin	38.223	40.358	42.618	44.885	47.014	49.190	50.904	52.337	53.919	54.777	54.406	56.728
	Botswana	47.622	49.618	51.520	53.298	56.024	59.319	61.484	63.622	62.745	52.556	46.634	50.728
	Burkina Faso	31.975	34.906	37.814	40.697	43.591	46.137	48.122	49.557	50.260	50.324	50.650	52.295
	Burundi	39.031	40.533	42.045	43.548	44.057	45.910	47.471	48.211	44.736	45.326	47.360	49.580
	Cameroon	38.523	40.428	42.643	44.799	47.049	49.355	52.961	54.985	54.314	52.199	49.856	50.430
	Central African Republic	35.463	37.464	39.475	41.478	43.457	46.775	48.295	50.485	49.396	46.066	43.308	44.741
	Chad	38.092	39.881	41.716	43.601	45.569	47.383	49.517	51.051	51.724	51.573	50.525	50.651
	Comoros	40.715	42.460	44.467	46.472	48.944	50.939	52.933	54.926	57.939	60.660	62.974	65.152

Understanding Multi-Level Indices:

When you pass a list to the index parameter, pandas creates a hierarchical (multi-level) index:

Level 0 (outermost): continent - Groups countries by continent
Level 1 (inner): country - Individual countries within each continent

Benefits of this structure:

Easy to see all years for a country in one row
Natural grouping by continent
Facilitates comparisons across time periods

16.2.4 Practical Use: Temporal Comparisons

With years as columns, comparing life expectancy across different time periods becomes trivial. Let’s visualize how life expectancy changed over 50 years (1957 vs 2007).

# Create a scatter plot comparing 1957 vs 2007 life expectancy
plt.figure(figsize=(10, 6))

# Add continent information for coloring
# Reset index to access 'continent' column for hue parameter
plot_data = gdp_lifeExp_data_pivot.reset_index()

# Create scatter plot
sns.scatterplot(data=plot_data, x=1957, y=2007, hue='continent', s=100, alpha=0.7)

# Add diagonal reference line (y = x) showing where 1957 = 2007
plt.plot([40, 80], [40, 80], 'r--', linewidth=2, label='No change (1957 = 2007)', alpha=0.5)

# Labels and formatting
plt.xlabel('Life Expectancy in 1957', fontsize=12)
plt.ylabel('Life Expectancy in 2007', fontsize=12)
plt.title('50-Year Change in Life Expectancy by Country', fontsize=14, fontweight='bold')
plt.legend(title='Continent', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Interpreting the Visualization:

Points above the red line: Countries where life expectancy increased from 1957 to 2007 (most countries)
Points below the red line: Countries where life expectancy decreased (concerning trend worth investigating)
Distance from the line: Magnitude of change over the 50-year period

Key Insight: A few African countries show decreased life expectancy after 50 years, likely due to factors like HIV/AIDS epidemic, conflicts, or healthcare system challenges. This pattern would be harder to spot without the wide format pivot table!

16.2.5 Visualizing Pivot Tables with Heatmaps

Another powerful way to visualize pivot tables is using heatmaps, which represent the table as a color-coded matrix. This makes it easy to spot patterns, trends, and outliers at a glance.

# Create a heatmap of the pivot table
plt.figure(figsize=(12, 20))
plt.title("Life Expectancy by Country and Year (Heatmap)", fontsize=16, fontweight='bold', pad=20)

# Create heatmap with better formatting
sns.heatmap(
    gdp_lifeExp_data_pivot, 
    cmap='YlGnBu',           # Yellow-Green-Blue color scheme
    cbar_kws={'label': 'Life Expectancy (years)'},
    linewidths=0.5,          # Add grid lines
    linecolor='gray',
    fmt='.1f'
)

plt.xlabel('Year', fontsize=12)
plt.ylabel('Country (grouped by Continent)', fontsize=12)
plt.tight_layout()
plt.show()

Reading the Heatmap:

Darker (blue) colors indicate higher life expectancy
Lighter (yellow) colors indicate lower life expectancy
Horizontal patterns show trends over time for individual countries
Vertical patterns show disparities across countries in a specific year

Patterns to Notice:

General darkening from left to right = life expectancy increasing over time globally
Top rows (Oceania, Europe) consistently darker = higher life expectancy
Bottom rows (Africa) show more yellow = lower life expectancy, though improving

Common Aggregation Functions in pivot_table()

You can use various aggregation functions within pivot_table() to summarize data:

mean – Calculates the average of values within each group.
sum – Computes the total of values within each group.
count – Counts the number of non-null entries within each group.
min and max – Finds the minimum and maximum values within each group.

We can also define our own custom aggregation function and pass it to the aggfunc parameter of the pivot_table() function

For Example: Find the \(90^{th}\) percentile of life expectancy for each country and year combination

pd.pivot_table(data = gdp_lifeExp_data, values = 'lifeExp',index = 'country', columns ='year',aggfunc = lambda x:np.percentile(x,90))

year	1952	1957	1962	1967	1972	1977	1982	1987	1992	1997	2002	2007
country
Afghanistan	28.801	30.332	31.997	34.020	36.088	38.438	39.854	40.822	41.674	41.763	42.129	43.828
Albania	55.230	59.280	64.820	66.220	67.690	68.930	70.420	72.000	71.581	72.950	75.651	76.423
Algeria	43.077	45.685	48.303	51.407	54.518	58.014	61.368	65.799	67.744	69.152	70.994	72.301
Angola	30.015	31.999	34.000	35.985	37.928	39.483	39.942	39.906	40.647	40.963	41.003	42.731
Argentina	62.485	64.399	65.142	65.634	67.065	68.481	69.942	70.774	71.868	73.275	74.340	75.320
...	...	...	...	...	...	...	...	...	...	...	...	...
Vietnam	40.412	42.887	45.363	47.838	50.254	55.764	58.816	62.820	67.662	70.672	73.017	74.249
West Bank and Gaza	43.160	45.671	48.127	51.631	56.532	60.765	64.406	67.046	69.718	71.096	72.370	73.422
Yemen, Rep.	32.548	33.970	35.180	36.984	39.848	44.175	49.113	52.922	55.599	58.020	60.308	62.698
Zambia	42.038	44.077	46.023	47.768	50.107	51.386	51.821	50.821	46.100	40.238	39.193	42.384
Zimbabwe	48.451	50.469	52.358	53.995	55.635	57.674	60.363	62.351	60.377	46.809	39.989	43.487

142 rows × 12 columns

Example: Consider the life expectancy dataset, calculate the average life expectancy for each country and year combination

pd.pivot_table(data = gdp_lifeExp_data, values = 'lifeExp',index = 'country', columns ='year',aggfunc = 'mean',margins = True)

year	1952	1957	1962	1967	1972	1977	1982	1987	1992	1997	2002	2007	All
country
Afghanistan	28.80100	30.332000	31.997000	34.02000	36.088000	38.438000	39.854000	40.822000	41.674000	41.763000	42.129000	43.828000	37.478833
Albania	55.23000	59.280000	64.820000	66.22000	67.690000	68.930000	70.420000	72.000000	71.581000	72.950000	75.651000	76.423000	68.432917
Algeria	43.07700	45.685000	48.303000	51.40700	54.518000	58.014000	61.368000	65.799000	67.744000	69.152000	70.994000	72.301000	59.030167
Angola	30.01500	31.999000	34.000000	35.98500	37.928000	39.483000	39.942000	39.906000	40.647000	40.963000	41.003000	42.731000	37.883500
Argentina	62.48500	64.399000	65.142000	65.63400	67.065000	68.481000	69.942000	70.774000	71.868000	73.275000	74.340000	75.320000	69.060417
...	...	...	...	...	...	...	...	...	...	...	...	...	...
West Bank and Gaza	43.16000	45.671000	48.127000	51.63100	56.532000	60.765000	64.406000	67.046000	69.718000	71.096000	72.370000	73.422000	60.328667
Yemen, Rep.	32.54800	33.970000	35.180000	36.98400	39.848000	44.175000	49.113000	52.922000	55.599000	58.020000	60.308000	62.698000	46.780417
Zambia	42.03800	44.077000	46.023000	47.76800	50.107000	51.386000	51.821000	50.821000	46.100000	40.238000	39.193000	42.384000	45.996333
Zimbabwe	48.45100	50.469000	52.358000	53.99500	55.635000	57.674000	60.363000	62.351000	60.377000	46.809000	39.989000	43.487000	52.663167
All	49.05762	51.507401	53.609249	55.67829	57.647386	59.570157	61.533197	63.212613	64.160338	65.014676	65.694923	67.007423	59.474439

143 rows × 13 columns

16.3 Melting from Wide to Long: `melt()`

melt() is the inverse operation of pivoting—it transforms wide format data into long format by “unpivoting” columns into rows. This is also called “melting” because you’re “melting” wide columns down into a taller, narrower structure.

16.3.1 Key Parameters

Parameter	Description
`id_vars`	Columns that remain as identifiers (not melted)
`value_vars`	Columns to unpivot (default: all except `id_vars`)
`var_name`	Name for the “variable” column
`value_name`	Name for the “value” column

16.3.2 Why Melt Your Data?

Visualization Libraries Prefer Long Format:

Seaborn and many plotting tools expect data in long format
Makes it easy to create grouped plots with hue, style, or col parameters

Better for Statistical Analysis:

Long format is “tidy data”—each variable in one column, each observation in one row
Required for many statistical models and tests

More Flexible for groupby() Operations:

Easier to filter and aggregate when variables are in rows rather than column names

Let’s see this in action!

16.3.3 Example 1: Basic Melt

Let’s take our pivoted data (wide format) and melt it back to long format. First, let’s look at what we’re starting with:

# Let's look at our wide format data (first 5 rows)
print(f"\nShape: {gdp_lifeExp_data_pivot.shape}")
print("Notice: Years are columns (wide), each country is one row")

print("WIDE FORMAT (pivoted data):")
print("="*60)
gdp_lifeExp_data_pivot.head()


Shape: (142, 12)
Notice: Years are columns (wide), each country is one row
WIDE FORMAT (pivoted data):
============================================================

	year	1952	1957	1962	1967	1972	1977	1982	1987	1992	1997	2002	2007
continent	country
Africa	Algeria	43.077	45.685	48.303	51.407	54.518	58.014	61.368	65.799	67.744	69.152	70.994	72.301
	Angola	30.015	31.999	34.000	35.985	37.928	39.483	39.942	39.906	40.647	40.963	41.003	42.731
	Benin	38.223	40.358	42.618	44.885	47.014	49.190	50.904	52.337	53.919	54.777	54.406	56.728
	Botswana	47.622	49.618	51.520	53.298	56.024	59.319	61.484	63.622	62.745	52.556	46.634	50.728
	Burkina Faso	31.975	34.906	37.814	40.697	43.591	46.137	48.122	49.557	50.260	50.324	50.650	52.295

# Melt the wide format back to long format
melted_data = gdp_lifeExp_data_pivot.melt(
    var_name='year',          # Name for the column that will hold year values
    value_name='lifeExp',     # Name for the column that will hold life expectancy values
    ignore_index=False        # Keep the original index (continent, country)
)

print(f"\nShape: {melted_data.shape}")
print("Notice: Years are now rows (long), multiple rows per country")


print("\nLONG FORMAT (after melting):")
print("="*60)
melted_data.head(15)  # Show more rows to see the pattern


Shape: (1704, 2)
Notice: Years are now rows (long), multiple rows per country

LONG FORMAT (after melting):
============================================================

		year	lifeExp
continent	country
Africa	Algeria	1952	43.077
	Angola	1952	30.015
	Benin	1952	38.223
	Botswana	1952	47.622
	Burkina Faso	1952	31.975
	Burundi	1952	39.031
	Cameroon	1952	38.523
	Central African Republic	1952	35.463
	Chad	1952	38.092
	Comoros	1952	40.715
	Congo, Dem. Rep.	1952	39.143
	Congo, Rep.	1952	42.111
	Cote d'Ivoire	1952	40.477
	Djibouti	1952	34.812
	Egypt	1952	41.893

What Just Happened?

Wide Format (before melt):
- 142 rows × 12 columns (one column per year)
- Each country had one row with year values spread across columns
Long Format (after melt):
- 1,704 rows × 1 column (142 countries × 12 years = 1,704 rows)
- Each country-year combination gets its own row
- The year column now contains what were previously column headers
- The lifeExp column contains the actual values

Key Parameters Used:

var_name='year' - Column header values (1952, 1957, etc.) go into a new column called ‘year’
value_name='lifeExp' - Cell values go into a new column called ‘lifeExp’
ignore_index=False - Preserve the multi-level index (continent, country)

This is now in tidy format, perfect for plotting and analysis!

16.3.4 Example 2: Using Melted Data for Visualization

Now that we have data in long format, we can easily create time series plots grouped by continent. This would be much harder with wide format data!

# Prepare data for plotting by resetting index to access continent column
plot_data = melted_data.reset_index()

# Create a line plot showing life expectancy trends over time
plt.figure(figsize=(12, 6))
sns.lineplot(
    data=plot_data, 
    x='year', 
    y='lifeExp', 
    hue='continent',
    estimator='mean',       # Average across countries in each continent
    errorbar='sd',          # Show standard deviation as error bars
    linewidth=2.5
)

plt.title('Life Expectancy Trends by Continent (1952-2007)', fontsize=14, fontweight='bold')
plt.xlabel('Year', fontsize=12)
plt.ylabel('Mean Life Expectancy (years)', fontsize=12)
plt.legend(title='Continent', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

Why This Works So Well:

This visualization would be extremely difficult with wide format data! With long format:

Seaborn automatically groups by continent using the hue parameter
The x='year' and y='lifeExp' mappings are straightforward
Error bars show variation across countries within each continent

Key Insights from the Plot:

All continents show upward trends (global improvement in healthcare)
Oceania maintains the highest life expectancy throughout
Africa starts lowest but shows improvement
The gap between continents has narrowed over time (error bars converge)

This is the power of having data in the right format!

16.3.5 Pivot ↔︎ Melt: The Full Circle

Let’s verify that pivot_table() and melt() are truly inverse operations by starting with long format, pivoting to wide, then melting back to long:

# Step 1: Start with original long format data
print("Step 1 - ORIGINAL (Long Format):")
print(f"Shape: {gdp_lifeExp_data.shape}")
print(gdp_lifeExp_data.head(3))

# Step 2: Pivot to wide format
wide = gdp_lifeExp_data.pivot_table(
    index=['continent', 'country'], 
    columns='year', 
    values='lifeExp'
)
print("\n" + "="*60)
print("Step 2 - PIVOTED (Wide Format):")
print(f"Shape: {wide.shape}")
print(wide.head(3))

# Step 3: Melt back to long format
long_again = wide.melt(var_name='year', value_name='lifeExp', ignore_index=False).reset_index()
print("\n" + "="*60)
print("Step 3 - MELTED BACK (Long Format):")
print(f"Shape: {long_again.shape}")
print(long_again.head(3))

print("\n✓ We're back where we started! pivot_table() and melt() are inverses.")

Step 1 - ORIGINAL (Long Format):
Shape: (1704, 6)
       country continent  year  lifeExp       pop   gdpPercap
0  Afghanistan      Asia  1952   28.801   8425333  779.445314
1  Afghanistan      Asia  1957   30.332   9240934  820.853030
2  Afghanistan      Asia  1962   31.997  10267083  853.100710

============================================================
Step 2 - PIVOTED (Wide Format):
Shape: (142, 12)
year                 1952    1957    1962    1967    1972    1977    1982  \
continent country                                                           
Africa    Algeria  43.077  45.685  48.303  51.407  54.518  58.014  61.368   
          Angola   30.015  31.999  34.000  35.985  37.928  39.483  39.942   
          Benin    38.223  40.358  42.618  44.885  47.014  49.190  50.904   

year                 1987    1992    1997    2002    2007  
continent country                                          
Africa    Algeria  65.799  67.744  69.152  70.994  72.301  
          Angola   39.906  40.647  40.963  41.003  42.731  
          Benin    52.337  53.919  54.777  54.406  56.728  

============================================================
Step 3 - MELTED BACK (Long Format):
Shape: (1704, 4)
  continent  country  year  lifeExp
0    Africa  Algeria  1952   43.077
1    Africa   Angola  1952   30.015
2    Africa    Benin  1952   38.223

✓ We're back where we started! pivot_table() and melt() are inverses.

16.4 Stacking and Unstacking using `stack()` and `unstack()`

Stacking and unstacking are powerful techniques used to reshape DataFrames by moving data between rows and columns. These operations are particularly useful when working with multi-level index or multi-level columns, where the data is structured in a hierarchical format.

stack(): Converts columns into rows, which means it “stacks” the data into a long format.
unstack(): Converts rows into columns, reversing the effect of stack().

Let’s use the GDP per capita dataset and create a DataFrame with a multi-level index (e.g., continent, country) and multi-level columns (e.g., lifeExp , gdpPercap, and pop).

# calculate the average and standard deviation of life expectancy, population, and GDP per capita for countries across the years.

gdp_lifeExp_multilevel_data = gdp_lifeExp_data.groupby(['continent', 'country'])[['lifeExp', 'pop', 'gdpPercap']].agg(['mean', 'std'])
gdp_lifeExp_multilevel_data

		lifeExp		pop		gdpPercap
		mean	std	mean	std	mean	std
continent	country
Africa	Algeria	59.030167	10.340069	1.987541e+07	8.613355e+06	4426.025973	1310.337656
	Angola	37.883500	4.005276	7.309390e+06	2.672281e+06	3607.100529	1165.900251
	Benin	48.779917	6.128681	4.017497e+06	2.105002e+06	1155.395107	159.741306
	Botswana	54.597500	5.929476	9.711862e+05	4.710965e+05	5031.503557	4178.136987
	Burkina Faso	44.694000	6.845792	7.548677e+06	3.247808e+06	843.990665	183.430087
...	...	...	...	...	...	...	...
Europe	Switzerland	75.565083	4.011572	6.384293e+06	8.582009e+05	27074.334405	6886.463308
	Turkey	59.696417	9.030591	4.590901e+07	1.667768e+07	4469.453380	2049.665102
	United Kingdom	73.922583	3.378943	5.608780e+07	3.174339e+06	19380.472986	7388.189399
Oceania	Australia	74.662917	4.147774	1.464931e+07	3.915203e+06	19980.595634	7815.405220
Oceania	New Zealand	73.989500	3.559724	3.100032e+06	6.547108e+05	17262.622813	4409.009167

142 rows × 6 columns

The resulting DataFrame has two levels of index and two levels of columns, as shown below:

# check the level of row and column index
gdp_lifeExp_multilevel_data.index.nlevels

gdp_lifeExp_multilevel_data.columns.nlevels

if you want to see the data along with the index and columns (in a more accessible way), you can directly print the .index and .columns attributes.

gdp_lifeExp_multilevel_data.index

MultiIndex([( 'Africa',                  'Algeria'),
            ( 'Africa',                   'Angola'),
            ( 'Africa',                    'Benin'),
            ( 'Africa',                 'Botswana'),
            ( 'Africa',             'Burkina Faso'),
            ( 'Africa',                  'Burundi'),
            ( 'Africa',                 'Cameroon'),
            ( 'Africa', 'Central African Republic'),
            ( 'Africa',                     'Chad'),
            ( 'Africa',                  'Comoros'),
            ...
            ( 'Europe',                   'Serbia'),
            ( 'Europe',          'Slovak Republic'),
            ( 'Europe',                 'Slovenia'),
            ( 'Europe',                    'Spain'),
            ( 'Europe',                   'Sweden'),
            ( 'Europe',              'Switzerland'),
            ( 'Europe',                   'Turkey'),
            ( 'Europe',           'United Kingdom'),
            ('Oceania',                'Australia'),
            ('Oceania',              'New Zealand')],
           names=['continent', 'country'], length=142)

gdp_lifeExp_multilevel_data.columns

MultiIndex([(  'lifeExp', 'mean'),
            (  'lifeExp',  'std'),
            (      'pop', 'mean'),
            (      'pop',  'std'),
            ('gdpPercap', 'mean'),
            ('gdpPercap',  'std')],
           )

As seen, the index and columns each contain tuples. This is the reason that when performing data subsetting on a DataFrame with multi-level index and columns, you need to supply a tuple to specify the exact location of the data.

gdp_lifeExp_multilevel_data.loc[('Americas', 'United States'), ('lifeExp', 'mean')]

73.4785

gdp_lifeExp_multilevel_data.columns.values

array([('lifeExp', 'mean'), ('lifeExp', 'std'), ('pop', 'mean'),
       ('pop', 'std'), ('gdpPercap', 'mean'), ('gdpPercap', 'std')],
      dtype=object)

16.4.1 Understanding the `level` in multi-level index and columns

The gdp_lifeExp_multilevel_data dataframe have two levels of index and two level of columns, You can reference these levels in various operations, such as grouping, subsetting, or performing aggregations. The level parameter helps identify which part of the multi-level structure you’re working with.

gdp_lifeExp_multilevel_data.index.get_level_values(0)

Index(['Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa',
       'Africa', 'Africa', 'Africa',
       ...
       'Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'Europe',
       'Europe', 'Oceania', 'Oceania'],
      dtype='object', name='continent', length=142)

gdp_lifeExp_multilevel_data.index.get_level_values(1)

Index(['Algeria', 'Angola', 'Benin', 'Botswana', 'Burkina Faso', 'Burundi',
       'Cameroon', 'Central African Republic', 'Chad', 'Comoros',
       ...
       'Serbia', 'Slovak Republic', 'Slovenia', 'Spain', 'Sweden',
       'Switzerland', 'Turkey', 'United Kingdom', 'Australia', 'New Zealand'],
      dtype='object', name='country', length=142)

gdp_lifeExp_multilevel_data.columns.get_level_values(0)

Index(['lifeExp', 'lifeExp', 'pop', 'pop', 'gdpPercap', 'gdpPercap'], dtype='object')

gdp_lifeExp_multilevel_data.columns.get_level_values(1)

Index(['mean', 'std', 'mean', 'std', 'mean', 'std'], dtype='object')

gdp_lifeExp_multilevel_data.columns.get_level_values(-1)

Index(['mean', 'std', 'mean', 'std', 'mean', 'std'], dtype='object')

As seen above, the outermost level corresponds to level=0, and it increases as we move inward to the inner levels. The innermost level can be referred to as -1.

Now, let’s use the level number in the droplevel() method to see how it works. The syntax for this method is as follows:

DataFrame.droplevel(level, axis=0)

# drop the first level of row index
gdp_lifeExp_multilevel_data.droplevel(0, axis=0)

	lifeExp		pop		gdpPercap
	mean	std	mean	std	mean	std
country
Algeria	59.030167	10.340069	1.987541e+07	8.613355e+06	4426.025973	1310.337656
Angola	37.883500	4.005276	7.309390e+06	2.672281e+06	3607.100529	1165.900251
Benin	48.779917	6.128681	4.017497e+06	2.105002e+06	1155.395107	159.741306
Botswana	54.597500	5.929476	9.711862e+05	4.710965e+05	5031.503557	4178.136987
Burkina Faso	44.694000	6.845792	7.548677e+06	3.247808e+06	843.990665	183.430087
...	...	...	...	...	...	...
Switzerland	75.565083	4.011572	6.384293e+06	8.582009e+05	27074.334405	6886.463308
Turkey	59.696417	9.030591	4.590901e+07	1.667768e+07	4469.453380	2049.665102
United Kingdom	73.922583	3.378943	5.608780e+07	3.174339e+06	19380.472986	7388.189399
Australia	74.662917	4.147774	1.464931e+07	3.915203e+06	19980.595634	7815.405220
New Zealand	73.989500	3.559724	3.100032e+06	6.547108e+05	17262.622813	4409.009167

142 rows × 6 columns

# drop the first level of column index
gdp_lifeExp_multilevel_data.droplevel(0, axis=1)

		mean	std	mean	std	mean	std
continent	country
Africa	Algeria	59.030167	10.340069	1.987541e+07	8.613355e+06	4426.025973	1310.337656
	Angola	37.883500	4.005276	7.309390e+06	2.672281e+06	3607.100529	1165.900251
	Benin	48.779917	6.128681	4.017497e+06	2.105002e+06	1155.395107	159.741306
	Botswana	54.597500	5.929476	9.711862e+05	4.710965e+05	5031.503557	4178.136987
	Burkina Faso	44.694000	6.845792	7.548677e+06	3.247808e+06	843.990665	183.430087
...	...	...	...	...	...	...	...
Europe	Switzerland	75.565083	4.011572	6.384293e+06	8.582009e+05	27074.334405	6886.463308
	Turkey	59.696417	9.030591	4.590901e+07	1.667768e+07	4469.453380	2049.665102
	United Kingdom	73.922583	3.378943	5.608780e+07	3.174339e+06	19380.472986	7388.189399
Oceania	Australia	74.662917	4.147774	1.464931e+07	3.915203e+06	19980.595634	7815.405220
Oceania	New Zealand	73.989500	3.559724	3.100032e+06	6.547108e+05	17262.622813	4409.009167

142 rows × 6 columns

16.4.2 Stacking (Columns to Rows ): `stack()`

The stack() function pivots the columns into rows. You can specify which level you want to stack using the level parameter. By default, it will stack the innermost level of the columns.

gdp_lifeExp_multilevel_data.stack(future_stack=True)

			lifeExp	pop	gdpPercap
continent	country
Africa	Algeria	mean	59.030167	1.987541e+07	4426.025973
	Algeria	std	10.340069	8.613355e+06	1310.337656
	Angola	mean	37.883500	7.309390e+06	3607.100529
	Angola	std	4.005276	2.672281e+06	1165.900251
	Benin	mean	48.779917	4.017497e+06	1155.395107
...	...	...	...	...	...
Europe	United Kingdom	std	3.378943	3.174339e+06	7388.189399
Oceania	Australia	mean	74.662917	1.464931e+07	19980.595634
	Australia	std	4.147774	3.915203e+06	7815.405220
	New Zealand	mean	73.989500	3.100032e+06	17262.622813
	New Zealand	std	3.559724	6.547108e+05	4409.009167

284 rows × 3 columns

Let’s change the default setting by specifying the level=0(the outmost level)

gdp_lifeExp_multilevel_data.stack(level=0, future_stack=True)

			mean	std
continent	country
Africa	Algeria	lifeExp	5.903017e+01	1.034007e+01
		pop	1.987541e+07	8.613355e+06
		gdpPercap	4.426026e+03	1.310338e+03
	Angola	lifeExp	3.788350e+01	4.005276e+00
	Angola	pop	7.309390e+06	2.672281e+06
...	...	...	...	...
Oceania	Australia	pop	1.464931e+07	3.915203e+06
	Australia	gdpPercap	1.998060e+04	7.815405e+03
	New Zealand	lifeExp	7.398950e+01	3.559724e+00
		pop	3.100032e+06	6.547108e+05
		gdpPercap	1.726262e+04	4.409009e+03

426 rows × 2 columns

16.4.3 Unstacking (Rows to Columns) : `unstack()`

The unstack() function pivots the rows back into columns. By default, it will unstack the innermost level of the index.

Let’s reverse the prevous dataframe to its original shape using unstack().

gdp_lifeExp_multilevel_data.stack(level=0, future_stack=True).unstack()

		mean			std
		lifeExp	pop	gdpPercap	lifeExp	pop	gdpPercap
continent	country
Africa	Algeria	59.030167	1.987541e+07	4426.025973	10.340069	8.613355e+06	1310.337656
	Angola	37.883500	7.309390e+06	3607.100529	4.005276	2.672281e+06	1165.900251
	Benin	48.779917	4.017497e+06	1155.395107	6.128681	2.105002e+06	159.741306
	Botswana	54.597500	9.711862e+05	5031.503557	5.929476	4.710965e+05	4178.136987
	Burkina Faso	44.694000	7.548677e+06	843.990665	6.845792	3.247808e+06	183.430087
...	...	...	...	...	...	...	...
Europe	Switzerland	75.565083	6.384293e+06	27074.334405	4.011572	8.582009e+05	6886.463308
	Turkey	59.696417	4.590901e+07	4469.453380	9.030591	1.667768e+07	2049.665102
	United Kingdom	73.922583	5.608780e+07	19380.472986	3.378943	3.174339e+06	7388.189399
Oceania	Australia	74.662917	1.464931e+07	19980.595634	4.147774	3.915203e+06	7815.405220
Oceania	New Zealand	73.989500	3.100032e+06	17262.622813	3.559724	6.547108e+05	4409.009167

142 rows × 6 columns

16.4.4 Transposing using `.T` attribute

If you simply want to swap rows and columns, you can use the .T attribute.

gdp_lifeExp_multilevel_data.T

	continent	Africa										...	Europe								Oceania
	country	Algeria	Angola	Benin	Botswana	Burkina Faso	Burundi	Cameroon	Central African Republic	Chad	Comoros	...	Serbia	Slovak Republic	Slovenia	Spain	Sweden	Switzerland	Turkey	United Kingdom	Australia	New Zealand
lifeExp	mean	5.903017e+01	3.788350e+01	4.877992e+01	54.597500	4.469400e+01	4.481733e+01	4.812850e+01	4.386692e+01	4.677358e+01	52.381750	...	6.855100e+01	7.069608e+01	7.160075e+01	7.420342e+01	7.617700e+01	7.556508e+01	5.969642e+01	7.392258e+01	7.466292e+01	7.398950e+01
lifeExp	std	1.034007e+01	4.005276e+00	6.128681e+00	5.929476	6.845792e+00	3.174882e+00	5.467960e+00	4.720690e+00	4.887978e+00	8.132353	...	4.906736e+00	2.715787e+00	3.677979e+00	5.155859e+00	3.003990e+00	4.011572e+00	9.030591e+00	3.378943e+00	4.147774e+00	3.559724e+00
pop	mean	1.987541e+07	7.309390e+06	4.017497e+06	971186.166667	7.548677e+06	4.651608e+06	9.816648e+06	2.560963e+06	5.329256e+06	361683.916667	...	8.783887e+06	4.774507e+06	1.794381e+06	3.585180e+07	8.220029e+06	6.384293e+06	4.590901e+07	5.608780e+07	1.464931e+07	3.100032e+06
pop	std	8.613355e+06	2.672281e+06	2.105002e+06	471096.527435	3.247808e+06	1.874538e+06	4.363640e+06	1.072158e+06	2.464304e+06	182998.737990	...	1.192153e+06	6.425096e+05	2.022077e+05	4.323928e+06	6.365660e+05	8.582009e+05	1.667768e+07	3.174339e+06	3.915203e+06	6.547108e+05
gdpPercap	mean	4.426026e+03	3.607101e+03	1.155395e+03	5031.503557	8.439907e+02	4.716630e+02	1.774634e+03	9.587847e+02	1.165454e+03	1314.380339	...	9.305049e+03	1.041553e+04	1.407458e+04	1.402983e+04	1.994313e+04	2.707433e+04	4.469453e+03	1.938047e+04	1.998060e+04	1.726262e+04
gdpPercap	std	1.310338e+03	1.165900e+03	1.597413e+02	4178.136987	1.834301e+02	9.932972e+01	4.195899e+02	1.919529e+02	2.305481e+02	298.334389	...	3.829907e+03	3.650231e+03	6.470288e+03	8.046635e+03	7.723248e+03	6.886463e+03	2.049665e+03	7.388189e+03	7.815405e+03	4.409009e+03

6 rows × 142 columns

16.5 Summary of Common Reshaping Functions

Function	Description	Example
`pivot()`	Reshape data by creating new columns from existing ones.	Pivot data on index/columns.
`melt()`	Unpivot data, turning columns into rows.	Convert wide format to long.
`stack()`	Convert columns to rows.	Move column data to rows.
`unstack()`	Convert rows back to columns.	Move row data to columns.
`.T`	Transpose the DataFrame (swap rows and columns).	Transpose the entire DataFrame.

Practical Use Cases:

Pivoting: Useful for time series analysis or when you want to group and summarize data by specific categories.
Melting: Helpful when you need to reshape wide data for easier analysis or when preparing data for machine learning algorithms.
Stacking/Unstacking: Useful when you are working with hierarchical index DataFrames.
Transpose: Helpful for reorienting data for analysis or visualization.

16.6 Converting from Multi-Level Index to Single-Level Index DataFrame

If you’re uncomfortable working with DataFrames that have multi-level indexes and columns, you can convert them into single-level DataFrames using the methods you learned in the previous chapter:

reset_index()
Flattening the multi-level columns

# reset the index to convert the multi-level index to a single-level index
gdp_lifeExp_multilevel_data.reset_index()

	continent	country	lifeExp		pop		gdpPercap
			mean	std	mean	std	mean	std
0	Africa	Algeria	59.030167	10.340069	1.987541e+07	8.613355e+06	4426.025973	1310.337656
1	Africa	Angola	37.883500	4.005276	7.309390e+06	2.672281e+06	3607.100529	1165.900251
2	Africa	Benin	48.779917	6.128681	4.017497e+06	2.105002e+06	1155.395107	159.741306
3	Africa	Botswana	54.597500	5.929476	9.711862e+05	4.710965e+05	5031.503557	4178.136987
4	Africa	Burkina Faso	44.694000	6.845792	7.548677e+06	3.247808e+06	843.990665	183.430087
...	...	...	...	...	...	...	...	...
137	Europe	Switzerland	75.565083	4.011572	6.384293e+06	8.582009e+05	27074.334405	6886.463308
138	Europe	Turkey	59.696417	9.030591	4.590901e+07	1.667768e+07	4469.453380	2049.665102
139	Europe	United Kingdom	73.922583	3.378943	5.608780e+07	3.174339e+06	19380.472986	7388.189399
140	Oceania	Australia	74.662917	4.147774	1.464931e+07	3.915203e+06	19980.595634	7815.405220
141	Oceania	New Zealand	73.989500	3.559724	3.100032e+06	6.547108e+05	17262.622813	4409.009167

142 rows × 8 columns

# flatten the multi-level column
gdp_lifeExp_multilevel_data.columns = ['_'.join(col).strip() for col in gdp_lifeExp_multilevel_data.columns.values]
gdp_lifeExp_multilevel_data

		lifeExp_mean	lifeExp_std	pop_mean	pop_std	gdpPercap_mean	gdpPercap_std
continent	country
Africa	Algeria	59.030167	10.340069	1.987541e+07	8.613355e+06	4426.025973	1310.337656
	Angola	37.883500	4.005276	7.309390e+06	2.672281e+06	3607.100529	1165.900251
	Benin	48.779917	6.128681	4.017497e+06	2.105002e+06	1155.395107	159.741306
	Botswana	54.597500	5.929476	9.711862e+05	4.710965e+05	5031.503557	4178.136987
	Burkina Faso	44.694000	6.845792	7.548677e+06	3.247808e+06	843.990665	183.430087
...	...	...	...	...	...	...	...
Europe	Switzerland	75.565083	4.011572	6.384293e+06	8.582009e+05	27074.334405	6886.463308
	Turkey	59.696417	9.030591	4.590901e+07	1.667768e+07	4469.453380	2049.665102
	United Kingdom	73.922583	3.378943	5.608780e+07	3.174339e+06	19380.472986	7388.189399
Oceania	Australia	74.662917	4.147774	1.464931e+07	3.915203e+06	19980.595634	7815.405220
Oceania	New Zealand	73.989500	3.559724	3.100032e+06	6.547108e+05	17262.622813	4409.009167

142 rows × 6 columns

16.7 Independent Study

16.7.1 Preparing GDP per capita data

Read the GDP per capita data from https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita

Drop all the Year columns. Use the drop() method with the columns, level and inplace arguments. Print the first 2 rows of the updated DataFrame. If the first row of the DataFrame has missing values for all columns, drop that row as well.

16.7.1.1

Drop the inner level of column labels. Use the droplevel() method. Print the first 2 rows of the updated DataFrame.

16.7.1.2

Convert the columns consisting of GDP per capita by IMF, World Bank, and the United Nations to numeric. Apply a lambda function on these columns to convert them to numeric. Print the number of missing values in each column of the updated DataFrame.

Note: Do not apply the function 3 times. Apply it once on a DataFrame consisting of these 3 columns.

16.7.1.3

Apply the lambda function below on all the column names of the dataset obtained in the previous question to clean the column names.

import re

column_name_cleaner = lambda x:re.split(r'\[|/', x)[0]

Note: You will need to edit the parameter of the function, i.e., x in the above function to make sure it is applied on column names and not columns.

Print the first 2 rows of the updated DataFrame.

16.7.1.4

Create a new column GDP_per_capita that copies the GDP per capita values of the United Nations. If the GDP per capita is missing in the United Nations column, then copy it from the World Bank column. If the GDP per capita is missing both in the United Nations and the World Bank columns, then copy it from the IMF column.

Print the number of missing values in the GDP_per_capita column.

16.7.1.5

Drop all the columns except Country and GDP_per_capita. Print the first 2 rows of the updated DataFrame.

16.7.1.6

The country names contain some special characters (characters other than letters) and need to be cleaned. The following function can help clean country names:

import re

country_names_clean_gdp_data = lambda x: re.sub(r'[^\w\s]', '', x).strip()

Apply the above lambda function on the country column to clean country names. Save the cleaned dataset as gdp_per_capita_data. Print the first 2 rows of the updated DataFrame.

16.7.2 Preparing population data

Read the population data from https://en.wikipedia.org/wiki/List_of_countries_by_population_(United_Nations).

Drop all columns except Country, UN Continental Region[1], and Population (1 July 2023).
Drop the first row since it is the total population in the world

16.7.2.1

Apply the lambda function below on all the column names of the dataset obtained in the previous question to clean the column names.

import re

column_name_cleaner = lambda x:re.split(r'\[|/|\(| ', x.name)[0]

Note: You will need to edit the parameter of the function, i.e., x in the above function to make sure it is applied on column names and not columns.

Print the first 2 rows of the updated DataFrame.

16.7.2.2

The country names contain some special characters (characters other than letters) and need to be cleaned. The following function can help clean country names:

import re

country_names_clean_population_data = lambda x: re.sub("[\(\[].*?[\)\]]", "", x).strip()

Apply the above lambda function on the country column to clean country names. Save the cleaned dataset as population_data.

16.1 Introduction to Data Reshaping

16.1.1 Why Data Reshaping Matters

16.1.2 Understanding Wide vs. Long Format

16.1.3 Reshaping Tools in Pandas

16.2 Pivoting from Long to Wide: pivot_table()

16.2.1 What is Pivoting?

16.2.2 Example 1: Simple Pivot with Single Index

16.2.3 Example 2: Multi-Level Index (Hierarchical Pivot)

16.2.4 Practical Use: Temporal Comparisons

16.2.5 Visualizing Pivot Tables with Heatmaps

16.3 Melting from Wide to Long: melt()

16.3.1 Key Parameters

16.3.2 Why Melt Your Data?

16.3.3 Example 1: Basic Melt

16.3.4 Example 2: Using Melted Data for Visualization

16.3.5 Pivot ↔︎ Melt: The Full Circle

16.4 Stacking and Unstacking using stack() and unstack()

16.4.1 Understanding the level in multi-level index and columns

16.4.2 Stacking (Columns to Rows ): stack()

16.4.3 Unstacking (Rows to Columns) : unstack()

16.4.4 Transposing using .T attribute

16.5 Summary of Common Reshaping Functions

16.6 Converting from Multi-Level Index to Single-Level Index DataFrame

16.7 Independent Study

16.7.1 Preparing GDP per capita data

16.7.1.1

16.7.1.2

16.7.1.3

16.7.1.4

16.7.1.5

16.7.1.6

16.7.2 Preparing population data

16.7.2.1

16.7.2.2

16.2 Pivoting from Long to Wide: `pivot_table()`

16.3 Melting from Wide to Long: `melt()`

16.4 Stacking and Unstacking using `stack()` and `unstack()`

16.4.1 Understanding the `level` in multi-level index and columns

16.4.2 Stacking (Columns to Rows ): `stack()`

16.4.3 Unstacking (Rows to Columns) : `unstack()`

16.4.4 Transposing using `.T` attribute