import numpy as np
import pandas as pd
import time8 Pandas Intermediate
Now that you’ve learned the fundamentals of pandas, this chapter shifts focus to transformations — techniques that reshape and enrich your data.
You’ll practice vectorized arithmetic, automatic alignment, conditional logic, and row- vs. column-wise operations that turn raw tables into analysis-ready features.
8.1 Learning Objective
At the end of this chapter, you will learn:
- Arithmetic Operations — perform calculations within and between DataFrames
- Broadcasting — apply operations efficiently across different shapes with automatic alignment
- Transformations — use
map()andapply()to transform your data flexibly
- Lambda Functions — create quick, inline transformations with
map()orapply()
- Advanced NLP Preprocessing — practice transformations that support your final project
🔑 Note on Vectorization:
Vectorization means executing operations in parallel (built on top of NumPy) instead of using slow Python loops.
In this chapter, we will motivate NumPy vectorization and introduce its benefits. A deeper dive will come in the dedicated NumPy chapter.
First of all, let’s import
8.2 Arithmetic Operations: Within and Between DataFrames
Arithmetic operations are essential when working with numerical data columns in pandas.
8.2.1 Why Arithmetic Operations Matter in Data Science
They form the backbone of many common tasks, including:
- Creating new metrics — e.g.,
profit = revenue - cost
- Normalizing data — z-scores, percentages, or scaling
- Comparing datasets — growth rates, ratios, or relative differences
- Financial analysis — ROI, margins, compound growth
- Scientific computations — applying formulas and transformations
🚀 Pandas Advantage: Arithmetic in pandas is intuitive and concise — operations align automatically by rows and columns, saving you from manual looping.
8.2.2 Arithmetic Operations Within a DataFrame
Core Concept: Perform calculations between columns, with constants, or across rows
| Operation | Symbol | Example | Business Use Case |
|---|---|---|---|
| Addition | + |
df['A'] + df['B'] |
Total sales = Online + Retail |
| Subtraction | - |
df['Revenue'] - df['Cost'] |
Profit calculation |
| Multiplication | * |
df['Price'] * df['Quantity'] |
Total value |
| Division | / |
df['Profit'] / df['Revenue'] |
Margin percentage |
| Power | ** |
df['Principal'] * (1 + rate)**years |
Compound interest |
| Floor Division | // |
df['Total'] // df['BatchSize'] |
Number of full batches |
| Modulo | % |
df['ID'] % 10 |
Grouping by last digit |
8.2.2.1 Column-to-Column Operations
# 🎯 Create a business-like DataFrame for demonstrations
business_data = {
'Product': ['Laptop', 'Phone', 'Tablet', 'Watch', 'Headphones'],
'Units_Sold': [150, 300, 200, 500, 250],
'Price_per_Unit': [1200, 800, 600, 400, 200],
'Cost_per_Unit': [800, 500, 400, 250, 120],
'Marketing_Spend': [15000, 25000, 18000, 30000, 12000]
}
business_df = pd.DataFrame(business_data)
print("📊 Original Business Data:")
print(business_df)
# 💰 Basic arithmetic operations between columns
business_df['Total_Revenue'] = business_df['Units_Sold'] * business_df['Price_per_Unit']
business_df['Total_Cost'] = business_df['Units_Sold'] * business_df['Cost_per_Unit']
business_df['Gross_Profit'] = business_df['Total_Revenue'] - business_df['Total_Cost']
business_df['Net_Profit'] = business_df['Gross_Profit'] - business_df['Marketing_Spend']
# 📈 Calculate ratios and percentages
business_df['Profit_Margin'] = (business_df['Gross_Profit'] / business_df['Total_Revenue']) * 100
business_df['ROI'] = (business_df['Net_Profit'] / business_df['Marketing_Spend']) * 100
print("\n💡 Enhanced Business Data with Calculations:")
print(business_df[['Product', 'Total_Revenue', 'Gross_Profit', 'Net_Profit', 'Profit_Margin', 'ROI']].round(2))
📊 Original Business Data:
Product Units_Sold Price_per_Unit Cost_per_Unit Marketing_Spend
0 Laptop 150 1200 800 15000
1 Phone 300 800 500 25000
2 Tablet 200 600 400 18000
3 Watch 500 400 250 30000
4 Headphones 250 200 120 12000
💡 Enhanced Business Data with Calculations:
Product Total_Revenue Gross_Profit Net_Profit Profit_Margin ROI
0 Laptop 180000 60000 45000 33.33 300.00
1 Phone 240000 90000 65000 37.50 260.00
2 Tablet 120000 40000 22000 33.33 122.22
3 Watch 200000 75000 45000 37.50 150.00
4 Headphones 50000 20000 8000 40.00 66.67
8.2.2.2 Operations with Constants and Scalars
In pandas, you can perform arithmetic operations not only between columns, but also with constants (scalars).
The operation is applied element-wise across the entire column or DataFrame — thanks to vectorization.
# 🎯 Operations with constants (scalars)
print("Operations with Constants:")
# Apply 10% discount to all prices
business_df['Discounted_Price'] = business_df['Price_per_Unit'] * 0.9
# Add fixed shipping cost of $50 to all products
business_df['Price_with_Shipping'] = business_df['Price_per_Unit'] + 50
# Calculate tax (8% of revenue)
business_df['Tax_Amount'] = business_df['Total_Revenue'] * 0.08
# Square the units sold (for some statistical analysis)
business_df['Units_Squared'] = business_df['Units_Sold'] ** 2
print("Sample with scalar operations:")
sample_cols = ['Product', 'Price_per_Unit', 'Discounted_Price', 'Price_with_Shipping', 'Tax_Amount']
print(business_df[sample_cols].head(3).to_string(index=False))Operations with Constants:
Sample with scalar operations:
Product Price_per_Unit Discounted_Price Price_with_Shipping Tax_Amount
Laptop 1200 1080.0 1250 14400.0
Phone 800 720.0 850 19200.0
Tablet 600 540.0 650 9600.0
8.2.2.3 Row-wise Operations (Across Columns)
By default, methods like .mean(), .max(), and .min() operate down each column (axis=0).
However, if you need to perform calculations across all columns in each row, set axis=1.
# 📊 Row-wise arithmetic operations (axis=1 for across columns)
print("🔄 Row-wise Operations:")
# Create a sample DataFrame with numerical data
sample_data = {
'Q1_Sales': [100, 150, 200, 120, 180],
'Q2_Sales': [110, 140, 220, 130, 190],
'Q3_Sales': [120, 160, 210, 140, 200],
'Q4_Sales': [130, 170, 230, 150, 210]
}
quarterly_sales = pd.DataFrame(sample_data,
index=['Product_A', 'Product_B', 'Product_C', 'Product_D', 'Product_E'])
print("📈 Quarterly Sales Data:")
print(quarterly_sales)
# Calculate row-wise statistics
quarterly_sales['Total_Sales'] = quarterly_sales.sum(axis=1) # Sum across columns
quarterly_sales['Average_Sales'] = quarterly_sales[['Q1_Sales', 'Q2_Sales', 'Q3_Sales', 'Q4_Sales']].mean(axis=1)
quarterly_sales['Max_Quarter'] = quarterly_sales[['Q1_Sales', 'Q2_Sales', 'Q3_Sales', 'Q4_Sales']].max(axis=1)
quarterly_sales['Sales_Range'] = quarterly_sales[['Q1_Sales', 'Q2_Sales', 'Q3_Sales', 'Q4_Sales']].max(axis=1) - quarterly_sales[['Q1_Sales', 'Q2_Sales', 'Q3_Sales', 'Q4_Sales']].min(axis=1)
print("\n📊 With Row-wise Calculations:")
print(quarterly_sales.round(2))🔄 Row-wise Operations:
📈 Quarterly Sales Data:
Q1_Sales Q2_Sales Q3_Sales Q4_Sales
Product_A 100 110 120 130
Product_B 150 140 160 170
Product_C 200 220 210 230
Product_D 120 130 140 150
Product_E 180 190 200 210
📊 With Row-wise Calculations:
Q1_Sales Q2_Sales Q3_Sales Q4_Sales Total_Sales Average_Sales \
Product_A 100 110 120 130 460 115.0
Product_B 150 140 160 170 620 155.0
Product_C 200 220 210 230 860 215.0
Product_D 120 130 140 150 540 135.0
Product_E 180 190 200 210 780 195.0
Max_Quarter Sales_Range
Product_A 130 30
Product_B 170 30
Product_C 230 30
Product_D 150 30
Product_E 210 30
8.2.2.4 Correlation Operations
8.2.3 Arithmetic Operations Between DataFrames
Key Concept: When operating on two DataFrames, pandas automatically aligns data by index and column labels.
8.2.3.1 Alignment Behavior
| Scenario | Result | Example |
|---|---|---|
| Exact match | Normal operation | df1 + df2 |
| Missing index/column | Fills with NaN |
Automatic alignment |
| Different shapes | Broadcasts when possible | Scalar operations |
| Using fill_value | Replaces NaN with specified value | df1.add(df2, fill_value=0) |
💡 Pro Tip: Use explicit methods (add(), sub(), mul(), div()) for better control over missing data!
8.2.3.2 Basic DataFrame-to-DataFrame Operations
# 🏢 Create two related business DataFrames for comparison
print("🔀 DataFrame-to-DataFrame Operations:")
# 📊 Store A performance data
store_a_data = {
'Electronics': [50000, 55000, 60000],
'Clothing': [30000, 32000, 35000],
'Books': [8000, 8500, 9000]
}
store_a = pd.DataFrame(store_a_data, index=['Q1', 'Q2', 'Q3'])
# 📊 Store B performance data
store_b_data = {
'Electronics': [45000, 50000, 58000],
'Clothing': [28000, 30000, 33000],
'Books': [7500, 8000, 8800]
}
store_b = pd.DataFrame(store_b_data, index=['Q1', 'Q2', 'Q3'])
print("🏪 Store A Sales:")
print(store_a)
print("\n🏪 Store B Sales:")
print(store_b)
# 📈 Basic arithmetic operations between DataFrames
print("\n💡 Arithmetic Operations Between Stores:")
# Total sales across both stores
total_sales = store_a + store_b
print("\n📊 Combined Sales (A + B):")
print(total_sales)
# Difference in performance
sales_diff = store_a - store_b
print("\n📊 Sales Difference (A - B):")
print(sales_diff)
# Growth ratio (Store A relative to Store B)
growth_ratio = store_a / store_b
print("\n📊 Performance Ratio (A / B):")
print(growth_ratio.round(3))🔀 DataFrame-to-DataFrame Operations:
🏪 Store A Sales:
Electronics Clothing Books
Q1 50000 30000 8000
Q2 55000 32000 8500
Q3 60000 35000 9000
🏪 Store B Sales:
Electronics Clothing Books
Q1 45000 28000 7500
Q2 50000 30000 8000
Q3 58000 33000 8800
💡 Arithmetic Operations Between Stores:
📊 Combined Sales (A + B):
Electronics Clothing Books
Q1 95000 58000 15500
Q2 105000 62000 16500
Q3 118000 68000 17800
📊 Sales Difference (A - B):
Electronics Clothing Books
Q1 5000 2000 500
Q2 5000 2000 500
Q3 2000 2000 200
📊 Performance Ratio (A / B):
Electronics Clothing Books
Q1 1.111 1.071 1.067
Q2 1.100 1.067 1.062
Q3 1.034 1.061 1.023
8.2.3.3 Handling Misaligned Data
# 🚨 Demonstrate alignment issues and solutions
print("⚠️ Handling Misaligned DataFrames:")
# Create DataFrames with different indices and columns
df1_partial = pd.DataFrame({
'A': [10, 20, 30],
'B': [40, 50, 60]
}, index=[0, 1, 2])
df2_partial = pd.DataFrame({
'B': [5, 10, 15, 20], # Extra row
'C': [100, 200, 300, 400] # Different column
}, index=[1, 2, 3, 4]) # Different indices
print("📊 DataFrame 1:")
print(df1_partial)
print("\n📊 DataFrame 2:")
print(df2_partial)
# 🔄 Default behavior - creates NaN for missing alignments
print("\n❌ Default Addition (creates NaN):")
result_nan = df1_partial + df2_partial
print(result_nan)
# 🔧 Using explicit methods with fill_value to handle NaN
print("\n✅ Addition with fill_value=0:")
result_filled = df1_partial.add(df2_partial, fill_value=0)
print(result_filled)
# 🎯 Other explicit arithmetic methods
print("\n🔧 Explicit Methods Available:")
print("• add() - Addition with fill_value option")
print("• sub() - Subtraction with fill_value option")
print("• mul() - Multiplication with fill_value option")
print("• div() - Division with fill_value option")
# Example with subtraction
result_sub = df1_partial.sub(df2_partial, fill_value=1)
print("\n➖ Subtraction with fill_value=1:")
print(result_sub)⚠️ Handling Misaligned DataFrames:
📊 DataFrame 1:
A B
0 10 40
1 20 50
2 30 60
📊 DataFrame 2:
B C
1 5 100
2 10 200
3 15 300
4 20 400
❌ Default Addition (creates NaN):
A B C
0 NaN NaN NaN
1 NaN 55.0 NaN
2 NaN 70.0 NaN
3 NaN NaN NaN
4 NaN NaN NaN
✅ Addition with fill_value=0:
A B C
0 10.0 40.0 NaN
1 20.0 55.0 100.0
2 30.0 70.0 200.0
3 NaN 15.0 300.0
4 NaN 20.0 400.0
🔧 Explicit Methods Available:
• add() - Addition with fill_value option
• sub() - Subtraction with fill_value option
• mul() - Multiplication with fill_value option
• div() - Division with fill_value option
➖ Subtraction with fill_value=1:
A B C
0 9.0 39.0 NaN
1 19.0 45.0 -99.0
2 29.0 50.0 -199.0
3 NaN -14.0 -299.0
4 NaN -19.0 -399.0
8.2.4 Advanced Arithmetic Techniques
8.2.4.1 Broadcasting and Shape Compatibility
# 📐 Broadcasting: Operations between different shapes
print("📐 Broadcasting Examples:")
# DataFrame with different dimensions
sales_matrix = pd.DataFrame({
'Jan': [1000, 1500, 2000],
'Feb': [1100, 1600, 2100],
'Mar': [1200, 1700, 2200]
}, index=['Store_A', 'Store_B', 'Store_C'])
# Series for operations
monthly_bonus = pd.Series([100, 150, 200], index=['Jan', 'Feb', 'Mar']) # Column-wise
store_multiplier = pd.Series([1.1, 1.2, 1.05], index=['Store_A', 'Store_B', 'Store_C']) # Row-wise
print("📊 Original Sales Matrix:")
print(sales_matrix)
print("\n💰 Monthly Bonus (Series):")
print(monthly_bonus)
print("\n📈 Store Multiplier (Series):")
print(store_multiplier)
# 🔄 Broadcasting operations
print("\n✨ Broadcasting Results:")
# Add monthly bonus to each column
sales_with_bonus = sales_matrix + monthly_bonus
print("📊 Sales + Monthly Bonus (broadcasting across columns):")
print(sales_with_bonus)
# Multiply each row by store multiplier
sales_adjusted = sales_matrix.mul(store_multiplier, axis=0) # axis=0 for row-wise
print("\n📊 Sales × Store Multiplier (broadcasting across rows):")
print(sales_adjusted.round(0))📐 Broadcasting Examples:
📊 Original Sales Matrix:
Jan Feb Mar
Store_A 1000 1100 1200
Store_B 1500 1600 1700
Store_C 2000 2100 2200
💰 Monthly Bonus (Series):
Jan 100
Feb 150
Mar 200
dtype: int64
📈 Store Multiplier (Series):
Store_A 1.10
Store_B 1.20
Store_C 1.05
dtype: float64
✨ Broadcasting Results:
📊 Sales + Monthly Bonus (broadcasting across columns):
Jan Feb Mar
Store_A 1100 1250 1400
Store_B 1600 1750 1900
Store_C 2100 2250 2400
📊 Sales × Store Multiplier (broadcasting across rows):
Jan Feb Mar
Store_A 1100.0 1210.0 1320.0
Store_B 1800.0 1920.0 2040.0
Store_C 2100.0 2205.0 2310.0
8.2.5 Error Handling and Edge Cases
Common issues and solutions when working with arithmetic operations
# 🚨 Handling common arithmetic errors and edge cases
print("🛡️ Error Handling and Edge Cases:")
# Create data with potential issues
problematic_data = pd.DataFrame({
'Values': [10, 0, -5, None, float('inf')],
'Divisors': [2, 0, 1, 3, 2],
'Mixed_Types': ['10', 20, '30', 40, '50'] # Mixed string/numeric
})
print("⚠️ Problematic Data:")
print(problematic_data)
print(f"Data types:\n{problematic_data.dtypes}")
# 🔧 Issue 1: Division by zero
print("\n❌ Division by Zero Issue:")
try:
result = problematic_data['Values'] / problematic_data['Divisors']
print("Result with division by zero:")
print(result)
except Exception as e:
print(f"Error: {e}")
# ✅ Solution: Handle division by zero
print("\n✅ Safe Division:")
safe_division = problematic_data['Values'].div(problematic_data['Divisors']).replace([float('inf'), float('-inf')], 0)
print(safe_division)
# 🔧 Issue 2: Mixed data types
print("\n❌ Mixed Data Types Issue:")
try:
# Convert string column to numeric first
mixed_numeric = pd.to_numeric(problematic_data['Mixed_Types'], errors='coerce')
print("Converted mixed types:")
print(mixed_numeric)
# Now arithmetic works
result = mixed_numeric + 5
print("After adding 5:")
print(result)
except Exception as e:
print(f"Error: {e}")
# 🔧 Issue 3: NaN and infinity handling
print("\n🔧 NaN and Infinity Handling:")
# Create data with NaN and inf
test_data = pd.Series([1, 2, None, float('inf'), float('-inf'), 5])
print("Original data with NaN and infinity:")
print(test_data)
# Fill NaN with specific values
filled_data = test_data.fillna(0)
print("\nAfter filling NaN with 0:")
print(filled_data)
# Replace infinity with finite values
clean_data = filled_data.replace([float('inf'), float('-inf')], [999999, -999999])
print("\nAfter replacing infinity:")
print(clean_data)
# 🎯 Best practices summary
print("\n💡 Best Practices for Robust Arithmetic:")
print("1. Check data types before operations: df.dtypes")
print("2. Handle NaN values: fillna(), dropna(), or errors='coerce'")
print("3. Check for infinity: df.replace([inf, -inf], value)")
print("4. Use explicit methods for better control: add(), sub(), mul(), div()")
print("5. Validate results: df.isna().sum(), df.isinf().sum() if hasattr(df, 'isinf')")🛡️ Error Handling and Edge Cases:
⚠️ Problematic Data:
Values Divisors Mixed_Types
0 10.0 2 10
1 0.0 0 20
2 -5.0 1 30
3 NaN 3 40
4 inf 2 50
Data types:
Values float64
Divisors int64
Mixed_Types object
dtype: object
❌ Division by Zero Issue:
Result with division by zero:
0 5.0
1 NaN
2 -5.0
3 NaN
4 inf
dtype: float64
✅ Safe Division:
0 5.0
1 NaN
2 -5.0
3 NaN
4 0.0
dtype: float64
❌ Mixed Data Types Issue:
Converted mixed types:
0 10
1 20
2 30
3 40
4 50
Name: Mixed_Types, dtype: int64
After adding 5:
0 15
1 25
2 35
3 45
4 55
Name: Mixed_Types, dtype: int64
🔧 NaN and Infinity Handling:
Original data with NaN and infinity:
0 1.0
1 2.0
2 NaN
3 inf
4 -inf
5 5.0
dtype: float64
After filling NaN with 0:
0 1.0
1 2.0
2 0.0
3 inf
4 -inf
5 5.0
dtype: float64
After replacing infinity:
0 1.0
1 2.0
2 0.0
3 999999.0
4 -999999.0
5 5.0
dtype: float64
💡 Best Practices for Robust Arithmetic:
1. Check data types before operations: df.dtypes
2. Handle NaN values: fillna(), dropna(), or errors='coerce'
3. Check for infinity: df.replace([inf, -inf], value)
4. Use explicit methods for better control: add(), sub(), mul(), div()
5. Validate results: df.isna().sum(), df.isinf().sum() if hasattr(df, 'isinf')
8.2.6 Performance and Method Comparison
| Operation Type | Operator | Explicit Method | Fill Value | Performance | Use Case |
|---|---|---|---|---|---|
| Addition | df1 + df2 |
df1.add(df2, fill_value=0) |
✅ Yes | ⚡ Fast | Missing data control |
| Subtraction | df1 - df2 |
df1.sub(df2, fill_value=0) |
✅ Yes | ⚡ Fast | Missing data control |
| Multiplication | df1 * df2 |
df1.mul(df2, fill_value=1) |
✅ Yes | ⚡ Fast | Broadcasting control |
| Division | df1 / df2 |
df1.div(df2, fill_value=1) |
✅ Yes | ⚡ Fast | Zero division safety |
| Power | df1 ** 2 |
df1.pow(2) |
❌ No | 🔄 Medium | Exponential operations |
| Floor Division | df1 // df2 |
df1.floordiv(df2) |
❌ No | 🔄 Medium | Integer division |
| Modulo | df1 % df2 |
df1.mod(df2) |
❌ No | 🔄 Medium | Remainder operations |
8.2.7 Are Arithmetic Operations Vectorized?
Yes — arithmetic operations in pandas are vectorized because they rely on NumPy under the hood.
This means operations are applied element-wise across entire columns or DataFrames at once, instead of looping through rows in Python.
✅ Benefits of Vectorization:
- Much faster than Python loops (implemented in optimized C code)
- Cleaner and more concise syntax
- Handles automatic alignment on indexes and columns
8.3 Correlation Operations: Understanding Relationships Between Variables
8.3.1 What is Correlation?
Correlation measures the strength and direction of the linear relationship between two numerical variables. In business and data science, correlation analysis helps you:
- Discover patterns in your data
- Identify relationships between metrics
- Make predictions based on variable relationships
- Generate insights for decision-making
Key Concepts:
- Correlation ≠ Causation: High correlation doesn’t mean one variable causes the other
- Range: Correlation values range from -1 to +1
- Interpretation: Closer to ±1 = stronger relationship, closer to 0 = weaker relationship
8.3.2 Correlation Coefficient Interpretation
| Range | Strength | Direction | Business Example |
|---|---|---|---|
| +0.8 to +1.0 | Very Strong Positive | As X ↗️, Y ↗️ | More ads → More sales |
| +0.6 to +0.8 | Strong Positive | As X ↗️, Y ↗️ | Higher price → Higher quality perception |
| +0.3 to +0.6 | Moderate Positive | As X ↗️, Y ↗️ | Customer age → Spending |
| -0.1 to +0.3 | Weak/No Relationship | Random pattern | Product color → Sales |
| -0.3 to -0.6 | Moderate Negative | As X ↗️, Y ↘️ | Discounts → Profit margin |
| -0.6 to -0.8 | Strong Negative | As X ↗️, Y ↘️ | Product defects → Customer satisfaction |
| -0.8 to -1.0 | Very Strong Negative | As X ↗️, Y ↘️ | Price increase → Demand |
8.3.3 Correlation Types
Pandas provides multiple correlation methods that you can specify with
df.corr(method=...). By default, it uses Pearson, but it’s important to know the alternatives and when to use them:
| Method | Best For | Advantages | Limitations |
|---|---|---|---|
| Pearson | Linear relationships | Simple, standard, interpretable | Very sensitive to outliers |
| Spearman | Monotonic (rank-based) relationships | Handles non-linearity well | Loses information about magnitude |
| Kendall | Robust rank correlation | Highly resistant to outliers | Slower, computationally intensive |
8.3.3.1 Decision Guide: Choosing the Right Correlation
# Simple decision tree for correlation choice:
if data_has_outliers or is_non_linear_monotonic:
use_method = "spearman" # Rank-based, robust to outliers
elif small_sample_size or need_extra_robustness:
use_method = "kendall" # Most robust, but slower
else:
use_method = "pearson" # Standard choice for linear correlation💡 Note for this chapter: We will focus on Pearson correlation, since it is the default in pandas and most commonly used in practice.
8.3.4 Pandas Correlation Methods
Pandas offers several methods for calculating correlations:
| Method | Purpose | Use Case |
|---|---|---|
df.corr() |
Full correlation matrix | Explore all relationships |
df['A'].corr(df['B']) |
Pairwise correlation | Specific relationship |
df.corrwith(series) |
Correlate with external series | Compare with benchmark |
💡 In this chapter, we’ll focus on the first two methods (df.corr() and pairwise .corr()), and skip corrwith() for simplicity.
import numpy as np
import pandas as pd
# 📊 Create sample data for correlation demonstration
print("🔍 CORRELATION OPERATIONS DEMONSTRATION")
print("="*50)
# Create sample business dataset
np.random.seed(42) # For reproducible results
n_samples = 100
# Create correlated business metrics
sales_data = pd.DataFrame({
'Advertising_Spend': np.random.normal(1000, 200, n_samples),
'Website_Traffic': np.random.normal(5000, 1000, n_samples),
'Customer_Satisfaction': np.random.normal(4.0, 0.5, n_samples),
'Product_Price': np.random.normal(50, 10, n_samples)
})
# Add some realistic correlations
sales_data['Sales_Revenue'] = (
0.8 * sales_data['Advertising_Spend'] +
0.3 * sales_data['Website_Traffic'] +
200 * sales_data['Customer_Satisfaction'] +
np.random.normal(0, 500, n_samples)
)
sales_data['Conversion_Rate'] = (
0.02 + 0.01 * sales_data['Customer_Satisfaction'] -
0.0002 * sales_data['Product_Price'] +
np.random.normal(0, 0.005, n_samples)
)
# Ensure realistic ranges
sales_data['Customer_Satisfaction'] = sales_data['Customer_Satisfaction'].clip(1, 5)
sales_data['Conversion_Rate'] = sales_data['Conversion_Rate'].clip(0.001, 0.1)
print(f"\nDataset shape: {sales_data.shape}")
print("📋 Sample Business Dataset:")
sales_data.head()🔍 CORRELATION OPERATIONS DEMONSTRATION
==================================================
Dataset shape: (100, 6)
📋 Sample Business Dataset:
| Advertising_Spend | Website_Traffic | Customer_Satisfaction | Product_Price | Sales_Revenue | Conversion_Rate | |
|---|---|---|---|---|---|---|
| 0 | 1099.342831 | 3584.629258 | 4.178894 | 41.710050 | 1993.427949 | 0.058078 |
| 1 | 972.347140 | 4579.354677 | 4.280392 | 44.398190 | 2708.075056 | 0.063471 |
| 2 | 1129.537708 | 4657.285483 | 4.541526 | 57.472936 | 3211.742785 | 0.046928 |
| 3 | 1304.605971 | 4197.722731 | 4.526901 | 56.103703 | 3231.872098 | 0.056863 |
| 4 | 953.169325 | 4838.714288 | 3.311165 | 49.790984 | 2651.350074 | 0.039900 |
8.3.4.1 Method 1: Full Correlation Matrix
The correlation matrix shows relationships between ALL numerical variables at once:
# 🔍 METHOD 1: FULL CORRELATION MATRIX
print("📊 CORRELATION MATRIX ANALYSIS")
print("="*40)
# Calculate correlation matrix using Pearson (default method)
print("🔢 PEARSON CORRELATION MATRIX")
print("-"*30)
correlation_matrix = sales_data.corr() # Pearson is default
print(correlation_matrix.round(3))
# Business interpretation
print("\n💼 BUSINESS INTERPRETATION:")
print("-"*25)
# Find strongest correlations
correlation_pairs = []
for i, col1 in enumerate(correlation_matrix.columns):
for j, col2 in enumerate(correlation_matrix.columns):
if i < j: # Avoid duplicates and self-correlation
corr_val = correlation_matrix.loc[col1, col2]
correlation_pairs.append((col1, col2, corr_val))
# Sort by absolute correlation strength
correlation_pairs.sort(key=lambda x: abs(x[2]), reverse=True)
print("🎯 Top 3 Strongest Relationships:")
for i, (col1, col2, corr_val) in enumerate(correlation_pairs[:3]):
strength = "Strong" if abs(corr_val) > 0.7 else "Moderate" if abs(corr_val) > 0.3 else "Weak"
direction = "Positive" if corr_val > 0 else "Negative"
print(f"{i+1}. {col1} ↔️ {col2}")
print(f" Correlation: {corr_val:.3f} ({strength} {direction})")
if corr_val > 0:
print(f" 📈 As {col1} increases, {col2} tends to increase")
else:
print(f" 📉 As {col1} increases, {col2} tends to decrease")
print()📊 CORRELATION MATRIX ANALYSIS
========================================
🔢 PEARSON CORRELATION MATRIX
------------------------------
Advertising_Spend Website_Traffic \
Advertising_Spend 1.000 -0.136
Website_Traffic -0.136 1.000
Customer_Satisfaction 0.185 -0.037
Product_Price -0.170 -0.018
Sales_Revenue 0.085 0.565
Conversion_Rate 0.094 -0.114
Customer_Satisfaction Product_Price Sales_Revenue \
Advertising_Spend 0.185 -0.170 0.085
Website_Traffic -0.037 -0.018 0.565
Customer_Satisfaction 1.000 0.005 0.085
Product_Price 0.005 1.000 0.131
Sales_Revenue 0.085 0.131 1.000
Conversion_Rate 0.696 -0.214 -0.079
Conversion_Rate
Advertising_Spend 0.094
Website_Traffic -0.114
Customer_Satisfaction 0.696
Product_Price -0.214
Sales_Revenue -0.079
Conversion_Rate 1.000
💼 BUSINESS INTERPRETATION:
-------------------------
🎯 Top 3 Strongest Relationships:
1. Customer_Satisfaction ↔️ Conversion_Rate
Correlation: 0.696 (Moderate Positive)
📈 As Customer_Satisfaction increases, Conversion_Rate tends to increase
2. Website_Traffic ↔️ Sales_Revenue
Correlation: 0.565 (Moderate Positive)
📈 As Website_Traffic increases, Sales_Revenue tends to increase
3. Product_Price ↔️ Conversion_Rate
Correlation: -0.214 (Weak Negative)
📉 As Product_Price increases, Conversion_Rate tends to decrease
8.3.4.2 Method 2: Pairwise Correlation
For analyzing specific relationships between two variables:
# METHOD 2: PAIRWISE CORRELATION ANALYSIS
print("🔍 PAIRWISE CORRELATION ANALYSIS")
print("="*50)
# Select two interesting columns for detailed analysis
col1, col2 = 'Advertising_Spend', 'Sales_Revenue'
print(f"🎬 Analyzing relationship between '{col1}' and '{col2}'")
print("-"*60)
# Calculate Pearson correlation for the pair (default method)
correlation_value = sales_data[col1].corr(sales_data[col2])
print(f"📊 Correlation coefficient: {correlation_value:.4f}")
# Detailed interpretation
print(f"\n💡 INTERPRETATION:")
print("-"*20)
def interpret_correlation(corr_val):
abs_corr = abs(corr_val)
if abs_corr >= 0.8:
strength = "Very Strong"
elif abs_corr >= 0.6:
strength = "Strong"
elif abs_corr >= 0.3:
strength = "Moderate"
elif abs_corr >= 0.1:
strength = "Weak"
else:
strength = "Very Weak/None"
direction = "Positive" if corr_val > 0 else "Negative"
return f"{strength} {direction}"
print(f"🔢 Relationship strength: {interpret_correlation(correlation_value)}")
# Business context
print(f"\n🎯 BUSINESS CONTEXT:")
print("-"*20)
if abs(correlation_value) > 0.5:
print("✅ Strong relationship detected!")
print(f" • This suggests {col1} and {col2} move together")
print(" • Could be useful for predictions or business decisions")
if correlation_value > 0:
print(f" • Higher {col1} is associated with higher {col2}")
else:
print(f" • Higher {col1} is associated with lower {col2}")
else:
print("⚠️ Moderate/Weak relationship detected")
print(f" • {col1} and {col2} don't have a strong linear relationship")
print(" • Consider other factors or non-linear relationships")
# Demonstrate different variable pairs
print(f"\n🔍 OTHER INTERESTING PAIRS:")
print("-"*25)
pairs_to_check = [
('Customer_Satisfaction', 'Sales_Revenue'),
('Product_Price', 'Conversion_Rate'),
('Website_Traffic', 'Sales_Revenue')
]
for pair_col1, pair_col2 in pairs_to_check:
pair_corr = sales_data[pair_col1].corr(sales_data[pair_col2])
print(f"{pair_col1} ↔️ {pair_col2}: {pair_corr:.3f}")🔍 PAIRWISE CORRELATION ANALYSIS
==================================================
🎬 Analyzing relationship between 'Advertising_Spend' and 'Sales_Revenue'
------------------------------------------------------------
📊 Correlation coefficient: 0.0847
💡 INTERPRETATION:
--------------------
🔢 Relationship strength: Very Weak/None Positive
🎯 BUSINESS CONTEXT:
--------------------
⚠️ Moderate/Weak relationship detected
• Advertising_Spend and Sales_Revenue don't have a strong linear relationship
• Consider other factors or non-linear relationships
🔍 OTHER INTERESTING PAIRS:
-------------------------
Customer_Satisfaction ↔️ Sales_Revenue: 0.085
Product_Price ↔️ Conversion_Rate: -0.214
Website_Traffic ↔️ Sales_Revenue: 0.565
8.3.4.3 Advanced Correlation Analysis: Finding Strong Correlations Programmatically
# 🔍 ADVANCED CORRELATION ANALYSIS
print(" ADVANCED CORRELATION TECHNIQUES")
print("="*50)
# 1. Find strongest correlations programmatically
print(" 1. FINDING STRONGEST CORRELATIONS")
print("-"*35)
correlation_matrix = sales_data.corr() # Uses Pearson correlation by default
# Create a function to find strong correlations
def find_strong_correlations(corr_matrix, threshold=0.5):
"""Find correlations above threshold, excluding diagonal"""
strong_corrs = []
for i in range(len(corr_matrix.columns)):
for j in range(i+1, len(corr_matrix.columns)): # Avoid duplicates and diagonal
col1 = corr_matrix.columns[i]
col2 = corr_matrix.columns[j]
corr_val = corr_matrix.iloc[i, j]
if pd.notna(corr_val) and abs(corr_val) >= threshold:
strong_corrs.append({
'Variable1': col1,
'Variable2': col2,
'Correlation': corr_val,
'Strength': abs(corr_val)
})
return pd.DataFrame(strong_corrs).sort_values('Strength', ascending=False)
strong_correlations = find_strong_correlations(correlation_matrix, threshold=0.3)
if len(strong_correlations) > 0:
print(f"📊 Found {len(strong_correlations)} correlations above 0.3 threshold:")
print(strong_correlations.round(3))
print(f"\n🎯 TOP CORRELATION:")
if len(strong_correlations) > 0:
top_corr = strong_correlations.iloc[0]
print(f" {top_corr['Variable1']} ↔️ {top_corr['Variable2']}")
print(f" Strength: {top_corr['Correlation']:.3f}")
if top_corr['Correlation'] > 0:
print(" 📈 Positive relationship: both increase/decrease together")
else:
print(" 📉 Negative relationship: one increases as other decreases")
else:
print("📊 No strong correlations found above 0.3 threshold")
print("💡 Try lowering threshold or check data quality")
# 2. Correlation matrix summary
print(f"\n🌡️ 2. CORRELATION MATRIX SUMMARY")
print("-"*30)
if len(correlation_matrix) > 1:
# Summary statistics of correlations
corr_values = correlation_matrix.values
# Get upper triangle (excluding diagonal)
mask = np.triu(np.ones_like(corr_values, dtype=bool), k=1)
upper_triangle = corr_values[mask]
print(f"📊 Matrix size: {correlation_matrix.shape[0]} x {correlation_matrix.shape[1]}")
print(f"📈 Highest correlation: {np.nanmax(upper_triangle):.3f}")
print(f"📉 Lowest correlation: {np.nanmin(upper_triangle):.3f}")
print(f"📊 Average correlation: {np.nanmean(upper_triangle):.3f}")
# Count correlations by strength
strong_positive = np.sum(upper_triangle >= 0.5)
strong_negative = np.sum(upper_triangle <= -0.5)
moderate = np.sum((upper_triangle >= 0.3) & (upper_triangle < 0.5)) + np.sum((upper_triangle <= -0.3) & (upper_triangle > -0.5))
weak = np.sum((upper_triangle > -0.3) & (upper_triangle < 0.3))
print(f"\n📊 CORRELATION STRENGTH DISTRIBUTION:")
print(f" 🔥 Strong (|r| ≥ 0.5): {strong_positive + strong_negative}")
print(f" 📊 Moderate (0.3 ≤ |r| < 0.5): {moderate}")
print(f" ⚪ Weak (|r| < 0.3): {weak}")
print(f"\n💡 PRO TIPS FOR CORRELATION ANALYSIS:")
print("-"*35)
print("✅ Always visualize correlations with scatter plots")
print("✅ Check for non-linear relationships")
print("✅ Consider domain knowledge when interpreting")
print("✅ Remember: correlation ≠ causation")
print("✅ Look for outliers that might affect correlations")
print("✅ Use different correlation methods for robustness") ADVANCED CORRELATION TECHNIQUES
==================================================
1. FINDING STRONGEST CORRELATIONS
-----------------------------------
📊 Found 2 correlations above 0.3 threshold:
Variable1 Variable2 Correlation Strength
1 Customer_Satisfaction Conversion_Rate 0.696 0.696
0 Website_Traffic Sales_Revenue 0.565 0.565
🎯 TOP CORRELATION:
Customer_Satisfaction ↔️ Conversion_Rate
Strength: 0.696
📈 Positive relationship: both increase/decrease together
🌡️ 2. CORRELATION MATRIX SUMMARY
------------------------------
📊 Matrix size: 6 x 6
📈 Highest correlation: 0.696
📉 Lowest correlation: -0.214
📊 Average correlation: 0.072
📊 CORRELATION STRENGTH DISTRIBUTION:
🔥 Strong (|r| ≥ 0.5): 2
📊 Moderate (0.3 ≤ |r| < 0.5): 0
⚪ Weak (|r| < 0.3): 13
💡 PRO TIPS FOR CORRELATION ANALYSIS:
-----------------------------------
✅ Always visualize correlations with scatter plots
✅ Check for non-linear relationships
✅ Consider domain knowledge when interpreting
✅ Remember: correlation ≠ causation
✅ Look for outliers that might affect correlations
✅ Use different correlation methods for robustness
8.3.5 Summary: Correlation Operations
| Method | Best For | Advantages | Limitations |
|---|---|---|---|
df.corr() |
Overview of all relationships | Complete picture | Can be overwhelming |
df['A'].corr(df['B']) |
Specific pair analysis | Focused, detailed | Limited scope |
df.corrwith(series) |
Benchmark comparisons | External validation | Requires external data |
Key Takeaways: - Correlation reveals relationships but not causation - Always visualize data alongside correlation coefficients
- Use multiple correlation methods for robust analysis - Consider business context when interpreting results - Strong correlations (|r| > 0.7) suggest predictive potential
8.3.6 Are Correlation Operations Vectorized?
Yes! Correlation operations in pandas (and NumPy under the hood) are vectorized:
- No explicit loops: Correlations are computed using optimized NumPy routines in C, avoiding slow Python row-by-row iteration.
- Matrix-based computation:
df.corr()computes the covariance matrix of all numeric columns at once, then scales it into a correlation matrix.
- Pairwise efficiency: Even
df['A'].corr(df['B'])leverages array-level operations on entire columns, not element-by-element loops.
- Scalability: Vectorization makes correlation fast and efficient for large datasets, with memory (not speed) usually being the main limitation.
💡 This is why pandas can compute full n × n correlation matrices quickly, even for wide datasets.
8.4 Advanced Operations: Function Application & Data Transformation
Pandas provides powerful methods for applying functions to DataFrame rows, columns, or individual elements. These methods are the cornerstone of advanced data manipulation, enabling complex transformations that would be cumbersome with basic operations.
8.4.1 apply(): The Swiss Army Knife of Data Transformation
The apply() function is one of pandas’ most versatile tools. It can apply any custom or built-in function across DataFrame axes or to elements in a Series, making it ideal for flexible transformations.
8.4.1.1 Core Concepts
- Row-wise operations (
axis=1) → apply a function across rows (each row passed as a Series)
- Column-wise operations (
axis=0) → apply a function across columns (each column passed as a Series)
- Element-wise operations → apply a function to each value in a Series
8.4.1.2 Syntax Patterns
# DataFrame apply() - row-wise or column-wise
df.apply(function, axis=1) # Row-wise (rows as Series)
df.apply(function, axis=0) # Column-wise (columns as Series)
# Series apply() - element-wise
df['column'].apply(function) # Each element passed individually
# With additional arguments
df.apply(function, axis=1, args=(arg1, arg2))8.4.1.3 Real-World Example: Financial Analysis
# 📈 Create a comprehensive financial dataset
# Generate realistic financial data
np.random.seed(42)
n_companies = 100
financial_data = {
'Company': [f'Company_{i:03d}' for i in range(1, n_companies + 1)],
'Revenue': np.random.normal(50000, 15000, n_companies).round(0),
'Expenses': np.random.normal(35000, 10000, n_companies).round(0),
'Employees': np.random.randint(10, 500, n_companies),
'Market_Cap': np.random.lognormal(15, 1, n_companies).round(0),
'Industry': np.random.choice(['Tech', 'Healthcare', 'Finance', 'Retail', 'Manufacturing'], n_companies),
'Founded_Year': np.random.randint(1990, 2020, n_companies),
'Country': np.random.choice(['USA', 'Canada', 'UK', 'Germany', 'Japan'], n_companies)
}
df_financial = pd.DataFrame(financial_data)
# Ensure expenses don't exceed revenue (business logic)
def cap_expenses(row):
"""
Ensure expenses don't exceed 95% of revenue
This prevents companies from having expenses that are too close to or exceed their revenue
"""
# Calculate the maximum allowed expenses (95% of revenue)
max_allowed_expenses = row['Revenue'] * 0.95
# Return the smaller value: either current expenses or the maximum allowed
if row['Expenses'] > max_allowed_expenses:
return max_allowed_expenses
else:
return row['Expenses']
# Apply the function to each row
df_financial['Expenses'] = df_financial.apply(cap_expenses, axis=1)
# Calculate profit and profit margin for each company
def calculate_profit(row):
return row['Revenue'] - row['Expenses']
def calculate_profit_margin(row):
profit = row['Revenue'] - row['Expenses']
return round((profit / row['Revenue']) * 100, 2) if row['Revenue'] > 0 else 0
# Apply functions to create new columns
df_financial['Profit'] = df_financial.apply(calculate_profit, axis=1)
df_financial['Profit_Margin'] = df_financial.apply(calculate_profit_margin, axis=1)
print(f"\nDataset shape: {df_financial.shape}")
print(f"Industries: {df_financial['Industry'].unique()}")
print("📊 Financial Dataset Sample:")
df_financial.head(10)
Dataset shape: (100, 10)
Industries: ['Retail' 'Manufacturing' 'Tech' 'Finance' 'Healthcare']
📊 Financial Dataset Sample:
| Company | Revenue | Expenses | Employees | Market_Cap | Industry | Founded_Year | Country | Profit | Profit_Margin | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Company_001 | 57451.0 | 20846.0 | 395 | 2340214.0 | Retail | 2010 | USA | 36605.0 | 63.72 |
| 1 | Company_002 | 47926.0 | 30794.0 | 483 | 2378910.0 | Manufacturing | 1994 | UK | 17132.0 | 35.75 |
| 2 | Company_003 | 59715.0 | 31573.0 | 282 | 4491813.0 | Tech | 1999 | Germany | 28142.0 | 47.13 |
| 3 | Company_004 | 72845.0 | 26977.0 | 113 | 651982.0 | Tech | 1999 | Germany | 45868.0 | 62.97 |
| 4 | Company_005 | 46488.0 | 33387.0 | 426 | 8413782.0 | Finance | 2008 | USA | 13101.0 | 28.18 |
| 5 | Company_006 | 46488.0 | 39041.0 | 402 | 11598738.0 | Manufacturing | 2006 | UK | 7447.0 | 16.02 |
| 6 | Company_007 | 73688.0 | 53862.0 | 308 | 4767327.0 | Retail | 2010 | UK | 19826.0 | 26.91 |
| 7 | Company_008 | 61512.0 | 36746.0 | 255 | 17044719.0 | Tech | 2003 | UK | 24766.0 | 40.26 |
| 8 | Company_009 | 42958.0 | 37576.0 | 185 | 867324.0 | Retail | 1998 | Japan | 5382.0 | 12.53 |
| 9 | Company_010 | 58138.0 | 34256.0 | 48 | 20660119.0 | Tech | 2003 | Canada | 23882.0 | 41.08 |
8.4.1.4 How apply() Actually Works
apply() is essentially a Python-level loop that iterates over each row or column and applies your function one-by-one.
It’s much slower than true vectorized operations (NumPy ufuncs, pandas built-ins).
✅ Prefer vectorized operations whenever possible.
🔁 Useapply()only when logic can’t be expressed with vectorized methods.
8.4.2 map(): Efficient Element-wise Transformations (on a Series)
Series.map() applies a transformation element by element and is often faster/simpler than apply() for 1-to-1 lookups or simple functions.
8.4.2.1 ✅ When to use
- Value mapping / recoding (e.g., codes → labels)
- Category encoding (labels → numbers)
- Simple per-value transforms (string cleanup, parsing, normalization)
- Lookup from a codebook stored as a dict or another Series
8.4.2.2 Three Ways to Use map()
# Sample data
df = pd.DataFrame({
'Industry': ['Tech', 'Healthcare', 'Finance', 'Tech', 'Retail'],
'Size': ['Small', 'Large', 'Medium', 'Small', 'Large'],
'Revenue': [100000, 150000, 80000, 120000, 90000]
})
# 1. **Dictionary Mapping** - Most Common
industry_codes = {'Tech': 1, 'Healthcare': 2, 'Finance': 3, 'Retail': 4}
df['Industry_Code'] = df['Industry'].map(industry_codes)
# 2. **Function Mapping**
def add_prefix(industry):
return f"IND_{industry}"
df['Industry_Prefixed'] = df['Industry'].map(add_prefix)
# 3. **Series Mapping**
size_priority = pd.Series({'Small': 3, 'Medium': 2, 'Large': 1})
df['Size_Priority'] = df['Size'].map(size_priority)
print(df) Industry Size Revenue Industry_Code Industry_Prefixed Size_Priority
0 Tech Small 100000 1 IND_Tech 3
1 Healthcare Large 150000 2 IND_Healthcare 1
2 Finance Medium 80000 3 IND_Finance 2
3 Tech Small 120000 1 IND_Tech 3
4 Retail Large 90000 4 IND_Retail 1
8.4.2.3 How map() Actually Works:
map() iterates through each element in the Series and applies the mapping function/dictionary lookup individually.
# 🚀 Practical Performance Comparison: Real-World Examples
# Create a realistic dataset for performance testing
np.random.seed(42)
n_rows = 50000
performance_data = pd.DataFrame({
'customer_id': range(1, n_rows + 1),
'purchase_amount': np.random.uniform(10, 1000, n_rows),
'discount_pct': np.random.choice([0, 5, 10, 15, 20], n_rows),
'category': np.random.choice(['A', 'B', 'C', 'D'], n_rows),
'rating': np.random.randint(1, 6, n_rows),
'is_premium': np.random.choice([True, False], n_rows)
})
print("📊 Performance Test Dataset:")
print(f"Dataset shape: {performance_data.shape}")
print(performance_data.head())
# Define category mapping for testing
category_mapping = {'A': 'Electronics', 'B': 'Clothing', 'C': 'Books', 'D': 'Sports'}
print(f"\n🏁 Performance Test Results ({n_rows:,} rows):")
print("-" * 60)📊 Performance Test Dataset:
Dataset shape: (50000, 6)
customer_id purchase_amount discount_pct category rating is_premium
0 1 380.794718 0 C 1 True
1 2 951.207163 15 D 1 False
2 3 734.674002 5 D 1 True
3 4 602.671899 5 A 4 True
4 5 164.458454 0 D 3 True
🏁 Performance Test Results (50,000 rows):
------------------------------------------------------------
# 📊 Test 1: Simple Mathematical Operations
print("🧮 Test 1: Calculate final price after discount")
# Method 1: VECTORIZED (Fastest)
start_time = time.time()
performance_data['final_price_vectorized'] = performance_data['purchase_amount'] * (1 - performance_data['discount_pct'] / 100)
vectorized_time = time.time() - start_time
# Method 2: APPLY with function (Slower)
def calculate_final_price(row):
return row['purchase_amount'] * (1 - row['discount_pct'] / 100)
start_time = time.time()
performance_data['final_price_apply'] = performance_data.apply(calculate_final_price, axis=1)
apply_time = time.time() - start_time
print(f"✅ Vectorized: {vectorized_time:.4f}s (1.0x baseline)")
print(f"⚙️ Apply function: {apply_time:.4f}s ({apply_time/vectorized_time:.1f}x slower)")
# Verify results are identical
results_match = (performance_data['final_price_vectorized'].round(2) == performance_data['final_price_apply'].round(2)).all()
print(f"🔍 Results identical: {results_match}")
# Clean up test columns
performance_data.drop(['final_price_vectorized', 'final_price_apply'], axis=1, inplace=True)🧮 Test 1: Calculate final price after discount
✅ Vectorized: 0.0013s (1.0x baseline)
⚙️ Apply function: 0.2778s (207.6x slower)
🔍 Results identical: True
8.4.3 replace() : Replace specific values with other values
s_with_nan = pd.Series([1, 2, np.nan, 4])
# map() with na_action
mapped_nan = s_with_nan.map({1: 'one', 2: 'two', 4: 'four'}, na_action='ignore')
print("map() with na_action='ignore':")
print(mapped_nan)
# replace() can target NaN directly
replaced_nan = s_with_nan.replace(np.nan, 'missing')
print("\nreplace() handling NaN:")
print(replaced_nan)map() with na_action='ignore':
0 one
1 two
2 NaN
3 four
dtype: object
replace() handling NaN:
0 1.0
1 2.0
2 missing
3 4.0
dtype: object
Note: Unlike
map()andapply(),replace()is a vectorized operation in pandas and is therefore much faster thanmap()orapply().
8.4.4 Performance Comparison
# 📊 Performance Comparison
print("\n🗺️ Test 2: Category mapping")
# Method 1: MAP (Optimal for this task)
start_time = time.time()
performance_data['category_name_map'] = performance_data['category'].map(category_mapping)
map_time = time.time() - start_time
# Method 2: APPLY (Overkill for simple mapping)
start_time = time.time()
performance_data['category_name_apply'] = performance_data['category'].apply(lambda x: category_mapping[x])
apply_time = time.time() - start_time
# Method 3: VECTORIZED replacement (Alternative approach)
start_time = time.time()
performance_data['category_name_replace'] = performance_data['category'].replace(category_mapping)
replace_time = time.time() - start_time
print(f"✅ Map(): {map_time:.4f}s (1.0x baseline)")
print(f"🔄 Apply(): {apply_time:.4f}s ({apply_time/map_time:.1f}x slower)")
print(f"🔄 Replace(): {replace_time:.4f}s ({replace_time/map_time:.1f}x slower)")
# Clean up test columns
performance_data.drop(['category_name_apply', 'category_name_replace'], axis=1, inplace=True)
🗺️ Test 2: Category mapping
✅ Map(): 0.0025s (1.0x baseline)
🔄 Apply(): 0.0101s (4.0x slower)
🔄 Replace(): 0.0000s (0.0x slower)
Key Takeaway
apply()andmap()are not vectorized — under the hood they behave like Python loops.
Pandas iterates through each row or column and calls your function, which can be slow on large datasets.For value substitutions, use
pandas.replace()— it’s vectorized and far faster than writing anapply()with a custom function.
8.5 Core Vectorized Operations in Pandas
Pandas is built on top of NumPy, which means that vectorized operations are actually compiled C code running under the hood. This is why vectorized operations can be orders of magnitude faster than Python loops!
8.5.1 Why Vectorization Matters
The Speed Difference:
- Python loop: Interpreted, slow, one operation at a time
- Vectorized operation: Compiled C code, operates on entire arrays at once
Performance gain: Typically 10-100x faster than equivalent loops!
8.5.2 Categories of Vectorized Operations
| Category | Operations | Example |
|---|---|---|
| Arithmetic | +, -, *, /, **, //, % |
df['total'] = df['price'] * df['qty'] |
| Comparison | >, <, >=, <=, ==, != |
df['expensive'] = df['price'] > 100 |
| Aggregation | .sum(), .mean(), .max(), .std() |
df['revenue'].sum() |
| Boolean Logic | &, \|, ~ |
df[(df['price'] > 50) & (df['qty'] > 10)] |
| String Ops | .str.upper(), .str.contains(), etc. |
df['name'].str.upper() |
| DateTime Ops | .dt.year, .dt.month, .dt.dayofweek |
df['date'].dt.year |
| Math Functions | np.sqrt(), np.log(), np.exp() |
np.log(df['revenue']) |
Let’s see each category with practical examples!
8.5.3 Arithmetic Operations
The most common vectorized operations — perform math on entire columns at once.
print("📊 ARITHMETIC OPERATIONS")
print("=" * 60)
# Sample e-commerce data
orders_df = pd.DataFrame({
'product': ['Widget', 'Gadget', 'Doohickey', 'Thingamajig', 'Whatsit'],
'price': [29.99, 49.99, 19.99, 99.99, 39.99],
'quantity': [2, 1, 5, 1, 3],
'tax_rate': [0.08, 0.08, 0.08, 0.08, 0.08],
'discount_pct': [10, 0, 15, 5, 20]
})
print("Original order data:")
print(orders_df)
# ✅ VECTORIZED: All calculations happen at once on entire columns
orders_df['subtotal'] = orders_df['price'] * orders_df['quantity']
orders_df['discount_amount'] = orders_df['subtotal'] * (orders_df['discount_pct'] / 100)
orders_df['after_discount'] = orders_df['subtotal'] - orders_df['discount_amount']
orders_df['tax'] = orders_df['after_discount'] * orders_df['tax_rate']
orders_df['total'] = orders_df['after_discount'] + orders_df['tax']
print("\n✅ Calculated columns (all vectorized!):")
orders_df[['product', 'subtotal', 'discount_amount', 'total']].round(2)📊 ARITHMETIC OPERATIONS
============================================================
Original order data:
product price quantity tax_rate discount_pct
0 Widget 29.99 2 0.08 10
1 Gadget 49.99 1 0.08 0
2 Doohickey 19.99 5 0.08 15
3 Thingamajig 99.99 1 0.08 5
4 Whatsit 39.99 3 0.08 20
✅ Calculated columns (all vectorized!):
| product | subtotal | discount_amount | total | |
|---|---|---|---|---|
| 0 | Widget | 59.98 | 6.00 | 58.30 |
| 1 | Gadget | 49.99 | 0.00 | 53.99 |
| 2 | Doohickey | 99.95 | 14.99 | 91.75 |
| 3 | Thingamajig | 99.99 | 5.00 | 102.59 |
| 4 | Whatsit | 119.97 | 23.99 | 103.65 |
8.5.4 Comparison Operations
Create boolean masks for filtering, flagging, and categorizing data.
print("\n🔍 COMPARISON OPERATIONS")
print("=" * 60)
# Student grades data
grades_df = pd.DataFrame({
'student': ['Alice', 'Bob', 'Carol', 'David', 'Eve', 'Frank'],
'math': [92, 78, 88, 65, 95, 72],
'science': [88, 82, 90, 70, 91, 68],
'english': [85, 90, 87, 75, 89, 80]
})
print("Student grades:")
print(grades_df)
# ✅ VECTORIZED: Compare entire columns at once
grades_df['math_pass'] = grades_df['math'] >= 70
grades_df['high_achiever'] = grades_df['math'] >= 90
grades_df['needs_help'] = grades_df['math'] < 70
# Calculate average and compare
grades_df['average'] = (grades_df['math'] + grades_df['science'] + grades_df['english']) / 3
grades_df['above_average'] = grades_df['average'] > grades_df['average'].mean()
print("\n✅ Comparison results:")
print(grades_df[['student', 'math', 'average', 'math_pass', 'high_achiever', 'above_average']])
# Count students in each category
print(f"\n📈 Statistics:")
print(f" • Students passing math: {grades_df['math_pass'].sum()}")
print(f" • High achievers (math ≥ 90): {grades_df['high_achiever'].sum()}")
print(f" • Students above class average: {grades_df['above_average'].sum()}")
🔍 COMPARISON OPERATIONS
============================================================
Student grades:
student math science english
0 Alice 92 88 85
1 Bob 78 82 90
2 Carol 88 90 87
3 David 65 70 75
4 Eve 95 91 89
5 Frank 72 68 80
✅ Comparison results:
student math average math_pass high_achiever above_average
0 Alice 92 88.333333 True True True
1 Bob 78 83.333333 True False True
2 Carol 88 88.333333 True False True
3 David 65 70.000000 False False False
4 Eve 95 91.666667 True True True
5 Frank 72 73.333333 True False False
📈 Statistics:
• Students passing math: 5
• High achievers (math ≥ 90): 2
• Students above class average: 4
8.5.5 Boolean Indexing & Filtering
Combine comparison operations with boolean logic to filter data.
print("\n🎯 BOOLEAN INDEXING & FILTERING")
print("=" * 60)
# Product inventory data
inventory_df = pd.DataFrame({
'product_id': range(1, 11),
'name': ['Widget', 'Gadget', 'Doohickey', 'Thingamajig', 'Whatsit',
'Gizmo', 'Contraption', 'Device', 'Apparatus', 'Mechanism'],
'price': [29.99, 149.99, 19.99, 299.99, 39.99, 89.99, 199.99, 59.99, 179.99, 249.99],
'stock': [45, 12, 8, 3, 67, 23, 15, 5, 28, 9],
'category': ['A', 'B', 'A', 'C', 'A', 'B', 'C', 'A', 'B', 'C']
})
print("Product inventory:")
print(inventory_df)
# ✅ VECTORIZED FILTERING: Multiple conditions combined
# Find expensive, low-stock items
expensive = inventory_df['price'] > 100
low_stock = inventory_df['stock'] < 15
urgent_restock = expensive & low_stock
print("\n🚨 Urgent restock needed (expensive + low stock):")
print(inventory_df[urgent_restock][['name', 'price', 'stock']])
# Find good deals in category A
category_a = inventory_df['category'] == 'A'
affordable = inventory_df['price'] < 50
good_deals = category_a & affordable
print("\n💰 Good deals in Category A (price < $50):")
print(inventory_df[good_deals][['name', 'price', 'category']])
# Complex filter: Either (expensive AND low stock) OR (cheap AND high stock)
cheap = inventory_df['price'] < 50
high_stock = inventory_df['stock'] > 30
complex_filter = (expensive & low_stock) | (cheap & high_stock)
print("\n🔄 Complex filter results:")
print(inventory_df[complex_filter][['name', 'price', 'stock']])
print(f"\n📊 Filter statistics:")
print(f" • Total products: {len(inventory_df)}")
print(f" • Urgent restock: {urgent_restock.sum()}")
print(f" • Good deals: {good_deals.sum()}")
print(f" • Complex filter matches: {complex_filter.sum()}")
🎯 BOOLEAN INDEXING & FILTERING
============================================================
Product inventory:
product_id name price stock category
0 1 Widget 29.99 45 A
1 2 Gadget 149.99 12 B
2 3 Doohickey 19.99 8 A
3 4 Thingamajig 299.99 3 C
4 5 Whatsit 39.99 67 A
5 6 Gizmo 89.99 23 B
6 7 Contraption 199.99 15 C
7 8 Device 59.99 5 A
8 9 Apparatus 179.99 28 B
9 10 Mechanism 249.99 9 C
🚨 Urgent restock needed (expensive + low stock):
name price stock
1 Gadget 149.99 12
3 Thingamajig 299.99 3
9 Mechanism 249.99 9
💰 Good deals in Category A (price < $50):
name price category
0 Widget 29.99 A
2 Doohickey 19.99 A
4 Whatsit 39.99 A
🔄 Complex filter results:
name price stock
0 Widget 29.99 45
1 Gadget 149.99 12
3 Thingamajig 299.99 3
4 Whatsit 39.99 67
9 Mechanism 249.99 9
📊 Filter statistics:
• Total products: 10
• Urgent restock: 3
• Good deals: 3
• Complex filter matches: 5
8.5.6 Aggregation Operations
Reduce columns to summary statistics using vectorized aggregations.
print("\n📈 AGGREGATION OPERATIONS")
print("=" * 60)
# Sales data by region
sales_df = pd.DataFrame({
'region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'] * 3,
'month': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
'revenue': [45000, 52000, 48000, 51000, 47000, 54000, 49000, 53000,
46000, 55000, 50000, 52000, 44000, 53000, 47000, 50000,
48000, 56000, 51000, 54000, 45000, 57000, 49000, 55000]
})
print("Sales data (first 8 rows):")
print(sales_df.head(8))
# ✅ VECTORIZED AGGREGATIONS: Instant calculations on entire columns
print("\n📊 Overall statistics (vectorized):")
print(f" • Total revenue: ${sales_df['revenue'].sum():,.0f}")
print(f" • Average revenue: ${sales_df['revenue'].mean():,.0f}")
print(f" • Median revenue: ${sales_df['revenue'].median():,.0f}")
print(f" • Min revenue: ${sales_df['revenue'].min():,.0f}")
print(f" • Max revenue: ${sales_df['revenue'].max():,.0f}")
print(f" • Std deviation: ${sales_df['revenue'].std():,.0f}")
# Group by region and aggregate (still vectorized!)
print("\n🌍 Revenue by region:")
region_stats = sales_df.groupby('region')['revenue'].agg(['sum', 'mean', 'count'])
print(region_stats)
# Multiple aggregations at once
print("\n📉 Comprehensive statistics:")
all_stats = sales_df.groupby('region')['revenue'].describe()
print(all_stats)
📈 AGGREGATION OPERATIONS
============================================================
Sales data (first 8 rows):
region month revenue
0 North 1 45000
1 South 1 52000
2 East 1 48000
3 West 1 51000
4 North 2 47000
5 South 2 54000
6 East 2 49000
7 West 2 53000
📊 Overall statistics (vectorized):
• Total revenue: $1,211,000
• Average revenue: $50,458
• Median revenue: $50,500
• Min revenue: $44,000
• Max revenue: $57,000
• Std deviation: $3,730
🌍 Revenue by region:
sum mean count
region
East 294000 49000.000000 6
North 275000 45833.333333 6
South 327000 54500.000000 6
West 315000 52500.000000 6
📉 Comprehensive statistics:
count mean std min 25% 50% 75% \
region
East 6.0 49000.000000 1414.213562 47000.0 48250.0 49000.0 49750.0
North 6.0 45833.333333 1471.960144 44000.0 45000.0 45500.0 46750.0
South 6.0 54500.000000 1870.828693 52000.0 53250.0 54500.0 55750.0
West 6.0 52500.000000 1870.828693 50000.0 51250.0 52500.0 53750.0
max
region
East 51000.0
North 48000.0
South 57000.0
West 55000.0
8.5.7 String Operations (.str accessor)
Vectorized string manipulation — no loops needed!
print("\n🔤 STRING OPERATIONS (.str accessor)")
print("=" * 60)
# Customer data with messy strings
customers_df = pd.DataFrame({
'customer_id': [1, 2, 3, 4, 5],
'name': [' john DOE ', 'jane SMITH', 'Bob_Jones', 'alice WILLIAMS ', 'CHARLIE brown'],
'email': ['john@EXAMPLE.com', 'jane@test.COM', 'bob@demo.ORG', 'alice@SITE.net', 'charlie@mail.COM'],
'phone': ['123-456-7890', '(234) 567-8901', '345.678.9012', '456-567-8901', '567-678-9012']
})
print("Original customer data (messy):")
print(customers_df)
# ✅ VECTORIZED STRING OPERATIONS: Clean all at once
customers_df['name_clean'] = customers_df['name'].str.strip().str.title()
customers_df['email_clean'] = customers_df['email'].str.lower()
customers_df['first_name'] = customers_df['name_clean'].str.split().str[0]
customers_df['last_name'] = customers_df['name_clean'].str.split().str[-1]
customers_df['phone_clean'] = customers_df['phone'].str.replace(r'[^\d]', '', regex=True)
# Check for patterns
customers_df['has_underscore'] = customers_df['name'].str.contains('_')
customers_df['email_domain'] = customers_df['email'].str.split('@').str[1]
print("\n✨ Cleaned data:")
print(customers_df[['name_clean', 'email_clean', 'first_name', 'last_name', 'email_domain']])
print("\n🔍 String pattern checks:")
print(f" • Customers with underscore in name: {customers_df['has_underscore'].sum()}")
print(f" • Unique email domains: {customers_df['email_domain'].nunique()}")8.5.8 DateTime Operations (.dt accessor)
Extract time components and perform date calculations — all vectorized!
print("\n📅 DATETIME OPERATIONS (.dt accessor)")
print("=" * 60)
# Transaction data with dates
transactions_df = pd.DataFrame({
'transaction_id': range(1, 7),
'date': pd.to_datetime([
'2024-01-15', '2024-02-20', '2024-03-10',
'2024-06-25', '2024-09-05', '2024-12-18'
]),
'amount': [250, 180, 420, 310, 275, 390]
})
print("Transaction data:")
print(transactions_df)
# ✅ VECTORIZED DATETIME OPERATIONS: Extract components
transactions_df['year'] = transactions_df['date'].dt.year
transactions_df['month'] = transactions_df['date'].dt.month
transactions_df['month_name'] = transactions_df['date'].dt.month_name()
transactions_df['quarter'] = transactions_df['date'].dt.quarter
transactions_df['day_of_week'] = transactions_df['date'].dt.day_name()
transactions_df['is_weekend'] = transactions_df['date'].dt.dayofweek >= 5
# Calculate days since first transaction
first_date = transactions_df['date'].min()
transactions_df['days_since_first'] = (transactions_df['date'] - first_date).dt.days
print("\n📊 Extracted date components:")
print(transactions_df[['date', 'month_name', 'quarter', 'day_of_week', 'days_since_first']])
# Aggregate by quarter
print("\n📈 Revenue by quarter:")
quarterly = transactions_df.groupby('quarter')['amount'].sum()
print(quarterly)
print(f"\n🎯 Date insights:")
print(f" • Transactions on weekends: {transactions_df['is_weekend'].sum()}")
print(f" • Date range: {transactions_df['date'].min()} to {transactions_df['date'].max()}")
print(f" • Total days covered: {transactions_df['days_since_first'].max()}")8.5.9 Mathematical Functions (NumPy Integration)
Pandas seamlessly integrates with NumPy functions for advanced math operations.
print("\n🔢 MATHEMATICAL FUNCTIONS (NumPy + Pandas)")
print("=" * 60)
# Scientific measurement data
measurements_df = pd.DataFrame({
'sample_id': range(1, 6),
'concentration': [2.5, 5.0, 10.0, 20.0, 40.0],
'absorbance': [0.15, 0.32, 0.65, 1.28, 2.45],
'temperature': [20, 25, 30, 35, 40]
})
print("Laboratory measurements:")
print(measurements_df)
# ✅ VECTORIZED MATH OPERATIONS: Apply functions to entire columns
measurements_df['log_concentration'] = np.log10(measurements_df['concentration'])
measurements_df['sqrt_absorbance'] = np.sqrt(measurements_df['absorbance'])
measurements_df['exp_temp'] = np.exp(measurements_df['temperature'] / 100)
measurements_df['temp_squared'] = np.power(measurements_df['temperature'], 2)
# Trigonometric functions (useful for periodic data)
measurements_df['sin_temp'] = np.sin(np.radians(measurements_df['temperature']))
# Statistical transformations
measurements_df['concentration_zscore'] = (
(measurements_df['concentration'] - measurements_df['concentration'].mean()) /
measurements_df['concentration'].std()
)
print("\n🧪 Calculated transformations:")
print(measurements_df[['sample_id', 'concentration', 'log_concentration',
'absorbance', 'sqrt_absorbance']].round(3))
print("\n📊 Statistical summary:")
print(f" • Mean concentration: {measurements_df['concentration'].mean():.2f}")
print(f" • Geometric mean (via log): {np.exp(np.log(measurements_df['concentration']).mean()):.2f}")
print(f" • Correlation (conc. vs abs.): {measurements_df['concentration'].corr(measurements_df['absorbance']):.3f}")8.5.10 Performance Comparison: Loops vs Vectorization
Let’s see the dramatic speed difference between loops and vectorized operations with real timing data.
import time
print("\n⚡ PERFORMANCE SHOWDOWN: Loops vs Vectorization")
print("=" * 60)
# Create a large dataset for meaningful timing
large_df = pd.DataFrame({
'price': np.random.uniform(10, 500, 50000),
'quantity': np.random.randint(1, 20, 50000),
'tax_rate': 0.08,
'discount': np.random.uniform(0, 0.3, 50000)
})
print(f"Dataset size: {len(large_df):,} rows")
# ❌ METHOD 1: Python loop with iterrows() - SLOW!
print("\n🐌 Method 1: Python loop with iterrows()")
start_time = time.time()
total_loop = []
for idx, row in large_df.iterrows():
subtotal = row['price'] * row['quantity']
discount_amt = subtotal * row['discount']
after_discount = subtotal - discount_amt
tax = after_discount * row['tax_rate']
total = after_discount + tax
total_loop.append(total)
loop_time = time.time() - start_time
print(f" Time: {loop_time:.4f} seconds")
# ❌ METHOD 2: List comprehension - Still slow!
print("\n🐢 Method 2: List comprehension")
start_time = time.time()
total_list_comp = [
(row['price'] * row['quantity'] * (1 - row['discount']) * (1 + row['tax_rate']))
for idx, row in large_df.iterrows()
]
list_comp_time = time.time() - start_time
print(f" Time: {list_comp_time:.4f} seconds")
# ❌ METHOD 3: apply() - Better but still not optimal
print("\n🔄 Method 3: apply() function")
start_time = time.time()
def calculate_total(row):
subtotal = row['price'] * row['quantity']
after_discount = subtotal * (1 - row['discount'])
return after_discount * (1 + row['tax_rate'])
large_df['total_apply'] = large_df.apply(calculate_total, axis=1)
apply_time = time.time() - start_time
print(f" Time: {apply_time:.4f} seconds")
# ✅ METHOD 4: Vectorized operations - FAST!
print("\n⚡ Method 4: Vectorized operations")
start_time = time.time()
large_df['total_vectorized'] = (
large_df['price'] *
large_df['quantity'] *
(1 - large_df['discount']) *
(1 + large_df['tax_rate'])
)
vectorized_time = time.time() - start_time
print(f" Time: {vectorized_time:.4f} seconds")
# Performance comparison
print("\n" + "=" * 60)
print("🏆 PERFORMANCE RESULTS")
print("=" * 60)
print(f"{'Method':<30} {'Time (s)':<12} {'Speedup':<12} {'Status'}")
print("-" * 60)
# Prevent ZeroDivisionError by ensuring minimum time value
min_time = 0.0001 # Minimum time threshold to avoid division by zero
safe_vectorized_time = max(vectorized_time, min_time)
# Calculate speedup safely
loop_speedup = loop_time / safe_vectorized_time if safe_vectorized_time > 0 else float('inf')
list_comp_speedup = list_comp_time / safe_vectorized_time if safe_vectorized_time > 0 else float('inf')
apply_speedup = apply_time / safe_vectorized_time if safe_vectorized_time > 0 else float('inf')
print(f"{'🐌 iterrows() loop':<30} {loop_time:>10.4f} {loop_speedup:>8.1f}x ❌ AVOID")
print(f"{'🐢 List comprehension':<30} {list_comp_time:>10.4f} {list_comp_speedup:>8.1f}x ❌ AVOID")
print(f"{'🔄 apply() function':<30} {apply_time:>10.4f} {apply_speedup:>8.1f}x ⚠️ USE SPARINGLY")
print(f"{'⚡ Vectorized operations':<30} {vectorized_time:>10.4f} {'1.0x':>8} ✅ BEST")
print("\n💡 Key Insight:")
if vectorized_time > 0:
speedup_factor = loop_time / safe_vectorized_time
time_reduction = ((loop_time - vectorized_time) / loop_time * 100) if loop_time > 0 else 0
print(f" Vectorization is {speedup_factor:.0f}x faster than loops!")
print(f" That's a {time_reduction:.1f}% reduction in execution time.")
else:
print(f" Vectorization is extremely fast (< 0.0001 seconds)!")
print(f" Loop time: {loop_time:.4f}s vs Vectorized: {vectorized_time:.6f}s")
# Verify all methods produce same results
print("\n✅ Verification: All methods produce identical results:",
np.allclose(total_loop, large_df['total_vectorized']))
⚡ PERFORMANCE SHOWDOWN: Loops vs Vectorization
============================================================
Dataset size: 50,000 rows
🐌 Method 1: Python loop with iterrows()
Time: 0.8976 seconds
🐢 Method 2: List comprehension
Time: 0.8976 seconds
🐢 Method 2: List comprehension
Time: 0.8167 seconds
🔄 Method 3: apply() function
Time: 0.8167 seconds
🔄 Method 3: apply() function
Time: 0.4004 seconds
⚡ Method 4: Vectorized operations
Time: 0.0000 seconds
============================================================
🏆 PERFORMANCE RESULTS
============================================================
Method Time (s) Speedup Status
------------------------------------------------------------
🐌 iterrows() loop 0.8976 8975.7x ❌ AVOID
🐢 List comprehension 0.8167 8166.9x ❌ AVOID
🔄 apply() function 0.4004 4003.8x ⚠️ USE SPARINGLY
⚡ Vectorized operations 0.0000 1.0x ✅ BEST
💡 Key Insight:
Vectorization is extremely fast (< 0.0001 seconds)!
Loop time: 0.8976s vs Vectorized: 0.000000s
✅ Verification: All methods produce identical results: True
Time: 0.4004 seconds
⚡ Method 4: Vectorized operations
Time: 0.0000 seconds
============================================================
🏆 PERFORMANCE RESULTS
============================================================
Method Time (s) Speedup Status
------------------------------------------------------------
🐌 iterrows() loop 0.8976 8975.7x ❌ AVOID
🐢 List comprehension 0.8167 8166.9x ❌ AVOID
🔄 apply() function 0.4004 4003.8x ⚠️ USE SPARINGLY
⚡ Vectorized operations 0.0000 1.0x ✅ BEST
💡 Key Insight:
Vectorization is extremely fast (< 0.0001 seconds)!
Loop time: 0.8976s vs Vectorized: 0.000000s
✅ Verification: All methods produce identical results: True
8.5.11 Summary: Vectorization Best Practices
8.5.11.1 Always Try Vectorization First
✅ Use vectorized operations for:
- Arithmetic:
df['total'] = df['price'] * df['qty'] - Comparisons:
df['high_value'] = df['amount'] > 1000 - String operations:
df['upper_name'] = df['name'].str.upper() - Date operations:
df['year'] = df['date'].dt.year - Math functions:
df['log_revenue'] = np.log(df['revenue']) - Aggregations:
df['revenue'].sum(),df.groupby('category').mean()
8.5.11.2 Performance Hierarchy
Speed Ranking (fastest to slowest):
1. ⚡ Vectorized operations (NumPy/Pandas built-ins) ← Always try first!
2. 🔄 .map() for simple 1:1 mappings ← Good for dictionaries
3. 🔄 .apply() for complex row-wise logic ← Use when necessary
4. 🐌 Python loops (iterrows, comprehensions) ← AVOID AT ALL COSTS
8.5.11.3 ⚠️ Common Mistakes to Avoid
# ❌ DON'T: Use loops for arithmetic
for i in range(len(df)):
df.loc[i, 'total'] = df.loc[i, 'price'] * df.loc[i, 'qty']
# ✅ DO: Use vectorized operations
df['total'] = df['price'] * df['qty']
# ❌ DON'T: Use apply() for simple operations
df['price_doubled'] = df['price'].apply(lambda x: x * 2)
# ✅ DO: Use direct vectorized operation
df['price_doubled'] = df['price'] * 2
# ❌ DON'T: Use iterrows() ever!
for idx, row in df.iterrows(): # 100x slower!
# ... any operation
# ✅ DO: Find a vectorized alternative
df['result'] = some_vectorized_operation8.5.11.4 Quick Decision Guide
How should I perform this operation?
**Can I use a basic operator? (+, -, *, /, >, <, etc.)**
→ ✅ Use it directly:df['col1'] + df['col2']Is it a string or date operation?
→ ✅ Use.stror.dtaccessor:df['text'].str.upper()Is it a NumPy function?
→ ✅ Apply to column:np.sqrt(df['values'])Is it a simple dictionary mapping?
→ ✅ Use.map():df['code'].map(mapping_dict)Is it a built-in aggregation?
→ ✅ Use aggregation method:df['sales'].sum()Does it require complex multi-column logic?
→ 🔄 Use.apply()(only when necessary)None of the above work?
→ 🤔 Re-think your approach! There’s almost always a vectorized way.
8.5.11.5 💡 Pro Tips
- Profile your code: Use
%%timeitin Jupyter to measure performance - Start vectorized: Don’t write loops first
because it’s easier
- Think in columns: Pandas works best with column-wise operations
- Combine operations: Chain multiple vectorized operations together
- Use NumPy: Many NumPy functions work directly on pandas Series
- Avoid loops: Seriously, avoid them. There’s almost always a better way.
8.6 Summary: Performance-First Mindset
8.6.1 Key Takeaways
- Vectorization is King: Always your first choice - leverages NumPy’s C-speed processing
- map() for Mappings: Perfect for dictionaries and simple element-wise transformations
- apply() for Complex Logic: Use when business rules require multiple columns or complex conditions
- Avoid Python Loops: iterrows(), list comprehensions are performance killers
8.6.2 Looking Ahead: Why NumPy Matters
The next chapter will dive deep into NumPy, the foundation that makes pandas so powerful. You’ll learn:
- Array Architecture: How NumPy’s C-based arrays enable lightning-fast operations
- Broadcasting Rules: Advanced techniques for working with different array shapes
- Universal Functions (ufuncs): The secret weapons behind vectorized operations
- Custom Vectorization: How to write your own fast, NumPy-powered functions
8.6.3 The Performance Pyramid
🚀 NumPy Vectorized (Next Chapter!)
↑ 10-100x faster
⚡ Pandas Vectorized
↑ 2-5x faster
🔄 map() / apply()
↑ 50-200x faster
🐌 Python Loops (iterrows, comprehensions)
Understanding NumPy will unlock the full potential of pandas and give you the tools to write production-ready, high-performance data analysis code!
8.7 Lambda Functions: Inline Power for Data Transformation
In the previous sections, we saw how to define a function and then use map() or apply() to apply it.
With lambda functions, you can skip the definition step and write the function inline.
Lambda functions are anonymous, one-line functions that provide a concise way to express simple operations without creating a separate def block.
They’re the Swiss Army knife of pandas transformations — compact, versatile, and perfect for quick, one-off operations.
8.7.1 Syntax & Structure
lambda arguments: expression8.7.2 Key Characteristics
| Feature | Description | Example |
|---|---|---|
| Anonymous | No function name required | lambda x: x * 2 |
| Single Expression | Only one expression allowed | lambda x, y: x + y |
| Inline Definition | Defined at point of use | df.apply(lambda row: row.sum(), axis=1) |
| Return Value | Automatically returns expression result | lambda x: x ** 2 returns squared value |
| Multiple Arguments | Can accept multiple parameters | lambda x, y, z: x * y + z |
8.7.3 Real-World Lambda Applications
In this section, we’ll explore practical use cases of lambda functions in pandas and demonstrate common transformation patterns.
# 🎯 Create a comprehensive e-commerce dataset for lambda demonstrations
from datetime import datetime, timedelta
np.random.seed(42)
# Generate realistic e-commerce transaction data
n_transactions = 1000
ecommerce_data = {
'transaction_id': [f'TXN_{i:05d}' for i in range(1, n_transactions + 1)],
'customer_id': np.random.randint(1, 200, n_transactions),
'product_name': np.random.choice([
'Laptop', 'Smartphone', 'Tablet', 'Headphones', 'Camera',
'Watch', 'Speaker', 'Monitor', 'Keyboard', 'Mouse'
], n_transactions),
'price': np.random.uniform(50, 2000, n_transactions).round(2),
'quantity': np.random.randint(1, 5, n_transactions),
'discount_pct': np.random.choice([0, 5, 10, 15, 20, 25], n_transactions),
'shipping_cost': np.random.uniform(5, 50, n_transactions).round(2),
'customer_rating': np.random.choice([1, 2, 3, 4, 5], n_transactions),
'purchase_date': pd.date_range('2024-01-01', periods=n_transactions, freq='h'),
'customer_type': np.random.choice(['Premium', 'Regular', 'New'], n_transactions),
'payment_method': np.random.choice(['Credit Card', 'PayPal', 'Bank Transfer'], n_transactions)
}
df_ecommerce = pd.DataFrame(ecommerce_data)
print("🛒 E-commerce Dataset Sample:")
print(df_ecommerce.head())
print(f"\nDataset shape: {df_ecommerce.shape}")🛒 E-commerce Dataset Sample:
transaction_id customer_id product_name price quantity discount_pct \
0 TXN_00001 103 Camera 1892.00 1 15
1 TXN_00002 180 Watch 1382.33 1 0
2 TXN_00003 93 Headphones 1019.49 2 0
3 TXN_00004 15 Mouse 1254.80 3 5
4 TXN_00005 107 Smartphone 1744.36 4 10
shipping_cost customer_rating purchase_date customer_type \
0 40.67 3 2024-01-01 00:00:00 New
1 33.23 2 2024-01-01 01:00:00 New
2 38.76 3 2024-01-01 02:00:00 New
3 11.84 2 2024-01-01 03:00:00 New
4 25.62 4 2024-01-01 04:00:00 New
payment_method
0 Bank Transfer
1 PayPal
2 PayPal
3 Bank Transfer
4 Credit Card
Dataset shape: (1000, 11)
8.7.3.1 Simple Mathematical Operations
# 📊 Example 1: Simple Mathematical Operations with Lambda
print("=== SIMPLE MATHEMATICAL OPERATIONS ===")
# Calculate discounted price
df_ecommerce['discounted_price'] = df_ecommerce.apply(
lambda row: row['price'] * (1 - row['discount_pct'] / 100), axis=1
)
# Calculate total cost (price + shipping)
df_ecommerce['total_cost'] = df_ecommerce.apply(
lambda row: row['discounted_price'] * row['quantity'] + row['shipping_cost'], axis=1
)
# Calculate savings amount
df_ecommerce['savings'] = df_ecommerce.apply(
lambda row: (row['price'] - row['discounted_price']) * row['quantity'], axis=1
)
print("💰 Price Calculations (First 10 transactions):")
price_cols = ['transaction_id', 'price', 'discount_pct', 'discounted_price', 'quantity', 'total_cost', 'savings']
print(df_ecommerce[price_cols].head(10).to_string(index=False))=== SIMPLE MATHEMATICAL OPERATIONS ===
💰 Price Calculations (First 10 transactions):
transaction_id price discount_pct discounted_price quantity total_cost savings
TXN_00001 1892.00 15 1608.2000 1 1648.8700 283.8000
TXN_00002 1382.33 0 1382.3300 1 1415.5600 0.0000
TXN_00003 1019.49 0 1019.4900 2 2077.7400 0.0000
TXN_00004 1254.80 5 1192.0600 3 3588.0200 188.2200
TXN_00005 1744.36 10 1569.9240 4 6305.3160 697.7440
TXN_00006 1162.69 25 872.0175 1 892.8175 290.6725
TXN_00007 109.25 0 109.2500 4 446.2200 0.0000
TXN_00008 1865.35 5 1772.0825 3 5343.1275 279.8025
TXN_00009 1394.58 20 1115.6640 3 3393.4220 836.7480
TXN_00010 1369.20 25 1026.9000 4 4114.4000 1369.2000
8.7.3.2 Conditional Logic
df['category'] = df['score'].apply(lambda x: 'High' if x > 80 else 'Low')# 📊 Example 2: Conditional Logic with Lambda
print("\n=== CONDITIONAL LOGIC OPERATIONS ===")
# Categorize customers based on spending
df_ecommerce['customer_segment'] = df_ecommerce['total_cost'].apply(
lambda x: 'High Value' if x > 500 else 'Medium Value' if x > 200 else 'Low Value'
)
# Classify ratings
df_ecommerce['rating_category'] = df_ecommerce['customer_rating'].apply(
lambda rating: 'Excellent' if rating == 5
else 'Good' if rating >= 4
else 'Average' if rating == 3
else 'Poor'
)
# Determine shipping urgency
df_ecommerce['shipping_urgency'] = df_ecommerce.apply(
lambda row: 'Express' if row['customer_type'] == 'Premium' and row['total_cost'] > 300
else 'Standard' if row['customer_type'] == 'Premium'
else 'Economy', axis=1
)
print("🎯 Customer Segmentation (Sample):")
segment_cols = ['customer_id', 'total_cost', 'customer_segment', 'customer_rating', 'rating_category', 'shipping_urgency']
print(df_ecommerce[segment_cols].head(10).to_string(index=False))
=== CONDITIONAL LOGIC OPERATIONS ===
🎯 Customer Segmentation (Sample):
customer_id total_cost customer_segment customer_rating rating_category shipping_urgency
103 1648.8700 High Value 3 Average Economy
180 1415.5600 High Value 2 Poor Economy
93 2077.7400 High Value 3 Average Economy
15 3588.0200 High Value 2 Poor Economy
107 6305.3160 High Value 4 Good Economy
72 892.8175 High Value 3 Average Economy
189 446.2200 Medium Value 3 Average Economy
21 5343.1275 High Value 5 Excellent Economy
103 3393.4220 High Value 3 Average Economy
122 4114.4000 High Value 2 Poor Economy
8.7.3.3 String Operations
df['clean_name'] = df['name'].apply(lambda x: x.strip().title())# 🔤 Example 3: String Operations with Lambda
print("\n=== STRING OPERATIONS ===")
# Clean and standardize product names
df_ecommerce['product_name_clean'] = df_ecommerce['product_name'].apply(
lambda name: name.strip().upper().replace(' ', '_')
)
# Create transaction codes
df_ecommerce['transaction_code'] = df_ecommerce.apply(
lambda row: f"{row['customer_type'][:3].upper()}-{row['product_name_clean'][:4]}-{row['transaction_id'][-3:]}", axis=1
)
# Format purchase date for display
df_ecommerce['purchase_display'] = df_ecommerce['purchase_date'].apply(
lambda date: date.strftime('%Y-%m-%d %H:%M')
)
print("🔤 String Processing Results (Sample):")
string_cols = ['product_name', 'product_name_clean', 'transaction_code', 'purchase_display']
print(df_ecommerce[string_cols].head(8).to_string(index=False))
=== STRING OPERATIONS ===
🔤 String Processing Results (Sample):
product_name product_name_clean transaction_code purchase_display
Camera CAMERA NEW-CAME-001 2024-01-01 00:00
Watch WATCH NEW-WATC-002 2024-01-01 01:00
Headphones HEADPHONES NEW-HEAD-003 2024-01-01 02:00
Mouse MOUSE NEW-MOUS-004 2024-01-01 03:00
Smartphone SMARTPHONE NEW-SMAR-005 2024-01-01 04:00
Monitor MONITOR NEW-MONI-006 2024-01-01 05:00
Watch WATCH REG-WATC-007 2024-01-01 06:00
Camera CAMERA REG-CAME-008 2024-01-01 07:00
8.7.3.4 Multiple Column Operations
df['total'] = df.apply(lambda row: row['price'] * row['quantity'], axis=1)# 🚀 Example 4: Performance Optimization & Advanced Lambda Patterns
print("\n=== ADVANCED LAMBDA PATTERNS ===")
# Complex feature engineering
df_ecommerce['value_score'] = df_ecommerce.apply(
lambda row: (
(row['total_cost'] / 100) * 0.4 + # 40% weight on spending
(row['customer_rating'] / 5) * 0.3 + # 30% weight on satisfaction
(1 if row['customer_type'] == 'Premium' else 0.5 if row['customer_type'] == 'Regular' else 0) * 0.2 + # 20% weight on loyalty
(min(row['quantity'], 3) / 3) * 0.1 # 10% weight on quantity (capped at 3)
), axis=1
)
# Risk assessment using multiple factors
df_ecommerce['risk_flag'] = df_ecommerce.apply(
lambda row: 'High Risk' if (
row['total_cost'] > 1000 and
row['payment_method'] == 'Bank Transfer' and
row['customer_rating'] <= 2
) else 'Medium Risk' if (
row['total_cost'] > 500 or
row['customer_rating'] <= 2
) else 'Low Risk', axis=1
)
# Calculate percentile ranking
total_cost_90th = df_ecommerce['total_cost'].quantile(0.9)
df_ecommerce['is_top_spender'] = df_ecommerce['total_cost'].apply(
lambda cost: cost >= total_cost_90th
)
print("🎯 Advanced Analytics Results:")
advanced_cols = ['customer_id', 'total_cost', 'value_score', 'risk_flag', 'is_top_spender']
print(df_ecommerce[advanced_cols].head(10).round(2).to_string(index=False))
print(f"\n📊 Summary Statistics:")
print(f"• Average value score: {df_ecommerce['value_score'].mean():.2f}")
print(f"• High risk transactions: {(df_ecommerce['risk_flag'] == 'High Risk').sum()}")
print(f"• Top spenders (90th percentile): {df_ecommerce['is_top_spender'].sum()}")
print(f"• Top spender threshold: ${total_cost_90th:.2f}")
=== ADVANCED LAMBDA PATTERNS ===
🎯 Advanced Analytics Results:
customer_id total_cost value_score risk_flag is_top_spender
103 1648.87 6.81 Medium Risk False
180 1415.56 5.82 Medium Risk False
93 2077.74 8.56 Medium Risk False
15 3588.02 14.57 High Risk False
107 6305.32 25.56 Medium Risk True
72 892.82 3.78 Medium Risk False
189 446.22 2.16 Low Risk False
21 5343.13 21.87 Medium Risk True
103 3393.42 13.95 Medium Risk False
122 4114.40 16.78 Medium Risk False
📊 Summary Statistics:
• Average value score: 9.26
• High risk transactions: 117
• Top spenders (90th percentile): 100
• Top spender threshold: $4875.15
8.7.4 Performance Benchmarking: Lambda vs Named Functions vs Vectorized
# 🏁 Performance Benchmarking: Lambda vs Named Functions vs Vectorized
import pandas as pd
import numpy as np
import time
# Create LARGER test data to get measurable times
test_data = pd.Series(np.random.randint(1, 100, 1000000)) # 1M elements instead of 100K
print("⚡ Performance Comparison (1,000,000 operations):")
print("-" * 50)
# Method 1: Lambda function
start_time = time.time()
result_lambda = test_data.apply(lambda x: x ** 2 + 10)
lambda_time = time.time() - start_time
# Method 2: Named function
def square_plus_ten(x):
return x ** 2 + 10
start_time = time.time()
result_named = test_data.apply(square_plus_ten)
named_time = time.time() - start_time
# Method 3: Vectorized operations (fastest)
start_time = time.time()
result_vectorized = test_data ** 2 + 10
vectorized_time = time.time() - start_time
# Method 4: List comprehension
start_time = time.time()
result_listcomp = pd.Series([x ** 2 + 10 for x in test_data])
listcomp_time = time.time() - start_time
# Safe comparison - avoid division by zero
print(f"Vectorized operations: {vectorized_time:.6f}s (baseline)")
print(f"Lambda function: {lambda_time:.6f}s ({lambda_time/vectorized_time:.1f}x slower)")
print(f"Named function: {named_time:.6f}s ({named_time/vectorized_time:.1f}x slower)")
print(f"List comprehension: {listcomp_time:.6f}s ({listcomp_time/vectorized_time:.1f}x slower)")
# Verify results are identical
print(f"\n✅ Results identical: {result_lambda.equals(result_named) and result_named.equals(result_vectorized)}")
print("\n💡 Performance Insights:")
print("• Vectorized operations are fastest for simple math")
print("• Named functions have slight overhead advantage over lambdas")
print("• Lambda functions are good compromise between speed and readability")
print("• List comprehensions are generally slower than pandas methods")⚡ Performance Comparison (1,000,000 operations):
--------------------------------------------------
Vectorized operations: 0.002011s (baseline)
Lambda function: 0.263592s (131.1x slower)
Named function: 0.280185s (139.3x slower)
List comprehension: 0.251086s (124.9x slower)
✅ Results identical: False
💡 Performance Insights:
• Vectorized operations are fastest for simple math
• Named functions have slight overhead advantage over lambdas
• Lambda functions are good compromise between speed and readability
• List comprehensions are generally slower than pandas methods
8.7.4.1 Lambda Best Practices & Advanced Techniques
| Scenario | ✅ Good Lambda Usage | ❌ Poor Lambda Usage |
|---|---|---|
| Simple Operations | df['x2'] = df['x'].apply(lambda x: x * 2) |
df['x2'] = df['x'].apply(lambda x: complex_algorithm(x)) |
| Conditional Logic | df['cat'] = df['score'].apply(lambda x: 'High' if x > 80 else 'Low') |
lambda x: 'A' if x > 90 else 'B' if x > 80 else 'C' if x > 70 else 'D' if x > 60 else 'F' |
| Multiple Columns | df['total'] = df.apply(lambda row: row['price'] * row['qty'], axis=1) |
Complex calculations with many variables |
| String Processing | df['clean'] = df['name'].apply(lambda x: x.strip().title()) |
Complex regex operations |
8.7.4.2 Lambda Decision Framework
# Use Lambda when:
simple_operation = lambda x: x * 2 # ✅ Simple math
conditional = lambda x: 'A' if x > 5 else 'B' # ✅ Basic conditions
string_ops = lambda s: s.upper() # ✅ Simple string ops
# Avoid Lambda when:
# ❌ Complex logic (use named functions)
# ❌ Multiple steps (break into separate operations)
# ❌ Reusable logic (define proper functions)
# ❌ Available vectorized operations8.8 Advanced NLP Preprocessing
For text data stored in object (string) columns, pandas works seamlessly with Python’s built-in string functions.
Beyond that, you can also use the re (regular expressions) module and the NLTK (Natural Language Toolkit) library for more advanced text preprocessing.
These tools are not commonly needed for this sequence course, but they are powerful for tasks such as text cleaning, tokenization, and feature extraction — especially if you choose an NLP-focused topic for your final project.
Below, we’ll demonstrate how to use the re module and NLTK with pandas columns.
8.8.1 the re module in Python is used for regular expression
The re module in Python is used for regular expressions, which are powerful tools for text analysis and manipulation. Regular expressions allow you to search, match, and manipulate strings based on specific patterns.
Common Use Cases of re in String Text Analysis
- Finding patterns in text:
re.search(r'\d{4}-\d{2}-\d{2}', text)# searches for a date in the format YYYY-MM-DD - Extracting Specific parts of a string:
re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)#extracts email addresses from the text. - Replacing Parts of a String:
re.sub(r'\$\d+', '[price]', text)#replacing any price formatted as $with [price].
8.8.1.0.1 Commonly Used Patterns in re
\d: Matches any digit (0-9).\w: Matches any alphanumeric character (letters and numbers).\s: Matches any whitespace character (spaces, tabs, newlines).[a-z]: Matches any lowercase letter fromatoz.[A-Z]: Matches any uppercase letter fromAtoZ.*: Matches 0 or more occurrences of the preceding character.+: Matches 1 or more occurrences of the preceding character.?: Matches 0 or 1 occurrence of the preceding character.^: Matches the beginning of a string.$: Matches the end of a string.|: Acts as an OR operator.
import requests
def read_html_tables_from_url(url, match=None, user_agent=None, timeout=10, **kwargs):
"""
Read HTML tables from a website URL with error handling and customizable options.
Parameters:
-----------
url : str
The website URL to scrape tables from
match : str, optional
String or regex pattern to match tables (passed to pd.read_html)
user_agent : str, optional
Custom User-Agent header. If None, uses a default Chrome User-Agent
timeout : int, default 10
Request timeout in seconds
**kwargs : dict
Additional arguments passed to pd.read_html()
Returns:
--------
list of DataFrames or None
List of pandas DataFrames (one per table found), or None if error occurs
Examples:
---------
# Basic usage
tables = read_html_tables_from_url("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita")
# Filter tables containing specific text
tables = read_html_tables_from_url(url, match="Country")
# With custom user agent
tables = read_html_tables_from_url(url, user_agent="MyBot/1.0")
"""
import pandas as pd
import requests
# Default User-Agent if none provided
if user_agent is None:
user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
headers = {'User-Agent': user_agent}
try:
print(f"🌐 Fetching data from: {url}")
# Make the HTTP request
response = requests.get(url, headers=headers, timeout=timeout)
response.raise_for_status() # Raise an exception for bad status codes (4xx or 5xx)
print(f"✅ Successfully retrieved webpage (Status: {response.status_code})")
# Parse HTML tables
read_html_kwargs = {'match': match} if match else {}
read_html_kwargs.update(kwargs) # Add any additional arguments
tables = pd.read_html(response.content, **read_html_kwargs)
print(f"📊 Found {len(tables)} table(s) on the page")
# Display table information
if tables:
print(f"\n📋 Table Overview:")
for i, table in enumerate(tables):
print(f" Table {i}: {table.shape} (rows × columns)")
if i >= 4: # Limit output for readability
print(f" ... and {len(tables) - 5} more tables")
break
return tables
except requests.exceptions.HTTPError as err:
print(f"❌ HTTP Error: {err}")
return None
except requests.exceptions.ConnectionError as err:
print(f"❌ Connection Error: {err}")
return None
except requests.exceptions.Timeout as err:
print(f"❌ Timeout Error: Request took longer than {timeout} seconds")
return None
except requests.exceptions.RequestException as err:
print(f"❌ Request Error: {err}")
return None
except ValueError as err:
print(f"❌ No tables found on the page or parsing error: {err}")
return None
except Exception as e:
print(f"❌ An unexpected error occurred: {e}")
return None# Example 1: Basic usage - Get all tables from GDP Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_Chicago_Bulls_seasons'
tables = read_html_tables_from_url(url)
ChicagoBulls = tables[2] if tables else None
ChicagoBulls.head()🌐 Fetching data from: https://en.wikipedia.org/wiki/List_of_Chicago_Bulls_seasons
✅ Successfully retrieved webpage (Status: 200)
📊 Found 18 table(s) on the page
📋 Table Overview:
Table 0: (13, 2) (rows × columns)
Table 1: (1, 5) (rows × columns)
Table 2: (59, 13) (rows × columns)
Table 3: (3, 4) (rows × columns)
Table 4: (9, 2) (rows × columns)
... and 13 more tables
✅ Successfully retrieved webpage (Status: 200)
📊 Found 18 table(s) on the page
📋 Table Overview:
Table 0: (13, 2) (rows × columns)
Table 1: (1, 5) (rows × columns)
Table 2: (59, 13) (rows × columns)
Table 3: (3, 4) (rows × columns)
Table 4: (9, 2) (rows × columns)
... and 13 more tables
| Season | Team | Conference | Finish | Division | Finish.1 | Wins | Losses | Win% | GB | Playoffs | Awards | Head coach | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1966–67 | 1966–67 | — | — | Western | 4th | 33 | 48 | 0.407 | 11 | Lost Division semifinals (Hawks) 3–0[19] | Johnny Kerr (COY)[6] | Johnny Kerr |
| 1 | 1967–68 | 1967–68 | — | — | Western | 4th | 29 | 53 | 0.354 | 27 | Lost Division semifinals (Lakers) 4–1[20] | — | Johnny Kerr |
| 2 | 1968–69 | 1968–69 | — | — | Western | 5th | 33 | 49 | 0.402 | 22 | NaN | — | Dick Motta |
| 3 | 1969–70 | 1969–70 | — | — | Western | 3rd[c] | 39 | 43 | 0.476 | 9 | Lost Division semifinals (Hawks) 4–1[22] | — | Dick Motta |
| 4 | 1970–71 | 1970–71 | Western | 3rd | Midwest[d] | 2nd | 51 | 31 | 0.622 | 2 | Lost conference semifinals (Lakers) 4–3[23] | Dick Motta (COY)[6] | Dick Motta |
# remove all charaters between box brackets inluding the brackets themselves in the columns Division, Finish, Playoffs, Awards
import re
def remove_brackets(x):
return re.sub(r'\[.*?\]', '', x)
# Apply the function to each column separately using map
ChicagoBulls['Division'] = ChicagoBulls['Division'].map(remove_brackets)
ChicagoBulls['Finish'] = ChicagoBulls['Finish'].map(remove_brackets)
ChicagoBulls['Finish.1'] = ChicagoBulls['Finish.1'].map(remove_brackets)
ChicagoBulls.head()| Season | Team | Conference | Finish | Division | Finish.1 | Wins | Losses | Win% | GB | Playoffs | Awards | Head coach | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1966–67 | 1966–67 | — | — | Western | 4th | 33 | 48 | 0.407 | 11 | Lost Division semifinals (Hawks) 3–0[19] | Johnny Kerr (COY)[6] | Johnny Kerr |
| 1 | 1967–68 | 1967–68 | — | — | Western | 4th | 29 | 53 | 0.354 | 27 | Lost Division semifinals (Lakers) 4–1[20] | — | Johnny Kerr |
| 2 | 1968–69 | 1968–69 | — | — | Western | 5th | 33 | 49 | 0.402 | 22 | NaN | — | Dick Motta |
| 3 | 1969–70 | 1969–70 | — | — | Western | 3rd | 39 | 43 | 0.476 | 9 | Lost Division semifinals (Hawks) 4–1[22] | — | Dick Motta |
| 4 | 1970–71 | 1970–71 | Western | 3rd | Midwest | 2nd | 51 | 31 | 0.622 | 2 | Lost conference semifinals (Lakers) 4–3[23] | Dick Motta (COY)[6] | Dick Motta |
8.8.2 The NLTK Library for Advanced NLP
8.8.2.1 What is NLTK?
NLTK (Natural Language Toolkit) is a comprehensive Python library for natural language processing (NLP) and text analysis. It offers tools for tokenization, preprocessing, part-of-speech tagging, parsing, named-entity recognition, sentiment analysis, and more—useful from classroom demos to research prototypes.
8.8.2.2 Why NLTK Matters in Data Science
| Application Domain | NLTK Use Cases | Business Impact |
|---|---|---|
| Social Media Analytics | Sentiment, subjectivity, emotion, polarity detection | Customer insights, brand monitoring |
| Customer Support | Text classification, intent detection | Automation, faster resolution |
| Content Analysis | Keyword extraction, topic modeling, summarization | SEO optimization, content strategy |
| Research & Academia | Corpus analysis, linguistic parsing, concordance tools | Rigorous analysis, reproducible studies |
| E-commerce | Review mining, aspect sentiment, recommendation signals |
8.8.2.3 💻 NLTK Installation and Setup
Let’s start by installing NLTK and downloading the essential datasets:
pip install nltkCollecting nltk
Downloading nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Collecting click (from nltk)
Downloading click-8.3.0-py3-none-any.whl.metadata (2.6 kB)
Collecting joblib (from nltk)
Using cached joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting regex>=2021.8.3 (from nltk)
Using cached regex-2025.9.18-cp312-cp312-win_amd64.whl.metadata (41 kB)
Collecting tqdm (from nltk)
Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Requirement already satisfied: colorama in c:\users\lsi8012\onedrive - northwestern university\fa25\stat303-1\workspace\.venv\lib\site-packages (from click->nltk) (0.4.6)
Downloading nltk-3.9.2-py3-none-any.whl (1.5 MB)
---------------------------------------- 0.0/1.5 MB ? eta -:--:--
--------------------------- ------------ 1.0/1.5 MB 7.1 MB/s eta 0:00:01
---------------------------------------- 1.5/1.5 MB 7.3 MB/s 0:00:00
Using cached regex-2025.9.18-cp312-cp312-win_amd64.whl (275 kB)
Downloading click-8.3.0-py3-none-any.whl (107 kB)
Using cached joblib-1.5.2-py3-none-any.whl (308 kB)
Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, regex, joblib, click, nltk
---------------------------------------- 0/5 [tqdm]
-------- ------------------------------- 1/5 [regex]
---------------- ----------------------- 2/5 [joblib]
---------------- ----------------------- 2/5 [joblib]
---------------- ----------------------- 2/5 [joblib]
---------------- ----------------------- 2/5 [joblib]
------------------------ --------------- 3/5 [click]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
-------------------------------- ------- 4/5 [nltk]
---------------------------------------- 5/5 [nltk]
Successfully installed click-8.3.0 joblib-1.5.2 nltk-3.9.2 regex-2025.9.18 tqdm-4.67.1
Note: you may need to restart the kernel to use updated packages.
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))': /simple/nltk/
# 🚀 NLTK Installation and Essential Downloads
# Install NLTK (uncomment if not already installed)
# !pip install nltk
try:
import nltk
print("✅ NLTK is already installed")
# Download essential datasets (this might take a few minutes first time)
print("📦 Downloading essential NLTK data...")
# Download essential components
essential_downloads = [
'punkt', # Sentence tokenizer
'stopwords', # Stop words for multiple languages
'vader_lexicon', # Sentiment analysis
'averaged_perceptron_tagger', # POS tagger
'wordnet', # WordNet lexical database
'omw-1.4', # Open Multilingual Wordnet
'movie_reviews', # Sample movie reviews for sentiment
]
for item in essential_downloads:
try:
nltk.download(item, quiet=True)
print(f" ✓ Downloaded: {item}")
except:
print(f" ⚠️ Could not download: {item}")
print("\n🎯 NLTK Setup Complete!")
print("Ready for natural language processing!")
except ImportError:
print("❌ NLTK not found. Please install with: pip install nltk")
except Exception as e:
print(f"⚠️ Setup issue: {e}")
print("You may need internet connection for first-time downloads")✅ NLTK is already installed
📦 Downloading essential NLTK data...
✓ Downloaded: punkt
✓ Downloaded: stopwords
✓ Downloaded: vader_lexicon
✓ Downloaded: averaged_perceptron_tagger
✓ Downloaded: wordnet
✓ Downloaded: omw-1.4
✓ Downloaded: movie_reviews
🎯 NLTK Setup Complete!
Ready for natural language processing!
✓ Downloaded: averaged_perceptron_tagger
✓ Downloaded: wordnet
✓ Downloaded: omw-1.4
✓ Downloaded: movie_reviews
🎯 NLTK Setup Complete!
Ready for natural language processing!
[nltk_data] Error downloading 'vader_lexicon' from
[nltk_data] <https://raw.githubusercontent.com/nltk/nltk_data/gh-
[nltk_data] pages/packages/sentiment/vader_lexicon.zip>:
[nltk_data] <urlopen error [WinError 10054] An existing connection
[nltk_data] was forcibly closed by the remote host>
[nltk_data] Error downloading 'averaged_perceptron_tagger' from
[nltk_data] <https://raw.githubusercontent.com/nltk/nltk_data/gh-p
[nltk_data] ages/packages/taggers/averaged_perceptron_tagger.zip>:
[nltk_data] <urlopen error [WinError 10054] An existing connection
[nltk_data] was forcibly closed by the remote host>
[nltk_data] Error downloading 'omw-1.4' from
[nltk_data] <https://raw.githubusercontent.com/nltk/nltk_data/gh-
[nltk_data] pages/packages/corpora/omw-1.4.zip>: <urlopen error
[nltk_data] [WinError 10054] An existing connection was forcibly
[nltk_data] closed by the remote host>
[nltk_data] Error downloading 'movie_reviews' from
[nltk_data] <https://raw.githubusercontent.com/nltk/nltk_data/gh-
[nltk_data] pages/packages/corpora/movie_reviews.zip>: <urlopen
[nltk_data] error [WinError 10054] An existing connection was
[nltk_data] forcibly closed by the remote host>
8.8.2.4 Tokenization: Breaking Text into Pieces
Tokenization is the foundation of NLP - splitting text into meaningful units (words, sentences, or phrases).
# 🔤 Tokenization Examples: Word and Sentence Level
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tokenize import RegexpTokenizer, WhitespaceTokenizer
import pandas as pd
# Sample text for demonstration
sample_text = """
Natural Language Processing (NLP) is amazing! It helps computers understand human language.
Companies like Google, Amazon, and Microsoft use NLP for various applications.
Did you know that NLTK has been around since 2001? It's a powerful toolkit.
Let's explore tokenization: word-level, sentence-level, and custom patterns.
"""
print("📝 Original Text:")
print(sample_text.strip())
print("\n" + "="*60)
print("🔤 TOKENIZATION EXAMPLES")
print("="*60)
# 1. SENTENCE TOKENIZATION
sentences = sent_tokenize(sample_text)
print(f"\n📄 Sentence Tokenization ({len(sentences)} sentences):")
for i, sentence in enumerate(sentences, 1):
print(f" {i}. {sentence.strip()}")
# 2. WORD TOKENIZATION
words = word_tokenize(sample_text)
print(f"\n🔤 Word Tokenization ({len(words)} tokens):")
print("First 15 tokens:", words[:15])
# 3. CUSTOM TOKENIZATION
# Only alphabetic tokens (removes punctuation and numbers)
alpha_tokenizer = RegexpTokenizer(r'[a-zA-Z]+')
alpha_tokens = alpha_tokenizer.tokenize(sample_text)
print(f"\n🔤 Alphabetic Only ({len(alpha_tokens)} tokens):")
print("First 15 tokens:", alpha_tokens[:15])
# 4. WHITESPACE TOKENIZATION
whitespace_tokenizer = WhitespaceTokenizer()
whitespace_tokens = whitespace_tokenizer.tokenize(sample_text)
print(f"\n🔤 Whitespace Tokenization ({len(whitespace_tokens)} tokens):")
print("First 10 tokens:", whitespace_tokens[:10])
# Create DataFrame for comparison
tokenization_comparison = pd.DataFrame({
'Method': ['Sentences', 'Words (with punct.)', 'Alphabetic only', 'Whitespace'],
'Token_Count': [len(sentences), len(words), len(alpha_tokens), len(whitespace_tokens)],
'Sample_Tokens': [
[s[:30] + "..." if len(s) > 30 else s for s in sentences[:2]],
words[:3],
alpha_tokens[:3],
whitespace_tokens[:3]
]
})
print(f"\n📊 Tokenization Comparison:")
print(tokenization_comparison.to_string(index=False))📝 Original Text:
Natural Language Processing (NLP) is amazing! It helps computers understand human language.
Companies like Google, Amazon, and Microsoft use NLP for various applications.
Did you know that NLTK has been around since 2001? It's a powerful toolkit.
Let's explore tokenization: word-level, sentence-level, and custom patterns.
============================================================
🔤 TOKENIZATION EXAMPLES
============================================================
📄 Sentence Tokenization (6 sentences):
1. Natural Language Processing (NLP) is amazing!
2. It helps computers understand human language.
3. Companies like Google, Amazon, and Microsoft use NLP for various applications.
4. Did you know that NLTK has been around since 2001?
5. It's a powerful toolkit.
6. Let's explore tokenization: word-level, sentence-level, and custom patterns.
🔤 Word Tokenization (60 tokens):
First 15 tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'amazing', '!', 'It', 'helps', 'computers', 'understand', 'human', 'language']
🔤 Alphabetic Only (48 tokens):
First 15 tokens: ['Natural', 'Language', 'Processing', 'NLP', 'is', 'amazing', 'It', 'helps', 'computers', 'understand', 'human', 'language', 'Companies', 'like', 'Google']
🔤 Whitespace Tokenization (45 tokens):
First 10 tokens: ['Natural', 'Language', 'Processing', '(NLP)', 'is', 'amazing!', 'It', 'helps', 'computers', 'understand']
📊 Tokenization Comparison:
Method Token_Count Sample_Tokens
Sentences 6 [\nNatural Language Processing (..., It helps computers understand ...]
Words (with punct.) 60 [Natural, Language, Processing]
Alphabetic only 48 [Natural, Language, Processing]
Whitespace 45 [Natural, Language, Processing]
8.8.2.5 Text Preprocessing: Stop Words and Cleaning
Stop words are common words that typically don’t carry significant meaning and are often removed to focus on important content.
# 🧹 Stop Words and Text Cleaning
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import pandas as pd
# Sample product reviews for analysis
product_reviews = [
"This product is absolutely amazing! I love it so much.",
"The quality is terrible. I want my money back immediately.",
"It's okay, nothing special but does the job well enough.",
"Fantastic value for money! Highly recommend to everyone.",
"Poor customer service and the product broke after one day."
]
print("🛍️ Sample Product Reviews:")
for i, review in enumerate(product_reviews, 1):
print(f" {i}. {review}")
print("\n" + "="*60)
print("🧹 TEXT PREPROCESSING PIPELINE")
print("="*60)
# Get English stop words
stop_words = set(stopwords.words('english'))
print(f"\n📋 English Stop Words ({len(stop_words)} words):")
print("Sample stop words:", sorted(list(stop_words))[:20])
def preprocess_text(text):
"""Complete text preprocessing pipeline"""
# 1. Convert to lowercase
text = text.lower()
# 2. Tokenize
tokens = word_tokenize(text)
# 3. Remove punctuation and non-alphabetic tokens
tokens = [token for token in tokens if token.isalpha()]
# 4. Remove stop words
filtered_tokens = [token for token in tokens if token not in stop_words]
return tokens, filtered_tokens
# Process all reviews
processed_reviews = []
print(f"\n🔍 Processing Each Review:")
for i, review in enumerate(product_reviews, 1):
original_tokens, filtered_tokens = preprocess_text(review)
processed_reviews.append({
'Review_ID': f'Review_{i}',
'Original_Text': review,
'Original_Tokens': len(original_tokens),
'After_Cleaning': len(filtered_tokens),
'Tokens_Removed': len(original_tokens) - len(filtered_tokens),
'Key_Words': filtered_tokens[:5] # First 5 meaningful words
})
print(f"\n 📝 Review {i}:")
print(f" Original: {review}")
print(f" Key words: {filtered_tokens[:8]}")
print(f" Tokens: {len(original_tokens)} → {len(filtered_tokens)} (-{len(original_tokens) - len(filtered_tokens)})")
# Create summary DataFrame
df_preprocessing = pd.DataFrame(processed_reviews)
print(f"\n📊 Preprocessing Summary:")
print(df_preprocessing[['Review_ID', 'Original_Tokens', 'After_Cleaning', 'Tokens_Removed']].to_string(index=False))
print(f"\n📈 Overall Statistics:")
total_original = df_preprocessing['Original_Tokens'].sum()
total_cleaned = df_preprocessing['After_Cleaning'].sum()
print(f"• Total tokens before cleaning: {total_original}")
print(f"• Total tokens after cleaning: {total_cleaned}")
print(f"• Tokens removed: {total_original - total_cleaned} ({((total_original - total_cleaned)/total_original)*100:.1f}%)")
print(f"• Average tokens per review (cleaned): {total_cleaned/len(product_reviews):.1f}")🛍️ Sample Product Reviews:
1. This product is absolutely amazing! I love it so much.
2. The quality is terrible. I want my money back immediately.
3. It's okay, nothing special but does the job well enough.
4. Fantastic value for money! Highly recommend to everyone.
5. Poor customer service and the product broke after one day.
============================================================
🧹 TEXT PREPROCESSING PIPELINE
============================================================
📋 English Stop Words (198 words):
Sample stop words: ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been']
🔍 Processing Each Review:
📝 Review 1:
Original: This product is absolutely amazing! I love it so much.
Key words: ['product', 'absolutely', 'amazing', 'love', 'much']
Tokens: 10 → 5 (-5)
📝 Review 2:
Original: The quality is terrible. I want my money back immediately.
Key words: ['quality', 'terrible', 'want', 'money', 'back', 'immediately']
Tokens: 10 → 6 (-4)
📝 Review 3:
Original: It's okay, nothing special but does the job well enough.
Key words: ['okay', 'nothing', 'special', 'job', 'well', 'enough']
Tokens: 10 → 6 (-4)
📝 Review 4:
Original: Fantastic value for money! Highly recommend to everyone.
Key words: ['fantastic', 'value', 'money', 'highly', 'recommend', 'everyone']
Tokens: 8 → 6 (-2)
📝 Review 5:
Original: Poor customer service and the product broke after one day.
Key words: ['poor', 'customer', 'service', 'product', 'broke', 'one', 'day']
Tokens: 10 → 7 (-3)
📊 Preprocessing Summary:
Review_ID Original_Tokens After_Cleaning Tokens_Removed
Review_1 10 5 5
Review_2 10 6 4
Review_3 10 6 4
Review_4 8 6 2
Review_5 10 7 3
📈 Overall Statistics:
• Total tokens before cleaning: 48
• Total tokens after cleaning: 30
• Tokens removed: 18 (37.5%)
• Average tokens per review (cleaned): 6.0
8.8.2.6 Stemming and Lemmatization: Finding Word Roots
Both stemming and lemmatization reduce words to their base form, but they work differently: - Stemming: Fast, rule-based chopping (running → run) - Lemmatization: Slower, dictionary-based reduction (better → good)
# 🌱 Stemming vs Lemmatization Comparison
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
import pandas as pd
# Initialize stemmers and lemmatizer
porter = PorterStemmer()
snowball = SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()
# Test words that demonstrate differences
test_words = [
'running', 'runs', 'ran', 'runner', # Run variations
'better', 'good', 'best', # Good variations
'dogs', 'cats', 'children', 'geese', # Plural forms
'walking', 'walked', 'walks', # Walk variations
'flying', 'flies', 'flew', # Fly variations
'studies', 'studying', 'studied', # Study variations
'beautiful', 'beautifully', # Adjective/adverb
'happiness', 'happier', 'happiest' # Happy variations
]
print("🌱 STEMMING vs LEMMATIZATION COMPARISON")
print("="*60)
# Process each word
results = []
for word in test_words:
porter_stem = porter.stem(word)
snowball_stem = snowball.stem(word)
lemma = lemmatizer.lemmatize(word)
lemma_verb = lemmatizer.lemmatize(word, pos='v') # As verb
results.append({
'Original': word,
'Porter_Stem': porter_stem,
'Snowball_Stem': snowball_stem,
'Lemma_Noun': lemma,
'Lemma_Verb': lemma_verb
})
# Create comparison DataFrame
df_comparison = pd.DataFrame(results)
print("📊 Detailed Comparison:")
print(df_comparison.to_string(index=False))
print(f"\n🎯 Key Differences Analysis:")
# Count unique results for each method
porter_unique = df_comparison['Porter_Stem'].nunique()
snowball_unique = df_comparison['Snowball_Stem'].nunique()
lemma_unique = df_comparison['Lemma_Noun'].nunique()
print(f"• Porter Stemmer: {porter_unique} unique stems")
print(f"• Snowball Stemmer: {snowball_unique} unique stems")
print(f"• WordNet Lemmatizer: {lemma_unique} unique lemmas")
print(f"\n💡 Notable Examples:")
# Find interesting cases where methods differ significantly
interesting_cases = [
('better', 'good'), # Lemmatization handles irregular forms
('studies', 'study'), # Different handling of -ies ending
('flying', 'fly'), # Verb vs noun treatment
('geese', 'goose') # Irregular plurals
]
for original, expected in interesting_cases:
if original in test_words:
row = df_comparison[df_comparison['Original'] == original].iloc[0]
print(f"• '{original}' → Porter: '{row['Porter_Stem']}', Lemma: '{row['Lemma_Noun']}' (expected: '{expected}')")
# Real-world application example
sample_sentence = "The children were running happily through the beautiful gardens, studying the flying geese."
words_to_process = sample_sentence.lower().replace(',', '').replace('.', '').split()
print(f"\n🔬 Real-world Example:")
print(f"Sentence: {sample_sentence}")
print(f"\nWord processing:")
sentence_results = []
for word in words_to_process:
if word.isalpha(): # Skip punctuation
stem = porter.stem(word)
lemma = lemmatizer.lemmatize(word)
sentence_results.append(f"{word} → stem: {stem}, lemma: {lemma}")
for result in sentence_results[:6]: # Show first 6 words
print(f" {result}")
print(f"\n⚡ Performance Guidelines:")
print("• Use STEMMING when: Speed is crucial, approximate matching is okay")
print("• Use LEMMATIZATION when: Accuracy is important, working with formal text")
print("• Stemming is ~10x faster but less accurate than lemmatization")🌱 STEMMING vs LEMMATIZATION COMPARISON
============================================================
📊 Detailed Comparison:
Original Porter_Stem Snowball_Stem Lemma_Noun Lemma_Verb
running run run running run
runs run run run run
ran ran ran ran run
runner runner runner runner runner
better better better better better
good good good good good
best best best best best
dogs dog dog dog dog
cats cat cat cat cat
children children children child children
geese gees gees goose geese
walking walk walk walking walk
walked walk walk walked walk
walks walk walk walk walk
flying fli fli flying fly
flies fli fli fly fly
flew flew flew flew fly
studies studi studi study study
studying studi studi studying study
studied studi studi studied study
beautiful beauti beauti beautiful beautiful
beautifully beauti beauti beautifully beautifully
happiness happi happi happiness happiness
happier happier happier happier happier
happiest happiest happiest happiest happiest
🎯 Key Differences Analysis:
• Porter Stemmer: 18 unique stems
• Snowball Stemmer: 18 unique stems
• WordNet Lemmatizer: 25 unique lemmas
💡 Notable Examples:
• 'better' → Porter: 'better', Lemma: 'better' (expected: 'good')
• 'studies' → Porter: 'studi', Lemma: 'study' (expected: 'study')
• 'flying' → Porter: 'fli', Lemma: 'flying' (expected: 'fly')
• 'geese' → Porter: 'gees', Lemma: 'goose' (expected: 'goose')
🔬 Real-world Example:
Sentence: The children were running happily through the beautiful gardens, studying the flying geese.
Word processing:
the → stem: the, lemma: the
children → stem: children, lemma: child
were → stem: were, lemma: were
running → stem: run, lemma: running
happily → stem: happili, lemma: happily
through → stem: through, lemma: through
⚡ Performance Guidelines:
• Use STEMMING when: Speed is crucial, approximate matching is okay
• Use LEMMATIZATION when: Accuracy is important, working with formal text
• Stemming is ~10x faster but less accurate than lemmatization
📊 Detailed Comparison:
Original Porter_Stem Snowball_Stem Lemma_Noun Lemma_Verb
running run run running run
runs run run run run
ran ran ran ran run
runner runner runner runner runner
better better better better better
good good good good good
best best best best best
dogs dog dog dog dog
cats cat cat cat cat
children children children child children
geese gees gees goose geese
walking walk walk walking walk
walked walk walk walked walk
walks walk walk walk walk
flying fli fli flying fly
flies fli fli fly fly
flew flew flew flew fly
studies studi studi study study
studying studi studi studying study
studied studi studi studied study
beautiful beauti beauti beautiful beautiful
beautifully beauti beauti beautifully beautifully
happiness happi happi happiness happiness
happier happier happier happier happier
happiest happiest happiest happiest happiest
🎯 Key Differences Analysis:
• Porter Stemmer: 18 unique stems
• Snowball Stemmer: 18 unique stems
• WordNet Lemmatizer: 25 unique lemmas
💡 Notable Examples:
• 'better' → Porter: 'better', Lemma: 'better' (expected: 'good')
• 'studies' → Porter: 'studi', Lemma: 'study' (expected: 'study')
• 'flying' → Porter: 'fli', Lemma: 'flying' (expected: 'fly')
• 'geese' → Porter: 'gees', Lemma: 'goose' (expected: 'goose')
🔬 Real-world Example:
Sentence: The children were running happily through the beautiful gardens, studying the flying geese.
Word processing:
the → stem: the, lemma: the
children → stem: children, lemma: child
were → stem: were, lemma: were
running → stem: run, lemma: running
happily → stem: happili, lemma: happily
through → stem: through, lemma: through
⚡ Performance Guidelines:
• Use STEMMING when: Speed is crucial, approximate matching is okay
• Use LEMMATIZATION when: Accuracy is important, working with formal text
• Stemming is ~10x faster but less accurate than lemmatization
8.8.2.7 Sentiment Analysis: Understanding Emotions
NLTK provides powerful tools for analyzing sentiment - determining whether text expresses positive, negative, or neutral emotions.
# 📊 Sentiment Analysis with NLTK VADER
from nltk.sentiment import SentimentIntensityAnalyzer
import pandas as pd
import matplotlib.pyplot as plt
nltk.download('vader_lexicon')
# Initialize VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()
# Diverse sample texts for sentiment analysis
sample_texts = [
# Positive examples
"This product is absolutely amazing! I love everything about it!",
"Fantastic customer service and fast delivery. Highly recommend!",
"Great value for money. Very satisfied with my purchase.",
# Negative examples
"Terrible quality, waste of money. Very disappointed.",
"Poor customer service and the product broke immediately.",
"Worst purchase I've ever made. Completely useless.",
# Neutral examples
"The product arrived on time. It's okay, nothing special.",
"Standard quality for the price. Does what it's supposed to do.",
"Average product with typical features. Works as expected.",
# Mixed sentiment
"Great design but poor durability. Mixed feelings about this.",
"Love the features but hate the price. It's complicated.",
]
print("📊 SENTIMENT ANALYSIS WITH NLTK VADER")
print("="*60)
# Analyze sentiment for each text
sentiment_results = []
for i, text in enumerate(sample_texts, 1):
# Get sentiment scores
scores = analyzer.polarity_scores(text)
# Determine overall sentiment
if scores['compound'] >= 0.05:
overall = 'Positive'
emoji = '😊'
elif scores['compound'] <= -0.05:
overall = 'Negative'
emoji = '😞'
else:
overall = 'Neutral'
emoji = '😐'
sentiment_results.append({
'Text_ID': f'Text_{i:02d}',
'Text': text[:50] + "..." if len(text) > 50 else text,
'Positive': round(scores['pos'], 3),
'Neutral': round(scores['neu'], 3),
'Negative': round(scores['neg'], 3),
'Compound': round(scores['compound'], 3),
'Sentiment': overall,
'Emoji': emoji,
'Full_Text': text
})
print(f"{emoji} Text {i}: {overall} (compound: {scores['compound']:.3f})")
print(f" \"{text}\"")
print(f" Pos: {scores['pos']:.2f}, Neu: {scores['neu']:.2f}, Neg: {scores['neg']:.2f}")
print()
# Create DataFrame for analysis
df_sentiment = pd.DataFrame(sentiment_results)
print("📈 SENTIMENT ANALYSIS SUMMARY")
print("="*40)
# Overall statistics
sentiment_counts = df_sentiment['Sentiment'].value_counts()
print("📊 Sentiment Distribution:")
for sentiment, count in sentiment_counts.items():
percentage = (count / len(df_sentiment)) * 100
print(f" {sentiment}: {count} texts ({percentage:.1f}%)")
print(f"\n📏 Score Ranges:")
print(f"• Positive scores: {df_sentiment['Positive'].min():.3f} - {df_sentiment['Positive'].max():.3f}")
print(f"• Negative scores: {df_sentiment['Negative'].min():.3f} - {df_sentiment['Negative'].max():.3f}")
print(f"• Compound scores: {df_sentiment['Compound'].min():.3f} - {df_sentiment['Compound'].max():.3f}")
# Show most extreme examples
print(f"\n🎯 Most Extreme Examples:")
most_positive = df_sentiment.loc[df_sentiment['Compound'].idxmax()]
most_negative = df_sentiment.loc[df_sentiment['Compound'].idxmin()]
print(f"😊 Most Positive (score: {most_positive['Compound']:.3f}):")
print(f" \"{most_positive['Full_Text']}\"")
print(f"😞 Most Negative (score: {most_negative['Compound']:.3f}):")
print(f" \"{most_negative['Full_Text']}\"")
# Business application example
print(f"\n💼 Business Application Example:")
print("This could be used for:")
print("• Customer review analysis")
print("• Social media monitoring")
print("• Product feedback categorization")
print("• Brand sentiment tracking")
print("• Customer support prioritization")
# Display summary table
print(f"\n📋 Detailed Results:")
display_df = df_sentiment[['Text_ID', 'Text', 'Compound', 'Sentiment', 'Emoji']].copy()
print(display_df.to_string(index=False))[nltk_data] Downloading package vader_lexicon to
[nltk_data] C:\Users\lsi8012\AppData\Roaming\nltk_data...
📊 SENTIMENT ANALYSIS WITH NLTK VADER
============================================================
😊 Text 1: Positive (compound: 0.879)
"This product is absolutely amazing! I love everything about it!"
Pos: 0.57, Neu: 0.43, Neg: 0.00
😊 Text 2: Positive (compound: 0.771)
"Fantastic customer service and fast delivery. Highly recommend!"
Pos: 0.53, Neu: 0.47, Neg: 0.00
😊 Text 3: Positive (compound: 0.862)
"Great value for money. Very satisfied with my purchase."
Pos: 0.61, Neu: 0.39, Neg: 0.00
😞 Text 4: Negative (compound: -0.852)
"Terrible quality, waste of money. Very disappointed."
Pos: 0.00, Neu: 0.30, Neg: 0.70
😞 Text 5: Negative (compound: -0.710)
"Poor customer service and the product broke immediately."
Pos: 0.00, Neu: 0.50, Neg: 0.50
😞 Text 6: Negative (compound: -0.802)
"Worst purchase I've ever made. Completely useless."
Pos: 0.00, Neu: 0.41, Neg: 0.59
😞 Text 7: Negative (compound: -0.092)
"The product arrived on time. It's okay, nothing special."
Pos: 0.17, Neu: 0.63, Neg: 0.20
😐 Text 8: Neutral (compound: 0.000)
"Standard quality for the price. Does what it's supposed to do."
Pos: 0.00, Neu: 1.00, Neg: 0.00
😐 Text 9: Neutral (compound: 0.000)
"Average product with typical features. Works as expected."
Pos: 0.00, Neu: 1.00, Neg: 0.00
😞 Text 10: Negative (compound: -0.382)
"Great design but poor durability. Mixed feelings about this."
Pos: 0.19, Neu: 0.51, Neg: 0.30
😞 Text 11: Negative (compound: -0.535)
"Love the features but hate the price. It's complicated."
Pos: 0.18, Neu: 0.48, Neg: 0.34
📈 SENTIMENT ANALYSIS SUMMARY
========================================
📊 Sentiment Distribution:
Negative: 6 texts (54.5%)
Positive: 3 texts (27.3%)
Neutral: 2 texts (18.2%)
📏 Score Ranges:
• Positive scores: 0.000 - 0.615
• Negative scores: 0.000 - 0.699
• Compound scores: -0.852 - 0.879
🎯 Most Extreme Examples:
😊 Most Positive (score: 0.879):
"This product is absolutely amazing! I love everything about it!"
😞 Most Negative (score: -0.852):
"Terrible quality, waste of money. Very disappointed."
💼 Business Application Example:
This could be used for:
• Customer review analysis
• Social media monitoring
• Product feedback categorization
• Brand sentiment tracking
• Customer support prioritization
📋 Detailed Results:
Text_ID Text Compound Sentiment Emoji
Text_01 This product is absolutely amazing! I love everyth... 0.879 Positive 😊
Text_02 Fantastic customer service and fast delivery. High... 0.771 Positive 😊
Text_03 Great value for money. Very satisfied with my purc... 0.862 Positive 😊
Text_04 Terrible quality, waste of money. Very disappointe... -0.852 Negative 😞
Text_05 Poor customer service and the product broke immedi... -0.710 Negative 😞
Text_06 Worst purchase I've ever made. Completely useless. -0.802 Negative 😞
Text_07 The product arrived on time. It's okay, nothing sp... -0.092 Negative 😞
Text_08 Standard quality for the price. Does what it's sup... 0.000 Neutral 😐
Text_09 Average product with typical features. Works as ex... 0.000 Neutral 😐
Text_10 Great design but poor durability. Mixed feelings a... -0.382 Negative 😞
Text_11 Love the features but hate the price. It's complic... -0.535 Negative 😞
8.8.2.8 Part-of-Speech (POS) Tagging & Named Entity Recognition
POS tagging identifies grammatical roles of words (noun, verb, adjective), while Named Entity Recognition (NER) finds important entities like names, places, and organizations.
# 🏷️ POS Tagging and Named Entity Recognition
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
from nltk.tree import Tree
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')
nltk.download('words')
import pandas as pd
# Sample business-related text
business_text = """
Apple Inc. is planning to open a new headquarters in Austin, Texas next year.
CEO Tim Cook announced this during a meeting with investors on Wall Street.
The company expects to hire 5,000 new employees, including software engineers
and data scientists. Microsoft and Google are also expanding their operations
in the region. The project will cost approximately $1 billion dollars.
"""
print("🏷️ POS TAGGING & NAMED ENTITY RECOGNITION")
print("="*60)
print("📝 Sample Text:")
print(business_text.strip())
# 1. POS TAGGING
print(f"\n🔤 PART-OF-SPEECH TAGGING:")
tokens = word_tokenize(business_text)
pos_tags = pos_tag(tokens)
# Group by POS types for analysis
pos_groups = {}
for word, pos in pos_tags:
if pos not in pos_groups:
pos_groups[pos] = []
pos_groups[pos].append(word)
print(f"Found {len(set([pos for _, pos in pos_tags]))} different POS tags:")
# Show most common POS categories
pos_description = {
'NN': 'Noun (singular)',
'NNS': 'Noun (plural)',
'NNP': 'Proper noun (singular)',
'NNPS': 'Proper noun (plural)',
'VB': 'Verb (base form)',
'VBD': 'Verb (past tense)',
'VBG': 'Verb (gerund/present participle)',
'VBN': 'Verb (past participle)',
'VBP': 'Verb (present, not 3rd person singular)',
'VBZ': 'Verb (present, 3rd person singular)',
'JJ': 'Adjective',
'JJR': 'Adjective (comparative)',
'JJS': 'Adjective (superlative)',
'RB': 'Adverb',
'DT': 'Determiner',
'IN': 'Preposition/subordinating conjunction',
'TO': 'to',
'CD': 'Cardinal number'
}
for pos in sorted(pos_groups.keys()):
if pos in pos_description:
example_words = pos_groups[pos][:3] # Show first 3 examples
print(f" {pos} ({pos_description[pos]}): {example_words}")
# 2. NAMED ENTITY RECOGNITION
print(f"\n🏢 NAMED ENTITY RECOGNITION:")
named_entities = ne_chunk(pos_tags)
# Extract entities
entities = []
def extract_entities(tree):
for subtree in tree:
if isinstance(subtree, Tree): # It's a named entity
entity_name = ' '.join([token for token, pos in subtree.leaves()])
entity_label = subtree.label()
entities.append((entity_name, entity_label))
extract_entities(named_entities)
if entities:
print("🎯 Detected Entities:")
entity_types = {}
for entity, label in entities:
if label not in entity_types:
entity_types[label] = []
entity_types[label].append(entity)
for label, entity_list in entity_types.items():
print(f" {label}: {entity_list}")
else:
print(" No named entities detected with basic NER")
# 3. MANUAL ENTITY EXTRACTION (for demonstration)
print(f"\n🔍 MANUAL ENTITY IDENTIFICATION:")
# Find potential entities using POS tags
companies = [word for word, pos in pos_tags if pos == 'NNP' and word in ['Apple', 'Microsoft', 'Google']]
people = [word for word, pos in pos_tags if pos == 'NNP' and word in ['Tim', 'Cook']]
locations = [word for word, pos in pos_tags if pos == 'NNP' and word in ['Austin', 'Texas', 'Wall', 'Street']]
numbers = [word for word, pos in pos_tags if pos == 'CD']
manual_entities = {
'Companies': companies,
'People': people,
'Locations': locations,
'Numbers': numbers
}
for category, items in manual_entities.items():
if items:
print(f" {category}: {items}")
# 4. BUSINESS APPLICATIONS
print(f"\n💼 BUSINESS APPLICATIONS:")
print("POS Tagging can be used for:")
print("• Information extraction from documents")
print("• Improving search relevance")
print("• Text summarization")
print("• Grammar checking tools")
print("• Keyword extraction")
print("\nNamed Entity Recognition can be used for:")
print("• Customer relationship management")
print("• Compliance and regulatory analysis")
print("• News and social media monitoring")
print("• Document classification")
print("• Knowledge graph construction")
# Create summary DataFrame
word_analysis = []
for word, pos in pos_tags[:15]: # First 15 words
if word.isalpha(): # Skip punctuation
is_entity = any(word in entity for entity, _ in entities)
word_analysis.append({
'Word': word,
'POS_Tag': pos,
'POS_Description': pos_description.get(pos, 'Other'),
'Is_Entity': is_entity
})
df_analysis = pd.DataFrame(word_analysis)
print(f"\n📋 Sample Word Analysis:")
print(df_analysis.to_string(index=False))[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data] C:\Users\lsi8012\AppData\Roaming\nltk_data...
[nltk_data] Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data] date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data] C:\Users\lsi8012\AppData\Roaming\nltk_data...
[nltk_data] Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data] C:\Users\lsi8012\AppData\Roaming\nltk_data...
[nltk_data] Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data] C:\Users\lsi8012\AppData\Roaming\nltk_data...
[nltk_data] Unzipping corpora\words.zip.
[nltk_data] Downloading package words to
[nltk_data] C:\Users\lsi8012\AppData\Roaming\nltk_data...
[nltk_data] Package words is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data] C:\Users\lsi8012\AppData\Roaming\nltk_data...
[nltk_data] Package words is already up-to-date!
🏷️ POS TAGGING & NAMED ENTITY RECOGNITION
============================================================
📝 Sample Text:
Apple Inc. is planning to open a new headquarters in Austin, Texas next year.
CEO Tim Cook announced this during a meeting with investors on Wall Street.
The company expects to hire 5,000 new employees, including software engineers
and data scientists. Microsoft and Google are also expanding their operations
in the region. The project will cost approximately $1 billion dollars.
🔤 PART-OF-SPEECH TAGGING:
Found 20 different POS tags:
CD (Cardinal number): ['5,000', '1', 'billion']
DT (Determiner): ['a', 'this', 'a']
IN (Preposition/subordinating conjunction): ['in', 'during', 'with']
JJ (Adjective): ['new', 'next', 'new']
NN (Noun (singular)): ['headquarters', 'year', 'meeting']
NNP (Proper noun (singular)): ['Apple', 'Inc.', 'Austin']
NNS (Noun (plural)): ['investors', 'employees', 'engineers']
RB (Adverb): ['also', 'approximately']
TO (to): ['to', 'to']
VB (Verb (base form)): ['open', 'hire', 'cost']
VBD (Verb (past tense)): ['announced']
VBG (Verb (gerund/present participle)): ['planning', 'including', 'expanding']
VBP (Verb (present, not 3rd person singular)): ['are']
VBZ (Verb (present, 3rd person singular)): ['is', 'expects']
🏢 NAMED ENTITY RECOGNITION:
🎯 Detected Entities:
PERSON: ['Apple', 'Microsoft']
ORGANIZATION: ['Inc.', 'CEO Tim Cook']
GPE: ['Austin', 'Texas', 'Google']
FACILITY: ['Wall Street']
🔍 MANUAL ENTITY IDENTIFICATION:
Companies: ['Apple', 'Microsoft', 'Google']
People: ['Tim', 'Cook']
Locations: ['Austin', 'Texas', 'Wall', 'Street']
Numbers: ['5,000', '1', 'billion']
💼 BUSINESS APPLICATIONS:
POS Tagging can be used for:
• Information extraction from documents
• Improving search relevance
• Text summarization
• Grammar checking tools
• Keyword extraction
Named Entity Recognition can be used for:
• Customer relationship management
• Compliance and regulatory analysis
• News and social media monitoring
• Document classification
• Knowledge graph construction
📋 Sample Word Analysis:
Word POS_Tag POS_Description Is_Entity
Apple NNP Proper noun (singular) True
is VBZ Verb (present, 3rd person singular) False
planning VBG Verb (gerund/present participle) False
to TO to False
open VB Verb (base form) False
a DT Determiner True
new JJ Adjective False
headquarters NN Noun (singular) False
in IN Preposition/subordinating conjunction True
Austin NNP Proper noun (singular) True
Texas NNP Proper noun (singular) True
next JJ Adjective False
year NN Noun (singular) False
🎯 Detected Entities:
PERSON: ['Apple', 'Microsoft']
ORGANIZATION: ['Inc.', 'CEO Tim Cook']
GPE: ['Austin', 'Texas', 'Google']
FACILITY: ['Wall Street']
🔍 MANUAL ENTITY IDENTIFICATION:
Companies: ['Apple', 'Microsoft', 'Google']
People: ['Tim', 'Cook']
Locations: ['Austin', 'Texas', 'Wall', 'Street']
Numbers: ['5,000', '1', 'billion']
💼 BUSINESS APPLICATIONS:
POS Tagging can be used for:
• Information extraction from documents
• Improving search relevance
• Text summarization
• Grammar checking tools
• Keyword extraction
Named Entity Recognition can be used for:
• Customer relationship management
• Compliance and regulatory analysis
• News and social media monitoring
• Document classification
• Knowledge graph construction
📋 Sample Word Analysis:
Word POS_Tag POS_Description Is_Entity
Apple NNP Proper noun (singular) True
is VBZ Verb (present, 3rd person singular) False
planning VBG Verb (gerund/present participle) False
to TO to False
open VB Verb (base form) False
a DT Determiner True
new JJ Adjective False
headquarters NN Noun (singular) False
in IN Preposition/subordinating conjunction True
Austin NNP Proper noun (singular) True
Texas NNP Proper noun (singular) True
next JJ Adjective False
year NN Noun (singular) False
8.8.2.9 NLTK Summary & Real-World Applications
8.8.2.9.1 📊 NLTK Capabilities Overview
| Feature | Purpose | Business Use Case | Performance |
|---|---|---|---|
| Tokenization | Split text into words/sentences | Document processing, search | ⚡ Fast |
| Stop Words | Remove common words | Focus on meaningful content | ⚡ Fast |
| Stemming | Reduce words to stems | Search optimization, indexing | ⚡ Very Fast |
| Lemmatization | Reduce words to dictionary form | Text analysis, normalization | 🔄 Medium |
| Sentiment Analysis | Detect emotions and opinions | Customer feedback, social monitoring | ⚡ Fast |
| POS Tagging | Identify grammatical roles | Information extraction | 🔄 Medium |
| Named Entity Recognition | Find people, places, organizations | Document analysis, CRM | 🔄 Medium |
8.8.2.9.2 💡 Best Practices & Tips
Performance Optimization:
- Use stemming for speed-critical applications
- Use lemmatization for accuracy-critical applications
- Cache processed results for repeated text analysis
- Batch process large datasets instead of one-by-one
Data Quality:
- Clean text before processing (remove HTML, special characters)
- Handle multiple languages appropriately
- Validate results with domain experts
- Consider context when interpreting sentiment
NLTK provides an excellent foundation for understanding NLP concepts before moving to more advanced tools!
8.9 Independent Study
8.9.1 Practice exercise 1
8.9.1.1
Read the file STAT303-1 survey for data analysis.csv.
survey_data = pd.read_csv('./Datasets/STAT303-1 survey for data analysis.csv')
print("Survey data loaded successfully!")
print(f"Shape: {survey_data.shape}")
survey_data.head()Survey data loaded successfully!
Shape: (192, 51)
| Timestamp | fav_alcohol | On average (approx.) how many parties a month do you attend during the school year? Enter a whole number (0, 1, 2, 3, 4, ...) | smoke | weed | introvert_extrovert | love_first_sight | learning_style | left_right_brained | personality_type | ... | used_python_before | dominant_hand | childhood_in_US | gender | region_of_residence | political_affliation | cant_change_math_ability | can_change_math_ability | math_is_genetic | much_effort_is_lack_of_talent | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2022/09/13 1:43:34 pm GMT-5 | I don't drink | 1 | No | Occasionally | Introvert | No | Visual (learn best through images or graphic o... | Left-brained (logic, science, critical thinkin... | INFJ | ... | Yes | Right | Yes | Female | Northeast | Democrat | Disagree | Agree | Disagree | Disagree |
| 1 | 2022/09/13 5:28:17 pm GMT-5 | Hard liquor/Mixed drink | 3 | No | Occasionally | Extrovert | No | Visual (learn best through images or graphic o... | Left-brained (logic, science, critical thinkin... | ESFJ | ... | Yes | Right | Yes | Male | West | Democrat | Disagree | Agree | Disagree | Disagree |
| 2 | 2022/09/13 7:56:38 pm GMT-5 | Hard liquor/Mixed drink | 3 | No | Yes | Introvert | No | Kinesthetic (learn best through figuring out h... | Left-brained (logic, science, critical thinkin... | ISTJ | ... | No | Right | No | Female | International | No affiliation | Disagree | Agree | Disagree | Disagree |
| 3 | 2022/09/13 10:34:37 pm GMT-5 | Hard liquor/Mixed drink | 12 | No | No | Extrovert | No | Visual (learn best through images or graphic o... | Left-brained (logic, science, critical thinkin... | ENFJ | ... | No | Right | Yes | Female | Southeast | Democrat | Disagree | Agree | Disagree | Disagree |
| 4 | 2022/09/14 4:46:19 pm GMT-5 | I don't drink | 1 | No | No | Extrovert | Yes | Reading/Writing (learn best through words ofte... | Right-brained (creative, art, imaginative, int... | ENTJ | ... | No | Right | Yes | Female | Northeast | Democrat | Agree | Disagree | Disagree | Disagree |
5 rows × 51 columns
8.9.1.2
Rename all the columns of the data, except the first two columns, with the shorter names in the list new_col_names given below. The order of column names in the list is the same as the order in which the columns are to be renamed starting with the third column from the left.
new_col_names = ['parties_per_month', 'do_you_smoke', 'weed', 'are_you_an_introvert_or_extrovert', 'love_first_sight', 'learning_style', 'left_right_brained', 'personality_type', 'social_media', 'num_insta_followers', 'streaming_platforms', 'expected_marriage_age', 'expected_starting_salary', 'fav_sport', 'minutes_ex_per_week', 'sleep_hours_per_day', 'how_happy', 'farthest_distance_travelled', 'fav_number', 'fav_letter', 'internet_hours_per_day', 'only_child', 'birthdate_odd_even', 'birth_month', 'fav_season', 'living_location_on_campus', 'major', 'num_majors_minors', 'high_school_GPA', 'NU_GPA', 'age', 'height', 'height_father', 'height_mother', 'school_year', 'procrastinator', 'num_clubs', 'student_athlete', 'AP_stats', 'used_python_before', 'dominant_hand', 'childhood_in_US', 'gender', 'region_of_residence', 'political_affliation', 'cant_change_math_ability', 'can_change_math_ability', 'math_is_genetic', 'much_effort_is_lack_of_talent']survey_data.columns = list(survey_data.columns[0:2])+new_col_names8.9.1.3
Rename the following columns again:
Rename
do_you_smoketosmoke.Rename
are_you_an_introvert_or_extroverttointrovert_extrovert.
Hint: Use the function rename()
# Rename specific columns using rename function
survey_data = survey_data.rename(columns={
'do_you_smoke': 'smoke',
'are_you_an_introvert_or_extrovert': 'introvert_extrovert'})
print(f"Updated columns: {list(survey_data.columns)}")
print("Specific columns renamed successfully!")Updated columns: ['Timestamp', 'fav_alcohol', 'parties_per_month', 'smoke', 'weed', 'introvert_extrovert', 'love_first_sight', 'learning_style', 'left_right_brained', 'personality_type', 'social_media', 'num_insta_followers', 'streaming_platforms', 'expected_marriage_age', 'expected_starting_salary', 'fav_sport', 'minutes_ex_per_week', 'sleep_hours_per_day', 'how_happy', 'farthest_distance_travelled', 'fav_number', 'fav_letter', 'internet_hours_per_day', 'only_child', 'birthdate_odd_even', 'birth_month', 'fav_season', 'living_location_on_campus', 'major', 'num_majors_minors', 'high_school_GPA', 'NU_GPA', 'age', 'height', 'height_father', 'height_mother', 'school_year', 'procrastinator', 'num_clubs', 'student_athlete', 'AP_stats', 'used_python_before', 'dominant_hand', 'childhood_in_US', 'gender', 'region_of_residence', 'political_affliation', 'cant_change_math_ability', 'can_change_math_ability', 'math_is_genetic', 'much_effort_is_lack_of_talent']
Specific columns renamed successfully!
8.9.1.4
Examine the column num_insta_followers. Some numbers in the column contain a comma(,) or a tilde(~). Remove both these characters from the numbers in the column.
Hint: You may use the function str.replace() of the Pandas Series class.
survey_data_insta = survey_data.copy()
survey_data_insta['num_insta_followers']=survey_data_insta['num_insta_followers'].str.replace(',','')
survey_data_insta['num_insta_followers']=survey_data_insta['num_insta_followers'].str.replace('~','')survey_data_insta.head()| Timestamp | fav_alcohol | parties_per_month | smoke | weed | introvert_extrovert | love_first_sight | learning_style | left_right_brained | personality_type | ... | used_python_before | dominant_hand | childhood_in_US | gender | region_of_residence | political_affliation | cant_change_math_ability | can_change_math_ability | math_is_genetic | much_effort_is_lack_of_talent | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2022/09/13 1:43:34 pm GMT-5 | I don't drink | 1 | No | Occasionally | Introvert | No | Visual (learn best through images or graphic o... | Left-brained (logic, science, critical thinkin... | INFJ | ... | Yes | Right | Yes | Female | Northeast | Democrat | Disagree | Agree | Disagree | Disagree |
| 1 | 2022/09/13 5:28:17 pm GMT-5 | Hard liquor/Mixed drink | 3 | No | Occasionally | Extrovert | No | Visual (learn best through images or graphic o... | Left-brained (logic, science, critical thinkin... | ESFJ | ... | Yes | Right | Yes | Male | West | Democrat | Disagree | Agree | Disagree | Disagree |
| 2 | 2022/09/13 7:56:38 pm GMT-5 | Hard liquor/Mixed drink | 3 | No | Yes | Introvert | No | Kinesthetic (learn best through figuring out h... | Left-brained (logic, science, critical thinkin... | ISTJ | ... | No | Right | No | Female | International | No affiliation | Disagree | Agree | Disagree | Disagree |
| 3 | 2022/09/13 10:34:37 pm GMT-5 | Hard liquor/Mixed drink | 12 | No | No | Extrovert | No | Visual (learn best through images or graphic o... | Left-brained (logic, science, critical thinkin... | ENFJ | ... | No | Right | Yes | Female | Southeast | Democrat | Disagree | Agree | Disagree | Disagree |
| 4 | 2022/09/14 4:46:19 pm GMT-5 | I don't drink | 1 | No | No | Extrovert | Yes | Reading/Writing (learn best through words ofte... | Right-brained (creative, art, imaginative, int... | ENTJ | ... | No | Right | Yes | Female | Northeast | Democrat | Agree | Disagree | Disagree | Disagree |
5 rows × 51 columns
8.9.1.5
Use the cleaned dataset where the column num_insta_followers has been updated.
The last four variables in the dataset are:
cant_change_math_abilitycan_change_math_abilitymath_is_geneticmuch_effort_is_lack_of_talent
Each of the above variables has values - Agree / Disagree. Replace Agree with 1 and Disagree with 0.
Hint : You can do it with any one of the following methods:
Use the map() function
Use the apply() function with the lambda function
# implement a function to convert the column to numeric, forcing errors to NaN
# Define the columns to convert
columns_to_convert = ['cant_change_math_ability', 'can_change_math_ability',
'math_is_genetic', 'much_effort_is_lack_of_talent']
# Method 1: Using map() function (avoids for-loop)
for col in columns_to_convert:
survey_data_insta[col] = survey_data_insta[col].map({'Agree': 1, 'Disagree': 0})
print("Method 1: Using map() function")
print(survey_data_insta[columns_to_convert].head())
print("\nData types:")
print(survey_data_insta[columns_to_convert].dtypes)
# Method 2: Using apply() function with lambda (avoids for-loop)
# First, let's reload the data to show this method
survey_data_insta2 = survey_data.copy()
survey_data_insta2['num_insta_followers'] = survey_data_insta2['num_insta_followers'].str.replace(',','')
survey_data_insta2['num_insta_followers'] = survey_data_insta2['num_insta_followers'].str.replace('~','')
for col in columns_to_convert:
survey_data_insta2[col] = survey_data_insta2[col].apply(lambda x: 1 if x == 'Agree' else 0)
print("\n\nMethod 2: Using apply() with lambda function")
print(survey_data_insta2[columns_to_convert].head())
Method 1: Using map() function
cant_change_math_ability can_change_math_ability math_is_genetic \
0 0 1 0
1 0 1 0
2 0 1 0
3 0 1 0
4 1 0 0
much_effort_is_lack_of_talent
0 0
1 0
2 0
3 0
4 0
Data types:
cant_change_math_ability int64
can_change_math_ability int64
math_is_genetic int64
much_effort_is_lack_of_talent int64
dtype: object
Method 2: Using apply() with lambda function
cant_change_math_ability can_change_math_ability math_is_genetic \
0 0 1 0
1 0 1 0
2 0 1 0
3 0 1 0
4 1 0 0
much_effort_is_lack_of_talent
0 0
1 0
2 0
3 0
4 0
8.9.2 Practice exercise 2
This exercise will motivate vectorized computations with NumPy. Vectorized computations help perform computations more efficiently, and also make the code concise.
Read the (1) quantities of roll, bun, cake and bread required by 3 people - Ben, Barbara & Beth, from food_quantity.csv, (2) price of these food items in two shops - Target and Kroger, from price.csv. Find out which shop should each person go to minimize their expenses.
food_df = pd.read_csv('./datasets/food_quantity.csv')
food_df.head()| Person | roll | bun | cake | bread | |
|---|---|---|---|---|---|
| 0 | Ben | 6 | 5 | 3 | 1 |
| 1 | Barbara | 3 | 6 | 2 | 2 |
| 2 | Beth | 3 | 4 | 3 | 1 |
price_df = pd.read_csv('./datasets/price.csv')
price_df.head()| Item | Target | Kroger | |
|---|---|---|---|
| 0 | roll | 1.5 | 1.0 |
| 1 | bun | 2.0 | 2.5 |
| 2 | cake | 5.0 | 4.5 |
| 3 | bread | 16.0 | 17.0 |
8.9.2.1
Compute the expenses of Ben if he prefers to buy all food items from Target
# Step 1: Get Ben's quantities
ben_quantities = food_df[food_df['Person'] == 'Ben'].iloc[0] # Get Ben's row
print("Ben's food quantities:")
print(ben_quantities)
print()
# Step 2: Get Target prices
target_prices = price_df.set_index('Item')['Target'] # Set Item as index, get Target column
print("Target prices:")
print(target_prices)
print()
# Step 3: Calculate Ben's total expenses at Target
# Multiply quantities by corresponding Target prices
food_items = ['roll', 'bun', 'cake', 'bread']
total_expense = 0
print("Ben's expense breakdown at Target:")
for item in food_items:
quantity = ben_quantities[item]
price = target_prices[item]
expense = quantity * price
total_expense += expense
print(f"{item}: {quantity} × ${price} = ${expense}")
print(f"\nBen's total expense at Target: ${total_expense}")Ben's food quantities:
Person Ben
roll 6
bun 5
cake 3
bread 1
Name: 0, dtype: object
Target prices:
Item
roll 1.5
bun 2.0
cake 5.0
bread 16.0
Name: Target, dtype: float64
Ben's expense breakdown at Target:
roll: 6 × $1.5 = $9.0
bun: 5 × $2.0 = $10.0
cake: 3 × $5.0 = $15.0
bread: 1 × $16.0 = $16.0
Ben's total expense at Target: $50.0
8.9.2.2
Compute Ben’s total expenses at both Target and Kroger. Which store is cheaper for him?
# Step 1: Get Ben's quantities (reuse from previous calculation)
ben_quantities = food_df[food_df['Person'] == 'Ben'].iloc[0]
food_items = ['roll', 'bun', 'cake', 'bread']
# Step 2: Get prices from both stores
target_prices = price_df.set_index('Item')['Target']
kroger_prices = price_df.set_index('Item')['Kroger']
print("🛒 BEN'S EXPENSE COMPARISON")
print("=" * 40)
print("Ben's quantities:", [ben_quantities[item] for item in food_items])
print()
# Step 3: Calculate expenses at both stores
target_total = 0
kroger_total = 0
print("Detailed breakdown:")
print(f"{'Item':<8} {'Qty':<4} {'Target':<8} {'Kroger':<8} {'Target $':<10} {'Kroger $':<10}")
print("-" * 60)
for item in food_items:
quantity = ben_quantities[item]
target_price = target_prices[item]
kroger_price = kroger_prices[item]
target_expense = quantity * target_price
kroger_expense = quantity * kroger_price
target_total += target_expense
kroger_total += kroger_expense
print(f"{item:<8} {quantity:<4} ${target_price:<7} ${kroger_price:<7} ${target_expense:<9} ${kroger_expense:<9}")
print("-" * 60)
print(f"{'TOTAL':<8} {'':<4} {'':<8} {'':<8} ${target_total:<9} ${kroger_total:<9}")
# Step 4: Determine which store is cheaper
print(f"\n💰 COMPARISON RESULTS:")
print(f"Ben's total at Target: ${target_total}")
print(f"Ben's total at Kroger: ${kroger_total}")
if target_total < kroger_total:
savings = kroger_total - target_total
print(f"🏆 TARGET is cheaper by ${savings}")
print(f"Ben should shop at TARGET to save money!")
elif kroger_total < target_total:
savings = target_total - kroger_total
print(f"🏆 KROGER is cheaper by ${savings}")
print(f"Ben should shop at KROGER to save money!")
else:
print("🤝 Both stores cost the same for Ben!")🛒 BEN'S EXPENSE COMPARISON
========================================
Ben's quantities: [6, 5, 3, 1]
Detailed breakdown:
Item Qty Target Kroger Target $ Kroger $
------------------------------------------------------------
roll 6 $1.5 $1.0 $9.0 $6.0
bun 5 $2.0 $2.5 $10.0 $12.5
cake 3 $5.0 $4.5 $15.0 $13.5
bread 1 $16.0 $17.0 $16.0 $17.0
------------------------------------------------------------
TOTAL $50.0 $49.0
💰 COMPARISON RESULTS:
Ben's total at Target: $50.0
Ben's total at Kroger: $49.0
🏆 KROGER is cheaper by $1.0
Ben should shop at KROGER to save money!
8.9.2.3
Compute each person’s total expenses at both Target and Kroger. Which store is cheaper for them?
# Step 1: Set up data
food_items = ['roll', 'bun', 'cake', 'bread']
target_prices = price_df.set_index('Item')['Target']
kroger_prices = price_df.set_index('Item')['Kroger']
print("🛒 COMPLETE EXPENSE ANALYSIS FOR ALL PEOPLE")
print("=" * 60)
print(f"Food items: {food_items}")
print(f"Target prices: {target_prices.values}")
print(f"Kroger prices: {kroger_prices.values}")
print()
# Step 2: Calculate expenses for each person
results = []
for index, person_row in food_df.iterrows():
person_name = person_row['Person']
# Calculate total expenses at both stores
target_total = 0
kroger_total = 0
print(f"👤 {person_name.upper()}'S ANALYSIS:")
print("-" * 30)
print(f"{'Item':<8} {'Qty':<4} {'Target $':<10} {'Kroger $':<10}")
print("-" * 30)
for item in food_items:
quantity = person_row[item]
target_expense = quantity * target_prices[item]
kroger_expense = quantity * kroger_prices[item]
target_total += target_expense
kroger_total += kroger_expense
print(f"{item:<8} {quantity:<4} ${target_expense:<9.2f} ${kroger_expense:<9.2f}")
print("-" * 30)
print(f"{'TOTAL':<8} {'':<4} ${target_total:<9.2f} ${kroger_total:<9.2f}")
# Determine cheaper store
if target_total < kroger_total:
cheaper_store = "TARGET"
savings = kroger_total - target_total
print(f"🏆 {cheaper_store} is cheaper by ${savings:.2f}")
elif kroger_total < target_total:
cheaper_store = "KROGER"
savings = target_total - kroger_total
print(f"🏆 {cheaper_store} is cheaper by ${savings:.2f}")
else:
cheaper_store = "TIE"
savings = 0
print("🤝 Both stores cost the same!")
# Store results
results.append({
'Person': person_name,
'Target_Total': target_total,
'Kroger_Total': kroger_total,
'Cheaper_Store': cheaper_store,
'Savings': savings
})
print()
# Step 3: Create summary DataFrame
results_df = pd.DataFrame(results)
print("📊 SUMMARY TABLE:")
print("=" * 50)
print(results_df.to_string(index=False))
print(f"\n FINAL RECOMMENDATIONS:")
print("=" * 30)
for _, row in results_df.iterrows():
if row['Cheaper_Store'] == 'TIE':
print(f"• {row['Person']}: Can shop at either store (same cost)")
else:
print(f"• {row['Person']}: Shop at {row['Cheaper_Store']} to save ${row['Savings']:.2f}")
# Step 4: Additional insights
print(f"\n ADDITIONAL INSIGHTS:")
print("=" * 25)
target_shoppers = results_df[results_df['Cheaper_Store'] == 'TARGET']['Person'].tolist()
kroger_shoppers = results_df[results_df['Cheaper_Store'] == 'KROGER']['Person'].tolist()
tie_shoppers = results_df[results_df['Cheaper_Store'] == 'TIE']['Person'].tolist()
if target_shoppers:
print(f"Target is better for: {', '.join(target_shoppers)}")
if kroger_shoppers:
print(f"Kroger is better for: {', '.join(kroger_shoppers)}")
if tie_shoppers:
print(f"No difference for: {', '.join(tie_shoppers)}")
total_savings = results_df['Savings'].sum()
print(f"Total potential savings if everyone shops optimally: ${total_savings:.2f}")🛒 COMPLETE EXPENSE ANALYSIS FOR ALL PEOPLE
============================================================
Food items: ['roll', 'bun', 'cake', 'bread']
Target prices: [ 1.5 2. 5. 16. ]
Kroger prices: [ 1. 2.5 4.5 17. ]
👤 BEN'S ANALYSIS:
------------------------------
Item Qty Target $ Kroger $
------------------------------
roll 6 $9.00 $6.00
bun 5 $10.00 $12.50
cake 3 $15.00 $13.50
bread 1 $16.00 $17.00
------------------------------
TOTAL $50.00 $49.00
🏆 KROGER is cheaper by $1.00
👤 BARBARA'S ANALYSIS:
------------------------------
Item Qty Target $ Kroger $
------------------------------
roll 3 $4.50 $3.00
bun 6 $12.00 $15.00
cake 2 $10.00 $9.00
bread 2 $32.00 $34.00
------------------------------
TOTAL $58.50 $61.00
🏆 TARGET is cheaper by $2.50
👤 BETH'S ANALYSIS:
------------------------------
Item Qty Target $ Kroger $
------------------------------
roll 3 $4.50 $3.00
bun 4 $8.00 $10.00
cake 3 $15.00 $13.50
bread 1 $16.00 $17.00
------------------------------
TOTAL $43.50 $43.50
🤝 Both stores cost the same!
📊 SUMMARY TABLE:
==================================================
Person Target_Total Kroger_Total Cheaper_Store Savings
Ben 50.0 49.0 KROGER 1.0
Barbara 58.5 61.0 TARGET 2.5
Beth 43.5 43.5 TIE 0.0
FINAL RECOMMENDATIONS:
==============================
• Ben: Shop at KROGER to save $1.00
• Barbara: Shop at TARGET to save $2.50
• Beth: Can shop at either store (same cost)
ADDITIONAL INSIGHTS:
=========================
Target is better for: Barbara
Kroger is better for: Ben
No difference for: Beth
Total potential savings if everyone shops optimally: $3.50