8  Pandas Intermediate

Now that you’ve learned the fundamentals of pandas, this chapter shifts focus to transformations — techniques that reshape and enrich your data.
You’ll practice vectorized arithmetic, automatic alignment, conditional logic, and row- vs. column-wise operations that turn raw tables into analysis-ready features.

8.1 Learning Objective

At the end of this chapter, you will learn:

  • Arithmetic Operations — perform calculations within and between DataFrames
  • Broadcasting — apply operations efficiently across different shapes with automatic alignment
  • Transformations — use map() and apply() to transform your data flexibly
  • Lambda Functions — create quick, inline transformations with map() or apply()
  • Advanced NLP Preprocessing — practice transformations that support your final project

🔑 Note on Vectorization:
Vectorization means executing operations in parallel (built on top of NumPy) instead of using slow Python loops.
In this chapter, we will motivate NumPy vectorization and introduce its benefits. A deeper dive will come in the dedicated NumPy chapter.

First of all, let’s import

import numpy as np
import pandas as pd
import time

8.2 Arithmetic Operations: Within and Between DataFrames

Arithmetic operations are essential when working with numerical data columns in pandas.

8.2.1 Why Arithmetic Operations Matter in Data Science

They form the backbone of many common tasks, including:

  • Creating new metrics — e.g., profit = revenue - cost
  • Normalizing data — z-scores, percentages, or scaling
  • Comparing datasets — growth rates, ratios, or relative differences
  • Financial analysis — ROI, margins, compound growth
  • Scientific computations — applying formulas and transformations

🚀 Pandas Advantage: Arithmetic in pandas is intuitive and concise — operations align automatically by rows and columns, saving you from manual looping.

8.2.2 Arithmetic Operations Within a DataFrame

Core Concept: Perform calculations between columns, with constants, or across rows

Operation Symbol Example Business Use Case
Addition + df['A'] + df['B'] Total sales = Online + Retail
Subtraction - df['Revenue'] - df['Cost'] Profit calculation
Multiplication * df['Price'] * df['Quantity'] Total value
Division / df['Profit'] / df['Revenue'] Margin percentage
Power ** df['Principal'] * (1 + rate)**years Compound interest
Floor Division // df['Total'] // df['BatchSize'] Number of full batches
Modulo % df['ID'] % 10 Grouping by last digit

8.2.2.1 Column-to-Column Operations

# 🎯 Create a business-like DataFrame for demonstrations
business_data = {
    'Product': ['Laptop', 'Phone', 'Tablet', 'Watch', 'Headphones'],
    'Units_Sold': [150, 300, 200, 500, 250],
    'Price_per_Unit': [1200, 800, 600, 400, 200],
    'Cost_per_Unit': [800, 500, 400, 250, 120],
    'Marketing_Spend': [15000, 25000, 18000, 30000, 12000]
}
business_df = pd.DataFrame(business_data)

print("📊 Original Business Data:")
print(business_df)

# 💰 Basic arithmetic operations between columns
business_df['Total_Revenue'] = business_df['Units_Sold'] * business_df['Price_per_Unit']
business_df['Total_Cost'] = business_df['Units_Sold'] * business_df['Cost_per_Unit']  
business_df['Gross_Profit'] = business_df['Total_Revenue'] - business_df['Total_Cost']
business_df['Net_Profit'] = business_df['Gross_Profit'] - business_df['Marketing_Spend']

# 📈 Calculate ratios and percentages
business_df['Profit_Margin'] = (business_df['Gross_Profit'] / business_df['Total_Revenue']) * 100
business_df['ROI'] = (business_df['Net_Profit'] / business_df['Marketing_Spend']) * 100

print("\n💡 Enhanced Business Data with Calculations:")
print(business_df[['Product', 'Total_Revenue', 'Gross_Profit', 'Net_Profit', 'Profit_Margin', 'ROI']].round(2))
📊 Original Business Data:
      Product  Units_Sold  Price_per_Unit  Cost_per_Unit  Marketing_Spend
0      Laptop         150            1200            800            15000
1       Phone         300             800            500            25000
2      Tablet         200             600            400            18000
3       Watch         500             400            250            30000
4  Headphones         250             200            120            12000

💡 Enhanced Business Data with Calculations:
      Product  Total_Revenue  Gross_Profit  Net_Profit  Profit_Margin     ROI
0      Laptop         180000         60000       45000          33.33  300.00
1       Phone         240000         90000       65000          37.50  260.00
2      Tablet         120000         40000       22000          33.33  122.22
3       Watch         200000         75000       45000          37.50  150.00
4  Headphones          50000         20000        8000          40.00   66.67

8.2.2.2 Operations with Constants and Scalars

In pandas, you can perform arithmetic operations not only between columns, but also with constants (scalars).
The operation is applied element-wise across the entire column or DataFrame — thanks to vectorization.

# 🎯 Operations with constants (scalars)
print("Operations with Constants:")

# Apply 10% discount to all prices
business_df['Discounted_Price'] = business_df['Price_per_Unit'] * 0.9

# Add fixed shipping cost of $50 to all products
business_df['Price_with_Shipping'] = business_df['Price_per_Unit'] + 50

# Calculate tax (8% of revenue)
business_df['Tax_Amount'] = business_df['Total_Revenue'] * 0.08

# Square the units sold (for some statistical analysis)
business_df['Units_Squared'] = business_df['Units_Sold'] ** 2

print("Sample with scalar operations:")
sample_cols = ['Product', 'Price_per_Unit', 'Discounted_Price', 'Price_with_Shipping', 'Tax_Amount']
print(business_df[sample_cols].head(3).to_string(index=False))
Operations with Constants:
Sample with scalar operations:
Product  Price_per_Unit  Discounted_Price  Price_with_Shipping  Tax_Amount
 Laptop            1200            1080.0                 1250     14400.0
  Phone             800             720.0                  850     19200.0
 Tablet             600             540.0                  650      9600.0

8.2.2.3 Row-wise Operations (Across Columns)

By default, methods like .mean(), .max(), and .min() operate down each column (axis=0).
However, if you need to perform calculations across all columns in each row, set axis=1.

# 📊 Row-wise arithmetic operations (axis=1 for across columns)
print("🔄 Row-wise Operations:")

# Create a sample DataFrame with numerical data
sample_data = {
    'Q1_Sales': [100, 150, 200, 120, 180],
    'Q2_Sales': [110, 140, 220, 130, 190], 
    'Q3_Sales': [120, 160, 210, 140, 200],
    'Q4_Sales': [130, 170, 230, 150, 210]
}
quarterly_sales = pd.DataFrame(sample_data, 
                              index=['Product_A', 'Product_B', 'Product_C', 'Product_D', 'Product_E'])

print("📈 Quarterly Sales Data:")
print(quarterly_sales)

# Calculate row-wise statistics
quarterly_sales['Total_Sales'] = quarterly_sales.sum(axis=1)  # Sum across columns
quarterly_sales['Average_Sales'] = quarterly_sales[['Q1_Sales', 'Q2_Sales', 'Q3_Sales', 'Q4_Sales']].mean(axis=1)
quarterly_sales['Max_Quarter'] = quarterly_sales[['Q1_Sales', 'Q2_Sales', 'Q3_Sales', 'Q4_Sales']].max(axis=1)
quarterly_sales['Sales_Range'] = quarterly_sales[['Q1_Sales', 'Q2_Sales', 'Q3_Sales', 'Q4_Sales']].max(axis=1) - quarterly_sales[['Q1_Sales', 'Q2_Sales', 'Q3_Sales', 'Q4_Sales']].min(axis=1)

print("\n📊 With Row-wise Calculations:")
print(quarterly_sales.round(2))
🔄 Row-wise Operations:
📈 Quarterly Sales Data:
           Q1_Sales  Q2_Sales  Q3_Sales  Q4_Sales
Product_A       100       110       120       130
Product_B       150       140       160       170
Product_C       200       220       210       230
Product_D       120       130       140       150
Product_E       180       190       200       210

📊 With Row-wise Calculations:
           Q1_Sales  Q2_Sales  Q3_Sales  Q4_Sales  Total_Sales  Average_Sales  \
Product_A       100       110       120       130          460          115.0   
Product_B       150       140       160       170          620          155.0   
Product_C       200       220       210       230          860          215.0   
Product_D       120       130       140       150          540          135.0   
Product_E       180       190       200       210          780          195.0   

           Max_Quarter  Sales_Range  
Product_A          130           30  
Product_B          170           30  
Product_C          230           30  
Product_D          150           30  
Product_E          210           30  

8.2.2.4 Correlation Operations

8.2.3 Arithmetic Operations Between DataFrames

Key Concept: When operating on two DataFrames, pandas automatically aligns data by index and column labels.

8.2.3.1 Alignment Behavior

Scenario Result Example
Exact match Normal operation df1 + df2
Missing index/column Fills with NaN Automatic alignment
Different shapes Broadcasts when possible Scalar operations
Using fill_value Replaces NaN with specified value df1.add(df2, fill_value=0)

💡 Pro Tip: Use explicit methods (add(), sub(), mul(), div()) for better control over missing data!

8.2.3.2 Basic DataFrame-to-DataFrame Operations

# 🏢 Create two related business DataFrames for comparison
print("🔀 DataFrame-to-DataFrame Operations:")

# 📊 Store A performance data
store_a_data = {
    'Electronics': [50000, 55000, 60000],
    'Clothing': [30000, 32000, 35000],
    'Books': [8000, 8500, 9000]
}
store_a = pd.DataFrame(store_a_data, index=['Q1', 'Q2', 'Q3'])

# 📊 Store B performance data  
store_b_data = {
    'Electronics': [45000, 50000, 58000],
    'Clothing': [28000, 30000, 33000],
    'Books': [7500, 8000, 8800]
}
store_b = pd.DataFrame(store_b_data, index=['Q1', 'Q2', 'Q3'])

print("🏪 Store A Sales:")
print(store_a)
print("\n🏪 Store B Sales:")
print(store_b)

# 📈 Basic arithmetic operations between DataFrames
print("\n💡 Arithmetic Operations Between Stores:")

# Total sales across both stores
total_sales = store_a + store_b
print("\n📊 Combined Sales (A + B):")
print(total_sales)

# Difference in performance
sales_diff = store_a - store_b  
print("\n📊 Sales Difference (A - B):")
print(sales_diff)

# Growth ratio (Store A relative to Store B)
growth_ratio = store_a / store_b
print("\n📊 Performance Ratio (A / B):")
print(growth_ratio.round(3))
🔀 DataFrame-to-DataFrame Operations:
🏪 Store A Sales:
    Electronics  Clothing  Books
Q1        50000     30000   8000
Q2        55000     32000   8500
Q3        60000     35000   9000

🏪 Store B Sales:
    Electronics  Clothing  Books
Q1        45000     28000   7500
Q2        50000     30000   8000
Q3        58000     33000   8800

💡 Arithmetic Operations Between Stores:

📊 Combined Sales (A + B):
    Electronics  Clothing  Books
Q1        95000     58000  15500
Q2       105000     62000  16500
Q3       118000     68000  17800

📊 Sales Difference (A - B):
    Electronics  Clothing  Books
Q1         5000      2000    500
Q2         5000      2000    500
Q3         2000      2000    200

📊 Performance Ratio (A / B):
    Electronics  Clothing  Books
Q1        1.111     1.071  1.067
Q2        1.100     1.067  1.062
Q3        1.034     1.061  1.023

8.2.3.3 Handling Misaligned Data

# 🚨 Demonstrate alignment issues and solutions
print("⚠️ Handling Misaligned DataFrames:")

# Create DataFrames with different indices and columns
df1_partial = pd.DataFrame({
    'A': [10, 20, 30],
    'B': [40, 50, 60]
}, index=[0, 1, 2])

df2_partial = pd.DataFrame({
    'B': [5, 10, 15, 20],  # Extra row
    'C': [100, 200, 300, 400]  # Different column
}, index=[1, 2, 3, 4])  # Different indices

print("📊 DataFrame 1:")
print(df1_partial)
print("\n📊 DataFrame 2:")  
print(df2_partial)

# 🔄 Default behavior - creates NaN for missing alignments
print("\n❌ Default Addition (creates NaN):")
result_nan = df1_partial + df2_partial
print(result_nan)

# 🔧 Using explicit methods with fill_value to handle NaN
print("\n✅ Addition with fill_value=0:")
result_filled = df1_partial.add(df2_partial, fill_value=0)
print(result_filled)

# 🎯 Other explicit arithmetic methods
print("\n🔧 Explicit Methods Available:")
print("• add() - Addition with fill_value option")
print("• sub() - Subtraction with fill_value option")  
print("• mul() - Multiplication with fill_value option")
print("• div() - Division with fill_value option")

# Example with subtraction
result_sub = df1_partial.sub(df2_partial, fill_value=1)
print("\n➖ Subtraction with fill_value=1:")
print(result_sub)
⚠️ Handling Misaligned DataFrames:
📊 DataFrame 1:
    A   B
0  10  40
1  20  50
2  30  60

📊 DataFrame 2:
    B    C
1   5  100
2  10  200
3  15  300
4  20  400

❌ Default Addition (creates NaN):
    A     B   C
0 NaN   NaN NaN
1 NaN  55.0 NaN
2 NaN  70.0 NaN
3 NaN   NaN NaN
4 NaN   NaN NaN

✅ Addition with fill_value=0:
      A     B      C
0  10.0  40.0    NaN
1  20.0  55.0  100.0
2  30.0  70.0  200.0
3   NaN  15.0  300.0
4   NaN  20.0  400.0

🔧 Explicit Methods Available:
• add() - Addition with fill_value option
• sub() - Subtraction with fill_value option
• mul() - Multiplication with fill_value option
• div() - Division with fill_value option

➖ Subtraction with fill_value=1:
      A     B      C
0   9.0  39.0    NaN
1  19.0  45.0  -99.0
2  29.0  50.0 -199.0
3   NaN -14.0 -299.0
4   NaN -19.0 -399.0

8.2.4 Advanced Arithmetic Techniques

8.2.4.1 Broadcasting and Shape Compatibility

# 📐 Broadcasting: Operations between different shapes
print("📐 Broadcasting Examples:")

# DataFrame with different dimensions
sales_matrix = pd.DataFrame({
    'Jan': [1000, 1500, 2000],
    'Feb': [1100, 1600, 2100], 
    'Mar': [1200, 1700, 2200]
}, index=['Store_A', 'Store_B', 'Store_C'])

# Series for operations
monthly_bonus = pd.Series([100, 150, 200], index=['Jan', 'Feb', 'Mar'])  # Column-wise
store_multiplier = pd.Series([1.1, 1.2, 1.05], index=['Store_A', 'Store_B', 'Store_C'])  # Row-wise

print("📊 Original Sales Matrix:")
print(sales_matrix)
print("\n💰 Monthly Bonus (Series):")
print(monthly_bonus)
print("\n📈 Store Multiplier (Series):")
print(store_multiplier)

# 🔄 Broadcasting operations
print("\n✨ Broadcasting Results:")

# Add monthly bonus to each column
sales_with_bonus = sales_matrix + monthly_bonus
print("📊 Sales + Monthly Bonus (broadcasting across columns):")
print(sales_with_bonus)

# Multiply each row by store multiplier
sales_adjusted = sales_matrix.mul(store_multiplier, axis=0)  # axis=0 for row-wise
print("\n📊 Sales × Store Multiplier (broadcasting across rows):")
print(sales_adjusted.round(0))
📐 Broadcasting Examples:
📊 Original Sales Matrix:
          Jan   Feb   Mar
Store_A  1000  1100  1200
Store_B  1500  1600  1700
Store_C  2000  2100  2200

💰 Monthly Bonus (Series):
Jan    100
Feb    150
Mar    200
dtype: int64

📈 Store Multiplier (Series):
Store_A    1.10
Store_B    1.20
Store_C    1.05
dtype: float64

✨ Broadcasting Results:
📊 Sales + Monthly Bonus (broadcasting across columns):
          Jan   Feb   Mar
Store_A  1100  1250  1400
Store_B  1600  1750  1900
Store_C  2100  2250  2400

📊 Sales × Store Multiplier (broadcasting across rows):
            Jan     Feb     Mar
Store_A  1100.0  1210.0  1320.0
Store_B  1800.0  1920.0  2040.0
Store_C  2100.0  2205.0  2310.0

8.2.5 Error Handling and Edge Cases

Common issues and solutions when working with arithmetic operations

# 🚨 Handling common arithmetic errors and edge cases
print("🛡️ Error Handling and Edge Cases:")

# Create data with potential issues
problematic_data = pd.DataFrame({
    'Values': [10, 0, -5, None, float('inf')],
    'Divisors': [2, 0, 1, 3, 2],
    'Mixed_Types': ['10', 20, '30', 40, '50']  # Mixed string/numeric
})

print("⚠️ Problematic Data:")
print(problematic_data)
print(f"Data types:\n{problematic_data.dtypes}")

# 🔧 Issue 1: Division by zero
print("\n❌ Division by Zero Issue:")
try:
    result = problematic_data['Values'] / problematic_data['Divisors']
    print("Result with division by zero:")
    print(result)
except Exception as e:
    print(f"Error: {e}")

# ✅ Solution: Handle division by zero
print("\n✅ Safe Division:")
safe_division = problematic_data['Values'].div(problematic_data['Divisors']).replace([float('inf'), float('-inf')], 0)
print(safe_division)

# 🔧 Issue 2: Mixed data types
print("\n❌ Mixed Data Types Issue:")
try:
    # Convert string column to numeric first
    mixed_numeric = pd.to_numeric(problematic_data['Mixed_Types'], errors='coerce')
    print("Converted mixed types:")
    print(mixed_numeric)
    
    # Now arithmetic works
    result = mixed_numeric + 5
    print("After adding 5:")
    print(result)
except Exception as e:
    print(f"Error: {e}")

# 🔧 Issue 3: NaN and infinity handling
print("\n🔧 NaN and Infinity Handling:")

# Create data with NaN and inf
test_data = pd.Series([1, 2, None, float('inf'), float('-inf'), 5])
print("Original data with NaN and infinity:")
print(test_data)

# Fill NaN with specific values
filled_data = test_data.fillna(0)
print("\nAfter filling NaN with 0:")
print(filled_data)

# Replace infinity with finite values
clean_data = filled_data.replace([float('inf'), float('-inf')], [999999, -999999])
print("\nAfter replacing infinity:")
print(clean_data)

# 🎯 Best practices summary
print("\n💡 Best Practices for Robust Arithmetic:")
print("1. Check data types before operations: df.dtypes")
print("2. Handle NaN values: fillna(), dropna(), or errors='coerce'")
print("3. Check for infinity: df.replace([inf, -inf], value)")
print("4. Use explicit methods for better control: add(), sub(), mul(), div()")
print("5. Validate results: df.isna().sum(), df.isinf().sum() if hasattr(df, 'isinf')")
🛡️ Error Handling and Edge Cases:
⚠️ Problematic Data:
   Values  Divisors Mixed_Types
0    10.0         2          10
1     0.0         0          20
2    -5.0         1          30
3     NaN         3          40
4     inf         2          50
Data types:
Values         float64
Divisors         int64
Mixed_Types     object
dtype: object

❌ Division by Zero Issue:
Result with division by zero:
0    5.0
1    NaN
2   -5.0
3    NaN
4    inf
dtype: float64

✅ Safe Division:
0    5.0
1    NaN
2   -5.0
3    NaN
4    0.0
dtype: float64

❌ Mixed Data Types Issue:
Converted mixed types:
0    10
1    20
2    30
3    40
4    50
Name: Mixed_Types, dtype: int64
After adding 5:
0    15
1    25
2    35
3    45
4    55
Name: Mixed_Types, dtype: int64

🔧 NaN and Infinity Handling:
Original data with NaN and infinity:
0    1.0
1    2.0
2    NaN
3    inf
4   -inf
5    5.0
dtype: float64

After filling NaN with 0:
0    1.0
1    2.0
2    0.0
3    inf
4   -inf
5    5.0
dtype: float64

After replacing infinity:
0         1.0
1         2.0
2         0.0
3    999999.0
4   -999999.0
5         5.0
dtype: float64

💡 Best Practices for Robust Arithmetic:
1. Check data types before operations: df.dtypes
2. Handle NaN values: fillna(), dropna(), or errors='coerce'
3. Check for infinity: df.replace([inf, -inf], value)
4. Use explicit methods for better control: add(), sub(), mul(), div()
5. Validate results: df.isna().sum(), df.isinf().sum() if hasattr(df, 'isinf')

8.2.6 Performance and Method Comparison

Operation Type Operator Explicit Method Fill Value Performance Use Case
Addition df1 + df2 df1.add(df2, fill_value=0) ✅ Yes ⚡ Fast Missing data control
Subtraction df1 - df2 df1.sub(df2, fill_value=0) ✅ Yes ⚡ Fast Missing data control
Multiplication df1 * df2 df1.mul(df2, fill_value=1) ✅ Yes ⚡ Fast Broadcasting control
Division df1 / df2 df1.div(df2, fill_value=1) ✅ Yes ⚡ Fast Zero division safety
Power df1 ** 2 df1.pow(2) ❌ No 🔄 Medium Exponential operations
Floor Division df1 // df2 df1.floordiv(df2) ❌ No 🔄 Medium Integer division
Modulo df1 % df2 df1.mod(df2) ❌ No 🔄 Medium Remainder operations

8.2.7 Are Arithmetic Operations Vectorized?

Yes — arithmetic operations in pandas are vectorized because they rely on NumPy under the hood.
This means operations are applied element-wise across entire columns or DataFrames at once, instead of looping through rows in Python.

Benefits of Vectorization:
- Much faster than Python loops (implemented in optimized C code)
- Cleaner and more concise syntax
- Handles automatic alignment on indexes and columns

8.3 Correlation Operations: Understanding Relationships Between Variables

8.3.1 What is Correlation?

Correlation measures the strength and direction of the linear relationship between two numerical variables. In business and data science, correlation analysis helps you:

  • Discover patterns in your data
  • Identify relationships between metrics
  • Make predictions based on variable relationships
  • Generate insights for decision-making

Key Concepts:

  • Correlation ≠ Causation: High correlation doesn’t mean one variable causes the other
  • Range: Correlation values range from -1 to +1
  • Interpretation: Closer to ±1 = stronger relationship, closer to 0 = weaker relationship

8.3.2 Correlation Coefficient Interpretation

Range Strength Direction Business Example
+0.8 to +1.0 Very Strong Positive As X ↗️, Y ↗️ More ads → More sales
+0.6 to +0.8 Strong Positive As X ↗️, Y ↗️ Higher price → Higher quality perception
+0.3 to +0.6 Moderate Positive As X ↗️, Y ↗️ Customer age → Spending
-0.1 to +0.3 Weak/No Relationship Random pattern Product color → Sales
-0.3 to -0.6 Moderate Negative As X ↗️, Y ↘️ Discounts → Profit margin
-0.6 to -0.8 Strong Negative As X ↗️, Y ↘️ Product defects → Customer satisfaction
-0.8 to -1.0 Very Strong Negative As X ↗️, Y ↘️ Price increase → Demand

8.3.3 Correlation Types

Pandas provides multiple correlation methods that you can specify with
df.corr(method=...). By default, it uses Pearson, but it’s important to know the alternatives and when to use them:

Method Best For Advantages Limitations
Pearson Linear relationships Simple, standard, interpretable Very sensitive to outliers
Spearman Monotonic (rank-based) relationships Handles non-linearity well Loses information about magnitude
Kendall Robust rank correlation Highly resistant to outliers Slower, computationally intensive

8.3.3.1 Decision Guide: Choosing the Right Correlation

# Simple decision tree for correlation choice:

if data_has_outliers or is_non_linear_monotonic:
    use_method = "spearman"   # Rank-based, robust to outliers
elif small_sample_size or need_extra_robustness:
    use_method = "kendall"    # Most robust, but slower
else:
    use_method = "pearson"    # Standard choice for linear correlation

💡 Note for this chapter: We will focus on Pearson correlation, since it is the default in pandas and most commonly used in practice.

8.3.4 Pandas Correlation Methods

Pandas offers several methods for calculating correlations:

Method Purpose Use Case
df.corr() Full correlation matrix Explore all relationships
df['A'].corr(df['B']) Pairwise correlation Specific relationship
df.corrwith(series) Correlate with external series Compare with benchmark

💡 In this chapter, we’ll focus on the first two methods (df.corr() and pairwise .corr()), and skip corrwith() for simplicity.

import numpy as np
import pandas as pd

# 📊 Create sample data for correlation demonstration
print("🔍 CORRELATION OPERATIONS DEMONSTRATION")
print("="*50)

# Create sample business dataset
np.random.seed(42)  # For reproducible results
n_samples = 100

# Create correlated business metrics
sales_data = pd.DataFrame({
    'Advertising_Spend': np.random.normal(1000, 200, n_samples),
    'Website_Traffic': np.random.normal(5000, 1000, n_samples),
    'Customer_Satisfaction': np.random.normal(4.0, 0.5, n_samples),
    'Product_Price': np.random.normal(50, 10, n_samples)
})

# Add some realistic correlations
sales_data['Sales_Revenue'] = (
    0.8 * sales_data['Advertising_Spend'] + 
    0.3 * sales_data['Website_Traffic'] + 
    200 * sales_data['Customer_Satisfaction'] + 
    np.random.normal(0, 500, n_samples)
)

sales_data['Conversion_Rate'] = (
    0.02 + 0.01 * sales_data['Customer_Satisfaction'] - 
    0.0002 * sales_data['Product_Price'] +
    np.random.normal(0, 0.005, n_samples)
)

# Ensure realistic ranges
sales_data['Customer_Satisfaction'] = sales_data['Customer_Satisfaction'].clip(1, 5)
sales_data['Conversion_Rate'] = sales_data['Conversion_Rate'].clip(0.001, 0.1)

print(f"\nDataset shape: {sales_data.shape}")
print("📋 Sample Business Dataset:")
sales_data.head()
🔍 CORRELATION OPERATIONS DEMONSTRATION
==================================================

Dataset shape: (100, 6)
📋 Sample Business Dataset:
Advertising_Spend Website_Traffic Customer_Satisfaction Product_Price Sales_Revenue Conversion_Rate
0 1099.342831 3584.629258 4.178894 41.710050 1993.427949 0.058078
1 972.347140 4579.354677 4.280392 44.398190 2708.075056 0.063471
2 1129.537708 4657.285483 4.541526 57.472936 3211.742785 0.046928
3 1304.605971 4197.722731 4.526901 56.103703 3231.872098 0.056863
4 953.169325 4838.714288 3.311165 49.790984 2651.350074 0.039900

8.3.4.1 Method 1: Full Correlation Matrix

The correlation matrix shows relationships between ALL numerical variables at once:

# 🔍 METHOD 1: FULL CORRELATION MATRIX
print("📊 CORRELATION MATRIX ANALYSIS")
print("="*40)

# Calculate correlation matrix using Pearson (default method)
print("🔢 PEARSON CORRELATION MATRIX")
print("-"*30)
correlation_matrix = sales_data.corr()  # Pearson is default
print(correlation_matrix.round(3))

# Business interpretation
print("\n💼 BUSINESS INTERPRETATION:")
print("-"*25)

# Find strongest correlations
correlation_pairs = []
for i, col1 in enumerate(correlation_matrix.columns):
    for j, col2 in enumerate(correlation_matrix.columns):
        if i < j:  # Avoid duplicates and self-correlation
            corr_val = correlation_matrix.loc[col1, col2]
            correlation_pairs.append((col1, col2, corr_val))

# Sort by absolute correlation strength
correlation_pairs.sort(key=lambda x: abs(x[2]), reverse=True)

print("🎯 Top 3 Strongest Relationships:")
for i, (col1, col2, corr_val) in enumerate(correlation_pairs[:3]):
    strength = "Strong" if abs(corr_val) > 0.7 else "Moderate" if abs(corr_val) > 0.3 else "Weak"
    direction = "Positive" if corr_val > 0 else "Negative"
    print(f"{i+1}. {col1} ↔️ {col2}")
    print(f"   Correlation: {corr_val:.3f} ({strength} {direction})")
    
    if corr_val > 0:
        print(f"   📈 As {col1} increases, {col2} tends to increase")
    else:
        print(f"   📉 As {col1} increases, {col2} tends to decrease")
    print()
📊 CORRELATION MATRIX ANALYSIS
========================================
🔢 PEARSON CORRELATION MATRIX
------------------------------
                       Advertising_Spend  Website_Traffic  \
Advertising_Spend                  1.000           -0.136   
Website_Traffic                   -0.136            1.000   
Customer_Satisfaction              0.185           -0.037   
Product_Price                     -0.170           -0.018   
Sales_Revenue                      0.085            0.565   
Conversion_Rate                    0.094           -0.114   

                       Customer_Satisfaction  Product_Price  Sales_Revenue  \
Advertising_Spend                      0.185         -0.170          0.085   
Website_Traffic                       -0.037         -0.018          0.565   
Customer_Satisfaction                  1.000          0.005          0.085   
Product_Price                          0.005          1.000          0.131   
Sales_Revenue                          0.085          0.131          1.000   
Conversion_Rate                        0.696         -0.214         -0.079   

                       Conversion_Rate  
Advertising_Spend                0.094  
Website_Traffic                 -0.114  
Customer_Satisfaction            0.696  
Product_Price                   -0.214  
Sales_Revenue                   -0.079  
Conversion_Rate                  1.000  

💼 BUSINESS INTERPRETATION:
-------------------------
🎯 Top 3 Strongest Relationships:
1. Customer_Satisfaction ↔️ Conversion_Rate
   Correlation: 0.696 (Moderate Positive)
   📈 As Customer_Satisfaction increases, Conversion_Rate tends to increase

2. Website_Traffic ↔️ Sales_Revenue
   Correlation: 0.565 (Moderate Positive)
   📈 As Website_Traffic increases, Sales_Revenue tends to increase

3. Product_Price ↔️ Conversion_Rate
   Correlation: -0.214 (Weak Negative)
   📉 As Product_Price increases, Conversion_Rate tends to decrease

8.3.4.2 Method 2: Pairwise Correlation

For analyzing specific relationships between two variables:

#  METHOD 2: PAIRWISE CORRELATION ANALYSIS
print("🔍 PAIRWISE CORRELATION ANALYSIS")
print("="*50)

# Select two interesting columns for detailed analysis
col1, col2 = 'Advertising_Spend', 'Sales_Revenue'

print(f"🎬 Analyzing relationship between '{col1}' and '{col2}'")
print("-"*60)

# Calculate Pearson correlation for the pair (default method)
correlation_value = sales_data[col1].corr(sales_data[col2])

print(f"📊 Correlation coefficient: {correlation_value:.4f}")

# Detailed interpretation
print(f"\n💡 INTERPRETATION:")
print("-"*20)

def interpret_correlation(corr_val):
    abs_corr = abs(corr_val)
    if abs_corr >= 0.8:
        strength = "Very Strong"
    elif abs_corr >= 0.6:
        strength = "Strong"  
    elif abs_corr >= 0.3:
        strength = "Moderate"
    elif abs_corr >= 0.1:
        strength = "Weak"
    else:
        strength = "Very Weak/None"
        
    direction = "Positive" if corr_val > 0 else "Negative"
    return f"{strength} {direction}"

print(f"🔢 Relationship strength: {interpret_correlation(correlation_value)}")

# Business context
print(f"\n🎯 BUSINESS CONTEXT:")
print("-"*20)
if abs(correlation_value) > 0.5:
    print("✅ Strong relationship detected!")
    print(f"   • This suggests {col1} and {col2} move together")
    print("   • Could be useful for predictions or business decisions")
    if correlation_value > 0:
        print(f"   • Higher {col1} is associated with higher {col2}")
    else:
        print(f"   • Higher {col1} is associated with lower {col2}")
else:
    print("⚠️  Moderate/Weak relationship detected")
    print(f"   • {col1} and {col2} don't have a strong linear relationship")
    print("   • Consider other factors or non-linear relationships")

# Demonstrate different variable pairs
print(f"\n🔍 OTHER INTERESTING PAIRS:")
print("-"*25)
pairs_to_check = [
    ('Customer_Satisfaction', 'Sales_Revenue'),
    ('Product_Price', 'Conversion_Rate'),
    ('Website_Traffic', 'Sales_Revenue')
]

for pair_col1, pair_col2 in pairs_to_check:
    pair_corr = sales_data[pair_col1].corr(sales_data[pair_col2])
    print(f"{pair_col1} ↔️ {pair_col2}: {pair_corr:.3f}")
🔍 PAIRWISE CORRELATION ANALYSIS
==================================================
🎬 Analyzing relationship between 'Advertising_Spend' and 'Sales_Revenue'
------------------------------------------------------------
📊 Correlation coefficient: 0.0847

💡 INTERPRETATION:
--------------------
🔢 Relationship strength: Very Weak/None Positive

🎯 BUSINESS CONTEXT:
--------------------
⚠️  Moderate/Weak relationship detected
   • Advertising_Spend and Sales_Revenue don't have a strong linear relationship
   • Consider other factors or non-linear relationships

🔍 OTHER INTERESTING PAIRS:
-------------------------
Customer_Satisfaction ↔️ Sales_Revenue: 0.085
Product_Price ↔️ Conversion_Rate: -0.214
Website_Traffic ↔️ Sales_Revenue: 0.565

8.3.4.3 Advanced Correlation Analysis: Finding Strong Correlations Programmatically

# 🔍 ADVANCED CORRELATION ANALYSIS
print(" ADVANCED CORRELATION TECHNIQUES")
print("="*50)

# 1. Find strongest correlations programmatically
print(" 1. FINDING STRONGEST CORRELATIONS")
print("-"*35)

correlation_matrix = sales_data.corr()  # Uses Pearson correlation by default

# Create a function to find strong correlations
def find_strong_correlations(corr_matrix, threshold=0.5):
    """Find correlations above threshold, excluding diagonal"""
    strong_corrs = []
    
    for i in range(len(corr_matrix.columns)):
        for j in range(i+1, len(corr_matrix.columns)):  # Avoid duplicates and diagonal
            col1 = corr_matrix.columns[i]
            col2 = corr_matrix.columns[j] 
            corr_val = corr_matrix.iloc[i, j]
            
            if pd.notna(corr_val) and abs(corr_val) >= threshold:
                strong_corrs.append({
                    'Variable1': col1,
                    'Variable2': col2,
                    'Correlation': corr_val,
                    'Strength': abs(corr_val)
                })
    
    return pd.DataFrame(strong_corrs).sort_values('Strength', ascending=False)

strong_correlations = find_strong_correlations(correlation_matrix, threshold=0.3)

if len(strong_correlations) > 0:
    print(f"📊 Found {len(strong_correlations)} correlations above 0.3 threshold:")
    print(strong_correlations.round(3))
    
    print(f"\n🎯 TOP CORRELATION:")
    if len(strong_correlations) > 0:
        top_corr = strong_correlations.iloc[0]
        print(f"   {top_corr['Variable1']} ↔️ {top_corr['Variable2']}")
        print(f"   Strength: {top_corr['Correlation']:.3f}")
        
        if top_corr['Correlation'] > 0:
            print("   📈 Positive relationship: both increase/decrease together")
        else:
            print("   📉 Negative relationship: one increases as other decreases")
else:
    print("📊 No strong correlations found above 0.3 threshold")
    print("💡 Try lowering threshold or check data quality")

# 2. Correlation matrix summary
print(f"\n🌡️ 2. CORRELATION MATRIX SUMMARY")
print("-"*30)

if len(correlation_matrix) > 1:
    # Summary statistics of correlations
    corr_values = correlation_matrix.values
    # Get upper triangle (excluding diagonal)
    mask = np.triu(np.ones_like(corr_values, dtype=bool), k=1)
    upper_triangle = corr_values[mask]
    
    print(f"📊 Matrix size: {correlation_matrix.shape[0]} x {correlation_matrix.shape[1]}")
    print(f"📈 Highest correlation: {np.nanmax(upper_triangle):.3f}")
    print(f"📉 Lowest correlation: {np.nanmin(upper_triangle):.3f}")
    print(f"📊 Average correlation: {np.nanmean(upper_triangle):.3f}")
    
    # Count correlations by strength
    strong_positive = np.sum(upper_triangle >= 0.5)
    strong_negative = np.sum(upper_triangle <= -0.5)
    moderate = np.sum((upper_triangle >= 0.3) & (upper_triangle < 0.5)) + np.sum((upper_triangle <= -0.3) & (upper_triangle > -0.5))
    weak = np.sum((upper_triangle > -0.3) & (upper_triangle < 0.3))
    
    print(f"\n📊 CORRELATION STRENGTH DISTRIBUTION:")
    print(f"   🔥 Strong (|r| ≥ 0.5):    {strong_positive + strong_negative}")
    print(f"   📊 Moderate (0.3 ≤ |r| < 0.5): {moderate}")
    print(f"   ⚪ Weak (|r| < 0.3):       {weak}")

print(f"\n💡 PRO TIPS FOR CORRELATION ANALYSIS:")
print("-"*35)
print("✅ Always visualize correlations with scatter plots")
print("✅ Check for non-linear relationships")
print("✅ Consider domain knowledge when interpreting")
print("✅ Remember: correlation ≠ causation")
print("✅ Look for outliers that might affect correlations")
print("✅ Use different correlation methods for robustness")
 ADVANCED CORRELATION TECHNIQUES
==================================================
 1. FINDING STRONGEST CORRELATIONS
-----------------------------------
📊 Found 2 correlations above 0.3 threshold:
               Variable1        Variable2  Correlation  Strength
1  Customer_Satisfaction  Conversion_Rate        0.696     0.696
0        Website_Traffic    Sales_Revenue        0.565     0.565

🎯 TOP CORRELATION:
   Customer_Satisfaction ↔️ Conversion_Rate
   Strength: 0.696
   📈 Positive relationship: both increase/decrease together

🌡️ 2. CORRELATION MATRIX SUMMARY
------------------------------
📊 Matrix size: 6 x 6
📈 Highest correlation: 0.696
📉 Lowest correlation: -0.214
📊 Average correlation: 0.072

📊 CORRELATION STRENGTH DISTRIBUTION:
   🔥 Strong (|r| ≥ 0.5):    2
   📊 Moderate (0.3 ≤ |r| < 0.5): 0
   ⚪ Weak (|r| < 0.3):       13

💡 PRO TIPS FOR CORRELATION ANALYSIS:
-----------------------------------
✅ Always visualize correlations with scatter plots
✅ Check for non-linear relationships
✅ Consider domain knowledge when interpreting
✅ Remember: correlation ≠ causation
✅ Look for outliers that might affect correlations
✅ Use different correlation methods for robustness

8.3.5 Summary: Correlation Operations

Method Best For Advantages Limitations
df.corr() Overview of all relationships Complete picture Can be overwhelming
df['A'].corr(df['B']) Specific pair analysis Focused, detailed Limited scope
df.corrwith(series) Benchmark comparisons External validation Requires external data

Key Takeaways: - Correlation reveals relationships but not causation - Always visualize data alongside correlation coefficients
- Use multiple correlation methods for robust analysis - Consider business context when interpreting results - Strong correlations (|r| > 0.7) suggest predictive potential

8.3.6 Are Correlation Operations Vectorized?

Yes! Correlation operations in pandas (and NumPy under the hood) are vectorized:

  • No explicit loops: Correlations are computed using optimized NumPy routines in C, avoiding slow Python row-by-row iteration.
  • Matrix-based computation: df.corr() computes the covariance matrix of all numeric columns at once, then scales it into a correlation matrix.
  • Pairwise efficiency: Even df['A'].corr(df['B']) leverages array-level operations on entire columns, not element-by-element loops.
  • Scalability: Vectorization makes correlation fast and efficient for large datasets, with memory (not speed) usually being the main limitation.

💡 This is why pandas can compute full n × n correlation matrices quickly, even for wide datasets.

8.4 Advanced Operations: Function Application & Data Transformation

Pandas provides powerful methods for applying functions to DataFrame rows, columns, or individual elements. These methods are the cornerstone of advanced data manipulation, enabling complex transformations that would be cumbersome with basic operations.

8.4.1 apply(): The Swiss Army Knife of Data Transformation

The apply() function is one of pandas’ most versatile tools. It can apply any custom or built-in function across DataFrame axes or to elements in a Series, making it ideal for flexible transformations.

8.4.1.1 Core Concepts

  • Row-wise operations (axis=1) → apply a function across rows (each row passed as a Series)
  • Column-wise operations (axis=0) → apply a function across columns (each column passed as a Series)
  • Element-wise operations → apply a function to each value in a Series

8.4.1.2 Syntax Patterns

# DataFrame apply() - row-wise or column-wise
df.apply(function, axis=1)        # Row-wise (rows as Series)
df.apply(function, axis=0)        # Column-wise (columns as Series)

# Series apply() - element-wise
df['column'].apply(function)      # Each element passed individually

# With additional arguments
df.apply(function, axis=1, args=(arg1, arg2))

8.4.1.3 Real-World Example: Financial Analysis

# 📈 Create a comprehensive financial dataset

# Generate realistic financial data
np.random.seed(42)
n_companies = 100

financial_data = {
    'Company': [f'Company_{i:03d}' for i in range(1, n_companies + 1)],
    'Revenue': np.random.normal(50000, 15000, n_companies).round(0),
    'Expenses': np.random.normal(35000, 10000, n_companies).round(0),
    'Employees': np.random.randint(10, 500, n_companies),
    'Market_Cap': np.random.lognormal(15, 1, n_companies).round(0),
    'Industry': np.random.choice(['Tech', 'Healthcare', 'Finance', 'Retail', 'Manufacturing'], n_companies),
    'Founded_Year': np.random.randint(1990, 2020, n_companies),
    'Country': np.random.choice(['USA', 'Canada', 'UK', 'Germany', 'Japan'], n_companies)
}

df_financial = pd.DataFrame(financial_data)

# Ensure expenses don't exceed revenue (business logic)
def cap_expenses(row):
    """
    Ensure expenses don't exceed 95% of revenue
    This prevents companies from having expenses that are too close to or exceed their revenue
    """
    # Calculate the maximum allowed expenses (95% of revenue)
    max_allowed_expenses = row['Revenue'] * 0.95
    
    # Return the smaller value: either current expenses or the maximum allowed
    if row['Expenses'] > max_allowed_expenses:
        return max_allowed_expenses
    else:
        return row['Expenses']

# Apply the function to each row
df_financial['Expenses'] = df_financial.apply(cap_expenses, axis=1)

# Calculate profit and profit margin for each company
def calculate_profit(row):
    return row['Revenue'] - row['Expenses']

def calculate_profit_margin(row):
    profit = row['Revenue'] - row['Expenses']
    return round((profit / row['Revenue']) * 100, 2) if row['Revenue'] > 0 else 0

# Apply functions to create new columns
df_financial['Profit'] = df_financial.apply(calculate_profit, axis=1)
df_financial['Profit_Margin'] = df_financial.apply(calculate_profit_margin, axis=1)


print(f"\nDataset shape: {df_financial.shape}")
print(f"Industries: {df_financial['Industry'].unique()}")

print("📊 Financial Dataset Sample:")
df_financial.head(10)

Dataset shape: (100, 10)
Industries: ['Retail' 'Manufacturing' 'Tech' 'Finance' 'Healthcare']
📊 Financial Dataset Sample:
Company Revenue Expenses Employees Market_Cap Industry Founded_Year Country Profit Profit_Margin
0 Company_001 57451.0 20846.0 395 2340214.0 Retail 2010 USA 36605.0 63.72
1 Company_002 47926.0 30794.0 483 2378910.0 Manufacturing 1994 UK 17132.0 35.75
2 Company_003 59715.0 31573.0 282 4491813.0 Tech 1999 Germany 28142.0 47.13
3 Company_004 72845.0 26977.0 113 651982.0 Tech 1999 Germany 45868.0 62.97
4 Company_005 46488.0 33387.0 426 8413782.0 Finance 2008 USA 13101.0 28.18
5 Company_006 46488.0 39041.0 402 11598738.0 Manufacturing 2006 UK 7447.0 16.02
6 Company_007 73688.0 53862.0 308 4767327.0 Retail 2010 UK 19826.0 26.91
7 Company_008 61512.0 36746.0 255 17044719.0 Tech 2003 UK 24766.0 40.26
8 Company_009 42958.0 37576.0 185 867324.0 Retail 1998 Japan 5382.0 12.53
9 Company_010 58138.0 34256.0 48 20660119.0 Tech 2003 Canada 23882.0 41.08

8.4.1.4 How apply() Actually Works

apply() is essentially a Python-level loop that iterates over each row or column and applies your function one-by-one.
It’s much slower than true vectorized operations (NumPy ufuncs, pandas built-ins).

✅ Prefer vectorized operations whenever possible.
🔁 Use apply() only when logic can’t be expressed with vectorized methods.

8.4.2 map(): Efficient Element-wise Transformations (on a Series)

Series.map() applies a transformation element by element and is often faster/simpler than apply() for 1-to-1 lookups or simple functions.

8.4.2.1 ✅ When to use

  • Value mapping / recoding (e.g., codes → labels)
  • Category encoding (labels → numbers)
  • Simple per-value transforms (string cleanup, parsing, normalization)
  • Lookup from a codebook stored as a dict or another Series

8.4.2.2 Three Ways to Use map()

# Sample data
df = pd.DataFrame({
    'Industry': ['Tech', 'Healthcare', 'Finance', 'Tech', 'Retail'],
    'Size': ['Small', 'Large', 'Medium', 'Small', 'Large'],
    'Revenue': [100000, 150000, 80000, 120000, 90000]
})

# 1. **Dictionary Mapping** - Most Common
industry_codes = {'Tech': 1, 'Healthcare': 2, 'Finance': 3, 'Retail': 4}
df['Industry_Code'] = df['Industry'].map(industry_codes)

# 2. **Function Mapping**
def add_prefix(industry):
    return f"IND_{industry}"

df['Industry_Prefixed'] = df['Industry'].map(add_prefix)

# 3. **Series Mapping**
size_priority = pd.Series({'Small': 3, 'Medium': 2, 'Large': 1})
df['Size_Priority'] = df['Size'].map(size_priority)

print(df)
     Industry    Size  Revenue  Industry_Code Industry_Prefixed  Size_Priority
0        Tech   Small   100000              1          IND_Tech              3
1  Healthcare   Large   150000              2    IND_Healthcare              1
2     Finance  Medium    80000              3       IND_Finance              2
3        Tech   Small   120000              1          IND_Tech              3
4      Retail   Large    90000              4        IND_Retail              1

8.4.2.3 How map() Actually Works:

map() iterates through each element in the Series and applies the mapping function/dictionary lookup individually.

# 🚀 Practical Performance Comparison: Real-World Examples

# Create a realistic dataset for performance testing
np.random.seed(42)
n_rows = 50000

performance_data = pd.DataFrame({
    'customer_id': range(1, n_rows + 1),
    'purchase_amount': np.random.uniform(10, 1000, n_rows),
    'discount_pct': np.random.choice([0, 5, 10, 15, 20], n_rows),
    'category': np.random.choice(['A', 'B', 'C', 'D'], n_rows),
    'rating': np.random.randint(1, 6, n_rows),
    'is_premium': np.random.choice([True, False], n_rows)
})

print("📊 Performance Test Dataset:")
print(f"Dataset shape: {performance_data.shape}")
print(performance_data.head())

# Define category mapping for testing
category_mapping = {'A': 'Electronics', 'B': 'Clothing', 'C': 'Books', 'D': 'Sports'}

print(f"\n🏁 Performance Test Results ({n_rows:,} rows):")
print("-" * 60)
📊 Performance Test Dataset:
Dataset shape: (50000, 6)
   customer_id  purchase_amount  discount_pct category  rating  is_premium
0            1       380.794718             0        C       1        True
1            2       951.207163            15        D       1       False
2            3       734.674002             5        D       1        True
3            4       602.671899             5        A       4        True
4            5       164.458454             0        D       3        True

🏁 Performance Test Results (50,000 rows):
------------------------------------------------------------
# 📊 Test 1: Simple Mathematical Operations
print("🧮 Test 1: Calculate final price after discount")

# Method 1: VECTORIZED (Fastest)
start_time = time.time()
performance_data['final_price_vectorized'] = performance_data['purchase_amount'] * (1 - performance_data['discount_pct'] / 100)
vectorized_time = time.time() - start_time

# Method 2: APPLY with function (Slower)
def calculate_final_price(row):
    return row['purchase_amount'] * (1 - row['discount_pct'] / 100)

start_time = time.time()
performance_data['final_price_apply'] = performance_data.apply(calculate_final_price, axis=1)
apply_time = time.time() - start_time

print(f"✅ Vectorized:    {vectorized_time:.4f}s (1.0x baseline)")
print(f"⚙️  Apply function: {apply_time:.4f}s ({apply_time/vectorized_time:.1f}x slower)")

# Verify results are identical
results_match = (performance_data['final_price_vectorized'].round(2) == performance_data['final_price_apply'].round(2)).all()
print(f"🔍 Results identical: {results_match}")

# Clean up test columns
performance_data.drop(['final_price_vectorized', 'final_price_apply'], axis=1, inplace=True)
🧮 Test 1: Calculate final price after discount
✅ Vectorized:    0.0013s (1.0x baseline)
⚙️  Apply function: 0.2778s (207.6x slower)
🔍 Results identical: True

8.4.3 replace() : Replace specific values with other values

s_with_nan = pd.Series([1, 2, np.nan, 4])

# map() with na_action
mapped_nan = s_with_nan.map({1: 'one', 2: 'two', 4: 'four'}, na_action='ignore')
print("map() with na_action='ignore':")
print(mapped_nan)

# replace() can target NaN directly
replaced_nan = s_with_nan.replace(np.nan, 'missing')
print("\nreplace() handling NaN:")
print(replaced_nan)
map() with na_action='ignore':
0     one
1     two
2     NaN
3    four
dtype: object

replace() handling NaN:
0        1.0
1        2.0
2    missing
3        4.0
dtype: object

Note: Unlike map() and apply(), replace() is a vectorized operation in pandas and is therefore much faster than map() or apply().

8.4.4 Performance Comparison

# 📊 Performance Comparison
print("\n🗺️ Test 2: Category mapping")

# Method 1: MAP (Optimal for this task)
start_time = time.time()
performance_data['category_name_map'] = performance_data['category'].map(category_mapping)
map_time = time.time() - start_time

# Method 2: APPLY (Overkill for simple mapping)
start_time = time.time()
performance_data['category_name_apply'] = performance_data['category'].apply(lambda x: category_mapping[x])
apply_time = time.time() - start_time

# Method 3: VECTORIZED replacement (Alternative approach)
start_time = time.time()
performance_data['category_name_replace'] = performance_data['category'].replace(category_mapping)
replace_time = time.time() - start_time

print(f"✅ Map():        {map_time:.4f}s (1.0x baseline)")
print(f"🔄 Apply():      {apply_time:.4f}s ({apply_time/map_time:.1f}x slower)")
print(f"🔄 Replace():    {replace_time:.4f}s ({replace_time/map_time:.1f}x slower)")

# Clean up test columns
performance_data.drop(['category_name_apply', 'category_name_replace'], axis=1, inplace=True)

🗺️ Test 2: Category mapping
✅ Map():        0.0025s (1.0x baseline)
🔄 Apply():      0.0101s (4.0x slower)
🔄 Replace():    0.0000s (0.0x slower)

Key Takeaway

  • apply() and map() are not vectorized — under the hood they behave like Python loops.
    Pandas iterates through each row or column and calls your function, which can be slow on large datasets.

  • For value substitutions, use pandas.replace() — it’s vectorized and far faster than writing an apply() with a custom function.

8.5 Core Vectorized Operations in Pandas

Pandas is built on top of NumPy, which means that vectorized operations are actually compiled C code running under the hood. This is why vectorized operations can be orders of magnitude faster than Python loops!

8.5.1 Why Vectorization Matters

The Speed Difference:

  • Python loop: Interpreted, slow, one operation at a time
  • Vectorized operation: Compiled C code, operates on entire arrays at once

Performance gain: Typically 10-100x faster than equivalent loops!

8.5.2 Categories of Vectorized Operations

Category Operations Example
Arithmetic +, -, *, /, **, //, % df['total'] = df['price'] * df['qty']
Comparison >, <, >=, <=, ==, != df['expensive'] = df['price'] > 100
Aggregation .sum(), .mean(), .max(), .std() df['revenue'].sum()
Boolean Logic &, \|, ~ df[(df['price'] > 50) & (df['qty'] > 10)]
String Ops .str.upper(), .str.contains(), etc. df['name'].str.upper()
DateTime Ops .dt.year, .dt.month, .dt.dayofweek df['date'].dt.year
Math Functions np.sqrt(), np.log(), np.exp() np.log(df['revenue'])

Let’s see each category with practical examples!

8.5.3 Arithmetic Operations

The most common vectorized operations — perform math on entire columns at once.

print("📊 ARITHMETIC OPERATIONS")
print("=" * 60)

# Sample e-commerce data
orders_df = pd.DataFrame({
    'product': ['Widget', 'Gadget', 'Doohickey', 'Thingamajig', 'Whatsit'],
    'price': [29.99, 49.99, 19.99, 99.99, 39.99],
    'quantity': [2, 1, 5, 1, 3],
    'tax_rate': [0.08, 0.08, 0.08, 0.08, 0.08],
    'discount_pct': [10, 0, 15, 5, 20]
})

print("Original order data:")
print(orders_df)

# ✅ VECTORIZED: All calculations happen at once on entire columns
orders_df['subtotal'] = orders_df['price'] * orders_df['quantity']
orders_df['discount_amount'] = orders_df['subtotal'] * (orders_df['discount_pct'] / 100)
orders_df['after_discount'] = orders_df['subtotal'] - orders_df['discount_amount']
orders_df['tax'] = orders_df['after_discount'] * orders_df['tax_rate']
orders_df['total'] = orders_df['after_discount'] + orders_df['tax']

print("\n✅ Calculated columns (all vectorized!):")
orders_df[['product', 'subtotal', 'discount_amount', 'total']].round(2)
📊 ARITHMETIC OPERATIONS
============================================================
Original order data:
       product  price  quantity  tax_rate  discount_pct
0       Widget  29.99         2      0.08            10
1       Gadget  49.99         1      0.08             0
2    Doohickey  19.99         5      0.08            15
3  Thingamajig  99.99         1      0.08             5
4      Whatsit  39.99         3      0.08            20

✅ Calculated columns (all vectorized!):
product subtotal discount_amount total
0 Widget 59.98 6.00 58.30
1 Gadget 49.99 0.00 53.99
2 Doohickey 99.95 14.99 91.75
3 Thingamajig 99.99 5.00 102.59
4 Whatsit 119.97 23.99 103.65

8.5.4 Comparison Operations

Create boolean masks for filtering, flagging, and categorizing data.

print("\n🔍 COMPARISON OPERATIONS")
print("=" * 60)

# Student grades data
grades_df = pd.DataFrame({
    'student': ['Alice', 'Bob', 'Carol', 'David', 'Eve', 'Frank'],
    'math': [92, 78, 88, 65, 95, 72],
    'science': [88, 82, 90, 70, 91, 68],
    'english': [85, 90, 87, 75, 89, 80]
})

print("Student grades:")
print(grades_df)

# ✅ VECTORIZED: Compare entire columns at once
grades_df['math_pass'] = grades_df['math'] >= 70
grades_df['high_achiever'] = grades_df['math'] >= 90
grades_df['needs_help'] = grades_df['math'] < 70

# Calculate average and compare
grades_df['average'] = (grades_df['math'] + grades_df['science'] + grades_df['english']) / 3
grades_df['above_average'] = grades_df['average'] > grades_df['average'].mean()

print("\n✅ Comparison results:")
print(grades_df[['student', 'math', 'average', 'math_pass', 'high_achiever', 'above_average']])

# Count students in each category
print(f"\n📈 Statistics:")
print(f"  • Students passing math: {grades_df['math_pass'].sum()}")
print(f"  • High achievers (math ≥ 90): {grades_df['high_achiever'].sum()}")
print(f"  • Students above class average: {grades_df['above_average'].sum()}")

🔍 COMPARISON OPERATIONS
============================================================
Student grades:
  student  math  science  english
0   Alice    92       88       85
1     Bob    78       82       90
2   Carol    88       90       87
3   David    65       70       75
4     Eve    95       91       89
5   Frank    72       68       80

✅ Comparison results:
  student  math    average  math_pass  high_achiever  above_average
0   Alice    92  88.333333       True           True           True
1     Bob    78  83.333333       True          False           True
2   Carol    88  88.333333       True          False           True
3   David    65  70.000000      False          False          False
4     Eve    95  91.666667       True           True           True
5   Frank    72  73.333333       True          False          False

📈 Statistics:
  • Students passing math: 5
  • High achievers (math ≥ 90): 2
  • Students above class average: 4

8.5.5 Boolean Indexing & Filtering

Combine comparison operations with boolean logic to filter data.

print("\n🎯 BOOLEAN INDEXING & FILTERING")
print("=" * 60)

# Product inventory data
inventory_df = pd.DataFrame({
    'product_id': range(1, 11),
    'name': ['Widget', 'Gadget', 'Doohickey', 'Thingamajig', 'Whatsit', 
             'Gizmo', 'Contraption', 'Device', 'Apparatus', 'Mechanism'],
    'price': [29.99, 149.99, 19.99, 299.99, 39.99, 89.99, 199.99, 59.99, 179.99, 249.99],
    'stock': [45, 12, 8, 3, 67, 23, 15, 5, 28, 9],
    'category': ['A', 'B', 'A', 'C', 'A', 'B', 'C', 'A', 'B', 'C']
})

print("Product inventory:")
print(inventory_df)

# ✅ VECTORIZED FILTERING: Multiple conditions combined
# Find expensive, low-stock items
expensive = inventory_df['price'] > 100
low_stock = inventory_df['stock'] < 15
urgent_restock = expensive & low_stock

print("\n🚨 Urgent restock needed (expensive + low stock):")
print(inventory_df[urgent_restock][['name', 'price', 'stock']])

# Find good deals in category A
category_a = inventory_df['category'] == 'A'
affordable = inventory_df['price'] < 50
good_deals = category_a & affordable

print("\n💰 Good deals in Category A (price < $50):")
print(inventory_df[good_deals][['name', 'price', 'category']])

# Complex filter: Either (expensive AND low stock) OR (cheap AND high stock)
cheap = inventory_df['price'] < 50
high_stock = inventory_df['stock'] > 30
complex_filter = (expensive & low_stock) | (cheap & high_stock)

print("\n🔄 Complex filter results:")
print(inventory_df[complex_filter][['name', 'price', 'stock']])

print(f"\n📊 Filter statistics:")
print(f"  • Total products: {len(inventory_df)}")
print(f"  • Urgent restock: {urgent_restock.sum()}")
print(f"  • Good deals: {good_deals.sum()}")
print(f"  • Complex filter matches: {complex_filter.sum()}")

🎯 BOOLEAN INDEXING & FILTERING
============================================================
Product inventory:
   product_id         name   price  stock category
0           1       Widget   29.99     45        A
1           2       Gadget  149.99     12        B
2           3    Doohickey   19.99      8        A
3           4  Thingamajig  299.99      3        C
4           5      Whatsit   39.99     67        A
5           6        Gizmo   89.99     23        B
6           7  Contraption  199.99     15        C
7           8       Device   59.99      5        A
8           9    Apparatus  179.99     28        B
9          10    Mechanism  249.99      9        C

🚨 Urgent restock needed (expensive + low stock):
          name   price  stock
1       Gadget  149.99     12
3  Thingamajig  299.99      3
9    Mechanism  249.99      9

💰 Good deals in Category A (price < $50):
        name  price category
0     Widget  29.99        A
2  Doohickey  19.99        A
4    Whatsit  39.99        A

🔄 Complex filter results:
          name   price  stock
0       Widget   29.99     45
1       Gadget  149.99     12
3  Thingamajig  299.99      3
4      Whatsit   39.99     67
9    Mechanism  249.99      9

📊 Filter statistics:
  • Total products: 10
  • Urgent restock: 3
  • Good deals: 3
  • Complex filter matches: 5

8.5.6 Aggregation Operations

Reduce columns to summary statistics using vectorized aggregations.

print("\n📈 AGGREGATION OPERATIONS")
print("=" * 60)

# Sales data by region
sales_df = pd.DataFrame({
    'region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'] * 3,
    'month': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3],
    'revenue': [45000, 52000, 48000, 51000, 47000, 54000, 49000, 53000, 
                46000, 55000, 50000, 52000, 44000, 53000, 47000, 50000,
                48000, 56000, 51000, 54000, 45000, 57000, 49000, 55000]
})

print("Sales data (first 8 rows):")
print(sales_df.head(8))

# ✅ VECTORIZED AGGREGATIONS: Instant calculations on entire columns
print("\n📊 Overall statistics (vectorized):")
print(f"  • Total revenue: ${sales_df['revenue'].sum():,.0f}")
print(f"  • Average revenue: ${sales_df['revenue'].mean():,.0f}")
print(f"  • Median revenue: ${sales_df['revenue'].median():,.0f}")
print(f"  • Min revenue: ${sales_df['revenue'].min():,.0f}")
print(f"  • Max revenue: ${sales_df['revenue'].max():,.0f}")
print(f"  • Std deviation: ${sales_df['revenue'].std():,.0f}")

# Group by region and aggregate (still vectorized!)
print("\n🌍 Revenue by region:")
region_stats = sales_df.groupby('region')['revenue'].agg(['sum', 'mean', 'count'])
print(region_stats)

# Multiple aggregations at once
print("\n📉 Comprehensive statistics:")
all_stats = sales_df.groupby('region')['revenue'].describe()
print(all_stats)

📈 AGGREGATION OPERATIONS
============================================================
Sales data (first 8 rows):
  region  month  revenue
0  North      1    45000
1  South      1    52000
2   East      1    48000
3   West      1    51000
4  North      2    47000
5  South      2    54000
6   East      2    49000
7   West      2    53000

📊 Overall statistics (vectorized):
  • Total revenue: $1,211,000
  • Average revenue: $50,458
  • Median revenue: $50,500
  • Min revenue: $44,000
  • Max revenue: $57,000
  • Std deviation: $3,730

🌍 Revenue by region:
           sum          mean  count
region                             
East    294000  49000.000000      6
North   275000  45833.333333      6
South   327000  54500.000000      6
West    315000  52500.000000      6

📉 Comprehensive statistics:
        count          mean          std      min      25%      50%      75%  \
region                                                                         
East      6.0  49000.000000  1414.213562  47000.0  48250.0  49000.0  49750.0   
North     6.0  45833.333333  1471.960144  44000.0  45000.0  45500.0  46750.0   
South     6.0  54500.000000  1870.828693  52000.0  53250.0  54500.0  55750.0   
West      6.0  52500.000000  1870.828693  50000.0  51250.0  52500.0  53750.0   

            max  
region           
East    51000.0  
North   48000.0  
South   57000.0  
West    55000.0  

8.5.7 String Operations (.str accessor)

Vectorized string manipulation — no loops needed!

print("\n🔤 STRING OPERATIONS (.str accessor)")
print("=" * 60)

# Customer data with messy strings
customers_df = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['  john DOE  ', 'jane SMITH', 'Bob_Jones', 'alice WILLIAMS  ', 'CHARLIE brown'],
    'email': ['john@EXAMPLE.com', 'jane@test.COM', 'bob@demo.ORG', 'alice@SITE.net', 'charlie@mail.COM'],
    'phone': ['123-456-7890', '(234) 567-8901', '345.678.9012', '456-567-8901', '567-678-9012']
})

print("Original customer data (messy):")
print(customers_df)

# ✅ VECTORIZED STRING OPERATIONS: Clean all at once
customers_df['name_clean'] = customers_df['name'].str.strip().str.title()
customers_df['email_clean'] = customers_df['email'].str.lower()
customers_df['first_name'] = customers_df['name_clean'].str.split().str[0]
customers_df['last_name'] = customers_df['name_clean'].str.split().str[-1]
customers_df['phone_clean'] = customers_df['phone'].str.replace(r'[^\d]', '', regex=True)

# Check for patterns
customers_df['has_underscore'] = customers_df['name'].str.contains('_')
customers_df['email_domain'] = customers_df['email'].str.split('@').str[1]

print("\n✨ Cleaned data:")
print(customers_df[['name_clean', 'email_clean', 'first_name', 'last_name', 'email_domain']])

print("\n🔍 String pattern checks:")
print(f"  • Customers with underscore in name: {customers_df['has_underscore'].sum()}")
print(f"  • Unique email domains: {customers_df['email_domain'].nunique()}")

8.5.8 DateTime Operations (.dt accessor)

Extract time components and perform date calculations — all vectorized!

print("\n📅 DATETIME OPERATIONS (.dt accessor)")
print("=" * 60)

# Transaction data with dates
transactions_df = pd.DataFrame({
    'transaction_id': range(1, 7),
    'date': pd.to_datetime([
        '2024-01-15', '2024-02-20', '2024-03-10', 
        '2024-06-25', '2024-09-05', '2024-12-18'
    ]),
    'amount': [250, 180, 420, 310, 275, 390]
})

print("Transaction data:")
print(transactions_df)

# ✅ VECTORIZED DATETIME OPERATIONS: Extract components
transactions_df['year'] = transactions_df['date'].dt.year
transactions_df['month'] = transactions_df['date'].dt.month
transactions_df['month_name'] = transactions_df['date'].dt.month_name()
transactions_df['quarter'] = transactions_df['date'].dt.quarter
transactions_df['day_of_week'] = transactions_df['date'].dt.day_name()
transactions_df['is_weekend'] = transactions_df['date'].dt.dayofweek >= 5

# Calculate days since first transaction
first_date = transactions_df['date'].min()
transactions_df['days_since_first'] = (transactions_df['date'] - first_date).dt.days

print("\n📊 Extracted date components:")
print(transactions_df[['date', 'month_name', 'quarter', 'day_of_week', 'days_since_first']])

# Aggregate by quarter
print("\n📈 Revenue by quarter:")
quarterly = transactions_df.groupby('quarter')['amount'].sum()
print(quarterly)

print(f"\n🎯 Date insights:")
print(f"  • Transactions on weekends: {transactions_df['is_weekend'].sum()}")
print(f"  • Date range: {transactions_df['date'].min()} to {transactions_df['date'].max()}")
print(f"  • Total days covered: {transactions_df['days_since_first'].max()}")

8.5.9 Mathematical Functions (NumPy Integration)

Pandas seamlessly integrates with NumPy functions for advanced math operations.

print("\n🔢 MATHEMATICAL FUNCTIONS (NumPy + Pandas)")
print("=" * 60)

# Scientific measurement data
measurements_df = pd.DataFrame({
    'sample_id': range(1, 6),
    'concentration': [2.5, 5.0, 10.0, 20.0, 40.0],
    'absorbance': [0.15, 0.32, 0.65, 1.28, 2.45],
    'temperature': [20, 25, 30, 35, 40]
})

print("Laboratory measurements:")
print(measurements_df)

# ✅ VECTORIZED MATH OPERATIONS: Apply functions to entire columns
measurements_df['log_concentration'] = np.log10(measurements_df['concentration'])
measurements_df['sqrt_absorbance'] = np.sqrt(measurements_df['absorbance'])
measurements_df['exp_temp'] = np.exp(measurements_df['temperature'] / 100)
measurements_df['temp_squared'] = np.power(measurements_df['temperature'], 2)

# Trigonometric functions (useful for periodic data)
measurements_df['sin_temp'] = np.sin(np.radians(measurements_df['temperature']))

# Statistical transformations
measurements_df['concentration_zscore'] = (
    (measurements_df['concentration'] - measurements_df['concentration'].mean()) / 
    measurements_df['concentration'].std()
)

print("\n🧪 Calculated transformations:")
print(measurements_df[['sample_id', 'concentration', 'log_concentration', 
                       'absorbance', 'sqrt_absorbance']].round(3))

print("\n📊 Statistical summary:")
print(f"  • Mean concentration: {measurements_df['concentration'].mean():.2f}")
print(f"  • Geometric mean (via log): {np.exp(np.log(measurements_df['concentration']).mean()):.2f}")
print(f"  • Correlation (conc. vs abs.): {measurements_df['concentration'].corr(measurements_df['absorbance']):.3f}")

8.5.10 Performance Comparison: Loops vs Vectorization

Let’s see the dramatic speed difference between loops and vectorized operations with real timing data.

import time

print("\n⚡ PERFORMANCE SHOWDOWN: Loops vs Vectorization")
print("=" * 60)

# Create a large dataset for meaningful timing
large_df = pd.DataFrame({
    'price': np.random.uniform(10, 500, 50000),
    'quantity': np.random.randint(1, 20, 50000),
    'tax_rate': 0.08,
    'discount': np.random.uniform(0, 0.3, 50000)
})

print(f"Dataset size: {len(large_df):,} rows")

# ❌ METHOD 1: Python loop with iterrows() - SLOW!
print("\n🐌 Method 1: Python loop with iterrows()")
start_time = time.time()
total_loop = []
for idx, row in large_df.iterrows():
    subtotal = row['price'] * row['quantity']
    discount_amt = subtotal * row['discount']
    after_discount = subtotal - discount_amt
    tax = after_discount * row['tax_rate']
    total = after_discount + tax
    total_loop.append(total)
loop_time = time.time() - start_time
print(f"   Time: {loop_time:.4f} seconds")

# ❌ METHOD 2: List comprehension - Still slow!
print("\n🐢 Method 2: List comprehension")
start_time = time.time()
total_list_comp = [
    (row['price'] * row['quantity'] * (1 - row['discount']) * (1 + row['tax_rate']))
    for idx, row in large_df.iterrows()
]
list_comp_time = time.time() - start_time
print(f"   Time: {list_comp_time:.4f} seconds")

# ❌ METHOD 3: apply() - Better but still not optimal
print("\n🔄 Method 3: apply() function")
start_time = time.time()
def calculate_total(row):
    subtotal = row['price'] * row['quantity']
    after_discount = subtotal * (1 - row['discount'])
    return after_discount * (1 + row['tax_rate'])
large_df['total_apply'] = large_df.apply(calculate_total, axis=1)
apply_time = time.time() - start_time
print(f"   Time: {apply_time:.4f} seconds")

# ✅ METHOD 4: Vectorized operations - FAST!
print("\n⚡ Method 4: Vectorized operations")
start_time = time.time()
large_df['total_vectorized'] = (
    large_df['price'] * 
    large_df['quantity'] * 
    (1 - large_df['discount']) * 
    (1 + large_df['tax_rate'])
)
vectorized_time = time.time() - start_time
print(f"   Time: {vectorized_time:.4f} seconds")

# Performance comparison
print("\n" + "=" * 60)
print("🏆 PERFORMANCE RESULTS")
print("=" * 60)
print(f"{'Method':<30} {'Time (s)':<12} {'Speedup':<12} {'Status'}")
print("-" * 60)

# Prevent ZeroDivisionError by ensuring minimum time value
min_time = 0.0001  # Minimum time threshold to avoid division by zero
safe_vectorized_time = max(vectorized_time, min_time)

# Calculate speedup safely
loop_speedup = loop_time / safe_vectorized_time if safe_vectorized_time > 0 else float('inf')
list_comp_speedup = list_comp_time / safe_vectorized_time if safe_vectorized_time > 0 else float('inf')
apply_speedup = apply_time / safe_vectorized_time if safe_vectorized_time > 0 else float('inf')

print(f"{'🐌 iterrows() loop':<30} {loop_time:>10.4f}   {loop_speedup:>8.1f}x   ❌ AVOID")
print(f"{'🐢 List comprehension':<30} {list_comp_time:>10.4f}   {list_comp_speedup:>8.1f}x   ❌ AVOID")
print(f"{'🔄 apply() function':<30} {apply_time:>10.4f}   {apply_speedup:>8.1f}x   ⚠️  USE SPARINGLY")
print(f"{'⚡ Vectorized operations':<30} {vectorized_time:>10.4f}   {'1.0x':>8}   ✅ BEST")

print("\n💡 Key Insight:")
if vectorized_time > 0:
    speedup_factor = loop_time / safe_vectorized_time
    time_reduction = ((loop_time - vectorized_time) / loop_time * 100) if loop_time > 0 else 0
    print(f"   Vectorization is {speedup_factor:.0f}x faster than loops!")
    print(f"   That's a {time_reduction:.1f}% reduction in execution time.")
else:
    print(f"   Vectorization is extremely fast (< 0.0001 seconds)!")
    print(f"   Loop time: {loop_time:.4f}s vs Vectorized: {vectorized_time:.6f}s")

# Verify all methods produce same results
print("\n✅ Verification: All methods produce identical results:", 
      np.allclose(total_loop, large_df['total_vectorized']))

⚡ PERFORMANCE SHOWDOWN: Loops vs Vectorization
============================================================
Dataset size: 50,000 rows

🐌 Method 1: Python loop with iterrows()
   Time: 0.8976 seconds

🐢 Method 2: List comprehension
   Time: 0.8976 seconds

🐢 Method 2: List comprehension
   Time: 0.8167 seconds

🔄 Method 3: apply() function
   Time: 0.8167 seconds

🔄 Method 3: apply() function
   Time: 0.4004 seconds

⚡ Method 4: Vectorized operations
   Time: 0.0000 seconds

============================================================
🏆 PERFORMANCE RESULTS
============================================================
Method                         Time (s)     Speedup      Status
------------------------------------------------------------
🐌 iterrows() loop                  0.8976     8975.7x   ❌ AVOID
🐢 List comprehension               0.8167     8166.9x   ❌ AVOID
🔄 apply() function                 0.4004     4003.8x   ⚠️  USE SPARINGLY
⚡ Vectorized operations            0.0000       1.0x   ✅ BEST

💡 Key Insight:
   Vectorization is extremely fast (< 0.0001 seconds)!
   Loop time: 0.8976s vs Vectorized: 0.000000s

✅ Verification: All methods produce identical results: True
   Time: 0.4004 seconds

⚡ Method 4: Vectorized operations
   Time: 0.0000 seconds

============================================================
🏆 PERFORMANCE RESULTS
============================================================
Method                         Time (s)     Speedup      Status
------------------------------------------------------------
🐌 iterrows() loop                  0.8976     8975.7x   ❌ AVOID
🐢 List comprehension               0.8167     8166.9x   ❌ AVOID
🔄 apply() function                 0.4004     4003.8x   ⚠️  USE SPARINGLY
⚡ Vectorized operations            0.0000       1.0x   ✅ BEST

💡 Key Insight:
   Vectorization is extremely fast (< 0.0001 seconds)!
   Loop time: 0.8976s vs Vectorized: 0.000000s

✅ Verification: All methods produce identical results: True

8.5.11 Summary: Vectorization Best Practices

8.5.11.1 Always Try Vectorization First

✅ Use vectorized operations for:

  • Arithmetic: df['total'] = df['price'] * df['qty']
  • Comparisons: df['high_value'] = df['amount'] > 1000
  • String operations: df['upper_name'] = df['name'].str.upper()
  • Date operations: df['year'] = df['date'].dt.year
  • Math functions: df['log_revenue'] = np.log(df['revenue'])
  • Aggregations: df['revenue'].sum(), df.groupby('category').mean()

8.5.11.2 Performance Hierarchy

Speed Ranking (fastest to slowest):
1. ⚡ Vectorized operations (NumPy/Pandas built-ins)     ← Always try first!
2. 🔄 .map() for simple 1:1 mappings                    ← Good for dictionaries
3. 🔄 .apply() for complex row-wise logic               ← Use when necessary
4. 🐌 Python loops (iterrows, comprehensions)           ← AVOID AT ALL COSTS

8.5.11.3 ⚠️ Common Mistakes to Avoid

# ❌ DON'T: Use loops for arithmetic
for i in range(len(df)):
    df.loc[i, 'total'] = df.loc[i, 'price'] * df.loc[i, 'qty']

# ✅ DO: Use vectorized operations
df['total'] = df['price'] * df['qty']

# ❌ DON'T: Use apply() for simple operations
df['price_doubled'] = df['price'].apply(lambda x: x * 2)

# ✅ DO: Use direct vectorized operation
df['price_doubled'] = df['price'] * 2

# ❌ DON'T: Use iterrows() ever!
for idx, row in df.iterrows():  # 100x slower!
    # ... any operation

# ✅ DO: Find a vectorized alternative
df['result'] = some_vectorized_operation

8.5.11.4 Quick Decision Guide

How should I perform this operation?

  1. **Can I use a basic operator? (+, -, *, /, >, <, etc.)**
    → ✅ Use it directly: df['col1'] + df['col2']

  2. Is it a string or date operation?
    → ✅ Use .str or .dt accessor: df['text'].str.upper()

  3. Is it a NumPy function?
    → ✅ Apply to column: np.sqrt(df['values'])

  4. Is it a simple dictionary mapping?
    → ✅ Use .map(): df['code'].map(mapping_dict)

  5. Is it a built-in aggregation?
    → ✅ Use aggregation method: df['sales'].sum()

  6. Does it require complex multi-column logic?
    → 🔄 Use .apply() (only when necessary)

  7. None of the above work?
    → 🤔 Re-think your approach! There’s almost always a vectorized way.

8.5.11.5 💡 Pro Tips

  1. Profile your code: Use %%timeit in Jupyter to measure performance
  2. Start vectorized: Don’t write loops first because it’s easier
  3. Think in columns: Pandas works best with column-wise operations
  4. Combine operations: Chain multiple vectorized operations together
  5. Use NumPy: Many NumPy functions work directly on pandas Series
  6. Avoid loops: Seriously, avoid them. There’s almost always a better way.

8.6 Summary: Performance-First Mindset

8.6.1 Key Takeaways

  1. Vectorization is King: Always your first choice - leverages NumPy’s C-speed processing
  2. map() for Mappings: Perfect for dictionaries and simple element-wise transformations
  3. apply() for Complex Logic: Use when business rules require multiple columns or complex conditions
  4. Avoid Python Loops: iterrows(), list comprehensions are performance killers

8.6.2 Looking Ahead: Why NumPy Matters

The next chapter will dive deep into NumPy, the foundation that makes pandas so powerful. You’ll learn:

  • Array Architecture: How NumPy’s C-based arrays enable lightning-fast operations
  • Broadcasting Rules: Advanced techniques for working with different array shapes
  • Universal Functions (ufuncs): The secret weapons behind vectorized operations
  • Custom Vectorization: How to write your own fast, NumPy-powered functions

8.6.3 The Performance Pyramid

🚀 NumPy Vectorized (Next Chapter!)
    ↑ 10-100x faster
⚡ Pandas Vectorized  
    ↑ 2-5x faster
🔄 map() / apply()
    ↑ 50-200x faster  
🐌 Python Loops (iterrows, comprehensions)

Understanding NumPy will unlock the full potential of pandas and give you the tools to write production-ready, high-performance data analysis code!

8.7 Lambda Functions: Inline Power for Data Transformation

In the previous sections, we saw how to define a function and then use map() or apply() to apply it.
With lambda functions, you can skip the definition step and write the function inline.

Lambda functions are anonymous, one-line functions that provide a concise way to express simple operations without creating a separate def block.
They’re the Swiss Army knife of pandas transformations — compact, versatile, and perfect for quick, one-off operations.

8.7.1 Syntax & Structure

lambda arguments: expression

8.7.2 Key Characteristics

Feature Description Example
Anonymous No function name required lambda x: x * 2
Single Expression Only one expression allowed lambda x, y: x + y
Inline Definition Defined at point of use df.apply(lambda row: row.sum(), axis=1)
Return Value Automatically returns expression result lambda x: x ** 2 returns squared value
Multiple Arguments Can accept multiple parameters lambda x, y, z: x * y + z

8.7.3 Real-World Lambda Applications

In this section, we’ll explore practical use cases of lambda functions in pandas and demonstrate common transformation patterns.

# 🎯 Create a comprehensive e-commerce dataset for lambda demonstrations

from datetime import datetime, timedelta

np.random.seed(42)

# Generate realistic e-commerce transaction data
n_transactions = 1000

ecommerce_data = {
    'transaction_id': [f'TXN_{i:05d}' for i in range(1, n_transactions + 1)],
    'customer_id': np.random.randint(1, 200, n_transactions),
    'product_name': np.random.choice([
        'Laptop', 'Smartphone', 'Tablet', 'Headphones', 'Camera', 
        'Watch', 'Speaker', 'Monitor', 'Keyboard', 'Mouse'
    ], n_transactions),
    'price': np.random.uniform(50, 2000, n_transactions).round(2),
    'quantity': np.random.randint(1, 5, n_transactions),
    'discount_pct': np.random.choice([0, 5, 10, 15, 20, 25], n_transactions),
    'shipping_cost': np.random.uniform(5, 50, n_transactions).round(2),
    'customer_rating': np.random.choice([1, 2, 3, 4, 5], n_transactions),
    'purchase_date': pd.date_range('2024-01-01', periods=n_transactions, freq='h'),
    'customer_type': np.random.choice(['Premium', 'Regular', 'New'], n_transactions),
    'payment_method': np.random.choice(['Credit Card', 'PayPal', 'Bank Transfer'], n_transactions)
}

df_ecommerce = pd.DataFrame(ecommerce_data)

print("🛒 E-commerce Dataset Sample:")
print(df_ecommerce.head())
print(f"\nDataset shape: {df_ecommerce.shape}")
🛒 E-commerce Dataset Sample:
  transaction_id  customer_id product_name    price  quantity  discount_pct  \
0      TXN_00001          103       Camera  1892.00         1            15   
1      TXN_00002          180        Watch  1382.33         1             0   
2      TXN_00003           93   Headphones  1019.49         2             0   
3      TXN_00004           15        Mouse  1254.80         3             5   
4      TXN_00005          107   Smartphone  1744.36         4            10   

   shipping_cost  customer_rating       purchase_date customer_type  \
0          40.67                3 2024-01-01 00:00:00           New   
1          33.23                2 2024-01-01 01:00:00           New   
2          38.76                3 2024-01-01 02:00:00           New   
3          11.84                2 2024-01-01 03:00:00           New   
4          25.62                4 2024-01-01 04:00:00           New   

  payment_method  
0  Bank Transfer  
1         PayPal  
2         PayPal  
3  Bank Transfer  
4    Credit Card  

Dataset shape: (1000, 11)

8.7.3.1 Simple Mathematical Operations

# 📊 Example 1: Simple Mathematical Operations with Lambda
print("=== SIMPLE MATHEMATICAL OPERATIONS ===")

# Calculate discounted price
df_ecommerce['discounted_price'] = df_ecommerce.apply(
    lambda row: row['price'] * (1 - row['discount_pct'] / 100), axis=1
)

# Calculate total cost (price + shipping)
df_ecommerce['total_cost'] = df_ecommerce.apply(
    lambda row: row['discounted_price'] * row['quantity'] + row['shipping_cost'], axis=1
)

# Calculate savings amount
df_ecommerce['savings'] = df_ecommerce.apply(
    lambda row: (row['price'] - row['discounted_price']) * row['quantity'], axis=1
)

print("💰 Price Calculations (First 10 transactions):")
price_cols = ['transaction_id', 'price', 'discount_pct', 'discounted_price', 'quantity', 'total_cost', 'savings']
print(df_ecommerce[price_cols].head(10).to_string(index=False))
=== SIMPLE MATHEMATICAL OPERATIONS ===
💰 Price Calculations (First 10 transactions):
transaction_id   price  discount_pct  discounted_price  quantity  total_cost   savings
     TXN_00001 1892.00            15         1608.2000         1   1648.8700  283.8000
     TXN_00002 1382.33             0         1382.3300         1   1415.5600    0.0000
     TXN_00003 1019.49             0         1019.4900         2   2077.7400    0.0000
     TXN_00004 1254.80             5         1192.0600         3   3588.0200  188.2200
     TXN_00005 1744.36            10         1569.9240         4   6305.3160  697.7440
     TXN_00006 1162.69            25          872.0175         1    892.8175  290.6725
     TXN_00007  109.25             0          109.2500         4    446.2200    0.0000
     TXN_00008 1865.35             5         1772.0825         3   5343.1275  279.8025
     TXN_00009 1394.58            20         1115.6640         3   3393.4220  836.7480
     TXN_00010 1369.20            25         1026.9000         4   4114.4000 1369.2000

8.7.3.2 Conditional Logic

df['category'] = df['score'].apply(lambda x: 'High' if x > 80 else 'Low')
# 📊 Example 2: Conditional Logic with Lambda
print("\n=== CONDITIONAL LOGIC OPERATIONS ===")

# Categorize customers based on spending
df_ecommerce['customer_segment'] = df_ecommerce['total_cost'].apply(
    lambda x: 'High Value' if x > 500 else 'Medium Value' if x > 200 else 'Low Value'
)

# Classify ratings  
df_ecommerce['rating_category'] = df_ecommerce['customer_rating'].apply(
    lambda rating: 'Excellent' if rating == 5 
                  else 'Good' if rating >= 4
                  else 'Average' if rating == 3
                  else 'Poor'
)

# Determine shipping urgency
df_ecommerce['shipping_urgency'] = df_ecommerce.apply(
    lambda row: 'Express' if row['customer_type'] == 'Premium' and row['total_cost'] > 300
               else 'Standard' if row['customer_type'] == 'Premium'
               else 'Economy', axis=1
)

print("🎯 Customer Segmentation (Sample):")
segment_cols = ['customer_id', 'total_cost', 'customer_segment', 'customer_rating', 'rating_category', 'shipping_urgency']
print(df_ecommerce[segment_cols].head(10).to_string(index=False))

=== CONDITIONAL LOGIC OPERATIONS ===
🎯 Customer Segmentation (Sample):
 customer_id  total_cost customer_segment  customer_rating rating_category shipping_urgency
         103   1648.8700       High Value                3         Average          Economy
         180   1415.5600       High Value                2            Poor          Economy
          93   2077.7400       High Value                3         Average          Economy
          15   3588.0200       High Value                2            Poor          Economy
         107   6305.3160       High Value                4            Good          Economy
          72    892.8175       High Value                3         Average          Economy
         189    446.2200     Medium Value                3         Average          Economy
          21   5343.1275       High Value                5       Excellent          Economy
         103   3393.4220       High Value                3         Average          Economy
         122   4114.4000       High Value                2            Poor          Economy

8.7.3.3 String Operations

df['clean_name'] = df['name'].apply(lambda x: x.strip().title())
# 🔤 Example 3: String Operations with Lambda
print("\n=== STRING OPERATIONS ===")

# Clean and standardize product names
df_ecommerce['product_name_clean'] = df_ecommerce['product_name'].apply(
    lambda name: name.strip().upper().replace(' ', '_')
)

# Create transaction codes
df_ecommerce['transaction_code'] = df_ecommerce.apply(
    lambda row: f"{row['customer_type'][:3].upper()}-{row['product_name_clean'][:4]}-{row['transaction_id'][-3:]}", axis=1
)

# Format purchase date for display
df_ecommerce['purchase_display'] = df_ecommerce['purchase_date'].apply(
    lambda date: date.strftime('%Y-%m-%d %H:%M')
)

print("🔤 String Processing Results (Sample):")
string_cols = ['product_name', 'product_name_clean', 'transaction_code', 'purchase_display']
print(df_ecommerce[string_cols].head(8).to_string(index=False))

=== STRING OPERATIONS ===
🔤 String Processing Results (Sample):
product_name product_name_clean transaction_code purchase_display
      Camera             CAMERA     NEW-CAME-001 2024-01-01 00:00
       Watch              WATCH     NEW-WATC-002 2024-01-01 01:00
  Headphones         HEADPHONES     NEW-HEAD-003 2024-01-01 02:00
       Mouse              MOUSE     NEW-MOUS-004 2024-01-01 03:00
  Smartphone         SMARTPHONE     NEW-SMAR-005 2024-01-01 04:00
     Monitor            MONITOR     NEW-MONI-006 2024-01-01 05:00
       Watch              WATCH     REG-WATC-007 2024-01-01 06:00
      Camera             CAMERA     REG-CAME-008 2024-01-01 07:00

8.7.3.4 Multiple Column Operations

df['total'] = df.apply(lambda row: row['price'] * row['quantity'], axis=1)
# 🚀 Example 4: Performance Optimization & Advanced Lambda Patterns
print("\n=== ADVANCED LAMBDA PATTERNS ===")

# Complex feature engineering
df_ecommerce['value_score'] = df_ecommerce.apply(
    lambda row: (
        (row['total_cost'] / 100) * 0.4 +  # 40% weight on spending
        (row['customer_rating'] / 5) * 0.3 +  # 30% weight on satisfaction  
        (1 if row['customer_type'] == 'Premium' else 0.5 if row['customer_type'] == 'Regular' else 0) * 0.2 +  # 20% weight on loyalty
        (min(row['quantity'], 3) / 3) * 0.1  # 10% weight on quantity (capped at 3)
    ), axis=1
)

# Risk assessment using multiple factors
df_ecommerce['risk_flag'] = df_ecommerce.apply(
    lambda row: 'High Risk' if (
        row['total_cost'] > 1000 and 
        row['payment_method'] == 'Bank Transfer' and 
        row['customer_rating'] <= 2
    ) else 'Medium Risk' if (
        row['total_cost'] > 500 or 
        row['customer_rating'] <= 2
    ) else 'Low Risk', axis=1
)

# Calculate percentile ranking
total_cost_90th = df_ecommerce['total_cost'].quantile(0.9)
df_ecommerce['is_top_spender'] = df_ecommerce['total_cost'].apply(
    lambda cost: cost >= total_cost_90th
)

print("🎯 Advanced Analytics Results:")
advanced_cols = ['customer_id', 'total_cost', 'value_score', 'risk_flag', 'is_top_spender']
print(df_ecommerce[advanced_cols].head(10).round(2).to_string(index=False))

print(f"\n📊 Summary Statistics:")
print(f"• Average value score: {df_ecommerce['value_score'].mean():.2f}")
print(f"• High risk transactions: {(df_ecommerce['risk_flag'] == 'High Risk').sum()}")
print(f"• Top spenders (90th percentile): {df_ecommerce['is_top_spender'].sum()}")
print(f"• Top spender threshold: ${total_cost_90th:.2f}")

=== ADVANCED LAMBDA PATTERNS ===
🎯 Advanced Analytics Results:
 customer_id  total_cost  value_score   risk_flag  is_top_spender
         103     1648.87         6.81 Medium Risk           False
         180     1415.56         5.82 Medium Risk           False
          93     2077.74         8.56 Medium Risk           False
          15     3588.02        14.57   High Risk           False
         107     6305.32        25.56 Medium Risk            True
          72      892.82         3.78 Medium Risk           False
         189      446.22         2.16    Low Risk           False
          21     5343.13        21.87 Medium Risk            True
         103     3393.42        13.95 Medium Risk           False
         122     4114.40        16.78 Medium Risk           False

📊 Summary Statistics:
• Average value score: 9.26
• High risk transactions: 117
• Top spenders (90th percentile): 100
• Top spender threshold: $4875.15

8.7.4 Performance Benchmarking: Lambda vs Named Functions vs Vectorized

# 🏁 Performance Benchmarking: Lambda vs Named Functions vs Vectorized
import pandas as pd
import numpy as np
import time

# Create LARGER test data to get measurable times
test_data = pd.Series(np.random.randint(1, 100, 1000000))  # 1M elements instead of 100K

print("⚡ Performance Comparison (1,000,000 operations):")
print("-" * 50)

# Method 1: Lambda function
start_time = time.time()
result_lambda = test_data.apply(lambda x: x ** 2 + 10)
lambda_time = time.time() - start_time

# Method 2: Named function
def square_plus_ten(x):
    return x ** 2 + 10

start_time = time.time()
result_named = test_data.apply(square_plus_ten)
named_time = time.time() - start_time

# Method 3: Vectorized operations (fastest)
start_time = time.time()
result_vectorized = test_data ** 2 + 10
vectorized_time = time.time() - start_time

# Method 4: List comprehension
start_time = time.time()
result_listcomp = pd.Series([x ** 2 + 10 for x in test_data])
listcomp_time = time.time() - start_time

# Safe comparison - avoid division by zero
print(f"Vectorized operations: {vectorized_time:.6f}s (baseline)")
print(f"Lambda function:       {lambda_time:.6f}s ({lambda_time/vectorized_time:.1f}x slower)")
print(f"Named function:        {named_time:.6f}s ({named_time/vectorized_time:.1f}x slower)")
print(f"List comprehension:    {listcomp_time:.6f}s ({listcomp_time/vectorized_time:.1f}x slower)")

# Verify results are identical
print(f"\n✅ Results identical: {result_lambda.equals(result_named) and result_named.equals(result_vectorized)}")

print("\n💡 Performance Insights:")
print("• Vectorized operations are fastest for simple math")
print("• Named functions have slight overhead advantage over lambdas")  
print("• Lambda functions are good compromise between speed and readability")
print("• List comprehensions are generally slower than pandas methods")
⚡ Performance Comparison (1,000,000 operations):
--------------------------------------------------
Vectorized operations: 0.002011s (baseline)
Lambda function:       0.263592s (131.1x slower)
Named function:        0.280185s (139.3x slower)
List comprehension:    0.251086s (124.9x slower)

✅ Results identical: False

💡 Performance Insights:
• Vectorized operations are fastest for simple math
• Named functions have slight overhead advantage over lambdas
• Lambda functions are good compromise between speed and readability
• List comprehensions are generally slower than pandas methods

8.7.4.1 Lambda Best Practices & Advanced Techniques

Scenario ✅ Good Lambda Usage ❌ Poor Lambda Usage
Simple Operations df['x2'] = df['x'].apply(lambda x: x * 2) df['x2'] = df['x'].apply(lambda x: complex_algorithm(x))
Conditional Logic df['cat'] = df['score'].apply(lambda x: 'High' if x > 80 else 'Low') lambda x: 'A' if x > 90 else 'B' if x > 80 else 'C' if x > 70 else 'D' if x > 60 else 'F'
Multiple Columns df['total'] = df.apply(lambda row: row['price'] * row['qty'], axis=1) Complex calculations with many variables
String Processing df['clean'] = df['name'].apply(lambda x: x.strip().title()) Complex regex operations

8.7.4.2 Lambda Decision Framework

# Use Lambda when:
simple_operation = lambda x: x * 2           # ✅ Simple math
conditional = lambda x: 'A' if x > 5 else 'B'  # ✅ Basic conditions  
string_ops = lambda s: s.upper()            # ✅ Simple string ops

# Avoid Lambda when:
# ❌ Complex logic (use named functions)
# ❌ Multiple steps (break into separate operations)  
# ❌ Reusable logic (define proper functions)
# ❌ Available vectorized operations

8.8 Advanced NLP Preprocessing

For text data stored in object (string) columns, pandas works seamlessly with Python’s built-in string functions.
Beyond that, you can also use the re (regular expressions) module and the NLTK (Natural Language Toolkit) library for more advanced text preprocessing.

These tools are not commonly needed for this sequence course, but they are powerful for tasks such as text cleaning, tokenization, and feature extraction — especially if you choose an NLP-focused topic for your final project.

Below, we’ll demonstrate how to use the re module and NLTK with pandas columns.

8.8.1 the re module in Python is used for regular expression

The re module in Python is used for regular expressions, which are powerful tools for text analysis and manipulation. Regular expressions allow you to search, match, and manipulate strings based on specific patterns.

Common Use Cases of re in String Text Analysis

  • Finding patterns in text: re.search(r'\d{4}-\d{2}-\d{2}', text) # searches for a date in the format YYYY-MM-DD
  • Extracting Specific parts of a string: re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text) #extracts email addresses from the text.
  • Replacing Parts of a String: re.sub(r'\$\d+', '[price]', text) #replacing any price formatted as $ with [price].
8.8.1.0.1 Commonly Used Patterns in re
  • \d: Matches any digit (0-9).
  • \w: Matches any alphanumeric character (letters and numbers).
  • \s: Matches any whitespace character (spaces, tabs, newlines).
  • [a-z]: Matches any lowercase letter from a to z.
  • [A-Z]: Matches any uppercase letter from A to Z.
  • *: Matches 0 or more occurrences of the preceding character.
  • +: Matches 1 or more occurrences of the preceding character.
  • ?: Matches 0 or 1 occurrence of the preceding character.
  • ^: Matches the beginning of a string.
  • $: Matches the end of a string.
  • |: Acts as an OR operator.
import requests

def read_html_tables_from_url(url, match=None, user_agent=None, timeout=10, **kwargs):
    """
    Read HTML tables from a website URL with error handling and customizable options.
    
    Parameters:
    -----------
    url : str
        The website URL to scrape tables from
    match : str, optional
        String or regex pattern to match tables (passed to pd.read_html)
    user_agent : str, optional
        Custom User-Agent header. If None, uses a default Chrome User-Agent
    timeout : int, default 10
        Request timeout in seconds
    **kwargs : dict
        Additional arguments passed to pd.read_html()
    
    Returns:
    --------
    list of DataFrames or None
        List of pandas DataFrames (one per table found), or None if error occurs
    
    Examples:
    ---------
    # Basic usage
    tables = read_html_tables_from_url("https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)_per_capita")
    
    # Filter tables containing specific text
    tables = read_html_tables_from_url(url, match="Country")
    
    # With custom user agent
    tables = read_html_tables_from_url(url, user_agent="MyBot/1.0")
    """
    import pandas as pd
    import requests
    
    # Default User-Agent if none provided
    if user_agent is None:
        user_agent = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    
    headers = {'User-Agent': user_agent}
    
    try:
        print(f"🌐 Fetching data from: {url}")
        
        # Make the HTTP request
        response = requests.get(url, headers=headers, timeout=timeout)
        response.raise_for_status()  # Raise an exception for bad status codes (4xx or 5xx)
        
        print(f"✅ Successfully retrieved webpage (Status: {response.status_code})")
        
        # Parse HTML tables
        read_html_kwargs = {'match': match} if match else {}
        read_html_kwargs.update(kwargs)  # Add any additional arguments
        
        tables = pd.read_html(response.content, **read_html_kwargs)
        
        print(f"📊 Found {len(tables)} table(s) on the page")
        
        # Display table information
        if tables:
            print(f"\n📋 Table Overview:")
            for i, table in enumerate(tables):
                print(f"  Table {i}: {table.shape} (rows × columns)")
                if i >= 4:  # Limit output for readability
                    print(f"  ... and {len(tables) - 5} more tables")
                    break
        
        return tables
        
    except requests.exceptions.HTTPError as err:
        print(f"❌ HTTP Error: {err}")
        return None
    except requests.exceptions.ConnectionError as err:
        print(f"❌ Connection Error: {err}")
        return None
    except requests.exceptions.Timeout as err:
        print(f"❌ Timeout Error: Request took longer than {timeout} seconds")
        return None
    except requests.exceptions.RequestException as err:
        print(f"❌ Request Error: {err}")
        return None
    except ValueError as err:
        print(f"❌ No tables found on the page or parsing error: {err}")
        return None
    except Exception as e:
        print(f"❌ An unexpected error occurred: {e}")
        return None
# Example 1: Basic usage - Get all tables from GDP Wikipedia page
url = 'https://en.wikipedia.org/wiki/List_of_Chicago_Bulls_seasons'
tables = read_html_tables_from_url(url)
ChicagoBulls = tables[2] if tables else None
ChicagoBulls.head()
🌐 Fetching data from: https://en.wikipedia.org/wiki/List_of_Chicago_Bulls_seasons
✅ Successfully retrieved webpage (Status: 200)
📊 Found 18 table(s) on the page

📋 Table Overview:
  Table 0: (13, 2) (rows × columns)
  Table 1: (1, 5) (rows × columns)
  Table 2: (59, 13) (rows × columns)
  Table 3: (3, 4) (rows × columns)
  Table 4: (9, 2) (rows × columns)
  ... and 13 more tables
✅ Successfully retrieved webpage (Status: 200)
📊 Found 18 table(s) on the page

📋 Table Overview:
  Table 0: (13, 2) (rows × columns)
  Table 1: (1, 5) (rows × columns)
  Table 2: (59, 13) (rows × columns)
  Table 3: (3, 4) (rows × columns)
  Table 4: (9, 2) (rows × columns)
  ... and 13 more tables
Season Team Conference Finish Division Finish.1 Wins Losses Win% GB Playoffs Awards Head coach
0 1966–67 1966–67 Western 4th 33 48 0.407 11 Lost Division semifinals (Hawks) 3–0[19] Johnny Kerr (COY)[6] Johnny Kerr
1 1967–68 1967–68 Western 4th 29 53 0.354 27 Lost Division semifinals (Lakers) 4–1[20] Johnny Kerr
2 1968–69 1968–69 Western 5th 33 49 0.402 22 NaN Dick Motta
3 1969–70 1969–70 Western 3rd[c] 39 43 0.476 9 Lost Division semifinals (Hawks) 4–1[22] Dick Motta
4 1970–71 1970–71 Western 3rd Midwest[d] 2nd 51 31 0.622 2 Lost conference semifinals (Lakers) 4–3[23] Dick Motta (COY)[6] Dick Motta
# remove all charaters between box brackets inluding the brackets themselves in the columns Division, Finish, Playoffs, Awards
import re
def remove_brackets(x):
    return re.sub(r'\[.*?\]', '', x)

# Apply the function to each column separately using map
ChicagoBulls['Division'] = ChicagoBulls['Division'].map(remove_brackets)
ChicagoBulls['Finish'] = ChicagoBulls['Finish'].map(remove_brackets)
ChicagoBulls['Finish.1'] = ChicagoBulls['Finish.1'].map(remove_brackets)

ChicagoBulls.head()
Season Team Conference Finish Division Finish.1 Wins Losses Win% GB Playoffs Awards Head coach
0 1966–67 1966–67 Western 4th 33 48 0.407 11 Lost Division semifinals (Hawks) 3–0[19] Johnny Kerr (COY)[6] Johnny Kerr
1 1967–68 1967–68 Western 4th 29 53 0.354 27 Lost Division semifinals (Lakers) 4–1[20] Johnny Kerr
2 1968–69 1968–69 Western 5th 33 49 0.402 22 NaN Dick Motta
3 1969–70 1969–70 Western 3rd 39 43 0.476 9 Lost Division semifinals (Hawks) 4–1[22] Dick Motta
4 1970–71 1970–71 Western 3rd Midwest 2nd 51 31 0.622 2 Lost conference semifinals (Lakers) 4–3[23] Dick Motta (COY)[6] Dick Motta

8.8.2 The NLTK Library for Advanced NLP

8.8.2.1 What is NLTK?

NLTK (Natural Language Toolkit) is a comprehensive Python library for natural language processing (NLP) and text analysis. It offers tools for tokenization, preprocessing, part-of-speech tagging, parsing, named-entity recognition, sentiment analysis, and more—useful from classroom demos to research prototypes.

8.8.2.2 Why NLTK Matters in Data Science

Application Domain NLTK Use Cases Business Impact
Social Media Analytics Sentiment, subjectivity, emotion, polarity detection Customer insights, brand monitoring
Customer Support Text classification, intent detection Automation, faster resolution
Content Analysis Keyword extraction, topic modeling, summarization SEO optimization, content strategy
Research & Academia Corpus analysis, linguistic parsing, concordance tools Rigorous analysis, reproducible studies
E-commerce Review mining, aspect sentiment, recommendation signals

8.8.2.3 💻 NLTK Installation and Setup

Let’s start by installing NLTK and downloading the essential datasets:

pip install nltk
Collecting nltk
  Downloading nltk-3.9.2-py3-none-any.whl.metadata (3.2 kB)
Collecting click (from nltk)
  Downloading click-8.3.0-py3-none-any.whl.metadata (2.6 kB)
Collecting joblib (from nltk)
  Using cached joblib-1.5.2-py3-none-any.whl.metadata (5.6 kB)
Collecting regex>=2021.8.3 (from nltk)
  Using cached regex-2025.9.18-cp312-cp312-win_amd64.whl.metadata (41 kB)
Collecting tqdm (from nltk)
  Using cached tqdm-4.67.1-py3-none-any.whl.metadata (57 kB)
Requirement already satisfied: colorama in c:\users\lsi8012\onedrive - northwestern university\fa25\stat303-1\workspace\.venv\lib\site-packages (from click->nltk) (0.4.6)
Downloading nltk-3.9.2-py3-none-any.whl (1.5 MB)
   ---------------------------------------- 0.0/1.5 MB ? eta -:--:--
   --------------------------- ------------ 1.0/1.5 MB 7.1 MB/s eta 0:00:01
   ---------------------------------------- 1.5/1.5 MB 7.3 MB/s  0:00:00
Using cached regex-2025.9.18-cp312-cp312-win_amd64.whl (275 kB)
Downloading click-8.3.0-py3-none-any.whl (107 kB)
Using cached joblib-1.5.2-py3-none-any.whl (308 kB)
Using cached tqdm-4.67.1-py3-none-any.whl (78 kB)
Installing collected packages: tqdm, regex, joblib, click, nltk

   ---------------------------------------- 0/5 [tqdm]
   -------- ------------------------------- 1/5 [regex]
   ---------------- ----------------------- 2/5 [joblib]
   ---------------- ----------------------- 2/5 [joblib]
   ---------------- ----------------------- 2/5 [joblib]
   ---------------- ----------------------- 2/5 [joblib]
   ------------------------ --------------- 3/5 [click]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   -------------------------------- ------- 4/5 [nltk]
   ---------------------------------------- 5/5 [nltk]

Successfully installed click-8.3.0 joblib-1.5.2 nltk-3.9.2 regex-2025.9.18 tqdm-4.67.1
Note: you may need to restart the kernel to use updated packages.
WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ProtocolError('Connection aborted.', ConnectionResetError(10054, 'An existing connection was forcibly closed by the remote host', None, 10054, None))': /simple/nltk/
# 🚀 NLTK Installation and Essential Downloads

# Install NLTK (uncomment if not already installed)
# !pip install nltk

try:
    import nltk
    print("✅ NLTK is already installed")
    
    # Download essential datasets (this might take a few minutes first time)
    print("📦 Downloading essential NLTK data...")
    
    # Download essential components
    essential_downloads = [
        'punkt',           # Sentence tokenizer
        'stopwords',       # Stop words for multiple languages  
        'vader_lexicon',   # Sentiment analysis
        'averaged_perceptron_tagger',  # POS tagger
        'wordnet',         # WordNet lexical database
        'omw-1.4',         # Open Multilingual Wordnet
        'movie_reviews',   # Sample movie reviews for sentiment
    ]
    
    for item in essential_downloads:
        try:
            nltk.download(item, quiet=True)
            print(f"  ✓ Downloaded: {item}")
        except:
            print(f"  ⚠️ Could not download: {item}")
    
    print("\n🎯 NLTK Setup Complete!")
    print("Ready for natural language processing!")
    
except ImportError:
    print("❌ NLTK not found. Please install with: pip install nltk")
except Exception as e:
    print(f"⚠️ Setup issue: {e}")
    print("You may need internet connection for first-time downloads")
✅ NLTK is already installed
📦 Downloading essential NLTK data...
  ✓ Downloaded: punkt
  ✓ Downloaded: stopwords
  ✓ Downloaded: vader_lexicon
  ✓ Downloaded: averaged_perceptron_tagger
  ✓ Downloaded: wordnet
  ✓ Downloaded: omw-1.4
  ✓ Downloaded: movie_reviews

🎯 NLTK Setup Complete!
Ready for natural language processing!
  ✓ Downloaded: averaged_perceptron_tagger
  ✓ Downloaded: wordnet
  ✓ Downloaded: omw-1.4
  ✓ Downloaded: movie_reviews

🎯 NLTK Setup Complete!
Ready for natural language processing!
[nltk_data] Error downloading 'vader_lexicon' from
[nltk_data]     <https://raw.githubusercontent.com/nltk/nltk_data/gh-
[nltk_data]     pages/packages/sentiment/vader_lexicon.zip>:
[nltk_data]     <urlopen error [WinError 10054] An existing connection
[nltk_data]     was forcibly closed by the remote host>
[nltk_data] Error downloading 'averaged_perceptron_tagger' from
[nltk_data]     <https://raw.githubusercontent.com/nltk/nltk_data/gh-p
[nltk_data]     ages/packages/taggers/averaged_perceptron_tagger.zip>:
[nltk_data]     <urlopen error [WinError 10054] An existing connection
[nltk_data]     was forcibly closed by the remote host>
[nltk_data] Error downloading 'omw-1.4' from
[nltk_data]     <https://raw.githubusercontent.com/nltk/nltk_data/gh-
[nltk_data]     pages/packages/corpora/omw-1.4.zip>:   <urlopen error
[nltk_data]     [WinError 10054] An existing connection was forcibly
[nltk_data]     closed by the remote host>
[nltk_data] Error downloading 'movie_reviews' from
[nltk_data]     <https://raw.githubusercontent.com/nltk/nltk_data/gh-
[nltk_data]     pages/packages/corpora/movie_reviews.zip>:   <urlopen
[nltk_data]     error [WinError 10054] An existing connection was
[nltk_data]     forcibly closed by the remote host>

8.8.2.4 Tokenization: Breaking Text into Pieces

Tokenization is the foundation of NLP - splitting text into meaningful units (words, sentences, or phrases).

# 🔤 Tokenization Examples: Word and Sentence Level
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.tokenize import RegexpTokenizer, WhitespaceTokenizer
import pandas as pd

# Sample text for demonstration
sample_text = """
Natural Language Processing (NLP) is amazing! It helps computers understand human language.
Companies like Google, Amazon, and Microsoft use NLP for various applications.
Did you know that NLTK has been around since 2001? It's a powerful toolkit.
Let's explore tokenization: word-level, sentence-level, and custom patterns.
"""

print("📝 Original Text:")
print(sample_text.strip())

print("\n" + "="*60)
print("🔤 TOKENIZATION EXAMPLES")
print("="*60)

# 1. SENTENCE TOKENIZATION
sentences = sent_tokenize(sample_text)
print(f"\n📄 Sentence Tokenization ({len(sentences)} sentences):")
for i, sentence in enumerate(sentences, 1):
    print(f"  {i}. {sentence.strip()}")

# 2. WORD TOKENIZATION  
words = word_tokenize(sample_text)
print(f"\n🔤 Word Tokenization ({len(words)} tokens):")
print("First 15 tokens:", words[:15])

# 3. CUSTOM TOKENIZATION
# Only alphabetic tokens (removes punctuation and numbers)
alpha_tokenizer = RegexpTokenizer(r'[a-zA-Z]+')
alpha_tokens = alpha_tokenizer.tokenize(sample_text)
print(f"\n🔤 Alphabetic Only ({len(alpha_tokens)} tokens):")
print("First 15 tokens:", alpha_tokens[:15])

# 4. WHITESPACE TOKENIZATION
whitespace_tokenizer = WhitespaceTokenizer()
whitespace_tokens = whitespace_tokenizer.tokenize(sample_text)
print(f"\n🔤 Whitespace Tokenization ({len(whitespace_tokens)} tokens):")
print("First 10 tokens:", whitespace_tokens[:10])

# Create DataFrame for comparison
tokenization_comparison = pd.DataFrame({
    'Method': ['Sentences', 'Words (with punct.)', 'Alphabetic only', 'Whitespace'],
    'Token_Count': [len(sentences), len(words), len(alpha_tokens), len(whitespace_tokens)],
    'Sample_Tokens': [
        [s[:30] + "..." if len(s) > 30 else s for s in sentences[:2]],
        words[:3],
        alpha_tokens[:3], 
        whitespace_tokens[:3]
    ]
})

print(f"\n📊 Tokenization Comparison:")
print(tokenization_comparison.to_string(index=False))
📝 Original Text:
Natural Language Processing (NLP) is amazing! It helps computers understand human language.
Companies like Google, Amazon, and Microsoft use NLP for various applications.
Did you know that NLTK has been around since 2001? It's a powerful toolkit.
Let's explore tokenization: word-level, sentence-level, and custom patterns.

============================================================
🔤 TOKENIZATION EXAMPLES
============================================================

📄 Sentence Tokenization (6 sentences):
  1. Natural Language Processing (NLP) is amazing!
  2. It helps computers understand human language.
  3. Companies like Google, Amazon, and Microsoft use NLP for various applications.
  4. Did you know that NLTK has been around since 2001?
  5. It's a powerful toolkit.
  6. Let's explore tokenization: word-level, sentence-level, and custom patterns.

🔤 Word Tokenization (60 tokens):
First 15 tokens: ['Natural', 'Language', 'Processing', '(', 'NLP', ')', 'is', 'amazing', '!', 'It', 'helps', 'computers', 'understand', 'human', 'language']

🔤 Alphabetic Only (48 tokens):
First 15 tokens: ['Natural', 'Language', 'Processing', 'NLP', 'is', 'amazing', 'It', 'helps', 'computers', 'understand', 'human', 'language', 'Companies', 'like', 'Google']

🔤 Whitespace Tokenization (45 tokens):
First 10 tokens: ['Natural', 'Language', 'Processing', '(NLP)', 'is', 'amazing!', 'It', 'helps', 'computers', 'understand']

📊 Tokenization Comparison:
             Method  Token_Count                                                           Sample_Tokens
          Sentences            6 [\nNatural Language Processing (..., It helps computers understand ...]
Words (with punct.)           60                                         [Natural, Language, Processing]
    Alphabetic only           48                                         [Natural, Language, Processing]
         Whitespace           45                                         [Natural, Language, Processing]

8.8.2.5 Text Preprocessing: Stop Words and Cleaning

Stop words are common words that typically don’t carry significant meaning and are often removed to focus on important content.

# 🧹 Stop Words and Text Cleaning
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import pandas as pd

# Sample product reviews for analysis
product_reviews = [
    "This product is absolutely amazing! I love it so much.",
    "The quality is terrible. I want my money back immediately.",  
    "It's okay, nothing special but does the job well enough.",
    "Fantastic value for money! Highly recommend to everyone.",
    "Poor customer service and the product broke after one day."
]

print("🛍️ Sample Product Reviews:")
for i, review in enumerate(product_reviews, 1):
    print(f"  {i}. {review}")

print("\n" + "="*60)
print("🧹 TEXT PREPROCESSING PIPELINE")
print("="*60)

# Get English stop words
stop_words = set(stopwords.words('english'))
print(f"\n📋 English Stop Words ({len(stop_words)} words):")
print("Sample stop words:", sorted(list(stop_words))[:20])

def preprocess_text(text):
    """Complete text preprocessing pipeline"""
    
    # 1. Convert to lowercase
    text = text.lower()
    
    # 2. Tokenize
    tokens = word_tokenize(text)
    
    # 3. Remove punctuation and non-alphabetic tokens
    tokens = [token for token in tokens if token.isalpha()]
    
    # 4. Remove stop words
    filtered_tokens = [token for token in tokens if token not in stop_words]
    
    return tokens, filtered_tokens

# Process all reviews
processed_reviews = []

print(f"\n🔍 Processing Each Review:")
for i, review in enumerate(product_reviews, 1):
    original_tokens, filtered_tokens = preprocess_text(review)
    
    processed_reviews.append({
        'Review_ID': f'Review_{i}',
        'Original_Text': review,
        'Original_Tokens': len(original_tokens),
        'After_Cleaning': len(filtered_tokens),
        'Tokens_Removed': len(original_tokens) - len(filtered_tokens),
        'Key_Words': filtered_tokens[:5]  # First 5 meaningful words
    })
    
    print(f"\n  📝 Review {i}:")
    print(f"     Original: {review}")
    print(f"     Key words: {filtered_tokens[:8]}")
    print(f"     Tokens: {len(original_tokens)}{len(filtered_tokens)} (-{len(original_tokens) - len(filtered_tokens)})")

# Create summary DataFrame
df_preprocessing = pd.DataFrame(processed_reviews)

print(f"\n📊 Preprocessing Summary:")
print(df_preprocessing[['Review_ID', 'Original_Tokens', 'After_Cleaning', 'Tokens_Removed']].to_string(index=False))

print(f"\n📈 Overall Statistics:")
total_original = df_preprocessing['Original_Tokens'].sum()
total_cleaned = df_preprocessing['After_Cleaning'].sum()
print(f"• Total tokens before cleaning: {total_original}")
print(f"• Total tokens after cleaning: {total_cleaned}")
print(f"• Tokens removed: {total_original - total_cleaned} ({((total_original - total_cleaned)/total_original)*100:.1f}%)")
print(f"• Average tokens per review (cleaned): {total_cleaned/len(product_reviews):.1f}")
🛍️ Sample Product Reviews:
  1. This product is absolutely amazing! I love it so much.
  2. The quality is terrible. I want my money back immediately.
  3. It's okay, nothing special but does the job well enough.
  4. Fantastic value for money! Highly recommend to everyone.
  5. Poor customer service and the product broke after one day.

============================================================
🧹 TEXT PREPROCESSING PIPELINE
============================================================

📋 English Stop Words (198 words):
Sample stop words: ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an', 'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been']

🔍 Processing Each Review:

  📝 Review 1:
     Original: This product is absolutely amazing! I love it so much.
     Key words: ['product', 'absolutely', 'amazing', 'love', 'much']
     Tokens: 10 → 5 (-5)

  📝 Review 2:
     Original: The quality is terrible. I want my money back immediately.
     Key words: ['quality', 'terrible', 'want', 'money', 'back', 'immediately']
     Tokens: 10 → 6 (-4)

  📝 Review 3:
     Original: It's okay, nothing special but does the job well enough.
     Key words: ['okay', 'nothing', 'special', 'job', 'well', 'enough']
     Tokens: 10 → 6 (-4)

  📝 Review 4:
     Original: Fantastic value for money! Highly recommend to everyone.
     Key words: ['fantastic', 'value', 'money', 'highly', 'recommend', 'everyone']
     Tokens: 8 → 6 (-2)

  📝 Review 5:
     Original: Poor customer service and the product broke after one day.
     Key words: ['poor', 'customer', 'service', 'product', 'broke', 'one', 'day']
     Tokens: 10 → 7 (-3)

📊 Preprocessing Summary:
Review_ID  Original_Tokens  After_Cleaning  Tokens_Removed
 Review_1               10               5               5
 Review_2               10               6               4
 Review_3               10               6               4
 Review_4                8               6               2
 Review_5               10               7               3

📈 Overall Statistics:
• Total tokens before cleaning: 48
• Total tokens after cleaning: 30
• Tokens removed: 18 (37.5%)
• Average tokens per review (cleaned): 6.0

8.8.2.6 Stemming and Lemmatization: Finding Word Roots

Both stemming and lemmatization reduce words to their base form, but they work differently: - Stemming: Fast, rule-based chopping (running → run) - Lemmatization: Slower, dictionary-based reduction (better → good)

# 🌱 Stemming vs Lemmatization Comparison
from nltk.stem import PorterStemmer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
import pandas as pd

# Initialize stemmers and lemmatizer
porter = PorterStemmer()
snowball = SnowballStemmer('english')  
lemmatizer = WordNetLemmatizer()

# Test words that demonstrate differences
test_words = [
    'running', 'runs', 'ran', 'runner',        # Run variations
    'better', 'good', 'best',                  # Good variations  
    'dogs', 'cats', 'children', 'geese',      # Plural forms
    'walking', 'walked', 'walks',             # Walk variations
    'flying', 'flies', 'flew',                # Fly variations
    'studies', 'studying', 'studied',         # Study variations
    'beautiful', 'beautifully',               # Adjective/adverb
    'happiness', 'happier', 'happiest'        # Happy variations
]

print("🌱 STEMMING vs LEMMATIZATION COMPARISON")
print("="*60)

# Process each word
results = []
for word in test_words:
    porter_stem = porter.stem(word)
    snowball_stem = snowball.stem(word) 
    lemma = lemmatizer.lemmatize(word)
    lemma_verb = lemmatizer.lemmatize(word, pos='v')  # As verb
    
    results.append({
        'Original': word,
        'Porter_Stem': porter_stem,
        'Snowball_Stem': snowball_stem,
        'Lemma_Noun': lemma,
        'Lemma_Verb': lemma_verb
    })

# Create comparison DataFrame
df_comparison = pd.DataFrame(results)

print("📊 Detailed Comparison:")
print(df_comparison.to_string(index=False))

print(f"\n🎯 Key Differences Analysis:")

# Count unique results for each method
porter_unique = df_comparison['Porter_Stem'].nunique()
snowball_unique = df_comparison['Snowball_Stem'].nunique() 
lemma_unique = df_comparison['Lemma_Noun'].nunique()

print(f"• Porter Stemmer: {porter_unique} unique stems")
print(f"• Snowball Stemmer: {snowball_unique} unique stems") 
print(f"• WordNet Lemmatizer: {lemma_unique} unique lemmas")

print(f"\n💡 Notable Examples:")

# Find interesting cases where methods differ significantly
interesting_cases = [
    ('better', 'good'),   # Lemmatization handles irregular forms
    ('studies', 'study'), # Different handling of -ies ending
    ('flying', 'fly'),    # Verb vs noun treatment
    ('geese', 'goose')    # Irregular plurals
]

for original, expected in interesting_cases:
    if original in test_words:
        row = df_comparison[df_comparison['Original'] == original].iloc[0]
        print(f"• '{original}' → Porter: '{row['Porter_Stem']}', Lemma: '{row['Lemma_Noun']}' (expected: '{expected}')")

# Real-world application example
sample_sentence = "The children were running happily through the beautiful gardens, studying the flying geese."
words_to_process = sample_sentence.lower().replace(',', '').replace('.', '').split()

print(f"\n🔬 Real-world Example:")
print(f"Sentence: {sample_sentence}")
print(f"\nWord processing:")

sentence_results = []
for word in words_to_process:
    if word.isalpha():  # Skip punctuation
        stem = porter.stem(word)
        lemma = lemmatizer.lemmatize(word)
        sentence_results.append(f"{word} → stem: {stem}, lemma: {lemma}")

for result in sentence_results[:6]:  # Show first 6 words
    print(f"  {result}")

print(f"\n⚡ Performance Guidelines:")
print("• Use STEMMING when: Speed is crucial, approximate matching is okay")
print("• Use LEMMATIZATION when: Accuracy is important, working with formal text")
print("• Stemming is ~10x faster but less accurate than lemmatization")
🌱 STEMMING vs LEMMATIZATION COMPARISON
============================================================
📊 Detailed Comparison:
   Original Porter_Stem Snowball_Stem  Lemma_Noun  Lemma_Verb
    running         run           run     running         run
       runs         run           run         run         run
        ran         ran           ran         ran         run
     runner      runner        runner      runner      runner
     better      better        better      better      better
       good        good          good        good        good
       best        best          best        best        best
       dogs         dog           dog         dog         dog
       cats         cat           cat         cat         cat
   children    children      children       child    children
      geese        gees          gees       goose       geese
    walking        walk          walk     walking        walk
     walked        walk          walk      walked        walk
      walks        walk          walk        walk        walk
     flying         fli           fli      flying         fly
      flies         fli           fli         fly         fly
       flew        flew          flew        flew         fly
    studies       studi         studi       study       study
   studying       studi         studi    studying       study
    studied       studi         studi     studied       study
  beautiful      beauti        beauti   beautiful   beautiful
beautifully      beauti        beauti beautifully beautifully
  happiness       happi         happi   happiness   happiness
    happier     happier       happier     happier     happier
   happiest    happiest      happiest    happiest    happiest

🎯 Key Differences Analysis:
• Porter Stemmer: 18 unique stems
• Snowball Stemmer: 18 unique stems
• WordNet Lemmatizer: 25 unique lemmas

💡 Notable Examples:
• 'better' → Porter: 'better', Lemma: 'better' (expected: 'good')
• 'studies' → Porter: 'studi', Lemma: 'study' (expected: 'study')
• 'flying' → Porter: 'fli', Lemma: 'flying' (expected: 'fly')
• 'geese' → Porter: 'gees', Lemma: 'goose' (expected: 'goose')

🔬 Real-world Example:
Sentence: The children were running happily through the beautiful gardens, studying the flying geese.

Word processing:
  the → stem: the, lemma: the
  children → stem: children, lemma: child
  were → stem: were, lemma: were
  running → stem: run, lemma: running
  happily → stem: happili, lemma: happily
  through → stem: through, lemma: through

⚡ Performance Guidelines:
• Use STEMMING when: Speed is crucial, approximate matching is okay
• Use LEMMATIZATION when: Accuracy is important, working with formal text
• Stemming is ~10x faster but less accurate than lemmatization
📊 Detailed Comparison:
   Original Porter_Stem Snowball_Stem  Lemma_Noun  Lemma_Verb
    running         run           run     running         run
       runs         run           run         run         run
        ran         ran           ran         ran         run
     runner      runner        runner      runner      runner
     better      better        better      better      better
       good        good          good        good        good
       best        best          best        best        best
       dogs         dog           dog         dog         dog
       cats         cat           cat         cat         cat
   children    children      children       child    children
      geese        gees          gees       goose       geese
    walking        walk          walk     walking        walk
     walked        walk          walk      walked        walk
      walks        walk          walk        walk        walk
     flying         fli           fli      flying         fly
      flies         fli           fli         fly         fly
       flew        flew          flew        flew         fly
    studies       studi         studi       study       study
   studying       studi         studi    studying       study
    studied       studi         studi     studied       study
  beautiful      beauti        beauti   beautiful   beautiful
beautifully      beauti        beauti beautifully beautifully
  happiness       happi         happi   happiness   happiness
    happier     happier       happier     happier     happier
   happiest    happiest      happiest    happiest    happiest

🎯 Key Differences Analysis:
• Porter Stemmer: 18 unique stems
• Snowball Stemmer: 18 unique stems
• WordNet Lemmatizer: 25 unique lemmas

💡 Notable Examples:
• 'better' → Porter: 'better', Lemma: 'better' (expected: 'good')
• 'studies' → Porter: 'studi', Lemma: 'study' (expected: 'study')
• 'flying' → Porter: 'fli', Lemma: 'flying' (expected: 'fly')
• 'geese' → Porter: 'gees', Lemma: 'goose' (expected: 'goose')

🔬 Real-world Example:
Sentence: The children were running happily through the beautiful gardens, studying the flying geese.

Word processing:
  the → stem: the, lemma: the
  children → stem: children, lemma: child
  were → stem: were, lemma: were
  running → stem: run, lemma: running
  happily → stem: happili, lemma: happily
  through → stem: through, lemma: through

⚡ Performance Guidelines:
• Use STEMMING when: Speed is crucial, approximate matching is okay
• Use LEMMATIZATION when: Accuracy is important, working with formal text
• Stemming is ~10x faster but less accurate than lemmatization

8.8.2.7 Sentiment Analysis: Understanding Emotions

NLTK provides powerful tools for analyzing sentiment - determining whether text expresses positive, negative, or neutral emotions.

# 📊 Sentiment Analysis with NLTK VADER
from nltk.sentiment import SentimentIntensityAnalyzer
import pandas as pd
import matplotlib.pyplot as plt
nltk.download('vader_lexicon')

# Initialize VADER sentiment analyzer
analyzer = SentimentIntensityAnalyzer()

# Diverse sample texts for sentiment analysis
sample_texts = [
    # Positive examples
    "This product is absolutely amazing! I love everything about it!",
    "Fantastic customer service and fast delivery. Highly recommend!",
    "Great value for money. Very satisfied with my purchase.",
    
    # Negative examples  
    "Terrible quality, waste of money. Very disappointed.",
    "Poor customer service and the product broke immediately.",
    "Worst purchase I've ever made. Completely useless.",
    
    # Neutral examples
    "The product arrived on time. It's okay, nothing special.",
    "Standard quality for the price. Does what it's supposed to do.",
    "Average product with typical features. Works as expected.",
    
    # Mixed sentiment
    "Great design but poor durability. Mixed feelings about this.",
    "Love the features but hate the price. It's complicated.",
]

print("📊 SENTIMENT ANALYSIS WITH NLTK VADER")
print("="*60)

# Analyze sentiment for each text
sentiment_results = []

for i, text in enumerate(sample_texts, 1):
    # Get sentiment scores
    scores = analyzer.polarity_scores(text)
    
    # Determine overall sentiment
    if scores['compound'] >= 0.05:
        overall = 'Positive'
        emoji = '😊'
    elif scores['compound'] <= -0.05:
        overall = 'Negative'  
        emoji = '😞'
    else:
        overall = 'Neutral'
        emoji = '😐'
    
    sentiment_results.append({
        'Text_ID': f'Text_{i:02d}',
        'Text': text[:50] + "..." if len(text) > 50 else text,
        'Positive': round(scores['pos'], 3),
        'Neutral': round(scores['neu'], 3), 
        'Negative': round(scores['neg'], 3),
        'Compound': round(scores['compound'], 3),
        'Sentiment': overall,
        'Emoji': emoji,
        'Full_Text': text
    })
    
    print(f"{emoji} Text {i}: {overall} (compound: {scores['compound']:.3f})")
    print(f"   \"{text}\"")
    print(f"   Pos: {scores['pos']:.2f}, Neu: {scores['neu']:.2f}, Neg: {scores['neg']:.2f}")
    print()

# Create DataFrame for analysis
df_sentiment = pd.DataFrame(sentiment_results)

print("📈 SENTIMENT ANALYSIS SUMMARY")
print("="*40)

# Overall statistics
sentiment_counts = df_sentiment['Sentiment'].value_counts()
print("📊 Sentiment Distribution:")
for sentiment, count in sentiment_counts.items():
    percentage = (count / len(df_sentiment)) * 100
    print(f"  {sentiment}: {count} texts ({percentage:.1f}%)")

print(f"\n📏 Score Ranges:")
print(f"• Positive scores: {df_sentiment['Positive'].min():.3f} - {df_sentiment['Positive'].max():.3f}")
print(f"• Negative scores: {df_sentiment['Negative'].min():.3f} - {df_sentiment['Negative'].max():.3f}")
print(f"• Compound scores: {df_sentiment['Compound'].min():.3f} - {df_sentiment['Compound'].max():.3f}")

# Show most extreme examples
print(f"\n🎯 Most Extreme Examples:")
most_positive = df_sentiment.loc[df_sentiment['Compound'].idxmax()]
most_negative = df_sentiment.loc[df_sentiment['Compound'].idxmin()]

print(f"😊 Most Positive (score: {most_positive['Compound']:.3f}):")
print(f"   \"{most_positive['Full_Text']}\"")
print(f"😞 Most Negative (score: {most_negative['Compound']:.3f}):")
print(f"   \"{most_negative['Full_Text']}\"")

# Business application example
print(f"\n💼 Business Application Example:")
print("This could be used for:")
print("• Customer review analysis")
print("• Social media monitoring") 
print("• Product feedback categorization")
print("• Brand sentiment tracking")
print("• Customer support prioritization")

# Display summary table
print(f"\n📋 Detailed Results:")
display_df = df_sentiment[['Text_ID', 'Text', 'Compound', 'Sentiment', 'Emoji']].copy()
print(display_df.to_string(index=False))
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\lsi8012\AppData\Roaming\nltk_data...
📊 SENTIMENT ANALYSIS WITH NLTK VADER
============================================================
😊 Text 1: Positive (compound: 0.879)
   "This product is absolutely amazing! I love everything about it!"
   Pos: 0.57, Neu: 0.43, Neg: 0.00

😊 Text 2: Positive (compound: 0.771)
   "Fantastic customer service and fast delivery. Highly recommend!"
   Pos: 0.53, Neu: 0.47, Neg: 0.00

😊 Text 3: Positive (compound: 0.862)
   "Great value for money. Very satisfied with my purchase."
   Pos: 0.61, Neu: 0.39, Neg: 0.00

😞 Text 4: Negative (compound: -0.852)
   "Terrible quality, waste of money. Very disappointed."
   Pos: 0.00, Neu: 0.30, Neg: 0.70

😞 Text 5: Negative (compound: -0.710)
   "Poor customer service and the product broke immediately."
   Pos: 0.00, Neu: 0.50, Neg: 0.50

😞 Text 6: Negative (compound: -0.802)
   "Worst purchase I've ever made. Completely useless."
   Pos: 0.00, Neu: 0.41, Neg: 0.59

😞 Text 7: Negative (compound: -0.092)
   "The product arrived on time. It's okay, nothing special."
   Pos: 0.17, Neu: 0.63, Neg: 0.20

😐 Text 8: Neutral (compound: 0.000)
   "Standard quality for the price. Does what it's supposed to do."
   Pos: 0.00, Neu: 1.00, Neg: 0.00

😐 Text 9: Neutral (compound: 0.000)
   "Average product with typical features. Works as expected."
   Pos: 0.00, Neu: 1.00, Neg: 0.00

😞 Text 10: Negative (compound: -0.382)
   "Great design but poor durability. Mixed feelings about this."
   Pos: 0.19, Neu: 0.51, Neg: 0.30

😞 Text 11: Negative (compound: -0.535)
   "Love the features but hate the price. It's complicated."
   Pos: 0.18, Neu: 0.48, Neg: 0.34

📈 SENTIMENT ANALYSIS SUMMARY
========================================
📊 Sentiment Distribution:
  Negative: 6 texts (54.5%)
  Positive: 3 texts (27.3%)
  Neutral: 2 texts (18.2%)

📏 Score Ranges:
• Positive scores: 0.000 - 0.615
• Negative scores: 0.000 - 0.699
• Compound scores: -0.852 - 0.879

🎯 Most Extreme Examples:
😊 Most Positive (score: 0.879):
   "This product is absolutely amazing! I love everything about it!"
😞 Most Negative (score: -0.852):
   "Terrible quality, waste of money. Very disappointed."

💼 Business Application Example:
This could be used for:
• Customer review analysis
• Social media monitoring
• Product feedback categorization
• Brand sentiment tracking
• Customer support prioritization

📋 Detailed Results:
Text_ID                                                  Text  Compound Sentiment Emoji
Text_01 This product is absolutely amazing! I love everyth...     0.879  Positive     😊
Text_02 Fantastic customer service and fast delivery. High...     0.771  Positive     😊
Text_03 Great value for money. Very satisfied with my purc...     0.862  Positive     😊
Text_04 Terrible quality, waste of money. Very disappointe...    -0.852  Negative     😞
Text_05 Poor customer service and the product broke immedi...    -0.710  Negative     😞
Text_06    Worst purchase I've ever made. Completely useless.    -0.802  Negative     😞
Text_07 The product arrived on time. It's okay, nothing sp...    -0.092  Negative     😞
Text_08 Standard quality for the price. Does what it's sup...     0.000   Neutral     😐
Text_09 Average product with typical features. Works as ex...     0.000   Neutral     😐
Text_10 Great design but poor durability. Mixed feelings a...    -0.382  Negative     😞
Text_11 Love the features but hate the price. It's complic...    -0.535  Negative     😞

8.8.2.8 Part-of-Speech (POS) Tagging & Named Entity Recognition

POS tagging identifies grammatical roles of words (noun, verb, adjective), while Named Entity Recognition (NER) finds important entities like names, places, and organizations.

# 🏷️ POS Tagging and Named Entity Recognition
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
from nltk.tree import Tree
nltk.download('averaged_perceptron_tagger_eng')
nltk.download('maxent_ne_chunker_tab')
nltk.download('maxent_ne_chunker_tab')
nltk.download('words')
nltk.download('words')


import pandas as pd

# Sample business-related text
business_text = """
Apple Inc. is planning to open a new headquarters in Austin, Texas next year.
CEO Tim Cook announced this during a meeting with investors on Wall Street.
The company expects to hire 5,000 new employees, including software engineers 
and data scientists. Microsoft and Google are also expanding their operations
in the region. The project will cost approximately $1 billion dollars.
"""

print("🏷️ POS TAGGING & NAMED ENTITY RECOGNITION")
print("="*60)

print("📝 Sample Text:")
print(business_text.strip())

# 1. POS TAGGING
print(f"\n🔤 PART-OF-SPEECH TAGGING:")
tokens = word_tokenize(business_text)
pos_tags = pos_tag(tokens)

# Group by POS types for analysis
pos_groups = {}
for word, pos in pos_tags:
    if pos not in pos_groups:
        pos_groups[pos] = []
    pos_groups[pos].append(word)

print(f"Found {len(set([pos for _, pos in pos_tags]))} different POS tags:")

# Show most common POS categories
pos_description = {
    'NN': 'Noun (singular)',
    'NNS': 'Noun (plural)', 
    'NNP': 'Proper noun (singular)',
    'NNPS': 'Proper noun (plural)',
    'VB': 'Verb (base form)',
    'VBD': 'Verb (past tense)',
    'VBG': 'Verb (gerund/present participle)',
    'VBN': 'Verb (past participle)',
    'VBP': 'Verb (present, not 3rd person singular)',
    'VBZ': 'Verb (present, 3rd person singular)',
    'JJ': 'Adjective',
    'JJR': 'Adjective (comparative)', 
    'JJS': 'Adjective (superlative)',
    'RB': 'Adverb',
    'DT': 'Determiner',
    'IN': 'Preposition/subordinating conjunction',
    'TO': 'to',
    'CD': 'Cardinal number'
}

for pos in sorted(pos_groups.keys()):
    if pos in pos_description:
        example_words = pos_groups[pos][:3]  # Show first 3 examples
        print(f"  {pos} ({pos_description[pos]}): {example_words}")

# 2. NAMED ENTITY RECOGNITION  
print(f"\n🏢 NAMED ENTITY RECOGNITION:")
named_entities = ne_chunk(pos_tags)

# Extract entities
entities = []
def extract_entities(tree):
    for subtree in tree:
        if isinstance(subtree, Tree):  # It's a named entity
            entity_name = ' '.join([token for token, pos in subtree.leaves()])
            entity_label = subtree.label()
            entities.append((entity_name, entity_label))
            
extract_entities(named_entities)

if entities:
    print("🎯 Detected Entities:")
    entity_types = {}
    for entity, label in entities:
        if label not in entity_types:
            entity_types[label] = []
        entity_types[label].append(entity)
    
    for label, entity_list in entity_types.items():
        print(f"  {label}: {entity_list}")
else:
    print("  No named entities detected with basic NER")

# 3. MANUAL ENTITY EXTRACTION (for demonstration)
print(f"\n🔍 MANUAL ENTITY IDENTIFICATION:")

# Find potential entities using POS tags
companies = [word for word, pos in pos_tags if pos == 'NNP' and word in ['Apple', 'Microsoft', 'Google']]
people = [word for word, pos in pos_tags if pos == 'NNP' and word in ['Tim', 'Cook']]
locations = [word for word, pos in pos_tags if pos == 'NNP' and word in ['Austin', 'Texas', 'Wall', 'Street']]
numbers = [word for word, pos in pos_tags if pos == 'CD']

manual_entities = {
    'Companies': companies,
    'People': people,
    'Locations': locations, 
    'Numbers': numbers
}

for category, items in manual_entities.items():
    if items:
        print(f"  {category}: {items}")

# 4. BUSINESS APPLICATIONS
print(f"\n💼 BUSINESS APPLICATIONS:")
print("POS Tagging can be used for:")
print("• Information extraction from documents")
print("• Improving search relevance")  
print("• Text summarization")
print("• Grammar checking tools")
print("• Keyword extraction")

print("\nNamed Entity Recognition can be used for:")
print("• Customer relationship management")
print("• Compliance and regulatory analysis")
print("• News and social media monitoring")
print("• Document classification")
print("• Knowledge graph construction")

# Create summary DataFrame
word_analysis = []
for word, pos in pos_tags[:15]:  # First 15 words
    if word.isalpha():  # Skip punctuation
        is_entity = any(word in entity for entity, _ in entities)
        word_analysis.append({
            'Word': word,
            'POS_Tag': pos,
            'POS_Description': pos_description.get(pos, 'Other'),
            'Is_Entity': is_entity
        })

df_analysis = pd.DataFrame(word_analysis)
print(f"\n📋 Sample Word Analysis:")
print(df_analysis.to_string(index=False))
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     C:\Users\lsi8012\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     C:\Users\lsi8012\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker_tab to
[nltk_data]     C:\Users\lsi8012\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker_tab is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\lsi8012\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\words.zip.
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\lsi8012\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\lsi8012\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!
🏷️ POS TAGGING & NAMED ENTITY RECOGNITION
============================================================
📝 Sample Text:
Apple Inc. is planning to open a new headquarters in Austin, Texas next year.
CEO Tim Cook announced this during a meeting with investors on Wall Street.
The company expects to hire 5,000 new employees, including software engineers 
and data scientists. Microsoft and Google are also expanding their operations
in the region. The project will cost approximately $1 billion dollars.

🔤 PART-OF-SPEECH TAGGING:
Found 20 different POS tags:
  CD (Cardinal number): ['5,000', '1', 'billion']
  DT (Determiner): ['a', 'this', 'a']
  IN (Preposition/subordinating conjunction): ['in', 'during', 'with']
  JJ (Adjective): ['new', 'next', 'new']
  NN (Noun (singular)): ['headquarters', 'year', 'meeting']
  NNP (Proper noun (singular)): ['Apple', 'Inc.', 'Austin']
  NNS (Noun (plural)): ['investors', 'employees', 'engineers']
  RB (Adverb): ['also', 'approximately']
  TO (to): ['to', 'to']
  VB (Verb (base form)): ['open', 'hire', 'cost']
  VBD (Verb (past tense)): ['announced']
  VBG (Verb (gerund/present participle)): ['planning', 'including', 'expanding']
  VBP (Verb (present, not 3rd person singular)): ['are']
  VBZ (Verb (present, 3rd person singular)): ['is', 'expects']

🏢 NAMED ENTITY RECOGNITION:
🎯 Detected Entities:
  PERSON: ['Apple', 'Microsoft']
  ORGANIZATION: ['Inc.', 'CEO Tim Cook']
  GPE: ['Austin', 'Texas', 'Google']
  FACILITY: ['Wall Street']

🔍 MANUAL ENTITY IDENTIFICATION:
  Companies: ['Apple', 'Microsoft', 'Google']
  People: ['Tim', 'Cook']
  Locations: ['Austin', 'Texas', 'Wall', 'Street']
  Numbers: ['5,000', '1', 'billion']

💼 BUSINESS APPLICATIONS:
POS Tagging can be used for:
• Information extraction from documents
• Improving search relevance
• Text summarization
• Grammar checking tools
• Keyword extraction

Named Entity Recognition can be used for:
• Customer relationship management
• Compliance and regulatory analysis
• News and social media monitoring
• Document classification
• Knowledge graph construction

📋 Sample Word Analysis:
        Word POS_Tag                       POS_Description  Is_Entity
       Apple     NNP                Proper noun (singular)       True
          is     VBZ   Verb (present, 3rd person singular)      False
    planning     VBG      Verb (gerund/present participle)      False
          to      TO                                    to      False
        open      VB                      Verb (base form)      False
           a      DT                            Determiner       True
         new      JJ                             Adjective      False
headquarters      NN                       Noun (singular)      False
          in      IN Preposition/subordinating conjunction       True
      Austin     NNP                Proper noun (singular)       True
       Texas     NNP                Proper noun (singular)       True
        next      JJ                             Adjective      False
        year      NN                       Noun (singular)      False
🎯 Detected Entities:
  PERSON: ['Apple', 'Microsoft']
  ORGANIZATION: ['Inc.', 'CEO Tim Cook']
  GPE: ['Austin', 'Texas', 'Google']
  FACILITY: ['Wall Street']

🔍 MANUAL ENTITY IDENTIFICATION:
  Companies: ['Apple', 'Microsoft', 'Google']
  People: ['Tim', 'Cook']
  Locations: ['Austin', 'Texas', 'Wall', 'Street']
  Numbers: ['5,000', '1', 'billion']

💼 BUSINESS APPLICATIONS:
POS Tagging can be used for:
• Information extraction from documents
• Improving search relevance
• Text summarization
• Grammar checking tools
• Keyword extraction

Named Entity Recognition can be used for:
• Customer relationship management
• Compliance and regulatory analysis
• News and social media monitoring
• Document classification
• Knowledge graph construction

📋 Sample Word Analysis:
        Word POS_Tag                       POS_Description  Is_Entity
       Apple     NNP                Proper noun (singular)       True
          is     VBZ   Verb (present, 3rd person singular)      False
    planning     VBG      Verb (gerund/present participle)      False
          to      TO                                    to      False
        open      VB                      Verb (base form)      False
           a      DT                            Determiner       True
         new      JJ                             Adjective      False
headquarters      NN                       Noun (singular)      False
          in      IN Preposition/subordinating conjunction       True
      Austin     NNP                Proper noun (singular)       True
       Texas     NNP                Proper noun (singular)       True
        next      JJ                             Adjective      False
        year      NN                       Noun (singular)      False

8.8.2.9 NLTK Summary & Real-World Applications

8.8.2.9.1 📊 NLTK Capabilities Overview
Feature Purpose Business Use Case Performance
Tokenization Split text into words/sentences Document processing, search ⚡ Fast
Stop Words Remove common words Focus on meaningful content ⚡ Fast
Stemming Reduce words to stems Search optimization, indexing ⚡ Very Fast
Lemmatization Reduce words to dictionary form Text analysis, normalization 🔄 Medium
Sentiment Analysis Detect emotions and opinions Customer feedback, social monitoring ⚡ Fast
POS Tagging Identify grammatical roles Information extraction 🔄 Medium
Named Entity Recognition Find people, places, organizations Document analysis, CRM 🔄 Medium
8.8.2.9.2 💡 Best Practices & Tips

Performance Optimization:

  • Use stemming for speed-critical applications
  • Use lemmatization for accuracy-critical applications
  • Cache processed results for repeated text analysis
  • Batch process large datasets instead of one-by-one

Data Quality:

  • Clean text before processing (remove HTML, special characters)
  • Handle multiple languages appropriately
  • Validate results with domain experts
  • Consider context when interpreting sentiment

NLTK provides an excellent foundation for understanding NLP concepts before moving to more advanced tools!

8.9 Independent Study

8.9.1 Practice exercise 1

8.9.1.1

Read the file STAT303-1 survey for data analysis.csv.

survey_data = pd.read_csv('./Datasets/STAT303-1 survey for data analysis.csv')
print("Survey data loaded successfully!")
print(f"Shape: {survey_data.shape}")
survey_data.head()
Survey data loaded successfully!
Shape: (192, 51)
Timestamp fav_alcohol On average (approx.) how many parties a month do you attend during the school year? Enter a whole number (0, 1, 2, 3, 4, ...) smoke weed introvert_extrovert love_first_sight learning_style left_right_brained personality_type ... used_python_before dominant_hand childhood_in_US gender region_of_residence political_affliation cant_change_math_ability can_change_math_ability math_is_genetic much_effort_is_lack_of_talent
0 2022/09/13 1:43:34 pm GMT-5 I don't drink 1 No Occasionally Introvert No Visual (learn best through images or graphic o... Left-brained (logic, science, critical thinkin... INFJ ... Yes Right Yes Female Northeast Democrat Disagree Agree Disagree Disagree
1 2022/09/13 5:28:17 pm GMT-5 Hard liquor/Mixed drink 3 No Occasionally Extrovert No Visual (learn best through images or graphic o... Left-brained (logic, science, critical thinkin... ESFJ ... Yes Right Yes Male West Democrat Disagree Agree Disagree Disagree
2 2022/09/13 7:56:38 pm GMT-5 Hard liquor/Mixed drink 3 No Yes Introvert No Kinesthetic (learn best through figuring out h... Left-brained (logic, science, critical thinkin... ISTJ ... No Right No Female International No affiliation Disagree Agree Disagree Disagree
3 2022/09/13 10:34:37 pm GMT-5 Hard liquor/Mixed drink 12 No No Extrovert No Visual (learn best through images or graphic o... Left-brained (logic, science, critical thinkin... ENFJ ... No Right Yes Female Southeast Democrat Disagree Agree Disagree Disagree
4 2022/09/14 4:46:19 pm GMT-5 I don't drink 1 No No Extrovert Yes Reading/Writing (learn best through words ofte... Right-brained (creative, art, imaginative, int... ENTJ ... No Right Yes Female Northeast Democrat Agree Disagree Disagree Disagree

5 rows × 51 columns

8.9.1.2

Rename all the columns of the data, except the first two columns, with the shorter names in the list new_col_names given below. The order of column names in the list is the same as the order in which the columns are to be renamed starting with the third column from the left.

new_col_names = ['parties_per_month', 'do_you_smoke', 'weed', 'are_you_an_introvert_or_extrovert', 'love_first_sight', 'learning_style', 'left_right_brained', 'personality_type', 'social_media', 'num_insta_followers', 'streaming_platforms', 'expected_marriage_age', 'expected_starting_salary', 'fav_sport', 'minutes_ex_per_week', 'sleep_hours_per_day', 'how_happy', 'farthest_distance_travelled', 'fav_number', 'fav_letter', 'internet_hours_per_day', 'only_child', 'birthdate_odd_even', 'birth_month', 'fav_season', 'living_location_on_campus', 'major', 'num_majors_minors', 'high_school_GPA', 'NU_GPA', 'age', 'height', 'height_father', 'height_mother', 'school_year', 'procrastinator', 'num_clubs', 'student_athlete', 'AP_stats', 'used_python_before', 'dominant_hand', 'childhood_in_US', 'gender', 'region_of_residence', 'political_affliation', 'cant_change_math_ability', 'can_change_math_ability', 'math_is_genetic', 'much_effort_is_lack_of_talent']
survey_data.columns = list(survey_data.columns[0:2])+new_col_names

8.9.1.3

Rename the following columns again:

  1. Rename do_you_smoke to smoke.

  2. Rename are_you_an_introvert_or_extrovert to introvert_extrovert.

Hint: Use the function rename()


# Rename specific columns using rename function

survey_data = survey_data.rename(columns={

    'do_you_smoke': 'smoke',

    'are_you_an_introvert_or_extrovert': 'introvert_extrovert'})

print(f"Updated columns: {list(survey_data.columns)}")
print("Specific columns renamed successfully!")
Updated columns: ['Timestamp', 'fav_alcohol', 'parties_per_month', 'smoke', 'weed', 'introvert_extrovert', 'love_first_sight', 'learning_style', 'left_right_brained', 'personality_type', 'social_media', 'num_insta_followers', 'streaming_platforms', 'expected_marriage_age', 'expected_starting_salary', 'fav_sport', 'minutes_ex_per_week', 'sleep_hours_per_day', 'how_happy', 'farthest_distance_travelled', 'fav_number', 'fav_letter', 'internet_hours_per_day', 'only_child', 'birthdate_odd_even', 'birth_month', 'fav_season', 'living_location_on_campus', 'major', 'num_majors_minors', 'high_school_GPA', 'NU_GPA', 'age', 'height', 'height_father', 'height_mother', 'school_year', 'procrastinator', 'num_clubs', 'student_athlete', 'AP_stats', 'used_python_before', 'dominant_hand', 'childhood_in_US', 'gender', 'region_of_residence', 'political_affliation', 'cant_change_math_ability', 'can_change_math_ability', 'math_is_genetic', 'much_effort_is_lack_of_talent']
Specific columns renamed successfully!

8.9.1.4

Examine the column num_insta_followers. Some numbers in the column contain a comma(,) or a tilde(~). Remove both these characters from the numbers in the column.

Hint: You may use the function str.replace() of the Pandas Series class.


survey_data_insta = survey_data.copy()
survey_data_insta['num_insta_followers']=survey_data_insta['num_insta_followers'].str.replace(',','')
survey_data_insta['num_insta_followers']=survey_data_insta['num_insta_followers'].str.replace('~','')
survey_data_insta.head()
Timestamp fav_alcohol parties_per_month smoke weed introvert_extrovert love_first_sight learning_style left_right_brained personality_type ... used_python_before dominant_hand childhood_in_US gender region_of_residence political_affliation cant_change_math_ability can_change_math_ability math_is_genetic much_effort_is_lack_of_talent
0 2022/09/13 1:43:34 pm GMT-5 I don't drink 1 No Occasionally Introvert No Visual (learn best through images or graphic o... Left-brained (logic, science, critical thinkin... INFJ ... Yes Right Yes Female Northeast Democrat Disagree Agree Disagree Disagree
1 2022/09/13 5:28:17 pm GMT-5 Hard liquor/Mixed drink 3 No Occasionally Extrovert No Visual (learn best through images or graphic o... Left-brained (logic, science, critical thinkin... ESFJ ... Yes Right Yes Male West Democrat Disagree Agree Disagree Disagree
2 2022/09/13 7:56:38 pm GMT-5 Hard liquor/Mixed drink 3 No Yes Introvert No Kinesthetic (learn best through figuring out h... Left-brained (logic, science, critical thinkin... ISTJ ... No Right No Female International No affiliation Disagree Agree Disagree Disagree
3 2022/09/13 10:34:37 pm GMT-5 Hard liquor/Mixed drink 12 No No Extrovert No Visual (learn best through images or graphic o... Left-brained (logic, science, critical thinkin... ENFJ ... No Right Yes Female Southeast Democrat Disagree Agree Disagree Disagree
4 2022/09/14 4:46:19 pm GMT-5 I don't drink 1 No No Extrovert Yes Reading/Writing (learn best through words ofte... Right-brained (creative, art, imaginative, int... ENTJ ... No Right Yes Female Northeast Democrat Agree Disagree Disagree Disagree

5 rows × 51 columns

8.9.1.5

Use the cleaned dataset where the column num_insta_followers has been updated.

The last four variables in the dataset are:

  1. cant_change_math_ability

  2. can_change_math_ability

  3. math_is_genetic

  4. much_effort_is_lack_of_talent

Each of the above variables has values - Agree / Disagree. Replace Agree with 1 and Disagree with 0.

Hint : You can do it with any one of the following methods:

  1. Use the map() function

  2. Use the apply() function with the lambda function

# implement a function to convert the column to numeric, forcing errors to NaN

# Define the columns to convert
columns_to_convert = ['cant_change_math_ability', 'can_change_math_ability', 
                      'math_is_genetic', 'much_effort_is_lack_of_talent']

# Method 1: Using map() function (avoids for-loop)
for col in columns_to_convert:
    survey_data_insta[col] = survey_data_insta[col].map({'Agree': 1, 'Disagree': 0})

print("Method 1: Using map() function")
print(survey_data_insta[columns_to_convert].head())
print("\nData types:")
print(survey_data_insta[columns_to_convert].dtypes)

# Method 2: Using apply() function with lambda (avoids for-loop)
# First, let's reload the data to show this method
survey_data_insta2 = survey_data.copy()
survey_data_insta2['num_insta_followers'] = survey_data_insta2['num_insta_followers'].str.replace(',','')
survey_data_insta2['num_insta_followers'] = survey_data_insta2['num_insta_followers'].str.replace('~','')

for col in columns_to_convert:
    survey_data_insta2[col] = survey_data_insta2[col].apply(lambda x: 1 if x == 'Agree' else 0)

print("\n\nMethod 2: Using apply() with lambda function")
print(survey_data_insta2[columns_to_convert].head())
Method 1: Using map() function
   cant_change_math_ability  can_change_math_ability  math_is_genetic  \
0                         0                        1                0   
1                         0                        1                0   
2                         0                        1                0   
3                         0                        1                0   
4                         1                        0                0   

   much_effort_is_lack_of_talent  
0                              0  
1                              0  
2                              0  
3                              0  
4                              0  

Data types:
cant_change_math_ability         int64
can_change_math_ability          int64
math_is_genetic                  int64
much_effort_is_lack_of_talent    int64
dtype: object


Method 2: Using apply() with lambda function
   cant_change_math_ability  can_change_math_ability  math_is_genetic  \
0                         0                        1                0   
1                         0                        1                0   
2                         0                        1                0   
3                         0                        1                0   
4                         1                        0                0   

   much_effort_is_lack_of_talent  
0                              0  
1                              0  
2                              0  
3                              0  
4                              0  

8.9.2 Practice exercise 2

This exercise will motivate vectorized computations with NumPy. Vectorized computations help perform computations more efficiently, and also make the code concise.

Read the (1) quantities of roll, bun, cake and bread required by 3 people - Ben, Barbara & Beth, from food_quantity.csv, (2) price of these food items in two shops - Target and Kroger, from price.csv. Find out which shop should each person go to minimize their expenses.


food_df = pd.read_csv('./datasets/food_quantity.csv')
food_df.head()
Person roll bun cake bread
0 Ben 6 5 3 1
1 Barbara 3 6 2 2
2 Beth 3 4 3 1

price_df = pd.read_csv('./datasets/price.csv')
price_df.head()
Item Target Kroger
0 roll 1.5 1.0
1 bun 2.0 2.5
2 cake 5.0 4.5
3 bread 16.0 17.0

8.9.2.1

Compute the expenses of Ben if he prefers to buy all food items from Target


# Step 1: Get Ben's quantities
ben_quantities = food_df[food_df['Person'] == 'Ben'].iloc[0]  # Get Ben's row
print("Ben's food quantities:")
print(ben_quantities)
print()

# Step 2: Get Target prices
target_prices = price_df.set_index('Item')['Target']  # Set Item as index, get Target column
print("Target prices:")
print(target_prices)
print()

# Step 3: Calculate Ben's total expenses at Target
# Multiply quantities by corresponding Target prices
food_items = ['roll', 'bun', 'cake', 'bread']
total_expense = 0

print("Ben's expense breakdown at Target:")
for item in food_items:
    quantity = ben_quantities[item]
    price = target_prices[item]
    expense = quantity * price
    total_expense += expense
    print(f"{item}: {quantity} × ${price} = ${expense}")

print(f"\nBen's total expense at Target: ${total_expense}")
Ben's food quantities:
Person    Ben
roll        6
bun         5
cake        3
bread       1
Name: 0, dtype: object

Target prices:
Item
roll      1.5
bun       2.0
cake      5.0
bread    16.0
Name: Target, dtype: float64

Ben's expense breakdown at Target:
roll: 6 × $1.5 = $9.0
bun: 5 × $2.0 = $10.0
cake: 3 × $5.0 = $15.0
bread: 1 × $16.0 = $16.0

Ben's total expense at Target: $50.0

8.9.2.2

Compute Ben’s total expenses at both Target and Kroger. Which store is cheaper for him?



# Step 1: Get Ben's quantities (reuse from previous calculation)
ben_quantities = food_df[food_df['Person'] == 'Ben'].iloc[0]
food_items = ['roll', 'bun', 'cake', 'bread']

# Step 2: Get prices from both stores
target_prices = price_df.set_index('Item')['Target']
kroger_prices = price_df.set_index('Item')['Kroger']

print("🛒 BEN'S EXPENSE COMPARISON")
print("=" * 40)
print("Ben's quantities:", [ben_quantities[item] for item in food_items])
print()

# Step 3: Calculate expenses at both stores
target_total = 0
kroger_total = 0

print("Detailed breakdown:")
print(f"{'Item':<8} {'Qty':<4} {'Target':<8} {'Kroger':<8} {'Target $':<10} {'Kroger $':<10}")
print("-" * 60)

for item in food_items:
    quantity = ben_quantities[item]
    target_price = target_prices[item]
    kroger_price = kroger_prices[item]
    
    target_expense = quantity * target_price
    kroger_expense = quantity * kroger_price
    
    target_total += target_expense
    kroger_total += kroger_expense
    
    print(f"{item:<8} {quantity:<4} ${target_price:<7} ${kroger_price:<7} ${target_expense:<9} ${kroger_expense:<9}")

print("-" * 60)
print(f"{'TOTAL':<8} {'':<4} {'':<8} {'':<8} ${target_total:<9} ${kroger_total:<9}")

# Step 4: Determine which store is cheaper
print(f"\n💰 COMPARISON RESULTS:")
print(f"Ben's total at Target: ${target_total}")
print(f"Ben's total at Kroger: ${kroger_total}")

if target_total < kroger_total:
    savings = kroger_total - target_total
    print(f"🏆 TARGET is cheaper by ${savings}")
    print(f"Ben should shop at TARGET to save money!")
elif kroger_total < target_total:
    savings = target_total - kroger_total
    print(f"🏆 KROGER is cheaper by ${savings}")
    print(f"Ben should shop at KROGER to save money!")
else:
    print("🤝 Both stores cost the same for Ben!")
🛒 BEN'S EXPENSE COMPARISON
========================================
Ben's quantities: [6, 5, 3, 1]

Detailed breakdown:
Item     Qty  Target   Kroger   Target $   Kroger $  
------------------------------------------------------------
roll     6    $1.5     $1.0     $9.0       $6.0      
bun      5    $2.0     $2.5     $10.0      $12.5     
cake     3    $5.0     $4.5     $15.0      $13.5     
bread    1    $16.0    $17.0    $16.0      $17.0     
------------------------------------------------------------
TOTAL                           $50.0      $49.0     

💰 COMPARISON RESULTS:
Ben's total at Target: $50.0
Ben's total at Kroger: $49.0
🏆 KROGER is cheaper by $1.0
Ben should shop at KROGER to save money!

8.9.2.3

Compute each person’s total expenses at both Target and Kroger. Which store is cheaper for them?



# Step 1: Set up data
food_items = ['roll', 'bun', 'cake', 'bread']
target_prices = price_df.set_index('Item')['Target']
kroger_prices = price_df.set_index('Item')['Kroger']

print("🛒 COMPLETE EXPENSE ANALYSIS FOR ALL PEOPLE")
print("=" * 60)
print(f"Food items: {food_items}")
print(f"Target prices: {target_prices.values}")
print(f"Kroger prices: {kroger_prices.values}")
print()

# Step 2: Calculate expenses for each person
results = []

for index, person_row in food_df.iterrows():
    person_name = person_row['Person']
    
    # Calculate total expenses at both stores
    target_total = 0
    kroger_total = 0
    
    print(f"👤 {person_name.upper()}'S ANALYSIS:")
    print("-" * 30)
    print(f"{'Item':<8} {'Qty':<4} {'Target $':<10} {'Kroger $':<10}")
    print("-" * 30)
    
    for item in food_items:
        quantity = person_row[item]
        target_expense = quantity * target_prices[item]
        kroger_expense = quantity * kroger_prices[item]
        
        target_total += target_expense
        kroger_total += kroger_expense
        
        print(f"{item:<8} {quantity:<4} ${target_expense:<9.2f} ${kroger_expense:<9.2f}")
    
    print("-" * 30)
    print(f"{'TOTAL':<8} {'':<4} ${target_total:<9.2f} ${kroger_total:<9.2f}")
    
    # Determine cheaper store
    if target_total < kroger_total:
        cheaper_store = "TARGET"
        savings = kroger_total - target_total
        print(f"🏆 {cheaper_store} is cheaper by ${savings:.2f}")
    elif kroger_total < target_total:
        cheaper_store = "KROGER"
        savings = target_total - kroger_total
        print(f"🏆 {cheaper_store} is cheaper by ${savings:.2f}")
    else:
        cheaper_store = "TIE"
        savings = 0
        print("🤝 Both stores cost the same!")
    
    # Store results
    results.append({
        'Person': person_name,
        'Target_Total': target_total,
        'Kroger_Total': kroger_total,
        'Cheaper_Store': cheaper_store,
        'Savings': savings
    })
    
    print()

# Step 3: Create summary DataFrame
results_df = pd.DataFrame(results)

print("📊 SUMMARY TABLE:")
print("=" * 50)
print(results_df.to_string(index=False))

print(f"\n FINAL RECOMMENDATIONS:")
print("=" * 30)
for _, row in results_df.iterrows():
    if row['Cheaper_Store'] == 'TIE':
        print(f"• {row['Person']}: Can shop at either store (same cost)")
    else:
        print(f"• {row['Person']}: Shop at {row['Cheaper_Store']} to save ${row['Savings']:.2f}")

# Step 4: Additional insights
print(f"\n ADDITIONAL INSIGHTS:")
print("=" * 25)
target_shoppers = results_df[results_df['Cheaper_Store'] == 'TARGET']['Person'].tolist()
kroger_shoppers = results_df[results_df['Cheaper_Store'] == 'KROGER']['Person'].tolist()
tie_shoppers = results_df[results_df['Cheaper_Store'] == 'TIE']['Person'].tolist()

if target_shoppers:
    print(f"Target is better for: {', '.join(target_shoppers)}")
if kroger_shoppers:
    print(f"Kroger is better for: {', '.join(kroger_shoppers)}")
if tie_shoppers:
    print(f"No difference for: {', '.join(tie_shoppers)}")

total_savings = results_df['Savings'].sum()
print(f"Total potential savings if everyone shops optimally: ${total_savings:.2f}")
🛒 COMPLETE EXPENSE ANALYSIS FOR ALL PEOPLE
============================================================
Food items: ['roll', 'bun', 'cake', 'bread']
Target prices: [ 1.5  2.   5.  16. ]
Kroger prices: [ 1.   2.5  4.5 17. ]

👤 BEN'S ANALYSIS:
------------------------------
Item     Qty  Target $   Kroger $  
------------------------------
roll     6    $9.00      $6.00     
bun      5    $10.00     $12.50    
cake     3    $15.00     $13.50    
bread    1    $16.00     $17.00    
------------------------------
TOTAL         $50.00     $49.00    
🏆 KROGER is cheaper by $1.00

👤 BARBARA'S ANALYSIS:
------------------------------
Item     Qty  Target $   Kroger $  
------------------------------
roll     3    $4.50      $3.00     
bun      6    $12.00     $15.00    
cake     2    $10.00     $9.00     
bread    2    $32.00     $34.00    
------------------------------
TOTAL         $58.50     $61.00    
🏆 TARGET is cheaper by $2.50

👤 BETH'S ANALYSIS:
------------------------------
Item     Qty  Target $   Kroger $  
------------------------------
roll     3    $4.50      $3.00     
bun      4    $8.00      $10.00    
cake     3    $15.00     $13.50    
bread    1    $16.00     $17.00    
------------------------------
TOTAL         $43.50     $43.50    
🤝 Both stores cost the same!

📊 SUMMARY TABLE:
==================================================
 Person  Target_Total  Kroger_Total Cheaper_Store  Savings
    Ben          50.0          49.0        KROGER      1.0
Barbara          58.5          61.0        TARGET      2.5
   Beth          43.5          43.5           TIE      0.0

 FINAL RECOMMENDATIONS:
==============================
• Ben: Shop at KROGER to save $1.00
• Barbara: Shop at TARGET to save $2.50
• Beth: Can shop at either store (same cost)

 ADDITIONAL INSIGHTS:
=========================
Target is better for: Barbara
Kroger is better for: Ben
No difference for: Beth
Total potential savings if everyone shops optimally: $3.50