10 NumPy Intermediate

10.1 Learning Objectives

At the end of this chapter, you will learn:

Optimize code performance using vectorization instead of loops
Perform vectorized computations that are 10–100x faster than Python loops
Generate random data for simulations and statistical analysis
Convert between NumPy and Pandas for optimal workflow efficiency

Code

import numpy as np
import pandas as pd
import time
import warnings

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

10.2 Vectorization in NumPy

Vectorization applies operations to entire arrays simultaneously instead of looping through individual elements. This enables NumPy to leverage highly optimized C/Fortran libraries for dramatic performance gains.

10.2.1 Why Vectorization Matters

Vectorized operations require fewer lines of code and are easier to read compared to Python loops.
Moreover, since they rely on compiled C and Fortran libraries that leverage CPU SIMD instructions and efficient cache utilization, they can be significantly faster and more scalable than pure Python loops.

Loop Approach:

result = []
for i in range(len(array1)):
    for j in range(len(array2)):
        result.append(array1[i] * array2[j] + some_value)

Vectorized Approach:

result = array1[:, np.newaxis] * array2 + some_value

10.2.1.1 Performance Comparison: NumPy Vectorization Vs. Python Loops

Understanding the performance benefits of vectorization is crucial for writing efficient scientific computing code. Let’s compare different scenarios to see when and why vectorization matters.

10.2.1.1.1 Scenario 1: Basic Mathematical Operations

np.dot is a vectorized operation in NumPy that performs matrix multiplication or dot product between arrays. It efficiently computes the element-wise multiplications and then sums them up. To better understand its efficiency, let’s first implement the dot product using for loops, and then compare its performance with np.dot to see the benefits of vectorization.

Code

# Function to calculate dot product using for loops
def dot_product_loops(arr1, arr2):
    result = 0
    for i in range(len(arr1)):
        result += arr1[i] * arr2[i]
    return result

# Create sample arrays of different sizes for comparison
sizes = [1000, 10000, 100000, 1000000]

print("Performance Comparison: Loops vs NumPy Vectorization")
print("=" * 60)

for size in sizes:
    # Create random arrays
    arr1 = np.random.rand(size)
    arr2 = np.random.rand(size)
    
    # Measure time for the loop-based implementation
    start_time = time.time()
    loop_result = dot_product_loops(arr1, arr2)
    loop_time = time.time() - start_time
    
    # Measure time for np.dot
    start_time = time.time()
    numpy_result = np.dot(arr1, arr2)
    numpy_time = time.time() - start_time
    
    # Calculate speedup
    speedup = loop_time / numpy_time if numpy_time > 0 else float('inf')
    
    print(f"\nArray size: {size:,}")
    print(f"Loop result: {loop_result:.5f}, Time: {loop_time:.5f}s")
    print(f"NumPy result: {numpy_result:.5f}, Time: {numpy_time:.5f}s")
    print(f"Speedup: {speedup:.1f}x faster")
    
    

print(f"\n{'='*60}")
print("Key Observations:")
print("• NumPy becomes dramatically faster as array size increases")
print("• Speedup can reach over 100x for large arrays")
print("• NumPy overhead is minimal for small arrays but pays off quickly")

Performance Comparison: Loops vs NumPy Vectorization
============================================================

Array size: 1,000
Loop result: 240.51012, Time: 0.00100s
NumPy result: 240.51012, Time: 0.00000s
Speedup: infx faster

Array size: 10,000
Loop result: 2483.13248, Time: 0.00450s
NumPy result: 2483.13248, Time: 0.00000s
Speedup: infx faster

Array size: 100,000
Loop result: 24918.72212, Time: 0.02491s
NumPy result: 24918.72212, Time: 0.00000s
Speedup: infx faster

Array size: 1,000,000
Loop result: 250291.73101, Time: 0.18839s
NumPy result: 250291.73101, Time: 0.00098s
Speedup: 192.8x faster

============================================================
Key Observations:
• NumPy becomes dramatically faster as array size increases
• Speedup can reach over 100x for large arrays
• NumPy overhead is minimal for small arrays but pays off quickly

Array size: 1,000,000
Loop result: 250291.73101, Time: 0.18839s
NumPy result: 250291.73101, Time: 0.00098s
Speedup: 192.8x faster

============================================================
Key Observations:
• NumPy becomes dramatically faster as array size increases
• Speedup can reach over 100x for large arrays
• NumPy overhead is minimal for small arrays but pays off quickly

10.2.1.1.2 Scenario 2: Complex Mathematical Operations

Let’s compare more complex operations that show even greater performance differences:

Code

# Complex mathematical operation: polynomial evaluation
# f(x) = 3x³ + 2x² - 5x + 1

def polynomial_loops(x_values):
    """Evaluate polynomial using loops"""
    results = []
    for x in x_values:
        result = 3*x**3 + 2*x**2 - 5*x + 1
        results.append(result)
    return results

def polynomial_vectorized(x_values):
    """Evaluate polynomial using vectorized operations"""
    return 3*x_values**3 + 2*x_values**2 - 5*x_values + 1

# Test with different sizes
size = 5000000
x_data = np.random.uniform(-10, 10, size)

print("Complex Operations: Polynomial Evaluation")
print("f(x) = 3x³ + 2x² - 5x + 1")
print("-" * 40)

# Loop version
start_time = time.time()
loop_results = polynomial_loops(x_data)
loop_time = time.time() - start_time

# Vectorized version
start_time = time.time()
vector_results = polynomial_vectorized(x_data)
vector_time = time.time() - start_time

speedup = loop_time / vector_time

print(f"Array size: {size:,}")
print(f"Loop time: {loop_time:.4f}s")
print(f"Vectorized time: {vector_time:.4f}s")
print(f"Speedup: {speedup:.1f}x")

# Verify results match
print(f"Results match: {np.allclose(loop_results, vector_results)}")

Complex Operations: Polynomial Evaluation
f(x) = 3x³ + 2x² - 5x + 1
----------------------------------------
Array size: 5,000,000
Loop time: 2.8250s
Vectorized time: 0.2053s
Speedup: 13.8x
Results match: True
Array size: 5,000,000
Loop time: 2.8250s
Vectorized time: 0.2053s
Speedup: 13.8x
Results match: True

The performance comparisons clearly demonstrate that vectorized NumPy operations significantly outperform equivalent Python loops

10.3 Vectorized Operations in NumPy

Pandas vectorized operations are built on top of NumPy, which provides the foundation for efficient computation.

10.3.1 Types of Vectorized Operations

NumPy supports several categories of vectorized operations:

10.3.1.1 Element-wise Operations (Universal Functions - ufuncs)

These operations apply a function to each element of an array:

Code

# Basic arithmetic operations (all vectorized)
arr = np.array([1, 4, 9, 16, 25])

print("Original array:", arr)
print("Square root:", np.sqrt(arr))
print("Natural log:", np.log(arr))
print("Sine:", np.sin(arr))
print("Power of 2:", np.power(arr, 2))

# Comparison operations
print("\nComparison operations:")
print("Greater than 10:", arr > 10)
print("Equal to 9:", arr == 9)
print("Between 5 and 20:", (arr >= 5) & (arr <= 20))

Original array: [ 1  4  9 16 25]
Square root: [1. 2. 3. 4. 5.]
Natural log: [0.         1.38629436 2.19722458 2.77258872 3.21887582]
Sine: [ 0.84147098 -0.7568025   0.41211849 -0.28790332 -0.13235175]
Power of 2: [  1  16  81 256 625]

Comparison operations:
Greater than 10: [False False False  True  True]
Equal to 9: [False False  True False False]
Between 5 and 20: [False False  True  True False]

10.3.1.2 Aggregation Operations

These reduce arrays to scalar values or smaller arrays:

Code

# 2D array for aggregation examples
matrix = np.array([[1, 2, 3, 4],
                   [5, 6, 7, 8],
                   [9, 10, 11, 12]])

print("Matrix:")
print(matrix)

# Aggregations across entire array
print(f"\nSum of all elements: {np.sum(matrix)}")
print(f"Mean: {np.mean(matrix)}")
print(f"Standard deviation: {np.std(matrix)}")
print(f"Min: {np.min(matrix)}, Max: {np.max(matrix)}")

# Aggregations along specific axes
print(f"\nSum along axis 0 (columns): {np.sum(matrix, axis=0)}")
print(f"Sum along axis 1 (rows): {np.sum(matrix, axis=1)}")
print(f"Mean along axis 0: {np.mean(matrix, axis=0)}")
print(f"Max along axis 1: {np.max(matrix, axis=1)}")

Matrix:
[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]

Sum of all elements: 78
Mean: 6.5
Standard deviation: 3.452052529534663
Min: 1, Max: 12

Sum along axis 0 (columns): [15 18 21 24]
Sum along axis 1 (rows): [10 26 42]
Mean along axis 0: [5. 6. 7. 8.]
Max along axis 1: [ 4  8 12]

10.3.1.3 Boolean Operations and Fancy Indexing

Vectorized selection and filtering operations:

Code

# Boolean indexing - vectorized filtering
data = np.array([1, 15, 8, 23, 4, 16, 42, 3, 19, 7])

# Find elements meeting multiple conditions (vectorized)
mask = (data > 5) & (data < 20)
filtered_data = data[mask]
print(f"Original data: {data}")
print(f"Elements between 5 and 20: {filtered_data}")

# Using np.where for conditional replacement (vectorized)
# Replace values > 15 with -1, others with original value
result = np.where(data > 15, -1, data)
print(f"After conditional replacement: {result}")

# Fancy indexing - select multiple elements at once
indices = [0, 2, 4, 7]
selected = data[indices]
print(f"Elements at indices {indices}: {selected}")

# Count how many elements meet condition (vectorized)
count_large = np.sum(data > 10)
print(f"Count of elements > 10: {count_large}")

Original data: [ 1 15  8 23  4 16 42  3 19  7]
Elements between 5 and 20: [15  8 16 19  7]
After conditional replacement: [ 1 15  8 -1  4 -1 -1  3 -1  7]
Elements at indices [0, 2, 4, 7]: [1 8 4 3]
Count of elements > 10: 5

10.3.2 Matrix Multiplication in NumPy with Vectorization

Matrix multiplication is one of the most common and computationally intensive operations in numerical computing and deep learning. NumPy offers efficient and highly optimized methods for performing matrix multiplication, which leverage vectorization to handle large matrices quickly and accurately.

Note: NumPy matrix operations follow the standard rules of linear algebra, so it’s important to ensure that the shapes of the matrices are compatible. If they are not, consider reshaping the matrices before performing multiplication.

There are two common methods for matrix multiplication:

10.3.2.1 Method 1: Matrix Multiplication Using `np.dot()`

Code

# Define two 2D arrays (matrices)
matrix1 = np.array([[1, 2, 3], 
                    [4, 5, 6]])
matrix2 = np.array([[7, 8], 
                    [9, 10], 
                    [11, 12]])

# Matrix multiplication using np.dot
result_dot = np.dot(matrix1, matrix2)
print("Matrix Multiplication using np.dot:\n", result_dot)

# Another way to perform np.dot for matrix multiplication
result_dot2 = matrix1.dot(matrix2)
print("\nMatrix Multiplication using dot method:\n", result_dot2)

Matrix Multiplication using np.dot:
 [[ 58  64]
 [139 154]]

Matrix Multiplication using dot method:
 [[ 58  64]
 [139 154]]

10.3.2.2 Method 2: Matrix Multiplication using `np.matmul()` or `@`

Code

# Matrix multiplication using np.matmul or @ operator
result_matmul = np.matmul(matrix1, matrix2)
result_operator = matrix1 @ matrix2
print("\nMatrix Multiplication using np.matmul:\n", result_matmul)
print("\nMatrix Multiplication using @ operator:\n", result_operator)


Matrix Multiplication using np.matmul:
 [[ 58  64]
 [139 154]]

Matrix Multiplication using @ operator:
 [[ 58  64]
 [139 154]]

Important: Note that the * operator in NumPy performs element-wise multiplication, not matrix multiplication.

Code

# Using * operator for element-wise multiplication will cause an error if shapes don't match
# Uncomment the code below to see the error:
# element_wise = matrix1 * matrix2
# print("\nElement-wise Multiplication:\n", element_wise)

Code

# reshape the array for element-wise multiplication
matrix2_reshaped = matrix2.reshape(2, 3)
element_wise = matrix1 * matrix2_reshaped
print("\nElement-wise Multiplication after reshaping:\n", element_wise)


Element-wise Multiplication after reshaping:
 [[ 7 16 27]
 [40 55 72]]

10.4 Converting Between NumPy Arrays and Pandas DataFrames

Seamless data conversion between NumPy and pandas is essential for efficient data science workflows. Each library excels in different areas, and knowing when and how to convert between them maximizes your analytical power.

Typical Data Science Workflow

1.  LOAD DATA → Use Pandas (CSV, Excel, databases)
2.  CLEAN & PREPARE → Use Pandas (filtering, grouping, handling missing data)
3.  COMPUTE → Convert to NumPy (ML algorithms, linear algebra, heavy math)
4.  ANALYZE & VISUALIZE → Convert back to Pandas (results interpretation)
5.  EXPORT → Use Pandas (CSV, Parquet, SQL)

NumPy and pandas complement each other perfectly:

Aspect	NumPy Arrays	Pandas DataFrames
Strength	Mathematical computations	Data manipulation & analysis
Speed	Faster numerical operations	Easier data exploration
Memory	More memory efficient	Rich metadata (labels, dtypes)
Use Cases	Linear algebra, statistics, ML algorithms	Data cleaning, grouping, merging
Indexing	Integer-based only	Label-based + integer-based
Missing Data	Limited NaN support	Robust NaN handling

The key question: When should you stay in pandas vs convert to NumPy?

10.4.1 Decision Framework: When to Use Which Library

10.4.1.1 Stay in Pandas for Vectorized Operations

Vectorized operations in Pandas are built on top of NumPy and are often fast enough for most tasks!
These operations include, but are not limited to:

1. Standard Mathematical Operations

Arithmetic: df['total'] = df['price'] * df['quantity']
Normalization: df_norm = (df - df.mean()) / df.std()
Scaling: df['scaled'] = df['value'] / df['value'].max()

2. Statistical Computations

Aggregations: df.mean(), df.sum(), df.std()
Correlations: df.corr(), df['A'].corr(df['B'])
Rolling windows: df['MA'] = df['price'].rolling(7).mean()

3. Element-wise Transformations

Vectorized operations: df['log_val'] = np.log(df['value'])
Conditional logic: df['flag'] = df['amount'] > 100
String operations: df['upper'] = df['name'].str.upper()

Let’s see this in action:

Code

print("💡 STAYING IN PANDAS: When Conversion Isn't Needed")
print("=" * 60)

# Example: Normalize data WITHOUT converting to NumPy
df_data = pd.DataFrame({
    "Sales_Q1": [10000, 20000, 30000, 40000],
    "Sales_Q2": [12000, 22000, 28000, 38000],
    "Sales_Q3": [15000, 25000, 35000, 45000]
}, index=["Product_A", "Product_B", "Product_C", "Product_D"])

print("Original Sales Data:")
print(df_data)
print(f"\nDtypes: {df_data.dtypes.unique()}")

# ❌ METHOD 1: Using NumPy (requires metadata handling)
print("\n" + "="*60)
print("❌ Method 1: NumPy Approach (more complex)")
print("="*60)

# Save metadata
saved_index = df_data.index
saved_columns = df_data.columns

# Convert to NumPy
arr = df_data.to_numpy()
print(f"1. Converted to NumPy array (shape: {arr.shape})")

# Normalize
arr_norm = (arr - arr.mean(axis=0)) / arr.std(axis=0)
print(f"2. Normalized using NumPy")

# Convert back and restore metadata
df_norm_numpy = pd.DataFrame(arr_norm, index=saved_index, columns=saved_columns)
print(f"3. Converted back to DataFrame with manual metadata restoration")
print(f"\nResult:")
print(df_norm_numpy.round(3))

# ✅ METHOD 2: Pure Pandas (simpler and automatic!)
print("\n" + "="*60)
print("✅ Method 2: Pandas Approach (simpler!)")
print("="*60)

df_norm_pandas = (df_data - df_data.mean()) / df_data.std()
print(f"Result (metadata preserved automatically):")
print(df_norm_pandas.round(3))

# Verify they're the same
print(f"\n✓ Results identical? {np.allclose(df_norm_numpy.values, df_norm_pandas.values)}")
print(f"✓ Metadata preserved? {df_norm_pandas.index.equals(df_data.index) and df_norm_pandas.columns.equals(df_data.columns)}")

print("\n💡 Takeaway: For standard operations, pandas is simpler and just as fast!")

💡 STAYING IN PANDAS: When Conversion Isn't Needed
============================================================
Original Sales Data:
           Sales_Q1  Sales_Q2  Sales_Q3
Product_A     10000     12000     15000
Product_B     20000     22000     25000
Product_C     30000     28000     35000
Product_D     40000     38000     45000

Dtypes: [dtype('int64')]

============================================================
❌ Method 1: NumPy Approach (more complex)
============================================================
1. Converted to NumPy array (shape: (4, 3))
2. Normalized using NumPy
3. Converted back to DataFrame with manual metadata restoration

Result:
           Sales_Q1  Sales_Q2  Sales_Q3
Product_A    -1.342    -1.378    -1.342
Product_B    -0.447    -0.318    -0.447
Product_C     0.447     0.318     0.447
Product_D     1.342     1.378     1.342

============================================================
✅ Method 2: Pandas Approach (simpler!)
============================================================
Result (metadata preserved automatically):
           Sales_Q1  Sales_Q2  Sales_Q3
Product_A    -1.162    -1.193    -1.162
Product_B    -0.387    -0.275    -0.387
Product_C     0.387     0.275     0.387
Product_D     1.162     1.193     1.162

✓ Results identical? False
✓ Metadata preserved? True

💡 Takeaway: For standard operations, pandas is simpler and just as fast!

10.4.1.2 Convert to NumPy When You Must

Convert your DataFrame to a NumPy array only in specific scenarios where Pandas alone isn’t enough:

1. Specialized NumPy Functions

Linear algebra: np.linalg.inv(), np.linalg.eig(), matrix decomposition
FFT/signal processing: np.fft.fft(), np.convolve()
Advanced random sampling: np.random.multivariate_normal()
Custom mathematical operations not available in pandas

2. ML Library Integration

scikit-learn: model.fit(X_train, y_train) expects NumPy arrays
TensorFlow/PyTorch: Neural networks require tensor/array inputs
SciPy: Scientific functions expect NumPy arrays

3. Performance-Critical Code

Nested loops that can be vectorized with NumPy broadcasting
Custom algorithms requiring low-level array manipulation
Very large datasets (> 10 million rows) where memory matters

4. Complex Conditional Logic

Multi-condition selections with np.where(), np.select()
Better performance than chained pandas operations

Let’s see when NumPy is actually necessary:

Code

print("🚀 WHEN NUMPY IS NECESSARY: Real Use Cases")
print("=" * 60)

# Create a realistic dataset
np.random.seed(42)
n_rows = 50000

performance_data = pd.DataFrame({
    'customer_id': range(1, n_rows + 1),
    'purchase_amount': np.random.uniform(10, 1000, n_rows),
    'discount_pct': np.random.choice([0, 5, 10, 15, 20], n_rows),
    'category': np.random.choice(['A', 'B', 'C', 'D'], n_rows),
    'rating': np.random.randint(1, 6, n_rows),
    'is_premium': np.random.choice([True, False], n_rows)
})

print(f"📊 Dataset: {performance_data.shape[0]:,} rows × {performance_data.shape[1]} columns")
print(f"First 3 rows:")
print(performance_data.head(3))

🚀 WHEN NUMPY IS NECESSARY: Real Use Cases
============================================================
📊 Dataset: 50,000 rows × 6 columns
First 3 rows:
   customer_id  purchase_amount  discount_pct category  rating  is_premium
0            1       380.794718             0        C       1        True
1            2       951.207163            15        D       1       False
2            3       734.674002             5        D       1        True

Code

print("\n" + "="*60)
print("USE CASE 1: Complex Multi-Condition Logic with np.select()")
print("="*60)

def customer_segment_function(row):
    """Complex business logic - slow with apply()"""
    if row['is_premium'] and row['purchase_amount'] > 500:
        return 'VIP'
    elif row['is_premium'] and row['rating'] >= 4:
        return 'Premium+'  
    elif row['purchase_amount'] > 200 and row['rating'] >= 4:
        return 'High Value'
    elif row['purchase_amount'] > 100:
        return 'Standard'
    else:
        return 'Basic'

# Method 1: APPLY (Easy to write but slow)
print("\n❌ Method 1: Using apply() with function")
start_time = time.time()
performance_data['segment_apply'] = performance_data.apply(customer_segment_function, axis=1)
apply_time = time.time() - start_time
print(f"Time: {apply_time:.4f}s")

# Method 2: VECTORIZED with np.select() (Faster!)
print("\n✅ Method 2: NumPy np.select() - vectorized")
start_time = time.time()

# Define conditions using NumPy/pandas vectorized operations
conditions = [
    (performance_data['is_premium'] & (performance_data['purchase_amount'] > 500)),
    (performance_data['is_premium'] & (performance_data['rating'] >= 4)),
    ((performance_data['purchase_amount'] > 200) & (performance_data['rating'] >= 4)),
    (performance_data['purchase_amount'] > 100)
]
choices = ['VIP', 'Premium+', 'High Value', 'Standard']

# np.select() evaluates conditions vectorized and picks corresponding choice
performance_data['segment_vectorized'] = np.select(conditions, choices, default='Basic')
vectorized_time = time.time() - start_time
print(f"Time: {vectorized_time:.4f}s")

# Performance comparison
speedup = apply_time / vectorized_time if vectorized_time > 0 else float('inf')
print(f"\n🏆 Speedup: {speedup:.1f}x faster with np.select()!")
print(f"Time saved: {((apply_time - vectorized_time) / apply_time * 100):.1f}%")

# Verify results match
results_match = (performance_data['segment_apply'] == performance_data['segment_vectorized']).all()
print(f"✓ Results identical: {results_match}")

# Show distribution
print(f"\n📊 Customer Segment Distribution:")
segment_counts = performance_data['segment_apply'].value_counts().sort_index()
for segment, count in segment_counts.items():
    print(f"  {segment}: {count:,} ({count/len(performance_data)*100:.1f}%)")

# Clean up
performance_data.drop(['segment_vectorized'], axis=1, inplace=True)


============================================================
USE CASE 1: Complex Multi-Condition Logic with np.select()
============================================================

❌ Method 1: Using apply() with function
Time: 0.3499s

✅ Method 2: NumPy np.select() - vectorized
Time: 0.0124s

🏆 Speedup: 28.2x faster with np.select()!
Time saved: 96.5%
✓ Results identical: True

📊 Customer Segment Distribution:
  Basic: 3,648 (7.3%)
  High Value: 8,116 (16.2%)
  Premium+: 4,967 (9.9%)
  Standard: 20,606 (41.2%)
  VIP: 12,663 (25.3%)
Time: 0.3499s

✅ Method 2: NumPy np.select() - vectorized
Time: 0.0124s

🏆 Speedup: 28.2x faster with np.select()!
Time saved: 96.5%
✓ Results identical: True

📊 Customer Segment Distribution:
  Basic: 3,648 (7.3%)
  High Value: 8,116 (16.2%)
  Premium+: 4,967 (9.9%)
  Standard: 20,606 (41.2%)
  VIP: 12,663 (25.3%)

10.4.1.3 Quick Decision Guide

Summary Table

Operation Type	Pandas	NumPy	Recommendation
Arithmetic operations	✅ Fast	✅ Faster	Use Pandas (simpler)
Aggregations (sum, mean)	✅ Fast	✅ Faster	Use Pandas (metadata)
String operations	✅ Built-in	❌ Manual	Use Pandas
DateTime operations	✅ Built-in	❌ Manual	Use Pandas
Linear algebra	❌ Limited	✅ Comprehensive	Use NumPy
ML library input	⚠️ Convert	✅ Native	Convert to NumPy
Complex conditionals	⚠️ Slow	✅ Fast (`np.select`)	Use NumPy
Large datasets (> 10M rows)	⚠️ Slower	✅ Faster	Use NumPy

Now let’s learn the actual conversion methods!

10.5 Conversion Methods: Pandas ↔︎ NumPy

Now that you know when to convert, let’s learn how to convert between pandas and NumPy.

10.5.1 The Two-Way Street

┌──────────────────────┐                      ┌──────────────────────┐
│  Pandas DataFrame    │   .to_numpy()        │  NumPy Array         │
│                      │ ──────────────────→  │                      │
│  ✅ Column names     │                      │  ❌ No labels        │
│  ✅ Index labels     │                      │  ❌ No index         │
│  ✅ Mixed dtypes     │                      │  ⚠️  Homogeneous     │
│  ✅ Missing data     │ ←──────────────────  │  ⚠️  Limited NaN     │
│                      │   pd.DataFrame()     │                      │
└──────────────────────┘                      └──────────────────────┘

Challenge: Converting a DataFrame to a NumPy array drops all metadata:

❌ Column names are lost
❌ Index labels are removed
❌ Data type information is simplified or lost
❌ Categorical mappings are gone

Why this matters: After performing NumPy computations, you often need to convert results back to a DataFrame with the original structure for analysis, visualization, or export.

Solution: Save & Restore Metadata Manually

10.5.2 Save & Restore Metadata Manually

The most straightforward approach: save metadata before conversion, then restore it after computation.

Code

print("🔄 Manual Metadata Preservation")
print("=" * 50)

# Create a DataFrame with rich metadata
df_original = pd.DataFrame(
    {
        "Temperature": [72.5, 68.3, 75.1, 71.9],
        "Humidity": [45, 52, 48, 50],
        "WindSpeed": [12.3, 8.7, 15.2, 10.1]
    },
    index=["Monday", "Tuesday", "Wednesday", "Thursday"]
)

print("Original DataFrame:")
print(df_original)
print(f"\nIndex: {df_original.index.tolist()}")
print(f"Columns: {df_original.columns.tolist()}")
print(f"Dtypes:\n{df_original.dtypes}")

# STEP 1: Save metadata BEFORE converting to NumPy
saved_index = df_original.index.copy()
saved_columns = df_original.columns.copy()
saved_dtypes = df_original.dtypes.to_dict()

print("\n✅ Metadata saved!")

# STEP 2: Convert to NumPy for computation
arr = df_original.to_numpy()

print(f"\nNumPy array (metadata lost):")
print(arr)
print(f"Shape: {arr.shape}, dtype: {arr.dtype}")

# STEP 3: Perform computations (example: normalize values)
# Subtract mean and divide by std for each column
arr_normalized = (arr - arr.mean(axis=0)) / arr.std(axis=0)

print(f"\nNormalized array (still no metadata):")
print(arr_normalized)

# STEP 4: Restore metadata when converting back to DataFrame
df_result = pd.DataFrame(
    arr_normalized,
    index=saved_index,
    columns=saved_columns
)

print(f"\n📊 Reconstructed DataFrame with metadata:")
print(df_result)
print(f"\nIndex restored: {df_result.index.tolist()}")
print(f"Columns restored: {df_result.columns.tolist()}")
print(f"Dtypes restored (may differ due to normalization):\n{df_result.dtypes}")

🔄 STRATEGY 1: Manual Metadata Preservation
==================================================
Original DataFrame:
           Temperature  Humidity  WindSpeed
Monday            72.5        45       12.3
Tuesday           68.3        52        8.7
Wednesday         75.1        48       15.2
Thursday          71.9        50       10.1

Index: ['Monday', 'Tuesday', 'Wednesday', 'Thursday']
Columns: ['Temperature', 'Humidity', 'WindSpeed']
Dtypes:
Temperature    float64
Humidity         int64
WindSpeed      float64
dtype: object

✅ Metadata saved!

NumPy array (metadata lost):
[[72.5 45.  12.3]
 [68.3 52.   8.7]
 [75.1 48.  15.2]
 [71.9 50.  10.1]]
Shape: (4, 3), dtype: float64

Normalized array (still no metadata):
[[ 0.22667166 -1.45010473  0.29531936]
 [-1.50427558  1.25675744 -1.171094  ]
 [ 1.29821043 -0.29002095  1.47659679]
 [-0.02060651  0.48336824 -0.60082214]]

📊 Reconstructed DataFrame with metadata:
           Temperature  Humidity  WindSpeed
Monday        0.226672 -1.450105   0.295319
Tuesday      -1.504276  1.256757  -1.171094
Wednesday     1.298210 -0.290021   1.476597
Thursday     -0.020607  0.483368  -0.600822

Index restored: ['Monday', 'Tuesday', 'Wednesday', 'Thursday']
Columns restored: ['Temperature', 'Humidity', 'WindSpeed']
Dtypes restored (may differ due to normalization):
Temperature    float64
Humidity       float64
WindSpeed      float64
dtype: object

10.5.3 Pandas DataFrame → NumPy Array

When to convert: You need NumPy’s computational speed and mathematical functions.

Common scenarios:

Performance-critical computations (linear algebra, statistics)
Integration with scientific computing libraries (SciPy, scikit-learn)
Machine learning algorithms that expect NumPy arrays
Mathematical operations not available in pandas

10.5.3.1 Conversion Methods & Performance

Let’s create a pandas DataFrame to demonstrate the conversion:

Code

print(" PANDAS TO NUMPY CONVERSION")
print("=" * 40)

# Start with a DataFrame of tech stock prices
stock_data = pd.DataFrame(
    {
        "AAPL":  [150.0, 152.5, 148.2, 155.1],
        "GOOGL": [2800.0, 2750.0, 2825.0, 2900.0],
        "MSFT":  [300.0, 305.0, 298.5, 310.0],
        "TSLA":  [800.0, 795.0, 820.0, 815.0],
    },
    index=["Day 1", "Day 2", "Day 3", "Day 4"]
)

print("📈 Original DataFrame:")
print(stock_data)
print(f"\nShape: {stock_data.shape}")
print("Dtypes:")
print(stock_data.dtypes)
print(f"Memory usage (deep): {stock_data.memory_usage(deep=True).sum()} bytes")

 PANDAS TO NUMPY CONVERSION
========================================
📈 Original DataFrame:
        AAPL   GOOGL   MSFT   TSLA
Day 1  150.0  2800.0  300.0  800.0
Day 2  152.5  2750.0  305.0  795.0
Day 3  148.2  2825.0  298.5  820.0
Day 4  155.1  2900.0  310.0  815.0

Shape: (4, 4)
Dtypes:
AAPL     float64
GOOGL    float64
MSFT     float64
TSLA     float64
dtype: object
Memory usage (deep): 344 bytes

10.5.3.1.1 Method 1: `.to_numpy()` (Recommended ✅)

The preferred modern approach for converting DataFrames to NumPy arrays. This method:

Converts DataFrame data into a NumPy array
Discards index and column labels (keeps only values)
Preserves the same row and column order as the original DataFrame
Returns a clean 2D array ready for numerical computations

Code

# Method 1: .to_numpy() - The modern, recommended approach
print("\n🔢 Method 1 — .to_numpy() (recommended):")
array_modern = stock_data.to_numpy()

print("Converted NumPy array:")
print(array_modern)
print(f"\nShape: {array_modern.shape}")
print(f"Data type: {array_modern.dtype}")
print(f"\n⚠️ Note: Row indices (Day 1, Day 2...) and column names (AAPL, GOOGL...) are lost!")


🔢 Method 1 — .to_numpy() (recommended):
Shape: (4, 4)
Data type: float64

array([[ 150. , 2800. ,  300. ,  800. ],
       [ 152.5, 2750. ,  305. ,  795. ],
       [ 148.2, 2825. ,  298.5,  820. ],
       [ 155.1, 2900. ,  310. ,  815. ]])

Key Point: Notice how the resulting array contains only the numerical values. All the metadata (index labels like Day 1, column names like AAPL) is removed. This is what makes NumPy arrays fast but less descriptive than DataFrames.

10.5.3.1.2 Method 2: .values (legacy but still works)

An older method that still works but is not recommended for new code. It behaves similarly to .to_numpy() but can produce unexpected results with certain pandas dtypes.

Code

# Method 2: .values - Legacy approach (avoid in new code)
print("\n⚠️  Method 2 — .values (legacy):")
array_legacy = stock_data.values

print("Converted array using .values:")
print(array_legacy)
print(f"\nShape: {array_legacy.shape}")
print(f"Data type: {array_legacy.dtype}")

# Verify both methods produce same result (in this case)
print(f"\nSame result as .to_numpy()? {np.array_equal(array_modern, array_legacy)}")
print("✅ For simple numeric DataFrames, both methods work identically")


⚠️  Method 2 — .values (legacy):
[[ 150.  2800.   300.   800. ]
 [ 152.5 2750.   305.   795. ]
 [ 148.2 2825.   298.5  820. ]
 [ 155.1 2900.   310.   815. ]]
Shape: (4, 4)
Same result as .to_numpy()? True

Why avoid .values?

Less clear intent (what does values mean?)
Can behave unpredictably with pandas extension dtypes (nullable integers, strings, etc.)
Not officially recommended in pandas documentation
.to_numpy() is more explicit and future-proof

10.5.3.1.3 Method 3: Force a specific dtype (e.g., int32)

Code

# Method 3: Force a specific dtype during conversion
print("\n🔧 Method 3 — Force data type (int32):")
array_int = stock_data.to_numpy(dtype=np.int32)

print("Array with forced int32 dtype:")
print(array_int)
print(f"\nOriginal dtype: {stock_data.to_numpy().dtype}")
print(f"Forced dtype: {array_int.dtype}")
print(f"\n⚠️ Warning: Float values were truncated (not rounded) to integers!")
print(f"Example: 150.0 → {array_int[0, 0]}, but 152.5 → {array_int[1, 0]} (loses 0.5)")


 Method 3 — Force data type (int32):
[[ 150 2800  300  800]
 [ 152 2750  305  795]
 [ 148 2825  298  820]
 [ 155 2900  310  815]]
Shape: (4, 4)
Data type: int32
Preview (first 2 rows):
[[ 150 2800  300  800]
 [ 152 2750  305  795]]

When to use dtype parameter:

✅ Good use cases:

Ensuring consistent data types for mathematical operations
Reducing memory usage (e.g., float64 → float32)
Preparing data for ML libraries that require specific types
Converting to integer types when you know there are no decimals

❌ Be careful:

Data loss when converting float to int (truncation, not rounding)
May raise errors if conversion is impossible (e.g., strings to numbers)
Check for NaN values before converting to integer types

10.5.3.1.4 Method 4: Convert specific columns only

Why select specific columns?

Performance: Process only needed data
Memory: Smaller arrays use less RAM
Clarity: Makes your intent explicit
Debugging: Easier to track which data is being used

Code

# Method 4: Select and convert specific columns
print("\n🎯 Method 4 — Specific columns only (AAPL, MSFT):")

# Select only AAPL and MSFT columns, then convert
tech_stocks = stock_data[["AAPL", "MSFT"]].to_numpy()

print("Selected columns converted to array:")
print(tech_stocks)
print(f"\nOriginal DataFrame shape: {stock_data.shape} (4 rows × 4 columns)")
print(f"Selected array shape: {tech_stocks.shape} (4 rows × 2 columns)")
print(f"Data type: {tech_stocks.dtype}")
print(f"\n💡 Memory saved: {(stock_data.shape[1] - tech_stocks.shape[1]) / stock_data.shape[1] * 100:.0f}% fewer columns!")


🎯 Method 4 — Specific columns only (AAPL, MSFT):
[[150.  300. ]
 [152.5 305. ]
 [148.2 298.5]
 [155.1 310. ]]
Shape: (4, 2)
Data type: float64

Summary table

Method	Recommendation	Pros	Cons
`.to_numpy()`	✅ Use by default	Clear intent, stable API, future-proof	May copy data depending on dtypes/backing storage
`.to_numpy(dtype=...)`	✅ When you need a dtype	Explicit, consistent numeric types; easier math	Casting may be lossy or fail; extra param to manage
`.values`	⚠️ Legacy / avoid	Works on old pandas; quick	Ambiguous with mixed/extension dtypes; not future-proof

Best practices

Convert only needed columns to reduce memory and copying.
Specify dtype when downstream computations require consistent numeric types.
Check for missing values (NaN, NaT) before casting to integer dtypes.
Prefer .to_numpy() over .values for clarity and forward compatibility.

10.5.4 NumPy Array → Pandas DataFrame

Let’s create an NumPy array to demonstrate the conversion:

Code

# 📊 COMPREHENSIVE ARRAY → DATAFRAME CONVERSION
print("🔄 NUMPY TO PANDAS CONVERSION")
print("=" * 40)

# Original NumPy array
scores = np.array([
    [85, 92, 78, 95],
    [88, 76, 91, 82],
    [95, 89, 84, 90]
])

students = ['Alice', 'Bob', 'Carol']
subjects = ['Math', 'Science', 'English', 'History']

print("Original NumPy array:")
print(scores)
print(f"Shape: {scores.shape}")

🔄 NUMPY TO PANDAS CONVERSION
========================================
Original NumPy array:
[[85 92 78 95]
 [88 76 91 82]
 [95 89 84 90]]
Shape: (3, 4)

10.5.4.1 Method 1: Basic Conversion

Create a DataFrame directly from a NumPy array using pd.DataFrame(). By default, pandas will auto-generate integer labels (0, 1, 2, …) for both rows and columns.

Code

# Basic conversion - auto-generated index and column labels
df_basic = pd.DataFrame(scores)
print("\n📋 Basic DataFrame (auto index/columns):")
df_basic


📋 Basic DataFrame (auto index/columns):

	0	1	2	3
0	85	92	78	95
1	88	76	91	82
2	95	89	84	90

Notice that the row indices are [0, 1, 2] and column names are [0, 1, 2, 3] — these are automatically generated when no labels are specified.

10.5.4.2 Method 2: Custom Row and Column Labels

For better readability and data analysis, you can specify meaningful names for rows (index) and columns when creating the DataFrame.

Code

# Custom labels for better data interpretation
df_labeled = pd.DataFrame(scores, index=students, columns=subjects)
print("\n📊 Labeled DataFrame (custom index & columns):")
df_labeled


📊 Labeled DataFrame (custom index & columns):

	Math	Science	English	History
Alice	85	92	78	95
Bob	88	76	91	82
Carol	95	89	84	90

Now the DataFrame has meaningful labels: - Row indices: Student names (Alice, Bob, Carol) - Column names: Subject names (Math, Science, English, History)

This makes the data much more interpretable and easier to work with!

10.5.4.3 Key Parameters & Options

When converting NumPy arrays to DataFrames, you can customize the output using these parameters:

Parameter	Purpose	Example	Default
`data`	NumPy array to convert	`np.array([[1,2],[3,4]])`	Required
`columns`	Column labels	`['A', 'B']`	`[0, 1, 2, ...]`
`index`	Row labels	`['row1', 'row2']`	`[0, 1, 2, ...]`

💡 Best Practices:

✅ Always specify meaningful column names for better code readability and maintenance
✅ Use descriptive index names when your rows represent specific entities (people, dates, etc.)
✅ Check data types and index after conversion with df.dtypes to ensure correct interpretation

10.6 Random Number Generation in NumPy

Random number generation is a fundamental tool in data science, used for:

Simulations: Monte Carlo methods, risk analysis, game theory
Statistical Sampling: Bootstrap, cross-validation, hypothesis testing
Machine Learning: Data augmentation, weight initialization, dropout
Scientific Computing: Stochastic modeling, numerical experiments

NumPy’s random module provides fast, vectorized random number generation that’s orders of magnitude faster than Python’s built-in random module.

10.6.1 Why NumPy Random Over Python Random?

Aspect	Python `random`	NumPy `random`
Speed	Slow (one at a time)	Fast (vectorized)
Output	Single values	Arrays of any shape
Distributions	Limited	40+ distributions
Use Case	Simple scripts	Scientific computing
Performance	~0.1M numbers/sec	~10M+ numbers/sec

💡 Rule: Use NumPy random for data science; use Python random for simple scripts.

10.6.2 Core Random Number Functions

NumPy provides functions for various probability distributions. Let’s explore the most commonly used ones with practical examples.

10.6.2.1 📊 Comparison Table: Which Function to Use?

Function	Distribution	Parameters	Use Case
`rand()`	Uniform [0, 1)	shape	Quick random arrays, probabilities
`randn()`	Normal (μ=0, σ=1)	shape	Standard normal samples
`randint()`	Discrete uniform	low, high, size	Random integers, sampling
`choice()`	Sample from array	array, size, replace	Random selection, bootstrapping
`uniform()`	Uniform [a, b]	low, high, size	Custom range uniform
`normal()`	Normal (μ, σ)	mean, std, size	Real-world measurements
`binomial()`	Binomial	n, p, size	Coin flips, success/failure
`poisson()`	Poisson	lambda, size	Event counts, arrivals

Let’s see each in action with real examples!

10.6.3 Uniform Distribution: `np.random.rand()` and `np.random.uniform()`

Uniform distribution: All values in a range are equally likely.

10.6.3.1 `np.random.rand()` - Quick Uniform [0, 1)

Code

print("📊 UNIFORM DISTRIBUTION: np.random.rand()")
print("=" * 60)

# Generate a 2x3 array of random values between 0 and 1
rand_array = np.random.rand(2, 3)
print("2×3 array of uniform random numbers [0, 1):")
print(rand_array)
print(f"\nShape: {rand_array.shape}")
print(f"Min: {rand_array.min():.4f}, Max: {rand_array.max():.4f}")
print(f"Mean: {rand_array.mean():.4f} (expected ~0.5)")

# Practical use: Generate random probabilities
print("\n💡 Practical Use: Simulate coin flip probabilities")
probabilities = np.random.rand(10)
results = ['Heads' if p > 0.5 else 'Tails' for p in probabilities]
print(f"Probabilities: {probabilities.round(3)}")
print(f"Results: {results}")

10.6.3.2 `np.random.uniform()` - Custom Range Uniform [a, b]

More flexible - specify your own range.

Code

print("\n📊 UNIFORM DISTRIBUTION: np.random.uniform()")
print("=" * 60)

# Generate random numbers in a custom range
print("Example 1: Random temperatures between 60°F and 90°F")
temperatures = np.random.uniform(60, 90, size=7)
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
for day, temp in zip(days, temperatures):
    print(f"  {day}: {temp:.1f}°F")

# 2D array example
print("\nExample 2: 3×5 matrix of random prices between $10 and $100")
prices = np.random.uniform(10, 100, size=(3, 5))
print(prices.round(2))
print(f"\nAverage price: ${prices.mean():.2f}")

[[ 0.5376299   0.42818625 -0.2799933 ]
 [-1.27903074  0.51529432 -0.83428181]
 [ 2.18409673  0.57050721 -0.5806483 ]]

10.6.4 Normal (Gaussian) Distribution: `np.random.randn()` and `np.random.normal()`

Normal distribution: Bell-shaped curve, most values cluster around the mean.
Used for: heights, weights, test scores, measurement errors, etc.

10.6.4.1 `np.random.randn()` - Standard Normal (μ=0, σ=1)

Code

print("📊 NORMAL DISTRIBUTION: np.random.randn()")
print("=" * 60)

# Generate standard normal distribution
normal_array = np.random.randn(1000)

print(f"Generated {len(normal_array)} samples from standard normal distribution")
print(f"Mean: {normal_array.mean():.4f} (expected: 0)")
print(f"Std Dev: {normal_array.std():.4f} (expected: 1)")
print(f"Min: {normal_array.min():.2f}, Max: {normal_array.max():.2f}")

# Show distribution statistics
print("\n📈 Distribution check (68-95-99.7 rule):")
within_1std = np.sum(np.abs(normal_array) <= 1) / len(normal_array) * 100
within_2std = np.sum(np.abs(normal_array) <= 2) / len(normal_array) * 100
within_3std = np.sum(np.abs(normal_array) <= 3) / len(normal_array) * 100

print(f"  Within ±1σ: {within_1std:.1f}% (expected: ~68%)")
print(f"  Within ±2σ: {within_2std:.1f}% (expected: ~95%)")
print(f"  Within ±3σ: {within_3std:.1f}% (expected: ~99.7%)")

10.6.4.2 `np.random.normal()` - Custom Normal (μ, σ)

Specify your own mean and standard deviation.

Code

print("\n📊 NORMAL DISTRIBUTION: np.random.normal()")
print("=" * 60)

# Example 1: Student exam scores (mean=75, std=10)
print("Example 1: Simulate exam scores (μ=75, σ=10)")
exam_scores = np.random.normal(75, 10, size=20)
print(f"Scores: {exam_scores.round(1)}")
print(f"Class average: {exam_scores.mean():.1f}")
print(f"Passing (≥60): {np.sum(exam_scores >= 60)}/{len(exam_scores)}")

# Example 2: Heights in inches (mean=68, std=3)
print("\nExample 2: Adult heights in inches (μ=68\", σ=3\")")
heights = np.random.normal(68, 3, size=100)
print(f"Generated {len(heights)} height measurements")
print(f"Average height: {heights.mean():.2f}\"")
print(f"Tallest: {heights.max():.1f}\", Shortest: {heights.min():.1f}\"")
print(f"Between 65\" and 71\": {np.sum((heights >= 65) & (heights <= 71))}")

10.6.5 Integer Random Numbers: `np.random.randint()`

Generate random integers within a specific range.

Use cases: Dice rolls, random IDs, sampling indices, game mechanics

Code

print("🎲 RANDOM INTEGERS: np.random.randint()")
print("=" * 60)

# Example 1: Dice rolls
print("Example 1: Roll a six-sided die 20 times")
dice_rolls = np.random.randint(1, 7, size=20)  # 1 to 6 inclusive
print(f"Rolls: {dice_rolls}")
print(f"Distribution: {dict(zip(*np.unique(dice_rolls, return_counts=True)))}")

# Example 2: Random matrix of integers
print("\nExample 2: 4×4 matrix of random integers [10, 20)")
int_matrix = np.random.randint(10, 20, size=(4, 4))
print(int_matrix)
print(f"Sum: {int_matrix.sum()}, Average: {int_matrix.mean():.2f}")

# Example 3: Random customer IDs
print("\nExample 3: Generate 10 random customer IDs [1000, 9999]")
customer_ids = np.random.randint(1000, 10000, size=10)
print(f"IDs: {customer_ids}")

10.6.6 Random Selection: `np.random.choice()`

Randomly select elements from an array with or without replacement.

Use cases: Bootstrapping, random sampling, A/B testing, lottery

Code

print("🎯 RANDOM SELECTION: np.random.choice()")
print("=" * 60)

# Example 1: Random selection WITH replacement
print("Example 1: Select 5 numbers WITH replacement [1-5]")
choice_array = np.random.choice([1, 2, 3, 4, 5], size=10, replace=True)
print(f"Selected: {choice_array}")
print(f"Notice: Numbers can repeat!")

# Example 2: Random selection WITHOUT replacement
print("\nExample 2: Select 3 winners from 10 contestants (no duplicates)")
contestants = np.array(['Alice', 'Bob', 'Carol', 'David', 'Eve', 
                        'Frank', 'Grace', 'Henry', 'Iris', 'Jack'])
winners = np.random.choice(contestants, size=3, replace=False)
print(f"Winners: {winners}")

# Example 3: Weighted random choice (probabilities)
print("\nExample 3: Biased coin flip (70% Heads, 30% Tails)")
outcomes = ['Heads', 'Tails']
probabilities = [0.7, 0.3]
flips = np.random.choice(outcomes, size=100, p=probabilities)
unique, counts = np.unique(flips, return_counts=True)
for outcome, count in zip(unique, counts):
    print(f"  {outcome}: {count}/100 ({count}%)")

# Example 4: Bootstrap sampling
print("\nExample 4: Bootstrap sampling (sampling with replacement)")
data = np.array([23, 45, 67, 34, 89, 12, 56])
bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
print(f"Original: {data}")
print(f"Bootstrap: {bootstrap_sample}")
print(f"Bootstrap mean: {bootstrap_sample.mean():.2f}, Original mean: {data.mean():.2f}")

10.6.7 Other Useful Distributions

NumPy provides many more distributions for specialized use cases.

Code

print("📊 OTHER USEFUL DISTRIBUTIONS")
print("=" * 60)

# 1. Binomial: Number of successes in n trials
print("1. BINOMIAL: Coin flips (10 flips, 50% probability)")
coin_flips = np.random.binomial(n=10, p=0.5, size=1000)
print(f"   Average heads in 10 flips: {coin_flips.mean():.2f} (expected: 5)")

# 2. Poisson: Count of events in fixed interval
print("\n2. POISSON: Customer arrivals (λ=3 per hour)")
arrivals = np.random.poisson(lam=3, size=24)  # 24 hours
print(f"   Arrivals per hour: {arrivals}")
print(f"   Total customers: {arrivals.sum()}, Average: {arrivals.mean():.2f}")

# 3. Exponential: Time between events
print("\n3. EXPONENTIAL: Time between customer arrivals (scale=5 min)")
wait_times = np.random.exponential(scale=5, size=10)
print(f"   Wait times (minutes): {wait_times.round(1)}")
print(f"   Average wait: {wait_times.mean():.2f} min")

# 4. Beta: Probabilities and proportions
print("\n4. BETA: Probability distribution (α=2, β=5)")
probabilities = np.random.beta(2, 5, size=1000)
print(f"   Mean probability: {probabilities.mean():.3f}")
print(f"   Range: [{probabilities.min():.3f}, {probabilities.max():.3f}]")

10.6.8 Reproducibility: Random Seeds

Problem: Random numbers are different every time you run your code!
Solution: Set a seed for reproducible results.

10.6.9 Performance: NumPy vs Python Random

Let’s prove that NumPy random is dramatically faster than Python’s built-in random.

Code

import random as py_random

print(" PERFORMANCE COMPARISON: NumPy vs Python random")
print("=" * 60)

n_numbers = 1_000_000

# Python random (slow)
print(f"\n Python random module (generating {n_numbers:,} numbers):")
start_time = time.time()
python_randoms = [py_random.random() for _ in range(n_numbers)]
python_time = time.time() - start_time
print(f"   Time: {python_time:.4f} seconds")

# NumPy random (fast)
print(f"\n NumPy random module (generating {n_numbers:,} numbers):")
start_time = time.time()
numpy_randoms = np.random.rand(n_numbers)
numpy_time = time.time() - start_time
print(f"   Time: {numpy_time:.4f} seconds")

# Comparison
speedup = python_time / numpy_time if numpy_time > 0 else float('inf')
print(f"\n RESULT:")
print(f"   NumPy is {speedup:.1f}x faster!")
print(f"   Time saved: {(python_time - numpy_time):.4f} seconds ({(python_time - numpy_time)/python_time*100:.1f}%)")

print("\n Key Takeaway:")
print("   Use NumPy random for any serious data science work!")

 PERFORMANCE COMPARISON: NumPy vs Python random
============================================================

 Python random module (generating 1,000,000 numbers):
   Time: 0.1046 seconds

 NumPy random module (generating 1,000,000 numbers):
   Time: 0.0061 seconds

 RESULT:
   NumPy is 17.3x faster!
   Time saved: 0.0986 seconds (94.2%)

 Key Takeaway:
   Use NumPy random for any serious data science work!

10.6.10 Quick Reference Guide

# Uniform [0, 1)
np.random.rand(5)                    # 1D array of 5 numbers
np.random.rand(3, 4)                 # 3×4 matrix

# Uniform [a, b]
np.random.uniform(10, 20, size=10)   # 10 numbers between 10 and 20

# Normal (Gaussian)
np.random.randn(100)                 # Standard normal (μ=0, σ=1)
np.random.normal(50, 10, size=100)   # Custom μ=50, σ=10

# Integers
np.random.randint(1, 100, size=50)   # 50 random integers [1, 100)

# Random selection
np.random.choice([1,2,3,4,5], size=3, replace=False)  # No duplicates
np.random.choice(['A','B','C'], size=100, p=[0.5, 0.3, 0.2])  # Weighted

# Shuffling
arr = np.array([1, 2, 3, 4, 5])
np.random.shuffle(arr)               # Shuffles in-place

# Permutation
np.random.permutation(10)            # Random permutation of 0-9

10.7 Independent Study

This is where you apply everything you’ve learned about NumPy vectorization to solve real-world problems. Each exercise progressively builds your skills in:

Vectorized computations for performance optimization
Matrix multiplication for multi-dimensional operations
Random number generation for simulations
Statistical methods like bootstrapping

10.7.1 Practice Exercise 1: Shopping Optimization with Vectorization

📋 Problem Statement: Three shoppers (Ben, Barbara, and Beth) need to buy groceries (rolls, buns, cakes, and bread). Two stores (Target and Kroger) have different prices. Which store should each person choose to minimize their total cost?

📂 Data Files: - food_quantity.csv: How many of each item each person needs - price.csv: Price of each item at each store

Strategy: We’ll solve this in 3 progressive steps to understand the power of vectorization: 1. Step 1: Calculate Ben’s cost at Target (simplest) 2. Step 2: Calculate Ben’s cost at both stores (intermediate) 3. Step 3: Calculate everyone’s cost at all stores (complete solution)

10.7.1.1 Step 1: Calculate Ben’s Cost at Target (Simplest Case)

Goal: Find how much Ben will spend if he buys everything at Target.

✅ Key Insight: Ben will spend $50 at Target. Vectorized operations are cleaner and faster!

10.7.1.2 Step 2: Calculate Ben’s Cost at BOTH Stores (Intermediate)

Goal: Find Ben’s cost at Target AND Kroger to determine which is cheaper.

✅ Key Insight: Ben spends $50 at Target vs $49 at Kroger → Kroger saves $1!

10.7.1.3 Step 3: Calculate EVERYONE’S Cost at ALL Stores (Complete Solution)

Goal: Find the best store for Ben, Barbara, AND Beth.

Method 1: using pandas dataframe nested loops

Method 2: Using numpy matrix multiplication

This is where vectorization truly shines! 🌟

10.7.2 Practice Exercise 2: Movie Rating Analysis with Matrix Multiplication

Problem Statement:

You have a dataset of movies with: - Ratings: IMDB Rating and Rotten Tomatoes Rating - Genres: Comedy, Action, Drama, Horror (binary flags: 1 = movie is in genre, 0 = not)

📂 Dataset: movies_cleaned.csv

Questions to Answer: 1. What is the average IMDB rating for each genre? 2. What is the average Rotten Tomatoes rating for each genre? 3. Which genre is most preferred by IMDB users? 4. Which genre is least preferred by Rotten Tomatoes critics?

10.7.2.1 Step 0: Load movies dataset

10.7.2.2 Step 1: Create Rating Matrix (N movies × 2 ratings)

10.7.2.3 Step 2: Create Genre Matrix (N movies × 4 genres)

10.7.2.4 Step 3: Matrix Multiplication for Total Ratings

Goal: Find total IMDB and Rotten Tomatoes ratings for each genre.

Matrix Operation: Ratings.T @ Genres

⚠️ Dimension Check: - Ratings matrix: (N × 2) - Genres matrix: (N × 4) - For multiplication, we need: (2 × N) @ (N × 4) = (2 × 4) ✅ - Solution: Transpose the ratings matrix!

10.7.2.5 Step 4: Count Movies per Genre

To get averages, we need to divide total ratings by the number of movies in each genre.

10.7.2.6 Step 5: Compute the average rating per Genre

Key Findings:

✅ IMDB users prefer DRAMA (highest average rating), and are least amused by COMEDY movies on average.

✅ Rotten Tomatoes critics prefer drama over HORROR movies on average.

10.7.3 Practice Exercise 3: Simulation Study with Random Number Generation

📋 Problem Statement:

Two food carts serve 500 customers each, every day for 30 days. The waiting times follow different distributions:

Food Cart 1: Uniform distribution [5, 25] minutes (unpredictable service)
Food Cart 2: Normal distribution with μ=8 min, σ=3 min (consistent service)

Assumptions:

Waiting times are measured simultaneously (paired observations)
Same 500 people visit daily over 30 days

Questions to Answer:

On how many days is the average waiting time at Food Cart 2 higher than Food Cart 1?
What percentage of individual waiting times at Food Cart 2 exceed Food Cart 1?
How much faster is vectorized random generation vs loops?

Strategy:

Simulation size: 500 people × 30 days = 15,000 observations per cart
Method 1: Nested loops (slow but explicit)
Method 2: Vectorized NumPy (fast and elegant)

10.7.4 Practice Exercise 4: Bootstrapping for Confidence Intervals

Problem Statement:

Find the 95% confidence interval for the mean profit of Action movies using bootstrapping.

What is Bootstrapping?

Bootstrapping is a non-parametric statistical method that estimates the sampling distribution of a statistic by resampling from the observed data. It’s used when: - Sample size is small - Distribution is unknown or non-normal - Theoretical formulas are complex or unavailable

Bootstrap Algorithm:

Extract profit data for all Action movies (N movies)
Resample N values WITH replacement from the profit data
Calculate the mean of the resampled data
Repeat steps 2-3 M=1000 times
Find the 2.5th and 97.5th percentiles of the 1000 means

Result: [2.5th percentile, 97.5th percentile] = 95% Confidence Interval

Dataset: movies_cleaned.csv

Hints: Use np.random.choice() for sampling with replacement

======================================================================
🎬 BOOTSTRAPPING: 95% CI for Action Movie Profits
======================================================================

	Title	IMDB Rating	Rotten Tomatoes Rating	Running Time min	Release Date	US Gross	Worldwide Gross	Production Budget	comedy	Action
0	Broken Arrow	5.8	55	108	Feb 09 1996	70645997	148345997	65000000	0	1
1	Brazil	8.0	98	136	Dec 18 1985	9929135	9929135	15000000	1	0
2	The Cable Guy	5.8	52	95	Jun 14 1996	60240295	102825796	47000000	1	0
3	Chain Reaction	5.2	13	106	Aug 02 1996	21226204	60209334	55000000	0	1
4	Clash of the Titans	5.9	65	108	Jun 12 1981	30000000	30000000	15000000	0	1

Confidence interval = [$133.5 million, $181.08 million]

10.1 Learning Objectives

10.2 Vectorization in NumPy

10.2.1 Why Vectorization Matters

10.2.1.1 Performance Comparison: NumPy Vectorization Vs. Python Loops

10.2.1.1.1 Scenario 1: Basic Mathematical Operations

10.2.1.1.2 Scenario 2: Complex Mathematical Operations

10.3 Vectorized Operations in NumPy

10.3.1 Types of Vectorized Operations

10.3.1.1 Element-wise Operations (Universal Functions - ufuncs)

10.3.1.2 Aggregation Operations

10.3.1.3 Boolean Operations and Fancy Indexing

10.3.2 Matrix Multiplication in NumPy with Vectorization

10.3.2.1 Method 1: Matrix Multiplication Using np.dot()

10.3.2.2 Method 2: Matrix Multiplication using np.matmul() or @

10.4 Converting Between NumPy Arrays and Pandas DataFrames

10.4.1 Decision Framework: When to Use Which Library

10.4.1.1 Stay in Pandas for Vectorized Operations

10.4.1.2 Convert to NumPy When You Must

10.4.1.3 Quick Decision Guide

10.5 Conversion Methods: Pandas ↔︎ NumPy

10.5.1 The Two-Way Street

10.5.2 Save & Restore Metadata Manually

10.5.3 Pandas DataFrame → NumPy Array

10.5.3.1 Conversion Methods & Performance

10.5.3.1.1 Method 1: .to_numpy() (Recommended ✅)

10.5.3.1.2 Method 2: .values (legacy but still works)

10.5.3.1.3 Method 3: Force a specific dtype (e.g., int32)

10.5.3.1.4 Method 4: Convert specific columns only

10.5.4 NumPy Array → Pandas DataFrame

10.5.4.1 Method 1: Basic Conversion

10.5.4.2 Method 2: Custom Row and Column Labels

10.5.4.3 Key Parameters & Options

10.6 Random Number Generation in NumPy

10.6.1 Why NumPy Random Over Python Random?

10.6.2 Core Random Number Functions

10.6.2.1 📊 Comparison Table: Which Function to Use?

10.6.3 Uniform Distribution: np.random.rand() and np.random.uniform()

10.6.3.1 np.random.rand() - Quick Uniform [0, 1)

10.6.3.2 np.random.uniform() - Custom Range Uniform [a, b]

10.6.4 Normal (Gaussian) Distribution: np.random.randn() and np.random.normal()

10.6.4.1 np.random.randn() - Standard Normal (μ=0, σ=1)

10.6.4.2 np.random.normal() - Custom Normal (μ, σ)

10.6.5 Integer Random Numbers: np.random.randint()

10.6.6 Random Selection: np.random.choice()

10.6.7 Other Useful Distributions

10.6.8 Reproducibility: Random Seeds

10.6.9 Performance: NumPy vs Python Random

10.6.10 Quick Reference Guide

10.7 Independent Study

10.7.1 Practice Exercise 1: Shopping Optimization with Vectorization

10.7.1.1 Step 1: Calculate Ben’s Cost at Target (Simplest Case)

10.7.1.2 Step 2: Calculate Ben’s Cost at BOTH Stores (Intermediate)

10.7.1.3 Step 3: Calculate EVERYONE’S Cost at ALL Stores (Complete Solution)

10.7.2 Practice Exercise 2: Movie Rating Analysis with Matrix Multiplication

10.7.2.1 Step 0: Load movies dataset

10.7.2.2 Step 1: Create Rating Matrix (N movies × 2 ratings)

10.7.2.3 Step 2: Create Genre Matrix (N movies × 4 genres)

10.7.2.4 Step 3: Matrix Multiplication for Total Ratings

10.7.2.5 Step 4: Count Movies per Genre

10.7.2.6 Step 5: Compute the average rating per Genre

10.7.3 Practice Exercise 3: Simulation Study with Random Number Generation

10.7.4 Practice Exercise 4: Bootstrapping for Confidence Intervals

10.3.2.1 Method 1: Matrix Multiplication Using `np.dot()`

10.3.2.2 Method 2: Matrix Multiplication using `np.matmul()` or `@`

10.5.3.1.1 Method 1: `.to_numpy()` (Recommended ✅)

10.6.3 Uniform Distribution: `np.random.rand()` and `np.random.uniform()`

10.6.3.1 `np.random.rand()` - Quick Uniform [0, 1)

10.6.3.2 `np.random.uniform()` - Custom Range Uniform [a, b]

10.6.4 Normal (Gaussian) Distribution: `np.random.randn()` and `np.random.normal()`

10.6.4.1 `np.random.randn()` - Standard Normal (μ=0, σ=1)

10.6.4.2 `np.random.normal()` - Custom Normal (μ, σ)

10.6.5 Integer Random Numbers: `np.random.randint()`

10.6.6 Random Selection: `np.random.choice()`