10 NumPy Intermediate
10.1 Learning Objectives
At the end of this chapter, you will learn:
- Optimize code performance using vectorization instead of loops
- Perform vectorized computations that are 10–100x faster than Python loops
- Generate random data for simulations and statistical analysis
- Convert between NumPy and Pandas for optimal workflow efficiency
Code
import numpy as np
import pandas as pd
import time
import warnings
# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')10.2 Vectorization in NumPy
Vectorization applies operations to entire arrays simultaneously instead of looping through individual elements. This enables NumPy to leverage highly optimized C/Fortran libraries for dramatic performance gains.

10.2.1 Why Vectorization Matters
Vectorized operations require fewer lines of code and are easier to read compared to Python loops.
Moreover, since they rely on compiled C and Fortran libraries that leverage CPU SIMD instructions and efficient cache utilization, they can be significantly faster and more scalable than pure Python loops.
Loop Approach:
result = []
for i in range(len(array1)):
for j in range(len(array2)):
result.append(array1[i] * array2[j] + some_value)Vectorized Approach:
result = array1[:, np.newaxis] * array2 + some_value10.2.1.1 Performance Comparison: NumPy Vectorization Vs. Python Loops
Understanding the performance benefits of vectorization is crucial for writing efficient scientific computing code. Let’s compare different scenarios to see when and why vectorization matters.
10.2.1.1.1 Scenario 1: Basic Mathematical Operations
np.dot is a vectorized operation in NumPy that performs matrix multiplication or dot product between arrays. It efficiently computes the element-wise multiplications and then sums them up. To better understand its efficiency, let’s first implement the dot product using for loops, and then compare its performance with np.dot to see the benefits of vectorization.
Code
# Function to calculate dot product using for loops
def dot_product_loops(arr1, arr2):
result = 0
for i in range(len(arr1)):
result += arr1[i] * arr2[i]
return result
# Create sample arrays of different sizes for comparison
sizes = [1000, 10000, 100000, 1000000]
print("Performance Comparison: Loops vs NumPy Vectorization")
print("=" * 60)
for size in sizes:
# Create random arrays
arr1 = np.random.rand(size)
arr2 = np.random.rand(size)
# Measure time for the loop-based implementation
start_time = time.time()
loop_result = dot_product_loops(arr1, arr2)
loop_time = time.time() - start_time
# Measure time for np.dot
start_time = time.time()
numpy_result = np.dot(arr1, arr2)
numpy_time = time.time() - start_time
# Calculate speedup
speedup = loop_time / numpy_time if numpy_time > 0 else float('inf')
print(f"\nArray size: {size:,}")
print(f"Loop result: {loop_result:.5f}, Time: {loop_time:.5f}s")
print(f"NumPy result: {numpy_result:.5f}, Time: {numpy_time:.5f}s")
print(f"Speedup: {speedup:.1f}x faster")
print(f"\n{'='*60}")
print("Key Observations:")
print("• NumPy becomes dramatically faster as array size increases")
print("• Speedup can reach over 100x for large arrays")
print("• NumPy overhead is minimal for small arrays but pays off quickly")Performance Comparison: Loops vs NumPy Vectorization
============================================================
Array size: 1,000
Loop result: 240.51012, Time: 0.00100s
NumPy result: 240.51012, Time: 0.00000s
Speedup: infx faster
Array size: 10,000
Loop result: 2483.13248, Time: 0.00450s
NumPy result: 2483.13248, Time: 0.00000s
Speedup: infx faster
Array size: 100,000
Loop result: 24918.72212, Time: 0.02491s
NumPy result: 24918.72212, Time: 0.00000s
Speedup: infx faster
Array size: 1,000,000
Loop result: 250291.73101, Time: 0.18839s
NumPy result: 250291.73101, Time: 0.00098s
Speedup: 192.8x faster
============================================================
Key Observations:
• NumPy becomes dramatically faster as array size increases
• Speedup can reach over 100x for large arrays
• NumPy overhead is minimal for small arrays but pays off quickly
Array size: 1,000,000
Loop result: 250291.73101, Time: 0.18839s
NumPy result: 250291.73101, Time: 0.00098s
Speedup: 192.8x faster
============================================================
Key Observations:
• NumPy becomes dramatically faster as array size increases
• Speedup can reach over 100x for large arrays
• NumPy overhead is minimal for small arrays but pays off quickly
10.2.1.1.2 Scenario 2: Complex Mathematical Operations
Let’s compare more complex operations that show even greater performance differences:
Code
# Complex mathematical operation: polynomial evaluation
# f(x) = 3x³ + 2x² - 5x + 1
def polynomial_loops(x_values):
"""Evaluate polynomial using loops"""
results = []
for x in x_values:
result = 3*x**3 + 2*x**2 - 5*x + 1
results.append(result)
return results
def polynomial_vectorized(x_values):
"""Evaluate polynomial using vectorized operations"""
return 3*x_values**3 + 2*x_values**2 - 5*x_values + 1
# Test with different sizes
size = 5000000
x_data = np.random.uniform(-10, 10, size)
print("Complex Operations: Polynomial Evaluation")
print("f(x) = 3x³ + 2x² - 5x + 1")
print("-" * 40)
# Loop version
start_time = time.time()
loop_results = polynomial_loops(x_data)
loop_time = time.time() - start_time
# Vectorized version
start_time = time.time()
vector_results = polynomial_vectorized(x_data)
vector_time = time.time() - start_time
speedup = loop_time / vector_time
print(f"Array size: {size:,}")
print(f"Loop time: {loop_time:.4f}s")
print(f"Vectorized time: {vector_time:.4f}s")
print(f"Speedup: {speedup:.1f}x")
# Verify results match
print(f"Results match: {np.allclose(loop_results, vector_results)}")Complex Operations: Polynomial Evaluation
f(x) = 3x³ + 2x² - 5x + 1
----------------------------------------
Array size: 5,000,000
Loop time: 2.8250s
Vectorized time: 0.2053s
Speedup: 13.8x
Results match: True
Array size: 5,000,000
Loop time: 2.8250s
Vectorized time: 0.2053s
Speedup: 13.8x
Results match: True
The performance comparisons clearly demonstrate that vectorized NumPy operations significantly outperform equivalent Python loops
10.3 Vectorized Operations in NumPy
Pandas vectorized operations are built on top of NumPy, which provides the foundation for efficient computation.
10.3.1 Types of Vectorized Operations
NumPy supports several categories of vectorized operations:
10.3.1.1 Element-wise Operations (Universal Functions - ufuncs)
These operations apply a function to each element of an array:
Code
# Basic arithmetic operations (all vectorized)
arr = np.array([1, 4, 9, 16, 25])
print("Original array:", arr)
print("Square root:", np.sqrt(arr))
print("Natural log:", np.log(arr))
print("Sine:", np.sin(arr))
print("Power of 2:", np.power(arr, 2))
# Comparison operations
print("\nComparison operations:")
print("Greater than 10:", arr > 10)
print("Equal to 9:", arr == 9)
print("Between 5 and 20:", (arr >= 5) & (arr <= 20))Original array: [ 1 4 9 16 25]
Square root: [1. 2. 3. 4. 5.]
Natural log: [0. 1.38629436 2.19722458 2.77258872 3.21887582]
Sine: [ 0.84147098 -0.7568025 0.41211849 -0.28790332 -0.13235175]
Power of 2: [ 1 16 81 256 625]
Comparison operations:
Greater than 10: [False False False True True]
Equal to 9: [False False True False False]
Between 5 and 20: [False False True True False]
10.3.1.2 Aggregation Operations
These reduce arrays to scalar values or smaller arrays:
Code
# 2D array for aggregation examples
matrix = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
print("Matrix:")
print(matrix)
# Aggregations across entire array
print(f"\nSum of all elements: {np.sum(matrix)}")
print(f"Mean: {np.mean(matrix)}")
print(f"Standard deviation: {np.std(matrix)}")
print(f"Min: {np.min(matrix)}, Max: {np.max(matrix)}")
# Aggregations along specific axes
print(f"\nSum along axis 0 (columns): {np.sum(matrix, axis=0)}")
print(f"Sum along axis 1 (rows): {np.sum(matrix, axis=1)}")
print(f"Mean along axis 0: {np.mean(matrix, axis=0)}")
print(f"Max along axis 1: {np.max(matrix, axis=1)}")Matrix:
[[ 1 2 3 4]
[ 5 6 7 8]
[ 9 10 11 12]]
Sum of all elements: 78
Mean: 6.5
Standard deviation: 3.452052529534663
Min: 1, Max: 12
Sum along axis 0 (columns): [15 18 21 24]
Sum along axis 1 (rows): [10 26 42]
Mean along axis 0: [5. 6. 7. 8.]
Max along axis 1: [ 4 8 12]
10.3.1.3 Boolean Operations and Fancy Indexing
Vectorized selection and filtering operations:
Code
# Boolean indexing - vectorized filtering
data = np.array([1, 15, 8, 23, 4, 16, 42, 3, 19, 7])
# Find elements meeting multiple conditions (vectorized)
mask = (data > 5) & (data < 20)
filtered_data = data[mask]
print(f"Original data: {data}")
print(f"Elements between 5 and 20: {filtered_data}")
# Using np.where for conditional replacement (vectorized)
# Replace values > 15 with -1, others with original value
result = np.where(data > 15, -1, data)
print(f"After conditional replacement: {result}")
# Fancy indexing - select multiple elements at once
indices = [0, 2, 4, 7]
selected = data[indices]
print(f"Elements at indices {indices}: {selected}")
# Count how many elements meet condition (vectorized)
count_large = np.sum(data > 10)
print(f"Count of elements > 10: {count_large}")Original data: [ 1 15 8 23 4 16 42 3 19 7]
Elements between 5 and 20: [15 8 16 19 7]
After conditional replacement: [ 1 15 8 -1 4 -1 -1 3 -1 7]
Elements at indices [0, 2, 4, 7]: [1 8 4 3]
Count of elements > 10: 5
10.3.2 Matrix Multiplication in NumPy with Vectorization
Matrix multiplication is one of the most common and computationally intensive operations in numerical computing and deep learning. NumPy offers efficient and highly optimized methods for performing matrix multiplication, which leverage vectorization to handle large matrices quickly and accurately.
Note: NumPy matrix operations follow the standard rules of linear algebra, so it’s important to ensure that the shapes of the matrices are compatible. If they are not, consider reshaping the matrices before performing multiplication.
There are two common methods for matrix multiplication:
10.3.2.1 Method 1: Matrix Multiplication Using np.dot()
Code
# Define two 2D arrays (matrices)
matrix1 = np.array([[1, 2, 3],
[4, 5, 6]])
matrix2 = np.array([[7, 8],
[9, 10],
[11, 12]])
# Matrix multiplication using np.dot
result_dot = np.dot(matrix1, matrix2)
print("Matrix Multiplication using np.dot:\n", result_dot)
# Another way to perform np.dot for matrix multiplication
result_dot2 = matrix1.dot(matrix2)
print("\nMatrix Multiplication using dot method:\n", result_dot2)Matrix Multiplication using np.dot:
[[ 58 64]
[139 154]]
Matrix Multiplication using dot method:
[[ 58 64]
[139 154]]
10.3.2.2 Method 2: Matrix Multiplication using np.matmul() or @
Code
# Matrix multiplication using np.matmul or @ operator
result_matmul = np.matmul(matrix1, matrix2)
result_operator = matrix1 @ matrix2
print("\nMatrix Multiplication using np.matmul:\n", result_matmul)
print("\nMatrix Multiplication using @ operator:\n", result_operator)
Matrix Multiplication using np.matmul:
[[ 58 64]
[139 154]]
Matrix Multiplication using @ operator:
[[ 58 64]
[139 154]]
Important: Note that the * operator in NumPy performs element-wise multiplication, not matrix multiplication.
Code
# Using * operator for element-wise multiplication will cause an error if shapes don't match
# Uncomment the code below to see the error:
# element_wise = matrix1 * matrix2
# print("\nElement-wise Multiplication:\n", element_wise)Code
# reshape the array for element-wise multiplication
matrix2_reshaped = matrix2.reshape(2, 3)
element_wise = matrix1 * matrix2_reshaped
print("\nElement-wise Multiplication after reshaping:\n", element_wise)
Element-wise Multiplication after reshaping:
[[ 7 16 27]
[40 55 72]]
10.4 Converting Between NumPy Arrays and Pandas DataFrames
Seamless data conversion between NumPy and pandas is essential for efficient data science workflows. Each library excels in different areas, and knowing when and how to convert between them maximizes your analytical power.
Typical Data Science Workflow
1. LOAD DATA → Use Pandas (CSV, Excel, databases)
2. CLEAN & PREPARE → Use Pandas (filtering, grouping, handling missing data)
3. COMPUTE → Convert to NumPy (ML algorithms, linear algebra, heavy math)
4. ANALYZE & VISUALIZE → Convert back to Pandas (results interpretation)
5. EXPORT → Use Pandas (CSV, Parquet, SQL)
NumPy and pandas complement each other perfectly:
| Aspect | NumPy Arrays | Pandas DataFrames |
|---|---|---|
| Strength | Mathematical computations | Data manipulation & analysis |
| Speed | Faster numerical operations | Easier data exploration |
| Memory | More memory efficient | Rich metadata (labels, dtypes) |
| Use Cases | Linear algebra, statistics, ML algorithms | Data cleaning, grouping, merging |
| Indexing | Integer-based only | Label-based + integer-based |
| Missing Data | Limited NaN support | Robust NaN handling |
The key question: When should you stay in pandas vs convert to NumPy?
10.4.1 Decision Framework: When to Use Which Library
10.4.1.1 Stay in Pandas for Vectorized Operations
Vectorized operations in Pandas are built on top of NumPy and are often fast enough for most tasks!
These operations include, but are not limited to:
1. Standard Mathematical Operations
- Arithmetic:
df['total'] = df['price'] * df['quantity'] - Normalization:
df_norm = (df - df.mean()) / df.std() - Scaling:
df['scaled'] = df['value'] / df['value'].max()
2. Statistical Computations
- Aggregations:
df.mean(),df.sum(),df.std() - Correlations:
df.corr(),df['A'].corr(df['B']) - Rolling windows:
df['MA'] = df['price'].rolling(7).mean()
3. Element-wise Transformations
- Vectorized operations:
df['log_val'] = np.log(df['value']) - Conditional logic:
df['flag'] = df['amount'] > 100 - String operations:
df['upper'] = df['name'].str.upper()
Let’s see this in action:
Code
print("💡 STAYING IN PANDAS: When Conversion Isn't Needed")
print("=" * 60)
# Example: Normalize data WITHOUT converting to NumPy
df_data = pd.DataFrame({
"Sales_Q1": [10000, 20000, 30000, 40000],
"Sales_Q2": [12000, 22000, 28000, 38000],
"Sales_Q3": [15000, 25000, 35000, 45000]
}, index=["Product_A", "Product_B", "Product_C", "Product_D"])
print("Original Sales Data:")
print(df_data)
print(f"\nDtypes: {df_data.dtypes.unique()}")
# ❌ METHOD 1: Using NumPy (requires metadata handling)
print("\n" + "="*60)
print("❌ Method 1: NumPy Approach (more complex)")
print("="*60)
# Save metadata
saved_index = df_data.index
saved_columns = df_data.columns
# Convert to NumPy
arr = df_data.to_numpy()
print(f"1. Converted to NumPy array (shape: {arr.shape})")
# Normalize
arr_norm = (arr - arr.mean(axis=0)) / arr.std(axis=0)
print(f"2. Normalized using NumPy")
# Convert back and restore metadata
df_norm_numpy = pd.DataFrame(arr_norm, index=saved_index, columns=saved_columns)
print(f"3. Converted back to DataFrame with manual metadata restoration")
print(f"\nResult:")
print(df_norm_numpy.round(3))
# ✅ METHOD 2: Pure Pandas (simpler and automatic!)
print("\n" + "="*60)
print("✅ Method 2: Pandas Approach (simpler!)")
print("="*60)
df_norm_pandas = (df_data - df_data.mean()) / df_data.std()
print(f"Result (metadata preserved automatically):")
print(df_norm_pandas.round(3))
# Verify they're the same
print(f"\n✓ Results identical? {np.allclose(df_norm_numpy.values, df_norm_pandas.values)}")
print(f"✓ Metadata preserved? {df_norm_pandas.index.equals(df_data.index) and df_norm_pandas.columns.equals(df_data.columns)}")
print("\n💡 Takeaway: For standard operations, pandas is simpler and just as fast!")💡 STAYING IN PANDAS: When Conversion Isn't Needed
============================================================
Original Sales Data:
Sales_Q1 Sales_Q2 Sales_Q3
Product_A 10000 12000 15000
Product_B 20000 22000 25000
Product_C 30000 28000 35000
Product_D 40000 38000 45000
Dtypes: [dtype('int64')]
============================================================
❌ Method 1: NumPy Approach (more complex)
============================================================
1. Converted to NumPy array (shape: (4, 3))
2. Normalized using NumPy
3. Converted back to DataFrame with manual metadata restoration
Result:
Sales_Q1 Sales_Q2 Sales_Q3
Product_A -1.342 -1.378 -1.342
Product_B -0.447 -0.318 -0.447
Product_C 0.447 0.318 0.447
Product_D 1.342 1.378 1.342
============================================================
✅ Method 2: Pandas Approach (simpler!)
============================================================
Result (metadata preserved automatically):
Sales_Q1 Sales_Q2 Sales_Q3
Product_A -1.162 -1.193 -1.162
Product_B -0.387 -0.275 -0.387
Product_C 0.387 0.275 0.387
Product_D 1.162 1.193 1.162
✓ Results identical? False
✓ Metadata preserved? True
💡 Takeaway: For standard operations, pandas is simpler and just as fast!
10.4.1.2 Convert to NumPy When You Must
Convert your DataFrame to a NumPy array only in specific scenarios where Pandas alone isn’t enough:
1. Specialized NumPy Functions
- Linear algebra:
np.linalg.inv(),np.linalg.eig(), matrix decomposition - FFT/signal processing:
np.fft.fft(),np.convolve() - Advanced random sampling:
np.random.multivariate_normal() - Custom mathematical operations not available in pandas
2. ML Library Integration
- scikit-learn:
model.fit(X_train, y_train)expects NumPy arrays - TensorFlow/PyTorch: Neural networks require tensor/array inputs
- SciPy: Scientific functions expect NumPy arrays
3. Performance-Critical Code
- Nested loops that can be vectorized with NumPy broadcasting
- Custom algorithms requiring low-level array manipulation
- Very large datasets (> 10 million rows) where memory matters
4. Complex Conditional Logic
- Multi-condition selections with
np.where(),np.select() - Better performance than chained pandas operations
Let’s see when NumPy is actually necessary:
Code
print("🚀 WHEN NUMPY IS NECESSARY: Real Use Cases")
print("=" * 60)
# Create a realistic dataset
np.random.seed(42)
n_rows = 50000
performance_data = pd.DataFrame({
'customer_id': range(1, n_rows + 1),
'purchase_amount': np.random.uniform(10, 1000, n_rows),
'discount_pct': np.random.choice([0, 5, 10, 15, 20], n_rows),
'category': np.random.choice(['A', 'B', 'C', 'D'], n_rows),
'rating': np.random.randint(1, 6, n_rows),
'is_premium': np.random.choice([True, False], n_rows)
})
print(f"📊 Dataset: {performance_data.shape[0]:,} rows × {performance_data.shape[1]} columns")
print(f"First 3 rows:")
print(performance_data.head(3))🚀 WHEN NUMPY IS NECESSARY: Real Use Cases
============================================================
📊 Dataset: 50,000 rows × 6 columns
First 3 rows:
customer_id purchase_amount discount_pct category rating is_premium
0 1 380.794718 0 C 1 True
1 2 951.207163 15 D 1 False
2 3 734.674002 5 D 1 True
Code
print("\n" + "="*60)
print("USE CASE 1: Complex Multi-Condition Logic with np.select()")
print("="*60)
def customer_segment_function(row):
"""Complex business logic - slow with apply()"""
if row['is_premium'] and row['purchase_amount'] > 500:
return 'VIP'
elif row['is_premium'] and row['rating'] >= 4:
return 'Premium+'
elif row['purchase_amount'] > 200 and row['rating'] >= 4:
return 'High Value'
elif row['purchase_amount'] > 100:
return 'Standard'
else:
return 'Basic'
# Method 1: APPLY (Easy to write but slow)
print("\n❌ Method 1: Using apply() with function")
start_time = time.time()
performance_data['segment_apply'] = performance_data.apply(customer_segment_function, axis=1)
apply_time = time.time() - start_time
print(f"Time: {apply_time:.4f}s")
# Method 2: VECTORIZED with np.select() (Faster!)
print("\n✅ Method 2: NumPy np.select() - vectorized")
start_time = time.time()
# Define conditions using NumPy/pandas vectorized operations
conditions = [
(performance_data['is_premium'] & (performance_data['purchase_amount'] > 500)),
(performance_data['is_premium'] & (performance_data['rating'] >= 4)),
((performance_data['purchase_amount'] > 200) & (performance_data['rating'] >= 4)),
(performance_data['purchase_amount'] > 100)
]
choices = ['VIP', 'Premium+', 'High Value', 'Standard']
# np.select() evaluates conditions vectorized and picks corresponding choice
performance_data['segment_vectorized'] = np.select(conditions, choices, default='Basic')
vectorized_time = time.time() - start_time
print(f"Time: {vectorized_time:.4f}s")
# Performance comparison
speedup = apply_time / vectorized_time if vectorized_time > 0 else float('inf')
print(f"\n🏆 Speedup: {speedup:.1f}x faster with np.select()!")
print(f"Time saved: {((apply_time - vectorized_time) / apply_time * 100):.1f}%")
# Verify results match
results_match = (performance_data['segment_apply'] == performance_data['segment_vectorized']).all()
print(f"✓ Results identical: {results_match}")
# Show distribution
print(f"\n📊 Customer Segment Distribution:")
segment_counts = performance_data['segment_apply'].value_counts().sort_index()
for segment, count in segment_counts.items():
print(f" {segment}: {count:,} ({count/len(performance_data)*100:.1f}%)")
# Clean up
performance_data.drop(['segment_vectorized'], axis=1, inplace=True)
============================================================
USE CASE 1: Complex Multi-Condition Logic with np.select()
============================================================
❌ Method 1: Using apply() with function
Time: 0.3499s
✅ Method 2: NumPy np.select() - vectorized
Time: 0.0124s
🏆 Speedup: 28.2x faster with np.select()!
Time saved: 96.5%
✓ Results identical: True
📊 Customer Segment Distribution:
Basic: 3,648 (7.3%)
High Value: 8,116 (16.2%)
Premium+: 4,967 (9.9%)
Standard: 20,606 (41.2%)
VIP: 12,663 (25.3%)
Time: 0.3499s
✅ Method 2: NumPy np.select() - vectorized
Time: 0.0124s
🏆 Speedup: 28.2x faster with np.select()!
Time saved: 96.5%
✓ Results identical: True
📊 Customer Segment Distribution:
Basic: 3,648 (7.3%)
High Value: 8,116 (16.2%)
Premium+: 4,967 (9.9%)
Standard: 20,606 (41.2%)
VIP: 12,663 (25.3%)
10.4.1.3 Quick Decision Guide
Summary Table
| Operation Type | Pandas | NumPy | Recommendation |
|---|---|---|---|
| Arithmetic operations | ✅ Fast | ✅ Faster | Use Pandas (simpler) |
| Aggregations (sum, mean) | ✅ Fast | ✅ Faster | Use Pandas (metadata) |
| String operations | ✅ Built-in | ❌ Manual | Use Pandas |
| DateTime operations | ✅ Built-in | ❌ Manual | Use Pandas |
| Linear algebra | ❌ Limited | ✅ Comprehensive | Use NumPy |
| ML library input | ⚠️ Convert | ✅ Native | Convert to NumPy |
| Complex conditionals | ⚠️ Slow | ✅ Fast (np.select) |
Use NumPy |
| Large datasets (> 10M rows) | ⚠️ Slower | ✅ Faster | Use NumPy |
Now let’s learn the actual conversion methods!
10.5 Conversion Methods: Pandas ↔︎ NumPy
Now that you know when to convert, let’s learn how to convert between pandas and NumPy.
10.5.1 The Two-Way Street
┌──────────────────────┐ ┌──────────────────────┐
│ Pandas DataFrame │ .to_numpy() │ NumPy Array │
│ │ ──────────────────→ │ │
│ ✅ Column names │ │ ❌ No labels │
│ ✅ Index labels │ │ ❌ No index │
│ ✅ Mixed dtypes │ │ ⚠️ Homogeneous │
│ ✅ Missing data │ ←────────────────── │ ⚠️ Limited NaN │
│ │ pd.DataFrame() │ │
└──────────────────────┘ └──────────────────────┘
Challenge: Converting a DataFrame to a NumPy array drops all metadata:
- ❌ Column names are lost
- ❌ Index labels are removed
- ❌ Data type information is simplified or lost
- ❌ Categorical mappings are gone
Why this matters: After performing NumPy computations, you often need to convert results back to a DataFrame with the original structure for analysis, visualization, or export.
Solution: Save & Restore Metadata Manually
10.5.2 Save & Restore Metadata Manually
The most straightforward approach: save metadata before conversion, then restore it after computation.
Code
print("🔄 Manual Metadata Preservation")
print("=" * 50)
# Create a DataFrame with rich metadata
df_original = pd.DataFrame(
{
"Temperature": [72.5, 68.3, 75.1, 71.9],
"Humidity": [45, 52, 48, 50],
"WindSpeed": [12.3, 8.7, 15.2, 10.1]
},
index=["Monday", "Tuesday", "Wednesday", "Thursday"]
)
print("Original DataFrame:")
print(df_original)
print(f"\nIndex: {df_original.index.tolist()}")
print(f"Columns: {df_original.columns.tolist()}")
print(f"Dtypes:\n{df_original.dtypes}")
# STEP 1: Save metadata BEFORE converting to NumPy
saved_index = df_original.index.copy()
saved_columns = df_original.columns.copy()
saved_dtypes = df_original.dtypes.to_dict()
print("\n✅ Metadata saved!")
# STEP 2: Convert to NumPy for computation
arr = df_original.to_numpy()
print(f"\nNumPy array (metadata lost):")
print(arr)
print(f"Shape: {arr.shape}, dtype: {arr.dtype}")
# STEP 3: Perform computations (example: normalize values)
# Subtract mean and divide by std for each column
arr_normalized = (arr - arr.mean(axis=0)) / arr.std(axis=0)
print(f"\nNormalized array (still no metadata):")
print(arr_normalized)
# STEP 4: Restore metadata when converting back to DataFrame
df_result = pd.DataFrame(
arr_normalized,
index=saved_index,
columns=saved_columns
)
print(f"\n📊 Reconstructed DataFrame with metadata:")
print(df_result)
print(f"\nIndex restored: {df_result.index.tolist()}")
print(f"Columns restored: {df_result.columns.tolist()}")
print(f"Dtypes restored (may differ due to normalization):\n{df_result.dtypes}")🔄 STRATEGY 1: Manual Metadata Preservation
==================================================
Original DataFrame:
Temperature Humidity WindSpeed
Monday 72.5 45 12.3
Tuesday 68.3 52 8.7
Wednesday 75.1 48 15.2
Thursday 71.9 50 10.1
Index: ['Monday', 'Tuesday', 'Wednesday', 'Thursday']
Columns: ['Temperature', 'Humidity', 'WindSpeed']
Dtypes:
Temperature float64
Humidity int64
WindSpeed float64
dtype: object
✅ Metadata saved!
NumPy array (metadata lost):
[[72.5 45. 12.3]
[68.3 52. 8.7]
[75.1 48. 15.2]
[71.9 50. 10.1]]
Shape: (4, 3), dtype: float64
Normalized array (still no metadata):
[[ 0.22667166 -1.45010473 0.29531936]
[-1.50427558 1.25675744 -1.171094 ]
[ 1.29821043 -0.29002095 1.47659679]
[-0.02060651 0.48336824 -0.60082214]]
📊 Reconstructed DataFrame with metadata:
Temperature Humidity WindSpeed
Monday 0.226672 -1.450105 0.295319
Tuesday -1.504276 1.256757 -1.171094
Wednesday 1.298210 -0.290021 1.476597
Thursday -0.020607 0.483368 -0.600822
Index restored: ['Monday', 'Tuesday', 'Wednesday', 'Thursday']
Columns restored: ['Temperature', 'Humidity', 'WindSpeed']
Dtypes restored (may differ due to normalization):
Temperature float64
Humidity float64
WindSpeed float64
dtype: object
10.5.3 Pandas DataFrame → NumPy Array
When to convert: You need NumPy’s computational speed and mathematical functions.
Common scenarios:
- Performance-critical computations (linear algebra, statistics)
- Integration with scientific computing libraries (SciPy, scikit-learn)
- Machine learning algorithms that expect NumPy arrays
- Mathematical operations not available in pandas
10.5.3.1 Conversion Methods & Performance
Let’s create a pandas DataFrame to demonstrate the conversion:
Code
print(" PANDAS TO NUMPY CONVERSION")
print("=" * 40)
# Start with a DataFrame of tech stock prices
stock_data = pd.DataFrame(
{
"AAPL": [150.0, 152.5, 148.2, 155.1],
"GOOGL": [2800.0, 2750.0, 2825.0, 2900.0],
"MSFT": [300.0, 305.0, 298.5, 310.0],
"TSLA": [800.0, 795.0, 820.0, 815.0],
},
index=["Day 1", "Day 2", "Day 3", "Day 4"]
)
print("📈 Original DataFrame:")
print(stock_data)
print(f"\nShape: {stock_data.shape}")
print("Dtypes:")
print(stock_data.dtypes)
print(f"Memory usage (deep): {stock_data.memory_usage(deep=True).sum()} bytes") PANDAS TO NUMPY CONVERSION
========================================
📈 Original DataFrame:
AAPL GOOGL MSFT TSLA
Day 1 150.0 2800.0 300.0 800.0
Day 2 152.5 2750.0 305.0 795.0
Day 3 148.2 2825.0 298.5 820.0
Day 4 155.1 2900.0 310.0 815.0
Shape: (4, 4)
Dtypes:
AAPL float64
GOOGL float64
MSFT float64
TSLA float64
dtype: object
Memory usage (deep): 344 bytes
10.5.3.1.1 Method 1: .to_numpy() (Recommended ✅)
The preferred modern approach for converting DataFrames to NumPy arrays. This method:
- Converts DataFrame data into a NumPy array
- Discards index and column labels (keeps only values)
- Preserves the same row and column order as the original DataFrame
- Returns a clean 2D array ready for numerical computations
Code
# Method 1: .to_numpy() - The modern, recommended approach
print("\n🔢 Method 1 — .to_numpy() (recommended):")
array_modern = stock_data.to_numpy()
print("Converted NumPy array:")
print(array_modern)
print(f"\nShape: {array_modern.shape}")
print(f"Data type: {array_modern.dtype}")
print(f"\n⚠️ Note: Row indices (Day 1, Day 2...) and column names (AAPL, GOOGL...) are lost!")
🔢 Method 1 — .to_numpy() (recommended):
Shape: (4, 4)
Data type: float64
array([[ 150. , 2800. , 300. , 800. ],
[ 152.5, 2750. , 305. , 795. ],
[ 148.2, 2825. , 298.5, 820. ],
[ 155.1, 2900. , 310. , 815. ]])
Key Point: Notice how the resulting array contains only the numerical values. All the metadata (index labels like Day 1
, column names like AAPL
) is removed. This is what makes NumPy arrays fast but less descriptive than DataFrames.
10.5.3.1.2 Method 2: .values (legacy but still works)
An older method that still works but is not recommended for new code. It behaves similarly to .to_numpy() but can produce unexpected results with certain pandas dtypes.
Code
# Method 2: .values - Legacy approach (avoid in new code)
print("\n⚠️ Method 2 — .values (legacy):")
array_legacy = stock_data.values
print("Converted array using .values:")
print(array_legacy)
print(f"\nShape: {array_legacy.shape}")
print(f"Data type: {array_legacy.dtype}")
# Verify both methods produce same result (in this case)
print(f"\nSame result as .to_numpy()? {np.array_equal(array_modern, array_legacy)}")
print("✅ For simple numeric DataFrames, both methods work identically")
⚠️ Method 2 — .values (legacy):
[[ 150. 2800. 300. 800. ]
[ 152.5 2750. 305. 795. ]
[ 148.2 2825. 298.5 820. ]
[ 155.1 2900. 310. 815. ]]
Shape: (4, 4)
Same result as .to_numpy()? True
Why avoid .values?
- Less clear intent (what does
values
mean?) - Can behave unpredictably with pandas extension dtypes (nullable integers, strings, etc.)
- Not officially recommended in pandas documentation
.to_numpy()is more explicit and future-proof
10.5.3.1.3 Method 3: Force a specific dtype (e.g., int32)
Code
# Method 3: Force a specific dtype during conversion
print("\n🔧 Method 3 — Force data type (int32):")
array_int = stock_data.to_numpy(dtype=np.int32)
print("Array with forced int32 dtype:")
print(array_int)
print(f"\nOriginal dtype: {stock_data.to_numpy().dtype}")
print(f"Forced dtype: {array_int.dtype}")
print(f"\n⚠️ Warning: Float values were truncated (not rounded) to integers!")
print(f"Example: 150.0 → {array_int[0, 0]}, but 152.5 → {array_int[1, 0]} (loses 0.5)")
Method 3 — Force data type (int32):
[[ 150 2800 300 800]
[ 152 2750 305 795]
[ 148 2825 298 820]
[ 155 2900 310 815]]
Shape: (4, 4)
Data type: int32
Preview (first 2 rows):
[[ 150 2800 300 800]
[ 152 2750 305 795]]
When to use dtype parameter:
✅ Good use cases:
- Ensuring consistent data types for mathematical operations
- Reducing memory usage (e.g.,
float64→float32) - Preparing data for ML libraries that require specific types
- Converting to integer types when you know there are no decimals
❌ Be careful:
- Data loss when converting float to int (truncation, not rounding)
- May raise errors if conversion is impossible (e.g., strings to numbers)
- Check for NaN values before converting to integer types
10.5.3.1.4 Method 4: Convert specific columns only
Why select specific columns?
- Performance: Process only needed data
- Memory: Smaller arrays use less RAM
- Clarity: Makes your intent explicit
- Debugging: Easier to track which data is being used
Code
# Method 4: Select and convert specific columns
print("\n🎯 Method 4 — Specific columns only (AAPL, MSFT):")
# Select only AAPL and MSFT columns, then convert
tech_stocks = stock_data[["AAPL", "MSFT"]].to_numpy()
print("Selected columns converted to array:")
print(tech_stocks)
print(f"\nOriginal DataFrame shape: {stock_data.shape} (4 rows × 4 columns)")
print(f"Selected array shape: {tech_stocks.shape} (4 rows × 2 columns)")
print(f"Data type: {tech_stocks.dtype}")
print(f"\n💡 Memory saved: {(stock_data.shape[1] - tech_stocks.shape[1]) / stock_data.shape[1] * 100:.0f}% fewer columns!")
🎯 Method 4 — Specific columns only (AAPL, MSFT):
[[150. 300. ]
[152.5 305. ]
[148.2 298.5]
[155.1 310. ]]
Shape: (4, 2)
Data type: float64
Summary table
| Method | Recommendation | Pros | Cons |
|---|---|---|---|
.to_numpy() |
✅ Use by default | Clear intent, stable API, future-proof | May copy data depending on dtypes/backing storage |
.to_numpy(dtype=...) |
✅ When you need a dtype | Explicit, consistent numeric types; easier math | Casting may be lossy or fail; extra param to manage |
.values |
⚠️ Legacy / avoid | Works on old pandas; quick | Ambiguous with mixed/extension dtypes; not future-proof |
Best practices
- Convert only needed columns to reduce memory and copying.
- Specify
dtypewhen downstream computations require consistent numeric types. - Check for missing values (
NaN,NaT) before casting to integer dtypes. - Prefer
.to_numpy()over.valuesfor clarity and forward compatibility.
10.5.4 NumPy Array → Pandas DataFrame
Let’s create an NumPy array to demonstrate the conversion:
Code
# 📊 COMPREHENSIVE ARRAY → DATAFRAME CONVERSION
print("🔄 NUMPY TO PANDAS CONVERSION")
print("=" * 40)
# Original NumPy array
scores = np.array([
[85, 92, 78, 95],
[88, 76, 91, 82],
[95, 89, 84, 90]
])
students = ['Alice', 'Bob', 'Carol']
subjects = ['Math', 'Science', 'English', 'History']
print("Original NumPy array:")
print(scores)
print(f"Shape: {scores.shape}")🔄 NUMPY TO PANDAS CONVERSION
========================================
Original NumPy array:
[[85 92 78 95]
[88 76 91 82]
[95 89 84 90]]
Shape: (3, 4)
10.5.4.1 Method 1: Basic Conversion
Create a DataFrame directly from a NumPy array using pd.DataFrame(). By default, pandas will auto-generate integer labels (0, 1, 2, …) for both rows and columns.
Code
# Basic conversion - auto-generated index and column labels
df_basic = pd.DataFrame(scores)
print("\n📋 Basic DataFrame (auto index/columns):")
df_basic
📋 Basic DataFrame (auto index/columns):
| 0 | 1 | 2 | 3 | |
|---|---|---|---|---|
| 0 | 85 | 92 | 78 | 95 |
| 1 | 88 | 76 | 91 | 82 |
| 2 | 95 | 89 | 84 | 90 |
Notice that the row indices are [0, 1, 2] and column names are [0, 1, 2, 3] — these are automatically generated when no labels are specified.
10.5.4.2 Method 2: Custom Row and Column Labels
For better readability and data analysis, you can specify meaningful names for rows (index) and columns when creating the DataFrame.
Code
# Custom labels for better data interpretation
df_labeled = pd.DataFrame(scores, index=students, columns=subjects)
print("\n📊 Labeled DataFrame (custom index & columns):")
df_labeled
📊 Labeled DataFrame (custom index & columns):
| Math | Science | English | History | |
|---|---|---|---|---|
| Alice | 85 | 92 | 78 | 95 |
| Bob | 88 | 76 | 91 | 82 |
| Carol | 95 | 89 | 84 | 90 |
Now the DataFrame has meaningful labels: - Row indices: Student names (Alice, Bob, Carol) - Column names: Subject names (Math, Science, English, History)
This makes the data much more interpretable and easier to work with!
10.5.4.3 Key Parameters & Options
When converting NumPy arrays to DataFrames, you can customize the output using these parameters:
| Parameter | Purpose | Example | Default |
|---|---|---|---|
data |
NumPy array to convert | np.array([[1,2],[3,4]]) |
Required |
columns |
Column labels | ['A', 'B'] |
[0, 1, 2, ...] |
index |
Row labels | ['row1', 'row2'] |
[0, 1, 2, ...] |
💡 Best Practices:
- ✅ Always specify meaningful column names for better code readability and maintenance
- ✅ Use descriptive index names when your rows represent specific entities (people, dates, etc.)
- ✅ Check data types and index after conversion with
df.dtypesto ensure correct interpretation
10.6 Random Number Generation in NumPy
Random number generation is a fundamental tool in data science, used for:
- Simulations: Monte Carlo methods, risk analysis, game theory
- Statistical Sampling: Bootstrap, cross-validation, hypothesis testing
- Machine Learning: Data augmentation, weight initialization, dropout
- Scientific Computing: Stochastic modeling, numerical experiments
NumPy’s random module provides fast, vectorized random number generation that’s orders of magnitude faster than Python’s built-in random module.
10.6.1 Why NumPy Random Over Python Random?
| Aspect | Python random |
NumPy random |
|---|---|---|
| Speed | Slow (one at a time) | Fast (vectorized) |
| Output | Single values | Arrays of any shape |
| Distributions | Limited | 40+ distributions |
| Use Case | Simple scripts | Scientific computing |
| Performance | ~0.1M numbers/sec | ~10M+ numbers/sec |
💡 Rule: Use NumPy random for data science; use Python random for simple scripts.
10.6.2 Core Random Number Functions
NumPy provides functions for various probability distributions. Let’s explore the most commonly used ones with practical examples.
10.6.2.1 📊 Comparison Table: Which Function to Use?
| Function | Distribution | Parameters | Use Case |
|---|---|---|---|
rand() |
Uniform [0, 1) | shape | Quick random arrays, probabilities |
randn() |
Normal (μ=0, σ=1) | shape | Standard normal samples |
randint() |
Discrete uniform | low, high, size | Random integers, sampling |
choice() |
Sample from array | array, size, replace | Random selection, bootstrapping |
uniform() |
Uniform [a, b] | low, high, size | Custom range uniform |
normal() |
Normal (μ, σ) | mean, std, size | Real-world measurements |
binomial() |
Binomial | n, p, size | Coin flips, success/failure |
poisson() |
Poisson | lambda, size | Event counts, arrivals |
Let’s see each in action with real examples!
10.6.3 Uniform Distribution: np.random.rand() and np.random.uniform()
Uniform distribution: All values in a range are equally likely.
10.6.3.1 np.random.rand() - Quick Uniform [0, 1)
Code
print("📊 UNIFORM DISTRIBUTION: np.random.rand()")
print("=" * 60)
# Generate a 2x3 array of random values between 0 and 1
rand_array = np.random.rand(2, 3)
print("2×3 array of uniform random numbers [0, 1):")
print(rand_array)
print(f"\nShape: {rand_array.shape}")
print(f"Min: {rand_array.min():.4f}, Max: {rand_array.max():.4f}")
print(f"Mean: {rand_array.mean():.4f} (expected ~0.5)")
# Practical use: Generate random probabilities
print("\n💡 Practical Use: Simulate coin flip probabilities")
probabilities = np.random.rand(10)
results = ['Heads' if p > 0.5 else 'Tails' for p in probabilities]
print(f"Probabilities: {probabilities.round(3)}")
print(f"Results: {results}")10.6.3.2 np.random.uniform() - Custom Range Uniform [a, b]
More flexible - specify your own range.
Code
print("\n📊 UNIFORM DISTRIBUTION: np.random.uniform()")
print("=" * 60)
# Generate random numbers in a custom range
print("Example 1: Random temperatures between 60°F and 90°F")
temperatures = np.random.uniform(60, 90, size=7)
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
for day, temp in zip(days, temperatures):
print(f" {day}: {temp:.1f}°F")
# 2D array example
print("\nExample 2: 3×5 matrix of random prices between $10 and $100")
prices = np.random.uniform(10, 100, size=(3, 5))
print(prices.round(2))
print(f"\nAverage price: ${prices.mean():.2f}")[[ 0.5376299 0.42818625 -0.2799933 ]
[-1.27903074 0.51529432 -0.83428181]
[ 2.18409673 0.57050721 -0.5806483 ]]
10.6.4 Normal (Gaussian) Distribution: np.random.randn() and np.random.normal()
Normal distribution: Bell-shaped curve, most values cluster around the mean.
Used for: heights, weights, test scores, measurement errors, etc.
10.6.4.1 np.random.randn() - Standard Normal (μ=0, σ=1)
Code
print("📊 NORMAL DISTRIBUTION: np.random.randn()")
print("=" * 60)
# Generate standard normal distribution
normal_array = np.random.randn(1000)
print(f"Generated {len(normal_array)} samples from standard normal distribution")
print(f"Mean: {normal_array.mean():.4f} (expected: 0)")
print(f"Std Dev: {normal_array.std():.4f} (expected: 1)")
print(f"Min: {normal_array.min():.2f}, Max: {normal_array.max():.2f}")
# Show distribution statistics
print("\n📈 Distribution check (68-95-99.7 rule):")
within_1std = np.sum(np.abs(normal_array) <= 1) / len(normal_array) * 100
within_2std = np.sum(np.abs(normal_array) <= 2) / len(normal_array) * 100
within_3std = np.sum(np.abs(normal_array) <= 3) / len(normal_array) * 100
print(f" Within ±1σ: {within_1std:.1f}% (expected: ~68%)")
print(f" Within ±2σ: {within_2std:.1f}% (expected: ~95%)")
print(f" Within ±3σ: {within_3std:.1f}% (expected: ~99.7%)")10.6.4.2 np.random.normal() - Custom Normal (μ, σ)
Specify your own mean and standard deviation.
Code
print("\n📊 NORMAL DISTRIBUTION: np.random.normal()")
print("=" * 60)
# Example 1: Student exam scores (mean=75, std=10)
print("Example 1: Simulate exam scores (μ=75, σ=10)")
exam_scores = np.random.normal(75, 10, size=20)
print(f"Scores: {exam_scores.round(1)}")
print(f"Class average: {exam_scores.mean():.1f}")
print(f"Passing (≥60): {np.sum(exam_scores >= 60)}/{len(exam_scores)}")
# Example 2: Heights in inches (mean=68, std=3)
print("\nExample 2: Adult heights in inches (μ=68\", σ=3\")")
heights = np.random.normal(68, 3, size=100)
print(f"Generated {len(heights)} height measurements")
print(f"Average height: {heights.mean():.2f}\"")
print(f"Tallest: {heights.max():.1f}\", Shortest: {heights.min():.1f}\"")
print(f"Between 65\" and 71\": {np.sum((heights >= 65) & (heights <= 71))}")10.6.5 Integer Random Numbers: np.random.randint()
Generate random integers within a specific range.
Use cases: Dice rolls, random IDs, sampling indices, game mechanics
Code
print("🎲 RANDOM INTEGERS: np.random.randint()")
print("=" * 60)
# Example 1: Dice rolls
print("Example 1: Roll a six-sided die 20 times")
dice_rolls = np.random.randint(1, 7, size=20) # 1 to 6 inclusive
print(f"Rolls: {dice_rolls}")
print(f"Distribution: {dict(zip(*np.unique(dice_rolls, return_counts=True)))}")
# Example 2: Random matrix of integers
print("\nExample 2: 4×4 matrix of random integers [10, 20)")
int_matrix = np.random.randint(10, 20, size=(4, 4))
print(int_matrix)
print(f"Sum: {int_matrix.sum()}, Average: {int_matrix.mean():.2f}")
# Example 3: Random customer IDs
print("\nExample 3: Generate 10 random customer IDs [1000, 9999]")
customer_ids = np.random.randint(1000, 10000, size=10)
print(f"IDs: {customer_ids}")10.6.6 Random Selection: np.random.choice()
Randomly select elements from an array with or without replacement.
Use cases: Bootstrapping, random sampling, A/B testing, lottery
Code
print("🎯 RANDOM SELECTION: np.random.choice()")
print("=" * 60)
# Example 1: Random selection WITH replacement
print("Example 1: Select 5 numbers WITH replacement [1-5]")
choice_array = np.random.choice([1, 2, 3, 4, 5], size=10, replace=True)
print(f"Selected: {choice_array}")
print(f"Notice: Numbers can repeat!")
# Example 2: Random selection WITHOUT replacement
print("\nExample 2: Select 3 winners from 10 contestants (no duplicates)")
contestants = np.array(['Alice', 'Bob', 'Carol', 'David', 'Eve',
'Frank', 'Grace', 'Henry', 'Iris', 'Jack'])
winners = np.random.choice(contestants, size=3, replace=False)
print(f"Winners: {winners}")
# Example 3: Weighted random choice (probabilities)
print("\nExample 3: Biased coin flip (70% Heads, 30% Tails)")
outcomes = ['Heads', 'Tails']
probabilities = [0.7, 0.3]
flips = np.random.choice(outcomes, size=100, p=probabilities)
unique, counts = np.unique(flips, return_counts=True)
for outcome, count in zip(unique, counts):
print(f" {outcome}: {count}/100 ({count}%)")
# Example 4: Bootstrap sampling
print("\nExample 4: Bootstrap sampling (sampling with replacement)")
data = np.array([23, 45, 67, 34, 89, 12, 56])
bootstrap_sample = np.random.choice(data, size=len(data), replace=True)
print(f"Original: {data}")
print(f"Bootstrap: {bootstrap_sample}")
print(f"Bootstrap mean: {bootstrap_sample.mean():.2f}, Original mean: {data.mean():.2f}")10.6.7 Other Useful Distributions
NumPy provides many more distributions for specialized use cases.
Code
print("📊 OTHER USEFUL DISTRIBUTIONS")
print("=" * 60)
# 1. Binomial: Number of successes in n trials
print("1. BINOMIAL: Coin flips (10 flips, 50% probability)")
coin_flips = np.random.binomial(n=10, p=0.5, size=1000)
print(f" Average heads in 10 flips: {coin_flips.mean():.2f} (expected: 5)")
# 2. Poisson: Count of events in fixed interval
print("\n2. POISSON: Customer arrivals (λ=3 per hour)")
arrivals = np.random.poisson(lam=3, size=24) # 24 hours
print(f" Arrivals per hour: {arrivals}")
print(f" Total customers: {arrivals.sum()}, Average: {arrivals.mean():.2f}")
# 3. Exponential: Time between events
print("\n3. EXPONENTIAL: Time between customer arrivals (scale=5 min)")
wait_times = np.random.exponential(scale=5, size=10)
print(f" Wait times (minutes): {wait_times.round(1)}")
print(f" Average wait: {wait_times.mean():.2f} min")
# 4. Beta: Probabilities and proportions
print("\n4. BETA: Probability distribution (α=2, β=5)")
probabilities = np.random.beta(2, 5, size=1000)
print(f" Mean probability: {probabilities.mean():.3f}")
print(f" Range: [{probabilities.min():.3f}, {probabilities.max():.3f}]")10.6.8 Reproducibility: Random Seeds
Problem: Random numbers are different every time you run your code!
Solution: Set a seed for reproducible results.
10.6.9 Performance: NumPy vs Python Random
Let’s prove that NumPy random is dramatically faster than Python’s built-in random.
Code
import random as py_random
print(" PERFORMANCE COMPARISON: NumPy vs Python random")
print("=" * 60)
n_numbers = 1_000_000
# Python random (slow)
print(f"\n Python random module (generating {n_numbers:,} numbers):")
start_time = time.time()
python_randoms = [py_random.random() for _ in range(n_numbers)]
python_time = time.time() - start_time
print(f" Time: {python_time:.4f} seconds")
# NumPy random (fast)
print(f"\n NumPy random module (generating {n_numbers:,} numbers):")
start_time = time.time()
numpy_randoms = np.random.rand(n_numbers)
numpy_time = time.time() - start_time
print(f" Time: {numpy_time:.4f} seconds")
# Comparison
speedup = python_time / numpy_time if numpy_time > 0 else float('inf')
print(f"\n RESULT:")
print(f" NumPy is {speedup:.1f}x faster!")
print(f" Time saved: {(python_time - numpy_time):.4f} seconds ({(python_time - numpy_time)/python_time*100:.1f}%)")
print("\n Key Takeaway:")
print(" Use NumPy random for any serious data science work!") PERFORMANCE COMPARISON: NumPy vs Python random
============================================================
Python random module (generating 1,000,000 numbers):
Time: 0.1046 seconds
NumPy random module (generating 1,000,000 numbers):
Time: 0.0061 seconds
RESULT:
NumPy is 17.3x faster!
Time saved: 0.0986 seconds (94.2%)
Key Takeaway:
Use NumPy random for any serious data science work!
10.6.10 Quick Reference Guide
# Uniform [0, 1)
np.random.rand(5) # 1D array of 5 numbers
np.random.rand(3, 4) # 3×4 matrix
# Uniform [a, b]
np.random.uniform(10, 20, size=10) # 10 numbers between 10 and 20
# Normal (Gaussian)
np.random.randn(100) # Standard normal (μ=0, σ=1)
np.random.normal(50, 10, size=100) # Custom μ=50, σ=10
# Integers
np.random.randint(1, 100, size=50) # 50 random integers [1, 100)
# Random selection
np.random.choice([1,2,3,4,5], size=3, replace=False) # No duplicates
np.random.choice(['A','B','C'], size=100, p=[0.5, 0.3, 0.2]) # Weighted
# Shuffling
arr = np.array([1, 2, 3, 4, 5])
np.random.shuffle(arr) # Shuffles in-place
# Permutation
np.random.permutation(10) # Random permutation of 0-910.7 Independent Study
This is where you apply everything you’ve learned about NumPy vectorization to solve real-world problems. Each exercise progressively builds your skills in:
- Vectorized computations for performance optimization
- Matrix multiplication for multi-dimensional operations
- Random number generation for simulations
- Statistical methods like bootstrapping
10.7.1 Practice Exercise 1: Shopping Optimization with Vectorization
📋 Problem Statement: Three shoppers (Ben, Barbara, and Beth) need to buy groceries (rolls, buns, cakes, and bread). Two stores (Target and Kroger) have different prices. Which store should each person choose to minimize their total cost?
📂 Data Files: - food_quantity.csv: How many of each item each person needs - price.csv: Price of each item at each store
Strategy: We’ll solve this in 3 progressive steps to understand the power of vectorization: 1. Step 1: Calculate Ben’s cost at Target (simplest) 2. Step 2: Calculate Ben’s cost at both stores (intermediate) 3. Step 3: Calculate everyone’s cost at all stores (complete solution)
10.7.1.1 Step 1: Calculate Ben’s Cost at Target (Simplest Case)
Goal: Find how much Ben will spend if he buys everything at Target.
✅ Key Insight: Ben will spend $50 at Target. Vectorized operations are cleaner and faster!
10.7.1.2 Step 2: Calculate Ben’s Cost at BOTH Stores (Intermediate)
Goal: Find Ben’s cost at Target AND Kroger to determine which is cheaper.
✅ Key Insight: Ben spends $50 at Target vs $49 at Kroger → Kroger saves $1!
10.7.1.3 Step 3: Calculate EVERYONE’S Cost at ALL Stores (Complete Solution)
Goal: Find the best store for Ben, Barbara, AND Beth.
Method 1: using pandas dataframe nested loops
Method 2: Using numpy matrix multiplication
This is where vectorization truly shines! 🌟
10.7.2 Practice Exercise 2: Movie Rating Analysis with Matrix Multiplication
Problem Statement:
You have a dataset of movies with: - Ratings: IMDB Rating and Rotten Tomatoes Rating - Genres: Comedy, Action, Drama, Horror (binary flags: 1 = movie is in genre, 0 = not)
📂 Dataset: movies_cleaned.csv
Questions to Answer: 1. What is the average IMDB rating for each genre? 2. What is the average Rotten Tomatoes rating for each genre? 3. Which genre is most preferred by IMDB users? 4. Which genre is least preferred by Rotten Tomatoes critics?
10.7.2.1 Step 0: Load movies dataset
10.7.2.2 Step 1: Create Rating Matrix (N movies × 2 ratings)
10.7.2.3 Step 2: Create Genre Matrix (N movies × 4 genres)
10.7.2.4 Step 3: Matrix Multiplication for Total Ratings
Goal: Find total IMDB and Rotten Tomatoes ratings for each genre.
Matrix Operation: Ratings.T @ Genres
⚠️ Dimension Check: - Ratings matrix: (N × 2) - Genres matrix: (N × 4) - For multiplication, we need: (2 × N) @ (N × 4) = (2 × 4) ✅ - Solution: Transpose the ratings matrix!
10.7.2.5 Step 4: Count Movies per Genre
To get averages, we need to divide total ratings by the number of movies in each genre.
10.7.2.6 Step 5: Compute the average rating per Genre
Key Findings:
✅ IMDB users prefer DRAMA (highest average rating), and are least amused by COMEDY movies on average.
✅ Rotten Tomatoes critics prefer drama over HORROR movies on average.
10.7.3 Practice Exercise 3: Simulation Study with Random Number Generation
📋 Problem Statement:
Two food carts serve 500 customers each, every day for 30 days. The waiting times follow different distributions:
- Food Cart 1: Uniform distribution [5, 25] minutes (unpredictable service)
- Food Cart 2: Normal distribution with μ=8 min, σ=3 min (consistent service)
Assumptions:
- Waiting times are measured simultaneously (paired observations)
- Same 500 people visit daily over 30 days
Questions to Answer:
- On how many days is the average waiting time at Food Cart 2 higher than Food Cart 1?
- What percentage of individual waiting times at Food Cart 2 exceed Food Cart 1?
- How much faster is vectorized random generation vs loops?
Strategy:
- Simulation size: 500 people × 30 days = 15,000 observations per cart
- Method 1: Nested loops (slow but explicit)
- Method 2: Vectorized NumPy (fast and elegant)
10.7.4 Practice Exercise 4: Bootstrapping for Confidence Intervals
Problem Statement:
Find the 95% confidence interval for the mean profit of Action movies using bootstrapping.
What is Bootstrapping?
Bootstrapping is a non-parametric statistical method that estimates the sampling distribution of a statistic by resampling from the observed data. It’s used when: - Sample size is small - Distribution is unknown or non-normal - Theoretical formulas are complex or unavailable
Bootstrap Algorithm:
- Extract profit data for all Action movies (N movies)
- Resample N values WITH replacement from the profit data
- Calculate the mean of the resampled data
- Repeat steps 2-3 M=1000 times
- Find the 2.5th and 97.5th percentiles of the 1000 means
Result: [2.5th percentile, 97.5th percentile] = 95% Confidence Interval
Dataset: movies_cleaned.csv
Hints: Use np.random.choice() for sampling with replacement
======================================================================
🎬 BOOTSTRAPPING: 95% CI for Action Movie Profits
======================================================================
| Title | IMDB Rating | Rotten Tomatoes Rating | Running Time min | Release Date | US Gross | Worldwide Gross | Production Budget | comedy | Action | drama | horror | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Broken Arrow | 5.8 | 55 | 108 | Feb 09 1996 | 70645997 | 148345997 | 65000000 | 0 | 1 | 0 | 0 |
| 1 | Brazil | 8.0 | 98 | 136 | Dec 18 1985 | 9929135 | 9929135 | 15000000 | 1 | 0 | 0 | 0 |
| 2 | The Cable Guy | 5.8 | 52 | 95 | Jun 14 1996 | 60240295 | 102825796 | 47000000 | 1 | 0 | 0 | 0 |
| 3 | Chain Reaction | 5.2 | 13 | 106 | Aug 02 1996 | 21226204 | 60209334 | 55000000 | 0 | 1 | 0 | 0 |
| 4 | Clash of the Titans | 5.9 | 65 | 108 | Jun 12 1981 | 30000000 | 30000000 | 15000000 | 0 | 1 | 0 | 0 |
Confidence interval = [$133.5 million, $181.08 million]