NumPy is a foundational library in Python, providing support for large, multi-dimensional arrays and matrices, along with a variety of mathematical functions. It’s a critical tool in data science and machine learning because it enables efficient numerical computations, data manipulation, and linear algebra operations. Many data science packcages and machine learning algorithms rely on these operations to process data and perform complex calculations quickly. Moreover, popular libraries like Pandas, SciPy, and TensorFlow are built on top of NumPy, making it essential to understand for implementing and optimizing machine learning models.
9.1 Learning Objectives
By the end of this tutorial, you will be able to:
Create NumPy arrays using various functions (np.array(), np.zeros(), np.arange(), etc.)
Interpret array attributes (shape, dtype, size, ndim) and understand their roles in computation
Apply indexing and slicing to extract and modify array elements
Use conditional selection for advanced element filtering
Reshape and concatenate arrays efficiently for flexible data manipulation
9.2 Getting Started with NumPy
Code
# Essential imports for numerical computingimport numpy as npimport pandas as pdimport timeimport sys# Display NumPy version and configurationprint(f"NumPy version: {np.__version__}")print(f"NumPy is installed at: {np.__file__}")print("\n✅ Ready to explore the world of numerical computing!")
NumPy version: 1.26.4
NumPy is installed at: c:\Users\lsi8012\AppData\Local\anaconda3\Lib\site-packages\numpy\__init__.py
✅ Ready to explore the world of numerical computing!
If you encounter a ModuleNotFoundError, install NumPy using:
pip install numpy
9.3 Why NumPy?
Think of NumPy arrays as an upgrade to Python’s built-in lists and tuples.
You can store data in them and perform the same kinds of computations—but with a big advantage:
Memory Savings: Arrays use a fixed, compact layout in memory, reducing overhead.
Efficiency: NumPy is optimized at the C level, making it far faster than pure Python structures.
Scalability: Designed to handle large datasets smoothly without slowing down.
👉 This is why NumPy is the backbone of scientific computing and data science in Python.
9.3.1 NumPy Arrays are Memory Efficient: Homogeneity & Contiguous Storage
A NumPy array stores elements of the same data type in contiguous memory locations.
This contrasts with Python lists, which can hold mixed data types and store elements in scattered memory locations.
Because of this:
NumPy arrays are densely packed, minimizing memory overhead.
Operations can be executed much faster, since data is laid out predictably in memory.
👉 The result: lower memory consumption and higher performance when working with large datasets.
The following example compares the memory usage of NumPy arrays with Python lists.
Code
import sys# Create a NumPy array, Python list, and tuple with the same elementsarray = np.arange(1000)py_list =list(range(1000))py_tuple =tuple(range(1000))# Calculate memory usagearray_memory = array.nbyteslist_memory = sys.getsizeof(py_list) +sum(sys.getsizeof(item) for item in py_list)tuple_memory = sys.getsizeof(py_tuple) +sum(sys.getsizeof(item) for item in py_tuple)# Display the memory usagememory_usage = {"NumPy Array (in bytes)": array_memory,"Python List (in bytes)": list_memory,"Python Tuple (in bytes)": tuple_memory}memory_usage
{'NumPy Array (in bytes)': 4000,
'Python List (in bytes)': 36056,
'Python Tuple (in bytes)': 36040}
Code
# each element in the array is a 64-bit integerarray.dtype
dtype('int32')
9.3.2 NumPy arrays are fast
NumPy arrays enable mathematical computations to run much faster than with native Python data structures. This speed advantage comes from three key factors:
Densely Packed & Homogeneous Data
Since NumPy arrays store elements of the same type in contiguous memory, data retrieval is faster and computations can be executed more efficiently.
Vectorized Computations
NumPy replaces slow Python for loops with vectorized operations, which are internally broken into optimized fragments and executed in parallel. This means calculations are applied to entire arrays at once, rather than element by element.
Integration with C/C++
Under the hood, NumPy is implemented in C and C++, which are compiled languages with very low execution overhead compared to Python. This gives NumPy both the speed of C and the usability of Python.
We’ll see the faster speed on NumPy computations in the example below.
Code
def my_dot(a, b): """ Compute the dot product of two vectors Args: a (ndarray (n,)): input vector b (ndarray (n,)): input vector with same dimension as a Returns: x (scalar): """ x=0for i inrange(a.shape[0]): x = x + a[i] * b[i]return x
Code
np.random.seed(1)a = np.random.rand(10000000) # very large arraysb = np.random.rand(10000000)tic = time.time() # capture start timec = np.dot(a, b)toc = time.time() # capture end timeprint(f"np.dot(a, b) = {c:.4f}")print(f"Vectorized version duration: {1000*(toc-tic):.4f} ms ")tic = time.time() # capture start timec = my_dot(a,b)toc = time.time() # capture end timeprint(f"my_dot(a, b) = {c:.4f}")print(f"loop version duration: {1000*(toc-tic):.4f} ms ")del(a);del(b)
np.dot(a, b) = 2501072.5817
Vectorized version duration: 34.6291 ms
my_dot(a, b) = 2501072.5817
loop version duration: 1979.4111 ms
✅ Both methods produce the same result, but the vectorized np.dot() executes almost instantly, while the manual Python loop takes over a second.
👉 This clearly demonstrates how NumPy’s vectorization and C backend make computations orders of magnitude faster than pure Python loops.
9.4 Building Blocks: NumPy Arrays Fundamentals
9.4.1 Array Creation: Your Complete Toolkit
NumPy offers multiple ways to create arrays, each optimized for different scenarios.
9.4.1.1 From Existing Data
Method
Purpose
Example Use Case
np.array()
Convert lists/tuples to arrays
Transform Python data structures
df.to_numpy()
Convert a Pandas DataFrame
Bridge between Pandas and NumPy
Code
# 1️⃣ Creating arrays from existing dataprint("1️⃣ FROM EXISTING DATA")print("="*30)# From Python listsdata_1d = [1, 2, 3, 4, 5]arr_from_list = np.array(data_1d)print(f"From list: {arr_from_list}")# From nested lists (2D array)data_2d = [[1, 2, 3], [4, 5, 6]]arr_2d = np.array(data_2d)print(f"2D array:\n{arr_2d}")# From tuplesarr_from_tuple = np.array((10, 20, 30))print(f"From tuple: {arr_from_tuple}")
1️⃣ FROM EXISTING DATA
==============================
From list: [1 2 3 4 5]
2D array:
[[1 2 3]
[4 5 6]]
From tuple: [10 20 30]
The to_numpy() method is used to convert a pandas DataFrame into a NumPy array.
DataFrame:
A B C
0 1 4 7
1 2 5 8
2 3 6 9
Converted NumPy array:
[[1 4 7]
[2 5 8]
[3 6 9]]
Notes:
.to_numpy() returns a NumPyndarray.
The data type (dtype) is inferred automatically, but you can specify it if needed:
Code
df.to_numpy(dtype=float)
array([[1., 4., 7.],
[2., 5., 8.],
[3., 6., 9.]])
.to_numpy() is preferred over the older .values property.
This is often useful when you need to perform numerical operations with NumPy or machine learning libraries.
We will explore more examples and advanced operations in the next chapter: NumPy Intermediate.
9.4.1.2 Specialized Constructors
Method
Creates
When to Use
np.zeros(shape)
Array of zeros
Initialize arrays, placeholders
np.ones(shape)
Array of ones
Mathematical operations, masks
np.full(shape, val)
Array filled with a value
Default values, initialization
np.eye(n)
Identity matrix
Linear algebra operations
np.empty(shape)
Uninitialized array
Maximum speed (use with caution)
Code
print("\n2️⃣ SPECIALIZED CONSTRUCTORS")print("="*30)# Arrays of zeros and oneszeros_3x3 = np.zeros((3, 3))ones_2x4 = np.ones((2, 4))full_matrix = np.full((2, 3), 7) # Fill with custom valueprint(f"Zeros (3x3):\n{zeros_3x3}")print(f"Ones (2x4):\n{ones_2x4}")print(f"Full of 7s:\n{full_matrix}")# Identity matrixidentity = np.eye(3)print(f"Identity matrix:\n{identity}")
Shows the number of dimensions (or axes) of the array.
Code
numpy_ex.ndim
2
Numpy arrays can have any number of dimensions and different lengths along each dimension. We can inspect the length along each dimension using the .shape property of an array.
This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, the shape will be (n,m). The length of the shape tuple is therefore the rank, or the number of dimensions, ndim.
Unlike lists and tuples, NumPy arrays are designed to store elements of the same type, enabling more efficient memory usage and faster computations. The data type of the elements in a NumPy array can be accessed using the .dtype attribute
Code
numpy_ex.dtype
dtype('int32')
9.5 Data Types and Memory Optimization
NumPy supports a wide range of data types, each with a defined memory size. These types map directly to C data types, since NumPy is implemented in C at its computational core.
This design allows NumPy to handle array operations far more efficiently than native Python data structures.
9.5.1 Common NumPy Data Types
Data Type
Memory Size
np.int8
1 byte
np.int16
2 bytes
np.int32
4 bytes
np.int64
8 bytes
np.uint8
1 byte
np.uint16
2 bytes
np.uint32
4 bytes
np.uint64
8 bytes
np.float16
2 bytes
np.float32
4 bytes
np.float64
8 bytes
np.complex64
8 bytes
np.complex128
16 bytes
np.bool_
1 byte
np.string_
1 byte per character
np.unicode_
4 bytes per character
np.object_
Variable (Python objects)
np.datetime64
8 bytes
np.timedelta64
8 bytes
💡 Tip: Choosing the right data type is crucial for memory efficiency and performance. For large datasets, using smaller types (e.g., int16 instead of int64) can save significant memory.
Code
# Add practical examples for data type selectionprint("📊 CHOOSING THE RIGHT DATA TYPE")print("="*40)# Memory efficiency examplelarge_numbers = np.array([1000, 2000, 3000], dtype=np.int16) # Efficient for small rangessmall_numbers = np.array([1, 2, 3], dtype=np.int8) # Very memory efficientprint(f"int16 array memory: {large_numbers.nbytes} bytes")print(f"int8 array memory: {small_numbers.nbytes} bytes")# Precision example high_precision = np.array([3.14159265359], dtype=np.float64)low_precision = np.array([3.14159265359], dtype=np.float32)print(f"float64 precision: {high_precision}")print(f"float32 precision: {low_precision}")
📊 CHOOSING THE RIGHT DATA TYPE
========================================
int16 array memory: 6 bytes
int8 array memory: 3 bytes
float64 precision: [3.14159265]
float32 precision: [3.1415927]
9.5.2 Upcasting
When you create a NumPy array with elements of different data types, NumPy automatically upcasts them to a common type that can represent all values.
This process—also called type coercion or type promotion—follows a hierarchy of data types to prevent data loss.
Below are two common cases of upcasting, with examples:
Numeric Upcasting: If you mix integers and floats, NumPy will convert the entire array to floats.
Code
arr = np.array([1, 2.5, 3])print(arr.dtype)
float64
String Upcasting: If you mix numbers and strings, NumPy will upcast all elements to strings.
Code
arr = np.array([1, 'hello', 3.5])print(arr.dtype)
<U32
<U32 means: a Unicode string array where each element can hold up to 32 characters, using little-endian byte order.
9.6 Array Indexing and Slicing: Accessing Your Data
9.6.1 Array Indexing
Similar to Python lists, NumPy uses zero-based indexing, meaning the first element of an array is accessed using index 0. You can use positive or negative indices to access elements
In multi-dimensional arrays, indices are separated by commas. The first index refers to the row, and the second index refers to the column in a 2D array.
For slicing in Multi-Dimensional Arrays, use commas to separate slicing for different dimensions
Code
### Indexing & Slicing Quick Reference# Extract a sub-array: elements from the first two rows and the first two columnssub_array = array_2d[:2, :2]print(sub_array) # Extract all rows for the second columncol = array_2d[:, 1]print(col) # Extract the last two rows and last two columnssub_array = array_2d[-2:, -2:]print(sub_array)
[[1 2]
[4 5]]
[2 5 8]
[[5 6]
[8 9]]
The step parameter can be used to select elements at regular intervals.
Code
# the step parameter in slicingprint(array[1:8:2]) print(array[::-2])
[20 40]
[50 30 10]
9.6.3 Combining Indexing and Slicing
You can combine indexing and slicing to extract specific elements or sub-arrays
Code
# Create a 3D arrayarray_3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])# Select specific elements and slicesprint(array_3d[0, :, 1]) # Output: [2 5] (second element from each row in the first sub-array)print(array_3d[1, 1, :2]) # Output: [10 11] (first two elements in the last row of the second sub-array)
[2 5]
[10 11]
9.6.4 Boolean Mask: Conditional Selection
9.6.4.1 Single Condition in NumPy
You can use boolean arrays to filter or select elements that meet a specific condition.
You can nest multiple np.where() calls, but prefer np.select for 3+ branches.
Code
# Conditional replacement with different values (letter grades)letter_grades = np.where(scores >=90, 'A', np.where(scores >=80, 'B', np.where(scores >=70, 'C', np.where(scores >=60, 'D', 'F'))))# Example 3: Using np.where to find indicesprint("\n3️⃣ Indices of A's and F's:")a_idx = np.where(letter_grades =='A')[0]f_idx = np.where(letter_grades =='F')[0]print(f"A indices: {a_idx}")print(f"F indices: {f_idx}")
3️⃣ Indices of A's and F's:
A indices: [0 5 9]
F indices: [6]
While you can nest multiple np.where() calls for complex conditions, np.select() is preferred for 3+ branches because it’s more readable and maintainable.
9.7.3np.select: Multi-way Conditional Logic
9.7.3.1 Syntax
np.select(conditions, choices, default=...)
conditions: list of boolean arrays (must be same shape or broadcastable)
choices: list of result values/arrays (same length as conditions)
default: value used where no condition is True (optional)
9.7.3.2 Example 1 — Letter grades (cleaner than nested np.where())
If you omit default=... in np.select, NumPy uses 0 as the fallback.
When your choices are strings, mixing them with the integer 0 forces a common dtype—strings and integers have no shared dtype—so NumPy raises a TypeError.
# Create a 2D array (e.g., student test scores)scores = np.array([ [85, 92, 78, 95], # Student 0 [88, 76, 91, 82], # Student 1 [95, 89, 84, 90], # Student 2 [72, 85, 79, 88] # Student 3])print("📊 Student Test Scores (4 students × 4 tests):")print(scores)print()# ============================================# 1️⃣ FLATTENED (no axis) - Returns single index# ============================================print("="*60)print("1️⃣ FLATTENED VIEW (axis=None) - Single Index")print("="*60)min_idx = np.argmin(scores)max_idx = np.argmax(scores)print(f"Lowest score index (flattened): {min_idx}")print(f"Highest score index (flattened): {max_idx}")# Convert flat index to 2D coordinatesmin_row, min_col = np.unravel_index(min_idx, scores.shape)max_row, max_col = np.unravel_index(max_idx, scores.shape)print(f"\n🔻 Minimum: {scores[min_row, min_col]} at position ({min_row}, {min_col})")print(f" → Student {min_row}, Test {min_col}")print(f"🔺 Maximum: {scores[max_row, max_col]} at position ({max_row}, {max_col})")print(f" → Student {max_row}, Test {max_col}")print()
📊 Student Test Scores (4 students × 4 tests):
[[85 92 78 95]
[88 76 91 82]
[95 89 84 90]
[72 85 79 88]]
============================================================
1️⃣ FLATTENED VIEW (axis=None) - Single Index
============================================================
Lowest score index (flattened): 12
Highest score index (flattened): 3
🔻 Minimum: 72 at position (3, 0)
→ Student 3, Test 0
🔺 Maximum: 95 at position (0, 3)
→ Student 0, Test 3
Code
# ============================================# 2️⃣ AXIS=0 (down columns) - Compare students# ============================================print("="*60)print("2️⃣ AXIS=0 (down columns) - Which STUDENT performed best/worst?")print("="*60)min_student_per_test = np.argmin(scores, axis=0)max_student_per_test = np.argmax(scores, axis=0)print(f"Lowest scoring student per test: {min_student_per_test}")print(f"Highest scoring student per test: {max_student_per_test}")print("\nDetailed breakdown:")for test_num inrange(scores.shape[1]): worst_student = min_student_per_test[test_num] best_student = max_student_per_test[test_num]print(f" Test {test_num}: "f"Worst = Student {worst_student} ({scores[worst_student, test_num]}), "f"Best = Student {best_student} ({scores[best_student, test_num]})")print()# ============================================# 3️⃣ AXIS=1 (across rows) - Compare tests# ============================================print("="*60)print("3️⃣ AXIS=1 (across rows) - Which TEST was easiest/hardest?")print("="*60)min_test_per_student = np.argmin(scores, axis=1)max_test_per_student = np.argmax(scores, axis=1)print(f"Worst test per student: {min_test_per_student}")print(f"Best test per student: {max_test_per_student}")print("\nDetailed breakdown:")for student_num inrange(scores.shape[0]): worst_test = min_test_per_student[student_num] best_test = max_test_per_student[student_num]print(f" Student {student_num}: "f"Worst = Test {worst_test} ({scores[student_num, worst_test]}), "f"Best = Test {best_test} ({scores[student_num, best_test]})")print()
============================================================
2️⃣ AXIS=0 (down columns) - Which STUDENT performed best/worst?
============================================================
Lowest scoring student per test: [3 1 0 1]
Highest scoring student per test: [2 0 1 0]
Detailed breakdown:
Test 0: Worst = Student 3 (72), Best = Student 2 (95)
Test 1: Worst = Student 1 (76), Best = Student 0 (92)
Test 2: Worst = Student 0 (78), Best = Student 1 (91)
Test 3: Worst = Student 1 (82), Best = Student 0 (95)
============================================================
3️⃣ AXIS=1 (across rows) - Which TEST was easiest/hardest?
============================================================
Worst test per student: [2 1 2 0]
Best test per student: [3 2 0 3]
Detailed breakdown:
Student 0: Worst = Test 2 (78), Best = Test 3 (95)
Student 1: Worst = Test 1 (76), Best = Test 2 (91)
Student 2: Worst = Test 2 (84), Best = Test 0 (95)
Student 3: Worst = Test 0 (72), Best = Test 3 (88)
Tips
argmin/argmax return indices, not values (use them to index back into the array).
Axis behavior
axis=None (default): flattens the entire array and returns the global index.
axis=0 (2D): compares down the rows within each column → returns row indices per column.
axis=1 (2D): compares across columns within each row → returns column indices per row.
General ND: reduces along the specified axis; the result shape equals the input shape with that axis removed.
For values at those positions:
Column-wise: i = np.argmax(M, axis=0); vals = M[i, np.arange(M.shape[1])]
a = np.array([3, 10, 7, 10])k =2# Indices that would sort ascendingidx_asc = np.argsort(a) # e.g., [0, 2, 1, 3]print("Indices to sort ascending:", idx_asc)# top=k indeces (ascending by value)topk_idx_asc = idx_asc[:k] # e.g., [0, 2]print(f"Top-{k} indices (ascending by value):", topk_idx_asc)# the values corresponding to the top-k indices (ascending)topk_vals_asc = a[topk_idx_asc]print(f"Top-{k} values (ascending by value):", topk_vals_asc)# Top-k indices (descending by value)topk_idx_des = idx_asc[-k:][::-1] # e.g., [3, 1]print(f"Top-{k} indices (descending by value):", topk_idx_des)# the values corresponding to the top-k indices (descending)topk_vals_des = a[topk_idx_des]print(f"Top-{k} values (descending by value):", topk_vals_des)
Indices to sort ascending: [0 2 1 3]
Top-2 indices (ascending by value): [0 2]
Top-2 values (ascending by value): [3 7]
Top-2 indices (descending by value): [3 1]
Top-2 values (descending by value): [10 10]
Shortcut (numeric arrays):
Code
topk_idx = np.argsort(-a)[:k] # also descending indicestopk_vals = a[topk_idx]print(topk_idx)print(topk_vals)
[1 3]
[10 10]
9.7.5.3 2D arrays (row-wise or column-wise)
Code
A = np.array([[3, 7, 2], [5, 1, 9]])k =2# Row-wise: get the top-k column indices for each rowrow_idx_sorted = np.argsort(A, axis=1) # shape (n_rows, n_cols)topk_cols_per_row = row_idx_sorted[:, -k:][:, ::-1]# Column-wise: get the top-k row indices for each columncol_idx_sorted = np.argsort(A, axis=0)topk_rows_per_col = col_idx_sorted[-k:, :][::-1, :]
To retrieve the values:
Code
# Row-wise values (gather with advanced indexing)rows = np.arange(A.shape[0])[:, None] # column vector of row indicestopk_vals_per_row = A[rows, topk_cols_per_row]print("Top-k values per row:\n", topk_vals_per_row)# Column-wise valuescols = np.arange(A.shape[1])[None, :]topk_vals_per_col = A[topk_rows_per_col, cols]print("Top-k values per column:\n", topk_vals_per_col)
Top-k values per row:
[[7 3]
[9 5]]
Top-k values per column:
[[5 7 9]
[3 1 2]]
9.8 Array Operations: Mathematical Power at Scale
NumPy arrays excel at vectorized operations that process entire arrays efficiently. These operations are the foundation of scientific computing and data analysis.
9.8.1 Arithmetic Operations
NumPy arrays support all standard arithmetic operators (+, -, *, /, **, %) and apply them element-wise. You can perform operations between:
Array and Array → Operands must have compatible shapes (same shape or broadcastable)
Array and Vector → The vector is broadcast across rows or columns depending on its shape, enabling element-wise operations
Array and Scalar → The scalar is applied to every element of the array (broadcasting)
💡 Key Advantage: These operations are vectorized - they execute at C speed, not Python loop speed!
9.8.1.1 Array and Array
Let’s explore arithmetic operations between two arrays through practical examples:
Code
# 📊 Sample Data: Sales figures for 3 stores × 4 quartersstore_sales_q1_q4 = np.array([[120, 150, 180, 200], # Store A [100, 130, 160, 190], # Store B [140, 110, 150, 170]]) # Store C# Bonus amounts for each store and quarterbonus_amounts = np.array([[10, 15, 20, 25], [12, 18, 22, 28], [8, 12, 18, 22]])print("🏪 Quarterly Sales (in thousands):")print("Store Q1 Q2 Q3 Q4")print("A ", store_sales_q1_q4[0])print("B ", store_sales_q1_q4[1]) print("C ", store_sales_q1_q4[2])print("\n💰 Quarterly Bonuses (in thousands):")print("Store Q1 Q2 Q3 Q4")print("A ", bonus_amounts[0])print("B ", bonus_amounts[1])print("C ", bonus_amounts[2])# For compatibility with existing examplesarr1, arr2 = store_sales_q1_q4, bonus_amounts
🏪 Quarterly Sales (in thousands):
Store Q1 Q2 Q3 Q4
A [120 150 180 200]
B [100 130 160 190]
C [140 110 150 170]
💰 Quarterly Bonuses (in thousands):
Store Q1 Q2 Q3 Q4
A [10 15 20 25]
B [12 18 22 28]
C [ 8 12 18 22]
9.8.1.2 Broadcasting: The Heart of NumPy Vectorization
Besides supporting arithmetic operations between arrays, broadcasting is NumPy’s powerful mechanism that enables operations between arrays of different shapes without explicitly reshaping them.
Array and Vector → The vector is broadcast across rows or columns depending on its shape, enabling element-wise operations
Array and Scalar → The scalar is applied to every element of the array (broadcasting)
Broadcasting is the foundation that makes vectorization in NumPy so flexible and intuitive.
9.8.1.2.1 Broadcasting Rules
NumPy broadcasting follows these rules:
Start from the trailing dimension and work backwards
Dimensions are compatible if:
They are equal, OR
One of them is 1, OR
One of them doesn’t exist (missing dimension)
Missing dimensions are assumed to be size 1
Code
# Broadcasting Rules Demonstrationdef check_broadcast_compatibility(shape1, shape2):""" Check if two shapes are compatible for broadcasting """# Pad shorter shape with 1s on the left ndim1, ndim2 =len(shape1), len(shape2) max_ndim =max(ndim1, ndim2) shape1_padded = [1] * (max_ndim - ndim1) +list(shape1) shape2_padded = [1] * (max_ndim - ndim2) +list(shape2) result_shape = [] compatible =Truefor i inrange(max_ndim): dim1, dim2 = shape1_padded[i], shape2_padded[i]if dim1 == dim2: result_shape.append(dim1)elif dim1 ==1: result_shape.append(dim2)elif dim2 ==1: result_shape.append(dim1)else: compatible =Falsebreakreturn compatible, tuple(result_shape) if compatible elseNone# Test different shape combinationstest_cases = [ ((3, 4), (4,)), # Compatible ((3, 4), (3, 1)), # Compatible ((3, 4), (2,)), # Incompatible ((2, 3, 4), (4,)), # Compatible ((2, 3, 4), (3, 4)), # Compatible ((2, 3, 4), (2, 1, 4)), # Compatible ((3, 4), (2, 3)), # Incompatible]print("Broadcasting Compatibility Check:")print("-"*50)for shape1, shape2 in test_cases: compatible, result_shape = check_broadcast_compatibility(shape1, shape2) status ="✅ Compatible"if compatible else"❌ Incompatible"if compatible:print(f"{str(shape1):>10} + {str(shape2):>10} → {str(result_shape):>12}{status}")else:print(f"{str(shape1):>10} + {str(shape2):>10} → {'N/A':>12}{status}")
# Invalid operations (will raise ValueError)try: a = np.ones((3, 4)) b = np.ones((3, 3)) result = a + bexceptValueErroras e:print(f"Error: {e}")try: a = np.ones((2, 3)) b = np.ones((2, 1, 1)) result = a + bexceptValueErroras e:print(f"Error: {e}")
Error: operands could not be broadcast together with shapes (3,4) (3,3)
In the above example, a + b cannot be broadcast together and evaluated successfully. See the broadcasting documentation to learn more about it.
9.8.2 Aggregate Functions: Statistical Summaries
Aggregate functions reduce array dimensions by applying statistical operations. The axis parameter controls the direction of aggregation.
9.8.2.1 Global Aggregation (All Elements)
Function
Purpose
Example
np.sum(arr)
Sum all elements
np.sum([[1,2],[3,4]]) → 10
np.mean(arr)
Average of all elements
np.mean([[1,2],[3,4]]) → 2.5
np.min(arr)
Minimum value
np.min([[1,2],[3,4]]) → 1
np.max(arr)
Maximum value
np.max([[1,2],[3,4]]) → 4
np.std(arr)
Standard deviation
np.std([[1,2],[3,4]]) → 1.118
9.8.2.2 Axis-Specific Aggregation
Understanding Axes in 2D Arrays:
axis=0: Down the rows (column-wise aggregation) → Result has shape (n_cols,)
axis=1: Across the columns (row-wise aggregation) → Result has shape (n_rows,)
# Student grades: 3 students × 4 subjects (Math, Science, English, History)# Create sample data for demonstrationarray = np.array([[4, 7, 1, 3], [5, 8, 2, 6], [9, 3, 5, 2]])# Display the original arrayprint("Original Array:\n", array)# Calculate the sum, mean, minimum, and maximum for the entire arraytotal_sum = np.sum(array)mean_value = np.mean(array)min_value = np.min(array)max_value = np.max(array)print(f"\nSum of all elements: {total_sum}") print(f"Mean of all elements: {mean_value}") print(f"Minimum value in the array: {min_value}") print(f"Maximum value in the array: {max_value}") # Calculate the sum, mean, minimum, and maximum along each row (axis=1)row_sum = np.sum(array, axis=1)row_mean = np.mean(array, axis=1)row_min = np.min(array, axis=1)row_max = np.max(array, axis=1)print("\nSum along each row:", row_sum) print("Mean along each row:", row_mean) print("Minimum value along each row:", row_min) print("Maximum value along each row:", row_max) # Calculate the sum, mean, minimum, and maximum along each column (axis=0)col_sum = np.sum(array, axis=0)col_mean = np.mean(array, axis=0)col_min = np.min(array, axis=0)col_max = np.max(array, axis=0)print("\nSum along each column:", col_sum) print("Mean along each column:", col_mean) print("Minimum value along each column:", col_min) print("Maximum value along each column:", col_max)
Original Array:
[[4 7 1 3]
[5 8 2 6]
[9 3 5 2]]
Sum of all elements: 55
Mean of all elements: 4.583333333333333
Minimum value in the array: 1
Maximum value in the array: 9
Sum along each row: [15 21 19]
Mean along each row: [3.75 5.25 4.75]
Minimum value along each row: [1 2 2]
Maximum value along each row: [7 8 9]
Sum along each column: [18 18 8 11]
Mean along each column: [6. 6. 2.66666667 3.66666667]
Minimum value along each column: [4 3 1 2]
Maximum value along each column: [9 8 5 6]
9.9 Array Reshaping: Transforming Data Dimensions
Array reshaping is essential for preparing data for different computational tasks. Many operations require specific array shapes:
Machine Learning: Models often expect specific input dimensions (e.g., 2D for tabular data, 4D for images)
Matrix Operations: Linear algebra operations require compatible shapes for multiplicationLet’s explore the essential tools for transforming array dimensions:
Data Analysis: Different analysis techniques may need data in specific formats
Broadcasting: Reshaping enables efficient element-wise operations between arrays
💡 Key Insight: Reshaping never changes the data—only how it’s organized in memory!
9.9.1 Core Reshaping Methods
9.9.1.1reshape(): Intelligent Dimension Control
The reshape() method creates a new view of the array with different dimensions. The total number of elements must remain the same.
Syntax: array.reshape(new_shape) or np.reshape(array, new_shape)
⚠️ Important: You can only use -1 for one dimension per reshape operation!
9.9.1.3flatten() vs ravel(): Converting to 1D
Both convert an array to 1D, but they differ in copy vs view, speed, and memory order handling.
Method
Returns
Copy/View
Speed (typical)
When to Use
flatten()
1D ndarray
Always makes a copy
Slower
You want an independent array you can modify safely
ravel()
1D ndarray
View if possible, else copy (not guaranteed)
Faster
You want a 1D view for efficiency (and accept that edits may affect the original)
Reading order options (both functions support order=): - 'C' (default): Row-major (left→right, top→bottom) - 'F': Column-major (top→bottom, left→right) - 'A': 'F' if the array is Fortran-contiguous, otherwise 'C' - 'K': As close as possible to the array’s in-memory layout (preserves current strides, useful after slicing)
⚠️ Gotcha:ravel() may return a copy if a view isn’t possible (e.g., non-contiguous slices or mismatched order). If you must ensure independence, use flatten() or call .copy() on the result of ravel().
Code
print("📊 FLATTEN() vs RAVEL() COMPARISON")print("="*45)# Original 2D arrayoriginal_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])print("Original array:")print(original_2d)# Method 1: flatten() — always returns a COPYflat_copy = original_2d.flatten()# Method 2: ravel() — returns a VIEW when possible (else a copy)flat_view = original_2d.ravel()print("\nResults:")print(f"flatten() result: {flat_copy}")print(f"ravel() result: {flat_view}")print("\nMemory sharing checks:")print(f"flatten() shares memory with original? {np.shares_memory(original_2d, flat_copy)}")print(f"ravel() shares memory with original? {np.shares_memory(original_2d, flat_view)}")
📚 STUDENT SCORE CONCATENATION
========================================
Semester 1 Scores:
Alice : [85 92 78]
Bob : [88 76 91]
Carol : [95 89 84]
Semester 2 Scores:
Alice : [87 89 82]
Bob : [91 78 89]
Carol : [93 92 88]
Code
# 📈 AXIS=0: Vertical Concatenation (More Students)print("\n"+"="*50)print("🔽 AXIS=0 CONCATENATION (Vertical Stacking)")print("="*50)# Add new students to the classnew_students = np.array([[82, 85, 79], # David: Math, Science, English [90, 88, 93]]) # Emma: Math, Science, English# Concatenate along axis=0 (add rows = add students)expanded_class = np.concatenate((semester1, new_students), axis=0)print(f"Original class size: {semester1.shape[0]} students")print(f"New students added: {new_students.shape[0]} students") print(f"Total class size: {expanded_class.shape[0]} students")print(f"\nExpanded class scores (5 students × 3 subjects):")print(expanded_class)all_students = students + ['David', 'Emma']for i, student inenumerate(all_students):print(f"{student:6}: {expanded_class[i]}")
==================================================
🔽 AXIS=0 CONCATENATION (Vertical Stacking)
==================================================
Original class size: 3 students
New students added: 2 students
Total class size: 5 students
Expanded class scores (5 students × 3 subjects):
[[85 92 78]
[88 76 91]
[95 89 84]
[82 85 79]
[90 88 93]]
Alice : [85 92 78]
Bob : [88 76 91]
Carol : [95 89 84]
David : [82 85 79]
Emma : [90 88 93]
Code
# ➡️ AXIS=1: Horizontal Concatenation (More Subjects)print("\n"+"="*50)print("➡️ AXIS=1 CONCATENATION (Horizontal Stacking)") print("="*50)# Add new subjects (History and Art scores)new_subjects_scores = np.array([[80, 92], # Alice: History, Art [85, 88], # Bob: History, Art [91, 95]]) # Carol: History, Art# Concatenate along axis=1 (add columns = add subjects)expanded_subjects = np.concatenate((semester1, new_subjects_scores), axis=1)print(f"Original subjects: {semester1.shape[1]} subjects") print(f"New subjects added: {new_subjects_scores.shape[1]} subjects")print(f"Total subjects: {expanded_subjects.shape[1]} subjects")print(f"\nExpanded scores (3 students × 5 subjects):")all_subjects = subjects + ['History', 'Art']print(f"Subjects: {all_subjects}")print(expanded_subjects)for i, student inenumerate(students):print(f"{student:6}: {expanded_subjects[i]}")
==================================================
➡️ AXIS=1 CONCATENATION (Horizontal Stacking)
==================================================
Original subjects: 3 subjects
New subjects added: 2 subjects
Total subjects: 5 subjects
Expanded scores (3 students × 5 subjects):
Subjects: ['Math', 'Science', 'English', 'History', 'Art']
[[85 92 78 80 92]
[88 76 91 85 88]
[95 89 84 91 95]]
Alice : [85 92 78 80 92]
Bob : [88 76 91 85 88]
Carol : [95 89 84 91 95]
9.10.2.2 Visual Understanding of Concatenation
Here’s how axis=0 and axis=1 concatenation work visually:
⚠️ CONCATENATION ERROR DEMONSTRATION
=============================================
Original 2D array shape: (2, 3)
Original 2D array:
[[1 2 3]
[4 5 6]]
Problematic 1D array shape: (3,)
Problematic 1D array: [7 8 9]
🚫 Why this fails:
2D array: (2, 3) (2 dimensions)
1D array: (3,) (1 dimension)
❌ Different number of dimensions!
Code
# Demonstrate the actual errorprint("\n🧪 Attempting concatenation (this will fail):")try: result = np.concatenate((original_2d, problem_1d), axis=0)print("Success!")exceptValueErroras e:print(f"❌ Error: {e}")print(" → Arrays have different numbers of dimensions!")
🧪 Attempting concatenation (this will fail):
❌ Error: all the input arrays must have same number of dimensions, but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)
→ Arrays have different numbers of dimensions!
9.10.2.4 Solutions to Shape Mismatch Problems
When arrays don’t have compatible shapes, here are the most common fixes:
Code
# 🔧 SOLUTION 1: Reshape to match dimensionsprint("✅ SOLUTION 1: Reshape 1D → 2D")print("="*35)print(f"Original problematic shape: {problem_1d.shape}")# Method 1: Add row dimensionsolution_1 = problem_1d.reshape(1, 3) # Convert to (1, 3)print(f"Reshaped to row: {solution_1.shape}")print(f"Reshaped array:\n{solution_1}")# Method 2: Add column dimensionsolution_2 = problem_1d.reshape(3, 1) # Convert to (3, 1)print(f"\nReshaped to column: {solution_2.shape}")print(f"Reshaped array:\n{solution_2}")# Method 3: Using newaxis (more explicit)solution_3 = problem_1d[np.newaxis, :] # Same as reshape(1, 3)solution_4 = problem_1d[:, np.newaxis] # Same as reshape(3, 1)print(f"\nUsing newaxis:")print(f"Row format: {solution_3.shape} → {solution_3}")print(f"Col format: {solution_4.shape} → \n{solution_4}")
✅ SOLUTION 1: Reshape 1D → 2D
===================================
Original problematic shape: (3,)
Reshaped to row: (1, 3)
Reshaped array:
[[7 8 9]]
Reshaped to column: (3, 1)
Reshaped array:
[[7]
[8]
[9]]
Using newaxis:
Row format: (1, 3) → [[7 8 9]]
Col format: (3, 1) →
[[7]
[8]
[9]]
9.10.2.5 ✅ Successful Concatenation After Reshaping
Code
# ✅ Now concatenation works!print("🎉 SUCCESSFUL CONCATENATION")print("="*30)# Using the reshaped arrayfixed_array = problem_1d.reshape(1, 3)print(f"Original 2D: {original_2d.shape}")print(f"Fixed 1D→2D: {fixed_array.shape}")# Now concatenation workssuccess_result = np.concatenate((original_2d, fixed_array), axis=0)print(f"\n✅ Concatenation successful!")print(f"Result shape: {success_result.shape}")print(f"Result:\n{success_result}")print(f"\n📊 What happened:")print(f" • Started with (2,3) + (3,) → ❌ Incompatible")print(f" • Reshaped to (2,3) + (1,3) → ✅ Compatible") print(f" • Final result: (3,3) array")
🎉 SUCCESSFUL CONCATENATION
==============================
Original 2D: (2, 3)
Fixed 1D→2D: (1, 3)
✅ Concatenation successful!
Result shape: (3, 3)
Result:
[[1 2 3]
[4 5 6]
[7 8 9]]
📊 What happened:
• Started with (2,3) + (3,) → ❌ Incompatible
• Reshaped to (2,3) + (1,3) → ✅ Compatible
• Final result: (3,3) array
9.10.2.6 Multiple Array Concatenation
np.concatenate() can join more than two arrays at once:
9.11.1 🌍 Capitals & Distance from Washington, D.C.
Data:country-capital-lat-long-population.csv
Task:
Use NumPy to print the name and coordinates of the capital city closest to the U.S. capital, Washington, D.C. (exclude Washington, D.C. itself).
Notes
The Country Name for the U.S. is United States of America in the dataset.
Closeness is measured by Euclidean distance computed on the latitude/longitude pairs (for this exercise).
Exclude the U.S. capital row to avoid the trivial zero distance.
Hints
Use DataFrame.to_numpy() to convert columns to NumPy arrays.
Use broadcasting to compute euclidean distances from Washington, D.C. to all other capitals.
Use np.argmin to get the index of the minimum distance (closest capital).
Use np.argsort() to get the indices of the 10 smallest distances (nearest capitals) and index back into the DataFrame for names/coordinates.
Use np.argsort() to get the indices of the 10 largest distances (farthest capitals) similarly.
Remember to mask out or drop the Washington, D.C. row before computing distances.
Closest capital city is: Ottawa-Gatineau
Coordinates of the closest capital city are: [ 45.4166 -75.698 ]
Top 10 closest capital cities to Washington DC are:
Capital City Country
36 Ottawa-Gatineau Canada
14 Nassau Bahamas
22 Hamilton Bermuda
55 La Habana (Havana) Cuba
215 Cockburn Town Turks and Caicos Islands
38 George Town Cayman Islands
94 Port-au-Prince Haiti
107 Kingston Jamaica
64 Santo Domingo Dominican Republic
177 Saint-Pierre Saint Pierre and Miquelon
Task 2: Print the name and coordinates of the capital city farthest from the U.S. capital, Washington, D.C.
Country
Capital City
Latitude
Longitude
Population
Capital Type
Distance
149
New Zealand
Wellington
-41.2866
174.7756
411346
Capital
264.269537
74
Fiji
Suva
-18.1416
178.4415
178339
Capital
261.767344
216
Tuvalu
Funafuti
-8.5189
179.1991
7042
Capital
260.585339
112
Kiribati
Tarawa
1.3272
172.9813
64011
Capital
252.824440
226
Vanuatu
Port Vila
-17.7338
168.3219
52690
Capital
251.808514
148
New Caledonia
Nouméa
-22.2763
166.4572
197787
Capital
251.059900
130
Marshall Islands
Majuro
7.0897
171.3803
30661
Capital
250.444485
145
Nauru
Nauru
-0.5308
166.9112
11312
Capital
247.112997
191
Solomon Islands
Honiara
-9.4333
159.9500
81801
Capital
241.863987
11
Australia
Canberra
-35.2835
149.1281
447692
Capital
238.018583
9.11.2 ⭐ Bonus Task: Top 10 Nearest & Farthest Capitals from Washington, D.C. (Haversine)
Using the haversine (great-circle) distance, find:
The top 10 closest capital cities to Washington, D.C.
The top 10 farthest capital cities from Washington, D.C.
Notes
Use lat/lon in degrees (your haversine function should convert to radians internally).
Exclude Washington, D.C. itself from the results.
Handle ties deterministically (e.g., kind="stable" in argsort if needed).
🔎 Why haversine? It computes real geodesic distance on a sphere, which is more accurate than Euclidean distance on raw lat/lon.