5  Data Structures

In data science, you’ll shape raw data before modeling. These built-ins are the foundation that explain why Series/DataFrame (pandas) and ndarray (NumPy) behave the way they do.

5.1 Primitives vs. Containers

  • Primitive (single value): int, float, bool, None
  • Containers (hold multiple values): str, list, tuple, set, dict

Tip: Check any object’s type with type(x).

5.2 Core Built-in Data Structures

5.2.1 Sequences (ordered, indexable)

Types: list, tuple, str

# list (mutable)
nums = [10, 20, 20, 30]
nums.append(40)        # mutate in place
nums[0] = 11           # item assignment
nums_slice = nums[1:3] # slicing

# tuple (immutable)
pt = (42.0, -1.5)      # cannot reassign pt[0]

# string (immutable sequence of characters)
s = "data"
s2 = s.upper()         # returns a new string; s unchanged

When to use:

  • list: grow/shrink, frequent edits, ordered data

  • tuple: fixed-size records, function returns, hashable as dict keys

  • str: text processing (we’ll also use nltk later)

5.2.2 Sets (unordered, unique)

Types: set

a = {1, 2, 2, 3}       # {1, 2, 3}
b = {3, 4}
a | b                  # union -> {1, 2, 3, 4}
a & b                  # intersection -> {3}
a - b                  # difference -> {1, 2}
{1, 2}

When to use:

  • Deduplication, membership tests, fast set algebra.

5.2.3 Mappings (key–value)

Types: dict

student = {"name": "Alex", "year": 3, "major": "Stats"}
student["year"] = 4               # update
student["gpa"] = 3.7              # insert
for k, v in student.items():      # iterate keys & values
    print(k, "->", v)
name -> Alex
year -> 4
major -> Stats
gpa -> 3.7

When to use:

  • Labeled data, lookups, configuration/state.

5.3 Iterables vs. Iterators

In Python, an iterable is any object capable of returning its members one at a time, such as lists, tuples, strings, and dictionaries. You can loop over iterables using a for loop.

An iterator is an object that represents a stream of data; it produces the next value when you call next() on it. Iterators are created from iterables using the iter() function.

Key differences: - Iterables can be looped over, but do not remember their position. - Iterators remember their position and can only be advanced one item at a time.

Example:

my_list = [1, 2, 3]
my_iter = iter(my_list)
print(next(my_iter))  # 1
print(next(my_iter))  # 2
print(next(my_iter))  # 3

Understanding the difference helps you write efficient loops and work with data streams in Python.

5.4 Common, Efficient Operations

5.4.1 Comprehensions

Comprehensions are a concise way to create lists, sets, or dictionaries from iterables by applying an expression to each item in an iterable (such as a list, tuple, or range) and optionally filtering the items based on a condition. They are a powerful and efficient way to generate new collections without the need for explicit loops.

Basic Syntax:

new_list = [expression for item in iterable if condition]
  • expression: What you want to include in the new list (or set/dict).
  • item: Represents each element in the iterable as the comprehension iterates through it.
  • iterable: The source of elements (list, tuple, range, etc.).
  • condition (optional): A filter to control which items are included. If omitted, all items are included.

Why use comprehensions? - More readable and concise than loops. - Often faster than equivalent for-loops. - Preferred for simple transformations and filtering.

List comprehension example:

Create a list that has squares of natural numbers from 5 to 15.

sqrt_natural_no_5_15 = [(x**2) for x in range(5,16)]
print(sqrt_natural_no_5_15)
[25, 36, 49, 64, 81, 100, 121, 144, 169, 196, 225]

Create a list of tuples, where each tuple consists of a natural number and its square, for natural numbers ranging from 5 to 15.

sqrt_natural_no_5_15 = [(x,x**2) for x in range(5,16)]
print(sqrt_natural_no_5_15)
[(5, 25), (6, 36), (7, 49), (8, 64), (9, 81), (10, 100), (11, 121), (12, 144), (13, 169), (14, 196), (15, 225)]

Creating a list of words that start with the letter a in a given list of words.

words = ['apple', 'banana', 'avocado', 'grape', 'apricot']
a_words = [word for word in words if word.startswith('a')]
print(a_words)
['apple', 'avocado', 'apricot']

Set comprehension:

unique_lengths = {len(word) for word in ['cat', 'dog', 'mouse']}  
uniq_initials = {name[0].upper() for name in ["amy","ann","bob","amy"]}  
print(uniq_initials)
print(unique_lengths)
{'B', 'A'}
{3, 5}

Dictionary comprehension:

word_lengths = {word: len(word) for word in ['cat', 'dog', 'mouse']} 
print(word_lengths)
{'cat': 3, 'dog': 3, 'mouse': 5}

Comprehensions are preferred for simple transformations and filtering.

5.4.2 Membership & Lookups

Membership operations let you check if an item exists in a collection (like a list, set, or dictionary). Use in and not in for these checks. Lookups retrieve values by key or index, and are fastest in sets and dictionaries.

Examples:

  • 3 in [1, 2, 3]True (checks if 3 is in the list)
  • 'a' in {'a': 1, 'b': 2}True (checks if a is a key in the dictionary)
  • 5 not in {1, 2, 3}True (checks if 5 is not in the set)

Tip:

  • Dictionary and set lookups are very fast (constant time).
  • List and tuple membership checks are slower (linear time).
  • For safe dictionary lookups, use .get(key) to avoid errors if the key is missing.

5.4.3 Unpacking

Unpacking lets you assign elements of a collection (like a list, tuple, or string) to multiple variables in a single step. This makes your code more readable and concise, especially when working with structured data.

Basic Syntax:

pt = (3, 4)
x, y = pt  # x=3, y=4 

a, b, c = [1, 2, 3]  # a=1, b=2, c=3

You can also use unpacking in loops and with function arguments.

Extended Unpacking: Python allows you to use the * operator to capture multiple elements:

# Extended unpacking:
first, *rest = [10, 20, 30, 40]  # first=10, rest=[20, 30, 40]

a, *mid, z = [1,2,3,4] 

print(rest)
print(mid)
[20, 30, 40]
[2, 3]

Unpacking in Loops:

pairs = [(1, 'a'), (2, 'b'), (3, 'c')]
for num, char in pairs:
    print(num, char)
1 a
2 b
3 c

Unpacking with Dictionaries:

student = {"name": "Alex", "year": 3}
for key, value in student.items():
    print(key, value)
name Alex
year 3

Why use unpacking?

  • Makes code cleaner and more readable
  • Quick variable assignment
  • Useful for working with structured data (e.g., tuples, lists, dicts)
  • Avoids manual indexing

Tip: Unpacking works with any iterable, including lists, tuples, strings, and even the results of functions that return multiple values.

5.4.4 Sorting in Python Iterables

Sorting is a common task when working with data. Python provides flexible ways to order lists, tuples, and other iterables.

5.4.4.1 sorted() (built-in function)

  • Returns a new sorted list from any iterable.
  • Works with lists, tuples, strings, sets, and more.
  • Does not modify the original iterable.
nums = [3, 1, 4, 1, 5]
print(sorted(nums))       # [1, 1, 3, 4, 5]
print(nums)               # [3, 1, 4, 1, 5] (unchanged)

word = "python"
print(sorted(word))       # ['h', 'n', 'o', 'p', 't', 'y']
[1, 1, 3, 4, 5]
[3, 1, 4, 1, 5]
['h', 'n', 'o', 'p', 't', 'y']

5.4.4.2 .sort() (list method)

  • In-place sort: modifies the list itself.

  • Only available for lists (not other iterables).

  • Returns None.

nums = [3, 1, 4, 1, 5]
nums.sort()
print(nums)    # [1, 1, 3, 4, 5]
[1, 1, 3, 4, 5]

5.4.4.3 Reverse Sorting

Both sorted() and .sort() accept a reverse argument.

nums = [3, 1, 4, 1, 5]
print(sorted(nums, reverse=True))
[5, 4, 3, 1, 1]

5.4.4.4 Sorting with a key

The key parameter lets you customize sorting logic.

# Sort by string length
words = ["pear", "apple", "banana", "kiwi"]
print(sorted(words, key=len))  
# ['kiwi', 'pear', 'apple', 'banana']

# Sort by last character
print(sorted(words, key=lambda w: w[-1]))  
# ['banana', 'pear', 'apple', 'kiwi']
['pear', 'kiwi', 'apple', 'banana']
['banana', 'apple', 'kiwi', 'pear']

5.4.4.5 Sorting Complex Data

For lists of tuples or dicts, use key.

# Sort by the second element of each tuple
pairs = [("a", 3), ("b", 1), ("c", 2)]
print(sorted(pairs, key=lambda t: t[1]))  
# [('b', 1), ('c', 2), ('a', 3)]

# Sort list of dicts by a field
students = [
    {"name": "Alice", "grade": 85},
    {"name": "Bob", "grade": 92},
    {"name": "Chen", "grade": 78}
]
print(sorted(students, key=lambda s: s["grade"]))
[('b', 1), ('c', 2), ('a', 3)]
[{'name': 'Chen', 'grade': 78}, {'name': 'Alice', 'grade': 85}, {'name': 'Bob', 'grade': 92}]

Sorting strings alphabetically vs. numerically:

nums = ["10", "2", "1"]
print(sorted(nums))           
print(sorted(nums, key=int))
['1', '10', '2']
['1', '2', '10']

5.4.5 Lambda Functions in Python

Sometimes you only need a tiny one-off function for a specific task (like sorting by length or filtering items). Writing a full def feels heavy.
That’s where lambda functions help.

Syntax

lambda parameters: expression
  • Creates an anonymous function (no name required).
  • Must contain a single expression (not multiple statements).
  • Automatically returns the value of the expression.
square = lambda x: x**2
print(square(5))   # 25
25

Using Lambda Functions

  • With sorted()
words = ["pear", "apple", "banana", "kiwi"]

# Sort by word length
print(sorted(words, key=lambda w: len(w)))

# Sort by last character
print(sorted(words, key=lambda w: w[-1]))
['pear', 'kiwi', 'apple', 'banana']
['banana', 'apple', 'kiwi', 'pear']
  • With map() and filter()
nums = [1, 2, 3, 4, 5]

# Square each number
print(list(map(lambda x: x**2, nums)))  

# Keep only even numbers
print(list(filter(lambda x: x % 2 == 0, nums)))
[1, 4, 9, 16, 25]
[2, 4]
  • With reduce() (from functools)
from functools import reduce

nums = [1, 2, 3, 4, 5]
product = reduce(lambda x, y: x * y, nums)
print(product)   # 120
120

Key Takeaways:

  • lambda = quick, throwaway function for one-liners
  • use def if the function
    • has multiple steps
    • is reused often

5.4.6 enumerate()

The enumerate() function is a built-in Python tool that adds a counter to any iterable, returning pairs of (index, item) as you loop. This is especially useful when you need both the item and its position in a loop.

Syntax:

for index, value in enumerate(iterable, start=0):
    # use index and value
  • iterable: Any sequence (list, tuple, string, etc.)
  • start: Optional, sets the starting index (default is 0)

Why use enumerate()?

  • Makes code cleaner and less error-prone
  • Avoids manual index tracking with a separate variable
  • Works with lists, tuples, strings, and more

Example:

fruits = ['apple', 'banana', 'cherry']
for idx, fruit in enumerate(fruits):
    print(idx, fruit)
0 apple
1 banana
2 cherry

Tip: You can set the starting index with the start argument, e.g. enumerate(my_list, start=1).

Use enumerate() for readable, efficient loops when you need both index and value.

5.4.7 zip()

The zip() function combines two or more iterables (like lists, tuples, or strings) into tuples, pairing elements by their position. This is useful for parallel iteration, creating pairs, or merging data from multiple sources.

Syntax:

zip(iterable1, iterable2, ...)
  • Each tuple contains one element from each iterable, matched by position.
  • Stops at the shortest iterable.

Why use zip()? - Parallel iteration over multiple sequences - Pairing related data (e.g., names and scores) - Creating dictionaries from two lists

Example:

names = ['Alice', 'Bob', 'Chen']
scores = [85, 92, 78]
for name, score in zip(names, scores):
    print(name, score)
Alice 85
Bob 92
Chen 78
# Creating a dictionary from two lists:
gradebook = dict(zip(names, scores))
print(gradebook)
{'Alice': 85, 'Bob': 92, 'Chen': 78}

Tip: - You can use zip(*zipped) to unzip a list of tuples back into separate lists. - If the input iterables are different lengths, zip() stops at the shortest one.

5.4.8 Common Functions for Iterables

Python provides a variety of built-in functions to operate on iterables, making it easy to manipulate, process, and analyze collections like lists, tuples, strings, sets, and dictionaries. Below is a list of commonly used built-in functions specifically designed for iterables.

Function Description Example
len() Returns the number of elements in an iterable. len([1, 2, 3])3
min() Returns the smallest element in an iterable. min([3, 1, 4])1
max() Returns the largest element in an iterable. max([3, 1, 4])4
sum() Returns the sum of elements in an iterable (numeric types only). sum([1, 2, 3])6
sorted() Returns a sorted list from an iterable (does not modify the original). sorted([3, 1, 2])[1, 2, 3]
reversed() Returns an iterator that accesses the elements of an iterable in reverse. list(reversed([1, 2, 3]))[3, 2, 1]
enumerate() Returns an iterator of tuples containing indices and elements of the iterable. list(enumerate(['a', 'b', 'c']))[(0, 'a'), (1, 'b'), (2, 'c')]
all() Returns True if all elements of the iterable are true (or if empty). all([True, 1, 'a'])True
any() Returns True if any element of the iterable is true. any([False, 0, 'b'])True
str.join(iterable) Joins elements of an iterable (e.g., list, tuple) into a single string, using the given string as a separator. ''.join(['a', 'b', 'c'])'abc'

5.5 Independent Study

5.5.1 Sorting the data

  • Sort [10, 2, 33, 25, 7] in descending order.

  • Given words = [data, python, AI, science], sort alphabetically ignoring case.

  • Sort [(Ann, 22), (Bob, 19), (Chen, 22)] by age, preserving name order when ages match.

5.5.2 lambda

  • Use lambda with sorted() to order the list:
nums = [-3, 1, -2, 5, 0]
  • Use lambda with filter() to keep only names starting with a vowel:
names = ["Alice", "Bob", "Eve", "Uma", "Sam"]
  • Use lambda with map() to convert Celsius to Fahrenheit:
temps_c = [0, 20, 37, 100]

5.5.3

You are given a list of strings representing student names. Some students appear more than once because they signed up for multiple activities.

Write a Python function that:

  1. Removes duplicate names while preserving the first occurrence order.
  2. Returns a list of tuples (index, name_length) for each unique name, where:
    • index is the original position of the name in the list (first occurrence).
    • name_length is the length of the name.

Finally, sort the output list by name_length in descending order.

names = ["Alice", "Bob", "Alice", "Charlie", "Bob", "Dave"]

5.5.4 Student data

# CELL 1: Student Data Setup
students = [
    (101, "Alice", "STAT303", 85),
    (102, "Bob", "STAT301", 92),
    (103, "Charlie", "STAT303", 78),
    (104, "Diana", "STAT301", 95),
    (105, "Eve", "STAT303", 88),
    (106, "Frank", "STAT304", 82),
    (107, "Grace", "STAT301", 90),
    (108, "Hank", "STAT304", 87),
    (109, "Ivy", "STAT303", 91),
    (101, "Alice", "STAT301", 88),
    (101, "Alice", "STAT304", 82),
    (101, "Alice", "STAT302", 86),
    (102, "Bob", "STAT303", 79),
    (102, "Bob", "STAT304", 85),
    (102, "Bob", "STAT302", 83),
    (103, "Charlie", "STAT301", 91),
    (103, "Charlie", "STAT304", 84),
    (103, "Charlie", "STAT302", 80),
    (104, "Diana", "STAT303", 89),
    (104, "Diana", "STAT304", 93),
    (104, "Diana", "STAT302", 96),
    (105, "Eve", "STAT301", 87),
    (105, "Eve", "STAT304", 90),
    (105, "Eve", "STAT302", 85),
    (106, "Frank", "STAT303", 76),
    (106, "Frank", "STAT301", 81),
    (106, "Frank", "STAT302", 79),
    (107, "Grace", "STAT303", 84),
    (107, "Grace", "STAT304", 88),
    (107, "Grace", "STAT302", 91),
    (108, "Hank", "STAT303", 80),
    (108, "Hank", "STAT301", 83),
    (108, "Hank", "STAT302", 82),
    (109, "Ivy", "STAT301", 94),
    (109, "Ivy", "STAT304", 89),
    (109, "Ivy", "STAT302", 92),
    (110, "Jack", "STAT303", 83),
    (110, "Jack", "STAT301", 86),
    (110, "Jack", "STAT304", 79),
    (110, "Jack", "STAT302", 81),
    (111, "Karen", "STAT303", 96),
    (111, "Karen", "STAT301", 92),
    (111, "Karen", "STAT304", 94),
    (111, "Karen", "STAT302", 98),
    (112, "Leo", "STAT303", 72),
    (112, "Leo", "STAT301", 75),
    (112, "Leo", "STAT304", 78),
    (112, "Leo", "STAT302", 74)
]
  • What is the data type for this variable
  • How many unique students in the dataset
  • How many unique courses in the dataset
  • what is the grade range?
  • What is the average grade?