13  R: Data structures

R has several data structures to store and manipulate data efficiently. These structures can be classified into homogeneous (same type) and heterogeneous (different types) categories.

13.1 Atomic vectors ((1D, Homogeneous))

Atomic vectors are the most fundamental data type in R. They are one-dimensional collections of elements, where all elements must share the same data type.

Since atomic vectors were covered in the previous chapter, this chapter will focus on the remaining three data structures: matrices, lists, and data frames.

13.2 Matrix ((2D, Homogeneous))

Matrices are two-dimensional arrays.

13.2.1 Creating a matrix

The built-in function matrix() is used to define a matrix. An atomic vector can be organized as a matrix by specifying the number of rows and columns.

For example, let us define a 3x3 matrix (3 rows and 3 columns) consisting of consecutive integers from 1 to 9.

mat <- matrix(1:9, 3, 3)
mat
     [,1] [,2] [,3]
[1,]    1    4    7
[2,]    2    5    8
[3,]    3    6    9

Note that the integers fill up column-wise in the matrix. If we wish to fill-up the matrix by row, we can use the byrow argument.

mat <- matrix(1:9, 3, 3, byrow = TRUE)
mat
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6
[3,]    7    8    9

13.2.2 Getting matrix dimensions, the number of rows, the number of columns

Get matrix dimensions will give vector of c(rows, cols): - dim()

dim(mat)
[1] 3 3

Get number of elements: length()

length(mat)
[1] 9

The functions nrow() and ncol() can be used to get the number of rows and columns of the matrix respectively.

nrow(mat)
[1] 3
ncol(mat)
[1] 3

13.2.3 Matrix indexing and Slicing

Matrices can be sliced using the indices of row and column separated by a , in box brackets. Suppose we wish to get the element in the \(2^{nd}\) row and \(3^{rd}\) column of the matrix:

mat[2, 3]
[1] 6

For selecting all rows or columns of a matrix, the index for the row/column can be left blank. Suppose we wish to get all the elements of the \(1^{st}\) of the matrix:

mat[1, ]
[1] 1 2 3

Row and columns of the matrix can be sliced using the : operator. Suppose we want to select a sub-matrix that has elements in the first two rows and columns 2 and 3 of the matrix mat:

mat[1:2, 2:3]
     [,1] [,2]
[1,]    2    3
[2,]    5    6

13.2.4 Working with rownames() and colnames()

In R, row names (rownames()) and column names (colnames()) allow you to label and reference rows and columns in matrices and data frames efficiently.

# Create a matrix
mat <- matrix(1:9, nrow = 3, byrow = TRUE)

# Assign row names
rownames(mat) <- c("Row1", "Row2", "Row3")

# Assign column names
colnames(mat) <- c("Col1", "Col2", "Col3")

# Print matrix
print(mat)
     Col1 Col2 Col3
Row1    1    2    3
Row2    4    5    6
Row3    7    8    9

Getting and Modifying Names

rownames(mat)  # Get row names
[1] "Row1" "Row2" "Row3"
colnames(mat)  # Get column names
[1] "Col1" "Col2" "Col3"
rownames(mat) <- c("A", "B", "C")  # Modify row names
colnames(mat) <- c("X", "Y", "Z")  # Modify column names

print(mat)
  X Y Z
A 1 2 3
B 4 5 6
C 7 8 9

13.2.5 Adding and Removing

  • use cbind() to add a column of correct length
  • use rbind() to add a row of correct length
  • if length is incorrect there will be a warning (no error it will still run), however it will “recycle” values.
  • When adding a row or a col, you CANNOT insert it between the existing rows/cols. However, you can re-arrange AFTER it is added using indexing.
  • use negative indexes to remove rows/columns
new_col <- c(7, 8, 9)

# add the column 
mat_new <- cbind(mat, new_col)
print(mat_new)
  X Y Z new_col
A 1 2 3       7
B 4 5 6       8
C 7 8 9       9
mat_new <- cbind(mat, new_col = 5)
print(mat_new)
  X Y Z new_col
A 1 2 3       5
B 4 5 6       5
C 7 8 9       5
new_row <- c(7, 8, 9)

# add the row 
mat_new <- rbind(mat, new_row)
print(mat_new)
        X Y Z
A       1 2 3
B       4 5 6
C       7 8 9
new_row 7 8 9
mat_new <- rbind(mat, new_row = 5)
print(mat_new)
        X Y Z
A       1 2 3
B       4 5 6
C       7 8 9
new_row 5 5 5
# remove the added row
mat_new <- mat_new[-4, ]
print(mat_new)
  X Y Z
A 1 2 3
B 4 5 6
C 7 8 9
# remove the added column
mat_new <- mat_new[, -4]
print(mat_new)
  X Y Z
A 1 2 3
B 4 5 6
C 7 8 9

13.2.6 Iterating

To iterate through each element you generally need a nested loop combined with nrow() and ncol()

for (i in 1: nrow(mat_new)){
  for (j in 1: ncol(mat_new)) {
    print(paste("Element at (", i, ",", j, "):", mat[i, j]))
  }
}
[1] "Element at ( 1 , 1 ): 1"
[1] "Element at ( 1 , 2 ): 2"
[1] "Element at ( 1 , 3 ): 3"
[1] "Element at ( 2 , 1 ): 4"
[1] "Element at ( 2 , 2 ): 5"
[1] "Element at ( 2 , 3 ): 6"
[1] "Element at ( 3 , 1 ): 7"
[1] "Element at ( 3 , 2 ): 8"
[1] "Element at ( 3 , 3 ): 9"

13.2.7 Element-wise arithmetic operations

Element-wise arithmetic operations have the same logic with atomic vectors, they can be performed between 2 matrices of the same shape.

mat1 <- matrix(1:6, 2, 3)
mat2 <- matrix(c(9, 2, 6, 5, 1, 0), 2, 3)
mat1 + mat2
     [,1] [,2] [,3]
[1,]   10    9    6
[2,]    4    9    6
mat1 - mat2
     [,1] [,2] [,3]
[1,]   -8   -3    4
[2,]    0   -1    6

Suppose we need to sum up all the rows of the matrix. We can do it using a for loop as follows:

row_sum <- c(0,0)
for (i in 1:nrow(mat)) {
  for (j in 1:ncol(mat)) {
    row_sum[i] <- row_sum[i] + mat[i, j]
  }
}
row_sum
[1]  6 15 NA

Observe that in the above for loop, elements of each row are added one at a time. We can add all the elements of a row simultaneously using the sum() function. This will reduce a for loop from the above code:

row_sum <- c(0,0)
for (i in 1:nrow(mat)){
  row_sum[i] <- sum(mat[i,])
}
row_sum
[1]  6 15 24

In the above code, we sum up all the elements of the row simultaneously. However, we still need to sum up the elements of each row one at a time.

13.2.7.1 Matrix multiplication vs. element-wise multiplication

  • Matrix multiplication: using %*%
# Create another 3x3 matrix
mat2 <- matrix(9:1, nrow = 3)

# Perform matrix multiplication
result <- mat %*% mat2
print(result)
  [,1] [,2] [,3]
A   46   28   10
B  118   73   28
C  190  118   46
  • Element-wise multiplication: using *
# Element-wise multiplication
mat * mat
   X  Y  Z
A  1  4  9
B 16 25 36
C 49 64 81

13.2.8 The apply() function

The apply() function can be used to apply a function on each row or column of a matrix. Thus, this function helps avoid the user to write a for() loop in R to iterate over all the rows and columns of the matrix. Note that each row / column of a matrix is an atomic vector. Thus, vectorized computations can be performed within the function, resulting in efficient computations.

Note that the apply functions use a for() loop under-the-hood, and thus the function will be applied sequentially on each row / column of the matrix. However, as the implementation of the for() loop is in C, it is likely to be faster than a for() loop in R.

Let us use the apply() function to sum up all the rows of the matrix mat.

apply(mat, 1, sum)
 A  B  C 
 6 15 24 

Let us compare the time taken to sum up rows of a matrix using a for loop with the time taken using the apply() function.

options(digits.secs = 6)
start.time <- Sys.time()
row_sum<-c(0, 0)
for (i in 1:nrow(mat)){
  row_sum[i] <- sum(mat[i,])
}
row_sum
[1]  6 15 24
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 0.003321886 secs
start.time <- Sys.time()
apply(mat, 1, sum)
 A  B  C 
 6 15 24 
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 0.001377106 secs

Observe that the apply() function takes much lesser time to sum up all the rows of the matrix as compared to the for loop.

13.2.9 Misc useful functions:

Check if a value/string is in the matrix:

  • "val" %in% my_mat
# Create a matrix
fruit_mat <- matrix(c("apple", "banana", "cherry", "date"), nrow=2, byrow=TRUE)

# Check if "banana" is in the matrix
"banana" %in% fruit_mat 
[1] TRUE
# Check if "grape" is in the matrix
"grape" %in% fruit_mat  
[1] FALSE

If you want to return TRUE or FALSE directly:

any(fruit_mat == "banana")  
[1] TRUE
any(fruit_mat == "grape") 
[1] FALSE
which(mat == "banana", arr.ind = TRUE)
     row col

Check if a value is missing (ie: NA):

  • is.na(val)

13.3 Lists (1D, Heterogeneous)

Atomic vectors and matrices are quite useful in R. However, a constraint with them is that they can only contain objects of the same datatype. For example, an atomic vector can contain all numeric objects, all character objects, or all logical objects, but not a mixture of multiple types of objects. Thus, there arises a need for a list data structure that can store objects of multiple datatypes.

13.3.1 Creating and Indexing

A list can be defined using the list() function. For example, consider the list below:

list_ex <- list(1, "apple", TRUE, list("another list", TRUE))

The list list_ex consists of objects of mutiple datatypes. The length of the list can be obtained using the length()function:

length(list_ex)
[1] 4

A list is an ordered collection of objects. Each object of the list is associated with an index that corresponds to its order of occurrence in the list.

A single element can be sliced from the list by specifying its index within the [[]] operator. Let us slice the \(2^{nd}\) element of the list list_ex:

list_ex[[2]]
[1] "apple"

Multiple elements can be sliced from the list by specifying the indices as an atomic vector within the [] operator. Let us slice the \(1^{st}\) and \(3^{rd}\) elements from the list list_ex:

list_ex[c(1,3)]
[[1]]
[1] 1

[[2]]
[1] TRUE

13.3.2 Naming a list

Elements of a list can be named using the names() function. Let us name the elements of list_ex:

names(list_ex) <- c("Name1", "second_name", "3rd_element", "Number 4")

A single element can be sliced from the list using the name of the element with the $ operator. Let us slice the element named as second_name from the list list_ex:

list_ex$second_name
[1] "apple"

Note that if the name of the element does not begin with a letter or has special characters such as a space, then it should be specified within single quotes after the $ operator. For example, let us slice the element named as 3rd_element from the list list_ex:

list_ex$`3rd_element`
[1] TRUE

Names of elements of a list can also be specified while defining the list, as in the example below:

list_ex_with_names <- list(movie = 'The Dark Knight', IMDB_rating = 9)

13.3.3 Using as Function Input

No arithmetic or logical operations with a list, since elements are with different data types. A list can be converted to an atomic vector using the unlist() function. For example, let us convert the list list_ex to a vector:

unlist(list_ex)
         Name1    second_name    3rd_element      Number 41      Number 42 
           "1"        "apple"         "TRUE" "another list"         "TRUE" 

Since a vector can contain objects of a single datatype, note that all objects have been converted to the character datatype in the vector above.

Another example:

# Define a list with mixed data types
my_list <- list(a = 1, b = 2, c = "text", d = 4, e = TRUE)

# Function to sum numeric elements in the list
sum_list_numeric <- function(lst) {
  # Convert list to a vector, forcing non-numeric elements to NA
  vec <- unlist(lst)
  
  # Keep only numeric values (ignoring non-numeric ones)
  numeric_values <- as.numeric(vec[!is.na(as.numeric(vec))])
  
  # Compute sum
  return(sum(numeric_values))
}

# Call the function
result <- sum_list_numeric(my_list)
Warning in sum_list_numeric(my_list): NAs introduced by coercion
print(result) 
[1] 7

13.3.4 Applying a function to each element of a list: the lapply() function

# Define a list with numeric values
num_list <- list(a = 1:5, b = 6:10, c = 11:15)

# Use lapply() to compute the sum of each list element
sum_results <- lapply(num_list, sum)

# Print the result
print(sum_results)
$a
[1] 15

$b
[1] 40

$c
[1] 65
  • Using lapply() with an Anonymous Function
# Compute the mean of each element in the list
mean_results <- lapply(num_list, function(x) mean(x))

# Print the result
print(mean_results)
$a
[1] 3

$b
[1] 8

$c
[1] 13

13.4 Data Frames (2D, Heterogeneous)

A data frame is a table-like structure where each column is a vector. Unlike matrices, columns can have different types.

13.4.1 Creating data Frames

df <- data.frame(Name=c("Alice", "Bob", "Charlie"),
                 Age=c(25, 30, 35),
                 Score=c(90, 85, 88))

print(df)
     Name Age Score
1   Alice  25    90
2     Bob  30    85
3 Charlie  35    88

13.4.2 Common functions

Function Purpose
nrow(df) Number of rows
ncol(df) Number of columns
dim(df) Returns (rows, columns) as a vector
length(df) Number of columns (since a data frame is a list of columns)
rownames(df) Returns row names
colnames(df) Returns column names
str(df) Displays the structure of the data frame, including column types and values
head(df) Returns the first six rows of the data frame

13.4.3 Missing values

13.4.3.1 Using is.na()

# Sample data frame with missing values
df_missing <- data.frame(Name = c("Alice", "Bob", "Charlie"),
                 Age = c(25, NA, 35),
                 Score = c(90, 85, NA))

# Check for missing values (returns TRUE/FALSE matrix)
is.na(df_missing)
      Name   Age Score
[1,] FALSE FALSE FALSE
[2,] FALSE  TRUE FALSE
[3,] FALSE FALSE  TRUE

is.na(df) returns a matrix where TRUE indicates a missing value.

13.4.3.2 Count the number of total missing values

sum(is.na(df))
[1] 0

Counts all NA values in the entire data frame.

13.4.3.3 Count missing values by column

colSums(is.na(df))
 Name   Age Score 
    0     0     0 

13.4.3.4 Find Rows Containing NA Values

df[!complete.cases(df), ]
[1] Name  Age   Score
<0 rows> (or 0-length row.names)

Common functions on missing values

Function Purpose
is.na(df) Returns a matrix of TRUE/FALSE for missing values
sum(is.na(df)) Counts the total number of missing values
colSums(is.na(df)) Counts missing values per column
df[!complete.cases(df), ] Shows only rows that contain missing values
any(is.na(df)) Returns TRUE if any missing values exist
mean(is.na(df)) Returns the proportion of missing values

13.4.4 Accessing Data Frame Elements

df$Name   # Accessing a column
[1] "Alice"   "Bob"     "Charlie"
df[2, ]   # Second row
  Name Age Score
2  Bob  30    85
df[ , "Age"]  # Age column
[1] 25 30 35
df[1:2, c("Name", "Score")]  # Subset
   Name Score
1 Alice    90
2   Bob    85

13.4.5 Using Logical Operators for Data Frame Subsetting

13.4.5.1 Using Logical Operators ([ ])

You can filter rows based on conditions using comparison operators (==, >, <, >=, <=, !=):

Filter rows where age is greater than 25

filtered_df <- df[df$Score == 90, ]
print(filtered_df)
   Name Age Score
1 Alice  25    90
filtered_df <- df[df$Age > 25, ]
print(filtered_df)
     Name Age Score
2     Bob  30    85
3 Charlie  35    88
  • df$Age > 25 creates a logical vector (FALSE, TRUE, TRUE).
  • df[df$Age > 25, ] selects rows where Age is greater than 25.
  • Uses == to filter rows where Score is exactly 90.

13.4.5.2 Using subset() function

filtered_df <- subset(df, Age > 25)
print(filtered_df)
     Name Age Score
2     Bob  30    85
3 Charlie  35    88

13.4.5.3 Filter using multiple conditions

You can combine multiple conditions with & (AND) and | (OR):

filtered_df <- df[df$Age > 25 & df$Score > 85, ]
print(filtered_df)
     Name Age Score
3 Charlie  35    88

Using subset()

filtered_df <- subset(df, Age > 28 & Score > 85)
print(filtered_df)
     Name Age Score
3 Charlie  35    88
filtered_df <- df[df$Age < 30 | df$Score > 85, ]
print(filtered_df)
     Name Age Score
1   Alice  25    90
3 Charlie  35    88
# Filter and select only the "Name" and "Score" columns
filtered_df <- subset(df, Age > 28 & Score > 85, select = c(Name, Score))
print(filtered_df)
     Name Score
3 Charlie    88

13.4.6 Iterating

# Iterate over rows
for (i in 1:nrow(df)) {
  print(paste("Row", i, ":", df[i, ]))
}
[1] "Row 1 : Alice" "Row 1 : 25"    "Row 1 : 90"   
[1] "Row 2 : Bob" "Row 2 : 30"  "Row 2 : 85" 
[1] "Row 3 : Charlie" "Row 3 : 35"      "Row 3 : 88"     
for (col in names(df)) {
  print(paste("Column:", col))
  print(df[[col]])  # Access column by name
}
[1] "Column: Name"
[1] "Alice"   "Bob"     "Charlie"
[1] "Column: Age"
[1] 25 30 35
[1] "Column: Score"
[1] 90 85 88
for (i in 1:nrow(df)) {
  for (j in 1:ncol(df)) {
    print(paste("Element at [", i, ",", j, "]:", df[i, j]))
  }
}
[1] "Element at [ 1 , 1 ]: Alice"
[1] "Element at [ 1 , 2 ]: 25"
[1] "Element at [ 1 , 3 ]: 90"
[1] "Element at [ 2 , 1 ]: Bob"
[1] "Element at [ 2 , 2 ]: 30"
[1] "Element at [ 2 , 3 ]: 85"
[1] "Element at [ 3 , 1 ]: Charlie"
[1] "Element at [ 3 , 2 ]: 35"
[1] "Element at [ 3 , 3 ]: 88"

13.5 Summary

Data Structure Dimensions Homogeneous? Example
Atomic Vector 1D ✅ Yes c(1, 2, 3)
List 1D ❌ No list(name="Alice", age=25)
Matrix 2D ✅ Yes matrix(1:6, nrow=2)
Data Frame 2D ❌ No data.frame(name, age)

13.6 Practice exericises

13.6.1 Exercise 1

Recall the earlier example where we computed year’s in which the increase in GDP per capita was more than 10%.

13.6.1.1

Please use matrices to solve the same problem. Also we will time the execution time

Hints:

  • Let the first column of the matrix be the GDP of all the years except 1960, and the second column be the GDP of all the years except 2021.
  • The percent increase in GDP can be computed by performing computations using the 2 columns of the matrix
G = c(3007, 3067, 3244, 3375,3574, 3828, 4146, 4336, 4696, 5032,5234,5609,6094,6726,7226,7801,8592,9453,10565,11674,12575,13976,14434,15544,17121,18237,19071,20039,21417,22857,23889,24342,25419,26387,27695,28691,29968,31459,32854,34515,36330,37134,37998,39490,41725,44123,46302,48050,48570,47195,48651,50066,51784,53291,55124,56763,57867,59915,62805,65095,63028,69288)

start.time <- Sys.time()

#Let the first column of the matrix be the GDP of all the years except 1960, and the second column be the GDP of all the years except 2021.
GDP_mat <- matrix(c(G[-1], G[-length(G)]), length(G) - 1, 2)

#The percent increase in GDP can be computed by performing computations using the 2 columns of the matrix
inc <- (GDP_mat[,1] - GDP_mat[,2]) / GDP_mat[,2]
years <- 1961:2021
years <- years[inc > 0.1]
years
[1] 1973 1976 1977 1978 1979 1981 1984
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 0.002815008 secs

Without matrices, the time taken to perform the same computation is measured with the code below.

start.time <- Sys.time()
years <- c()
for (i in 1:(length(G) - 1)) {
  diff = (G[i+1] - G[i]) / G[i]
  if (diff > 0.1) years <- c(years, 1960 + i)
}
print(years)
[1] 1973 1976 1977 1978 1979 1981 1984
#print(proc.time()[3]-start_time)
end.time <- Sys.time()
time.taken <- end.time - start.time
time.taken
Time difference of 0.005828857 secs

Observe that matrices reduce the execution time of the code as computations are performed using vectorization, in contrast to a for loop where computations are performed one at a time.

13.6.1.2

Determine the highest GDP per capita in the United States for each five-year period from 1961-1965 to 2015-2020.

Hint:

  • First, create a matrix where each row represents a five-year period of GDP per capita.
  • Then, use the apply() function to find the maximum GDP per capita within each period
GDP_5year <- matrix(G[-c(1, length(G))], 12, 5, byrow = TRUE)
GDP_max_5year <- apply(GDP_5year, 1, max)

In the above code, we applied the built-in function max to each row (5-year period).

13.6.1.3

Identify the highest GDP per capita in the United States across all the five-year periods defined in the previous question. Additionally, determine the specific five-year period and the exact year in which this maximum GDP per capita occurred.

# Define start year
start_year <- 1961

# Generate period labels
year_intervals <- seq(start_year, start_year + (nrow(GDP_5year) - 1) * 5, by = 5)
period_labels <- paste(year_intervals, year_intervals + 4, sep = "-")

# Assign row names
rownames(GDP_5year) <- period_labels

# Find the maximum GDP across all 5-year periods
max_gdp_overall <- max(GDP_5year)

# Find the row and column index of the maximum value
max_position <- which(GDP_5year == max_gdp_overall, arr.ind = TRUE)

# Extract the corresponding 5-year period
max_period <- rownames(GDP_5year)[max_position[1]]

# Compute the exact year within the period
max_year_within_period <- start_year + (max_position[1] - 1) * 5 + (max_position[2] - 1)

# Print results
print(paste("Maximum GDP per capita:", max_gdp_overall))
[1] "Maximum GDP per capita: 65095"
print(paste("Occurred in period:", max_period))
[1] "Occurred in period: 2016-2020"
print(paste("Occurred in year:", max_year_within_period))
[1] "Occurred in year: 2019"

If a built-in function is not available for the required computations, we can create a user-defined function and pass it to the apply() function to perform the operation efficiently. See example below

13.6.1.4

Identify the five-year periods which the GDP per capita experienced a continuous decline each year, indicating a period of financial crisis or economic degradation

# Define a function to check if all values in a row are decreasing
is_decreasing <- function(row) {
  return(all(diff(row) < 0))
}

# Identify periods where GDP decreases year by year
decreasing_periods <- apply(GDP_5year, 1, is_decreasing)

# Extract period names where GDP is continuously decreasing
decreasing_period_names <- rownames(GDP_5year)[decreasing_periods]

# Print the periods of financial decline
print(decreasing_period_names)
character(0)

Instead of defining a named function, you can also use an anonymous function directly within the apply() function for a more concise approach

# Apply a user-defined function to check for decreasing GDP in each period
decreasing_periods <- apply(GDP_5year, 1, function(row) all(diff(row) < 0))

# Extract the period names where GDP decreased every year
decreasing_period_names <- rownames(GDP_5year)[decreasing_periods]

# Print the periods of financial decline
print(decreasing_period_names)
character(0)
# The 5 year periods during which the GDP per capita decreased as compared to the previous year are 2006-2010, and 2016-2020.

13.6.2 Exercise 2

setwd("C:\\Users\\akl0407\\Desktop\\STAT303-1\\Quarto Book\\Intro_to_programming_for_data_sci\\Datasets")

Download the file capital_cities.csv from here. Make sure the file is in your current working directory. Execute the following code to obtain the objects coordinates_capital_cities and country_names.

The object country_names is an atomic vector consisting of country names. The object coordinates_capital_cities is a matrix consisting of the latitude-longitude pair of the capital city of the respective country. The order of countries in country_names is the same as the order in which their capital city coordinates (latitude-longitude) appear in the matrix coordinates_capital_cities.

capital_cities <- read.csv('capital_cities.csv')
coordinates_capital_cities <- as.matrix(capital_cities[,c(3, 4)])
country_names <- capital_cities[,1]

13.6.2.1 Top 10 countries closest to DC

  1. Print the names of the countries of the top 10 capital cities closest to the US capital - Washington DC.

  2. Create and print a matrix containing the coordinates of the top 10 capital cities closest to Washington DC.

Note that:

  1. The Country Name for US is given as United States in the data.
  2. The ‘closeness’ of capital cities from the US capital is based on the Euclidean distance of their coordinates to those of the US capital.
  3. The dataset includes two records: United States and US Minor Outlying Islands, both belonging to the U.S., with a distance of zero. These records must be excluded to accurately identify the closest foreign capital cities.

Hint:

  1. Get the coordinates of Washington DC from coordinates_capital_cities. The row that contains the coordinates of DC will have the same index as United States has in the vector country_names

  2. Create a matrix that has coordinates of Washington DC in each row, and has the same number of rows as the matrix coordinates_capital_cities.

  3. Subtract coordinates_capital_cities from the matrix created in (2). Element-wise subtraction will occur between the matrices.

  4. Use the apply() function on the matrix obtained above to find the Euclidean distance of Washington DC from the rest of the capital cities.

  5. Using the distances obtained above, find the country that has the closest capital to DC (Euclidean distance > 0).

# Find the index of the US capital (Washington, D.C.)
US_index = which(country_names == 'United States')
dc_coord <- coordinates_capital_cities[US_index,]

# Compute Euclidean distances to Washington, D.C.
distances_to_DC <- apply(coordinates_capital_cities, 1, 
                  function(city_coord) sqrt(sum((city_coord - dc_coord)**2)))

# Create a matrix with country index and distance                  
num_of_countries <- length(country_names)
distances_to_DC_matrix <- cbind(1:num_of_countries, distances_to_DC)

# Exclude records where the distance is exactly 0
distances_to_DC_matrix <- distances_to_DC_matrix[distances_to_DC_matrix[, 2] > 0, ]

# Sort by distance
sorted <- distances_to_DC_matrix[order(distances_to_DC_matrix[,2]),]

Top 10 countries with capitals closest to Washington DC are the following:

# Get the top 10 closest countries (after exclusions)
top_10_countries <- sorted[1:10, 1]

# Print the country name
print(country_names[sorted[1:10, 1]])

You can also use the following code to exclude the U.S. records

#Exclude rows where the country is "United States" or "US Minor Outlying Islands"
filtered_sorted <- sorted[!(country_names[sorted[, 1]] %in% c("United States", "US Minor Outlying Islands")), ]

The coordinates of the top 10 capital cities closest to Washington DC are:

coordinates_capital_cities[top_10_countries, ]

13.6.2.2 Use data.frame to solve the above task

13.6.3 Exercise 3

Download the dataset movie_ratings_orig.csv. Execute the following code to read the data into the object movies:

movies <- read.csv("movie_ratings_orig.csv", stringsAsFactors = FALSE)

13.6.3.1

What is the datatype of the object movies?

class(movies)

The datatype of the object movies is list.

13.6.3.2

Count the number movies having a negative profit, i.e., their production budget is higher than their worldwide gross.

Ignore the movies that:

  1. Have missing values of production budget or worldwide gross.

  2. Have a zero worldwide gross (A zero worldwide gross is probably an incorrect value).

# Filter the dataset to exclude rows with missing or zero values in budget or gross
filtered_movies <- subset(
  movies,
  !is.na(Production.Budget) &
  !is.na(Worldwide.Gross) &
  Worldwide.Gross != 0
)

# Count movies with negative profit (budget > gross)
negative_profit_count <- sum(filtered_movies$Production.Budget > filtered_movies$Worldwide.Gross)

# Print the result
negative_profit_count