13  Data Cleaning

Real-world data is rarely clean and ready for analysis. Data scientists often spend 60-80% of their time on data preprocessing, with data cleaning being a critical first step that directly impacts the quality of insights and model performance.

This chapter focuses on essential data cleaning techniques - identifying and addressing issues in raw data quality that can compromise your analysis.

13.1 Common Data Quality Issues

Issue Type Example Impact Solution
Missing values NaN in income column Reduces sample size, biases results Impute or drop strategically
Duplicates Multiple identical rows Inflates counts, skews statistics Identify and remove
Inconsistent types '1', 1.0, 1 for same value Causes errors in operations Standardize data types
Outliers Salary = $1,000,000,000 Distorts means, regression lines Cap, transform, or remove


Note: We previously covered handling inconsistent data types (e.g., converting strings to datetime for time-based feature engineering). This chapter focuses on the other three major issues.

Let’s first import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sk
import seaborn as sns

sns.set(font_scale=1.5)

%matplotlib inline

13.2 Handling Missing Data

Missing values in a dataset can occur for several reasons:

  • Equipment failure - Breakdown of measuring equipment
  • Human error - Accidental deletion or failure to record observations
  • Non-response - Survey respondents skip questions
  • Data entry mistakes - Errors made by researchers or data collectors

13.2.1 Dataset Introduction

For this chapter, we’ll use two related datasets:

  • GDP_missing_data.csv - Contains artificially introduced missing values
  • GDP_complete_data.csv - The complete dataset without missing values

We’ll use the complete dataset later to evaluate the accuracy of our imputation methods by comparing imputed values against actual values.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sk
import seaborn as sns

sns.set(font_scale=1.5)

%matplotlib inline
gdp_missing_values_data = pd.read_csv('./Datasets/GDP_missing_data.csv')
gdp_complete_data = pd.read_csv('./Datasets/GDP_complete_data.csv')
gdp_missing_values_data.head()
economicActivityFemale country lifeMale infantMortality gdpPerCapita economicActivityMale illiteracyMale illiteracyFemale lifeFemale geographic_location contraception continent
0 7.2 Afghanistan 45.0 154.0 2474.0 87.5 NaN 85.0 46.0 Southern Asia NaN Asia
1 7.8 Algeria 67.5 44.0 11433.0 76.4 26.1 51.0 70.3 Northern Africa NaN Africa
2 41.3 Argentina 69.6 22.0 NaN 76.2 3.8 3.8 76.8 South America NaN South America
3 52.0 Armenia 67.2 25.0 13638.0 65.0 NaN 0.5 74.0 Western Asia NaN Asia
4 53.8 Australia NaN 6.0 54891.0 NaN 1.0 1.0 81.2 Oceania NaN Oceania

Observe that the gdp_missing_values_data dataset consists of some missing values shown as NaN (Not a Number).

13.2.2 Identifying missing values in a dataframe

There are multiple ways to identify missing values in a dataframe

13.2.2.1 describe() Method

Note that the descriptive statistics methods associated with Pandas objects ignore missing values by default. Consider the summary statistics of gdp_missing_values_data:

gdp_missing_values_data.describe()
economicActivityFemale lifeMale infantMortality gdpPerCapita economicActivityMale illiteracyMale illiteracyFemale lifeFemale contraception
count 145.000000 145.000000 145.000000 145.000000 145.000000 145.000000 145.000000 145.000000 84.000000
mean 45.935172 65.491724 37.158621 24193.482759 76.563448 13.570028 21.448897 70.615862 51.773810
std 16.875922 9.099256 34.465699 22748.764444 7.854730 16.497954 25.497045 9.923791 31.930026
min 1.900000 36.000000 3.000000 772.000000 51.200000 0.000000 0.000000 39.100000 0.000000
25% 35.500000 62.900000 10.000000 6837.000000 72.000000 1.000000 2.300000 67.500000 17.000000
50% 47.600000 67.800000 24.000000 15184.000000 77.300000 6.600000 9.720000 73.900000 65.000000
75% 55.900000 72.400000 54.000000 35957.000000 81.600000 19.500000 30.200000 78.100000 77.000000
max 90.600000 77.400000 169.000000 122740.000000 93.000000 70.500000 90.800000 82.900000 79.000000

Observe that the count statistics report the number of non-missing values of each column in the data, as the number of rows in the data (see code below) is more than the number of non-missing values of all the variables in the above table. Similarly, for the rest of the statistics, such as mean, std, etc., the missing values are ignored.

#The dataset gdp_missing_values_data has 155 rows
gdp_missing_values_data.shape[0]
155

13.2.2.2 info() Method

Shows the count of non-null entries in each column, helping you quickly identify columns with missing values.

gdp_missing_values_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 12 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   economicActivityFemale  145 non-null    float64
 1   country                 155 non-null    object 
 2   lifeMale                145 non-null    float64
 3   infantMortality         145 non-null    float64
 4   gdpPerCapita            145 non-null    float64
 5   economicActivityMale    145 non-null    float64
 6   illiteracyMale          145 non-null    float64
 7   illiteracyFemale        145 non-null    float64
 8   lifeFemale              145 non-null    float64
 9   geographic_location     155 non-null    object 
 10  contraception           84 non-null     float64
 11  continent               155 non-null    object 
dtypes: float64(9), object(3)
memory usage: 14.7+ KB

13.2.2.3 isnull() Method

This is one of the most direct methods. Using df.isnull() returns a DataFrame of Boolean values where True indicates a missing value. To get a summary, you can use df.isnull().sum() to see the count of missing values in each column.

For finding the number of missing values in each column of gdp_missing_values_data, we will sum up the missing values in each column of the dataset:

gdp_missing_values_data.isnull().sum()
economicActivityFemale    10
country                    0
lifeMale                  10
infantMortality           10
gdpPerCapita              10
economicActivityMale      10
illiteracyMale            10
illiteracyFemale          10
lifeFemale                10
geographic_location        0
contraception             71
continent                  0
dtype: int64

13.2.3 Types of Missing Values

In data science, missing values typically fall into three main types, each requiring different handling strategies:

13.2.3.1 Missing Completely at Random (MCAR)

  • Definition: Missing values are entirely independent of any variables in the dataset.
  • Example: A respondent accidentally skips a question on a survey.
  • Impact: MCAR data can usually be ignored or imputed without biasing the analysis.
  • Handling: Simple imputation methods, like filling with mean or median values, are often appropriate.

13.2.3.2 Missing at Random (MAR)

  • Definition: The likelihood of a value being missing is related to other observed variables but not to the missing data itself.
  • Example: People with higher incomes may be less likely to report their spending, but income data itself is not missing.
  • Impact: Ignoring MAR values may bias results, so careful imputation based on related variables is recommended.
  • Handling: More complex imputation methods, like conditional mean imputation or predictive modeling, are suitable.

13.2.3.3 Missing Not at Random (MNAR)

  • Definition: The probability of missingness is related to the missing data itself, meaning the value is systematically missing.
  • Example: Patients with severe health conditions might be less likely to report their health status, or students with low scores may be less likely to submit their grades.
  • Impact: MNAR is the most challenging type, as missing values may introduce significant bias.
  • Handling: Solutions often include sensitivity analysis, data augmentation, or modeling techniques that account for the missing mechanism, though sometimes domain-specific approaches are necessary.

Understanding the type of missing data helps in selecting the right imputation method and mitigating potential biases in the analysis.

13.2.3.4 Questions

13.2.3.4.1

Why can we ignore observations with missing values without risking skewing the analysis or trends in the case of Missing Completely at Random (MCAR)?

13.2.3.4.2

Why could ignoring missing values lead to biased results for Missing at Random (MAR) and Missing Not at Random (MNAR) data?

13.2.3.4.3

For the datset consisting of GDP per capita, think of hypothetical scenarios in which the missing values of GDP per capita can correspond to MCAR / MAR / MNAR.

13.2.4 Methods for Handling Missing Values

13.2.4.1 Removing Missing Values

  • Row/Column Removal — use df.dropna()
    • When to use: If the missing values are few or the affected rows/columns are not critical.
    • Risk: May reduce the dataset’s size, potentially removing useful information.

Let’s drop all rows that contain any missing value from gdp_missing_values_data.

gdp_no_missing_data = gdp_missing_values_data.dropna()

By default, df.dropna() will drop any row that contains at least one missing value, keeping only the rows that are completely free of missing values.

#Shape of gdp_no_missing_data
gdp_no_missing_data.shape
(42, 12)

13.2.4.2 Limitations of Using dropna() Directly

Using df.dropna() to remove rows with missing values can sometimes lead to a significant reduction in data, which can be problematic if much of the data is valuable and non-missing.

13.2.4.2.1 Example
  • Impact of Default Behavior:
    Dropping rows with even a single missing value reduced the number of rows from 155 to 42!
    This drastic reduction occurs because, by default, dropna() removes any row containing at least one missing value, keeping only rows that are completely complete.

  • Loss of Non-Missing Data:
    Even if some columns have only a few missing values (e.g., at most 10), using dropna() without modification causes all non-missing data in those rows to be lost.
    This is usually a poor choice—valuable, non-missing data may be removed unnecessarily.

13.2.4.3 Adjusting dropna() Behavior

To avoid losing too much data, you can modify how dropna() operates using its parameters:

  • how parameter
    • Controls the criteria for dropping rows or columns.
    • how='any' (default): Drops a row or column if any value is missing.
    • how='all': Drops a row or column only if all values are missing.
  • thresh parameter
    • Specifies the minimum number of non-missing values required to retain a row or column.
    • Useful when you want to keep rows or columns that are mostly complete, even if a few values are missing.

If only a few values in a column are missing, we can often estimate them using the remaining data to maximize the information retained in the dataset. However, if most of a column’s values are missing, it becomes much more difficult to reliably estimate or impute those values.

# Calculate percentage of missing values per column
missing_percent = gdp_missing_values_data.isna().mean() * 100

# Display results sorted by percentage (optional)
missing_percent.sort_values(ascending=False)
contraception             45.806452
economicActivityFemale     6.451613
lifeMale                   6.451613
infantMortality            6.451613
gdpPerCapita               6.451613
economicActivityMale       6.451613
illiteracyMale             6.451613
illiteracyFemale           6.451613
lifeFemale                 6.451613
country                    0.000000
geographic_location        0.000000
continent                  0.000000
dtype: float64

In this dataset, about 50% of the values in the contraception column are missing. Therefore, we’ll drop this column, since imputing it would be unreliable given the limited number of non-missing observations.

#Deleting column with missing values in almost half of the observations
gdp_missing_values_data.drop(['contraception'],axis=1,inplace=True)
gdp_missing_values_data.head()
economicActivityFemale country lifeMale infantMortality gdpPerCapita economicActivityMale illiteracyMale illiteracyFemale lifeFemale geographic_location continent
0 7.2 Afghanistan 45.0 154.0 2474.0 87.5 NaN 85.0 46.0 Southern Asia Asia
1 7.8 Algeria 67.5 44.0 11433.0 76.4 26.1 51.0 70.3 Northern Africa Africa
2 41.3 Argentina 69.6 22.0 NaN 76.2 3.8 3.8 76.8 South America South America
3 52.0 Armenia 67.2 25.0 13638.0 65.0 NaN 0.5 74.0 Western Asia Asia
4 53.8 Australia NaN 6.0 54891.0 NaN 1.0 1.0 81.2 Oceania Oceania

For the remaining columns, only about 6% of the values are missing. In this case, instead of dropping rows or columns, it’s often better to impute (i.e., fill in) the missing values using the available data.

Let’s explore several methods for imputing missing values and how they can help us preserve as much information as possible from the dataset.

13.2.5 Imputing Missing Values

There are many ways to impute missing values, ranging from simple replacements to complex model-based techniques. Some common approaches are described in the Pandas documentation.

The best imputation strategy depends on the mechanism of missingness:

  • MCAR (Missing Completely at Random) — Simple methods (e.g., mean or median imputation) are generally acceptable.
  • MAR (Missing at Random) — Imputation should account for relationships among variables, such as using conditional means or regression-based methods.
  • MNAR (Missing Not at Random) — Imputation is challenging and often requires domain expertise to avoid introducing bias.

13.2.5.1 Evaluation Approach

We will apply several imputation techniques to gdp_missing_values_data and evaluate their effectiveness by:

  1. Comparing imputed gdpPerCapita values against actual values from gdp_complete_data
  2. Computing the Root Mean Square Error (RMSE) for each method
  3. Visualizing the relationship between imputed and actual values

First, let’s define a helper function to visualize and evaluate our imputation results:

# Index of rows with missing values for GDP per capita
null_ind_gdpPC = gdp_missing_values_data.index[gdp_missing_values_data.gdpPerCapita.isnull()]

# Define a function to plot and evaluate imputed values vs actual values 
def plot_actual_vs_predicted(y, title_suffix=""):
    """
    Plot imputed vs actual GDP per capita values and display RMSE.
    
    Parameters:
    - y: DataFrame with imputed gdpPerCapita values
    - title_suffix: Optional string to add to the plot (e.g., method name)
    """
    fig, ax = plt.subplots(figsize=(8, 6))
    
    # Extract actual and imputed values
    x = gdp_complete_data.loc[null_ind_gdpPC, 'gdpPerCapita']
    y_imputed = y.loc[null_ind_gdpPC, 'gdpPerCapita']
    
    # Create scatter plot
    ax.scatter(x, y_imputed, alpha=0.6, s=50)
    
    # Add perfect prediction line (45-degree line)
    ax.plot(x, x, color='orange', linewidth=2, label='Perfect imputation')
    
    # Labels and formatting
    ax.set_xlabel('Actual GDP per capita', fontsize=14)
    ax.set_ylabel('Imputed GDP per capita', fontsize=14)
    ax.xaxis.set_major_formatter('${x:,.0f}')
    ax.yaxis.set_major_formatter('${x:,.0f}')
    ax.tick_params(labelsize=12)
    ax.grid(True, alpha=0.3)
    ax.legend(fontsize=11)
    
    # Calculate and display RMSE
    rmse = np.sqrt(np.mean((y_imputed - x)**2))
    
    # Position text dynamically based on data range
    x_pos = x.min() + (x.max() - x.min()) * 0.05
    y_pos = y_imputed.max() - (y_imputed.max() - y_imputed.min()) * 0.1
    ax.text(x_pos, y_pos, f'RMSE = ${rmse:,.2f}', 
            fontsize=13, bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))
    
    plt.tight_layout()
    plt.show()
    
    return rmse

13.2.5.2 Method 1: Mean/Median Imputation

Simple imputation replaces missing values with a statistical measure of central tendency from the non-missing values in the same column.

  • Mean imputation — Best for normally distributed data without outliers (MCAR scenarios)
  • Median imputation — Better for skewed data or when outliers are present
  • Mode imputation — Appropriate for categorical variables

Advantages: - Simple and fast to implement - Works well for MCAR data

Disadvantages: - Reduces variance in the data - Ignores relationships between variables - May introduce bias for MAR or MNAR data

Let’s impute missing gdpPerCapita values using the mean of non-missing values:

# Extracting the columns with missing values
columns_with_missing = [col for col in gdp_missing_values_data.columns if gdp_missing_values_data[col].isnull().any()]
columns_with_missing
['economicActivityFemale',
 'lifeMale',
 'infantMortality',
 'gdpPerCapita',
 'economicActivityMale',
 'illiteracyMale',
 'illiteracyFemale',
 'lifeFemale']
# Imputing missing values using the mean
gdp_imputed_data_mean = gdp_missing_values_data[columns_with_missing].fillna(gdp_missing_values_data[columns_with_missing].mean())
plot_actual_vs_predicted(gdp_imputed_data_mean)

30793.549983587087

Observation: The RMSE value indicates how well the mean imputation performed.

Note on data types: - Numerical data: Use mean or median - Categorical data: Use mode (the most frequently occurring value)

Since all columns with missing values in this dataset are numerical, mean imputation is appropriate. For categorical columns, you would use fillna(df['column'].mode()[0]).

13.2.5.3 Method 2: Conditional Imputation

Conditional imputation uses relationships between variables to predict missing values more accurately than simple mean/median imputation. This approach is better suited for MAR (Missing at Random) data.

We’ll explore three conditional imputation techniques:

13.2.5.3.1 A. Group-based Imputation (Conditional Mean)

Idea: Impute missing values using the mean of a subgroup rather than the overall mean.

If categorical variables show distinct patterns (e.g., GDP varies significantly by continent), we can impute missing values using the mean of the corresponding group. This is more accurate than using the global mean because it accounts for systematic differences between groups.

Let us visualize the distribution of GDP per capita for different continents.

plt.rcParams["figure.figsize"] = (12,6)
sns.boxplot(x = 'continent',y='gdpPerCapita',data = gdp_missing_values_data);

We observe that there is a distinct difference between the GDPs per capita of some of the contents. Let us impute the missing GDP per capita of a country as the mean GDP per capita of the corresponding continent. This imputation should be better than imputing the missing GDP per capita as the mean of all the non-missing values, as the GDP per capita of a country is likely to be closer to the mean GDP per capita of the continent, rather the mean GDP per capita of the whole world.

#Finding the mean GDP per capita of the continent - please defer the understanding of this code to chapter 9.
avg_gdpPerCapita = gdp_missing_values_data['gdpPerCapita'].groupby(gdp_missing_values_data['continent']).mean()
avg_gdpPerCapita
continent
Africa            7638.178571
Asia             25922.750000
Europe           45455.303030
North America    19625.210526
Oceania          15385.857143
South America    15360.909091
Name: gdpPerCapita, dtype: float64
#Creating a copy of missing data to impute missing values
gdp_imputed_data_group_mean = gdp_missing_values_data.copy()
#Replacing missing GDP per capita with the mean GDP per capita for the corresponding continent

gdp_imputed_data_group_mean.gdpPerCapita = \
gdp_imputed_data_group_mean['gdpPerCapita'].fillna(gdp_imputed_data_group_mean['continent'].map(avg_gdpPerCapita))
plot_actual_vs_predicted(gdp_imputed_data_group_mean)

25473.20645170116

Note that the imputed values are closer to the actual values, and the RMSE has further reduced as expected.

Suppose we wish to impute the missing values of each numeric column with the average of the non-missing values of the respective column corresponding to the continent of the observation. The above logic can be extended to each column as shown in the code below.

all_columns_imputed_data = gdp_missing_values_data.iloc[:,[0,2,3,4,5,6,7,8]].apply(lambda x:\
x.fillna(gdp_imputed_data_group_mean['continent'].map(x.groupby(gdp_missing_values_data['continent']).mean())))
13.2.5.3.2 B. Regression-based Imputation

Idea: Use linear regression to predict missing values based on their relationship with other variables.

If a variable is highly correlated with another variable in the dataset, we can build a regression model to predict missing values. This method leverages the statistical relationship between variables.

#Let us identify the variable highly correlated with GDP per capita.
gdp_missing_values_data.select_dtypes(include='number').corrwith(gdp_missing_values_data.gdpPerCapita)
economicActivityFemale    0.078332
lifeMale                  0.579850
infantMortality          -0.572201
gdpPerCapita              1.000000
economicActivityMale     -0.134108
illiteracyMale           -0.479143
illiteracyFemale         -0.448273
lifeFemale                0.615954
dtype: float64
#The variable *lifeFemale* has the strongest correlation with GDP per capita. Let us use it to impute missing values of GDP per capita.

#  Extract the variables lifeFemale and GDP per capita
x = gdp_missing_values_data.lifeFemale
y = gdp_missing_values_data.gdpPerCapita

#  Identify non-missing indices
idx_non_missing = np.isfinite(x) & np.isfinite(y)

# Fit a linear regression model (degree=1) to predict GDP per capita given lifeFemale
slope_intercept_trendline = np.polyfit(x[idx_non_missing],y[idx_non_missing],1)   #Finding the slope and intercept for the trendline
compute_y_given_x = np.poly1d(slope_intercept_trendline)

#Creating a copy of missing data to impute missing values
gdp_imputed_data_lr = gdp_missing_values_data.copy()

#Imputing missing values of GDP per capita using the linear regression model
gdp_imputed_data_lr.loc[null_ind_gdpPC,'gdpPerCapita']=compute_y_given_x(gdp_missing_values_data.loc[null_ind_gdpPC,'lifeFemale'])
plot_actual_vs_predicted(gdp_imputed_data_lr)

25570.361516956993
13.2.5.3.3 C. K-Nearest Neighbors (KNN) Imputation

Idea: Impute missing values using the mean of K most similar observations (nearest neighbors).

KNN finds observations that are similar based on Euclidean distance calculated from other variables. This method is more sophisticated and often produces better results than simple mean imputation because it considers the overall similarity between observations.

In this method, we’ll impute the missing value of the variable as the mean value of the \(K\)-nearest neighbors having non-missing values for that variable. The neighbors to a data-point are identified based on their Euclidean distance to the point in terms of the standardized values of rest of the variables in the data.

Let’s consider a toy example to understand missing value imputation by KNN. Suppose we have to impute missing values in a toy dataset, named as toy_data having 4 observations and 3 variables.

#Toy example - A 4x3 array with missing values
nan = np.nan
toy_data = np.array([[1, 2, nan], [3, 4, 3], [nan, 6, 5], [8, 8, 7]])
toy_data
array([[ 1.,  2., nan],
       [ 3.,  4.,  3.],
       [nan,  6.,  5.],
       [ 8.,  8.,  7.]])

We’ll use some functions from the sklearn library to perform the KNN imputation. It is much easier to directly use the algorithm from sklearn, instead of coding it from scratch.

#Library to compute pair-wise Euclidean distance between all observations in the data
from sklearn import metrics

#Library to impute missing values with the KNN algorithm
from sklearn import impute

We’ll use the sklearn function nan_euclidean_distances() to compute the Euclidean distance between all pairs of observations in the data.

#This is the distance matrix containing the distance of the ith observation from the jth observation at the (i,j) position in the matrix
metrics.pairwise.nan_euclidean_distances(toy_data,toy_data)
array([[ 0.        ,  3.46410162,  6.92820323, 11.29158979],
       [ 3.46410162,  0.        ,  3.46410162,  7.54983444],
       [ 6.92820323,  3.46410162,  0.        ,  3.46410162],
       [11.29158979,  7.54983444,  3.46410162,  0.        ]])

Note that the size of the above matrix is 4x4. This is because the \((i,j)^{th}\) element of the matrix is the distance of the \(i^{th}\) observation from the \(j^{th}\) observation. The matrix is symmetric because the distance of \(i^{th}\) observation to the \(j^{th}\) observation is the same as the distance of the \(j^{th}\) observation to the \(i^{th}\) observation.

We’ll use the sklearn function KNNImputer() to impute the missing value of a column in toy_data as the mean of the values of the \(K\) nearest neighbors to the observation that have non-missing values for that column.

Let us impute the missing values in toy_data using the values of \(K=2\) nearest neighbors from the corresponding observation.

#imputing missing values with 2 nearest neighbors, where the neighbors have equal weights

#Define an object of type KNNImputer
imputer = impute.KNNImputer(n_neighbors=2)

#Use the object method 'fit_transform' to impute missing values
imputer.fit_transform(toy_data)
array([[1. , 2. , 4. ],
       [3. , 4. , 3. ],
       [5.5, 6. , 5. ],
       [8. , 8. , 7. ]])

The third observation was the closest to the \(2nd\) and \(4th\) observations based on the Euclidean distance matrix. Thus, the missing value in the \(3rd\) row of the toy_data has been imputed as the mean of the values in the \(2nd\) and \(4th\) observations for the corresponding column. Similarly, the \(1st\) observation is the closest to the \(2nd\) and \(3rd\) observations. Thus the missing value in the \(1st\) row of toy_data has been imputed as the mean of the values in the \(1st\) and \(2nd\) observations for the corresponding column.

Let us use KNN to impute the missing values of gdpPerCapita in gdp_missing_values_data. We’ll use only the numeric columns of the data in imputing the missing values. Also, we’ll ignore contraception as it has a lot of missing values, and thus may not be useful.

#Considering numeric columns in the data to use KNN
num_cols = list(range(0,1))+list(range(2,9))
num_cols
[0, 2, 3, 4, 5, 6, 7, 8]

Before computing the pair-wise Euclidean distance of observations, we must standardize the data so that all columns are at the same scale. This will avoid columns with a higher magnitude of values having a higher weight in determining the Euclidean distance. Unless there is a reason to give a higher weight to a column, we assume all columns to have the same weight in the Euclidean distance computation.

We can use the code below to scale the data. However, after imputing the missing values, the data is to be scaled back to the original scale, so that each variable is in the same units as in the original dataset. However, if the code below is used, we’ll lose the orginal scale of each of the columns.

#Scaling data to compute equally weighted distances from the 'k' nearest neighbors
scaled_data = gdp_missing_values_data.iloc[:,num_cols].apply(lambda x:(x-x.min())/(x.max()-x.min()))

To alleviate the problem of losing the orignial scale of the data, we’ll use the MinMaxScaler object of the sklearn library. The object will store the original scale of the data, which will help transform the data back to the original scale once the missing values have been imputed in the standardized data.

# Scaling data - using sklearn

#Create an object of type MinMaxScaler
scaler = sk.preprocessing.MinMaxScaler()

#Use the object method 'fit_transform' to scale the values to a standard uniform distribution
scaled_data = pd.DataFrame(scaler.fit_transform(gdp_missing_values_data.iloc[:,num_cols]))
#Imputing missing values with KNNImputer

#Define an object of type KNNImputer
imputer = impute.KNNImputer(n_neighbors=3, weights="uniform")

#Use the object method 'fit_transform' to impute missing values
imputed_arr = imputer.fit_transform(scaled_data)
#Scaling back the scaled array to obtain the data at the original scale

#Use the object method 'inverse_transform' to scale back the values to the original scale of the data
unscaled_data = scaler.inverse_transform(imputed_arr)
#Note the method imputes the missing value of all the columns
#However, we are interested in imputing the missing values of only the 'gdpPerCapita' column
gdp_imputed_data_knn = gdp_missing_values_data.copy()
gdp_imputed_data_knn.loc[:,'gdpPerCapita'] = unscaled_data[:,3]
#Visualizing the accuracy of missing value imputation with KNN
plot_actual_vs_predicted(gdp_imputed_data_knn)

16804.195967740387

Note that the RMSE is the lowest in this method. It is because this method imputes missing values as the average of the values of similar observations, which is smarter and more robust than the previous methods.

We chose \(K=3\) in the missing value imputation for GDP per capita. However, the value of \(K\) is typically chosen using a method known as cross validation. We’ll learn about cross-validation in the next course of the sequence.

13.2.5.4 Method 3: Forward Fill / Backward Fill

Idea: Fill missing values by propagating the previous (forward fill) or next (backward fill) non-missing value.

When to use: - Time series data where values change gradually - Data sorted in a meaningful order - Assumes temporal or sequential dependency

Limitations: - Inappropriate for cross-sectional data (like our GDP dataset) - Assumes the missing value should be similar to adjacent observations - Can introduce bias if data order is arbitrary

Let’s demonstrate forward fill on our dataset (though it’s not ideal for this use case):

#Filling missing values: Method 1- Naive way
gdp_imputed_data_ffill = gdp_missing_values_data.ffill()

Let us next check how good is this method in imputing missing values.

plot_actual_vs_predicted(gdp_imputed_data_ffill)

34843.91091137732

We observe that the accuracy of imputation is poor as GDP per capita can vary a lot across countries, and the data is not sorted by GDP per capita. There is no reason why the GDP per capita of a country should be close to the GDP per capita of the country in the observation above it.

#Checking if any missing values are remaining
gdp_imputed_data_ffill.isnull().sum()
economicActivityFemale    0
country                   0
lifeMale                  0
infantMortality           0
gdpPerCapita              0
economicActivityMale      0
illiteracyMale            1
illiteracyFemale          0
lifeFemale                0
geographic_location       0
continent                 0
dtype: int64

After imputing missing values, note there is still one missing value for illiteracyMale. Can you guess why one missing value remained?

13.3 Dealing with Duplicate Records

Duplicate records can occur in datasets for various reasons: - Data entry errors (same record entered multiple times) - Merging datasets from different sources - System errors during data collection - Intentional duplicates that need investigation

Fortunately, detecting and removing duplicates is relatively straightforward in pandas. Let’s explore the methods for identifying and handling duplicate records.

13.3.1 Creating a Sample Dataset with Duplicates

First, let’s manually create a dataset that contains duplicate records to demonstrate the various duplicate detection and removal techniques.

# Create a sample dataframe with duplicate records
student_data = pd.DataFrame({
    'StudentID': [101, 102, 103, 101, 104, 102, 105, 103],
    'Name': ['Alice', 'Bob', 'Charlie', 'Alice', 'David', 'Bob', 'Eve', 'Charlie'],
    'Major': ['CS', 'Math', 'CS', 'CS', 'Physics', 'Math', 'Biology', 'CS'],
    'GPA': [3.8, 3.5, 3.9, 3.8, 3.7, 3.5, 3.6, 3.9]
})

print("Original dataset with duplicates:")
student_data
Original dataset with duplicates:
StudentID Name Major GPA
0 101 Alice CS 3.8
1 102 Bob Math 3.5
2 103 Charlie CS 3.9
3 101 Alice CS 3.8
4 104 David Physics 3.7
5 102 Bob Math 3.5
6 105 Eve Biology 3.6
7 103 Charlie CS 3.9

13.3.2 Identifying Duplicate Records

There are several ways to identify duplicates in a DataFrame:

13.3.2.1 Check How Many Duplicate Rows Exist

The duplicated() method returns a boolean Series indicating whether each row is a duplicate. We can sum this to count the total number of duplicate rows.

# Count the number of duplicate rows
num_duplicates = student_data.duplicated().sum()
print(f"Number of duplicate rows: {num_duplicates}")

# Check which rows are duplicates (returns boolean Series)
print("\nDuplicate flags for each row:")
student_data.duplicated()
Number of duplicate rows: 3

Duplicate flags for each row:
0    False
1    False
2    False
3     True
4    False
5     True
6    False
7     True
dtype: bool

Note: By default, duplicated() marks all duplicate rows as True except the first occurrence. So if a row appears 3 times, the first occurrence is marked False and the next 2 are marked True.

13.3.2.2 Display the Duplicated Rows

To see the actual duplicate records, we can filter the DataFrame using the boolean mask returned by duplicated().

# Display the duplicate rows (excluding the first occurrence)
print("Duplicate rows:")
student_data[student_data.duplicated()]
Duplicate rows:
StudentID Name Major GPA
3 101 Alice CS 3.8
5 102 Bob Math 3.5
7 103 Charlie CS 3.9

To see all occurrences of duplicate rows (including the first occurrence), use keep=False:

# Display ALL occurrences of duplicate rows (including first occurrence)
print("All occurrences of duplicate rows:")
student_data[student_data.duplicated(keep=False)]
All occurrences of duplicate rows:
StudentID Name Major GPA
0 101 Alice CS 3.8
1 102 Bob Math 3.5
2 103 Charlie CS 3.9
3 101 Alice CS 3.8
5 102 Bob Math 3.5
7 103 Charlie CS 3.9

13.3.2.3 Check Duplicates Based on Specific Columns

Sometimes you want to identify duplicates based on only certain columns, not all columns. For example, you might consider two students duplicates if they have the same StudentID, even if other details differ.

# Check duplicates based on StudentID only
print("Duplicates based on StudentID:")
print(f"Number of duplicate StudentIDs: {student_data.duplicated(subset=['StudentID']).sum()}")

print("\nRows with duplicate StudentID:")
student_data[student_data.duplicated(subset=['StudentID'], keep=False)]
Duplicates based on StudentID:
Number of duplicate StudentIDs: 3

Rows with duplicate StudentID:
StudentID Name Major GPA
0 101 Alice CS 3.8
1 102 Bob Math 3.5
2 103 Charlie CS 3.9
3 101 Alice CS 3.8
5 102 Bob Math 3.5
7 103 Charlie CS 3.9
# Check duplicates based on multiple columns (Name and Major)
print("Duplicates based on Name and Major:")
print(f"Number of duplicate Name-Major combinations: {student_data.duplicated(subset=['Name', 'Major']).sum()}")

print("\nRows with duplicate Name-Major combinations:")
student_data[student_data.duplicated(subset=['Name', 'Major'], keep=False)]
Duplicates based on Name and Major:
Number of duplicate Name-Major combinations: 3

Rows with duplicate Name-Major combinations:
StudentID Name Major GPA
0 101 Alice CS 3.8
1 102 Bob Math 3.5
2 103 Charlie CS 3.9
3 101 Alice CS 3.8
5 102 Bob Math 3.5
7 103 Charlie CS 3.9

13.3.3 Removing Duplicate Records

Once duplicates are identified, we can remove them using the drop_duplicates() method.

13.3.3.1 Remove All Duplicated Rows

By default, drop_duplicates() keeps the first occurrence and removes subsequent duplicates.

# Remove all duplicate rows (keeps first occurrence)
student_data_cleaned = student_data.drop_duplicates()

print("Dataset after removing duplicates:")
print(student_data_cleaned)

print(f"\nOriginal dataset size: {len(student_data)} rows")
print(f"Cleaned dataset size: {len(student_data_cleaned)} rows")
print(f"Removed: {len(student_data) - len(student_data_cleaned)} duplicate rows")

13.3.3.2 Remove Duplicates Based on Specific Columns

You can specify which columns to consider when identifying duplicates using the subset parameter.

# Remove duplicates based on StudentID only
student_data_unique_id = student_data.drop_duplicates(subset=['StudentID'])

print("Dataset after removing duplicates based on StudentID:")
print(student_data_unique_id)

print(f"\nOriginal dataset: {len(student_data)} rows")
print(f"After removing duplicate StudentIDs: {len(student_data_unique_id)} rows")
# Remove duplicates based on Name and Major
student_data_unique_name_major = student_data.drop_duplicates(subset=['Name', 'Major'])

print("Dataset after removing duplicates based on Name and Major:")
print(student_data_unique_name_major)

print(f"\nOriginal dataset: {len(student_data)} rows")
print(f"After removing duplicate Name-Major combinations: {len(student_data_unique_name_major)} rows")

13.3.3.3 Keep First or Last Occurrence

The keep parameter controls which occurrence to keep: - keep='first' (default): Keep the first occurrence, remove subsequent duplicates - keep='last': Keep the last occurrence, remove earlier duplicates - keep=False: Remove all duplicates (including all occurrences)

# Keep the first occurrence (default behavior)
keep_first = student_data.drop_duplicates(keep='first')
print("Keep first occurrence:")
print(keep_first)
print()
# Keep the last occurrence
keep_last = student_data.drop_duplicates(keep='last')
print("Keep last occurrence:")
print(keep_last)
print()
# Remove ALL occurrences of duplicates (keep none)
keep_none = student_data.drop_duplicates(keep=False)
print("Remove all occurrences of duplicates:")
print(keep_none)

print(f"\nOriginal dataset: {len(student_data)} rows")
print(f"After removing ALL duplicate occurrences: {len(keep_none)} rows")
print("Note: Only rows that appear exactly once are kept!")

13.3.4 Applying to Real Data

Now let’s check for duplicates in our GDP dataset:

# Check for duplicate records in GDP dataset
num_duplicates_gdp = gdp_imputed_data_ffill.duplicated().sum()
print(f"Number of duplicate rows in GDP dataset: {num_duplicates_gdp}")

if num_duplicates_gdp > 0:
    print("\nDuplicate rows:")
    print(gdp_imputed_data_ffill[gdp_imputed_data_ffill.duplicated(keep=False)])
else:
    print("\nNo duplicate records found in the GDP dataset!")
Number of duplicate rows in GDP dataset: 0

No duplicate records found in the GDP dataset!

If the dataset had duplicates, we could remove them using:

# Remove duplicate records (if they existed)
gdp_imputed_data_ffill = gdp_imputed_data_ffill.drop_duplicates()

Summary: - Use duplicated() to identify duplicate rows - Use drop_duplicates() to remove them - The subset parameter lets you specify which columns to check - The keep parameter controls which occurrence to keep (first, last, or False to remove all)

13.4 Outlier detection and handling

An outlier is an observation that is significantly different from the rest of the data. Outlier detection and handling are important in data science because outliers can significantly impact the quality, accuracy, and interpretability of data analysis and model performance. Here are the main reasons why it’s crucial to address outliers:

  • Preventing Skewed Results: Outliers can distort statistical measures, such as the mean and standard deviation, leading to misleading interpretations of the data. For example, a few extreme values can inflate the mean, making the data seem larger than it is in reality.
  • Improving Model Accuracy: Many machine learning algorithms are sensitive to outliers and can perform poorly if outliers are present. For instance, in linear regression, outliers can disproportionately affect the model by pulling the line of best fit toward them, reducing predictive accuracy.
  • Enhancing Robustness of Models: Identifying and handling outliers can make models more robust and stable. By minimizing the influence of extreme values, models become less sensitive to noise, which improves generalization and reduces overfitting.

13.4.1 Outlier detection

Outlier detection is crucial for identifying unusual values in your data that may need to be investigated, removed, or transformed. Here are several popular methods for detecting outliers:

We will use College.csv that contains information about US universities as our dataset in this section. The description of variables of the dataset can be found on page 65 of this book.

# Load the College dataset
college = pd.read_csv('./Datasets/College.csv')
college.head()
Unnamed: 0 Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate Room.Board Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate
0 Abilene Christian University Yes 1660 1232 721 23 52 2885 537 7440 3300 450 2200 70 78 18.1 12 7041 60
1 Adelphi University Yes 2186 1924 512 16 29 2683 1227 12280 6450 750 1500 29 30 12.2 16 10527 56
2 Adrian College Yes 1428 1097 336 22 50 1036 99 11250 3750 400 1165 53 66 12.9 30 8735 54
3 Agnes Scott College Yes 417 349 137 60 89 510 63 12960 5450 450 875 92 97 7.7 37 19016 59
4 Alaska Pacific University Yes 193 146 55 16 44 249 869 7560 4120 800 1500 76 72 11.9 2 10922 15

13.4.1.1 Visualization Methods

  • Box Plot: A box plot is a quick and visual way to detect outliers, where points outside the whiskers are considered outliers.

Let us visualize outliers in average instructional expenditure per student given by the variable Expend.

ax=college.boxplot(column = 'Expend')
ax.yaxis.set_major_formatter('${x:,.0f}')

There are several outliers (shown as circles in the above boxplot), which correspond to high values of average instructional expenditure per student.

  • Histogram: Histograms can reveal unusual values as isolated bars, helping to identify outliers in a single feature.
# create a histogram of expend
college['Expend'].plot(kind='hist', edgecolor='black', bins=20)
plt.xlabel('Expenditure')
plt.ylabel('Frequency')
plt.title('Histogram of Expenditure')
plt.show()

  • Scatter Plot: A scatter plot can reveal outliers, especially in two-dimensional data. Outliers will often appear as points separated from the main cluster.

Let’s make a scatterplot of Grad.Rate vs Expend with a trendline, to visualize potential outliers

sns.set(font_scale=1.5)
ax=sns.regplot(data = college, x = "Expend", y = "Grad.Rate",scatter_kws={"color": "orange"}, line_kws={"color": "blue"})
ax.xaxis.set_major_formatter('${x:,.0f}')
ax.set_xlabel('Expenditure per student')
ax.set_ylabel('Graduation rate')
plt.show()

The trendline indicates a positive correlation between Expend and Grad.Rate. However, there seems to be a lot of noise and presence of outliers in the data.

13.4.1.2 Statistical Methods

Visualization helps us identify potential outliers. To address them effectively, we need to extract these outlier instances using a defined threshold. Two commonly used statistical methods for this purpose are the Z-Score method and the Interquartile Range (IQR) (Tukey’s fences).

13.4.1.2.1 Z-Score (Standard Score)

Calculates how many standard deviations a data point is from the mean. Commonly, data points with a Z-score greater than 3 (or less than -3) are considered outliers.

from scipy import stats

# Calculate Z-scores and identify outliers
college['z_score'] = stats.zscore(college['Expend'])
college['is_z_score_outlier'] = college['z_score'].abs() > 3

# Filter to show only the outliers
z_score_outliers = college[college['is_z_score_outlier']]
print(z_score_outliers.shape)
z_score_outliers.head()
(16, 21)
Unnamed: 0 Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate ... Books Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate z_score is_z_score_outlier
20 Antioch University Yes 713 661 252 25 44 712 23 15476 ... 400 1100 69 82 11.3 35 42926 48 6.374709 True
144 Columbia University Yes 6756 1930 871 78 96 3376 55 18624 ... 550 300 97 98 5.9 21 30639 99 4.020159 True
158 Dartmouth College Yes 8587 2273 1087 87 99 3918 32 19545 ... 550 1100 95 99 4.7 49 29619 98 3.824698 True
174 Duke University Yes 13789 3893 1583 90 98 6188 53 18590 ... 625 1162 95 96 5.0 44 27206 97 3.362296 True
191 Emory University Yes 8506 4168 1236 76 97 5544 192 17600 ... 600 870 97 98 5.0 28 28457 96 3.602024 True

5 rows × 21 columns

Boxplot identifies outliers based on the Tukey’s fences criterion:

13.4.1.2.2 IQR method (Tukey’s fences)

John Tukey proposed that observations outside the range \([Q1 - 1.5(Q3-Q1), Q3+1.5(Q3-Q1)]\) are outliers, where \(Q1\) and \(Q3\) are the lower \((25\%)\) and upper \((75\%)\) quartiles respectively. Let us detect outliers based on this threshold.

# Calculate the first and third quartiles, and the IQR
Q1 = np.percentile(college['Expend'], 25)
Q3 = np.percentile(college['Expend'], 75)
IQR = Q3 - Q1

# Define the lower and upper bounds for outliers
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

# Add a new column to indicate IQR outliers
college['is_IQR_outlier'] = (college['Expend'] < lower_bound) | (college['Expend'] > upper_bound)

# Filter to show only the IQR outliers
iqr_outliers = college[college['is_IQR_outlier']]
print(iqr_outliers.shape)
iqr_outliers.head()
(48, 22)
Unnamed: 0 Private Apps Accept Enroll Top10perc Top25perc F.Undergrad P.Undergrad Outstate ... Personal PhD Terminal S.F.Ratio perc.alumni Expend Grad.Rate z_score is_z_score_outlier is_IQR_outlier
3 Agnes Scott College Yes 417 349 137 60 89 510 63 12960 ... 875 92 97 7.7 37 19016 59 1.792851 False True
16 Amherst College Yes 4302 992 418 83 96 1593 5 19760 ... 1598 93 98 8.4 63 21424 100 2.254295 False True
20 Antioch University Yes 713 661 252 25 44 712 23 15476 ... 1100 69 82 11.3 35 42926 48 6.374709 True True
60 Bowdoin College Yes 3356 1019 418 76 100 1490 8 19030 ... 875 93 96 11.2 52 20447 96 2.067073 False True
64 Brandeis University Yes 4186 2743 740 48 77 2819 62 19380 ... 1000 90 97 9.8 24 17150 84 1.435271 False True

5 rows × 22 columns

Using the IQR method, more instances in the college dataset are marked as outliers.

13.4.1.2.3 When to use each:
  • Use IQR when:

    • Data is skewed
    • You don’t want to assume any distribution
    • You want a more sensitive detector
  • Use Z-score when:

    • Data is approximately normal
    • You want a more conservative approach
    • You need a method that’s easily interpretable

The actual number of outliers marked by each method will depend on your specific dataset’s distribution and characteristics.

13.4.2 Common Methods for Handling outliers

Once we identify outlier instances using statistical methods, the next step is to handle them. Below are some commonly used approaches:

  • Removing Outliers: Discarding extreme values if they are likely due to errors or are irrelevant for the analysis.
  • Winsorizing: Capping/flooring outliers to a specific percentile to reduce their influence without removing them completely.
  • Replacing with the median: Replacing outlier values with the median of the remaining data. The median is less affected by extreme values than the mean, making it an ideal choice for imputation when dealing with outliers
  • Transforming Data: Applying transformations (e.g., log, square root) to reduce the impact of outliers.
def method_removal():
    """Method 1: Remove outliers"""
    data_cleaned = college[~college['is_outlier']]['Expend']
    return data_cleaned
def method_capping():
    """Method 2: Capping (Winsorization)"""
    data_capped = college['Expend'].copy()
    data_capped[data_capped < lower_bound] = lower_bound
    data_capped[data_capped > upper_bound] = upper_bound
    return data_capped
def method_mean_replacement():
    """Method 3: Replace with median"""
    data_median = college['Expend'].copy()
    median_value = college[~college['is_outlier']]['Expend'].median()
    data_median[college['is_outlier']] = median_value
    return data_median
def method_log_transformation():
    """Method 4: Log transformation"""
    data_log = np.log1p(college['Expend'])
    return data_log

Your Practice: Try each method and compare their results to see how they differ in handling outliers.

In real-world applications, handling outliers should be approached on a case-by-case basis. However, here are some general recommendations:

Recommendations for Different Scenarios:

  • Data Removal:

    • Use when: You have a large dataset and can afford to lose data
    • Pros: Simple, removes all outlier influence
    • Cons: Loss of data, potential loss of important information
  • Capping (Winsorization):

    • Use when: You want to preserve data count while limiting extreme values
    • Pros: Preserves data size, reduces impact of outliers
    • Cons: Artificial boundary creation, potential loss of genuine extreme events
  • Median Replacement:

    • Use when: The data is normally distributed, and outliers are verified as errors or anomalies not representing true values
    • Pros: Maintains data size, simple to implement
    • Cons: Reduces variance, may not represent the true distribution
  • Log Transformation:

    • Use when: Data has a right-skewed distribution, and you want to reduce the impact of large outliers.
    • Pros: Reduces skewness, minimizes the effect of extreme values, can make data more normal-like.
    • Cons: Only applicable to positive values, may not be effective for extremely high outliers.

13.5 Independent Study

13.5.1 Exercise 1: Missing Value Imputation Comparison

Using the GDP_missing_data.csv dataset:

  1. Compare the RMSE for imputing missing lifeFemale values using:

    • Mean imputation
    • Median imputation
    • Regression with the most correlated variable
    • KNN imputation (k=3)
  2. Which method performs best? Why do you think this is the case?

  3. How does the choice of imputation method affect the distribution of the lifeFemale variable? Create histograms to visualize the differences.

13.5.2 Exercise 2: Outlier Detection and Impact Analysis

Using the College.csv dataset:

  1. Identify outliers in the Grad.Rate column using both Z-score and IQR methods.

  2. How many outliers are identified by each method? Which method is more conservative?

  3. Apply three different outlier handling methods (removal, capping, median replacement) to the Grad.Rate column.

  4. For each method, calculate:

    • The mean graduation rate before and after handling outliers
    • The standard deviation before and after
    • Create box plots to visualize the differences
  5. Which outlier handling method would you recommend for this dataset and why? Consider the context of graduation rates when making your decision.

13.5.3 Exercise 3: Combined Data Cleaning Pipeline

Using the GDP_missing_data.csv dataset:

  1. Create a comprehensive data cleaning pipeline that:

    • Removes columns with more than 40% missing values
    • Identifies the type of missing data (MCAR, MAR, or MNAR) for each remaining column with missing values
    • Applies appropriate imputation methods based on the missing data type
    • Detects outliers in all numeric columns using the IQR method
    • Handles outliers using an appropriate strategy
  2. Document your decisions at each step with justifications.

  3. Compare key statistics (mean, median, standard deviation) before and after your cleaning pipeline.

  4. Create a summary report showing:

    • Number of missing values handled
    • Number of outliers detected and handled
    • RMSE for your imputation methods (where applicable)
    • Visual comparisons (scatter plots or box plots) showing the impact of your cleaning steps