import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sk
import seaborn as sns
set(font_scale=1.5)
sns.
%matplotlib inline
11 Data Cleaning and Preparation
11.1 Handling missing data
Missing values in a dataset can occur due to several reasons such as breakdown of measuring equipment, accidental removal of observations, lack of response by respondents, error on the part of the researcher, etc.
Let us read the dataset GDP_missing_data.csv, in which we have randomly removed some values, or put missing values in some of the columns.
We’ll also read GDP_complete_data.csv, in which we have not removed any values. We’ll use this data later to assess the accuracy of our guess or estimate of missing values in GDP_missing_data.csv.
= pd.read_csv('./Datasets/GDP_missing_data.csv')
gdp_missing_values_data = pd.read_csv('./Datasets/GDP_complete_data.csv') gdp_complete_data
gdp_missing_values_data.head()
economicActivityFemale | country | lifeMale | infantMortality | gdpPerCapita | economicActivityMale | illiteracyMale | illiteracyFemale | lifeFemale | geographic_location | contraception | continent | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 7.2 | Afghanistan | 45.0 | 154.0 | 2474.0 | 87.5 | NaN | 85.0 | 46.0 | Southern Asia | NaN | Asia |
1 | 7.8 | Algeria | 67.5 | 44.0 | 11433.0 | 76.4 | 26.1 | 51.0 | 70.3 | Northern Africa | NaN | Africa |
2 | 41.3 | Argentina | 69.6 | 22.0 | NaN | 76.2 | 3.8 | 3.8 | 76.8 | South America | NaN | South America |
3 | 52.0 | Armenia | 67.2 | 25.0 | 13638.0 | 65.0 | NaN | 0.5 | 74.0 | Western Asia | NaN | Asia |
4 | 53.8 | Australia | NaN | 6.0 | 54891.0 | NaN | 1.0 | 1.0 | 81.2 | Oceania | NaN | Oceania |
Observe that the gdp_missing_values_data
dataset consists of some missing values shown as NaN (Not a Number).
11.1.1 Identifying missing values in a dataframe
There are multiple ways to identify missing values in a dataframe
11.1.1.1 describe()
Method
Note that the descriptive statistics methods associated with Pandas objects ignore missing values by default. Consider the summary statistics of gdp_missing_values_data
:
gdp_missing_values_data.describe()
economicActivityFemale | lifeMale | infantMortality | gdpPerCapita | economicActivityMale | illiteracyMale | illiteracyFemale | lifeFemale | contraception | |
---|---|---|---|---|---|---|---|---|---|
count | 145.000000 | 145.000000 | 145.000000 | 145.000000 | 145.000000 | 145.000000 | 145.000000 | 145.000000 | 84.000000 |
mean | 45.935172 | 65.491724 | 37.158621 | 24193.482759 | 76.563448 | 13.570028 | 21.448897 | 70.615862 | 51.773810 |
std | 16.875922 | 9.099256 | 34.465699 | 22748.764444 | 7.854730 | 16.497954 | 25.497045 | 9.923791 | 31.930026 |
min | 1.900000 | 36.000000 | 3.000000 | 772.000000 | 51.200000 | 0.000000 | 0.000000 | 39.100000 | 0.000000 |
25% | 35.500000 | 62.900000 | 10.000000 | 6837.000000 | 72.000000 | 1.000000 | 2.300000 | 67.500000 | 17.000000 |
50% | 47.600000 | 67.800000 | 24.000000 | 15184.000000 | 77.300000 | 6.600000 | 9.720000 | 73.900000 | 65.000000 |
75% | 55.900000 | 72.400000 | 54.000000 | 35957.000000 | 81.600000 | 19.500000 | 30.200000 | 78.100000 | 77.000000 |
max | 90.600000 | 77.400000 | 169.000000 | 122740.000000 | 93.000000 | 70.500000 | 90.800000 | 82.900000 | 79.000000 |
Observe that the count
statistics report the number of non-missing values of each column in the data, as the number of rows in the data (see code below) is more than the number of non-missing values of all the variables in the above table. Similarly, for the rest of the statistics, such as mean
, std
, etc., the missing values are ignored.
#The dataset gdp_missing_values_data has 155 rows
0] gdp_missing_values_data.shape[
155
11.1.1.2 info()
Method
Shows the count of non-null entries in each column, helping you quickly identify columns with missing values.
gdp_missing_values_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 155 entries, 0 to 154
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 economicActivityFemale 145 non-null float64
1 country 155 non-null object
2 lifeMale 145 non-null float64
3 infantMortality 145 non-null float64
4 gdpPerCapita 145 non-null float64
5 economicActivityMale 145 non-null float64
6 illiteracyMale 145 non-null float64
7 illiteracyFemale 145 non-null float64
8 lifeFemale 145 non-null float64
9 geographic_location 155 non-null object
10 contraception 84 non-null float64
11 continent 155 non-null object
dtypes: float64(9), object(3)
memory usage: 14.7+ KB
11.1.1.3 isnull()
Method
This is one of the most direct methods. Using df.isnull()
returns a DataFrame of Boolean values where True
indicates a missing value. To get a summary, you can use df.isnull().sum()
to see the count of missing values in each column.
For finding the number of missing values in each column of gdp_missing_values_data
, we will sum up the missing values in each column of the dataset:
sum() gdp_missing_values_data.isnull().
economicActivityFemale 10
country 0
lifeMale 10
infantMortality 10
gdpPerCapita 10
economicActivityMale 10
illiteracyMale 10
illiteracyFemale 10
lifeFemale 10
geographic_location 0
contraception 71
continent 0
dtype: int64
11.1.2 Types of Missing Values
In data science, missing values typically fall into three main types, each requiring different handling strategies:
11.1.2.1 Missing Completely at Random (MCAR)
- Definition: Missing values are entirely independent of any variables in the dataset.
- Example: A respondent accidentally skips a question on a survey.
- Impact: MCAR data can usually be ignored or imputed without biasing the analysis.
- Handling: Simple imputation methods, like filling with mean or median values, are often appropriate.
11.1.2.2 Missing at Random (MAR)
- Definition: The likelihood of a value being missing is related to other observed variables but not to the missing data itself.
- Example: People with higher incomes may be less likely to report their spending, but income data itself is not missing.
- Impact: Ignoring MAR values may bias results, so careful imputation based on related variables is recommended.
- Handling: More complex imputation methods, like conditional mean imputation or predictive modeling, are suitable.
11.1.2.3 Missing Not at Random (MNAR)
- Definition: The probability of missingness is related to the missing data itself, meaning the value is systematically missing.
- Example: Patients with severe health conditions might be less likely to report their health status, or students with low scores may be less likely to submit their grades.
- Impact: MNAR is the most challenging type, as missing values may introduce significant bias.
- Handling: Solutions often include sensitivity analysis, data augmentation, or modeling techniques that account for the missing mechanism, though sometimes domain-specific approaches are necessary.
Understanding the type of missing data helps in selecting the right imputation method and mitigating potential biases in the analysis.
11.1.2.4 Questions
11.1.2.4.1
Why can we ignore observations with missing values without risking skewing the analysis or trends in the case of Missing Completely at Random (MCAR)?
11.1.2.4.2
Why could ignoring missing values lead to biased results for Missing at Random (MAR) and Missing Not at Random (MNAR) data?
11.1.2.4.3
For the datset consisting of GDP per capita, think of hypothetical scenarios in which the missing values of GDP per capita can correspond to MCAR / MAR / MNAR.
11.1.3 Methods for Handling missing values
11.1.3.1 Removing Missing values
- Row/Column Removal: Use [
df.dropna()
] (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html)- When to Use: if the missing values are few or the rows/columns are not critical.
- Risks: Can reduce the dataset’s size, potentially losing valuable information.
Let us drop the rows containing even a single value from gdp_missing_values_data
.
= gdp_missing_values_data.dropna() gdp_no_missing_data
By default, df.dropna()
will drop any row that contains at least one missing value, leaving only the rows that are completely free of missing values.
#Shape of gdp_no_missing_data
gdp_no_missing_data.shape
(42, 12)
Using df.dropna()
to remove rows with missing values can sometimes lead to a significant reduction in data, which can be problematic if much of the data is valuable and non-missing. For example:
Impact of Default Behavior: Dropping rows with even a single missing value reduced the number of rows from 155 to 42! This drastic reduction happens because, by default, dropna() removes any row with at least one missing value, keeping only rows that are completely complete.
Loss of Non-Missing Data: Even though some columns may have very few missing values (e.g., at most 10), using dropna() without modification results in losing all non-missing data in affected rows. This is typically a poor choice, especially when valuable, non-missing data is removed unnecessarily.
To avoid losing too much data, you can adjust the behavior of dropna()
using these parameters:
how
Parameter- Controls the criteria for dropping rows or columns.
how='any'
(default): Drops rows or columns if any values are missing.- `how=‘all’: Drops rows or columns only if all values are missing
thresh
Parameter- Sets a minimum number of non-missing values required to retain the row or column.
- Useful when you want to keep rows or columns with substantial, but not complete, data.
If a few values of a column are missing, we can possibly estimate them using the rest of the data, so that we can (hopefully) maximize the information that can be extracted from the data. However, if most of the values of a column are missing, it may be harder to estimate its values.
In this dataset, we see that around 50% values of the contraception
column is missing. Thus, we’ll drop the column as it may be hard to impute its values based on a relatively small number of non-missing values.
#Deleting column with missing values in almost half of the observations
'contraception'],axis=1,inplace=True)
gdp_missing_values_data.drop([ gdp_missing_values_data.shape
(155, 11)
11.1.3.2 Imputing Missing values
There are an unlimited number of ways to impute missing values. Some imputation methods are provided in the Pandas documentation.
The best way to impute them will depend on the problem:
- For MCAR: Simple imputation is generally acceptable.
- For MAR: Imputation should consider relationships with other variables, such as using conditional mean imputation.
- For MNAR: Imputation requires careful analysis, and domain knowledge is essential to avoid bias.
Below are some of the most common methods. Recall that we randomly introduced missing values in gdp_missing_values_data
, while the actual values are preserved in gdp_complete_data
.
We will apply these methods to gdp_missing_values_data
. To evaluate each method’s effectiveness in imputing missing values, we’ll compare the imputed gdpPerCapita values with the actual values and calculate the Root Mean Square Error
(RMSE).
To visualize the imputed vs actual values, let’s define a helper function
#Index of rows with missing values for GDP per capita
= gdp_missing_values_data.index[gdp_missing_values_data.gdpPerCapita.isnull()]
null_ind_gdpPC
#Defining a function to plot the imputed values vs actual values
def plot_actual_vs_predicted(y):
= plt.subplots(figsize=(8, 6))
fig, ax 'xtick', labelsize=15)
plt.rc('ytick', labelsize=15)
plt.rc(= gdp_complete_data.loc[null_ind_gdpPC,'gdpPerCapita']
x # y = y.loc[null_ind_gdpPC,'gdpPerCapita']
'gdpPerCapita'])
plt.scatter(x,y.loc[null_ind_gdpPC,=np.polyfit(x,y.loc[null_ind_gdpPC,'gdpPerCapita'],1)
z=np.poly1d(z)
p='orange')
plt.plot(x,x,color'Actual GDP per capita',fontsize=15)
plt.xlabel('Imputed GDP per capita',fontsize=15)
plt.ylabel('${x:,.0f}')
ax.xaxis.set_major_formatter('${x:,.0f}')
ax.yaxis.set_major_formatter('axes', labelsize=10) # Set all axis labels to fontsize 10
plt.rc(# plt.title('Actual vs Imputed values for GDP per capita',fontsize=20)
= np.sqrt(np.mean((y.loc[null_ind_gdpPC,'gdpPerCapita']-x)**2))
rmse 10000, 50000, 'RMSE = %.2f'%rmse, fontsize=15) plt.text(
11.1.3.2.1 Mean/Median/Mode Imputation
This approach replaces missing values with the mean or median of the column.
Let’s impute missing values in the column by substituting them with the average of the non-missing values. Imputing with the mean tends to minimize the sum of squared differences between actual values and imputed values, especially in cases where data is Missing Completely at Random (MCAR). However, this might not hold true for other types of missing data, such as Missing at Random (MAR) or Missing Not at Random (MNAR).
Let us impute missing values in the column as the average of the non-missing values of the column. The sum of squared differences between actual values and the imputed values is likely to be smaller if we impute using the mean. However, this may not be true in cases other than MCAR (Missing completely at random).
# Extracting the columns with missing values
= [col for col in gdp_missing_values_data.columns if gdp_missing_values_data[col].isnull().any()]
columns_with_missing columns_with_missing
['economicActivityFemale',
'lifeMale',
'infantMortality',
'gdpPerCapita',
'economicActivityMale',
'illiteracyMale',
'illiteracyFemale',
'lifeFemale']
# Imputing missing values using the mean
= gdp_missing_values_data[columns_with_missing].fillna(gdp_missing_values_data[columns_with_missing].mean()) gdp_imputed_data_mean
plot_actual_vs_predicted(gdp_imputed_data_mean)
Here, since all columns with missing values are numerical, we could use the mean to impute these values. Using the mean is generally suitable only for numerical data, as it represents a central tendency specific to numbers. For categorical data, however, mean imputation would be inappropriate. Instead, the mode, which identifies the most frequently occurring value, is a more suitable choice for imputing missing values in categorical columns.
11.1.3.2.3 Forward Fill/Backward Fill
To fill missing values in a column by copying the value of the previous or next non-missing observation, we can use forward fill (propagating the last valid observation forward) or backward fill (propagating the next valid observation backward).
Below, we demonstrate forward fill. To perform backward fill instead, simply replace ffill
with bfill
#Filling missing values: Method 1- Naive way
= gdp_missing_values_data.ffill() gdp_imputed_data_ffill
Let us next check how good is this method in imputing missing values.
plot_actual_vs_predicted(gdp_imputed_data_ffill)
We observe that the accuracy of imputation is poor as GDP per capita can vary a lot across countries, and the data is not sorted by GDP per capita. There is no reason why the GDP per capita of a country should be close to the GDP per capita of the country in the observation above it.
#Checking if any missing values are remaining
sum() gdp_imputed_data_ffill.isnull().
economicActivityFemale 0
country 0
lifeMale 0
infantMortality 0
gdpPerCapita 0
economicActivityMale 0
illiteracyMale 1
illiteracyFemale 0
lifeFemale 0
geographic_location 0
continent 0
dtype: int64
After imputing missing values, note there is still one missing value for illiteracyMale. Can you guess why one missing value remained?
11.2 Outlier detection and handling
An outlier is an observation that is significantly different from the rest of the data. Outlier detection and handling are important in data science because outliers can significantly impact the quality, accuracy, and interpretability of data analysis and model performance. Here are the main reasons why it’s crucial to address outliers:
- Preventing Skewed Results: Outliers can distort statistical measures, such as the mean and standard deviation, leading to misleading interpretations of the data. For example, a few extreme values can inflate the mean, making the data seem larger than it is in reality.
- Improving Model Accuracy: Many machine learning algorithms are sensitive to outliers and can perform poorly if outliers are present. For instance, in linear regression, outliers can disproportionately affect the model by pulling the line of best fit toward them, reducing predictive accuracy.
- Enhancing Robustness of Models: Identifying and handling outliers can make models more robust and stable. By minimizing the influence of extreme values, models become less sensitive to noise, which improves generalization and reduces overfitting.
11.2.1 Outlier detection
Outlier detection is crucial for identifying unusual values in your data that may need to be investigated, removed, or transformed. Here are several popular methods for detecting outliers:
We will use College.csv that contains information about US universities as our dataset in this section. The description of variables of the dataset can be found on page 65 of this book.
# Load the College dataset
= pd.read_csv('./Datasets/College.csv')
college college.head()
Unnamed: 0 | Private | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | Room.Board | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abilene Christian University | Yes | 1660 | 1232 | 721 | 23 | 52 | 2885 | 537 | 7440 | 3300 | 450 | 2200 | 70 | 78 | 18.1 | 12 | 7041 | 60 |
1 | Adelphi University | Yes | 2186 | 1924 | 512 | 16 | 29 | 2683 | 1227 | 12280 | 6450 | 750 | 1500 | 29 | 30 | 12.2 | 16 | 10527 | 56 |
2 | Adrian College | Yes | 1428 | 1097 | 336 | 22 | 50 | 1036 | 99 | 11250 | 3750 | 400 | 1165 | 53 | 66 | 12.9 | 30 | 8735 | 54 |
3 | Agnes Scott College | Yes | 417 | 349 | 137 | 60 | 89 | 510 | 63 | 12960 | 5450 | 450 | 875 | 92 | 97 | 7.7 | 37 | 19016 | 59 |
4 | Alaska Pacific University | Yes | 193 | 146 | 55 | 16 | 44 | 249 | 869 | 7560 | 4120 | 800 | 1500 | 76 | 72 | 11.9 | 2 | 10922 | 15 |
11.2.1.1 Visualization Methods
- Box Plot: A box plot is a quick and visual way to detect outliers, where points outside the “whiskers” are considered outliers.
Let us visualize outliers in average instructional expenditure per student given by the variable Expend
.
=college.boxplot(column = 'Expend')
ax'${x:,.0f}') ax.yaxis.set_major_formatter(
There are several outliers (shown as circles in the above boxplot), which correspond to high values of average instructional expenditure per student.
- Histogram: Histograms can reveal unusual values as isolated bars, helping to identify outliers in a single feature.
# create a histogram of expend
'Expend'].plot(kind='hist', edgecolor='black', bins=20)
college['Expenditure')
plt.xlabel('Frequency')
plt.ylabel('Histogram of Expenditure')
plt.title( plt.show()
- Scatter Plot: A scatter plot can reveal outliers, especially in two-dimensional data. Outliers will often appear as points separated from the main cluster.
Let’s make a scatterplot of ‘Grad.Rate’ vs ‘Expend’ with a trendline, to visualize potential outliers
set(font_scale=1.5)
sns.=sns.regplot(data = college, x = "Expend", y = "Grad.Rate",scatter_kws={"color": "orange"}, line_kws={"color": "blue"})
ax'${x:,.0f}')
ax.xaxis.set_major_formatter('Expenditure per student')
ax.set_xlabel('Graduation rate')
ax.set_ylabel( plt.show()
The trendline indicates a positive correlation between Expend
and Grad.Rate
. However, there seems to be a lot of noise and presence of outliers in the data.
11.2.1.2 Statistical Methods
Visualization helps us identify potential outliers. To address them effectively, we need to extract these outlier instances using a defined threshold. Two commonly used statistical methods for this purpose are the Z-Score method and the Interquartile Range (IQR) (Tukey’s fences).
11.2.1.2.1 Z-Score (Standard Score)
Calculates how many standard deviations a data point is from the mean. Commonly, data points with a Z-score greater than 3 (or less than -3) are considered outliers.
from scipy import stats
# Calculate Z-scores and identify outliers
'z_score'] = stats.zscore(college['Expend'])
college['is_z_score_outlier'] = college['z_score'].abs() > 3
college[
# Filter to show only the outliers
= college[college['is_z_score_outlier']]
z_score_outliers print(z_score_outliers.shape)
z_score_outliers.head()
(16, 21)
Unnamed: 0 | Private | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | ... | Books | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | z_score | is_z_score_outlier | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
20 | Antioch University | Yes | 713 | 661 | 252 | 25 | 44 | 712 | 23 | 15476 | ... | 400 | 1100 | 69 | 82 | 11.3 | 35 | 42926 | 48 | 6.374709 | True |
144 | Columbia University | Yes | 6756 | 1930 | 871 | 78 | 96 | 3376 | 55 | 18624 | ... | 550 | 300 | 97 | 98 | 5.9 | 21 | 30639 | 99 | 4.020159 | True |
158 | Dartmouth College | Yes | 8587 | 2273 | 1087 | 87 | 99 | 3918 | 32 | 19545 | ... | 550 | 1100 | 95 | 99 | 4.7 | 49 | 29619 | 98 | 3.824698 | True |
174 | Duke University | Yes | 13789 | 3893 | 1583 | 90 | 98 | 6188 | 53 | 18590 | ... | 625 | 1162 | 95 | 96 | 5.0 | 44 | 27206 | 97 | 3.362296 | True |
191 | Emory University | Yes | 8506 | 4168 | 1236 | 76 | 97 | 5544 | 192 | 17600 | ... | 600 | 870 | 97 | 98 | 5.0 | 28 | 28457 | 96 | 3.602024 | True |
5 rows × 21 columns
Boxplot identifies outliers based on the Tukey’s fences criterion:
11.2.1.2.2 IQR method (Tukey’s fences)
John Tukey proposed that observations outside the range \([Q1 - 1.5(Q3-Q1), Q3+1.5(Q3-Q1)]\) are outliers, where \(Q1\) and \(Q3\) are the lower \((25\%)\) and upper \((75\%)\) quartiles respectively. Let us detect outliers based on this threshold.
# Calculate the first and third quartiles, and the IQR
= np.percentile(college['Expend'], 25)
Q1 = np.percentile(college['Expend'], 75)
Q3 = Q3 - Q1
IQR
# Define the lower and upper bounds for outliers
= Q1 - 1.5 * IQR
lower_bound = Q3 + 1.5 * IQR
upper_bound
# Add a new column to indicate IQR outliers
'is_IQR_outlier'] = (college['Expend'] < lower_bound) | (college['Expend'] > upper_bound)
college[
# Filter to show only the IQR outliers
= college[college['is_IQR_outlier']]
iqr_outliers print(iqr_outliers.shape)
iqr_outliers.head()
(48, 22)
Unnamed: 0 | Private | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | ... | Personal | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | z_score | is_z_score_outlier | is_IQR_outlier | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
3 | Agnes Scott College | Yes | 417 | 349 | 137 | 60 | 89 | 510 | 63 | 12960 | ... | 875 | 92 | 97 | 7.7 | 37 | 19016 | 59 | 1.792851 | False | True |
16 | Amherst College | Yes | 4302 | 992 | 418 | 83 | 96 | 1593 | 5 | 19760 | ... | 1598 | 93 | 98 | 8.4 | 63 | 21424 | 100 | 2.254295 | False | True |
20 | Antioch University | Yes | 713 | 661 | 252 | 25 | 44 | 712 | 23 | 15476 | ... | 1100 | 69 | 82 | 11.3 | 35 | 42926 | 48 | 6.374709 | True | True |
60 | Bowdoin College | Yes | 3356 | 1019 | 418 | 76 | 100 | 1490 | 8 | 19030 | ... | 875 | 93 | 96 | 11.2 | 52 | 20447 | 96 | 2.067073 | False | True |
64 | Brandeis University | Yes | 4186 | 2743 | 740 | 48 | 77 | 2819 | 62 | 19380 | ... | 1000 | 90 | 97 | 9.8 | 24 | 17150 | 84 | 1.435271 | False | True |
5 rows × 22 columns
Using the IQR method, more instances in the college dataset are marked as outliers.
11.2.1.2.3 When to use each:
Use IQR when:
- Data is skewed
- You don’t want to assume any distribution
- You want a more sensitive detector
Use Z-score when:
- Data is approximately normal
- You want a more conservative approach
- You need a method that’s easily interpretable
The actual number of outliers marked by each method will depend on your specific dataset’s distribution and characteristics.
11.2.2 Common Methods for Handling outliers
Once we identify outlier instances using statistical methods, the next step is to handle them. Below are some commonly used approaches:
- Removing Outliers: Discarding extreme values if they are likely due to errors or are irrelevant for the analysis.
- Winsorizing: Capping/flooring outliers to a specific percentile to reduce their influence without removing them completely.
- Replacing with the median: Replacing outlier values with the median of the remaining data. The median is less affected by extreme values than the mean, making it an ideal choice for imputation when dealing with outliers
- Transforming Data: Applying transformations (e.g., log, square root) to reduce the impact of outliers.
def method_removal():
"""Method 1: Remove outliers"""
= college[~college['is_outlier']]['Expend']
data_cleaned return data_cleaned
def method_capping():
"""Method 2: Capping (Winsorization)"""
= college['Expend'].copy()
data_capped < lower_bound] = lower_bound
data_capped[data_capped > upper_bound] = upper_bound
data_capped[data_capped return data_capped
def method_mean_replacement():
"""Method 3: Replace with median"""
= college['Expend'].copy()
data_median = college[~college['is_outlier']]['Expend'].median()
median_value 'is_outlier']] = median_value
data_median[college[return data_median
def method_log_transformation():
"""Method 4: Log transformation"""
= np.log1p(college['Expend'])
data_log return data_log
Your Practice: Try each method and compare their results to see how they differ in handling outliers.
In real-world applications, handling outliers should be approached on a case-by-case basis. However, here are some general recommendations:
Recommendations for Different Scenarios:
Data Removal:
- Use when: You have a large dataset and can afford to lose data
- Pros: Simple, removes all outlier influence
- Cons: Loss of data, potential loss of important information
Capping (Winsorization):
- Use when: You want to preserve data count while limiting extreme values
- Pros: Preserves data size, reduces impact of outliers
- Cons: Artificial boundary creation, potential loss of genuine extreme events
Median Replacement:
- Use when: The data is normally distributed, and outliers are verified as errors or anomalies not representing true values
- Pros: Maintains data size, simple to implement
- Cons: Reduces variance, may not represent the true distribution
Log Transformation:
- Use when: Data has a right-skewed distribution, and you want to reduce the impact of large outliers.
- Pros: Reduces skewness, minimizes the effect of extreme values, can make data more normal-like.
- Cons: Only applicable to positive values, may not be effective for extremely high outliers.
11.3 Data binning
Data binning is a powerful technique in data analysis where continuous numerical data is grouped into discrete intervals or “bins.” Imagine it as sorting items into distinct categories or “buckets” based on their values, making it easier to observe trends, patterns, and groupings within large datasets. We encounter binning in everyday life without even realizing it.
For example, consider age groups: instead of listing every individual age, we often refer to broader categories like “under 20,” “20-30,” “30-40,” or “40 and over.” This binning approach helps simplify data interpretation and comparison across different age groups. Another common example is tax brackets. Tax rates are often applied to ranges of income, like “up to $11,000,” “$11,001 to $44,725,” and so forth. Here, income values are grouped into bins to determine applicable tax rates, making complex tax data easier to manage and understand.
![](./Datasets/taxable-income-2024.png)
![](./Datasets/bin_example1b.png)
11.3.1 Key reasons for binning:
- Better intepretation of data
- Making better recommendations
- Smooth data, reduce noise
Examples:
Binning to better interpret data
- The number of flu cases everyday may be binned to seasons such as fall, spring, winter and summer, to understand the effect of season on flu.
Binning to make recommendations:
A doctor may like to group patient age into bins. Grouping patient ages into categories such as Age <=12, 12<Age<=18, 18<Age<=65, Age>65 may help recommend the kind/doses of covid vaccine a patient needs.
A credit card company may want to bin customers based on their spend, as “High spenders”, “Medium spenders” and “Low spenders”. Binning will help them design customized marketing campaigns for each bin, thereby increasing customer response (or revenue). On the other hand, they use the same campaign for customers withing the same bin, thus minimizng marketing costs.
Binning to smooth data, and reduce noise
- A sales company may want to bin their total sales to a weekly / monthly / yearly level to reduce the noise in day-to-day sales.
Using our college dataset, let’s apply binning to better interpret the relationship between instructional expenditure per student (Expend
) and graduation rate (Grad.Rate
) for U.S. universities, and make relevant recommendations
Here is the previous scatterplot of ‘Grad.Rate’ vs ‘Expend’ with a trendline.
# scatter plot of Expend vs Grad.Rate
=sns.regplot(data = college, x = "Expend", y = "Grad.Rate",scatter_kws={"color": "orange"}, line_kws={"color": "blue"})
ax'${x:,.0f}')
ax.xaxis.set_major_formatter('Expenditure per student')
ax.set_xlabel('Graduation rate')
ax.set_ylabel( plt.show()
The trendline indicates a positive correlation between Expend
and Grad.Rate
. However, there seems to be a lot of noise and presence of outliers in the data, which makes it hard to interpret the overall trend.
11.3.2 Common binning methods:
- Equal-width: Divides range into equal-sized intervals
- Equal-size: Places equal number of observations in each bin
- Custom: Based on domain knowledge (like age groups above)
We’ll bin Expend
to see if we can better analyze its association with Grad.Rate
. However, let us first visualize the distribution of Expend
.
#Visualizing the distribution of expend
=sns.histplot(data = college, x= 'Expend')
ax'${x:,.0f}') ax.xaxis.set_major_formatter(
The distribution of Extend
is right skewed with potentially some extremely high outlying values.
11.3.2.1 Binning with equal width bins
We’ll use the Pandas function cut() to bin Expend
. This function creates bins such that all bins have the same width.
#Using the cut() function in Pandas to bin "Expend"
= pd.cut(college['Expend'],3,retbins = True)
Binned_expend Binned_expend
(0 (3132.953, 20868.333]
1 (3132.953, 20868.333]
2 (3132.953, 20868.333]
3 (3132.953, 20868.333]
4 (3132.953, 20868.333]
...
772 (3132.953, 20868.333]
773 (3132.953, 20868.333]
774 (3132.953, 20868.333]
775 (38550.667, 56233.0]
776 (3132.953, 20868.333]
Name: Expend, Length: 777, dtype: category
Categories (3, interval[float64, right]): [(3132.953, 20868.333] < (20868.333, 38550.667] < (38550.667, 56233.0]],
array([ 3132.953 , 20868.33333333, 38550.66666667, 56233. ]))
The cut() function returns a tuple of length 2. The first element of the tuple are the bins, while the second element is an array containing the cut-off values for the bins.
type(Binned_expend)
tuple
len(Binned_expend)
2
Once the bins are obtained, we’ll add a column in the dataset that indicates the bin for Expend
.
#Creating a categorical variable to store the level of expenditure on a student
'Expend_bin'] = Binned_expend[0]
college[ college.head()
Unnamed: 0 | Private | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | ... | PhD | Terminal | S.F.Ratio | perc.alumni | Expend | Grad.Rate | z_score | is_z_score_outlier | is_IQR_outlier | Expend_bin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abilene Christian University | Yes | 1660 | 1232 | 721 | 23 | 52 | 2885 | 537 | 7440 | ... | 70 | 78 | 18.1 | 12 | 7041 | 60 | -0.501910 | False | False | (3132.953, 20868.333] |
1 | Adelphi University | Yes | 2186 | 1924 | 512 | 16 | 29 | 2683 | 1227 | 12280 | ... | 29 | 30 | 12.2 | 16 | 10527 | 56 | 0.166110 | False | False | (3132.953, 20868.333] |
2 | Adrian College | Yes | 1428 | 1097 | 336 | 22 | 50 | 1036 | 99 | 11250 | ... | 53 | 66 | 12.9 | 30 | 8735 | 54 | -0.177290 | False | False | (3132.953, 20868.333] |
3 | Agnes Scott College | Yes | 417 | 349 | 137 | 60 | 89 | 510 | 63 | 12960 | ... | 92 | 97 | 7.7 | 37 | 19016 | 59 | 1.792851 | False | True | (3132.953, 20868.333] |
4 | Alaska Pacific University | Yes | 193 | 146 | 55 | 16 | 44 | 249 | 869 | 7560 | ... | 76 | 72 | 11.9 | 2 | 10922 | 15 | 0.241803 | False | False | (3132.953, 20868.333] |
5 rows × 23 columns
See the variable Expend_bin
in the above dataset.
Let us visualize the Expend
bins over the distribution of the Expend
variable.
#Visualizing the bins for instructional expediture on a student
set(font_scale=1.25)
sns."figure.figsize"] = (9,6)
plt.rcParams[=sns.histplot(data = college, x= 'Expend')
ax1], 0,100,color='red')
plt.vlines(Binned_expend['Expenditure per student');
plt.xlabel('Graduation rate');
plt.ylabel('${x:,.0f}') ax.xaxis.set_major_formatter(
By default, the bins created have equal width. They are created by dividing the range between the maximum and minimum value of Expend
into the desired number of equal-width intervals. We can label the bins as well as follows.
'Expend_bin'] = pd.cut(college['Expend'],3,labels = ['Low expend','Med expend','High expend'])
college['Expend_bin'] college[
0 Low expend
1 Low expend
2 Low expend
3 Low expend
4 Low expend
...
772 Low expend
773 Low expend
774 Low expend
775 High expend
776 Low expend
Name: Expend_bin, Length: 777, dtype: category
Categories (3, object): ['Low expend' < 'Med expend' < 'High expend']
Now that we have binned the variable Expend
, let us see if we can better visualize the association of graduation rate with expenditure per student using Expened_bin
.
#Visualizing average graduation rate vs categories of instructional expenditure per student
= 'Expend_bin', y = 'Grad.Rate', data = college) sns.barplot(x
<Axes: xlabel='Expend_bin', ylabel='Grad.Rate'>
It seems that the graduation rate is the highest for universities with medium level of expenditure per student. This is different from the trend we saw earlier in the scatter plot. Let us investigate.
Let us find the number of universities in each bin.
'Expend_bin'].value_counts() college[
Expend_bin
Low expend 751
Med expend 21
High expend 5
Name: count, dtype: int64
The bin High expend
consists of only 5 universities, or 0.6% of all the universities in the dataset. These universities may be outliers that are skewing the trend (as also evident in the histogram above).
Let us see if we get the correct trend with the outliers removed from the data.
#Data without outliers
= college[((college.Expend>=lower_bound) & (college.Expend<=upper_bound))] college_data_without_outliers
= pd.cut(college_data_without_outliers['Expend'],3,labels = ['Low expend','Med expend','High expend'],retbins = True)
Binned_data 'Expend_bin'] = Binned_data[0] college_data_without_outliers.loc[:,
= 'Expend_bin', y = 'Grad.Rate', data = college_data_without_outliers) sns.barplot(x
<Axes: xlabel='Expend_bin', ylabel='Grad.Rate'>
With the outliers removed, we obtain the correct overall trend, even in the case of equal-width bins. Note that these bins have unequal number of observations as shown below.
=sns.histplot(data = college_data_without_outliers, x= 'Expend')
axfor i in range(4):
1][i], 0,100,color='red')
plt.axvline(Binned_data['Expenditure per student');
plt.xlabel('Graduation rate');
plt.ylabel('${x:,.0f}') ax.xaxis.set_major_formatter(
Note that the right tail of the histogram has disappered since we removed outliers.
'Expend_bin'].value_counts() college_data_without_outliers[
Expend_bin
Med expend 327
Low expend 314
High expend 88
Name: count, dtype: int64
Instead of removing outliers, we can address the issue by binning observations into equal-sized bins, ensuring each bin has the same number of observations. Let’s explore this approach next.
11.3.2.2 Binning with equal sized bins
Let us bin the variable Expend
such that each bin consists of the same number of observations.
We’ll use the Pandas function qcut() to make equal-sized bins (in contrast to equal-width bins in the previous section).
#Using the Pandas function qcut() to create bins with the same number of observations
= pd.qcut(college['Expend'],3,retbins = True)
Binned_expend 'Expend_bin'] = Binned_expend[0] college[
Each bin has the same number of observations with qcut()
:
'Expend_bin'].value_counts() college[
Expend_bin
(3185.999, 7334.333] 259
(7334.333, 9682.333] 259
(9682.333, 56233.0] 259
Name: count, dtype: int64
Let us visualize the Expend
bins over the distribution of the Expend
variable.
#Visualizing the bins for instructional expediture on a student
set(font_scale=1.25)
sns."figure.figsize"] = (9,6)
plt.rcParams[=sns.histplot(data = college, x= 'Expend')
ax1], 0,100,color='red')
plt.vlines(Binned_expend['Expenditure per student');
plt.xlabel('Graduation rate');
plt.ylabel('${x:,.0f}') ax.xaxis.set_major_formatter(
Note that the bin-widths have been adjusted to have the same number of observations in each bin. The bins are narrower in domains of high density, and wider in domains of sparse density.
Let us again make the barplot visualizing the average graduate rate with level of instructional expenditure per student.
'Expend_bin'] = pd.qcut(college['Expend'],3,labels = ['Low expend','Med expend','High expend'])
college[=sns.barplot(x = 'Expend_bin', y = 'Grad.Rate', data = college) a
Now we see the same trend that we saw in the scatterplot, but without the noise. We have smoothed the data. Note that making equal-sized bins helps reduce the effect of outliers in the overall trend.
Suppose this analysis was done to provide recommendations to universities for increasing their graduation rate. With binning, we can can provide one recommendation to ‘Low expend’ universities, and another one to ‘Med expend’ universities. For example, the recommendations can be:
- ‘Low expend’ universities can expect an increase of 9 percentage points in
Grad.Rate
, if they migrate to the ‘Med expend’ category. - ‘Med expend’ universities can expect an increase of 7 percentage points in
Grad.Rate
, if they migrate to the ‘High expend’ category.
The numbers in the above recommendations are based on the table below.
=False)['Grad.Rate'].mean() college.groupby(college.Expend_bin, observed
Expend_bin
Low expend 57.343629
Med expend 66.057915
High expend 72.988417
Name: Grad.Rate, dtype: float64
We can also provide recommendations based on the confidence intervals of the mean graduation rate (Grad.Rate
). The confidence intervals, calculated below, are derived using a method known as bootstrapping. For a detailed explanation of bootstrapping, please refer to this article on Wikipedia.
# Bootstrapping to find 95% confidence intervals of Graduation Rate of US universities based on average expenditure per student
= {}
confidence_intervals
# Loop through each expenditure bin
for expend_bin, data_sub in college.groupby('Expend_bin', observed=False):
# Generate bootstrap samples for the graduation rate
= np.random.choice(data_sub['Grad.Rate'], size=(10000, len(data_sub)), replace=True)
samples
# Calculate the mean of each sample
= samples.mean(axis=1)
sample_means
# Calculate the 95% confidence interval
= np.percentile(sample_means, 2.5)
lower_bound = np.percentile(sample_means, 97.5)
upper_bound
# Store the result in the dictionary
= (round(lower_bound, 2), round(upper_bound, 2))
confidence_intervals[expend_bin]
# Print the results
for expend_bin, ci in confidence_intervals.items():
print(f"95% Confidence interval of Grad.Rate for {expend_bin} universities = [{ci[0]}, {ci[1]}]")
95% Confidence interval of Grad.Rate for Low expend universities = [55.36, 59.36]
95% Confidence interval of Grad.Rate for Med expend universities = [64.15, 67.95]
95% Confidence interval of Grad.Rate for High expend universities = [71.03, 74.92]
Apart from equal-width and equal-sized bins, custom bins can be created using the bins argument. Suppose, bins are to be created for Expend
with cutoffs \(\$10,000, \$20,000, \$30,000... \$60,000\). Then, we can use the bins argument as in the code below:
11.3.2.3 Binning with custom bins
= pd.cut(college.Expend,bins = list(range(0,70000,10000)),retbins=True) Binned_expend
#Visualizing the bins for instructional expediture on a student
=sns.histplot(data = college, x= 'Expend')
ax1], 0,100,color='red')
plt.vlines(Binned_expend['Expenditure per student');
plt.xlabel('Graduation rate');
plt.ylabel('${x:,.0f}') ax.xaxis.set_major_formatter(
Next, let’s create a custom bin with unequal sizes and widths.
# Create bins and labels
= [college['Expend'].min(), 10000, 40000, college['Expend'].max()]
bins = ['Low Expenditure', 'Medium Expenditure', 'High Expenditure']
labels
# Binning the 'Expend' column
'Expend_Binned'] = pd.cut(college['Expend'], bins=bins, labels=labels)
college[
# Visualizing the bins using a histogram
= sns.histplot(data=college, x='Expend', kde=False)
ax
# Add vertical lines for the bin edges (the bin edges are the values in 'bins')
for bin_edge in bins[1:-1]: # skip the first and last bins since they are the edges
0, ax.get_ylim()[1], color='red')
plt.vlines(bin_edge,
# Add labels and formatting
'Expenditure per student')
plt.xlabel('Frequency')
plt.ylabel('${x:,.0f}')
ax.xaxis.set_major_formatter(
# Display the plot
plt.show()
As custom bin-cutoffs can be specified with the cut()
function, custom bin quantiles can be specified with the qcut()
function.
11.4 Dummy / Indicator variables
Dummy or indicator variables are binary variables created to represent categorical data in numerical form, typically used in regression and machine learning models. Each category is represented by a separate dummy variable with values of 0 or 1.
11.4.1 Purpose:
- Enables categorical data to be used in quantitative analysis: Models like linear regression and many machine learning algorithms require numerical input, making dummy variables essential for handling categorical data.
- Preserves information: Converts categorical information into a format that models can interpret without losing meaning.
11.4.2 When to use dummy variables
Dummy variables are especially useful in:
- Regression models: For categorical predictors, such as region, gender, or type of product.
- Machine learning algorithms: Many models, including tree-based methods, can benefit from dummy variables for categorical features.
11.4.3 How to Create Dummy variables
The pandas library in Python has a built-in function, pd.get_dummies()
, to generate dummy variables. If a column in a DataFrame has \(k\) distinct values, we will get a DataFrame with \(k\) columns containing 0s and 1s
Let us make dummy variables with the equal-sized bins we created for the average instruction expenditure per student.
#Using the Pandas function qcut() to create bins with the same number of observations
= pd.qcut(college['Expend'],3,retbins = True,labels = ['Low_expend','Med_expend','High_expend'])
Binned_expend 'Expend_bin'] = Binned_expend[0] college[
#Making dummy variables based on the levels (categories) of the 'Expend_bin' variable
= pd.get_dummies(college['Expend_bin'], dtype=int) dummy_Expend
The dummy data dummy_Expend
has a value of \(1\) if the observation corresponds to the category referenced by the column name.
dummy_Expend.head()
Low_expend | Med_expend | High_expend | |
---|---|---|---|
0 | 1 | 0 | 0 |
1 | 0 | 0 | 1 |
2 | 0 | 1 | 0 |
3 | 0 | 0 | 1 |
4 | 0 | 0 | 1 |
Adding the dummy columns back to the original dataframe
#Concatenating the dummy variables to the original data
= pd.concat([college,dummy_Expend],axis=1)
college print(college.shape)
college.head()
(777, 26)
Unnamed: 0 | Private | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | ... | perc.alumni | Expend | Grad.Rate | z_score | is_z_score_outlier | is_IQR_outlier | Expend_bin | Low_expend | Med_expend | High_expend | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abilene Christian University | Yes | 1660 | 1232 | 721 | 23 | 52 | 2885 | 537 | 7440 | ... | 12 | 7041 | 60 | -0.501910 | False | False | Low_expend | 1 | 0 | 0 |
1 | Adelphi University | Yes | 2186 | 1924 | 512 | 16 | 29 | 2683 | 1227 | 12280 | ... | 16 | 10527 | 56 | 0.166110 | False | False | High_expend | 0 | 0 | 1 |
2 | Adrian College | Yes | 1428 | 1097 | 336 | 22 | 50 | 1036 | 99 | 11250 | ... | 30 | 8735 | 54 | -0.177290 | False | False | Med_expend | 0 | 1 | 0 |
3 | Agnes Scott College | Yes | 417 | 349 | 137 | 60 | 89 | 510 | 63 | 12960 | ... | 37 | 19016 | 59 | 1.792851 | False | True | High_expend | 0 | 0 | 1 |
4 | Alaska Pacific University | Yes | 193 | 146 | 55 | 16 | 44 | 249 | 869 | 7560 | ... | 2 | 10922 | 15 | 0.241803 | False | False | High_expend | 0 | 0 | 1 |
5 rows × 26 columns
We can find the correlation between the dummy variables and graduation rate to identify if any of the dummy variables will be useful to estimate graduation rate (Grad.Rate
).
#Finding if dummy variables will be useful to estimate 'Grad.Rate'
'Grad.Rate']) dummy_Expend.corrwith(college[
Low_expend -0.334456
Med_expend 0.024492
High_expend 0.309964
dtype: float64
The dummy variables Low expend
and High expend
may contribute in explaining Grad.Rate
in a regression model.
11.4.4 Using drop_first
to Avoid Multicollinearity
In cases like linear regression, you may want to drop one dummy variable to avoid multicollinearity (perfect correlation between variables).
# create dummy variables, dropping the first to avoid multicollinearity
= pd.get_dummies(college['Expend_bin'], dtype=int, drop_first=True)
dummy_Expend
# add the dummy variables to the original dataset
= pd.concat([college, dummy_Expend], axis=1)
college
college.head()
Unnamed: 0 | Private | Apps | Accept | Enroll | Top10perc | Top25perc | F.Undergrad | P.Undergrad | Outstate | ... | Grad.Rate | z_score | is_z_score_outlier | is_IQR_outlier | Expend_bin | Low_expend | Med_expend | High_expend | Med_expend | High_expend | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Abilene Christian University | Yes | 1660 | 1232 | 721 | 23 | 52 | 2885 | 537 | 7440 | ... | 60 | -0.501910 | False | False | Low_expend | 1 | 0 | 0 | 0 | 0 |
1 | Adelphi University | Yes | 2186 | 1924 | 512 | 16 | 29 | 2683 | 1227 | 12280 | ... | 56 | 0.166110 | False | False | High_expend | 0 | 0 | 1 | 0 | 1 |
2 | Adrian College | Yes | 1428 | 1097 | 336 | 22 | 50 | 1036 | 99 | 11250 | ... | 54 | -0.177290 | False | False | Med_expend | 0 | 1 | 0 | 1 | 0 |
3 | Agnes Scott College | Yes | 417 | 349 | 137 | 60 | 89 | 510 | 63 | 12960 | ... | 59 | 1.792851 | False | True | High_expend | 0 | 0 | 1 | 0 | 1 |
4 | Alaska Pacific University | Yes | 193 | 146 | 55 | 16 | 44 | 249 | 869 | 7560 | ... | 15 | 0.241803 | False | False | High_expend | 0 | 0 | 1 | 0 | 1 |
5 rows × 28 columns
11.5 Independent Study
11.5.1 Practice exercise 1
Read survey_data_clean.csv. Split the columns of the dataset, such that all columns with categorical values transform into dummy variables with each category corresponding to a column of 0s and 1s. Leave the Timestamp column.
As all categorical columns are transformed to dummy variables, all columns have numeric values.
What is the total number of columns in the transformed data? What is the total number of columns of the original data?
Find the:
Top 5 variables having the highest positive correlation with
NU_GPA
.Top 5 variables having the highest negative correlation with
NU_GPA
.
11.5.2 Practice exercise 2
Consider the dataset survey_data_clean.csv . Find the number of outliers in each column of the dataset based on both the Z-score and IQR criteria. Do not use a for
loop.
Which column(s) have the maximum number of outliers using Z-score?
Which column(s) have the maximum number of outliers using IQR?
Do you think the outlying observations identified for those columns(s) should be considered as outliers? If not, then which type of columns should be considered when finding outliers?