14 Data Preparation

After data cleaning removes quality issues (missing values, outliers, duplicates), data preparation transforms the cleaned data into a format optimized for modeling and analysis. While data cleaning ensures data quality, data preparation ensures data is in the right structure, scale, and format for algorithms to work effectively.

Data scientists often spend 20-30% of their time on data preparation after cleaning, as most machine learning algorithms have specific requirements:

Numerical inputs only (requiring encoding of categorical variables)
Similar feature scales (preventing some features from dominating others)
Meaningful representations (creating derived features that capture important patterns)

14.1 Why Data Preparation Matters

Challenge	Issue	Solution
Information loss	Continuous data may hide useful patterns	Bin values into meaningful groups
Categorical data	Algorithms require numbers, not text	Encode categories as dummy variables
Scale differences	Income ($50K) vs. Age (30) have different magnitudes	Normalize or standardize features
Model performance	Poorly prepared data leads to biased or inaccurate models	Apply appropriate transformations

Note: This chapter focuses on fundamental preparation techniques. Advanced feature engineering (creating interaction terms, polynomial features, domain-specific transformations) is typically covered in machine learning courses.

Let’s first import necessary libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import sklearn as sk
import seaborn as sns

sns.set(font_scale=1.5)

%matplotlib inline

14.2 Dataset Introduction

Throughout this chapter, we’ll use the College dataset (College.csv) that contains information about US universities as our dataset in this section. The description of variables of the dataset can be found on page 65 of this book.

We’ll demonstrate how binning and encoding can reveal meaningful patterns and support actionable recommendations for universities. Finally, we’ll apply feature scaling to prepare the data for predictive modeling.

Let’s read the dataset into pandas dataframe

college = pd.read_csv('./datasets/College.csv', index_col=0)
college.head()

	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60
Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56
Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	3750	400	1165	53	66	12.9	30	8735	54
Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	5450	450	875	92	97	7.7	37	19016	59
Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	4120	800	1500	76	72	11.9	2	10922	15

14.3 Data Binning

Data binning is a practical technique in data analysis where continuous numerical values are grouped into discrete intervals, or bins. Think of it as sorting a long list of numbers into a few meaningful categories—like putting marbles of different colors into separate jars. Binning helps us simplify complex data, spot patterns, and make comparisons that would be hard to see otherwise.

Why do we use binning?

Easier interpretation: Instead of analyzing hundreds of unique values, we can summarize data into a handful of groups. For example, rather than looking at every possible age, we might use age groups like under 20, 20-30, 30-40, and 40 and over.
Better recommendations: Grouping data helps us make targeted decisions. For instance, a doctor might recommend different treatments for children, adults, and seniors—each defined by an age bin.
Noise reduction: Binning can smooth out random fluctuations in data, making trends clearer. For example, daily sales numbers might be noisy, but weekly or monthly bins reveal the bigger picture.

Real-world examples:

Tax brackets: Income is grouped into ranges (bins) to determine tax rates, such as up to $11,000, $11,001 to $44,725, etc.
Credit card marketing: Customers are binned as High spenders, Medium spenders, or Low spenders to tailor marketing strategies.
Medical guidelines: Vaccine recommendations often depend on age bins (e.g., under 12, 12-18, 18-65, over 65).

In this section, we’ll see how binning can help us interpret and communicate insights from data, using real datasets and practical code examples. We’ll also compare different binning methods and discuss when each is most useful.

Now, let’s see how binning can help us uncover meaningful insights from our college dataset. We’ll focus on two key variables: instructional expenditure per student (Expend) and graduation rate (Grad.Rate) for U.S. universities.

By grouping universities into expenditure bins, we can more clearly observe how spending levels relate to graduation outcomes. This approach not only simplifies the analysis but also enables us to make practical, data-driven recommendations for different types of institutions.

Before we apply binning, let’s revisit the scatterplot of Grad.Rate versus Expend with a trendline. This plot shows the overall relationship, but as we’ll see, binning can reveal patterns and actionable insights that might be hidden in the raw data.

# scatter plot of Expend vs Grad.Rate
ax=sns.regplot(data = college, x = "Expend", y = "Grad.Rate",scatter_kws={"color": "orange"}, line_kws={"color": "blue"})
ax.xaxis.set_major_formatter('${x:,.0f}')
ax.set_xlabel('Expenditure per student')
ax.set_ylabel('Graduation rate')
plt.show()

The trendline indicates a positive correlation between Expend and Grad.Rate. However, there seems to be a lot of noise and presence of outliers in the data, which makes it hard to interpret the overall trend.

Common Data Binning Methods

There are several ways to create bins, each with its own advantages depending on your data and analysis goals:

Equal-width binning: Divides the range of a variable into intervals of the same size. This is simple and works well when the data is uniformly distributed, but can result in bins with very different numbers of observations if the data is skewed or has outliers.
- Example: Splitting ages 0–100 into five bins of 20 years each: 0–20, 21–40, 41–60, 61–80, 81–100.
Equal-size (quantile) binning: Each bin contains (roughly) the same number of observations, regardless of the actual range of values. This is useful for comparing groups of similar size and for handling skewed data.
- Example: Dividing 300 students into three bins so that each bin has 100 students, even if the score ranges are different.
Custom binning: Bins are defined based on domain knowledge or specific cutoffs relevant to the problem. This approach is flexible and can make results more interpretable for your audience.
- Example: Income groups defined as Low (<$30,000), Middle ($30,000–$70,000), and High (>$70,000) based on policy or business needs.

In practice, the choice of binning method should be guided by the data distribution and the questions you want to answer. We’ll demonstrate each of these methods in the following examples.

We’ll bin Expend to see if we can better analyze its association with Grad.Rate. However, let us first visualize the distribution of Expend.

#Visualizing the distribution of expend
ax=sns.histplot(data = college, x= 'Expend')
ax.xaxis.set_major_formatter('${x:,.0f}')

The distribution of Expend is right skewed with potentially some extremely high outlying values.

14.3.1 Equal-width Binning

Equal-width binning divides the range of Expend into intervals of the same size. This is useful for evenly spaced data, but can result in bins with very different numbers of universities if the data is skewed.

We’ll use the Pandas function cut() to create three equal-width bins for instructional expenditure.

#Using the cut() function in Pandas to bin "Expend"
Binned_expend = pd.cut(college['Expend'],3,retbins = True)
Binned_expend

(Abilene Christian University      (3132.953, 20868.333]
 Adelphi University                (3132.953, 20868.333]
 Adrian College                    (3132.953, 20868.333]
 Agnes Scott College               (3132.953, 20868.333]
 Alaska Pacific University         (3132.953, 20868.333]
                                           ...          
 Worcester State College           (3132.953, 20868.333]
 Xavier University                 (3132.953, 20868.333]
 Xavier University of Louisiana    (3132.953, 20868.333]
 Yale University                    (38550.667, 56233.0]
 York College of Pennsylvania      (3132.953, 20868.333]
 Name: Expend, Length: 777, dtype: category
 Categories (3, interval[float64, right]): [(3132.953, 20868.333] < (20868.333, 38550.667] < (38550.667, 56233.0]],
 array([ 3132.953     , 20868.33333333, 38550.66666667, 56233.        ]))

The cut() function returns a tuple of length 2. The first element of the tuple are the bins, while the second element is an array containing the cut-off values for the bins.

type(Binned_expend)

tuple

len(Binned_expend)

Once the bins are obtained, we’ll add a column in the dataset that indicates the bin for Expend.

#Creating a categorical variable to store the level of expenditure on a student
college['Expend_bin'] = Binned_expend[0]
college.head()

	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate	Expend_bin
Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	3300	450	2200	70	78	18.1	12	7041	60	(3132.953, 20868.333]
Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	6450	750	1500	29	30	12.2	16	10527	56	(3132.953, 20868.333]
Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	3750	400	1165	53	66	12.9	30	8735	54	(3132.953, 20868.333]
Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	5450	450	875	92	97	7.7	37	19016	59	(3132.953, 20868.333]
Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	4120	800	1500	76	72	11.9	2	10922	15	(3132.953, 20868.333]

See the variable Expend_bin in the above dataset.

Let us visualize the Expend bins over the distribution of the Expend variable.

#Visualizing the bins for instructional expediture on a student
sns.set(font_scale=1.25)
plt.rcParams["figure.figsize"] = (9,6)
ax=sns.histplot(data = college, x= 'Expend')
plt.vlines(Binned_expend[1], 0,100,color='red')
plt.xlabel('Expenditure per student');
plt.ylabel('Graduation rate');
ax.xaxis.set_major_formatter('${x:,.0f}')

By default, the bins created have equal width. They are created by dividing the range between the maximum and minimum value of Expend into the desired number of equal-width intervals. We can label the bins as well as follows.

college['Expend_bin'] = pd.cut(college['Expend'],3,labels = ['Low expend','Med expend','High expend'])
college['Expend_bin']

Abilene Christian University       Low expend
Adelphi University                 Low expend
Adrian College                     Low expend
Agnes Scott College                Low expend
Alaska Pacific University          Low expend
                                     ...     
Worcester State College            Low expend
Xavier University                  Low expend
Xavier University of Louisiana     Low expend
Yale University                   High expend
York College of Pennsylvania       Low expend
Name: Expend_bin, Length: 777, dtype: category
Categories (3, object): ['Low expend' < 'Med expend' < 'High expend']

Now that we have binned the variable Expend, let us see if we can better visualize the association of graduation rate with expenditure per student using Expened_bin.

#Visualizing average graduation rate vs categories of instructional expenditure per student
sns.barplot(x = 'Expend_bin', y = 'Grad.Rate', data = college)

It seems that the graduation rate is the highest for universities with medium level of expenditure per student. This is different from the trend we saw earlier in the scatter plot. Let us investigate.

Let us find the number of universities in each bin.

college['Expend_bin'].value_counts()

Expend_bin
Low expend     751
Med expend      21
High expend      5
Name: count, dtype: int64

The bin High expend consists of only 5 universities, or 0.6% of all the universities in the dataset. These universities may be outliers that are skewing the trend (as also evident in the histogram above).

Let us see if we get the correct trend with the outliers removed from the data.

14.3.1.1 Removing outliers in `Expend` (setup for fair comparisons)

Before we compare graduation rates across expenditure levels, we’ll remove extreme outliers in instructional expenditure (Expend) so they don’t distort averages. We’ll use the IQR rule (Q1 − 1.5×IQR, Q3 + 1.5×IQR) to define lower and upper bounds and filter the dataset. This also defines lower_bound and upper_bound used later.

#Data without outliers

Q1 = college['Expend'].quantile(0.25)
Q3 = college['Expend'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

college_data_without_outliers = college[((college.Expend>=lower_bound) & (college.Expend<=upper_bound))]

Let’s visualize Expend before and after outlier removal and show bounds


fig, axes = plt.subplots(1, 2, figsize=(14, 4))
sns.histplot(college['Expend'], ax=axes[0], color='tab:blue')
axes[0].axvline(lower_bound, color='red', linestyle='--', label='Lower bound')
axes[0].axvline(upper_bound, color='red', linestyle='--', label='Upper bound')
axes[0].set_title('Original Expend (with outliers)')
axes[0].set_xlabel('Expenditure per student')
axes[0].set_ylabel('Frequency')
axes[0].xaxis.set_major_formatter('${x:,.0f}')
axes[0].legend()

sns.histplot(college_data_without_outliers['Expend'], ax=axes[1], color='tab:green')
axes[1].axvline(lower_bound, color='red', linestyle='--')
axes[1].axvline(upper_bound, color='red', linestyle='--')
axes[1].set_title('Expend after outlier removal (IQR)')
axes[1].set_xlabel('Expenditure per student')
axes[1].set_ylabel('Frequency')
axes[1].xaxis.set_major_formatter('${x:,.0f}')

plt.tight_layout()
plt.show()

print(f"Q1 = {Q1:,.0f}, Q3 = {Q3:,.0f}")
print(f"IQR = {IQR:,.0f}")
print(f"Lower bound = {lower_bound:,.0f}, Upper bound = {upper_bound:,.0f}")

Q1 = 6,751, Q3 = 10,830
IQR = 4,079
Lower bound = 632, Upper bound = 16,948

Note that the right tail of the histogram has disappered since we removed outliers.

Let’s recreate the bins with equal size and visualize the relationship.

Binned_data = pd.cut(college_data_without_outliers['Expend'],3,labels = ['Low expend','Med expend','High expend'],retbins = True)
college_data_without_outliers.loc[:,'Expend_bin'] = Binned_data[0]

sns.barplot(x = 'Expend_bin', y = 'Grad.Rate', data = college_data_without_outliers)

With the outliers removed, we obtain the correct overall trend, even in the case of equal-width bins. Note that these bins have unequal number of observations as shown below.

ax=sns.histplot(data = college_data_without_outliers, x= 'Expend')
for i in range(4):
    plt.axvline(Binned_data[1][i], 0,100,color='red')
plt.xlabel('Expenditure per student');
plt.ylabel('Graduation rate');
ax.xaxis.set_major_formatter('${x:,.0f}')

college_data_without_outliers['Expend_bin'].value_counts()

Expend_bin
Med expend     327
Low expend     314
High expend     88
Name: count, dtype: int64

Instead of removing outliers, we can address the issue by binning observations into equal-sized bins, ensuring each bin has the same number of observations. Let’s explore this approach next.

14.3.2 Equal-size (quantile) Binning

Equal-sized (quantile) binning splits a variable so that each bin contains approximately the same number of observations. This is especially helpful for skewed distributions and makes comparisons across groups fair. We’ll use pd.qcut() to divide Expend into three equally sized groups (tertiles).

Let us bin the variable Expend such that each bin consists of the same number of observations.

#Using the Pandas function qcut() to create bins with the same number of observations
Binned_expend = pd.qcut(college['Expend'],3,retbins = True)
college['Expend_bin'] = Binned_expend[0]

Let’s check the number of obervations in each bin

college['Expend_bin'].value_counts()

Expend_bin
(3185.999, 7334.333]    259
(7334.333, 9682.333]    259
(9682.333, 56233.0]     259
Name: count, dtype: int64

Let us visualize the Expend bins over the distribution of the Expend variable.

#Visualizing the bins for instructional expediture on a student
sns.set(font_scale=1.25)
plt.rcParams["figure.figsize"] = (9,6)
ax=sns.histplot(data = college, x= 'Expend')
plt.vlines(Binned_expend[1], 0,100,color='red')
plt.xlabel('Expenditure per student');
plt.ylabel('Graduation rate');
ax.xaxis.set_major_formatter('${x:,.0f}')

Note that the bin-widths have been adjusted to have the same number of observations in each bin. The bins are narrower in domains of high density, and wider in domains of sparse density.

Let us again make the barplot visualizing the average graduate rate with level of instructional expenditure per student.

college['Expend_bin'] = pd.qcut(college['Expend'],3,labels = ['Low expend','Med expend','High expend'])
a=sns.barplot(x = 'Expend_bin', y = 'Grad.Rate', data = college)

Now we see the same trend that we saw in the scatterplot, but without the noise. We have smoothed the data. Note that making equal-sized bins helps reduce the effect of outliers in the overall trend.

Suppose this analysis was done to provide recommendations to universities for increasing their graduation rate. With binning, we can can provide one recommendation to Low expend universities, and another one to Med expend universities. For example, the recommendations can be:

Low expend universities can expect an increase of 9 percentage points in Grad.Rate, if they migrate to the Med expend category.
Med expend universities can expect an increase of 7 percentage points in Grad.Rate, if they migrate to the High expend category.

The numbers in the above recommendations are based on the table below.

college.groupby(college.Expend_bin, observed=False)['Grad.Rate'].mean()

Expend_bin
Low expend     57.343629
Med expend     66.057915
High expend    72.988417
Name: Grad.Rate, dtype: float64

We can also provide recommendations based on the confidence intervals of the mean graduation rate (Grad.Rate). The confidence intervals, calculated below, are derived using a method known as bootstrapping. For a detailed explanation of bootstrapping, please refer to this article on Wikipedia.

# Bootstrapping to find 95% confidence intervals of Graduation Rate of US universities based on average expenditure per student
confidence_intervals = {}

# Loop through each expenditure bin
for expend_bin, data_sub in college.groupby('Expend_bin', observed=False):
    # Generate bootstrap samples for the graduation rate
    samples = np.random.choice(data_sub['Grad.Rate'], size=(10000, len(data_sub)), replace=True)
    
    # Calculate the mean of each sample
    sample_means = samples.mean(axis=1)
    
    # Calculate the 95% confidence interval
    lower_bound = np.percentile(sample_means, 2.5)
    upper_bound = np.percentile(sample_means, 97.5)
    
    # Store the result in the dictionary
    confidence_intervals[expend_bin] = (round(lower_bound, 2), round(upper_bound, 2))

# Print the results
for expend_bin, ci in confidence_intervals.items():
    print(f"95% Confidence interval of Grad.Rate for {expend_bin} universities = [{ci[0]}, {ci[1]}]")

95% Confidence interval of Grad.Rate for Low expend universities = [55.36, 59.36]
95% Confidence interval of Grad.Rate for Med expend universities = [64.15, 67.95]
95% Confidence interval of Grad.Rate for High expend universities = [71.03, 74.92]

Apart from equal-width and equal-sized bins, custom bins can be created using the bins argument. Suppose, bins are to be created for Expend with cutoffs $\$10,000, \$20,000, \$30,000... \$60,000$. Then, we can use the bins argument as in the code below:

14.3.3 Custom Binning

Sometimes your analysis calls for bins that match business rules or domain thresholds rather than equal-width or equal-size splits. With pd.cut(), you can supply exact edges and readable labels to get bins that make sense for your audience.

Practical tips:

Choose edges based on policy or domain knowledge (e.g., <$10K, $10K–$40K, >$40K).
Use include_lowest=True to ensure the minimum value is captured by the first bin, and right to control which side of the interval is closed (default is right-closed).
To avoid leaving values outside your bins, use open-ended edges with -np.inf and np.inf.
Provide human-friendly labels; the result is an ordered categorical, so plots and tables keep the intended order.

We’ll demonstrate two patterns:

Fixed-width $10K bins (for a quick grid-like view)
Domain-based bins (Low/Medium/High) with open-ended edges

# Example 1: fixed-width $10K bins up to $60K
edges_10k = list(range(0, 70000, 10000))
Expend_bins_10k = pd.cut(
    college['Expend'],
    bins=edges_10k,
    include_lowest=True,
    right=False,   # intervals are [left, right)
    ordered=True,
    retbins=True
)

# Save the categorical bins as a new column
college['Expend_bin_custom_10k'] = Expend_bins_10k[0]

# Show counts and proportions per bin
counts_10k = college['Expend_bin_custom_10k'].value_counts().sort_index()
summary_10k = pd.DataFrame({
    'count': counts_10k,
    'proportion': (counts_10k / len(college)).round(3)
})
summary_10k

	count	proportion
Expend_bin_custom_10k
[0, 10000)	535	0.689
[10000, 20000)	214	0.275
[20000, 30000)	19	0.024
[30000, 40000)	4	0.005
[40000, 50000)	4	0.005
[50000, 60000)	1	0.001

# Visualizing the fixed $10K bins over the Expend distribution
sns.set(font_scale=1.25)
plt.rcParams["figure.figsize"] = (9,6)
ax = sns.histplot(data=college, x='Expend')
for edge in edges_10k:
    ax.axvline(edge, color='red', linestyle='--', alpha=0.7)
plt.xlabel('Expenditure per student')
plt.ylabel('Frequency')
ax.xaxis.set_major_formatter('${x:,.0f}')
plt.show()

Next, let’s create a domain-based set of custom bins with unequal widths using open-ended edges. This pattern ensures every value is assigned to a bin (no NaNs from falling outside the range) and keeps the labels interpretable for stakeholders.

# Example 2: domain-based custom bins with open-ended edges
bins = [-np.inf, 10000, 40000, np.inf]
labels = ['Low Expenditure', 'Medium Expenditure', 'High Expenditure']

# Binning the 'Expend' column (ordered categorical)
college['Expend_bin_custom'] = pd.cut(
    college['Expend'], bins=bins, labels=labels, ordered=True, include_lowest=True
)

# Visualizing the bins using a histogram
ax = sns.histplot(data=college, x='Expend', kde=False)
# Add vertical lines for the finite bin edges
for edge in bins:
    if np.isfinite(edge):
        ax.axvline(edge, color='red', linestyle='--', alpha=0.7)

# Labels and formatting
plt.xlabel('Expenditure per student')
plt.ylabel('Frequency')
ax.xaxis.set_major_formatter('${x:,.0f}')
plt.show()

# Check counts and proportions per custom (domain-based) bin
counts_custom = college['Expend_bin_custom'].value_counts().sort_index()
summary_custom = pd.DataFrame({
    'count': counts_custom,
    'proportion': (counts_custom / len(college)).round(3)
})
summary_custom

	count	proportion
Expend_bin_custom
Low Expenditure	535	0.689
Medium Expenditure	237	0.305
High Expenditure	5	0.006

Compute mean graduation rate by these custom bins without using groupby (we’ll use a pivot table, which preserves the categorical bin order).

college.groupby(college.Expend_bin_custom, observed=False)['Grad.Rate'].mean().round(2)

Expend_bin_custom
Low Expenditure       61.79
Medium Expenditure    73.37
High Expenditure      83.20
Name: Grad.Rate, dtype: float64

# Draw bar plot using the groupby result
means_by_custom = college.groupby(college.Expend_bin_custom, observed=False)['Grad.Rate'].mean().round(2)
plt.figure(figsize=(7,4))
sns.barplot(x=means_by_custom.index.astype(str), y=means_by_custom.values, color='skyblue', edgecolor='black')
plt.xlabel('Expenditure bin (custom)')
plt.ylabel('Mean graduation rate')
plt.ylim(0, 100)
plt.tight_layout()
plt.show()

As custom bin-cutoffs can be specified with the cut() function, custom bin quantiles can be specified with the qcut() function.

14.4 Encoding Categorical Variables

Many machine learning algorithms require numerical input, so categorical variables (text labels or categories) must be converted to numbers. There are two main approaches: label encoding and one-hot encoding. The choice depends on the nature of the variable and the model you plan to use.

14.4.1 Label Encoding

Label encoding assigns each unique category in a variable a unique integer value. This is simple and memory-efficient, but it can introduce unintended ordinal relationships (e.g., 0 < 1 < 2) that may mislead some models.

When to use:

For ordinal variables (where categories have a natural order, e.g., Low, Medium, High).
For target variables in classification tasks, where labels must be numeric.

When to avoid: - For nominal variables (no order), especially with linear models, as the numeric codes may be misinterpreted as ordered.

Let’s use the Private column (Yes/No) in the college dataset as an example

# Example: Label encoding a categorical column
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
college['Private_label'] = le.fit_transform(college['Private'])
college[['Private', 'Private_label']].head()

	Private	Private_label
Abilene Christian University	Yes	1
Adelphi University	Yes	1
Adrian College	Yes	1
Agnes Scott College	Yes	1
Alaska Pacific University	Yes	1

14.4.2 One-Hot Encoding

One-hot encoding creates a new binary column for each category, with a 1 indicating the presence of that category and 0 otherwise. This avoids introducing any ordinal relationship between categories.

When to use: - For nominal variables (no natural order), e.g., State, Major, Color. - For most linear models and neural networks.

When to avoid: - When the variable has many unique categories (can create too many columns, known as the curse of dimensionality).

14.4.2.1 How to Create Dummy variables

The pandas library in Python has a built-in function, pd.get_dummies(), to generate dummy variables. If a column in a DataFrame has $k$ distinct values, we will get a DataFrame with $k$ columns containing 0s and 1s

Let us make dummy variables with the equal-sized bins we created for the average instruction expenditure per student.

#Using the Pandas function qcut() to create bins with the same number of observations
Binned_expend = pd.qcut(college['Expend'],3,retbins = True,labels = ['Low_expend','Med_expend','High_expend'])
college['Expend_bin'] = Binned_expend[0]

#Making dummy variables based on the levels (categories) of the 'Expend_bin' variable
dummy_Expend = pd.get_dummies(college['Expend_bin'], dtype=int)

The dummy data dummy_Expend has a value of $1$ if the observation corresponds to the category referenced by the column name.

dummy_Expend.head()

	Low_expend	Med_expend	High_expend
0	1	0	0
1	0	0	1
2	0	1	0
3	0	0	1
4	0	0	1

Adding the dummy columns back to the original dataframe

#Concatenating the dummy variables to the original data
college = pd.concat([college,dummy_Expend],axis=1)
print(college.shape)
college.head()

(777, 26)

	Unnamed: 0	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	...	perc.alumni	Expend	Grad.Rate	z_score	is_z_score_outlier	is_IQR_outlier	Expend_bin	Low_expend	Med_expend	High_expend
0	Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	...	12	7041	60	-0.501910	False	False	Low_expend	1	0	0
1	Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	...	16	10527	56	0.166110	False	False	High_expend	0	0	1
2	Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	...	30	8735	54	-0.177290	False	False	Med_expend	0	1	0
3	Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	...	37	19016	59	1.792851	False	True	High_expend	0	0	1
4	Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	...	2	10922	15	0.241803	False	False	High_expend	0	0	1

5 rows × 26 columns

We can find the correlation between the dummy variables and graduation rate to identify if any of the dummy variables will be useful to estimate graduation rate (Grad.Rate).

#Finding if dummy variables will be useful to estimate 'Grad.Rate'
dummy_Expend.corrwith(college['Grad.Rate'])

Low_expend    -0.334456
Med_expend     0.024492
High_expend    0.309964
dtype: float64

The dummy variables Low expend and High expend may contribute in explaining Grad.Rate in a regression model.

14.4.2.2 Using `drop_first` to Avoid Multicollinearity

In cases like linear regression, you may want to drop one dummy variable to avoid multicollinearity (perfect correlation between variables).

# create dummy variables, dropping the first to avoid multicollinearity
dummy_Expend = pd.get_dummies(college['Expend_bin'], dtype=int, drop_first=True)

# add the dummy variables to the original dataset
college = pd.concat([college, dummy_Expend], axis=1)

college.head()

	Unnamed: 0	Private	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	...	Grad.Rate	z_score	is_z_score_outlier	is_IQR_outlier	Expend_bin	Low_expend	Med_expend	High_expend	Med_expend	High_expend
0	Abilene Christian University	Yes	1660	1232	721	23	52	2885	537	7440	...	60	-0.501910	False	False	Low_expend	1	0	0	0	0
1	Adelphi University	Yes	2186	1924	512	16	29	2683	1227	12280	...	56	0.166110	False	False	High_expend	0	0	1	0	1
2	Adrian College	Yes	1428	1097	336	22	50	1036	99	11250	...	54	-0.177290	False	False	Med_expend	0	1	0	1	0
3	Agnes Scott College	Yes	417	349	137	60	89	510	63	12960	...	59	1.792851	False	True	High_expend	0	0	1	0	1
4	Alaska Pacific University	Yes	193	146	55	16	44	249	869	7560	...	15	0.241803	False	False	High_expend	0	0	1	0	1

5 rows × 28 columns

14.5 Feature Scaling

Feature scaling is a crucial step in data preparation, especially for machine learning algorithms that are sensitive to the scale of input features. Scaling ensures that numerical features contribute equally to model training and prevents features with larger ranges from dominating those with smaller ranges.

Why scale features?

Many algorithms (e.g., k-nearest neighbors, k-means, principal component analysis, gradient descent-based models) compute distances or rely on feature magnitude.
Features with larger scales can unduly influence the model if not scaled.

Common scaling methods:

Min-Max Scaling: Rescales features to a fixed range, usually [0, 1].
Standardization (Z-score): Centers features by removing the mean and scaling to unit variance.

Note: Only continuous numerical features should be scaled. Do not scale encoded categorical variables (such as label-encoded or one-hot encoded columns). Scaling these can distort their meaning and introduce unintended relationships. For example, one-hot encoded columns are already binary (0/1), and scaling them would break their interpretation as category indicators.

14.5.1 How to Scale Features in Practice

Let’s demonstrate how to scale the continuous numerical features in the college dataset. We’ll use both Min-Max scaling and Standardization (Z-score). We’ll exclude all categorical and encoded categorical variables from scaling, including:

The original Private column (categorical)
Any label-encoded columns (e.g., Private_label)
Any one-hot/dummy variables (e.g., columns created by pd.get_dummies)

Scaling these columns would destroy their categorical meaning. For example, one-hot encoded columns are meant to be 0 or 1, representing category membership. Scaling would turn them into non-binary values, which is not meaningful for models.

# Select only continuous numerical columns for scaling
# Exclude categorical, label-encoded, and dummy variables
from sklearn.preprocessing import MinMaxScaler, StandardScaler

# Identify columns to exclude (categorical and encoded)
categorical_cols = ['Private', 'Private_label']
dummy_cols = [col for col in college.columns if col in ['Low_expend', 'Med_expend', 'High_expend'] or set(col.split('_')).intersection({'Low', 'Med', 'High'})]

# Get all numeric columns
numeric_cols = college.select_dtypes(include=[float, int]).columns.tolist()

# Remove encoded categorical columns from scaling
cols_to_scale = [col for col in numeric_cols if col not in categorical_cols + dummy_cols]

# Min-Max Scaling
minmax_scaler = MinMaxScaler()
college_minmax_scaled = college.copy()
college_minmax_scaled[cols_to_scale] = minmax_scaler.fit_transform(college[cols_to_scale])

# Standardization (Z-score)
standard_scaler = StandardScaler()
college_standard_scaled = college.copy()
college_standard_scaled[cols_to_scale] = standard_scaler.fit_transform(college[cols_to_scale])

# Show the first few rows of the scaled data
college_minmax_scaled[cols_to_scale].head()

	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
Abilene Christian University	0.032887	0.044177	0.107913	0.231579	0.472527	0.087164	0.024548	0.263430	0.239596	0.157754	0.297710	0.652632	0.710526	0.418231	0.187500	0.072671	0.462963
Adelphi University	0.043842	0.070531	0.075035	0.157895	0.219780	0.080752	0.056148	0.513430	0.736129	0.291444	0.190840	0.221053	0.078947	0.260054	0.250000	0.138387	0.425926
Adrian College	0.028055	0.039036	0.047349	0.221053	0.450549	0.028473	0.004488	0.460227	0.310530	0.135472	0.139695	0.473684	0.552632	0.278820	0.468750	0.104605	0.407407
Agnes Scott College	0.006998	0.010549	0.016045	0.621053	0.879121	0.011776	0.002839	0.548554	0.578499	0.157754	0.095420	0.884211	0.960526	0.139410	0.578125	0.298415	0.453704
Alaska Pacific University	0.002333	0.002818	0.003146	0.157895	0.384615	0.003492	0.039753	0.269628	0.368852	0.313725	0.190840	0.715789	0.631579	0.252011	0.031250	0.145833	0.046296

# Show the first few rows of the standardized (Z-score) scaled data
college_standard_scaled[cols_to_scale].head()

	Apps	Accept	Enroll	Top10perc	Top25perc	F.Undergrad	P.Undergrad	Outstate	Room.Board	Books	Personal	PhD	Terminal	S.F.Ratio	perc.alumni	Expend	Grad.Rate
Abilene Christian University	-0.346882	-0.321205	-0.063509	-0.258583	-0.191827	-0.168116	-0.209207	-0.746356	-0.964905	-0.602312	1.270045	-0.163028	-0.115729	1.013776	-0.867574	-0.501910	-0.318252
Adelphi University	-0.210884	-0.038703	-0.288584	-0.655656	-1.353911	-0.209788	0.244307	0.457496	1.909208	1.215880	0.235515	-2.675646	-3.378176	-0.477704	-0.544572	0.166110	-0.551262
Adrian College	-0.406866	-0.376318	-0.478121	-0.315307	-0.292878	-0.549565	-0.497090	0.201305	-0.554317	-0.905344	-0.259582	-1.204845	-0.931341	-0.300749	0.585935	-0.177290	-0.667767
Agnes Scott College	-0.668261	-0.681682	-0.692427	1.840231	1.677612	-0.658079	-0.520752	0.626633	0.996791	-0.602312	-0.688173	1.185206	1.175657	-1.615274	1.151188	1.792851	-0.376504
Alaska Pacific University	-0.726176	-0.764555	-0.780735	-0.655656	-0.596031	-0.711924	0.009005	-0.716508	-0.216723	1.518912	0.235515	0.204672	-0.523535	-0.553542	-1.675079	0.241803	-2.939613

Summary: When and What to Scale

Always scale continuous numerical features before using algorithms sensitive to feature magnitude (e.g., k-means, k-NN, PCA, neural networks).
Never scale categorical variables or their encoded versions (label-encoded or one-hot/dummy variables). Scaling these destroys their categorical meaning and can confuse models.
For tree-based models (e.g., decision trees, random forests), scaling is not required, as these models are insensitive to feature scale.

14.6 Independent Study

14.6.1 Practice exercise 1

Read survey_data_clean.csv. Split the columns of the dataset, such that all columns with categorical values transform into dummy variables with each category corresponding to a column of 0s and 1s. Leave the Timestamp column.

As all categorical columns are transformed to dummy variables, all columns have numeric values.

What is the total number of columns in the transformed data? What is the total number of columns of the original data?

Find the:

Top 5 variables having the highest positive correlation with NU_GPA.
Top 5 variables having the highest negative correlation with NU_GPA.

14.6.2 Practice exercise 2

Consider the dataset survey_data_clean.csv . Find the number of outliers in each column of the dataset based on both the Z-score and IQR criteria. Do not use a for loop.

Which column(s) have the maximum number of outliers using Z-score?

Which column(s) have the maximum number of outliers using IQR?

Do you think the outlying observations identified for those columns(s) should be considered as outliers? If not, then which type of columns should be considered when finding outliers?