4  Extending Linear Regression (PolynomialFeatures in Sklearn)

4.0.1 Simulate Data

import numpy as np
import pandas as pd

# Set a random seed for reproducibility
np.random.seed(42)

# Number of samples
N = 5000

# Generate features from uniform distributions
x1 = np.random.uniform(-5, 5, N)
x2 = np.random.uniform(-5, 5, N)

# Define the nonlinear relationship and add noise
y = 1.5 * (x1 ** 2) + 0.5 * (x2 ** 3) + np.random.normal(loc=3, scale=3, size=N)

# Create a pandas DataFrame
df = pd.DataFrame({'x1': x1, 'x2': x2, 'y': y})

# Save to CSV (optional)
df.to_csv('nonlinear_dataset.csv', index=False)

df.head(10)  # Display the first 10 rows
x1 x2 y
0 -1.254599 -1.063645 0.295770
1 4.507143 -0.265643 30.086577
2 2.319939 3.545474 34.523622
3 0.986585 -1.599956 -1.109427
4 -3.439814 3.696497 49.341010
5 -3.440055 -4.118656 -14.395442
6 -4.419164 2.767984 43.154083
7 3.661761 3.475476 43.267654
8 1.011150 -3.181823 -9.254211
9 2.080726 -0.696535 11.674644
# Create X and y arrays
X = df[['x1', 'x2']].values
y = df['y'].values

4.0.2 Train-Test Split

from sklearn.model_selection import train_test_split

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape:", X_train.shape)
print("Testing set shape:", X_test.shape)
Training set shape: (4000, 2)
Testing set shape: (1000, 2)

4.0.3 Baseline Model (original Features)

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Create a linear regression model
baseline_model = LinearRegression()

# Train the model on the original features
baseline_model.fit(X_train, y_train)

# Make predictions on the test set
y_pred_baseline = baseline_model.predict(X_test)

# Make predictions on the training set
y_train_pred_baseline = baseline_model.predict(X_train)

# Evaluate the baseline model
mse_baseline = mean_squared_error(y_test, y_pred_baseline)
r2_baseline = r2_score(y_test, y_pred_baseline)

# Evaluate the baseline model on the training set
mse_train_baseline = mean_squared_error(y_train, y_train_pred_baseline)
r2_train_baseline = r2_score(y_train, y_train_pred_baseline)

print("\nBaseline Model Performance:")
print("Training Set:")
print("MSE:", mse_train_baseline)
print("R2:", r2_train_baseline)
print("\nTesting Set:")
print("MSE:", mse_baseline)
print("R2:", r2_baseline)

Baseline Model Performance:
Training Set:
MSE: 218.84992612063238
R2: 0.6720857612554578

Testing Set:
MSE: 218.05222660659857
R2: 0.6745129537467693

4.0.4 Transform Features with PolynomialFeatures (degree = 2)

from sklearn.preprocessing import PolynomialFeatures

# Create PolynomialFeatures object with degree=2 (includes interaction terms)
poly_2 = PolynomialFeatures(degree=2, include_bias=False)

# Transform the training and testing features
X_train_poly_2 = poly_2.fit_transform(X_train)
X_test_poly_2 = poly_2.transform(X_test)

# Display the transformed feature names
print("\nTransformed Feature Names:")
print(poly_2.get_feature_names_out())

Transformed Feature Names:
['x0' 'x1' 'x0^2' 'x0 x1' 'x1^2']

4.0.5 Linear Model with transformed Features (degree = 2)

# Create a linear regression model for the polynomial features
poly_2_model = LinearRegression()

# Train the model on the transformed features
poly_2_model.fit(X_train_poly_2, y_train)

# Make predictions on the test set
y_pred_poly_2 = poly_2_model.predict(X_test_poly_2)

# Make predictions on the training set
y_train_pred_poly_2 = poly_2_model.predict(X_train_poly_2)

# Evaluate the polynomial model
mse_poly_2 = mean_squared_error(y_test, y_pred_poly_2)
r2_poly_2 = r2_score(y_test, y_pred_poly_2)

# Evaluate the polynomial model on the training set
mse_train_poly_2 = mean_squared_error(y_train, y_train_pred_poly_2)
r2_train_poly_2 = r2_score(y_train, y_train_pred_poly_2)

print("\nPolynomial Model Performance:")
print("Training Set:")
print("MSE:", mse_train_poly_2)
print("R2:", r2_train_poly_2)
print("\nTesting Set:")
print("MSE:", mse_poly_2)
print("R2:", r2_poly_2)

Polynomial Model Performance:
Training Set:
MSE: 95.31378244224537
R2: 0.8571863972474734

Testing Set:
MSE: 90.00139440016171
R2: 0.8656547173222298

4.0.6 Transform features with PolynomialFeatures (degree = 3)

# Create PolynomialFeatures object with degree=2 (includes interaction terms)
poly_3 = PolynomialFeatures(degree=3, include_bias=False)

# Transform the training and testing features
X_train_poly_3 = poly_3.fit_transform(X_train)
X_test_poly_3 = poly_3.transform(X_test)

# Display the transformed feature names
print("\nTransformed Feature Names:")
print(poly_3.get_feature_names_out())

Transformed Feature Names:
['x0' 'x1' 'x0^2' 'x0 x1' 'x1^2' 'x0^3' 'x0^2 x1' 'x0 x1^2' 'x1^3']

4.0.7 Linear Model with transformed Features (degree = 3)

# Create a linear regression model for the polynomial features
poly_3_model = LinearRegression()

# Train the model on the transformed features
poly_3_model.fit(X_train_poly_3, y_train)

# Make predictions on the test set
y_pred_poly_3 = poly_3_model.predict(X_test_poly_3)

# Make predictions on the training set
y_pred_train_poly_3 = poly_3_model.predict(X_train_poly_3)

# Evaluate the polynomial model
mse_poly_3 = mean_squared_error(y_test, y_pred_poly_3)
r2_poly_3 = r2_score(y_test, y_pred_poly_3)

# Evaluate the polynomial model on the training set
mse_poly_3_train = mean_squared_error(y_train, y_pred_train_poly_3)
r2_poly_3_train = r2_score(y_train, y_pred_train_poly_3)

print("\nPolynomial Model Performance:")
print("Training Set:")
print("MSE:", mse_poly_3_train)
print("R2:", r2_poly_3_train)
print("\nTesting Set:")
print("MSE:", mse_poly_3)
print("R2:", r2_poly_3)

Polynomial Model Performance:
Training Set:
MSE: 8.64661284180024
R2: 0.9870443297925774

Testing Set:
MSE: 9.034015244301722
R2: 0.9865149052434146

4.0.8 Putting all together

# create a dataframe to put these 3 models together, including model name, features, training mse, testing mse, training r2, testing r2
models = ['Baseline', 'Polynomial Degree 2', 'Polynomial Degree 3']
features = [X.shape[1], len(poly_2.get_feature_names_out()), len(poly_3.get_feature_names_out())]
training_mse = [mse_train_baseline, mse_train_poly_2, mse_poly_3_train]
testing_mse = [mse_baseline, mse_poly_2, mse_poly_3]
training_r2 = [r2_train_baseline, r2_train_poly_2, r2_poly_3_train]
testing_r2 = [r2_baseline, r2_poly_2, r2_poly_3]

model_comparison = pd.DataFrame({   
    'Model': models,
    'Features': features,
    'Training MSE': training_mse,
    'Testing MSE': testing_mse,
    'Training R2': training_r2,
    'Testing R2': testing_r2
})

model_comparison
Model Features Training MSE Testing MSE Training R2 Testing R2
0 Baseline 2 218.849926 218.052227 0.672086 0.674513
1 Polynomial Degree 2 5 95.313782 90.001394 0.857186 0.865655
2 Polynomial Degree 3 9 8.646613 9.034015 0.987044 0.986515
# print out the feature names for the polynomial degree 3 model
print("\nTransformed Feature Names:")
print(poly_3.get_feature_names_out())

Transformed Feature Names:
['x0' 'x1' 'x0^2' 'x0 x1' 'x1^2' 'x0^3' 'x0^2 x1' 'x0 x1^2' 'x1^3']
9
# print out the feature names for the polynomial degree 2 model
print("\nTransformed Feature Names:")
print(poly_2.get_feature_names_out())

Transformed Feature Names:
['x0' 'x1' 'x0^2' 'x0 x1' 'x1^2']
5

4.0.9 degreee = 4

# use polynominal degree of 4 to see if it improves the model
poly_4 = PolynomialFeatures(degree=4, include_bias=False)

# Transform the training and testing features
X_train_poly_4 = poly_4.fit_transform(X_train)
X_test_poly_4 = poly_4.transform(X_test)

# Create a linear regression model for the polynomial features
poly_4_model = LinearRegression()

# Train the model on the transformed features
poly_4_model.fit(X_train_poly_4, y_train)

# Make predictions on the test set
y_pred_poly_4 = poly_4_model.predict(X_test_poly_4)

# Make predictions on the training set
y_pred_train_poly_4 = poly_4_model.predict(X_train_poly_4)

# Evaluate the polynomial model
mse_poly_4 = mean_squared_error(y_test, y_pred_poly_4)
r2_poly_4 = r2_score(y_test, y_pred_poly_4)

# Evaluate the polynomial model on the training set
mse_poly_4_train = mean_squared_error(y_train, y_pred_train_poly_4)
r2_poly_4_train = r2_score(y_train, y_pred_train_poly_4)

print("\nPolynomial Model Performance:")
print("Training Set:")
print("MSE:", mse_poly_4_train)
print("R2:", r2_poly_4_train)
print("\nTesting Set:")
print("MSE:", mse_poly_4)
print("R2:", r2_poly_4)

Polynomial Model Performance:
Training Set:
MSE: 8.633776015235346
R2: 0.98706356387817

Testing Set:
MSE: 8.991410070128255
R2: 0.9865785020821749
# get the feature names for the polynomial degree 4 model
print("\nNumber of Features:", len(poly_4.get_feature_names_out()))
print("\nTransformed Feature Names:")
print(poly_4.get_feature_names_out())

Number of Features: 14

Transformed Feature Names:
['x0' 'x1' 'x0^2' 'x0 x1' 'x1^2' 'x0^3' 'x0^2 x1' 'x0 x1^2' 'x1^3' 'x0^4'
 'x0^3 x1' 'x0^2 x1^2' 'x0 x1^3' 'x1^4']

4.1 Key takeaway:

In scikit-learn, the built-in PolynomialFeatures transformer is somewhat “all or nothing”: by default, it generates all polynomial terms (including interactions) up to a certain degree. You can toggle:

  • interaction_only=True to generate only cross-terms
  • include_bias=False to exclude the constant (bias) term,
  • degree to control how high the polynomial powers go.

However, if you want fine-grained control over exactly which terms get generated (for example, only certain interaction terms, or only a subset of polynomial terms), you will need to create those features manually or write a custom transformer (skipped for beginner level)

Use interaction_only for Cross Terms Only

If your goal is only to capture interaction terms (i.e., $ x_1 x_2 $, but no squares, cubes, etc.), you can set:

poly_int = PolynomialFeatures(degree=6, 
                             interaction_only=True, 
                             include_bias=False)

X_transformed = poly_int.fit_transform(X)

print("\nTransformed Feature Names:")
print(poly_int.get_feature_names_out())

Transformed Feature Names:
['x0' 'x1' 'x0 x1']

If you want to be very selective—say, just add \(x_1^2\) and $ x_1 x_2 $ but not \(x_2^2\) —the simplest approach is to create columns by hand. For example:

import numpy as np

X1 = X[:, 0].reshape(-1, 1)  # feature 1
X2 = X[:, 1].reshape(-1, 1)  # feature 2

# Manually create specific transformations
X1_sq = X1**2
X1X2 = X1 * X2

# Combine them as you like
X_new = np.hstack([X1, X2, X1_sq, X1X2])

print("\nTransformed Feature Names:")
print(['x1', 'x2', 'x1^2', 'x1*x2'])

X_new[:5]  # Display the first 5 rows

Transformed Feature Names:
['x1', 'x2', 'x1^2', 'x1*x2']
array([[ -1.25459881,  -1.0636448 ,   1.57401818,   1.3344475 ],
       [  4.50714306,  -0.26564341,  20.3143386 ,  -1.19729284],
       [  2.31993942,   3.54547393,   5.3821189 ,   8.22528473],
       [  0.98658484,  -1.59995614,   0.97334965,  -1.57849247],
       [ -3.4398136 ,   3.69649685,  11.83231757, -12.71526011]])

When using PolynomialFeatures (or any other scikit-learn transformer), the fitting step is always done on the training data—not on the test data. This is a fundamental principle of machine learning pipelines: we do not use the test set for any part of model training (including feature encoding, feature generation, scaling, etc.).