4Extending Linear Regression (PolynomialFeatures in Sklearn)
4.0.1 Simulate Data
import numpy as npimport pandas as pd# Set a random seed for reproducibilitynp.random.seed(42)# Number of samplesN =5000# Generate features from uniform distributionsx1 = np.random.uniform(-5, 5, N)x2 = np.random.uniform(-5, 5, N)# Define the nonlinear relationship and add noisey =1.5* (x1 **2) +0.5* (x2 **3) + np.random.normal(loc=3, scale=3, size=N)# Create a pandas DataFramedf = pd.DataFrame({'x1': x1, 'x2': x2, 'y': y})# Save to CSV (optional)df.to_csv('nonlinear_dataset.csv', index=False)df.head(10) # Display the first 10 rows
x1
x2
y
0
-1.254599
-1.063645
0.295770
1
4.507143
-0.265643
30.086577
2
2.319939
3.545474
34.523622
3
0.986585
-1.599956
-1.109427
4
-3.439814
3.696497
49.341010
5
-3.440055
-4.118656
-14.395442
6
-4.419164
2.767984
43.154083
7
3.661761
3.475476
43.267654
8
1.011150
-3.181823
-9.254211
9
2.080726
-0.696535
11.674644
# Create X and y arraysX = df[['x1', 'x2']].valuesy = df['y'].values
4.0.2 Train-Test Split
from sklearn.model_selection import train_test_split# Split the data into training and testing setsX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)print("Training set shape:", X_train.shape)print("Testing set shape:", X_test.shape)
Training set shape: (4000, 2)
Testing set shape: (1000, 2)
4.0.3 Baseline Model (original Features)
from sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error, r2_score# Create a linear regression modelbaseline_model = LinearRegression()# Train the model on the original featuresbaseline_model.fit(X_train, y_train)# Make predictions on the test sety_pred_baseline = baseline_model.predict(X_test)# Make predictions on the training sety_train_pred_baseline = baseline_model.predict(X_train)# Evaluate the baseline modelmse_baseline = mean_squared_error(y_test, y_pred_baseline)r2_baseline = r2_score(y_test, y_pred_baseline)# Evaluate the baseline model on the training setmse_train_baseline = mean_squared_error(y_train, y_train_pred_baseline)r2_train_baseline = r2_score(y_train, y_train_pred_baseline)print("\nBaseline Model Performance:")print("Training Set:")print("MSE:", mse_train_baseline)print("R2:", r2_train_baseline)print("\nTesting Set:")print("MSE:", mse_baseline)print("R2:", r2_baseline)
Baseline Model Performance:
Training Set:
MSE: 218.84992612063238
R2: 0.6720857612554578
Testing Set:
MSE: 218.05222660659857
R2: 0.6745129537467693
4.0.4 Transform Features with PolynomialFeatures (degree = 2)
from sklearn.preprocessing import PolynomialFeatures# Create PolynomialFeatures object with degree=2 (includes interaction terms)poly_2 = PolynomialFeatures(degree=2, include_bias=False)# Transform the training and testing featuresX_train_poly_2 = poly_2.fit_transform(X_train)X_test_poly_2 = poly_2.transform(X_test)# Display the transformed feature namesprint("\nTransformed Feature Names:")print(poly_2.get_feature_names_out())
4.0.5 Linear Model with transformed Features (degree = 2)
# Create a linear regression model for the polynomial featurespoly_2_model = LinearRegression()# Train the model on the transformed featurespoly_2_model.fit(X_train_poly_2, y_train)# Make predictions on the test sety_pred_poly_2 = poly_2_model.predict(X_test_poly_2)# Make predictions on the training sety_train_pred_poly_2 = poly_2_model.predict(X_train_poly_2)# Evaluate the polynomial modelmse_poly_2 = mean_squared_error(y_test, y_pred_poly_2)r2_poly_2 = r2_score(y_test, y_pred_poly_2)# Evaluate the polynomial model on the training setmse_train_poly_2 = mean_squared_error(y_train, y_train_pred_poly_2)r2_train_poly_2 = r2_score(y_train, y_train_pred_poly_2)print("\nPolynomial Model Performance:")print("Training Set:")print("MSE:", mse_train_poly_2)print("R2:", r2_train_poly_2)print("\nTesting Set:")print("MSE:", mse_poly_2)print("R2:", r2_poly_2)
Polynomial Model Performance:
Training Set:
MSE: 95.31378244224537
R2: 0.8571863972474734
Testing Set:
MSE: 90.00139440016171
R2: 0.8656547173222298
4.0.6 Transform features with PolynomialFeatures (degree = 3)
# Create PolynomialFeatures object with degree=2 (includes interaction terms)poly_3 = PolynomialFeatures(degree=3, include_bias=False)# Transform the training and testing featuresX_train_poly_3 = poly_3.fit_transform(X_train)X_test_poly_3 = poly_3.transform(X_test)# Display the transformed feature namesprint("\nTransformed Feature Names:")print(poly_3.get_feature_names_out())
4.0.7 Linear Model with transformed Features (degree = 3)
# Create a linear regression model for the polynomial featurespoly_3_model = LinearRegression()# Train the model on the transformed featurespoly_3_model.fit(X_train_poly_3, y_train)# Make predictions on the test sety_pred_poly_3 = poly_3_model.predict(X_test_poly_3)# Make predictions on the training sety_pred_train_poly_3 = poly_3_model.predict(X_train_poly_3)# Evaluate the polynomial modelmse_poly_3 = mean_squared_error(y_test, y_pred_poly_3)r2_poly_3 = r2_score(y_test, y_pred_poly_3)# Evaluate the polynomial model on the training setmse_poly_3_train = mean_squared_error(y_train, y_pred_train_poly_3)r2_poly_3_train = r2_score(y_train, y_pred_train_poly_3)print("\nPolynomial Model Performance:")print("Training Set:")print("MSE:", mse_poly_3_train)print("R2:", r2_poly_3_train)print("\nTesting Set:")print("MSE:", mse_poly_3)print("R2:", r2_poly_3)
Polynomial Model Performance:
Training Set:
MSE: 8.64661284180024
R2: 0.9870443297925774
Testing Set:
MSE: 9.034015244301722
R2: 0.9865149052434146
4.0.8 Putting all together
# create a dataframe to put these 3 models together, including model name, features, training mse, testing mse, training r2, testing r2models = ['Baseline', 'Polynomial Degree 2', 'Polynomial Degree 3']features = [X.shape[1], len(poly_2.get_feature_names_out()), len(poly_3.get_feature_names_out())]training_mse = [mse_train_baseline, mse_train_poly_2, mse_poly_3_train]testing_mse = [mse_baseline, mse_poly_2, mse_poly_3]training_r2 = [r2_train_baseline, r2_train_poly_2, r2_poly_3_train]testing_r2 = [r2_baseline, r2_poly_2, r2_poly_3]model_comparison = pd.DataFrame({ 'Model': models,'Features': features,'Training MSE': training_mse,'Testing MSE': testing_mse,'Training R2': training_r2,'Testing R2': testing_r2})model_comparison
Model
Features
Training MSE
Testing MSE
Training R2
Testing R2
0
Baseline
2
218.849926
218.052227
0.672086
0.674513
1
Polynomial Degree 2
5
95.313782
90.001394
0.857186
0.865655
2
Polynomial Degree 3
9
8.646613
9.034015
0.987044
0.986515
# print out the feature names for the polynomial degree 3 modelprint("\nTransformed Feature Names:")print(poly_3.get_feature_names_out())
# use polynominal degree of 4 to see if it improves the modelpoly_4 = PolynomialFeatures(degree=4, include_bias=False)# Transform the training and testing featuresX_train_poly_4 = poly_4.fit_transform(X_train)X_test_poly_4 = poly_4.transform(X_test)# Create a linear regression model for the polynomial featurespoly_4_model = LinearRegression()# Train the model on the transformed featurespoly_4_model.fit(X_train_poly_4, y_train)# Make predictions on the test sety_pred_poly_4 = poly_4_model.predict(X_test_poly_4)# Make predictions on the training sety_pred_train_poly_4 = poly_4_model.predict(X_train_poly_4)# Evaluate the polynomial modelmse_poly_4 = mean_squared_error(y_test, y_pred_poly_4)r2_poly_4 = r2_score(y_test, y_pred_poly_4)# Evaluate the polynomial model on the training setmse_poly_4_train = mean_squared_error(y_train, y_pred_train_poly_4)r2_poly_4_train = r2_score(y_train, y_pred_train_poly_4)print("\nPolynomial Model Performance:")print("Training Set:")print("MSE:", mse_poly_4_train)print("R2:", r2_poly_4_train)print("\nTesting Set:")print("MSE:", mse_poly_4)print("R2:", r2_poly_4)
Polynomial Model Performance:
Training Set:
MSE: 8.633776015235346
R2: 0.98706356387817
Testing Set:
MSE: 8.991410070128255
R2: 0.9865785020821749
# get the feature names for the polynomial degree 4 modelprint("\nNumber of Features:", len(poly_4.get_feature_names_out()))print("\nTransformed Feature Names:")print(poly_4.get_feature_names_out())
In scikit-learn, the built-in PolynomialFeatures transformer is somewhat “all or nothing”: by default, it generates all polynomial terms (including interactions) up to a certain degree. You can toggle:
interaction_only=True to generate only cross-terms
include_bias=False to exclude the constant (bias) term,
degree to control how high the polynomial powers go.
However, if you want fine-grained control over exactly which terms get generated (for example, only certain interaction terms, or only a subset of polynomial terms), you will need to create those features manually or write a custom transformer (skipped for beginner level)
Use interaction_only for Cross Terms Only
If your goal is only to capture interaction terms (i.e., $ x_1 x_2 $, but no squares, cubes, etc.), you can set:
If you want to be very selective—say, just add \(x_1^2\) and $ x_1 x_2 $ but not \(x_2^2\) —the simplest approach is to create columns by hand. For example:
import numpy as npX1 = X[:, 0].reshape(-1, 1) # feature 1X2 = X[:, 1].reshape(-1, 1) # feature 2# Manually create specific transformationsX1_sq = X1**2X1X2 = X1 * X2# Combine them as you likeX_new = np.hstack([X1, X2, X1_sq, X1X2])print("\nTransformed Feature Names:")print(['x1', 'x2', 'x1^2', 'x1*x2'])X_new[:5] # Display the first 5 rows
When using PolynomialFeatures (or any other scikit-learn transformer), the fitting step is always done on the training data—not on the test data. This is a fundamental principle of machine learning pipelines: we do not use the test set for any part of model training (including feature encoding, feature generation, scaling, etc.).