Appendix D — Assignment 4

Instructions

  1. You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.

  2. Write your code in the Code cells and your answers in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to for the graders to understand and follow.

  3. Use Quarto to render the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.

  4. There is a bonus question worth 11 points.

  5. Five points are properly formatting the assignment. The breakdown is as follows:

    • Must be an HTML file rendered using Quarto (1 point). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.
    • No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission. (1 point)
    • There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 point)
    • Final answers to each question are written in the Markdown cells. (1 point)
    • There is no piece of unnecessary / redundant code, and no unnecessary / redundant text. (1 point)
  6. The maximum possible score in the assigment is 99 + 11 + 5 = 115 out of 100.

1) Modeling the Radii of Exoplanets (40 points)

For this question, we are interested in predicting the radius of exoplanets (planets outside the Solar System) in kilometers. To achieve this goal, we will use NASA’s Composite Planetary Systems dataset and different regression models. (See https://exoplanetarchive.ipac.caltech.edu for more context.)

Read all three CompositePlanetarySystems datasets - you should have one training and two test datasets. Each row is an exoplanet. pl_rade column represents the radius of each exoplanet as a proportion of Earth’s radius, which is approximately 6,378 km.

a)

Develop a linear regression model (no non-linear terms) to predict pl_rade using all the raw variables in the data except pl_name, disc_facility and disc_locale. You can use either statsmodels or sklearn. (2 points)

b)

Find the RMSE of the model using both test sets separately. (You need to print two RMSE values.) Note that the library you used should not make a difference here! (2 points)

Print the training RMSE as well for reference. (1 point)

c)

Compare the training and test RMSEs. (1 point) What is the issue with this model? (1 point)

d)

Train a Ridge regression model to predict pl_rade using the same set of variables as above. Optimize the regularization strength (alpha) using RidgeCV with 5-fold cross-validation, ensuring that the data is shuffled before splitting (random_state=42). Use neg_root_mean_squared_error as the scoring metric.

Note:

  • Scaling is essential for regularization to ensure fair weighting of features.
  • Use the following range of alpha for hyperparamter tuning: alphas = np.logspace(2,0.5,200)

e)

Using the optimized model, print the RMSEs for the training set and both test sets. (4 points)

f)

How did the training and test performance change? Explain why the Ridge regression changed the training and test results. (3 points)

g)

Find the predictor whose coefficient is shrunk by far the most by Ridge regularization. (3 points)

Hint: .coef_ and .columns attributes should be helpful.

h)

Why did the coefficient of the predictor identified in the previous question shrunk the most? Justify your answer for credit. (2 points)

Hint: Correlation vector/matrix

i)

Visualize how the coefficients change with the change in the hyperparameter value:

  • Create a line plot of coefficient values vs. the hyperparameter value.
  • Color code each predictor’s coefficient values.
  • Use log scale where necessary.
  • Use an alphas vector of np.logspace(7,0,200) for better visualization

(5 points)

j)

Replace the Ridge regression with Lasso regression.

  • Find the optimal hyperparameter using LassoCV (2 points).
    • You need a different hyperparameter array - use: np.logspace(0,-2.5,200)
    • Using the same splitting strategy as Ridge regression
    • Note: The Lasso object does not have a scoring hyperparameter.
  • Using the optimized Lasso model, print the RMSEs for the training set and both test sets. (2 points)
  • Visualize how the Lasso coefficients change with alpha. (2 points)
    • You may use the range of alpha values as np.logspace(7,-2.5,200) for better visualization.

k)

Using the two figures created in parts i and j, explain how the Ridge and Lasso models behave differently as the hyperparameter value changes. (2 points) What does that difference mean for the usage of the Lasso model? (1 point)

l)

Find the predictors that are eliminated by Lasso regularization. (2 points)

2) Enhancing House Price Prediction with Higher-Order Terms and Cross-Validation (29 points)

In this question, we are interested in improving the prediction performance for house prices using five predictors.

a)

Read the house feature and price files and create the training and test datasets. The response is log-price and the five predictors are the rest of the variables, except house_id. (2 points)

b)

Previously, we observed that a linear model using raw predictors fails to capture the complexity of the problem, resulting in underfitting. Our goal is to examine how training and test performance evolve as model complexity increases.

Task Breakdown:

  1. Generate Higher-Order Features

    • Use PolynomialFeatures from sklearn to create higher-order versions of the predictors (including both transformations and interactions) for both the training and test datasets. (3 points)
  2. Train a Ridge Regression Model

    • Use all predictors (original and transformed) to train a Ridge regression model with alpha = 0.000001. (2 points)
  3. Compute RMSE Scores

    • Store the RMSE for both the training and test sets. (2 points)
  4. Repeat for Different Polynomial Orders

    • Perform this process for polynomial orders ranging from 1 to 6. (2 points)
  5. Visualize the Performance

    • Plot the training and test RMSE values as a function of polynomial order. Ensure:
      • The two curves have distinct colors. (1 points)
      • A legend is included for clarity. (1 points)

Notes:

  • Exclude the bias term that PolynomialFeatures adds by default.
  • Feature scaling is required for regularization. (2 points)
  • Minimal regularization is necessary to prevent the test RMSE from diverging to infinity at higher polynomial orders, unlike in pure linear regression.

c)

Which order has the best test RMSE? (1 point) What is the best test RMSE? (1 point) At which order does the overfitting start? (1 point)

d)

Repeat part b, only this time use RidgeCV to find the best amount of regularization for each order by cross-validation. Use alphas = np.logspace(2,0.5,200) and LOOCV. Use neg_root_mean_squared_error for scoring. Create the same plot as part b. (4 points) Describe the obvious difference between the plot in this part and the plot in part b. (2 points)

e)

What is the best test RMSE found by using higher-orders and regularization? (1 point) Which order achieved this test RMSE? (1 point) Why did this order with regularization perform better than any lower order with or (almost) without regularization? (3 points)

3) Systematic Elimination of Interaction Terms (30 points)

In this question, we are interested in predicting if the client subscribed to a term deposit or not after a phone call using age and education of the client and the day and the month the call took place.

Note that this is the same problem as in the previous assignment, however, using sklearn, we aim to make the predictive analysis with interactions more systematic.

a)

Read train.csv, test1.csv, and test2.csv. Prepare the training and two test datasets according the description above. (2 points)

b)

For all datasets:

  • One-hot-encode the categorical predictors. (2 points)
  • Get the interactions of all the predictors. (Numeric and one-hot-encoded) (3 points)
    • Note that there is a very quick way of doing this with PolynomialFeatures (degree=2)
    • Don’t forget to exclude the bias.
    • Scale all the predictors. (2 points)

c)

Train a Logistic Regression model with Lasso penalty. (2 points) The idea is to discard interactions that are not useful. Note that instead of the manual, trial-and-error way of adding interactions in statsmodels, we include all the possible interactions and then discard the useless ones here.

  • Use [0.0001,0.001, 0.01, 0.1, 1, 10, 100, 1000] as the possible C values. (1 point)
  • Use 10-fold cross-validation to optimize the C value. (1 point)
  • Lasso is very useful, but it needs special algorithms, since it includes non-differentiable absolute values. Use saga as the solver. (1 point)
  • The default number of iterations the algorithm takes is usually not enough for Lasso. Use max_iter = 1000. (Default is 500) (1 point)
  • This will take 10-20 minutes to run.

d)

How many models in total are run by this cross-validation process? (2 points)

e)

What is the optimum C value? (1 point) What is the lambda (in the Lasso cost) value it corresponds to? (1 point)

f)

What is the percentage of terms (linear or interaction) that are discarded by Lasso? (Hint: .coef_) (2 points)

g)

Find the five terms that have the highest effect on the logodds of a subscription. Assume that we are quantifying the effect of a term with the absolute value of its coefficient. (Hint: .get_feature_names_out()) (4 points)

h)

Come up with real-life explanations on why the terms identified in the previous part are important. (This is an open-ended question, just make sure your answer makes sense.) (2 points)

i)

Lastly, tune the classification threshold to get both test datasets above 75% accuracy and 50% recall. Note that you only worry about the threshold now. Lasso took care of finding good interactions. (3 points)

4) Bonus: ElasticNet (11 points)

The goal of this section is to familiarize you with ElasticNet, which combines both Ridge and Lasso regularization techniques. The balance between the Lasso and Ridge penalties is controlled by a hyperparameter.

a)

For regression, scikit-learn provides both ElasticNet and its cross-validation counterpart, ElasticNetCV, for implementing Elastic Net regularization.

Your tasks:

  • Use the same dataset as in Question 1, ensuring that the specified columns remain dropped.

  • Use the same performance metric: RMSE

  • Apply the same data splitting strategy as in Question 1.

  • Train an Elastic Net model with the following l1_ratio values:

    • 25% Lasso / 75% Ridge (l1_ratio = 0.25)
    • 50% Lasso / 50% Ridge (l1_ratio = 0.50)
    • 75% Lasso / 25% Ridge (l1_ratio = 0.75)
  • Tune the alpha hyperparameter using the range:

    alphas = np.logspace(10, 0.1, 200)
  • Combine the two test sets.

  • Identify the l1_ratio and alpha combination that yields the best test performance.

(8 points)

b)

How many models were run in the cross-validation process of two hyperparameters? (1 point)

c)

Briefly describe how you would implement ElasticNet for Logistic Regression in scikit-learn. (2 points)