Appendix E — Assignment 5

Instructions

You may talk to a friend, discuss the questions and potential directions for solving them. However, you need to write your own solutions and code separately, and not as a group activity.
Write your code in the Code cells and your answers in the Markdown cells of the Jupyter notebook. Ensure that the solution is written neatly enough to for the graders to understand and follow.
Use Quarto to render the .ipynb file as HTML. You will need to open the command prompt, navigate to the directory containing the file, and use the command: quarto render filename.ipynb --to html. Submit the HTML file.
Five points are properly formatting the assignment. The breakdown is as follows:
- Must be an HTML file rendered using Quarto (1 point). If you have a Quarto issue, you must mention the issue & quote the error you get when rendering using Quarto in the comments section of Canvas, and submit the ipynb file.
- No name can be written on the assignment, nor can there be any indicator of the student’s identity—e.g. printouts of the working directory should not be included in the final submission. (1 point)
- There aren’t excessively long outputs of extraneous information (e.g. no printouts of entire data frames without good reason, there aren’t long printouts of which iteration a loop is on, there aren’t long sections of commented-out code, etc.) (1 point)
- Final answers to each question are written in the Markdown cells. (1 point)
- There is no piece of unnecessary / redundant code, and no unnecessary / redundant text. (1 point)
The maximum possible score in the assigment is 100 + 5 = 105 out of 100.
In this assignment, you will practice using low-level cross-validation tools for optimizing regression and classification models!

1) Cross-validation for a Regression Task (50 points)

Read the soc_ind.csv file and set the Index column as the index. The column names should be clear on what the variables represent. (2 points)

a)

gdpPerCapita will be the response for this regression analysis. Before anything else, create two density plots to see if we should use it as it is or its log-transformed version. Justify your answer with the plots. (4 points)

Hint: sns.kdeplot

b)

Create the proper response variable based on your answer in the previous part. The predictors are the rest of the variables except geographic_location, and country. Create a predictor matrix accordingly. (3 points)

Using train_test_split from sklearn.model_selection create the training and test data. Use a 80%-20% train-test split and use random_state=2 for reproducible results. (3 points)

c)

One-hot encode categorical predictors (2 points)
Scale all predictors for regularization (2 points)
Apply these transformations to both the training and test datasets. (2 points)

d)

Train a Ridge Regression model using 5-fold cross-validation with a hyperparameter grid of np.logspace(2, 0.5, 200). Use negative mean absolute error (neg_mean_absolute_error) as the scoring metric. (4 points)
Store all cross-validation (CV) scores in a NumPy array. (1 point)

e)

Using the array you created in the previous part, find the optimal hyperparameter value and the best CV score that corresponds to it. (1+1=2 points)

f)

Check the best CV score you found in the previous part. What seems to be the issue with it? Remember that the response is GDP per capita of countries. (We will solve this issue later in this question.) (2 points)

g)

Create a final model with the optimal hyperparameter value you found in the previous question. Return the test MAE. You need to return the test MAE in terms of actual GDP values for credit. (5 points)

h)

Now, it is time to calculate proper MAE values for the cross-validation results and optimize the hyperparameter value based on them. Using cross_val_predict, return the CV predictions for all hyperparameter values. Use a hyperparameter vector of np.logspace(2,0.5,200). (Same as part d.). (6 points) Save all the predictions in a DataFrame. (2 point)

i)

Using the DataFrame you created in the previous part, find the optimal hyperparameter value and the best CV MAE that corresponds to it. (5 points)

Note:

The MAE should be in terms of actual GDP values for credit.
No loops are allowed for this question. You may want to refresh your memory on the .apply()method.

j)

With the hyperparameter value you found in the previous part, train a final model and print its test MAE. (3 points) How does it compare to the test MAE you found with cross_val_score? (1 point) Why do you think this is the case? (1 point)

2) Cross-validation for a Classification Task (50 points)

Read the diabetes_train.csv and diabetes_test.csv files. The Outcome variable represents whether the patient has diabetes or not. The rest of the variables are medical predictors we will use to predict the outcome.

a)

Create the training and the test data. (2 points)

b)

Scale the datasets for regularization. (2 point)

c)

Using a hyperparameter vector of Cs = np.logspace(2,-2,200), cross-validate a Lasso Classification model. Use 5 folds and the default scoring metric (which is accuracy.) (4 points) Save all your cross-validation (CV) scores in a numpy array. (2 point)

d)

Using the array you created in the previous part, find the optimal hyperparameter value and the best CV score that corresponds to it. (1+1=2 points)

e)

Build a final model using the optimal hyperparameter value identified in the previous step. (2 point)
Evaluate the model on the test set and report the following metrics:
- Accuracy
- Recall
- AUC (Area Under the Curve) using a threshold of 0.5. (3 points)
Identify which metric appears problematic and explain why. (2 point)

f)

What threshold did cross_val_score use to compute the accuracy scores? (1 point)
How did this threshold choice contribute to the issue identified in the previous question? (2 point)

g)

Now, it’s time to generate cross-validated predictions and optimize the decision threshold.

Use cross_val_predict to obtain the predicted probabilities from cross-validation, using the optimal hyperparameter value identified in Part D. (5 points)

Note that no loops are needed, as the best C value has already been determined.

h)

Using the predicted probabilities from the previous step, compute and store the accuracy, and recall for all possible threshold values ranging from 0 to 1 with a step size of 0.01. (5 points)

i)

Plot accuracy, and recall against the threshold on the same graph. (4 points) Be sure to include a legend for clarity. (1 point)

j)

In the plot, identify the threshold value where accuracy, and recall are approximately equal. (5 points)

Important Notes:

The metric values should be considered equal when rounded to two decimal places (or equivalently, as whole numbers when multiplied by 100).
Trial-and-error approaches will not receive credit. You must use logical indexing to find the threshold.
Helpful functions: np.where, np.round, and the logical & operator.

k)

Train a final Lasso Classification model using the optimal hyperparameter value from Part d and the threshold identified in the previous question. (2 points)
Evaluate and report the following metrics on the test set:
- Accuracy
- Recall
- AUC (3 points)
Compare the accuracy and recall to the results from Part e. (1 point)
Did the AUC change? (1 point) If so, explain why. (1 point)