{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Machine Learning Homework 1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Instructions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This homework is due **before class on Friday, March 7.**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Important notes:\n",
"- Please submit the notebook with the output.\n",
"- If the answer is not obvious from the printout, please type it.\n",
"- The notebook should be self contained and we should be able to rerun it.\n",
"- Import all the libraries that you find necessary to answer the questions.\n",
"- If the subquestion is worth 1 point, no half points will be given: full point will be given for the correct answer. Similarly, if the question is worth 2, possible points are 0,1,2.\n",
"- Acknowledge the use of outside sources and code assistants."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Question 1 "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Total 20 points**\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Import the California house prices dataset from `sklearn.datasets` using `fetch_california_housing` as follows:\n",
" `from sklearn.datasets import fetch_california_housing` \n",
" `housing = fetch_california_housing()`\n",
"1. Print out the dataset description and read it (1pt)\n",
"2. Convert the dataset to pandas dataframe (all the features and the target) (1pt)\n",
"3. Check the number of data points, data types, and print the first 15 lines of the dataset (1pt)\n",
"4. Check the number of missing values per feature (1pt)\n",
"5. Divide the dataset into train and test, where 30% of the dataset will be used for test, with the `random_state=42` (1pt)\n",
"6. Train a Linear Regression model (1pt)\n",
"7. What value of target is predicted by the model, when all the features have value 0 (1pt)\n",
"8. What is the value of the coefficient associated with the feature HouseAge (1pt)\n",
"9. Predict the target values of the data points from the training set (1pt)\n",
"10. What is the value of the cost function for the obtained coefficients and the training dataset (1pt)\n",
"11. Evaluate the model's performance (1pt)\n",
"12. Generate polynomial features up to degree 2 of the training dataset (1pt)\n",
"13. Scale the polynomial features from the previous step using the standard scaler (1pt)\n",
"14. Train a Ridge regression model with the regularization strength equal to 0.001 on the scaled dataset. What would happen to the feature coefficients and the model if alpha approached infinity? (2pt)\n",
"15. Train Lasso Regression with the regularization strength equal to 0.01 on the scaled dataset and set the maximum number of iterations to 100000 (1pt)\n",
"16. Count how many coefficients (excluding intercept) are calculated (1pt)\n",
"17. Check how many features will not be used to predict the target with this model (1pt)\n",
"18. Plot the coefficients of Lasso and Ridge regression on the same plot, with label, Ridge in red, and Lasso in blue (2pt)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.datasets import fetch_california_housing\n",
"housing = fetch_california_housing()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.1. Print out the dataset description and read it (1pt)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
".. _california_housing_dataset:\n",
"\n",
"California Housing dataset\n",
"--------------------------\n",
"\n",
"**Data Set Characteristics:**\n",
"\n",
":Number of Instances: 20640\n",
"\n",
":Number of Attributes: 8 numeric, predictive attributes and the target\n",
"\n",
":Attribute Information:\n",
" - MedInc median income in block group\n",
" - HouseAge median house age in block group\n",
" - AveRooms average number of rooms per household\n",
" - AveBedrms average number of bedrooms per household\n",
" - Population block group population\n",
" - AveOccup average number of household members\n",
" - Latitude block group latitude\n",
" - Longitude block group longitude\n",
"\n",
":Missing Attribute Values: None\n",
"\n",
"This dataset was obtained from the StatLib repository.\n",
"https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html\n",
"\n",
"The target variable is the median house value for California districts,\n",
"expressed in hundreds of thousands of dollars ($100,000).\n",
"\n",
"This dataset was derived from the 1990 U.S. census, using one row per census\n",
"block group. A block group is the smallest geographical unit for which the U.S.\n",
"Census Bureau publishes sample data (a block group typically has a population\n",
"of 600 to 3,000 people).\n",
"\n",
"A household is a group of people residing within a home. Since the average\n",
"number of rooms and bedrooms in this dataset are provided per household, these\n",
"columns may take surprisingly large values for block groups with few households\n",
"and many empty houses, such as vacation resorts.\n",
"\n",
"It can be downloaded/loaded using the\n",
":func:`sklearn.datasets.fetch_california_housing` function.\n",
"\n",
".. rubric:: References\n",
"\n",
"- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,\n",
" Statistics and Probability Letters, 33 (1997) 291-297\n",
"\n"
]
}
],
"source": [
"print(housing.DESCR)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.2. Convert the dataset to pandas dataframe (all the features and the target) (1pt)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
"
],
"text/plain": [
"LinearRegression()"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"\n",
"lin_reg = LinearRegression()\n",
"lin_reg.fit(X_train,y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.7. What value of target is predicted by the model, when all the features have value 0 (1pt)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"np.float64(-37.056241331525186)"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lin_reg.intercept_"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.8. What is the value of the coefficient associated with the feature HouseAge? (1pt)"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',\n",
" 'Latitude', 'Longitude', 'target'],\n",
" dtype='object')"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.columns\n",
"#df.columns.get_loc('HouseAge')"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"np.float64(0.009681867985916507)"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"lin_reg.coef_[1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.9. Predict the target values of the data points from the training set (1pt)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"y_train_pred = lin_reg.predict(X_train)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Ridge(alpha=0.001)
"
],
"text/plain": [
"Ridge(alpha=0.001)"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.linear_model import Ridge\n",
"\n",
"ridge = Ridge(alpha=0.001)\n",
"ridge.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The feature coefficients would shrink toward 0, culminating in the model being underfit as the high penalty would prevent it from properly capturing the relationships between the input features and the target. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.15. Train Lasso Regression with the regularization strength equal to 0.01 on the scaled dataset and set the maximum number of iterations to 100000 (1pt)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
Lasso(alpha=0.01, max_iter=100000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Lasso(alpha=0.01, max_iter=100000)
"
],
"text/plain": [
"Lasso(alpha=0.01, max_iter=100000)"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.linear_model import Lasso\n",
"\n",
"lasso = Lasso(alpha = 1e-2, max_iter = int(1e5))\n",
"lasso.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.16. Count how many coefficients (excluding intercept) are calculated (1pt)"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Total number of coefficients: 44\n"
]
}
],
"source": [
"print(f\"Total number of coefficients: {len(lasso.coef_)}\") "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.17. Check how many features will not be used to predict the target with this model (1pt)"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Number of Features with Coefficients equal to 0: 28\n"
]
}
],
"source": [
"print(f\"Number of Features with Coefficients equal to 0: {(lasso.coef_==0).sum()}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1.18. Plot the coefficients of Lasso and Ridge regression on the same plot, with label, Ridge in red, and Lasso in blue (2pt)"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"#plt.figure(figsize=(5,4))\n",
"plt.plot(ridge.coef_, 's', label = 'Ridge Coefficients', color = 'red')\n",
"plt.plot(lasso.coef_, '^', label = 'Lasso Coefficients', color = 'blue')\n",
"plt.xlabel(\"Coefficients' Index\")\n",
"plt.ylabel(\"Coefficients' Magnitude\")\n",
"plt.legend(loc=(1.01,0.915));"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Question 2"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Total 18 points**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this question we will use a dataset with the medical details of patients for predicting the onset of diabetes within 5 years. The target is the last column of the dataset, where value 1 is interpreted as \"tested positive for diabetes\".\n",
" 1. Import the csv file \"diabetes.csv\" into pandas dataframe (1pt)\n",
" 2. How many duplicate rows do we have? If there are any, remove them. (1pt)\n",
" 3. Generate descriptive statistics for all the numerical columns with one line of code (1pt)\n",
" 4. Plot the distribution of the target value, per class percentages (1pt)\n",
" 5. Split the dataset into training, validation and test set, with the ratio 50:30:20, and use the `random_seed=42` (1pt)\n",
" 6. Train the logistic regression with solver='liblinear' with regularization strength equal to 0.01 with lasso regularization and that stops converging after 700 iterations (2pt) \n",
" 7. Use the validation set to find the value of the threshold that maximizes f1 score (of Class 1). What is that threshold value? (1pt)\n",
" 8. If we use the threshold value found above, how many false negatives do we have on the test dataset? (1pt)\n",
" 9. What is the precision of our model with the value of threshold from step 7? (1pt)\n",
" 10. What proportion (approximately) of patients with diabetes would we reach if we decided to contact 60% of the patients in the test set, ordered by the decreasing model score (1pt)\n",
" 11. Use the data available (except the test set) with a cross validation method that finds the value of regularization strength of l1 penalty of Logistic regression that maximizes recall. Check at least 8 different values of the parameter and verify the best cross validation score (4pt)\n",
" 12. What value of C gives the highest recall (1pt)\n",
" 13. What was the second best mean test value of recall in cross validation (1pt)\n",
" 14. What is f1 score of the best model (1pt)\n"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.1. Import the csv file \"diabetes.csv\" into pandas dataframe (1pt)"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('diabetes.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.2. How many duplicate rows do we have? If there are any, remove them. (1pt)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"771\n",
"Number of duplicate rows: 3\n",
"768\n"
]
}
],
"source": [
"print(len(df))\n",
"print(f\"Number of duplicate rows: {df.duplicated().sum()}\")\n",
"\n",
"df.drop_duplicates(keep = 'first', inplace=True)\n",
"print(len(df))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.3. Generate descriptive statistics for all the numerical columns with one line of code (1pt)"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
"
],
"text/plain": [
"LogisticRegression(C=100.0, max_iter=700, penalty='l1', solver='liblinear')"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"#Standardization could be performed but it was not required. \n",
"\n",
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"log_reg = LogisticRegression(solver = 'liblinear', C = (1/0.01), penalty = 'l1', max_iter = 700)\n",
"log_reg.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.7. Use the validation set to find the value of the threshold that maximizes f1 score (of Class 1). What is that threshold value? (1pt)"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import precision_recall_curve\n",
"y_pred_proba_val= log_reg.predict_proba(X_val)\n",
"precision, recall, threshold = precision_recall_curve(y_val, y_pred_proba_val[:,1])"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Best threshold: 0.2626832463017761\n"
]
}
],
"source": [
"import numpy as np\n",
"f1_scores = 2*recall*precision/(recall+precision)\n",
"idx_best=np.nanargmax(f1_scores)\n",
"best_threshold=threshold[idx_best]\n",
"print('Best threshold: ', best_threshold)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.8. If we use the threshold value found above, how many false negatives do we have on the test dataset? (1pt)"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n",
"y_pred_proba_test= log_reg.predict_proba(X_test)\n",
"y_pred_new = (y_pred_proba_test[:,1] >= best_threshold).astype(int)\n",
"cm=confusion_matrix(y_test, y_pred_new)\n",
"from sklearn.metrics import ConfusionMatrixDisplay\n",
"ConfusionMatrixDisplay(cm).plot();"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"False negatives: 12\n"
]
}
],
"source": [
"print(f\"False negatives: {cm[1][0]}\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.9. What is the precision of our model with the value of threshold from step 7? (1pt)"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.5375"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.metrics import precision_score\n",
"\n",
"precision_score(y_test, y_pred_new)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.10. What proportion (approximately) of patients with diabetes would we reach if we decided to contact 60% of the patients in the test set, ordered by the decreasing model score (1pt)"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
""
]
},
"execution_count": 42,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "",
"text/plain": [
"
"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn_evaluation.plot import cumulative_gain\n",
"cumulative_gain(y_test, y_pred_proba_test)"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"If we decided to contact 60% of the patients in the test set, we would reach approximately 80% of the patients with diabetes.\n"
]
}
],
"source": [
"print(\"If we decided to contact 60% of the patients in the test set, we would reach approximately 80% of the patients with diabetes.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2.11. Use the data available (except the test set) with a cross validation method that finds the value of regularization strength of l1 penalty of Logistic regression that maximizes recall. Check at least 8 different values of the parameter and verify the best cross validation score (4pt)"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import GridSearchCV, RandomizedSearchCV"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.002, 0.032, 0.062, 0.092, 0.122, 0.152, 0.182, 0.212])"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"np.arange(0.002, 0.22, 0.03)"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [],
"source": [
"param_grid = {'C': np.arange(0.002, 0.22, 0.03)}\n",
"\n",
"grid_search = GridSearchCV(LogisticRegression(solver=\"liblinear\", penalty='l1'), param_grid, scoring='recall')"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.