{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machine Learning Homework 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Instructions" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This homework is due **before class on Friday, March 7.**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Important notes:\n", "- Please submit the notebook with the output.\n", "- If the answer is not obvious from the printout, please type it.\n", "- The notebook should be self contained and we should be able to rerun it.\n", "- Import all the libraries that you find necessary to answer the questions.\n", "- If the subquestion is worth 1 point, no half points will be given: full point will be given for the correct answer. Similarly, if the question is worth 2, possible points are 0,1,2.\n", "- Acknowledge the use of outside sources and code assistants." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 1 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Total 20 points**\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Import the California house prices dataset from `sklearn.datasets` using `fetch_california_housing` as follows:
\n", " `from sklearn.datasets import fetch_california_housing`
\n", " `housing = fetch_california_housing()`\n", "1. Print out the dataset description and read it (1pt)\n", "2. Convert the dataset to pandas dataframe (all the features and the target) (1pt)\n", "3. Check the number of data points, data types, and print the first 15 lines of the dataset (1pt)\n", "4. Check the number of missing values per feature (1pt)\n", "5. Divide the dataset into train and test, where 30% of the dataset will be used for test, with the `random_state=42` (1pt)\n", "6. Train a Linear Regression model (1pt)\n", "7. What value of target is predicted by the model, when all the features have value 0 (1pt)\n", "8. What is the value of the coefficient associated with the feature HouseAge (1pt)\n", "9. Predict the target values of the data points from the training set (1pt)\n", "10. What is the value of the cost function for the obtained coefficients and the training dataset (1pt)\n", "11. Evaluate the model's performance (1pt)\n", "12. Generate polynomial features up to degree 2 of the training dataset (1pt)\n", "13. Scale the polynomial features from the previous step using the standard scaler (1pt)\n", "14. Train a Ridge regression model with the regularization strength equal to 0.001 on the scaled dataset. What would happen to the feature coefficients and the model if alpha approached infinity? (2pt)\n", "15. Train Lasso Regression with the regularization strength equal to 0.01 on the scaled dataset and set the maximum number of iterations to 100000 (1pt)\n", "16. Count how many coefficients (excluding intercept) are calculated (1pt)\n", "17. Check how many features will not be used to predict the target with this model (1pt)\n", "18. Plot the coefficients of Lasso and Ridge regression on the same plot, with label, Ridge in red, and Lasso in blue (2pt)" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_california_housing\n", "housing = fetch_california_housing()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.1. Print out the dataset description and read it (1pt)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ".. _california_housing_dataset:\n", "\n", "California Housing dataset\n", "--------------------------\n", "\n", "**Data Set Characteristics:**\n", "\n", ":Number of Instances: 20640\n", "\n", ":Number of Attributes: 8 numeric, predictive attributes and the target\n", "\n", ":Attribute Information:\n", " - MedInc median income in block group\n", " - HouseAge median house age in block group\n", " - AveRooms average number of rooms per household\n", " - AveBedrms average number of bedrooms per household\n", " - Population block group population\n", " - AveOccup average number of household members\n", " - Latitude block group latitude\n", " - Longitude block group longitude\n", "\n", ":Missing Attribute Values: None\n", "\n", "This dataset was obtained from the StatLib repository.\n", "https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html\n", "\n", "The target variable is the median house value for California districts,\n", "expressed in hundreds of thousands of dollars ($100,000).\n", "\n", "This dataset was derived from the 1990 U.S. census, using one row per census\n", "block group. A block group is the smallest geographical unit for which the U.S.\n", "Census Bureau publishes sample data (a block group typically has a population\n", "of 600 to 3,000 people).\n", "\n", "A household is a group of people residing within a home. Since the average\n", "number of rooms and bedrooms in this dataset are provided per household, these\n", "columns may take surprisingly large values for block groups with few households\n", "and many empty houses, such as vacation resorts.\n", "\n", "It can be downloaded/loaded using the\n", ":func:`sklearn.datasets.fetch_california_housing` function.\n", "\n", ".. rubric:: References\n", "\n", "- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,\n", " Statistics and Probability Letters, 33 (1997) 291-297\n", "\n" ] } ], "source": [ "print(housing.DESCR)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.2. Convert the dataset to pandas dataframe (all the features and the target) (1pt)" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitudetarget
08.325241.06.9841271.023810322.02.55555637.88-122.234.526
18.301421.06.2381370.9718802401.02.10984237.86-122.223.585
27.257452.08.2881361.073446496.02.80226037.85-122.243.521
35.643152.05.8173521.073059558.02.54794537.85-122.253.413
43.846252.06.2818531.081081565.02.18146737.85-122.253.422
..............................
206351.560325.05.0454551.133333845.02.56060639.48-121.090.781
206362.556818.06.1140351.315789356.03.12280739.49-121.210.771
206371.700017.05.2055431.1200921007.02.32563539.43-121.220.923
206381.867218.05.3295131.171920741.02.12320939.43-121.320.847
206392.388616.05.2547171.1622641387.02.61698139.37-121.240.894
\n", "

20640 rows × 9 columns

\n", "
" ], "text/plain": [ " MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \\\n", "0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 \n", "1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 \n", "2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 \n", "3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 \n", "4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 \n", "... ... ... ... ... ... ... ... \n", "20635 1.5603 25.0 5.045455 1.133333 845.0 2.560606 39.48 \n", "20636 2.5568 18.0 6.114035 1.315789 356.0 3.122807 39.49 \n", "20637 1.7000 17.0 5.205543 1.120092 1007.0 2.325635 39.43 \n", "20638 1.8672 18.0 5.329513 1.171920 741.0 2.123209 39.43 \n", "20639 2.3886 16.0 5.254717 1.162264 1387.0 2.616981 39.37 \n", "\n", " Longitude target \n", "0 -122.23 4.526 \n", "1 -122.22 3.585 \n", "2 -122.24 3.521 \n", "3 -122.25 3.413 \n", "4 -122.25 3.422 \n", "... ... ... \n", "20635 -121.09 0.781 \n", "20636 -121.21 0.771 \n", "20637 -121.22 0.923 \n", "20638 -121.32 0.847 \n", "20639 -121.24 0.894 \n", "\n", "[20640 rows x 9 columns]" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.DataFrame(housing.data, columns = housing.feature_names)\n", "df['target'] = housing.target\n", "df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.3. Check the number of samples, the data types and print the first 15 lines of the dataset (1pt)" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "20640" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape[0]\n", "\n", "#or\n", "#df.info()" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MedInc float64\n", "HouseAge float64\n", "AveRooms float64\n", "AveBedrms float64\n", "Population float64\n", "AveOccup float64\n", "Latitude float64\n", "Longitude float64\n", "target float64\n", "dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitudetarget
08.325241.06.9841271.023810322.02.55555637.88-122.234.526
18.301421.06.2381370.9718802401.02.10984237.86-122.223.585
27.257452.08.2881361.073446496.02.80226037.85-122.243.521
35.643152.05.8173521.073059558.02.54794537.85-122.253.413
43.846252.06.2818531.081081565.02.18146737.85-122.253.422
54.036852.04.7616581.103627413.02.13989637.85-122.252.697
63.659152.04.9319070.9513621094.02.12840537.84-122.252.992
73.120052.04.7975271.0618241157.01.78825337.84-122.252.414
82.080442.04.2941181.1176471206.02.02689137.84-122.262.267
93.691252.04.9705880.9901961551.02.17226937.84-122.252.611
103.203152.05.4776121.079602910.02.26368237.85-122.262.815
113.270552.04.7724801.0245231504.02.04904637.85-122.262.418
123.075052.05.3226501.0128211098.02.34615437.85-122.262.135
132.673652.04.0000001.097701345.01.98275937.84-122.261.913
141.916752.04.2629031.0096771212.01.95483937.85-122.261.592
\n", "
" ], "text/plain": [ " MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \\\n", "0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 \n", "1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 \n", "2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 \n", "3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 \n", "4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 \n", "5 4.0368 52.0 4.761658 1.103627 413.0 2.139896 37.85 \n", "6 3.6591 52.0 4.931907 0.951362 1094.0 2.128405 37.84 \n", "7 3.1200 52.0 4.797527 1.061824 1157.0 1.788253 37.84 \n", "8 2.0804 42.0 4.294118 1.117647 1206.0 2.026891 37.84 \n", "9 3.6912 52.0 4.970588 0.990196 1551.0 2.172269 37.84 \n", "10 3.2031 52.0 5.477612 1.079602 910.0 2.263682 37.85 \n", "11 3.2705 52.0 4.772480 1.024523 1504.0 2.049046 37.85 \n", "12 3.0750 52.0 5.322650 1.012821 1098.0 2.346154 37.85 \n", "13 2.6736 52.0 4.000000 1.097701 345.0 1.982759 37.84 \n", "14 1.9167 52.0 4.262903 1.009677 1212.0 1.954839 37.85 \n", "\n", " Longitude target \n", "0 -122.23 4.526 \n", "1 -122.22 3.585 \n", "2 -122.24 3.521 \n", "3 -122.25 3.413 \n", "4 -122.25 3.422 \n", "5 -122.25 2.697 \n", "6 -122.25 2.992 \n", "7 -122.25 2.414 \n", "8 -122.26 2.267 \n", "9 -122.25 2.611 \n", "10 -122.26 2.815 \n", "11 -122.26 2.418 \n", "12 -122.26 2.135 \n", "13 -122.26 1.913 \n", "14 -122.26 1.592 " ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head(15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.4. Check the number of missing values per feature (1pt)" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "MedInc 0\n", "HouseAge 0\n", "AveRooms 0\n", "AveBedrms 0\n", "Population 0\n", "AveOccup 0\n", "Latitude 0\n", "Longitude 0\n", "target 0\n", "dtype: int64" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.5. Divide the dataset into train and test, where 30% of the dataset will be used for test, with the random_state=42 (1pt)" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X = df.drop(columns = 'target')\n", "y = df['target']\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42, test_size=0.3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, random_state=42, test_size=0.3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,:-1], df.iloc[:,-1], random_state=42, test_size=0.3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.6. Train a Linear Regression model (1pt)" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LinearRegression()" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "lin_reg = LinearRegression()\n", "lin_reg.fit(X_train,y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.7. What value of target is predicted by the model, when all the features have value 0 (1pt)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(-37.056241331525186)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lin_reg.intercept_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.8. What is the value of the coefficient associated with the feature HouseAge? (1pt)" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population', 'AveOccup',\n", " 'Latitude', 'Longitude', 'target'],\n", " dtype='object')" ] }, "execution_count": 14, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns\n", "#df.columns.get_loc('HouseAge')" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(0.009681867985916507)" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lin_reg.coef_[1]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.9. Predict the target values of the data points from the training set (1pt)" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "y_train_pred = lin_reg.predict(X_train)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
MedIncHouseAgeAveRoomsAveBedrmsPopulationAveOccupLatitudeLongitude
70614.131235.05.8823530.9754901218.02.98529433.93-118.02
146892.863120.04.4012101.076613999.02.01411332.79-117.09
173234.202624.05.6175440.989474731.02.56491234.59-120.14
100563.109414.05.8695651.094203302.02.18840639.26-121.00
157503.306852.04.8012051.0662651526.02.29819337.77-122.45
...........................
112846.370035.06.1290320.926267658.03.03225833.78-117.96
119643.050033.06.8685971.2694881753.03.90423234.02-117.43
53902.934436.03.9867171.0796961756.03.33206834.03-118.38
8605.719215.06.3953491.0679791777.03.17889137.58-121.96
157952.575552.03.4025761.0587762619.02.10869637.77-122.42
\n", "

14448 rows × 8 columns

\n", "
" ], "text/plain": [ " MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude \\\n", "7061 4.1312 35.0 5.882353 0.975490 1218.0 2.985294 33.93 \n", "14689 2.8631 20.0 4.401210 1.076613 999.0 2.014113 32.79 \n", "17323 4.2026 24.0 5.617544 0.989474 731.0 2.564912 34.59 \n", "10056 3.1094 14.0 5.869565 1.094203 302.0 2.188406 39.26 \n", "15750 3.3068 52.0 4.801205 1.066265 1526.0 2.298193 37.77 \n", "... ... ... ... ... ... ... ... \n", "11284 6.3700 35.0 6.129032 0.926267 658.0 3.032258 33.78 \n", "11964 3.0500 33.0 6.868597 1.269488 1753.0 3.904232 34.02 \n", "5390 2.9344 36.0 3.986717 1.079696 1756.0 3.332068 34.03 \n", "860 5.7192 15.0 6.395349 1.067979 1777.0 3.178891 37.58 \n", "15795 2.5755 52.0 3.402576 1.058776 2619.0 2.108696 37.77 \n", "\n", " Longitude \n", "7061 -118.02 \n", "14689 -117.09 \n", "17323 -120.14 \n", "10056 -121.00 \n", "15750 -122.45 \n", "... ... \n", "11284 -117.96 \n", "11964 -117.43 \n", "5390 -118.38 \n", "860 -121.96 \n", "15795 -122.42 \n", "\n", "[14448 rows x 8 columns]" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([2.13761366, 1.76385736, 2.75114302, ..., 2.03900584, 2.84130506,\n", " 2.27916759], shape=(14448,))" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train_pred" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.10. What is the value of the cost function for the obtained coefficients and the training dataset (1pt)" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(7561.471021289252)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "SSE = ((y_train-y_train_pred)**2).sum()\n", "SSE " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(0.5233576288267755)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "MSE = ((y_train-y_train_pred)**2).sum()/len(y_train)\n", "MSE " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5233576288267755" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import mean_squared_error\n", "error_train = mean_squared_error(y_train,y_train_pred)\n", "error_train" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.11. Evaluate the model's performance (1pt)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5957702326061662" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lin_reg.score(X_test,y_test) " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.12. Generate polynomial features up to degree 2 of the training set (1pt) " ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import PolynomialFeatures\n", "\n", "poly_converter = PolynomialFeatures(degree=2, include_bias = False)\n", "poly_converter.fit(X_train)\n", "\n", "X_train = poly_converter.transform(X_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.13. Scale the polynomial features from the previous step using the standard scaler (1pt)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import StandardScaler\n", "\n", "scaler = StandardScaler()\n", "scaler.fit(X_train)\n", "\n", "X_train = scaler.transform(X_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.14. Train a Ridge Regression with the regularization strength equal to 0.001 on the scaled dataset. What would happen to the feature coefficients and the model if alpha approached infinity? (2pt)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Ridge(alpha=0.001)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Ridge(alpha=0.001)" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import Ridge\n", "\n", "ridge = Ridge(alpha=0.001)\n", "ridge.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The feature coefficients would shrink toward 0, culminating in the model being underfit as the high penalty would prevent it from properly capturing the relationships between the input features and the target. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.15. Train Lasso Regression with the regularization strength equal to 0.01 on the scaled dataset and set the maximum number of iterations to 100000 (1pt)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Lasso(alpha=0.01, max_iter=100000)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Lasso(alpha=0.01, max_iter=100000)" ] }, "execution_count": 26, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import Lasso\n", "\n", "lasso = Lasso(alpha = 1e-2, max_iter = int(1e5))\n", "lasso.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.16. Count how many coefficients (excluding intercept) are calculated (1pt)" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Total number of coefficients: 44\n" ] } ], "source": [ "print(f\"Total number of coefficients: {len(lasso.coef_)}\") " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.17. Check how many features will not be used to predict the target with this model (1pt)" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Number of Features with Coefficients equal to 0: 28\n" ] } ], "source": [ "print(f\"Number of Features with Coefficients equal to 0: {(lasso.coef_==0).sum()}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 1.18. Plot the coefficients of Lasso and Ridge regression on the same plot, with label, Ridge in red, and Lasso in blue (2pt)" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "#plt.figure(figsize=(5,4))\n", "plt.plot(ridge.coef_, 's', label = 'Ridge Coefficients', color = 'red')\n", "plt.plot(lasso.coef_, '^', label = 'Lasso Coefficients', color = 'blue')\n", "plt.xlabel(\"Coefficients' Index\")\n", "plt.ylabel(\"Coefficients' Magnitude\")\n", "plt.legend(loc=(1.01,0.915));" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**Total 18 points**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this question we will use a dataset with the medical details of patients for predicting the onset of diabetes within 5 years. The target is the last column of the dataset, where value 1 is interpreted as \"tested positive for diabetes\".\n", " 1. Import the csv file \"diabetes.csv\" into pandas dataframe (1pt)\n", " 2. How many duplicate rows do we have? If there are any, remove them. (1pt)\n", " 3. Generate descriptive statistics for all the numerical columns with one line of code (1pt)\n", " 4. Plot the distribution of the target value, per class percentages (1pt)\n", " 5. Split the dataset into training, validation and test set, with the ratio 50:30:20, and use the `random_seed=42` (1pt)\n", " 6. Train the logistic regression with solver='liblinear' with regularization strength equal to 0.01 with lasso regularization and that stops converging after 700 iterations (2pt) \n", " 7. Use the validation set to find the value of the threshold that maximizes f1 score (of Class 1). What is that threshold value? (1pt)\n", " 8. If we use the threshold value found above, how many false negatives do we have on the test dataset? (1pt)\n", " 9. What is the precision of our model with the value of threshold from step 7? (1pt)\n", " 10. What proportion (approximately) of patients with diabetes would we reach if we decided to contact 60% of the patients in the test set, ordered by the decreasing model score (1pt)\n", " 11. Use the data available (except the test set) with a cross validation method that finds the value of regularization strength of l1 penalty of Logistic regression that maximizes recall. Check at least 8 different values of the parameter and verify the best cross validation score (4pt)\n", " 12. What value of C gives the highest recall (1pt)\n", " 13. What was the second best mean test value of recall in cross validation (1pt)\n", " 14. What is f1 score of the best model (1pt)\n" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.1. Import the csv file \"diabetes.csv\" into pandas dataframe (1pt)" ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('diabetes.csv')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.2. How many duplicate rows do we have? If there are any, remove them. (1pt)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "771\n", "Number of duplicate rows: 3\n", "768\n" ] } ], "source": [ "print(len(df))\n", "print(f\"Number of duplicate rows: {df.duplicated().sum()}\")\n", "\n", "df.drop_duplicates(keep = 'first', inplace=True)\n", "print(len(df))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.3. Generate descriptive statistics for all the numerical columns with one line of code (1pt)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
num_pregnancyglucose_concblood_pressureskin_foldserumbody_massdiabetes_pedigreeagetarget
count768.000000768.000000768.000000768.000000768.000000768.000000768.000000768.000000768.000000
mean3.845052120.89453169.10546920.53645879.79947931.9925780.47187633.2408850.348958
std3.36957831.97261819.35580715.952218115.2440027.8841600.33132911.7602320.476951
min0.0000000.0000000.0000000.0000000.0000000.0000000.07800021.0000000.000000
25%1.00000099.00000062.0000000.0000000.00000027.3000000.24375024.0000000.000000
50%3.000000117.00000072.00000023.00000030.50000032.0000000.37250029.0000000.000000
75%6.000000140.25000080.00000032.000000127.25000036.6000000.62625041.0000001.000000
max17.000000199.000000122.00000099.000000846.00000067.1000002.42000081.0000001.000000
\n", "
" ], "text/plain": [ " num_pregnancy glucose_conc blood_pressure skin_fold serum \\\n", "count 768.000000 768.000000 768.000000 768.000000 768.000000 \n", "mean 3.845052 120.894531 69.105469 20.536458 79.799479 \n", "std 3.369578 31.972618 19.355807 15.952218 115.244002 \n", "min 0.000000 0.000000 0.000000 0.000000 0.000000 \n", "25% 1.000000 99.000000 62.000000 0.000000 0.000000 \n", "50% 3.000000 117.000000 72.000000 23.000000 30.500000 \n", "75% 6.000000 140.250000 80.000000 32.000000 127.250000 \n", "max 17.000000 199.000000 122.000000 99.000000 846.000000 \n", "\n", " body_mass diabetes_pedigree age target \n", "count 768.000000 768.000000 768.000000 768.000000 \n", "mean 31.992578 0.471876 33.240885 0.348958 \n", "std 7.884160 0.331329 11.760232 0.476951 \n", "min 0.000000 0.078000 21.000000 0.000000 \n", "25% 27.300000 0.243750 24.000000 0.000000 \n", "50% 32.000000 0.372500 29.000000 0.000000 \n", "75% 36.600000 0.626250 41.000000 1.000000 \n", "max 67.100000 2.420000 81.000000 1.000000 " ] }, "execution_count": 33, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.4. Plot the distribution of the target value, per class percentages (1pt)" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "target\n", "0 500\n", "1 268\n", "Name: count, dtype: int64\n" ] }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "print(df['target'].value_counts())\n", "#alternative: unique, counts = np.unique(df['target'], return_counts=True)\n", "\n", "df['target'].value_counts(normalize=True).plot(kind='bar')\n", "plt.xticks(rotation=0)\n", "plt.title('Distribution of target variable in percentage')\n", "plt.show()\n", "\n", "#Alternative code to display plot\n", "#sns.barplot(x = df['target'].value_counts().index, \n", "#y = df['target'].value_counts(normalize=True)).set(title = 'Distribution of target variable (in percentage)');" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.5. Split the dataset into training, validation and test set, with the ratio 50:30:20, and use the random_seed=42 (1pt)" ] }, { "cell_type": "code", "execution_count": 35, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.5\n", "0.3\n", "0.2\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X = df.drop(columns='target')\n", "y = df['target']\n", "\n", "X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, test_size = 0.2, random_state = 42)\n", "\n", "X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, test_size = 0.375, random_state = 42) #30/(100-20)\n", "\n", "#to confirm\n", "\n", "print(round(len(X_train)/len(X),2))\n", "print(round(len(X_val)/len(X),2))\n", "print(round(len(X_test)/len(X),2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.6. Train the logistic regression with solver='liblinear' with regularization strength equal to 0.01 with lasso regularization and that stops converging after 700 iterations (2pt) " ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
LogisticRegression(C=100.0, max_iter=700, penalty='l1', solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LogisticRegression(C=100.0, max_iter=700, penalty='l1', solver='liblinear')" ] }, "execution_count": 36, "metadata": {}, "output_type": "execute_result" } ], "source": [ "#Standardization could be performed but it was not required. \n", "\n", "from sklearn.linear_model import LogisticRegression\n", "\n", "log_reg = LogisticRegression(solver = 'liblinear', C = (1/0.01), penalty = 'l1', max_iter = 700)\n", "log_reg.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.7. Use the validation set to find the value of the threshold that maximizes f1 score (of Class 1). What is that threshold value? (1pt)" ] }, { "cell_type": "code", "execution_count": 37, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import precision_recall_curve\n", "y_pred_proba_val= log_reg.predict_proba(X_val)\n", "precision, recall, threshold = precision_recall_curve(y_val, y_pred_proba_val[:,1])" ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best threshold: 0.2626832463017761\n" ] } ], "source": [ "import numpy as np\n", "f1_scores = 2*recall*precision/(recall+precision)\n", "idx_best=np.nanargmax(f1_scores)\n", "best_threshold=threshold[idx_best]\n", "print('Best threshold: ', best_threshold)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.8. If we use the threshold value found above, how many false negatives do we have on the test dataset? (1pt)" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n", "y_pred_proba_test= log_reg.predict_proba(X_test)\n", "y_pred_new = (y_pred_proba_test[:,1] >= best_threshold).astype(int)\n", "cm=confusion_matrix(y_test, y_pred_new)\n", "from sklearn.metrics import ConfusionMatrixDisplay\n", "ConfusionMatrixDisplay(cm).plot();" ] }, { "cell_type": "code", "execution_count": 40, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "False negatives: 12\n" ] } ], "source": [ "print(f\"False negatives: {cm[1][0]}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.9. What is the precision of our model with the value of threshold from step 7? (1pt)" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5375" ] }, "execution_count": 41, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import precision_score\n", "\n", "precision_score(y_test, y_pred_new)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.10. What proportion (approximately) of patients with diabetes would we reach if we decided to contact 60% of the patients in the test set, ordered by the decreasing model score (1pt)" ] }, { "cell_type": "code", "execution_count": 42, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "" ] }, "execution_count": 42, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn_evaluation.plot import cumulative_gain\n", "cumulative_gain(y_test, y_pred_proba_test)" ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "If we decided to contact 60% of the patients in the test set, we would reach approximately 80% of the patients with diabetes.\n" ] } ], "source": [ "print(\"If we decided to contact 60% of the patients in the test set, we would reach approximately 80% of the patients with diabetes.\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.11. Use the data available (except the test set) with a cross validation method that finds the value of regularization strength of l1 penalty of Logistic regression that maximizes recall. Check at least 8 different values of the parameter and verify the best cross validation score (4pt)" ] }, { "cell_type": "code", "execution_count": 44, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV, RandomizedSearchCV" ] }, { "cell_type": "code", "execution_count": 45, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([0.002, 0.032, 0.062, 0.092, 0.122, 0.152, 0.182, 0.212])" ] }, "execution_count": 45, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.arange(0.002, 0.22, 0.03)" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [], "source": [ "param_grid = {'C': np.arange(0.002, 0.22, 0.03)}\n", "\n", "grid_search = GridSearchCV(LogisticRegression(solver=\"liblinear\", penalty='l1'), param_grid, scoring='recall')" ] }, { "cell_type": "code", "execution_count": 47, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
GridSearchCV(estimator=LogisticRegression(penalty='l1', solver='liblinear'),\n",
       "             param_grid={'C': array([0.002, 0.032, 0.062, 0.092, 0.122, 0.152, 0.182, 0.212])},\n",
       "             scoring='recall')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "GridSearchCV(estimator=LogisticRegression(penalty='l1', solver='liblinear'),\n", " param_grid={'C': array([0.002, 0.032, 0.062, 0.092, 0.122, 0.152, 0.182, 0.212])},\n", " scoring='recall')" ] }, "execution_count": 47, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid_search.fit(X_trainval, y_trainval)" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(0.5260243632336655)" ] }, "execution_count": 48, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid_search.best_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or" ] }, { "cell_type": "code", "execution_count": 49, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
RandomizedSearchCV(estimator=LogisticRegression(penalty='l1',\n",
       "                                                solver='liblinear'),\n",
       "                   n_iter=8,\n",
       "                   param_distributions={'C': <scipy.stats._distn_infrastructure.rv_continuous_frozen object at 0x0000013BEF7757F0>},\n",
       "                   scoring='recall')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "RandomizedSearchCV(estimator=LogisticRegression(penalty='l1',\n", " solver='liblinear'),\n", " n_iter=8,\n", " param_distributions={'C': },\n", " scoring='recall')" ] }, "execution_count": 49, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scipy.stats import uniform\n", "distributions= dict(C=uniform(loc=0, scale=4))\n", "random_search = RandomizedSearchCV(LogisticRegression(solver=\"liblinear\", penalty='l1'), distributions, scoring='recall', n_iter=8)\n", "random_search.fit(X_trainval, y_trainval)\n" ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "np.float64(0.5685492801771871)" ] }, "execution_count": 50, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_search.best_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.12. What value of C gives the highest recall (1pt)" ] }, { "cell_type": "code", "execution_count": 51, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'C': np.float64(0.182)}" ] }, "execution_count": 51, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid_search.best_params_" ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'C': np.float64(3.575115902312705)}" ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_search.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "or" ] }, { "cell_type": "code", "execution_count": 53, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "0.182\n", "3.575115902312705\n" ] } ], "source": [ "results_grid = pd.DataFrame(grid_search.cv_results_)\n", "results_random = pd.DataFrame(random_search.cv_results_)\n", "print(results_grid[results_grid['mean_test_score'] == results_grid['mean_test_score'].max()]['param_C'].reset_index(drop=True)[0])\n", "print(results_random[results_random['mean_test_score'] == results_random['mean_test_score'].max()]['param_C'].reset_index(drop=True)[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.13. What was the second best mean test value of recall in cross validation (1pt)" ] }, { "cell_type": "code", "execution_count": 54, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_Cparamssplit0_test_scoresplit1_test_scoresplit2_test_scoresplit3_test_scoresplit4_test_scoremean_test_scorestd_test_scorerank_test_score
00.0088880.0010850.0111840.0033610.002{'C': 0.002}0.1428570.1395350.0232560.0465120.0952380.0894800.0482078
10.0120310.0021450.0148280.0034780.032{'C': 0.032}0.3571430.2558140.1162790.2558140.3095240.2589150.0807157
20.0150790.0015700.0138930.0013860.062{'C': 0.062}0.3809520.3023260.2558140.2790700.4047620.3245850.0581426
30.0170140.0025640.0136070.0010330.092{'C': 0.092}0.4761900.4883720.3720930.3720930.4761900.4369880.0531735
40.0164180.0020780.0117360.0021200.122{'C': 0.122}0.5476190.5116280.3953490.4186050.4761900.4698780.0565864
50.0189390.0021370.0122010.0013500.152{'C': 0.152}0.6190480.5813950.4651160.4418600.5000000.5214840.0679512
60.0206710.0043310.0132390.0009000.182{'C': 0.182}0.5952380.6046510.4883720.4418600.5000000.5260240.0634841
70.0199810.0014290.0127670.0009190.212{'C': 0.212}0.5714290.6046510.4883720.4418600.5000000.5212620.0588373
\n", "
" ], "text/plain": [ " mean_fit_time std_fit_time mean_score_time std_score_time param_C \\\n", "0 0.008888 0.001085 0.011184 0.003361 0.002 \n", "1 0.012031 0.002145 0.014828 0.003478 0.032 \n", "2 0.015079 0.001570 0.013893 0.001386 0.062 \n", "3 0.017014 0.002564 0.013607 0.001033 0.092 \n", "4 0.016418 0.002078 0.011736 0.002120 0.122 \n", "5 0.018939 0.002137 0.012201 0.001350 0.152 \n", "6 0.020671 0.004331 0.013239 0.000900 0.182 \n", "7 0.019981 0.001429 0.012767 0.000919 0.212 \n", "\n", " params split0_test_score split1_test_score split2_test_score \\\n", "0 {'C': 0.002} 0.142857 0.139535 0.023256 \n", "1 {'C': 0.032} 0.357143 0.255814 0.116279 \n", "2 {'C': 0.062} 0.380952 0.302326 0.255814 \n", "3 {'C': 0.092} 0.476190 0.488372 0.372093 \n", "4 {'C': 0.122} 0.547619 0.511628 0.395349 \n", "5 {'C': 0.152} 0.619048 0.581395 0.465116 \n", "6 {'C': 0.182} 0.595238 0.604651 0.488372 \n", "7 {'C': 0.212} 0.571429 0.604651 0.488372 \n", "\n", " split3_test_score split4_test_score mean_test_score std_test_score \\\n", "0 0.046512 0.095238 0.089480 0.048207 \n", "1 0.255814 0.309524 0.258915 0.080715 \n", "2 0.279070 0.404762 0.324585 0.058142 \n", "3 0.372093 0.476190 0.436988 0.053173 \n", "4 0.418605 0.476190 0.469878 0.056586 \n", "5 0.441860 0.500000 0.521484 0.067951 \n", "6 0.441860 0.500000 0.526024 0.063484 \n", "7 0.441860 0.500000 0.521262 0.058837 \n", "\n", " rank_test_score \n", "0 8 \n", "1 7 \n", "2 6 \n", "3 5 \n", "4 4 \n", "5 2 \n", "6 1 \n", "7 3 " ] }, "execution_count": 54, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = pd.DataFrame(grid_search.cv_results_)\n", "results" ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "5 0.521484\n", "Name: mean_test_score, dtype: float64" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results[results.rank_test_score==2].mean_test_score" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### 2.14. What is f1 score of the best model (1pt)" ] }, { "cell_type": "code", "execution_count": 56, "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import f1_score" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [], "source": [ "y_pred=grid_search.best_estimator_.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "F1 score is 0.6415094339622641\n" ] } ], "source": [ "print('F1 score is', f1_score(y_test, y_pred))" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "#grid_search.score(X_test,y_test) this yields a different result because the score employed as recall." ] } ], "metadata": { "kernelspec": { "display_name": "ml25", "language": "python", "name": "ml25" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.13.1" } }, "nbformat": 4, "nbformat_minor": 4 }