{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Machine Learning Homework 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instructions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This homework is due **before class on Friday, March 7.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Important notes:\n",
    "- Please submit the notebook with the output.\n",
    "- If the answer is not obvious from the printout, please type it.\n",
    "- The notebook should be self contained and we should be able to rerun it.\n",
    "- Import all the libraries that you find necessary to answer the questions.\n",
    "- If the subquestion is worth 1 point, no half points will be given: full point will be given for the correct answer. Similarly, if the question is worth 2, possible points are 0,1,2.\n",
    "- Acknowledge the use of outside sources and code assistants."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 1 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Total 20 points**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Import the California house prices dataset from `sklearn.datasets` using `fetch_california_housing` as follows:</br>\n",
    "    `from sklearn.datasets import fetch_california_housing` <br>\n",
    "     `housing = fetch_california_housing()`\n",
    "1. Print out the dataset description and read it (1pt)\n",
    "2. Convert the dataset to pandas dataframe (all the features and the target) (1pt)\n",
    "3. Check the number of data points, data types, and print the first 15 lines of the dataset (1pt)\n",
    "4. Check the number of missing values per feature (1pt)\n",
    "5. Divide the dataset into train and test, where 30% of the dataset will be used for test, with the `random_state=42` (1pt)\n",
    "6. Train a Linear Regression model (1pt)\n",
    "7. What value of target is predicted by the model, when all the features have value 0 (1pt)\n",
    "8. What is the value of the coefficient associated with the feature HouseAge (1pt)\n",
    "9. Predict the target values of the data points from the training set (1pt)\n",
    "10. What is the value of the cost function for the obtained coefficients and the training dataset (1pt)\n",
    "11. Evaluate the model's performance (1pt)\n",
    "12. Generate polynomial features up to degree 2 of the training dataset (1pt)\n",
    "13. Scale the polynomial features from the previous step using the standard scaler (1pt)\n",
    "14. Train a Ridge regression model with the regularization strength equal to 0.001 on the scaled dataset.  What would happen to the feature coefficients and the model if alpha approached infinity?  (2pt)\n",
    "15. Train Lasso Regression with the regularization strength equal to 0.01 on the scaled dataset and set the maximum number of iterations to 100000 (1pt)\n",
    "16. Count how many coefficients (excluding intercept) are calculated (1pt)\n",
    "17. Check how many features will not be used to predict the target with this model (1pt)\n",
    "18. Plot the coefficients of Lasso and Ridge regression on the same plot, with label, Ridge in red, and Lasso in blue (2pt)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import fetch_california_housing\n",
    "housing = fetch_california_housing()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Total 18 points**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For this question we will use a dataset with the medical details of patients for predicting the onset of diabetes within 5 years. The target is the last column of the dataset,  where value 1 is interpreted as \"tested positive for diabetes\".\n",
    " 1. Import the csv file \"diabetes.csv\" into pandas dataframe (1pt)\n",
    " 2. How many duplicate rows do we have? If there are any, remove them. (1pt)\n",
    " 3. Generate descriptive statistics for all the numerical columns with one line of code  (1pt)\n",
    " 4. Plot the distribution of the target value, per class percentages (1pt)\n",
    " 5. Split the dataset into training, validation and test set, with the ratio 50:30:20, and use the `random_seed=42` (1pt)\n",
    " 6. Train the logistic regression with solver='liblinear' with regularization strength equal to 0.01 with lasso regularization and that stops converging after 700 iterations (2pt) \n",
    " 7. Use the validation set to find the value of the threshold that maximizes f1 score (of Class 1). What is that threshold value? (1pt)\n",
    " 8. If we use the threshold value found above, how many false negatives do we have on the test dataset? (1pt)\n",
    " 9. What is the precision of our model with the value of threshold from step 7? (1pt)\n",
    " 10. What proportion (approximately) of patients with diabetes would we reach if we decided to contact 60% of the patients in the test set, ordered by the decreasing model score (1pt)\n",
    " 11. Use the data available (except the test set) with a cross validation method that finds the value of regularization strength of l1 penalty of Logistic regression that maximizes recall. Check at least 8 different values of the parameter and verify the best cross validation score (4pt)\n",
    " 12. What value of C gives the highest recall (1pt)\n",
    " 13. What was the second best mean test value of recall in cross validation (1pt)\n",
    " 14. What is f1 score of the best model (1pt)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ml2025",
   "language": "python",
   "name": "ml2025"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}