{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Machine Learning Homework 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instructions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This homework is due before class on **Friday, May 2nd**."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Important notes:\n",
    "- Please submit the notebook with the output.\n",
    "- If the answer is not obvious from the printout, please type it as markdown.\n",
    "- The notebook should be self contained and we should be able to rerun it.\n",
    "- Import all the libraries that you find necessary to answer the questions.\n",
    "- If the subquestion is worth 1 point, no half points will be given: full point will be given to the correct answer. Similarly if the question is worth 2, possible points are 0,1,2. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "pd.set_option('display.max_colwidth', None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import set_config\n",
    "set_config(display=\"text\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Total 26 points**\n",
    "\n",
    "For this question we will use a dataset with the medical details of patients for predicting the onset of diabetes within 5 years. The target is the last column of the dataset,  where value 1 is interpreted as \"tested positive for diabetes\"."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Import the csv file \"HW3_Q1.csv\" into pandas dataframe, check the number of rows, check the data types, view the first 5 rows, and check class distribution  (1pt)\n",
    "2. Plot a heatmap of the Pearson correlations between each pair of features (with annotations). Which pair of variables has the highest correlation? (1pt)\n",
    "3.  Split the data into train and test, leaving 20% for the test set. Make sure that the target distribution is the same in train and test set. Set the random_state to 0 (1pt)\n",
    "4. Using pipeline with randomized search do data standardization, data balancing, and then apply XGBoost (6pts)\n",
    "- For data balancing test no data balancing, oversampling and SMOTE (with random state set to 0 for both).\n",
    "- For XGBoost, for learning rate sample uniformly from the interval between 0.03 and 0.1 (including 0.1), for the maximum tree depth, test the values from 4 to 7 (including 7), and for the number of iterations test the values from 300 to 900 (inclusive), in steps of 100.\n",
    "- Check in total 15 parameter combinations, set random state to 0, do 5 fold cross validation and use f1-score as the criterion for model selection and tuning.\n",
    "\n",
    "What are the best parameters found?\n",
    "\n",
    "5. Visualize the original training dataset using TSNE, coloring the classes in different colors and with 70 as the number of neighbors, and visualize the training dataset after employing the resampling method of the best estimator found in step 4, also with classes in different colors and 70 as the number of neighbors. Set the random state to 42 on both. If the best estimator did not employ any sampling method, use SMOTE (2pt)\n",
    "6. Check the recall of the best model found in step 4 on the test set (1pt) \n",
    "\n",
    "For all the interpretation approaches below use the test set. \n",
    "\n",
    "7. Check feature importance of the best model obtained in step 4 using permutation feature importance on the test set, permuting each feature 30 times and setting the random state to 42. What are the two most important features? (2pt)\n",
    "8. With the test set, for two most important features found in step 7, plot partial dependence plot separately for each feature. Explain how the feature impacts the average predicted target value. Plot partial dependence interactive plot for these two features. Approximately for what ranges of values of these features is the lowest average predicted probability of being tested positive for diabetes? (4pt)\n",
    "9. Find the first point misclassified by the model. What is the probability that it belongs to class 1 found by the model? Use LIME to see the explanation for this prediction by using only 3 features to explain the model, with 200 samples; also, set the random state to 42 when defining the explainer. What are the features that LIME uses to explain the prediction? What are the features that contribute to the wrong class, i.e., the class predicted by the model? (5pt)\n",
    "10. \n",
    "What is the most important feature in the testing set according to Shap. Explain in one sentence how the values of this feature impact the prediction (3pt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Total 17 points**\n",
    "\n",
    "In this question we will use the MNIST digits dataset. However, to reduce the dataset size, we will only classify the first five classes. The question must be solved with pytorch and lightning."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Import the csv file \"HW3_Q2data.csv\" into pandas dataframe, check the number of rows, check the data types, view the first 5 rows, check class distribution (1pt)\n",
    "   \n",
    "As the images have been compressed into a flattened format, restore the data to its original format with dimensions of 28 pixels in height and 28 pixels in width by transforming the data and labels to a numpy object using to_numpy() and the data through reshape(-1,28,28)\n",
    "\n",
    "\n",
    "2. Visualize the first image of the dataset (1pt)\n",
    "3. Split the dataset into training, validation, and testing sets with proportions of 60%, 30%, and 10% respectively. After performing the split, print the proportion of each set (1pt)\n",
    "4. Construct a TensorDataset for each set with the resulting tensors. Set the torch and numpy seed to 42. Instantiate a DataLoader for each TensorDataset with a batch size of 64 (2pt)\n",
    "5. Define a fully connected network with one hidden layer containing 64 nodes, and include any other required layers. Specify the appropriate loss function and initialize an Adam optimizer with a learning rate of 0.0005. Set up a model saving checkpoint and early stopping to stop if there isn't any improvement in validation accuracy after 6 epochs. Prepare the function for printing and visualizing the metrics per epoch. (6pt)\n",
    "6. Train the fully connected network for 50 epochs. Set the torch and numpy seed to 42. Plot training and validation accuracy and loss. What was the validation accuracy in the last epoch and why did it stop there? Is the model overfitting? Visualize the model architecture. (5pt)\n",
    "7. Evaluate the best model's accuracy on the test set (1pt)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ml1111",
   "language": "python",
   "name": "ml1111"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}