{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Introduction to Machine Learning Homework 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Instructions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This homework is due **before class on Friday, April 4th.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Important notes:\n",
    "- Please submit the notebook with the output.\n",
    "- If the answer is not obvious from the printout, please type it.\n",
    "- The notebook should be self contained and we should be able to rerun it.\n",
    "- Import all the libraries that you find necessary to answer the questions.\n",
    "- If the subquestion is worth 1 point, no half points will be given: full point will be given to the correct answer. Similarly if the question is worth 2, possible points are 0,1,2. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import seaborn as sns\n",
    "pd.set_option('display.max_colwidth', None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import set_config\n",
    "set_config(display=\"text\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Total 14 points**\n",
    "\n",
    "The dataset will be used for classification, and the target value is stored in the column named *target*.\n",
    "\n",
    "1. Import the csv file \"HW2_Q1data.csv\" into pandas dataframe and view the first 5 rows. (1pt)\n",
    "2. Which are categorical and which are the numerical features? (1pt)\n",
    "3. Visualize the relationships between each pair of \"G\" features (G1, G2, G3)  and the target. (1pt)\n",
    "4. Split the data into train and test, leaving 25% for the test set. Set the random_state to 42. (1pt)\n",
    "5. Using pipeline, do one hot encoding for all the categorical variables, standardize the numerical features and perform a GridSearch with 5 fold cross validation that maximizes the f1 score to find the parameters of logistic regression: (7pt)\n",
    "    - For Logistic Regression: check penalty 'l1' and 'l2', regularization strength of 0.001 and 0.1, and set random state to 42;\n",
    "    - For one hot encoding: drop the first category in each feature. \\\n",
    "List the parameters of the pipeline.\\\n",
    "Note: choose the right order of steps in the pipeline.  \n",
    "6. What are the best parameters found in step 5? (1pt)\n",
    "7. What are the second best parameters found in step 5? (1pt)\n",
    "8. Check the f1 score of the best model on the test set set and compare it with the cross validation performance. (1pt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Total 18 points**\n",
    "\n",
    "In this question we will be using the Titanic dataset that contains demographics and information on passengers and crew. The ship Titanic sank after colliding with an iceberg, killing more than half the people on board. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others. The target in the dataset corresponds to whether the person survived or not."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Import the csv file \"HW2_Q2data.csv\" into pandas dataframe, check the number of rows, check the data types, view the first 10 rows and check the target distribution (1pt)\n",
    "2. Check the number of missing values per column and drop the column if it has more than 90% of values missing (1pt)\n",
    "3. Split the data into train and test, leaving 30% for the test set. Set the random_state to 42 (1pt)\n",
    "4. Using pipeline, do the following (not necessarily in this order, but choose the right order of steps in the pipeline) (12pts)\n",
    "    - Use SMOTENC for balancing\n",
    "    - Normalize the numerical variables\n",
    "    - Perform one-hot encoding for all the nominal variables by keeping all the dummy variables\n",
    "    - Use mean imputing for numerical variables and mode for the nominal\n",
    "    - Use randomized search with random state 42, and 25 combinations of parameters to select the best classifier: try Random Forest and CatBoost\n",
    "    - Use ROC Area under the curve as the criterion for model selection and tuning\n",
    "    - The cross validation needs 5 splits that shuffles the data the same number of times and keeps the same proportion of target classes, the test size of 0.3 and a random_state = 42\n",
    "\n",
    "Tip: after using the column transformer, the column names are replaced with indices\n",
    "\n",
    "For Random Forest\n",
    "   - tune the number of trees by testing the values from 400 to 800 (including 800) with the step size of 100\n",
    "   - tune the maximum tree depth by testing the values from 7 to 12 (including 12)\n",
    "   - tune the minimum number of samples required to split an internal node with values from 5 to 14 (including 14); also, set the random state to 42\n",
    "\n",
    "For CatBoost\n",
    "   - tune the learning rate by testing the values [0.001, 0.01, 0.1]\n",
    "   - tune the maximum tree depth by testing the values from 4 to 9 (including 9)\n",
    "   - tune the percentage of features to be randomly selected on each iteration by sampling uniformly from the interval of 0.2 to 0.9 (including 0.9) \n",
    "   - tune the weight of the minority class (class 1) by testing the values 1 and 3; also, set the seed to 42.\n",
    "\n",
    "Tip: consider whether any of these algorithms does not require onehot encoding.\n",
    "\n",
    "5. What are the parameters of the best classifier found in step 4? Display all the parameters. (1pt)\n",
    "6. Using the test set, plot the ROC AUC curve (including the random guessing baseline) of the best model found in step 4 and calculate the percentage of those who survived that the model correctly identified, if we classify as class 1 all instances that have probability of belonging to class 1 above 0.3. (2pt)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Question 3"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Total 20 points**\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "1. Import the csv file \"HW2_Q3data.csv\" into pandas dataframe, view the first 5 rows (1pt)\n",
    "2. Standardize the data. (1pt)\n",
    "3. Run KMeans for the number of clusters varying from 3 to 17, with a smart initialization method. Use three metrics to choose the number of clusters. Choose the number of clusters and justify your decision. Set the random_state to 0. (6pts)\n",
    "4. Rerun KMeans for the selected number of clusters. (1pt)\n",
    "5. Visualize the clusters using TSNE, for 4 different values of the hyperparameter perplexity. Color each cluster with a different color (Clusters from step 3). Set the random_state to 42. Which value of the hyperparameter helps the most to visualize the clusters, explain your choice. (3pt)\n",
    "6. Visualize the clusters using UMAP, for  4 different values of the hyperparameter nearest neighbors. Color each cluster with a different color (Clusters from step 3). Set the random_state to 42. Which value of the hyperparameter help the most to visualize the clusters, explain your choice (Clusters chosen in step 3). (3pt)\n",
    "7. Use the cluster assignments as labels and run decision tree of depth 2 on the data (no need for train test split). Visualize the tree, print the total number of nodes, and create a DataFrame with only one column denoting the features importances in descending order. (3pts)\n",
    "8. Run DBSCAN for values of epsilon 1, 2, and 3. What percentage of points does not get assigned to a single cluster for each value of epsilon? (2pts)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "ml2025",
   "language": "python",
   "name": "ml2025"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.11"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}