{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Contents:\n", "- [KNN Classifier](#KNN-Classifier)\n", " - [Decision boundary for KNN](#Decision-boundary-for-KNN)\n", " - [Tuning the number of neighbors with Grid Search](#Tuning-the-number-of-neighbors-with-Grid-Search)\n", " - [KNN for regression](#KNN-for-regression)\n", " - [Practice question](#Practice-question)\n", "- [Clustering](#Clustering)\n", " - [KMeans Clustering](#KMeans-Clustering)\n", " - [Generating our dataset](#Generating-our-dataset)\n", " - [Feature Scaling](#Feature-Scaling)\n", " - [Conduct KMeans Clustering](#Conduct-KMeans-Clustering)\n", " - [Choosing K, the number of clusters](#Choosing-K,-the-number-of-clusters)\n", " - [Elbow Method](#Elbow-Method)\n", " - [Silhoutte coefficient](#Silhoutte-coefficient)\n", " - [Davies-Bouldin Index](#Davies-Bouldin-Index)\n", " - [Example of using Kmeans for color compression](#Example-of-using-Kmeans-for-color-compression)\n", " - [Density Based clustering](#Density-Based-clustering)\n", " - [Practice question](#Practice-question)\n", "- [Comparing different clustering techniques on the face dataset](#Comparing-different-clustering-techniques-on-the-face-dataset)\n", " - [DBSCAN](#DBSCAN)\n", " - [KMeans](#KMeans)\n", "- [Dimensionality reduction](#Dimensionality-reduction)\n", " - [TSNE](#TSNE)\n", " - [UMAP](#UMAP)\n", " - [Comparison of TSNE and UMAP on the digit dataset](#Comparison-of-TSNE-and-UMAP-on-the-digit-dataset)\n", " - [TSNE](#TSNE)\n", " - [UMAP](#UMAP)\n", " - [Practice question](#Practice-question)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# KNN Classifier" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "from collections import Counter\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "KNN (K-Nearest Neighbor) is a simple supervised classification algorithm we can use to assign a class to a new data point. \n", "KNN does not learn an explicit model, hence it is non-parametric. It keeps all the training data to make future predictions by computing the similarity between an input data instance and each training instance.\n", "KNN classification can be summarized as below:\n", "- Compute the distance between the new data point and every training data point (using, for example, Euclidean distance or Manhattan distance)\n", "- Pick K entries from the training dataset that are the closest to the new data point\n", "- Perform the majority vote i.e., the most common class/label among those K entries will be the class of the new data point, or do weighted voting" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Decision boundary for KNN" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's start with a simple analysis of the decision boundary using the toy dataset, created with `sklearn`'s `make_classification`." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_classification\n", "X, y = make_classification(n_samples=101, n_features=2, n_informative=2,\n", " n_redundant=0, n_repeated=0, n_classes=2,\n", " n_clusters_per_class=1, class_sep=0.3, random_state=123)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check how many members of each class we have:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({0: 52, 1: 49})" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Counter(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that now we will just want to analyze the impact of the number of the nearest neighbors on the boundary, so we will not do the split into the train and test dataset. We will illustrate the proper use of the KNN later on." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "from sklearn.inspection import DecisionBoundaryDisplay" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some parameters of `KNeighborsClassifier`:\n", "- *n_neighbor*: Number of neighbors to use.\n", "\n", "- *weights*: weight (combining) function used in prediction. Possible values:\n", "\n", " - 'uniform' : uniform weights. All points in each neighborhood are weighted equally.\n", "\n", " - 'distance' : weigh points by the inverse of their distance. In this case, closer neighbors of a query point will have a greater influence than the neighbors which are further away.\n", " \n", " \n", "Let's now do `fit` of the KNN classifier with 9 neighbors and visualize the decision boundary." ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
KNeighborsClassifier(n_neighbors=9)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(n_neighbors=9)
GridSearchCV(cv=5,\n", " estimator=Pipeline(steps=[('scaler', StandardScaler()),\n", " ('knn', KNeighborsClassifier())]),\n", " n_jobs=-1, param_grid={'knn__n_neighbors': range(1, 32, 2)},\n", " return_train_score=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
GridSearchCV(cv=5,\n", " estimator=Pipeline(steps=[('scaler', StandardScaler()),\n", " ('knn', KNeighborsClassifier())]),\n", " n_jobs=-1, param_grid={'knn__n_neighbors': range(1, 32, 2)},\n", " return_train_score=True)
Pipeline(steps=[('scaler', StandardScaler()),\n", " ('knn', KNeighborsClassifier(n_neighbors=7))])
StandardScaler()
KNeighborsClassifier(n_neighbors=7)
\n", " | mean_fit_time | \n", "std_fit_time | \n", "mean_score_time | \n", "std_score_time | \n", "param_knn__n_neighbors | \n", "params | \n", "split0_test_score | \n", "split1_test_score | \n", "split2_test_score | \n", "split3_test_score | \n", "... | \n", "mean_test_score | \n", "std_test_score | \n", "rank_test_score | \n", "split0_train_score | \n", "split1_train_score | \n", "split2_train_score | \n", "split3_train_score | \n", "split4_train_score | \n", "mean_train_score | \n", "std_train_score | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.015882 | \n", "0.001918 | \n", "0.225033 | \n", "0.016135 | \n", "1 | \n", "{'knn__n_neighbors': 1} | \n", "0.9875 | \n", "0.900 | \n", "0.9625 | \n", "0.949367 | \n", "... | \n", "0.942152 | \n", "0.032390 | \n", "13 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "0.000000 | \n", "
1 | \n", "0.014655 | \n", "0.003874 | \n", "0.120041 | \n", "0.081644 | \n", "3 | \n", "{'knn__n_neighbors': 3} | \n", "0.9750 | \n", "0.925 | \n", "0.9500 | \n", "0.949367 | \n", "... | \n", "0.947215 | \n", "0.016663 | \n", "7 | \n", "0.971698 | \n", "0.981132 | \n", "0.968553 | \n", "0.974922 | \n", "0.984326 | \n", "0.976126 | \n", "0.005843 | \n", "
2 rows × 21 columns
\n", "KMeans(n_clusters=3, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=3, random_state=42)
KMeans(n_clusters=16)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=16)
DBSCAN(eps=0.3)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
DBSCAN(eps=0.3)
KMeans(n_clusters=2)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=2)