{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Machine Learning 2695 Coding Exam" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This part of the exam will carry 50% of the exam grade, and the theoretical part will carry the other 50%. You have in total 2 hours to complete both parts. For the theoretical part, you will complete the moodle quiz, do not forget to submit it once you are done (multiple submissions are possible, the last one counts).\n", "\n", "You should submit this notebook through Coding Exam submission on Moodle. \n", "\n", "The notebook should contain the code, output and explanations. The notebook should be rerunable.\n", "\n", "Total points: **30**\n", "\n", "- Question 1: 10 points\n", "- Question 2: 13 points\n", "- Question 3: 7 points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 1\n", "\n", "**10 points**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You are hired by the Inventory department of a large retail store to help with the product inventory management. They provided you with a product dataset stored in a file: `Question_1.csv` with the following information:\n", "\n", "- `product_id`: Unique identifier for each product in the inventory.\n", "- `sales_volume`: Total quantity of items sold over a specific period.\n", "- `location`: Warehouse or store location where the product is stored.\n", "- `category`: Type of products (e.g., electronics, clothing, groceries).\n", "- `demand_level` : Level of product demand.\n", "- `inventory_count` : Current inventory count for each item.\n", "- `reorder_frequency`: Number of times an item is reordered in the last 3 months.\n", "- `lead_time`: Average time taken from ordering to receiving the item, in days.\n", "- `stock_priority`: Priority level for restocking.\n", "- `supplier`: Name or code of the supplier providing the product.\n", "\n", "1. Your task is to use the most appropriate preprocessing steps and suggest the most appropriate number of product groups. Justify all your decisions.\n", "2. Assuming that a data point (product) is considered a core product if it has 10 neighbors (including itself), and the maximum distance between two data points in the neighborhood is 1.5, how many products would not belong to a single group? How many groups would we end up having? " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
product_idsales_volumelocationcategorydemand_levelinventory_countreorder_frequencylead_timestock_prioritysupplier
00255.0Rivertownshoesmedium5736.02.028.0not_criticalNimbus
11371.0Rivertownshoeslow4951.05.086.0not_criticalNimbus
22482.0Rivertownshoeslow4651.04.022.0not_criticalVertex
33494.0Rivertownelectronicsmedium3406.03.0163.0lowestPinnacle
44298.0Rivertownaccessoriesmedium4257.03.014.0not_criticalVertex
\n", "
" ], "text/plain": [ " product_id sales_volume location category demand_level \\\n", "0 0 255.0 Rivertown shoes medium \n", "1 1 371.0 Rivertown shoes low \n", "2 2 482.0 Rivertown shoes low \n", "3 3 494.0 Rivertown electronics medium \n", "4 4 298.0 Rivertown accessories medium \n", "\n", " inventory_count reorder_frequency lead_time stock_priority supplier \n", "0 5736.0 2.0 28.0 not_critical Nimbus \n", "1 4951.0 5.0 86.0 not_critical Nimbus \n", "2 4651.0 4.0 22.0 not_critical Vertex \n", "3 3406.0 3.0 163.0 lowest Pinnacle \n", "4 4257.0 3.0 14.0 not_critical Vertex " ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df=pd.read_csv('Question_1.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "product_id 0\n", "sales_volume 30\n", "location 36\n", "category 0\n", "demand_level 0\n", "inventory_count 0\n", "reorder_frequency 0\n", "lead_time 0\n", "stock_priority 0\n", "supplier 0\n", "dtype: int64" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isna().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are no missing values. " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['product_id', 'sales_volume', 'location', 'category', 'demand_level',\n", " 'inventory_count', 'reorder_frequency', 'lead_time', 'stock_priority',\n", " 'supplier'],\n", " dtype='object')" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.columns" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "product_id int64\n", "sales_volume float64\n", "location object\n", "category object\n", "demand_level object\n", "inventory_count float64\n", "reorder_frequency float64\n", "lead_time float64\n", "stock_priority object\n", "supplier object\n", "dtype: object" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are both numerical and categorical data." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "df=df.drop(columns='product_id') #non-informative variable" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['sales_volume', 'inventory_count', 'reorder_frequency', 'lead_time'], dtype='object')" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numerical=df.select_dtypes(include=np.number).columns\n", "numerical" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['location', 'category', 'demand_level', 'stock_priority', 'supplier'], dtype='object')" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categorical=df.select_dtypes(exclude=np.number).columns\n", "categorical" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "location\n", "Rivertown 2742\n", "Everton 119\n", "Westvale 103\n", "Name: count, dtype: int64\n", "\n", "category\n", "shoes 1500\n", "accessories 900\n", "groceries 240\n", "electronics 150\n", "cosmetics 150\n", "clothing 60\n", "Name: count, dtype: int64\n", "\n", "demand_level\n", "low 1650\n", "medium 900\n", "high 450\n", "Name: count, dtype: int64\n", "\n", "stock_priority\n", "not_critical 1830\n", "low_critical 600\n", "lowest 420\n", "high_critical 150\n", "Name: count, dtype: int64\n", "\n", "supplier\n", "Nimbus 1800\n", "Pinnacle 600\n", "Vertex 450\n", "Zenith 90\n", "Acme 60\n", "Name: count, dtype: int64\n", "\n" ] } ], "source": [ "for col in categorical:\n", " print(df[col].value_counts())\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Among the categorical variables, `demand_level` and `stock_priority` are ordinal." ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [], "source": [ "ohe_cols=['location', 'category', 'supplier']\n", "ord_cols=['stock_priority', 'demand_level']\n", "categories=[[\"lowest\", \"not_critical\", \"low_critical\", 'high_critical'],\n", " ['low','medium','high']]" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "from sklearn.preprocessing import MinMaxScaler, StandardScaler, OneHotEncoder, OrdinalEncoder\n", "from sklearn.compose import ColumnTransformer\n", "from sklearn.impute import SimpleImputer" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('encoder',\n",
       "                 ColumnTransformer(remainder='passthrough',\n",
       "                                   transformers=[('numerical',\n",
       "                                                  Pipeline(steps=[('imputation_median',\n",
       "                                                                   SimpleImputer())]),\n",
       "                                                  Index(['sales_volume', 'inventory_count', 'reorder_frequency', 'lead_time'], dtype='object')),\n",
       "                                                 ('ohe',\n",
       "                                                  Pipeline(steps=[('imputation_mode',\n",
       "                                                                   SimpleImputer(strategy='most_frequent')),\n",
       "                                                                  ('onehot',\n",
       "                                                                   OneHotEncoder(sparse_output=False))]),\n",
       "                                                  ['location', 'category',\n",
       "                                                   'supplier']),\n",
       "                                                 ('ordinal',\n",
       "                                                  Pipeline(steps=[('ord',\n",
       "                                                                   OrdinalEncoder(categories=[['lowest',\n",
       "                                                                                               'not_critical',\n",
       "                                                                                               'low_critical',\n",
       "                                                                                               'high_critical'],\n",
       "                                                                                              ['low',\n",
       "                                                                                               'medium',\n",
       "                                                                                               'high']]))]),\n",
       "                                                  ['stock_priority',\n",
       "                                                   'demand_level'])])),\n",
       "                ('scaler', StandardScaler())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('encoder',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('numerical',\n", " Pipeline(steps=[('imputation_median',\n", " SimpleImputer())]),\n", " Index(['sales_volume', 'inventory_count', 'reorder_frequency', 'lead_time'], dtype='object')),\n", " ('ohe',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('onehot',\n", " OneHotEncoder(sparse_output=False))]),\n", " ['location', 'category',\n", " 'supplier']),\n", " ('ordinal',\n", " Pipeline(steps=[('ord',\n", " OrdinalEncoder(categories=[['lowest',\n", " 'not_critical',\n", " 'low_critical',\n", " 'high_critical'],\n", " ['low',\n", " 'medium',\n", " 'high']]))]),\n", " ['stock_priority',\n", " 'demand_level'])])),\n", " ('scaler', StandardScaler())])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numeric_preprocessor = Pipeline([ \n", " (\"imputation_median\", SimpleImputer(strategy=\"mean\"))\n", "])\n", "\n", "ohe_preprocessor = Pipeline([\n", " (\"imputation_mode\", SimpleImputer( strategy=\"most_frequent\")),\n", " (\"onehot\", OneHotEncoder(sparse_output=False ))\n", " ])\n", "\n", "ord_preprocessor = Pipeline([\n", "\n", " (\"ord\", OrdinalEncoder(categories=categories ))\n", " ])\n", "\n", "\n", "preprocessor = ColumnTransformer([\n", " (\"numerical\", numeric_preprocessor, numerical),\n", " (\"ohe\", ohe_preprocessor, ohe_cols),\n", " (\"ordinal\", ord_preprocessor, ord_cols)\n", " ], remainder='passthrough') #this step is necessary, otherwise the numerical variables are discarded.\n", "\n", "pipe = Pipeline([\n", " ('encoder', preprocessor),\n", " ('scaler', StandardScaler()), \n", " ])\n", "pipe\n", " \n", " " ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "df_transformed=pipe.fit_transform(df)" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
sales_volumelocationcategorydemand_levelinventory_countreorder_frequencylead_timestock_prioritysupplier
0255.0Rivertownshoesmedium5736.02.028.0not_criticalNimbus
1371.0Rivertownshoeslow4951.05.086.0not_criticalNimbus
2482.0Rivertownshoeslow4651.04.022.0not_criticalVertex
3494.0Rivertownelectronicsmedium3406.03.0163.0lowestPinnacle
4298.0Rivertownaccessoriesmedium4257.03.014.0not_criticalVertex
\n", "
" ], "text/plain": [ " sales_volume location category demand_level inventory_count \\\n", "0 255.0 Rivertown shoes medium 5736.0 \n", "1 371.0 Rivertown shoes low 4951.0 \n", "2 482.0 Rivertown shoes low 4651.0 \n", "3 494.0 Rivertown electronics medium 3406.0 \n", "4 298.0 Rivertown accessories medium 4257.0 \n", "\n", " reorder_frequency lead_time stock_priority supplier \n", "0 2.0 28.0 not_critical Nimbus \n", "1 5.0 86.0 not_critical Nimbus \n", "2 4.0 22.0 not_critical Vertex \n", "3 3.0 163.0 lowest Pinnacle \n", "4 3.0 14.0 not_critical Vertex " ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.head()" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['numerical__sales_volume', 'numerical__inventory_count',\n", " 'numerical__reorder_frequency', 'numerical__lead_time',\n", " 'ohe__location_Everton', 'ohe__location_Rivertown',\n", " 'ohe__location_Westvale', 'ohe__category_accessories',\n", " 'ohe__category_clothing', 'ohe__category_cosmetics',\n", " 'ohe__category_electronics', 'ohe__category_groceries',\n", " 'ohe__category_shoes', 'ohe__supplier_Acme',\n", " 'ohe__supplier_Nimbus', 'ohe__supplier_Pinnacle',\n", " 'ohe__supplier_Vertex', 'ohe__supplier_Zenith',\n", " 'ordinal__stock_priority', 'ordinal__demand_level'], dtype=object)" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "feature_names=pipe.named_steps['encoder'].get_feature_names_out()\n", "feature_names" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3000, 20)" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_transformed.shape" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "from sklearn.cluster import KMeans\n", "from sklearn.metrics import silhouette_score, davies_bouldin_score\n", "scores = {'SSE': [], 'Silhouette Coefficient': [], 'David Bouldin Score': []}\n", "\n", "for k in range(2,16):\n", " \n", " kmeans = KMeans(n_clusters = k, init = 'k-means++', random_state=42)\n", " kmeans.fit(df_transformed)\n", " silhouette_results = silhouette_score(df_transformed, kmeans.labels_)\n", " DB_results = davies_bouldin_score(df_transformed, kmeans.labels_)\n", " \n", " scores['SSE'].append(kmeans.inertia_)\n", " scores['Silhouette Coefficient'].append(silhouette_results)\n", " scores['David Bouldin Score'].append(DB_results)\n" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig, axes = plt.subplots(nrows = 1, ncols = 3, figsize=(25,6))\n", "\n", "for score_i, ax in zip(scores, axes.ravel()):\n", " ax.plot(range(2,16), scores[score_i])\n", " ax.set_xlabel('Number of Clusters', fontsize=12)\n", " ax.set_ylabel(score_i, fontsize=15)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "SSE: The most prominent elbows astarts at 4 clusters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Silhouette: Local maximum present at 4, 5, 9 clusters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "David Bouldin: Local minimum at 5 clusters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Therefore, the appropriate number of clusters is 5. " ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "no_cluster=5" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
KMeans(n_clusters=5, random_state=42)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "KMeans(n_clusters=5, random_state=42)" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "kmeans = KMeans(n_clusters=no_cluster, init = 'k-means++', random_state=42)\n", "kmeans.fit(df_transformed)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The count of instances per cluster is as follows:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({1: 1709, 2: 601, 4: 450, 3: 150, 0: 90})" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import Counter\n", "Counter(kmeans.labels_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using TSNE we can get a visual sense of the clusters." ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "from sklearn.manifold import TSNE\n", "tsne = TSNE(random_state=42, perplexity = 30)\n", "X_tsne = tsne.fit_transform(df_transformed)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({0: 1121,\n", " 7: 333,\n", " 4: 237,\n", " 1: 213,\n", " 2: 148,\n", " 5: 147,\n", " 8: 146,\n", " 3: 145,\n", " 13: 99,\n", " 11: 84,\n", " 9: 82,\n", " 12: 60,\n", " 6: 58,\n", " 10: 58,\n", " -1: 57,\n", " 14: 12})" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.cluster import DBSCAN\n", "dbscan = DBSCAN(eps=1.5, min_samples=10)\n", "dbscan.fit(df_transformed)\n", "Counter(dbscan.labels_)" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 't-SNE feature 1')" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(6, 6))\n", "plt.scatter(X_tsne[:, 0], X_tsne[:, 1], s=30, c=dbscan.labels_ , cmap=plt.cm.Spectral)\n", "plt.xlabel(\"t-SNE feature 0\")\n", "plt.ylabel(\"t-SNE feature 1\")" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "new_labels=[ 'red' if label==-1 else 'blue' for label in dbscan.labels_ ]\n" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 't-SNE feature 1')" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(6, 6))\n", "plt.scatter(X_tsne[:, 0], X_tsne[:, 1], s=30, c=new_labels)\n", "plt.xlabel(\"t-SNE feature 0\")\n", "plt.ylabel(\"t-SNE feature 1\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 2\n", "\n", "**13 points**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You are hired by the Sales department of a large retail store to predict future sales volume based on past sales data. They provided you with a dataset stored in a file: `Question_2.csv` with the following information:\n", "- `past_sales`: Historical sales volume data for previous periods.\n", "- `price`: The price of the product during the prediction period.\n", "- `season`: The seasons to capture seasonal trends.\n", "- `product_category`: The category of the product.\n", "- `store_location_type`: The type of the location of the store.\n", "- `advertising_spend`: Amount of money spent on advertising for the product.\n", "- `discount_rate`: The percentage discount applied to the product.\n", "- `competitor_sales_volume`: Sales volumes of similar products sold by competitors.\n", "- `promotion_type`: The type of promotion applied.\n", "- `store_traffic_volume`: The foot traffic levels in the store.\n", "- `sales_volume`: The volume of sales that we wish to predict in the future.\n", "\n", "The goal is to develop two approaches: \n", "- A. One that uses an ensemble method that predicts the exact volume of sales. \n", "- B. One that uses an ensemble method to predict the level of sales volume, where the level of sales volume should be: \n", " - 0: if below 4500 (low level), \n", " - 1: if equal or larger than 4500 and less than 6500 (medium level)\n", " - 2: if equal or larger than 6500 (high level)\n", "\n", "At least one parameter of each model should be tuned, and the metric of optimization should be justified.\n", "State the final performance of both models.\n", "\n", "Regarding model B, additionally, the sales team would like to know what percentage of all sales of high level the model wrongly identifies as medium level, out of all the high level ones. Show how you have arrived at the answer and/or display a visual that supports your answer." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To begin with, the data should be inspected." ] }, { "cell_type": "code", "execution_count": 108, "metadata": { "tags": [] }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
past_salespriceseasonproduct_categorystore_location_typeadvertising_spenddiscount_ratecompetitor_sales_volumepromotion_typestore_traffic_volumesales_volume
03712.01920661.946333Summershoessuburban682.83688828.0707824198.959000percentage_offLow4964.299973
14666.49940247.336866Springelectronicsdowntown572.21843317.4050446576.442539percentage_offMedium4006.739512
26085.29998854.793826Springcosmeticsdowntown466.10751717.7190314641.982873percentage_offLow4295.756331
36102.39192894.596118Springclothingdowntown388.51409222.0956159055.830815loyalty_pointsLow6978.062512
43014.23719365.518271Summercosmeticssuburban559.75639515.4941805762.883494buy_one_get_oneLow3499.905963
\n", "
" ], "text/plain": [ " past_sales price season product_category store_location_type \\\n", "0 3712.019206 61.946333 Summer shoes suburban \n", "1 4666.499402 47.336866 Spring electronics downtown \n", "2 6085.299988 54.793826 Spring cosmetics downtown \n", "3 6102.391928 94.596118 Spring clothing downtown \n", "4 3014.237193 65.518271 Summer cosmetics suburban \n", "\n", " advertising_spend discount_rate competitor_sales_volume promotion_type \\\n", "0 682.836888 28.070782 4198.959000 percentage_off \n", "1 572.218433 17.405044 6576.442539 percentage_off \n", "2 466.107517 17.719031 4641.982873 percentage_off \n", "3 388.514092 22.095615 9055.830815 loyalty_points \n", "4 559.756395 15.494180 5762.883494 buy_one_get_one \n", "\n", " store_traffic_volume sales_volume \n", "0 Low 4964.299973 \n", "1 Medium 4006.739512 \n", "2 Low 4295.756331 \n", "3 Low 6978.062512 \n", "4 Low 3499.905963 " ] }, "execution_count": 108, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df=pd.read_csv('Question_2.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(3000, 11)" ] }, "execution_count": 29, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "past_sales float64\n", "price float64\n", "season object\n", "product_category object\n", "store_location_type object\n", "advertising_spend float64\n", "discount_rate float64\n", "competitor_sales_volume float64\n", "promotion_type object\n", "store_traffic_volume object\n", "sales_volume float64\n", "dtype: object" ] }, "execution_count": 30, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Given the data types and the values of each variable in the first 5 rows, it may be inferred that all variables are numerical." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "past_sales 15\n", "price 28\n", "season 7\n", "product_category 4\n", "store_location_type 0\n", "advertising_spend 0\n", "discount_rate 11\n", "competitor_sales_volume 0\n", "promotion_type 0\n", "store_traffic_volume 16\n", "sales_volume 0\n", "dtype: int64" ] }, "execution_count": 31, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Missing values should be addressed." ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
past_salespriceseasonproduct_categorystore_location_typeadvertising_spenddiscount_ratecompetitor_sales_volumepromotion_typestore_traffic_volumesales_volume
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [past_sales, price, season, product_category, store_location_type, advertising_spend, discount_rate, competitor_sales_volume, promotion_type, store_traffic_volume, sales_volume]\n", "Index: []" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.duplicated()]" ] }, { "cell_type": "code", "execution_count": 109, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2400, 10)\n", "(600, 10)\n" ] } ], "source": [ "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='sales_volume'), df['sales_volume'], random_state=0, test_size=0.2)\n", "\n", "print(X_train.shape)\n", "\n", "print(X_test.shape)" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['past_sales', 'price', 'advertising_spend', 'discount_rate',\n", " 'competitor_sales_volume'],\n", " dtype='object')" ] }, "execution_count": 110, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numerical=X_train.select_dtypes(include=np.number).columns\n", "numerical" ] }, { "cell_type": "code", "execution_count": 111, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['season', 'product_category', 'store_location_type', 'promotion_type',\n", " 'store_traffic_volume'],\n", " dtype='object')" ] }, "execution_count": 111, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categorical=X_train.select_dtypes(exclude=np.number).columns\n", "categorical" ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "season\n", "Autumn 896\n", "Summer 750\n", "Spring 747\n", "Winter 600\n", "Name: count, dtype: int64\n", "\n", "product_category\n", "groceries 750\n", "clothing 747\n", "electronics 450\n", "accessories 450\n", "shoes 300\n", "cosmetics 299\n", "Name: count, dtype: int64\n", "\n", "store_location_type\n", "suburban 1350\n", "downtown 1050\n", "rural 600\n", "Name: count, dtype: int64\n", "\n", "promotion_type\n", "percentage_off 1350\n", "loyalty_points 900\n", "buy_one_get_one 750\n", "Name: count, dtype: int64\n", "\n", "store_traffic_volume\n", "Low 1791\n", "High 449\n", "Medium 446\n", "VeryLow 150\n", "VeryHigh 148\n", "Name: count, dtype: int64\n", "\n" ] } ], "source": [ "for col in categorical:\n", " print(df[col].value_counts())\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Among the categorical variables, `season` and `store_traffic_volume` are ordinal." ] }, { "cell_type": "code", "execution_count": 97, "metadata": {}, "outputs": [], "source": [ "ohe_cols=['product_category', 'store_location_type', 'promotion_type']\n", "ord_cols=['season', 'store_traffic_volume']\n", "categories=[['Spring', 'Summer', 'Autumn', 'Winter'],\n", " ['VeryLow', 'Low', 'Medium', 'High', 'VeryHigh']]" ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [], "source": [ "from sklearn.ensemble import RandomForestRegressor\n", "from sklearn.model_selection import RandomizedSearchCV, GridSearchCV \n", "from xgboost import XGBClassifier" ] }, { "cell_type": "code", "execution_count": 99, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('encoder',\n",
       "                 ColumnTransformer(remainder='passthrough',\n",
       "                                   transformers=[('numerical',\n",
       "                                                  Pipeline(steps=[('imputation_median',\n",
       "                                                                   SimpleImputer())]),\n",
       "                                                  Index(['past_sales', 'price', 'advertising_spend', 'discount_rate',\n",
       "       'competitor_sales_volume'],\n",
       "      dtype='object')),\n",
       "                                                 ('ohe',\n",
       "                                                  Pipeline(steps=[('imputation_mode',\n",
       "                                                                   SimpleImputer(strategy='most_frequent')),\n",
       "                                                                  ('onehot',\n",
       "                                                                   OneHotEncoder(sparse_output=False))]),\n",
       "                                                  ['product_category',\n",
       "                                                   'store_location_type',\n",
       "                                                   'promotion_type']),\n",
       "                                                 ('ordinal',\n",
       "                                                  Pipeline(steps=[('imputation_mode',\n",
       "                                                                   SimpleImputer(strategy='most_frequent')),\n",
       "                                                                  ('ord',\n",
       "                                                                   OrdinalEncoder(categories=[['Spring',\n",
       "                                                                                               'Summer',\n",
       "                                                                                               'Autumn',\n",
       "                                                                                               'Winter'],\n",
       "                                                                                              ['VeryLow',\n",
       "                                                                                               'Low',\n",
       "                                                                                               'Medium',\n",
       "                                                                                               'High',\n",
       "                                                                                               'VeryHigh']]))]),\n",
       "                                                  ['season',\n",
       "                                                   'store_traffic_volume'])])),\n",
       "                ('regressor', RandomForestRegressor())])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('encoder',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('numerical',\n", " Pipeline(steps=[('imputation_median',\n", " SimpleImputer())]),\n", " Index(['past_sales', 'price', 'advertising_spend', 'discount_rate',\n", " 'competitor_sales_volume'],\n", " dtype='object')),\n", " ('ohe',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('onehot',\n", " OneHotEncoder(sparse_output=False))]),\n", " ['product_category',\n", " 'store_location_type',\n", " 'promotion_type']),\n", " ('ordinal',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('ord',\n", " OrdinalEncoder(categories=[['Spring',\n", " 'Summer',\n", " 'Autumn',\n", " 'Winter'],\n", " ['VeryLow',\n", " 'Low',\n", " 'Medium',\n", " 'High',\n", " 'VeryHigh']]))]),\n", " ['season',\n", " 'store_traffic_volume'])])),\n", " ('regressor', RandomForestRegressor())])" ] }, "execution_count": 99, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numeric_preprocessor = Pipeline([ \n", " (\"imputation_median\", SimpleImputer(strategy=\"mean\"))\n", "])\n", "\n", "ohe_preprocessor = Pipeline([\n", " (\"imputation_mode\", SimpleImputer( strategy=\"most_frequent\")),\n", " (\"onehot\", OneHotEncoder(sparse_output=False ))\n", " ])\n", "\n", "ord_preprocessor = Pipeline([\n", " (\"imputation_mode\", SimpleImputer( strategy=\"most_frequent\")),\n", " (\"ord\", OrdinalEncoder(categories=categories ))\n", " ])\n", "\n", "\n", "preprocessor = ColumnTransformer([\n", " (\"numerical\", numeric_preprocessor, numerical),\n", " (\"ohe\", ohe_preprocessor, ohe_cols),\n", " (\"ordinal\", ord_preprocessor, ord_cols)\n", " ], remainder='passthrough') #this step is necessary, otherwise the numerical variables are discarded.\n", "\n", "pipe = Pipeline([\n", " ('encoder', preprocessor),\n", " ('regressor', RandomForestRegressor())\n", " ])\n", "pipe" ] }, { "cell_type": "code", "execution_count": 100, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'regressor__max_depth': 9}" ] }, "execution_count": 100, "metadata": {}, "output_type": "execute_result" } ], "source": [ "param_grid_rf = [{'regressor__max_depth': range(7, 10) } ]\n", " \n", " \n", "search_rf = GridSearchCV( pipe, param_grid = param_grid_rf, cv=5, scoring='neg_root_mean_squared_error', n_jobs=-1) \n", "search_rf.fit(X_train, y_train)\n", "\n", "model_rf=search_rf.best_estimator_\n", "search_rf.best_params_" ] }, { "cell_type": "code", "execution_count": 101, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-645.8897315794815" ] }, "execution_count": 101, "metadata": {}, "output_type": "execute_result" } ], "source": [ "search_rf.best_score_" ] }, { "cell_type": "code", "execution_count": 102, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "-641.9652110146275" ] }, "execution_count": 102, "metadata": {}, "output_type": "execute_result" } ], "source": [ "search_rf.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "641.9652110146275" ] }, "execution_count": 103, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.metrics import root_mean_squared_error\n", "y_pred=model_rf.predict(X_test)\n", "RMSE = root_mean_squared_error(y_test, y_pred)\n", "RMSE" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
past_salespriceseasonproduct_categorystore_location_typeadvertising_spenddiscount_ratecompetitor_sales_volumepromotion_typestore_traffic_volumesales_volume
03712.01920661.946333Summershoessuburban682.83688828.0707824198.959000percentage_offLow4964.299973
14666.49940247.336866Springelectronicsdowntown572.21843317.4050446576.442539percentage_offMedium4006.739512
26085.29998854.793826Springcosmeticsdowntown466.10751717.7190314641.982873percentage_offLow4295.756331
36102.39192894.596118Springclothingdowntown388.51409222.0956159055.830815loyalty_pointsLow6978.062512
43014.23719365.518271Summercosmeticssuburban559.75639515.4941805762.883494buy_one_get_oneLow3499.905963
\n", "
" ], "text/plain": [ " past_sales price season product_category store_location_type \\\n", "0 3712.019206 61.946333 Summer shoes suburban \n", "1 4666.499402 47.336866 Spring electronics downtown \n", "2 6085.299988 54.793826 Spring cosmetics downtown \n", "3 6102.391928 94.596118 Spring clothing downtown \n", "4 3014.237193 65.518271 Summer cosmetics suburban \n", "\n", " advertising_spend discount_rate competitor_sales_volume promotion_type \\\n", "0 682.836888 28.070782 4198.959000 percentage_off \n", "1 572.218433 17.405044 6576.442539 percentage_off \n", "2 466.107517 17.719031 4641.982873 percentage_off \n", "3 388.514092 22.095615 9055.830815 loyalty_points \n", "4 559.756395 15.494180 5762.883494 buy_one_get_one \n", "\n", " store_traffic_volume sales_volume \n", "0 Low 4964.299973 \n", "1 Medium 4006.739512 \n", "2 Low 4295.756331 \n", "3 Low 6978.062512 \n", "4 Low 3499.905963 " ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df=pd.read_csv('Question_2.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 58, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "sales_level\n", "1 1500\n", "0 981\n", "2 519\n", "Name: count, dtype: int64" ] }, "execution_count": 58, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df['sales_level']=1\n", "mask=df.sales_volume<4500\n", "df.loc[mask, 'sales_level']=0\n", "\n", "mask=df.sales_volume>6500\n", "df.loc[mask, 'sales_level']=2\n", "df.sales_level.value_counts()" ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [], "source": [ "df=df.drop(columns='sales_volume')" ] }, { "cell_type": "code", "execution_count": 60, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(2400, 10)\n", "(600, 10)\n" ] } ], "source": [ "X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='sales_level'), df['sales_level'], stratify= df['sales_level'], random_state=0, test_size=0.2)\n", "\n", "print(X_train.shape)\n", "\n", "print(X_test.shape)" ] }, { "cell_type": "code", "execution_count": 61, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('encoder',\n",
       "                 ColumnTransformer(remainder='passthrough',\n",
       "                                   transformers=[('numerical',\n",
       "                                                  Pipeline(steps=[('imputation_median',\n",
       "                                                                   SimpleImputer())]),\n",
       "                                                  Index(['past_sales', 'price', 'advertising_spend', 'discount_rate',\n",
       "       'competitor_sales_volume'],\n",
       "      dtype='object')),\n",
       "                                                 ('ohe',\n",
       "                                                  Pipeline(steps=[('imputation_mode',\n",
       "                                                                   SimpleImputer(strategy='most_frequent')),\n",
       "                                                                  ('oneho...\n",
       "                               feature_types=None, gamma=None, grow_policy=None,\n",
       "                               importance_type=None,\n",
       "                               interaction_constraints=None, learning_rate=None,\n",
       "                               max_bin=None, max_cat_threshold=None,\n",
       "                               max_cat_to_onehot=None, max_delta_step=None,\n",
       "                               max_depth=None, max_leaves=None,\n",
       "                               min_child_weight=None, missing=nan,\n",
       "                               monotone_constraints=None, multi_strategy=None,\n",
       "                               n_estimators=None, n_jobs=None,\n",
       "                               num_parallel_tree=None, random_state=None, ...))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('encoder',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('numerical',\n", " Pipeline(steps=[('imputation_median',\n", " SimpleImputer())]),\n", " Index(['past_sales', 'price', 'advertising_spend', 'discount_rate',\n", " 'competitor_sales_volume'],\n", " dtype='object')),\n", " ('ohe',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('oneho...\n", " feature_types=None, gamma=None, grow_policy=None,\n", " importance_type=None,\n", " interaction_constraints=None, learning_rate=None,\n", " max_bin=None, max_cat_threshold=None,\n", " max_cat_to_onehot=None, max_delta_step=None,\n", " max_depth=None, max_leaves=None,\n", " min_child_weight=None, missing=nan,\n", " monotone_constraints=None, multi_strategy=None,\n", " n_estimators=None, n_jobs=None,\n", " num_parallel_tree=None, random_state=None, ...))])" ] }, "execution_count": 61, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe = Pipeline([\n", " ('encoder', preprocessor),\n", " ('classifier', XGBClassifier())\n", " ])\n", "pipe" ] }, { "cell_type": "code", "execution_count": 62, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'classifier__n_estimators': 300}" ] }, "execution_count": 62, "metadata": {}, "output_type": "execute_result" } ], "source": [ "\n", "\n", "param_grid_rf = [{'classifier__n_estimators': range(100, 400, 100) } ]\n", " \n", " \n", "search_xg = GridSearchCV( pipe, param_grid = param_grid_rf, cv=5, scoring='f1_macro', n_jobs=-1) \n", "search_xg.fit(X_train, y_train)\n", "\n", "model_xg=search_xg.best_estimator_\n", "search_xg.best_params_" ] }, { "cell_type": "code", "execution_count": 63, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8068939277512788" ] }, "execution_count": 63, "metadata": {}, "output_type": "execute_result" } ], "source": [ "search_xg.best_score_" ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.8198694997691057" ] }, "execution_count": 64, "metadata": {}, "output_type": "execute_result" } ], "source": [ "search_xg.score(X_test, y_test)" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [], "source": [ "y_pred=model_xg.predict(X_test)" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n", "cm=confusion_matrix(y_test, y_pred)\n", "\n", "ConfusionMatrixDisplay(cm).plot();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " - 0: if below 4500 (low level), \n", " - 1: if equal or larger than 4500 and less than 6500 (medium level)\n", " - 2: if equal or larger than 6500 (high level)\n", "\n", "Regarding model B, the sales team would like to know what percentage of all high sales the model wrongly identifies as medium sales, out of all the high sales." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "- high sales wrongly identified as medium sales: predicted as 1, true 2 : 27\n", "- total high sales: 27+77" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "25.961538461538463" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "27/(27+77)*100" ] }, { "cell_type": "markdown", "metadata": {}, "source": [] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 3\n", "\n", "**7 points**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You are hired by the Customer support department to develop a model that sorts customer reviews into those about Clothing and those about Electronics. The reviews stored in a file: `Question_3.csv` with the following information:\n", "\n", "- `Review`\n", "- `Category`\n", "\n", "\n", "The goal is to have develop an approach that distinguishes between these two categories of reviews and evaluate the performance of your approach. Check whether it is more useful to use only single words or two consecutive words as features, and use only those features that appear in at least two reviews.\n", "Additionally, the team has asked you to:\n", "- Additionally state the percentage of correctly identified clothing reviews (out of all clothing reviews), and the percentage of correctly identified electronics reviews (out of all electronic reviews).\n", "- Show the top 5 tokens associated with each of the categories.\n", "- State the number of features used by your model.\n", "- Write one sentence review on clothing and show the probability that your model gives to that review for each of the categories.\n", " \n", "You need to justify your decisions." ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ReviewCategory
0Perfect for my needs.Electronics
1Very comfortable and stylish.Clothing
2Not good.Clothing
3Exceeded my expectations.Electronics
4Amazing durability.Electronics
\n", "
" ], "text/plain": [ " Review Category\n", "0 Perfect for my needs. Electronics\n", "1 Very comfortable and stylish. Clothing\n", "2 Not good. Clothing\n", "3 Exceeded my expectations. Electronics\n", "4 Amazing durability. Electronics" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df=pd.read_csv('Question_3.csv')\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(323, 2)" ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "code", "execution_count": 70, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Category\n", "Clothing 173\n", "Electronics 150\n", "Name: count, dtype: int64" ] }, "execution_count": 70, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.Category.value_counts()" ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Category\n", "0 173\n", "1 150\n", "Name: count, dtype: int64" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "mapping={'Clothing':0, 'Electronics':1}\n", "df['Category']=df['Category'].replace(mapping)\n", "df.Category.value_counts()" ] }, { "cell_type": "code", "execution_count": 72, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Review 0\n", "Category 0\n", "dtype: int64" ] }, "execution_count": 72, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Check for duplicates." ] }, { "cell_type": "code", "execution_count": 73, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ReviewCategory
\n", "
" ], "text/plain": [ "Empty DataFrame\n", "Columns: [Review, Category]\n", "Index: []" ] }, "execution_count": 73, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df[df.duplicated()]" ] }, { "cell_type": "code", "execution_count": 74, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(df['Review'], df['Category'], \n", " stratify= df['Category'], test_size=0.2, random_state=0)" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((258,), (65,))" ] }, "execution_count": 75, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.shape, X_test.shape" ] }, { "cell_type": "code", "execution_count": 76, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Category\n", "0 138\n", "1 120\n", "Name: count, dtype: int64" ] }, "execution_count": 76, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train.value_counts()" ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best cross-validation score: 0.616289592760181\n" ] } ], "source": [ "from sklearn.feature_extraction.text import CountVectorizer\n", "from sklearn.linear_model import LogisticRegression\n", " \n", "\n", "pipe = Pipeline ([ ('tokenizer', CountVectorizer(min_df=2) ),\n", " ('classifier', LogisticRegression(solver='liblinear'))\n", " ] )\n", "param_grid = {\n", " 'tokenizer__ngram_range': [(1, 1), (1, 2)]}\n", "\n", "grid = GridSearchCV(pipe, param_grid, cv=5, n_jobs=-1)\n", "grid.fit(X_train, y_train)\n", "print('Best cross-validation score: ', grid.best_score_)" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "115" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "vectorizer = grid.best_estimator_.named_steps['tokenizer']\n", "len(vectorizer.vocabulary_)" ] }, { "cell_type": "code", "execution_count": 79, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_tokenizer__ngram_rangeparamssplit0_test_scoresplit1_test_scoresplit2_test_scoresplit3_test_scoresplit4_test_scoremean_test_scorestd_test_scorerank_test_score
00.0098030.0007570.0028130.000512(1, 1){'tokenizer__ngram_range': (1, 1)}0.5769230.6538460.6153850.6470590.5882350.6162900.0306431
10.0095090.0004500.0024000.000491(1, 2){'tokenizer__ngram_range': (1, 2)}0.5384620.6730770.6153850.6078430.5686270.6006790.0456232
\n", "
" ], "text/plain": [ " mean_fit_time std_fit_time mean_score_time std_score_time \\\n", "0 0.009803 0.000757 0.002813 0.000512 \n", "1 0.009509 0.000450 0.002400 0.000491 \n", "\n", " param_tokenizer__ngram_range params \\\n", "0 (1, 1) {'tokenizer__ngram_range': (1, 1)} \n", "1 (1, 2) {'tokenizer__ngram_range': (1, 2)} \n", "\n", " split0_test_score split1_test_score split2_test_score split3_test_score \\\n", "0 0.576923 0.653846 0.615385 0.647059 \n", "1 0.538462 0.673077 0.615385 0.607843 \n", "\n", " split4_test_score mean_test_score std_test_score rank_test_score \n", "0 0.588235 0.616290 0.030643 1 \n", "1 0.568627 0.600679 0.045623 2 " ] }, "execution_count": 79, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.DataFrame(grid.cv_results_ ).sort_values(by='mean_test_score', ascending=False)" ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "y_pred=grid.predict(X_test)\n", "cm=confusion_matrix(y_test, y_pred)\n", "\n", "ConfusionMatrixDisplay(cm).plot();" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "{'Clothing':Negative, 'Electronics':Positive}" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "feature_names=np.array(vectorizer.get_feature_names_out())\n", "coef = grid.best_estimator_.named_steps['classifier'].coef_.ravel()\n", "idx_positive_coefficients = np.argsort(coef)[-5:]\n", "idx_negative_coefficients = np.argsort(coef)[:5]\n", "idx_interesting_coefficients = np.hstack([idx_negative_coefficients, idx_positive_coefficients])\n", "\n", "fig = plt.figure(figsize=(10,5))\n", "plt.bar(feature_names[idx_interesting_coefficients], coef[idx_interesting_coefficients])\n", "plt.xticks(rotation=90, ha=\"right\") ; " ] }, { "cell_type": "code", "execution_count": 82, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " 0 0.69 0.57 0.62 35\n", " 1 0.58 0.70 0.64 30\n", "\n", " accuracy 0.63 65\n", " macro avg 0.64 0.64 0.63 65\n", "weighted avg 0.64 0.63 0.63 65\n", "\n" ] } ], "source": [ "from sklearn.metrics import classification_report\n", "\n", "\n", "print(classification_report(y_test, y_pred))" ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([[0.87360927, 0.12639073]])" ] }, "execution_count": 85, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid.predict_proba( ['Cheap fabric'])" ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [], "source": [ "X_transformed = grid.best_estimator_.named_steps['tokenizer'].transform(X_train)\n", "from sklearn.manifold import TSNE\n", "tsne = TSNE(random_state=42, perplexity = 30, init=\"random\")\n", "X_tsne = tsne.fit_transform(X_transformed)" ] }, { "cell_type": "code", "execution_count": 92, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0, 0.5, 't-SNE feature 1')" ] }, "execution_count": 92, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "plt.figure(figsize=(6, 6))\n", "plt.scatter(X_tsne[:, 0], X_tsne[:, 1], s=30, c=y_train)\n", "plt.xlabel(\"t-SNE feature 0\")\n", "plt.ylabel(\"t-SNE feature 1\")" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" }, "toc-showcode": false, "toc-showmarkdowntxt": false }, "nbformat": 4, "nbformat_minor": 4 }