{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Introduction to Machine Learning 2695 Coding Exam" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This part of the exam will carry 50% of the exam grade, and the theoretical part will carry the other 50%. You have in total 2 hours to complete both parts. For the theoretical part, you will complete the moodle quiz, do not forget to submit it once you are done (multiple submissions are possible, the last one counts).\n", "\n", "You should submit this notebook through Coding Exam submission on Moodle. \n", "\n", "The notebook should contain the code, output and explanations. The notebook should be rerunable.\n", "\n", "Total points: **30**\n", "\n", "- Question 1: 10 points\n", "- Question 2: 13 points\n", "- Question 3: 7 points" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Question 1\n", "\n", "**10 points**" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You are hired by the Inventory department of a large retail store to help with the product inventory management. They provided you with a product dataset stored in a file: `Question_1.csv` with the following information:\n", "\n", "- `product_id`: Unique identifier for each product in the inventory.\n", "- `sales_volume`: Total quantity of items sold over a specific period.\n", "- `location`: Warehouse or store location where the product is stored.\n", "- `category`: Type of products (e.g., electronics, clothing, groceries).\n", "- `demand_level` : Level of product demand.\n", "- `inventory_count` : Current inventory count for each item.\n", "- `reorder_frequency`: Number of times an item is reordered in the last 3 months.\n", "- `lead_time`: Average time taken from ordering to receiving the item, in days.\n", "- `stock_priority`: Priority level for restocking.\n", "- `supplier`: Name or code of the supplier providing the product.\n", "\n", "1. Your task is to use the most appropriate preprocessing steps and suggest the most appropriate number of product groups. Justify all your decisions.\n", "2. Assuming that a data point (product) is considered a core product if it has 10 neighbors (including itself), and the maximum distance between two data points in the neighborhood is 1.5, how many products would not belong to a single group? How many groups would we end up having? " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | product_id | \n", "sales_volume | \n", "location | \n", "category | \n", "demand_level | \n", "inventory_count | \n", "reorder_frequency | \n", "lead_time | \n", "stock_priority | \n", "supplier | \n", "
---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0 | \n", "255.0 | \n", "Rivertown | \n", "shoes | \n", "medium | \n", "5736.0 | \n", "2.0 | \n", "28.0 | \n", "not_critical | \n", "Nimbus | \n", "
1 | \n", "1 | \n", "371.0 | \n", "Rivertown | \n", "shoes | \n", "low | \n", "4951.0 | \n", "5.0 | \n", "86.0 | \n", "not_critical | \n", "Nimbus | \n", "
2 | \n", "2 | \n", "482.0 | \n", "Rivertown | \n", "shoes | \n", "low | \n", "4651.0 | \n", "4.0 | \n", "22.0 | \n", "not_critical | \n", "Vertex | \n", "
3 | \n", "3 | \n", "494.0 | \n", "Rivertown | \n", "electronics | \n", "medium | \n", "3406.0 | \n", "3.0 | \n", "163.0 | \n", "lowest | \n", "Pinnacle | \n", "
4 | \n", "4 | \n", "298.0 | \n", "Rivertown | \n", "accessories | \n", "medium | \n", "4257.0 | \n", "3.0 | \n", "14.0 | \n", "not_critical | \n", "Vertex | \n", "
Pipeline(steps=[('encoder',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('numerical',\n", " Pipeline(steps=[('imputation_median',\n", " SimpleImputer())]),\n", " Index(['sales_volume', 'inventory_count', 'reorder_frequency', 'lead_time'], dtype='object')),\n", " ('ohe',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('onehot',\n", " OneHotEncoder(sparse_output=False))]),\n", " ['location', 'category',\n", " 'supplier']),\n", " ('ordinal',\n", " Pipeline(steps=[('ord',\n", " OrdinalEncoder(categories=[['lowest',\n", " 'not_critical',\n", " 'low_critical',\n", " 'high_critical'],\n", " ['low',\n", " 'medium',\n", " 'high']]))]),\n", " ['stock_priority',\n", " 'demand_level'])])),\n", " ('scaler', StandardScaler())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('encoder',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('numerical',\n", " Pipeline(steps=[('imputation_median',\n", " SimpleImputer())]),\n", " Index(['sales_volume', 'inventory_count', 'reorder_frequency', 'lead_time'], dtype='object')),\n", " ('ohe',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('onehot',\n", " OneHotEncoder(sparse_output=False))]),\n", " ['location', 'category',\n", " 'supplier']),\n", " ('ordinal',\n", " Pipeline(steps=[('ord',\n", " OrdinalEncoder(categories=[['lowest',\n", " 'not_critical',\n", " 'low_critical',\n", " 'high_critical'],\n", " ['low',\n", " 'medium',\n", " 'high']]))]),\n", " ['stock_priority',\n", " 'demand_level'])])),\n", " ('scaler', StandardScaler())])
ColumnTransformer(remainder='passthrough',\n", " transformers=[('numerical',\n", " Pipeline(steps=[('imputation_median',\n", " SimpleImputer())]),\n", " Index(['sales_volume', 'inventory_count', 'reorder_frequency', 'lead_time'], dtype='object')),\n", " ('ohe',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('onehot',\n", " OneHotEncoder(sparse_output=False))]),\n", " ['location', 'category', 'supplier']),\n", " ('ordinal',\n", " Pipeline(steps=[('ord',\n", " OrdinalEncoder(categories=[['lowest',\n", " 'not_critical',\n", " 'low_critical',\n", " 'high_critical'],\n", " ['low',\n", " 'medium',\n", " 'high']]))]),\n", " ['stock_priority', 'demand_level'])])
Index(['sales_volume', 'inventory_count', 'reorder_frequency', 'lead_time'], dtype='object')
SimpleImputer()
['location', 'category', 'supplier']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(sparse_output=False)
['stock_priority', 'demand_level']
OrdinalEncoder(categories=[['lowest', 'not_critical', 'low_critical',\n", " 'high_critical'],\n", " ['low', 'medium', 'high']])
passthrough
StandardScaler()
\n", " | sales_volume | \n", "location | \n", "category | \n", "demand_level | \n", "inventory_count | \n", "reorder_frequency | \n", "lead_time | \n", "stock_priority | \n", "supplier | \n", "
---|---|---|---|---|---|---|---|---|---|
0 | \n", "255.0 | \n", "Rivertown | \n", "shoes | \n", "medium | \n", "5736.0 | \n", "2.0 | \n", "28.0 | \n", "not_critical | \n", "Nimbus | \n", "
1 | \n", "371.0 | \n", "Rivertown | \n", "shoes | \n", "low | \n", "4951.0 | \n", "5.0 | \n", "86.0 | \n", "not_critical | \n", "Nimbus | \n", "
2 | \n", "482.0 | \n", "Rivertown | \n", "shoes | \n", "low | \n", "4651.0 | \n", "4.0 | \n", "22.0 | \n", "not_critical | \n", "Vertex | \n", "
3 | \n", "494.0 | \n", "Rivertown | \n", "electronics | \n", "medium | \n", "3406.0 | \n", "3.0 | \n", "163.0 | \n", "lowest | \n", "Pinnacle | \n", "
4 | \n", "298.0 | \n", "Rivertown | \n", "accessories | \n", "medium | \n", "4257.0 | \n", "3.0 | \n", "14.0 | \n", "not_critical | \n", "Vertex | \n", "
KMeans(n_clusters=5, random_state=42)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=5, random_state=42)
\n", " | past_sales | \n", "price | \n", "season | \n", "product_category | \n", "store_location_type | \n", "advertising_spend | \n", "discount_rate | \n", "competitor_sales_volume | \n", "promotion_type | \n", "store_traffic_volume | \n", "sales_volume | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "3712.019206 | \n", "61.946333 | \n", "Summer | \n", "shoes | \n", "suburban | \n", "682.836888 | \n", "28.070782 | \n", "4198.959000 | \n", "percentage_off | \n", "Low | \n", "4964.299973 | \n", "
1 | \n", "4666.499402 | \n", "47.336866 | \n", "Spring | \n", "electronics | \n", "downtown | \n", "572.218433 | \n", "17.405044 | \n", "6576.442539 | \n", "percentage_off | \n", "Medium | \n", "4006.739512 | \n", "
2 | \n", "6085.299988 | \n", "54.793826 | \n", "Spring | \n", "cosmetics | \n", "downtown | \n", "466.107517 | \n", "17.719031 | \n", "4641.982873 | \n", "percentage_off | \n", "Low | \n", "4295.756331 | \n", "
3 | \n", "6102.391928 | \n", "94.596118 | \n", "Spring | \n", "clothing | \n", "downtown | \n", "388.514092 | \n", "22.095615 | \n", "9055.830815 | \n", "loyalty_points | \n", "Low | \n", "6978.062512 | \n", "
4 | \n", "3014.237193 | \n", "65.518271 | \n", "Summer | \n", "cosmetics | \n", "suburban | \n", "559.756395 | \n", "15.494180 | \n", "5762.883494 | \n", "buy_one_get_one | \n", "Low | \n", "3499.905963 | \n", "
\n", " | past_sales | \n", "price | \n", "season | \n", "product_category | \n", "store_location_type | \n", "advertising_spend | \n", "discount_rate | \n", "competitor_sales_volume | \n", "promotion_type | \n", "store_traffic_volume | \n", "sales_volume | \n", "
---|
Pipeline(steps=[('encoder',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('numerical',\n", " Pipeline(steps=[('imputation_median',\n", " SimpleImputer())]),\n", " Index(['past_sales', 'price', 'advertising_spend', 'discount_rate',\n", " 'competitor_sales_volume'],\n", " dtype='object')),\n", " ('ohe',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('onehot',\n", " OneHotEncoder(sparse_output=False))]),\n", " ['product_category',\n", " 'store_location_type',\n", " 'promotion_type']),\n", " ('ordinal',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('ord',\n", " OrdinalEncoder(categories=[['Spring',\n", " 'Summer',\n", " 'Autumn',\n", " 'Winter'],\n", " ['VeryLow',\n", " 'Low',\n", " 'Medium',\n", " 'High',\n", " 'VeryHigh']]))]),\n", " ['season',\n", " 'store_traffic_volume'])])),\n", " ('regressor', RandomForestRegressor())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('encoder',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('numerical',\n", " Pipeline(steps=[('imputation_median',\n", " SimpleImputer())]),\n", " Index(['past_sales', 'price', 'advertising_spend', 'discount_rate',\n", " 'competitor_sales_volume'],\n", " dtype='object')),\n", " ('ohe',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('onehot',\n", " OneHotEncoder(sparse_output=False))]),\n", " ['product_category',\n", " 'store_location_type',\n", " 'promotion_type']),\n", " ('ordinal',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('ord',\n", " OrdinalEncoder(categories=[['Spring',\n", " 'Summer',\n", " 'Autumn',\n", " 'Winter'],\n", " ['VeryLow',\n", " 'Low',\n", " 'Medium',\n", " 'High',\n", " 'VeryHigh']]))]),\n", " ['season',\n", " 'store_traffic_volume'])])),\n", " ('regressor', RandomForestRegressor())])
ColumnTransformer(remainder='passthrough',\n", " transformers=[('numerical',\n", " Pipeline(steps=[('imputation_median',\n", " SimpleImputer())]),\n", " Index(['past_sales', 'price', 'advertising_spend', 'discount_rate',\n", " 'competitor_sales_volume'],\n", " dtype='object')),\n", " ('ohe',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('onehot',\n", " OneHotEncoder(sparse_output=False))]),\n", " ['product_category', 'store_location_type',\n", " 'promotion_type']),\n", " ('ordinal',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('ord',\n", " OrdinalEncoder(categories=[['Spring',\n", " 'Summer',\n", " 'Autumn',\n", " 'Winter'],\n", " ['VeryLow',\n", " 'Low',\n", " 'Medium',\n", " 'High',\n", " 'VeryHigh']]))]),\n", " ['season', 'store_traffic_volume'])])
Index(['past_sales', 'price', 'advertising_spend', 'discount_rate',\n", " 'competitor_sales_volume'],\n", " dtype='object')
SimpleImputer()
['product_category', 'store_location_type', 'promotion_type']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(sparse_output=False)
['season', 'store_traffic_volume']
SimpleImputer(strategy='most_frequent')
OrdinalEncoder(categories=[['Spring', 'Summer', 'Autumn', 'Winter'],\n", " ['VeryLow', 'Low', 'Medium', 'High', 'VeryHigh']])
passthrough
RandomForestRegressor()
\n", " | past_sales | \n", "price | \n", "season | \n", "product_category | \n", "store_location_type | \n", "advertising_spend | \n", "discount_rate | \n", "competitor_sales_volume | \n", "promotion_type | \n", "store_traffic_volume | \n", "sales_volume | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "3712.019206 | \n", "61.946333 | \n", "Summer | \n", "shoes | \n", "suburban | \n", "682.836888 | \n", "28.070782 | \n", "4198.959000 | \n", "percentage_off | \n", "Low | \n", "4964.299973 | \n", "
1 | \n", "4666.499402 | \n", "47.336866 | \n", "Spring | \n", "electronics | \n", "downtown | \n", "572.218433 | \n", "17.405044 | \n", "6576.442539 | \n", "percentage_off | \n", "Medium | \n", "4006.739512 | \n", "
2 | \n", "6085.299988 | \n", "54.793826 | \n", "Spring | \n", "cosmetics | \n", "downtown | \n", "466.107517 | \n", "17.719031 | \n", "4641.982873 | \n", "percentage_off | \n", "Low | \n", "4295.756331 | \n", "
3 | \n", "6102.391928 | \n", "94.596118 | \n", "Spring | \n", "clothing | \n", "downtown | \n", "388.514092 | \n", "22.095615 | \n", "9055.830815 | \n", "loyalty_points | \n", "Low | \n", "6978.062512 | \n", "
4 | \n", "3014.237193 | \n", "65.518271 | \n", "Summer | \n", "cosmetics | \n", "suburban | \n", "559.756395 | \n", "15.494180 | \n", "5762.883494 | \n", "buy_one_get_one | \n", "Low | \n", "3499.905963 | \n", "
Pipeline(steps=[('encoder',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('numerical',\n", " Pipeline(steps=[('imputation_median',\n", " SimpleImputer())]),\n", " Index(['past_sales', 'price', 'advertising_spend', 'discount_rate',\n", " 'competitor_sales_volume'],\n", " dtype='object')),\n", " ('ohe',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('oneho...\n", " feature_types=None, gamma=None, grow_policy=None,\n", " importance_type=None,\n", " interaction_constraints=None, learning_rate=None,\n", " max_bin=None, max_cat_threshold=None,\n", " max_cat_to_onehot=None, max_delta_step=None,\n", " max_depth=None, max_leaves=None,\n", " min_child_weight=None, missing=nan,\n", " monotone_constraints=None, multi_strategy=None,\n", " n_estimators=None, n_jobs=None,\n", " num_parallel_tree=None, random_state=None, ...))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Pipeline(steps=[('encoder',\n", " ColumnTransformer(remainder='passthrough',\n", " transformers=[('numerical',\n", " Pipeline(steps=[('imputation_median',\n", " SimpleImputer())]),\n", " Index(['past_sales', 'price', 'advertising_spend', 'discount_rate',\n", " 'competitor_sales_volume'],\n", " dtype='object')),\n", " ('ohe',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('oneho...\n", " feature_types=None, gamma=None, grow_policy=None,\n", " importance_type=None,\n", " interaction_constraints=None, learning_rate=None,\n", " max_bin=None, max_cat_threshold=None,\n", " max_cat_to_onehot=None, max_delta_step=None,\n", " max_depth=None, max_leaves=None,\n", " min_child_weight=None, missing=nan,\n", " monotone_constraints=None, multi_strategy=None,\n", " n_estimators=None, n_jobs=None,\n", " num_parallel_tree=None, random_state=None, ...))])
ColumnTransformer(remainder='passthrough',\n", " transformers=[('numerical',\n", " Pipeline(steps=[('imputation_median',\n", " SimpleImputer())]),\n", " Index(['past_sales', 'price', 'advertising_spend', 'discount_rate',\n", " 'competitor_sales_volume'],\n", " dtype='object')),\n", " ('ohe',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('onehot',\n", " OneHotEncoder(sparse_output=False))]),\n", " ['product_category', 'store_location_type',\n", " 'promotion_type']),\n", " ('ordinal',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('ord',\n", " OrdinalEncoder(categories=[['Spring',\n", " 'Summer',\n", " 'Autumn',\n", " 'Winter'],\n", " ['VeryLow',\n", " 'Low',\n", " 'Medium',\n", " 'High',\n", " 'VeryHigh']]))]),\n", " ['season', 'store_traffic_volume'])])
Index(['past_sales', 'price', 'advertising_spend', 'discount_rate',\n", " 'competitor_sales_volume'],\n", " dtype='object')
SimpleImputer()
['product_category', 'store_location_type', 'promotion_type']
SimpleImputer(strategy='most_frequent')
OneHotEncoder(sparse_output=False)
['season', 'store_traffic_volume']
SimpleImputer(strategy='most_frequent')
OrdinalEncoder(categories=[['Spring', 'Summer', 'Autumn', 'Winter'],\n", " ['VeryLow', 'Low', 'Medium', 'High', 'VeryHigh']])
passthrough
XGBClassifier(base_score=None, booster=None, callbacks=None,\n", " colsample_bylevel=None, colsample_bynode=None,\n", " colsample_bytree=None, device=None, early_stopping_rounds=None,\n", " enable_categorical=False, eval_metric=None, feature_types=None,\n", " gamma=None, grow_policy=None, importance_type=None,\n", " interaction_constraints=None, learning_rate=None, max_bin=None,\n", " max_cat_threshold=None, max_cat_to_onehot=None,\n", " max_delta_step=None, max_depth=None, max_leaves=None,\n", " min_child_weight=None, missing=nan, monotone_constraints=None,\n", " multi_strategy=None, n_estimators=None, n_jobs=None,\n", " num_parallel_tree=None, random_state=None, ...)
\n", " | Review | \n", "Category | \n", "
---|---|---|
0 | \n", "Perfect for my needs. | \n", "Electronics | \n", "
1 | \n", "Very comfortable and stylish. | \n", "Clothing | \n", "
2 | \n", "Not good. | \n", "Clothing | \n", "
3 | \n", "Exceeded my expectations. | \n", "Electronics | \n", "
4 | \n", "Amazing durability. | \n", "Electronics | \n", "
\n", " | Review | \n", "Category | \n", "
---|
\n", " | mean_fit_time | \n", "std_fit_time | \n", "mean_score_time | \n", "std_score_time | \n", "param_tokenizer__ngram_range | \n", "params | \n", "split0_test_score | \n", "split1_test_score | \n", "split2_test_score | \n", "split3_test_score | \n", "split4_test_score | \n", "mean_test_score | \n", "std_test_score | \n", "rank_test_score | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", "0.009803 | \n", "0.000757 | \n", "0.002813 | \n", "0.000512 | \n", "(1, 1) | \n", "{'tokenizer__ngram_range': (1, 1)} | \n", "0.576923 | \n", "0.653846 | \n", "0.615385 | \n", "0.647059 | \n", "0.588235 | \n", "0.616290 | \n", "0.030643 | \n", "1 | \n", "
1 | \n", "0.009509 | \n", "0.000450 | \n", "0.002400 | \n", "0.000491 | \n", "(1, 2) | \n", "{'tokenizer__ngram_range': (1, 2)} | \n", "0.538462 | \n", "0.673077 | \n", "0.615385 | \n", "0.607843 | \n", "0.568627 | \n", "0.600679 | \n", "0.045623 | \n", "2 | \n", "