{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Project submission\n", "**Due Friday May 16th before class.** Counts for 25% of the final course grade.\n", "\n", "You should address all the questions relevant to your project.\n", "You will not be graded based on the values of the model performance, but on whether or not you have applied the right methodology: formulated the business model, translated it into a right machine learning approach, analyzed your data, prepared it for modeling, applied at least 5 different machine learning algorithms, as well as neural networks, used cross validation for model tuning, justified your tuning metric, set up the proper machine learning pipeline without data leakage, evaluated your model using all the relevant metrics, interpreted your model and justified all your decisions.\n", "\n", "If you have tried different approaches, please include them all, and not just the best one.\n", "If doing some feature engineering has improved your model, also please include all of the steps, not just the most successful ones.\n", "\n", "You should submit the notebook with the code, output and explanations. The notebook should be executable and comprehensible.\n", "\n", "The points will be deducted for the following reasons:\n", "- data leakage\n", "- unjustified decisions (no discussion on: choice of metric for optimization, blind removal of features, blind removal of outliers...)\n", "- notebook not comprehensible\n", "- notebook with incomplete output\n", "- notebook not executable\n", "- blind copy pasting from ChatGPT, if the copied code is not suitable for the task\n", "- writing your own code (or copy pasting them from outside source) for simple functions that we covered and that already exist in `sklearn` (train test split, plain grid search, encoding of categorical variables,...), as this leads to:\n", " - convoluted code prone to bugs\n", " - code that is hard to understand and review\n", " - waste of data scientist's time if ready-made simple functions exist\n", "\n", "Additional points will be awarded for trying and testing different relevant approaches, from exploratory data analysis, to feature engineering, to modeling and evaluation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There should be one submission per group, but team member evaluation can be submitted per person. If not submitted, the default is that all the team members have contributed equally to the project and should get the same grade." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Group number:\n", "### Student IDs:\n", "### Project name:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What business problem are you solving?\n", "- Please state clearly what business problem are you solving. (one sentence)\n", "- Elaborate why is this a relevant problem, and what can you do with the model output to create business value, i.e., how is the model output actionable. (2-3 paragraphs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What is the machine learning problem that you are solving?\n", "- Please state clearly what is the ML problem. \n", "- If applicable state your target." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data exploration and preparation \n", "\n", "- How many data instances do you have?\n", "- Do you have duplicates?\n", "- How many features? What type are they?\n", "- If they are categorical, what categories they have, what is their frequency?\n", "- If they are numerical, what is their distribution?\n", "- Do you have outliers, and do you need to do anything about them?\n", "- What is the distribution of the target variable?\n", "- If you have a target, you can also check the relationship between the target and the variables.\n", "- Do you have missing data? If yes, how are you going to handle it?\n", "- Can you use the features in their original form, or do you need to alter them in some way?\n", "- What have you learned about your data? Is there anything that can help you in feature engineering or modeling?\n" ] }, { "cell_type": "markdown", "metadata": { "tags": [] }, "source": [ "## Feature engineering\n", "Creating good features is probably the most important step in the machine learning process. \n", "This might involve doing:\n", "- transformations\n", "- aggregating over data points or over time and space, or finding differences (for example: differences between two monthly bills, time difference between two contacts with the client) \n", "- creating dummy (binary) variables\n", "- discretization\n", "\n", "Business insight is very relevant in this process. If it is possible you can also find additional relevant data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Modeling\n", "You should implement AT LEAST FIVE approaches we covered, and tune of at least two hyperparameters of each approach.\n", "Do not forget that you should split your data.\n", "You should do model selection and tuning using cross validation on the train set, avoiding data leakage.\n", "Explain and justify what is the metric you are using for model selection and tuning. If your data is imbalanced, consider using techniques for data balancing.\n", "\n", "Separately, you should train a neural network. Visualize the training and validation loss. Discuss the network performance\n", "\n", "In model selection, make sure when you compare different models and approaches that you compare them on the same dataset, though different transformations could be applied to the comparison dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model evaluation\n", "\n", "After selecting your final model, which could be a compromise of performance, interpretability and complexity, you should evaluate its performance on the test set. \n", "You might have tuned your model using a certain metric, but now you should describe the model performance using all relevant metrics. \n", "If you have some business insight, why a certain metric is relevant, you should explain it. \n", "Construct a suitable baseline to benchmark your result and to put them in the context.\n", "Discuss your results, do they seem good enough to be used in practice? If not, what should be improved. Discuss what type of errors is your model making.\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Model interpretation\n", "\n", "Use at least two different techniques for model interpretability. Discuss what are the most important features of your model, and how they impact the model performance. Pick a few examples of errors that your model is making, and check which features lead to thess errors." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.7" } }, "nbformat": 4, "nbformat_minor": 4 }