{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "Contents:\n", "- [Data Preprocessing](#Data-Preprocessing)\n", " - [Data scaling](#Data-scaling)\n", " - [Categorical variables](#Categorical-variables)\n", " - [One-hot encoding](#One-hot-encoding)\n", " - [Method 1](#Method-1)\n", " - [Method 2](#Method-2)\n", " - [Ordinal encoding](#Ordinal-encoding)\n", " - [Target Encoding](#Target-Encoding)\n", " - [Multiple column transformations](#Multiple-column-transformations)\n", "- [Cross validation and Hyperparameter tuning](#Cross-validation-and-Hyperparameter-tuning)\n", " - [Grid search](#Grid-search)\n", " - [Randomized search](#Randomized-search)\n", " - [Successive Halving search](#Successive-Halving-search)\n", " - [Other libraries for hyperparameter tuning](#Other-libraries-for-hyperparameter-tuning)\n", " - [Practice question](#Practice-question)\n", "- [Note](#Note)\n", "- [Algorithm chains and pipelines](#Algorithm-chains-and-pipelines)\n", " - [Building Pipelines](#Building-Pipelines)\n", " - [The General Pipeline Interface](#The-General-Pipeline-Interface)\n", " - [Pipeline and ColumnTransformer](#Pipeline-and-ColumnTransformer)\n", " - [Grid-Searching Which Model To Use](#Grid-Searching-Which-Model-To-Use)\n", "- [Splitting data in cross-validation](#Splitting-data-in-cross-validation)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Data Preprocessing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following exercise is adapted from Chapter 4 of the book *Introduction to Machine Learning with Python*, by Andreas C. Müller, Sarah Guido." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Regardless of the types of features our data consists of, how we represent them can have an enormous effect on the performance of machine learning models. The question of how to represent our data best for a particular application is known as feature engineering, and it is one of the main tasks of data scientists trying to solve real-world problems. Representing the data in the right way can have a bigger influence on the performance of a supervised model than the exact parameters we choose." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "from collections import Counter\n", "from sklearn.inspection import DecisionBoundaryDisplay" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data scaling\n", "Some algorithms are very sensitive to the scaling of the data. \n", "Therefore, a common practice is to adjust the features so that the data representation is more suitable for these algorithms.\n", "Let's visualize two different scaling techniques: normalization and standardization.\n", "First, let's generate a small synthetic dataset with sklearn's *make_blobs()*." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Text(0.5, 1.0, 'our dataset')" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.datasets import make_blobs\n", "X, y = make_blobs(n_samples=50, centers=2, random_state=1, cluster_std=3)\n", "plt.figure(figsize=(4,4))\n", "plt.scatter(X[:, 0], X[:, 1], s=30, c=y) \n", "plt.title(\"our dataset\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first feature (the x-axis value) is between -15 and 3. The second feature (the y-axis value) is between -10 and 10.\n", "Now, let's scale the data with two different scalers *MinMax* (performs normalization) and *StandardScaler* (performs standardization)." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import MinMaxScaler, StandardScaler\n", "\n", "minmax_scaler = MinMaxScaler()\n", "standard_scaler = StandardScaler()\n", "X_minmax = minmax_scaler.fit_transform(X)\n", "X_standard = standard_scaler.fit_transform(X)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's see the difference:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig, ax = plt.subplots(1, 2, figsize=(13,4))\n", "ax[0].set_title(\"Standard scaler\")\n", "ax[0].scatter(X_standard[:, 0], X_standard[:, 1], s=30, c=y)\n", "ax[0].axis([-2, 2, -2.5, 2.5])\n", "\n", "ax[1].set_title(\"Minmax scaler\")\n", "ax[1].scatter(X_minmax[:, 0], X_minmax[:, 1], s=30, c=y )\n", "ax[1].axis([-2, 2, -2.5, 2.5]);\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *StandardScaler* in *sklearn* ensures that for each feature the mean is 0 and the variance is 1, bringing all features to the same magnitude. However, this scaling does not ensure any particular minimum and maximum values for the features. \n", "\n", "The *MinMaxScaler*, on the other hand, shifts the data such that all features are exactly between 0 and 1. For the two-dimensional dataset this means all of the data is contained within the rectangle created by the x-axis between 0 and 1 and the y-axis between 0 and 1." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Previously, we have mentioned that it is important that we should fit the scaler only on the training data and transform the test data. This way we avoid any data leakage.\n", "Now, let's also demonstrate why should we apply the same transformation to the training set and the test set for the supervised model." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X, y = make_blobs(n_samples=50, centers=5, random_state=4, cluster_std=2)\n", "X_train, X_test = train_test_split(X, random_state=5, test_size=.1)\n", "\n", "fig, ax = plt.subplots(1, 3, figsize=(13, 4))\n", "ax[0].scatter(X_train[:, 0], X_train[:, 1], label=\"Training set\", s=30, c = 'blue')\n", "ax[0].scatter(X_test[:, 0], X_test[:, 1], marker='^', label=\"Test set\", s=30,c = 'red')\n", "ax[0].legend(loc='upper left')\n", "ax[0].set_title(\"Original Data\")\n", "\n", "# scale the data using MinMaxScaler\n", "scaler = MinMaxScaler()\n", "scaler.fit(X_train)\n", "X_train_scaled = scaler.transform(X_train)\n", "X_test_scaled = scaler.transform(X_test)\n", "\n", "# visualize the properly scaled data\n", "ax[1].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], label=\"Training set\", s=30, c = 'blue')\n", "ax[1].scatter(X_test_scaled[:, 0], X_test_scaled[:, 1], marker='^', label=\"Test set\", s=30,c = 'red')\n", "ax[1].legend(loc='upper left')\n", "ax[1].set_title(\"Scaled Data\")\n", "\n", "# rescale the test set separately\n", "# so test set min is 0 and test set max is 1\n", "# DO NOT DO THIS! For illustration purposes only.\n", "test_scaler = MinMaxScaler()\n", "test_scaler.fit(X_test)\n", "X_test_scaled_badly = test_scaler.transform(X_test)\n", "\n", "# visualize wrongly scaled data\n", "ax[2].scatter(X_train_scaled[:, 0], X_train_scaled[:, 1], label=\"training set\", s=30, c = 'blue')\n", "ax[2].scatter(X_test_scaled_badly[:, 0], X_test_scaled_badly[:, 1], marker='^', label=\"test set\", s=30,c = 'red')\n", "ax[2].legend(loc='upper left')\n", "ax[2].set_title(\"Improperly Scaled Data\")\n", "\n", "for axi in ax:\n", " axi.set_xlabel(\"Feature 0\")\n", " axi.set_ylabel(\"Feature 1\")\n", "fig.tight_layout()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The first subplot is an unscaled two-dimensional dataset, with the training set shown as blue circles and the test set shown as red triangles. The second subplot is the same data, but scaled using the *MinMaxScaler*. Here, we called `fit` on the training set, and then called `transform` on the training and test sets. We can see that the dataset in the second subplot looks identical to the first; only the ticks on the axes have changed. Now all the features are between 0 and 1. We can also see that the minimum and maximum feature values for the test data (the triangles) are not 0 and 1.\n", "\n", "The third subplot shows what would happen if we scaled the training set and test set separately. In this case, the minimum and maximum feature values for both the training and the test set are 0 and 1. But now the dataset looks different. The test points moved, as they were scaled differently. We changed the arrangement of the data in an arbitrary way. Clearly this is not what we want to do. Hence, **we should use `fit` on the training set only, and then `transform` on the training and test sets**." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Categorical variables" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this part of the exercise, we will use the dataset of adult incomes in the United States, derived from the 1994 census database. The task of the dataset is to predict whether a worker has an income of over 50,000 or under 50,000.\n", "Place the downloaded csv file *census_data* in the same directory as your notebook and import it.\n", "\n" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclasseducationcapital-gain-categoryhours-per-weekoccupationincome
039State-govBachelorscat240Adm-clerical<=50K
150Self-emp-not-incBachelorscat113Exec-managerial<=50K
238PrivateHS-gradcat140Handlers-cleaners<=50K
353Private11thcat140Handlers-cleaners<=50K
428PrivateBachelorscat140Prof-specialty<=50K
\n", "
" ], "text/plain": [ " age workclass education capital-gain-category hours-per-week \\\n", "0 39 State-gov Bachelors cat2 40 \n", "1 50 Self-emp-not-inc Bachelors cat1 13 \n", "2 38 Private HS-grad cat1 40 \n", "3 53 Private 11th cat1 40 \n", "4 28 Private Bachelors cat1 40 \n", "\n", " occupation income \n", "0 Adm-clerical <=50K \n", "1 Exec-managerial <=50K \n", "2 Handlers-cleaners <=50K \n", "3 Handlers-cleaners <=50K \n", "4 Prof-specialty <=50K " ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = pd.read_csv('census_data.csv')\n", "# For illustration purposes, we will only select some of the columns\n", "data = data[['age', 'workclass', 'education', 'capital-gain-category', 'hours-per-week', 'occupation', 'income']]\n", "data.head()" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(32561, 7)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.shape" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age int64\n", "workclass object\n", "education object\n", "capital-gain-category object\n", "hours-per-week int64\n", "occupation object\n", "income object\n", "dtype: object" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.dtypes" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The *workclass*, *education*, *\tcapital-gain-category*, and *occupation* features are categorical features. All of them come from a fixed list of possible values, as opposed to a range, and denote a qualitative property, as opposed to a quantity. In our example, target variable is income, and it is also categorical." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### One-hot encoding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By far the most common way to represent categorical variables is using the one-hot-encoding or one-out-of-N encoding, also known as dummy variables. The idea behind dummy variables is to replace a categorical variable with one or more new features that can have the values 0 and 1. We can represent any number of categories by introducing one new feature per category, as described here. In general, this encoding works well if there are only a few levels.\n", "For illustration purposes, we will do one hot encoding for all categorical features, even though education is ordinal, as the values it can take can be ordered." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see the values that the *workclass* feature can take." ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "workclass\n", "Private 22696\n", "Self-emp-not-inc 2541\n", "Local-gov 2093\n", "? 1836\n", "State-gov 1298\n", "Self-emp-inc 1116\n", "Federal-gov 960\n", "Without-pay 14\n", "Never-worked 7\n", "Name: count, dtype: int64" ] }, "execution_count": 25, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data.workclass.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Method 1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's apply pandas function *get_dummies* to do one hot encoding." ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(32561, 7)\n", "(32561, 48)\n", "Index(['age', 'hours-per-week', 'workclass_?', 'workclass_Federal-gov',\n", " 'workclass_Local-gov', 'workclass_Never-worked', 'workclass_Private',\n", " 'workclass_Self-emp-inc', 'workclass_Self-emp-not-inc',\n", " 'workclass_State-gov', 'workclass_Without-pay', 'education_10th',\n", " 'education_11th', 'education_12th', 'education_1st-4th',\n", " 'education_5th-6th', 'education_7th-8th', 'education_9th',\n", " 'education_Assoc-acdm', 'education_Assoc-voc', 'education_Bachelors',\n", " 'education_Doctorate', 'education_HS-grad', 'education_Masters',\n", " 'education_Preschool', 'education_Prof-school',\n", " 'education_Some-college', 'capital-gain-category_cat1',\n", " 'capital-gain-category_cat2', 'capital-gain-category_cat3',\n", " 'capital-gain-category_cat4', 'occupation_?', 'occupation_Adm-clerical',\n", " 'occupation_Armed-Forces', 'occupation_Craft-repair',\n", " 'occupation_Exec-managerial', 'occupation_Farming-fishing',\n", " 'occupation_Handlers-cleaners', 'occupation_Machine-op-inspct',\n", " 'occupation_Other-service', 'occupation_Priv-house-serv',\n", " 'occupation_Prof-specialty', 'occupation_Protective-serv',\n", " 'occupation_Sales', 'occupation_Tech-support',\n", " 'occupation_Transport-moving', 'income_<=50K', 'income_>50K'],\n", " dtype='object')\n", "age int64\n", "hours-per-week int64\n", "workclass_? bool\n", "workclass_Federal-gov bool\n", "workclass_Local-gov bool\n", "workclass_Never-worked bool\n", "workclass_Private bool\n", "workclass_Self-emp-inc bool\n", "workclass_Self-emp-not-inc bool\n", "workclass_State-gov bool\n", "workclass_Without-pay bool\n", "education_10th bool\n", "education_11th bool\n", "education_12th bool\n", "education_1st-4th bool\n", "education_5th-6th bool\n", "education_7th-8th bool\n", "education_9th bool\n", "education_Assoc-acdm bool\n", "education_Assoc-voc bool\n", "education_Bachelors bool\n", "education_Doctorate bool\n", "education_HS-grad bool\n", "education_Masters bool\n", "education_Preschool bool\n", "education_Prof-school bool\n", "education_Some-college bool\n", "capital-gain-category_cat1 bool\n", "capital-gain-category_cat2 bool\n", "capital-gain-category_cat3 bool\n", "capital-gain-category_cat4 bool\n", "occupation_? bool\n", "occupation_Adm-clerical bool\n", "occupation_Armed-Forces bool\n", "occupation_Craft-repair bool\n", "occupation_Exec-managerial bool\n", "occupation_Farming-fishing bool\n", "occupation_Handlers-cleaners bool\n", "occupation_Machine-op-inspct bool\n", "occupation_Other-service bool\n", "occupation_Priv-house-serv bool\n", "occupation_Prof-specialty bool\n", "occupation_Protective-serv bool\n", "occupation_Sales bool\n", "occupation_Tech-support bool\n", "occupation_Transport-moving bool\n", "income_<=50K bool\n", "income_>50K bool\n", "dtype: object\n" ] } ], "source": [ "data_dummies = pd.get_dummies(data)\n", "print(data.shape)\n", "print(data_dummies.shape)\n", "print(data_dummies.columns)\n", "print(data_dummies.dtypes)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that the continuous features *age* and *hours-per-week* were not touched, while the categorical features were expanded into one new feature for each possible value. The *get_dummies* function in pandas treats all numbers as continuous and will not create dummy variables for them. If we have a categorical feature with values as numbers, we should then explicitely list it for dummy coding." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's expand the output display for pandas, such that all the columns are shown when we print the dataframe. By default, not all columns will be displayed." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_columns', None)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agehours-per-weekworkclass_?workclass_Federal-govworkclass_Local-govworkclass_Never-workedworkclass_Privateworkclass_Self-emp-incworkclass_Self-emp-not-incworkclass_State-govworkclass_Without-payeducation_10theducation_11theducation_12theducation_1st-4theducation_5th-6theducation_7th-8theducation_9theducation_Assoc-acdmeducation_Assoc-voceducation_Bachelorseducation_Doctorateeducation_HS-gradeducation_Masterseducation_Preschooleducation_Prof-schooleducation_Some-collegecapital-gain-category_cat1capital-gain-category_cat2capital-gain-category_cat3capital-gain-category_cat4occupation_?occupation_Adm-clericaloccupation_Armed-Forcesoccupation_Craft-repairoccupation_Exec-managerialoccupation_Farming-fishingoccupation_Handlers-cleanersoccupation_Machine-op-inspctoccupation_Other-serviceoccupation_Priv-house-servoccupation_Prof-specialtyoccupation_Protective-servoccupation_Salesoccupation_Tech-supportoccupation_Transport-movingincome_<=50Kincome_>50K
03940FalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
15013FalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
23840FalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
35340FalseFalseFalseFalseTrueFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseTrueFalse
42840FalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseTrueFalseFalseFalseFalseTrueFalse
\n", "
" ], "text/plain": [ " age hours-per-week workclass_? workclass_Federal-gov \\\n", "0 39 40 False False \n", "1 50 13 False False \n", "2 38 40 False False \n", "3 53 40 False False \n", "4 28 40 False False \n", "\n", " workclass_Local-gov workclass_Never-worked workclass_Private \\\n", "0 False False False \n", "1 False False False \n", "2 False False True \n", "3 False False True \n", "4 False False True \n", "\n", " workclass_Self-emp-inc workclass_Self-emp-not-inc workclass_State-gov \\\n", "0 False False True \n", "1 False True False \n", "2 False False False \n", "3 False False False \n", "4 False False False \n", "\n", " workclass_Without-pay education_10th education_11th education_12th \\\n", "0 False False False False \n", "1 False False False False \n", "2 False False False False \n", "3 False False True False \n", "4 False False False False \n", "\n", " education_1st-4th education_5th-6th education_7th-8th education_9th \\\n", "0 False False False False \n", "1 False False False False \n", "2 False False False False \n", "3 False False False False \n", "4 False False False False \n", "\n", " education_Assoc-acdm education_Assoc-voc education_Bachelors \\\n", "0 False False True \n", "1 False False True \n", "2 False False False \n", "3 False False False \n", "4 False False True \n", "\n", " education_Doctorate education_HS-grad education_Masters \\\n", "0 False False False \n", "1 False False False \n", "2 False True False \n", "3 False False False \n", "4 False False False \n", "\n", " education_Preschool education_Prof-school education_Some-college \\\n", "0 False False False \n", "1 False False False \n", "2 False False False \n", "3 False False False \n", "4 False False False \n", "\n", " capital-gain-category_cat1 capital-gain-category_cat2 \\\n", "0 False True \n", "1 True False \n", "2 True False \n", "3 True False \n", "4 True False \n", "\n", " capital-gain-category_cat3 capital-gain-category_cat4 occupation_? \\\n", "0 False False False \n", "1 False False False \n", "2 False False False \n", "3 False False False \n", "4 False False False \n", "\n", " occupation_Adm-clerical occupation_Armed-Forces occupation_Craft-repair \\\n", "0 True False False \n", "1 False False False \n", "2 False False False \n", "3 False False False \n", "4 False False False \n", "\n", " occupation_Exec-managerial occupation_Farming-fishing \\\n", "0 False False \n", "1 True False \n", "2 False False \n", "3 False False \n", "4 False False \n", "\n", " occupation_Handlers-cleaners occupation_Machine-op-inspct \\\n", "0 False False \n", "1 False False \n", "2 True False \n", "3 True False \n", "4 False False \n", "\n", " occupation_Other-service occupation_Priv-house-serv \\\n", "0 False False \n", "1 False False \n", "2 False False \n", "3 False False \n", "4 False False \n", "\n", " occupation_Prof-specialty occupation_Protective-serv occupation_Sales \\\n", "0 False False False \n", "1 False False False \n", "2 False False False \n", "3 False False False \n", "4 True False False \n", "\n", " occupation_Tech-support occupation_Transport-moving income_<=50K \\\n", "0 False False True \n", "1 False False True \n", "2 False False True \n", "3 False False True \n", "4 False False True \n", "\n", " income_>50K \n", "0 False \n", "1 False \n", "2 False \n", "3 False \n", "4 False " ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data_dummies.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this example, we called *get_dummies* on a DataFrame containing both the training and the test data. This is important to ensure categorical values are represented in the same way in the training set and the test set. \n", "We could have done it separetely, but in that case we would need to ensure that both datasets contain the same categories for all features." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### Method 2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Sklearn* has also a function for one-hot encoding. *OneHotEncoder* assumes all features are categorical." ] }, { "cell_type": "code", "execution_count": 36, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OneHotEncoder" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us select only categorical features using `select_dtypes` method which returns a subset of the DataFrame’s columns based on the column dtypes. We can define with *include* or *exclude* a list of data types to be included or excluded." ] }, { "cell_type": "code", "execution_count": 38, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['workclass', 'education', 'capital-gain-category', 'occupation'], dtype='object')" ] }, "execution_count": 38, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X=data.drop(columns='income')\n", "y=data['income']\n", "categorical=X.select_dtypes(exclude='number').columns\n", "categorical" ] }, { "cell_type": "code", "execution_count": 39, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['age', 'hours-per-week'], dtype='object')" ] }, "execution_count": 39, "metadata": {}, "output_type": "execute_result" } ], "source": [ "numerical= X.select_dtypes(include='number').columns\n", "numerical" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now quickly visualize the different values that each categorical variable can take:" ] }, { "cell_type": "code", "execution_count": 41, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAX8AAAGMCAYAAAAyQqLQAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjAsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvlHJYcgAAAAlwSFlzAAAPYQAAD2EBqD+naQAAT6BJREFUeJztnQe0FMUT7oucJCNJERCQHBQQUECSZBBFJINKkChBQRAEJAiSJIMoUUEBFSRnBSVHiWIiCRKU9Ceneefr93rf7N69sLM7G+7O9ztnYHdmbu/E6uqq6qp4hmEYQgghxFHED/cBEEIICT0U/oQQ4kAo/AkhxIFQ+BNCiAOh8CeEEAdC4U8IIQ6Ewp8QQhwIhT8hhDgQCn9CCHEgFP6EiMiPP/4o8eLFU/9HKjNnzlTHeOzYsXAfCokCKPwJiTA++ugjWbRoUbgPg0Q5FP6ExBHh37x5c7lx44Zkz549LMdFoouE4T4AQohvJEiQQC2E2AE1fxJVnDp1St58803JlCmTJEmSRAoWLCjTp0932+fvv/+WevXqSYoUKSRjxozSrVs3uXXrVoy2cuTIIa+//nqM9RUqVFCLmZs3b8qAAQPkqaeekqRJk0qWLFnklVdekT///NO1z8iRI+W5556T9OnTS7JkyaR48eLyzTffuLUDm/61a9dk1qxZ6jMWfQyx2fwnTZqkzhPnmzVrVunYsaNcunQpxjEXKlRIDh06JBUrVpTkyZPLY489JsOHD7d0fUn0QM2fRA1nz56V0qVLKwHZqVMnefTRR2XFihXSqlUruXLlinTt2lWZTSpXriwnTpyQt99+WwnLL774QtavX+/37967d09q164t69atk0aNGkmXLl3kf//7n6xZs0YOHDgguXLlUvuNHTtW6tatK02bNpXbt2/L119/LQ0aNJClS5dKrVq11D44ltatW8uzzz4rbdu2Vev033sDHc6HH34oVapUkfbt28uRI0dk8uTJsmPHDtm0aZMkSpTIte/FixelevXqqlN67bXXVMfz3nvvSeHChaVGjRp+nz+JoyCfPyHRQKtWrYwsWbIY//77r9v6Ro0aGalTpzauX79ujBkzBvUrjPnz57u2X7t2zcidO7da/8MPP7jWZ8+e3WjZsmWM33nhhRfUopk+fbr629GjR8fY9/79+67P+H0zt2/fNgoVKmRUqlTJbX2KFCm8/u6MGTPU7xw9elR9P3funJE4cWKjatWqxr1791z7TZgwQe2H4zIfM9bNnj3bte7WrVtG5syZjfr168f4LRL90OxDogLUJPr222+lTp066vO///7rWqpVqyaXL1+W3bt3y/Lly5VJ5tVXX3X9LUwgWsv2B/xuhgwZpHPnzjG2YRSiganHrIXjmMqVK6eOyx/Wrl2rRhAY0cSP//9f5TZt2kiqVKlk2bJlbvs/8sgj0qxZM9f3xIkTqxHGX3/95dfvk7gNzT4kKjh//ryyc0+dOlUt3jh37pwcP35ccufO7SaUQd68ef3+bdj18fcJEz74dYJ5Z/DgwbJ37143H4PnsfgKzsXbsUOoP/nkk67tmscffzzGb6VNm1b27dvn1++TuA2FP4kK7t+/r/6HZtuyZUuv+xQpUsRSm7EJZdj4rUbd/PTTT8reX758eeWgxegD9vgZM2bI3LlzJRTEdsys5OpMKPxJVADnbsqUKZVghvMzNhAjDycsBJ5ZuMNR6gm0Ys+oGQCNGpq1Bg7Zbdu2yZ07d9wcrJ6mIUQBrVq1SkXlaCD8PfF1JKDj/XHs5uOBKejo0aMPvA6E0OZPogJotfXr11dCFsLdm1kI1KxZU06fPu0WYnn9+nWvpiII9a1btyphajbdnDx50m0//C58CxMmTIhVq8bxQaijc9IgZNPbZC6EoHrrdDyBcIeJZ9y4cW7a+7Rp05Q/QUcQEeINav4kahg2bJj88MMPUqpUKeX0LFCggFy4cEE5VOEcxWesh5Bu0aKF7Nq1S5lfEF4Jp68nCLlEJ4HwSIRGwrb/5Zdfxgi9RFuzZ8+W7t27y/bt25UTF7H6+M0OHTrISy+9pATx6NGjVVtNmjRR/oeJEycq/4OnzR3x//hb7I9Q1Jw5c6pz8jba6d27twr1RLswK2EUALNSyZIl3Zy7hMQg3OFGhNjJ2bNnjY4dOxrZsmUzEiVKpEIZK1eubEydOtW1z/Hjx426desayZMnNzJkyGB06dLFWLlyZYxQTzBq1CjjscceM5IkSWI8//zzxs6dO2OEeuowzj59+hg5c+Z0/e6rr75q/Pnnn659pk2bZuTJk0e1lS9fPhW62b9/f/W7Zn799VejfPnyRrJkydQ2HfbpGeppDu1Ee/jdTJkyGe3btzcuXrzotg+Ot2DBgjGuF9pGSCtxHvHwT8wugRBCSDRDmz8hhDgQCn9CCHEgFP6EEOJAKPwJIcSBUPgTQogDofAnhBAHktDp+WAw2xNpAfxNrkUIIZEEovdRTwITBM3ZXj1xtPCH4M+WLVu4D4MQQmwHaUiQyTU2HC38ofHri4T854QQEtdB1TootVq+xYajhb829UDwU/gTQqKJh5my6fAlhBAHQuFPCCEOhMKfEEIcCIU/IYQ4EAp/QghxIBT+hBDiQCj8CSHEgVD4E0KIA3H0JC9v5Oi1zKf9jg2rFfRjIYSQYEHNnxBCHAiFPyGEOBAKf0IIcSAU/oQQ4kAo/AkhxIFYFv4bN26UOnXqqCoxSBm6aNGiGFVk+vXrJ1myZJFkyZJJlSpV5Pfff3fb58KFC9K0aVOVRjlNmjTSqlUruXr1qts++/btk3LlyknSpElVburhw4fHOJYFCxZIvnz51D6FCxeW5cuXWz0dQghxJJaF/7Vr16Ro0aIyceJEr9shpMeNGydTpkyRbdu2SYoUKaRatWpy8+ZN1z4Q/AcPHpQ1a9bI0qVLVYfStm1bt2IEVatWlezZs8uuXbtkxIgRMmDAAJk6daprn82bN0vjxo1Vx7Fnzx6pV6+eWg4cOGD9KhBCiMOIZ0BV9/eP48WThQsXKqEL0BRGBO+88468++67at3ly5clU6ZMMnPmTGnUqJEcPnxYChQoIDt27JASJUqofVauXCk1a9aUv//+W/395MmTpU+fPnLmzBlJnDix2qdXr15qlPHrr7+q7w0bNlQdEToPTenSpaVYsWKq4/EFdDKpU6dWx6iLuTDOnxASl/Em14Ju8z969KgS2DD1aHAQpUqVki1btqjv+B+mHi34AfZHoWGMFPQ+5cuXdwl+gNHDkSNH5OLFi659zL+j99G/441bt26pC2NeCCHEidgq/CH4ATR9M/iut+H/jBkzum1PmDChpEuXzm0fb22YfyO2ffR2bwwdOlR1Rnph8XZCiFNxVLRP79691VBILyjcTgghTsRW4Z85c2b1/9mzZ93W47vehv/PnTvntv3u3bsqAsi8j7c2zL8R2z56uzeSJEniKtbOou2EECdjq/DPmTOnEr7r1q1zrYNdHbb8MmXKqO/4/9KlSyqKR7N+/Xq5f/++8g3ofRABdOfOHdc+iAzKmzevpE2b1rWP+Xf0Pvp3CCGE2Cj8EY+/d+9etWgnLz6fOHFCRf907dpVBg8eLIsXL5b9+/dLixYtVASPjgjKnz+/VK9eXdq0aSPbt2+XTZs2SadOnVQkEPYDTZo0Uc5ehHEiJHTevHkyduxY6d69u+s4unTpoqKERo0apSKAEAq6c+dO1RYhhBCbUzpDwFasWNH1XQvkli1bqnDOnj17qhBMxO1Dwy9btqwS0piIpZkzZ44S0pUrV1ZRPvXr11dzAzRwxq5evVo6duwoxYsXlwwZMqiJY+a5AM8995zMnTtX+vbtK++//77kyZNHhYIWKlTI6ikRQojjCCjOP67DOH9CSLQRljh/QgghcQMKf0IIcSAU/oQQ4kAo/AkhxIFQ+BNCiAOh8CeEEAdC4U8IIQ6Ewp8QQhwIhT8hhDgQCn9CCHEgFP6EEOJAKPwJIcSBUPgTQogDofAnhBAHQuFPCCEOhMKfEEIcCIU/IYQ4EAp/QghxIBT+hBDiQCj8CSHEgVD4E0KIA6HwJ4QQB0LhTwghDoTCnxBCHAiFPyGEOBAKf0IIcSAU/oQQ4kAo/AkhxIFQ+BNCiAOh8CeEEAdC4U8IIQ7EduF/7949+eCDDyRnzpySLFkyyZUrlwwaNEgMw3Dtg8/9+vWTLFmyqH2qVKkiv//+u1s7Fy5ckKZNm0qqVKkkTZo00qpVK7l69arbPvv27ZNy5cpJ0qRJJVu2bDJ8+HC7T4cQQqIS24X/xx9/LJMnT5YJEybI4cOH1XcI5fHjx7v2wfdx48bJlClTZNu2bZIiRQqpVq2a3Lx507UPBP/BgwdlzZo1snTpUtm4caO0bdvWtf3KlStStWpVyZ49u+zatUtGjBghAwYMkKlTp9p9SoQQEnXEM8wquQ3Url1bMmXKJNOmTXOtq1+/vtLwv/zyS6X1Z82aVd555x1599131fbLly+rv5k5c6Y0atRIdRoFChSQHTt2SIkSJdQ+K1eulJo1a8rff/+t/h4dTJ8+feTMmTOSOHFitU+vXr1k0aJF8uuvv/p0rOhAUqdOrX4fIwyQo9cyn/722LBalq8NIYQEG29yLSSa/3PPPSfr1q2T3377TX3/5Zdf5Oeff5YaNWqo70ePHlUCG6YeDQ60VKlSsmXLFvUd/8PUowU/wP7x48dXIwW9T/ny5V2CH2D0cOTIEbl48aLXY7t165a6MOaFEEKcSEK7G4T2DaGaL18+SZAggfIBDBkyRJlxAAQ/gKZvBt/1NvyfMWNG9wNNmFDSpUvntg/8Cp5t6G1p06aNcWxDhw6VDz/80NbzJYSQuIjtmv/8+fNlzpw5MnfuXNm9e7fMmjVLRo4cqf4PN71791ZDIb2cPHky3IdECCHRofn36NFDaf+w3YPChQvL8ePHldbdsmVLyZw5s1p/9uxZFe2jwfdixYqpz9jn3Llzbu3evXtXRQDpv8f/+Bsz+rvex5MkSZKohRBCnI7tmv/169eVbd4MzD/3799Xn2GqgXCGX0ADMxFs+WXKlFHf8f+lS5dUFI9m/fr1qg34BvQ+iAC6c+eOax9EBuXNm9eryYcQQkgQhX+dOnWUjX/ZsmVy7NgxWbhwoYwePVpefvlltT1evHjStWtXGTx4sCxevFj2798vLVq0UBE89erVU/vkz59fqlevLm3atJHt27fLpk2bpFOnTmo0gf1AkyZNlLMX8f8ICZ03b56MHTtWunfvbvcpEUJI1GG72Qfx/Jjk1aFDB2W6gbB+66231KQuTc+ePeXatWsqbh8aftmyZVUoJyZraeA3gMCvXLmyGkkgXBRzA8wRQqtXr5aOHTtK8eLFJUOGDOo3zHMBCCGEhCjOPy7BOH9CSLQRtjh/QgghkQ+FPyGEOBAKf0IIcSAU/oQQ4kAo/AkhxIFQ+BNCiAOh8CeEEAdC4U8IIQ7E9hm+xB1OGiOERCLU/AkhxIFQ+BNCiAOh8CeEEAdC4U8IIQ6Ewp8QQhwIhT8hhDgQCn9CCHEgFP6EEOJAKPwJIcSBUPgTQogDofAnhBAHQuFPCCEOhMKfEEIcCIU/IYQ4EAp/QghxIBT+hBDiQCj8CSHEgVD4E0KIA6HwJ4QQB0LhTwghDoTCnxBCHAiFPyGEOJCgCP9Tp05Js2bNJH369JIsWTIpXLiw7Ny507XdMAzp16+fZMmSRW2vUqWK/P77725tXLhwQZo2bSqpUqWSNGnSSKtWreTq1atu++zbt0/KlSsnSZMmlWzZssnw4cODcTqEEBJ12C78L168KM8//7wkSpRIVqxYIYcOHZJRo0ZJ2rRpXftASI8bN06mTJki27ZtkxQpUki1atXk5s2brn0g+A8ePChr1qyRpUuXysaNG6Vt27au7VeuXJGqVatK9uzZZdeuXTJixAgZMGCATJ061e5TIoSQqCOh3Q1+/PHHSgufMWOGa13OnDndtP4xY8ZI37595aWXXlLrZs+eLZkyZZJFixZJo0aN5PDhw7Jy5UrZsWOHlChRQu0zfvx4qVmzpowcOVKyZs0qc+bMkdu3b8v06dMlceLEUrBgQdm7d6+MHj3arZMghBASAs1/8eLFSmA3aNBAMmbMKE8//bR89tlnru1Hjx6VM2fOKFOPJnXq1FKqVCnZsmWL+o7/YerRgh9g//jx46uRgt6nfPnySvBrMHo4cuSIGn1449atW2rEYF4IIcSJ2C78//rrL5k8ebLkyZNHVq1aJe3bt5e3335bZs2apbZD8ANo+mbwXW/D/+g4zCRMmFDSpUvnto+3Nsy/4cnQoUNVR6MXjFAIIcSJ2C7879+/L88884x89NFHSuuHCaZNmzbKvh9uevfuLZcvX3YtJ0+eDPchEUJIdAh/RPAUKFDAbV3+/PnlxIkT6nPmzJnV/2fPnnXbB9/1Nvx/7tw5t+13795VEUDmfby1Yf4NT5IkSaKih8wLIYQ4EduFPyJ9YHc389tvv6moHO38hXBet26dazts77DllylTRn3H/5cuXVJRPJr169erUQV8A3ofRADduXPHtQ8ig/LmzesWWUQIISQEwr9bt26ydetWZfb5448/ZO7cuSr8smPHjmp7vHjxpGvXrjJ48GDlHN6/f7+0aNFCRfDUq1fPNVKoXr26Mhdt375dNm3aJJ06dVKRQNgPNGnSRDl7Ef+PkNB58+bJ2LFjpXv37nafEiGERB22h3qWLFlSFi5cqOzrAwcOVJo+QjsRt6/p2bOnXLt2TfkDoOGXLVtWhXZispYGoZwQ+JUrV1ZRPvXr11dzAzRw2K5evVp1KsWLF5cMGTKoiWMM8ySEkIcTz0DgvUOBuQmdCJy/2v6fo9cyn/722LBaPu1nd3uEEGJVrnmDuX0IIcSBUPgTQogDofAnhBAHQuFPCCEOhMKfEEIcCIU/IYQ4EAp/QghxILZP8iLBg3MGCCF2Qc2fEEIcCIU/IYQ4EAp/QghxIBT+hBDiQCj8CSHEgVD4E0KIA6HwJ4QQB0LhTwghDoTCnxBCHAiFPyGEOBAKf0IIcSAU/oQQ4kAo/AkhxIFQ+BNCiAOh8CeEEAdC4U8IIQ6Ewp8QQhwIhT8hhDgQCn9CCHEgFP6EEOJAKPwJIcSBUPgTQogDofAnhBAHEnThP2zYMIkXL5507drVte7mzZvSsWNHSZ8+vTzyyCNSv359OXv2rNvfnThxQmrVqiXJkyeXjBkzSo8ePeTu3btu+/z444/yzDPPSJIkSSR37twyc+bMYJ8OIYREBUEV/jt27JBPP/1UihQp4ra+W7dusmTJElmwYIFs2LBBTp8+La+88opr+71795Tgv337tmzevFlmzZqlBHu/fv1c+xw9elTtU7FiRdm7d6/qXFq3bi2rVq0K5ikRQkhUEDThf/XqVWnatKl89tlnkjZtWtf6y5cvy7Rp02T06NFSqVIlKV68uMyYMUMJ+a1bt6p9Vq9eLYcOHZIvv/xSihUrJjVq1JBBgwbJxIkTVYcApkyZIjlz5pRRo0ZJ/vz5pVOnTvLqq6/KJ598EqxTIoSQqCFowh9mHWjmVapUcVu/a9cuuXPnjtv6fPnyyRNPPCFbtmxR3/F/4cKFJVOmTK59qlWrJleuXJGDBw+69vFsG/voNrxx69Yt1YZ5IYQQJ5IwGI1+/fXXsnv3bmX28eTMmTOSOHFiSZMmjdt6CHps0/uYBb/errc9aB8I9Bs3bkiyZMli/PbQoUPlww8/tOEMCSEkbmO75n/y5Enp0qWLzJkzR5ImTSqRRO/evZXZSS84VkIIcSK2C3+Ydc6dO6eicBImTKgWOHXHjRunPkM7h93+0qVLbn+HaJ/MmTOrz/jfM/pHf3/YPqlSpfKq9QNEBWG7eSGEECdiu/CvXLmy7N+/X0Xg6KVEiRLK+as/J0qUSNatW+f6myNHjqjQzjJlyqjv+B9toBPRrFmzRgnrAgUKuPYxt6H30W0QQggJoc0/ZcqUUqhQIbd1KVKkUDH9en2rVq2ke/fuki5dOiXQO3furIR26dKl1faqVasqId+8eXMZPny4su/37dtXOZGhvYN27drJhAkTpGfPnvLmm2/K+vXrZf78+bJs2TK7T4kQQqKOoDh8HwbCMePHj68mdyECB1E6kyZNcm1PkCCBLF26VNq3b686BXQeLVu2lIEDB7r2QZgnBD3mDIwdO1Yef/xx+fzzz1VbhBBCIkD4YyauGTiCEbOPJTayZ88uy5cvf2C7FSpUkD179th2nIQQ4hSY24cQQhwIhT8hhDgQCn9CCHEgFP6EEOJAKPwJIcSBUPgTQogDofAnhBAHQuFPCCEOhMKfEEIcCIU/IYQ4EAp/QghxIBT+hBDiQCj8CSHEgVD4E0KIA6HwJ4QQB0LhTwghDoTCnxBCHAiFPyGEOBAKf0IIcSAU/oQQ4kAo/AkhxIFQ+BNCiAOh8CeEEAdC4U8IIQ6Ewp8QQhwIhT8hhDgQCn9CCHEgFP6EEOJAKPwJIcSBUPgTQogDofAnhBAHYrvwHzp0qJQsWVJSpkwpGTNmlHr16smRI0fc9rl586Z07NhR0qdPL4888ojUr19fzp4967bPiRMnpFatWpI8eXLVTo8ePeTu3btu+/z444/yzDPPSJIkSSR37twyc+ZMu0+HEEKiEtuF/4YNG5Rg37p1q6xZs0bu3LkjVatWlWvXrrn26datmyxZskQWLFig9j99+rS88sorru337t1Tgv/27duyefNmmTVrlhLs/fr1c+1z9OhRtU/FihVl79690rVrV2ndurWsWrXK7lMihJCoI6HdDa5cudLtO4Q2NPddu3ZJ+fLl5fLlyzJt2jSZO3euVKpUSe0zY8YMyZ8/v+owSpcuLatXr5ZDhw7J2rVrJVOmTFKsWDEZNGiQvPfeezJgwABJnDixTJkyRXLmzCmjRo1SbeDvf/75Z/nkk0+kWrVqdp8WIYREFUG3+UPYg3Tp0qn/0QlgNFClShXXPvny5ZMnnnhCtmzZor7j/8KFCyvBr4FAv3Llihw8eNC1j7kNvY9uwxu3bt1SbZgXQghxIkEV/vfv31fmmOeff14KFSqk1p05c0Zp7mnSpHHbF4Ie2/Q+ZsGvt+ttD9oHAv3GjRux+iNSp07tWrJly2bj2RJCSNwhqMIftv8DBw7I119/LZFA79691UhELydPngz3IRFCSHTY/DWdOnWSpUuXysaNG+Xxxx93rc+cObNy5F66dMlN+0e0D7bpfbZv3+7Wno4GMu/jGSGE76lSpZJkyZJ5PSZEBWEhhBCnY7vmbxiGEvwLFy6U9evXK6esmeLFi0uiRIlk3bp1rnUIBUVoZ5kyZdR3/L9//345d+6cax9EDkGwFyhQwLWPuQ29j26DEEJICDV/mHoQyfP999+rWH9to4eNHRo5/m/VqpV0795dOYEh0Dt37qyENiJ9AEJDIeSbN28uw4cPV2307dtXta0193bt2smECROkZ8+e8uabb6qOZv78+bJs2TK7T4kQQqIO2zX/yZMnK3t6hQoVJEuWLK5l3rx5rn0Qjlm7dm01uQvhnzDhfPfdd67tCRIkUCYj/I9OoVmzZtKiRQsZOHCgax+MKCDooe0XLVpUhXx+/vnnDPMkhJBwaP4w+zyMpEmTysSJE9USG9mzZ5fly5c/sB10MHv27PHrOIlIjl4PHyUdG1YrJMdCCAktzO1DCCEOhMKfEEIcCIU/IYQ4EAp/QghxIBT+hBDiQCj8CSHEgQQtvQNxFr6EjQKGjhISGVDzJ4QQB0LhTwghDoTCnxBCHAht/iTioP+AkOBDzZ8QQhwINX8S9TCBHSExoeZPCCEOhMKfEEIcCM0+hFiAzmgSLVDzJ4QQB0LhTwghDoTCnxBCHAiFPyGEOBAKf0IIcSCM9iEkTDByiIQTav6EEOJAqPkTEiVwJEGsQOFPCIkBO5Loh2YfQghxINT8CSFxKrMqRyX2QM2fEEIcCDV/QohjyeHgUQQ1f0IIcSAU/oQQ4kDivNln4sSJMmLECDlz5owULVpUxo8fL88++2y4D4sQ4kByxKGSoXFa8583b550795d+vfvL7t371bCv1q1anLu3LlwHxohhEQ0cVr4jx49Wtq0aSNvvPGGFChQQKZMmSLJkyeX6dOnh/vQCCEkoomzZp/bt2/Lrl27pHfv3q518ePHlypVqsiWLVu8/s2tW7fUorl8+bL6/8qVK651929d9+n3zX/zIOxsLxzHxvO03pav7fFZC15b0XCehfqv8mm/Ax9W89q+YRgP/kMjjnLq1CmcmbF582a39T169DCeffZZr3/Tv39/9TdcuHDhIlG+nDx58oEyNM5q/v6AUQJ8BJr79+/LhQsXJH369BIvXjyvf4NeNFu2bHLy5ElJlSpVwMdgZ3uR2haPLfxtRfKxOeU8w3Vs0Pj/97//SdasWR/YXpwV/hkyZJAECRLI2bNn3dbje+bMmb3+TZIkSdRiJk2aND79Hi62HQ9DMNqL1Lbsbs8px8bzDH97qeL4saVOnTp6Hb6JEyeW4sWLy7p169w0eXwvU6ZMWI+NEEIinTir+QOYcFq2bCklSpRQsf1jxoyRa9euqegfQgghUSr8GzZsKOfPn5d+/fqpSV7FihWTlStXSqZMmWz7DZiJMI/A01wUCe1Fals8tvC3FcnH5pTzjPRjiwevry0tEUIIiTPEWZs/IYQQ/6HwJ4QQB0LhTwghDoTCnxBCHAiFPyGEOJA4HeoZF5PRHT16VHLlyiUJE/LShxqk+saCyYBmihQpErZjIiRcUAKFgOvXr0vnzp1l1qxZ6vtvv/0mTz75pFr32GOPSa9evXxuC3ULEiVKJIULF1bfv//+e5kxY4ZKaT1gwAA18zkcvP3225I7d271v5kJEybIH3/8oSbghQtkf8VkwMOHD7syHSKXEz7j/3v37oXt2CIZpEp599131ax5dJqeUeHhvm6XLl2Sb775Rv7880/p0aOHpEuXTr0fmOeD9yqc3L9/Xz333pSN8uXLSyTAOH8TTz/9dKwJ3jzBQ+YrXbp0kU2bNikBWL16ddm3b58S/hDcENh79uzxua2SJUuqzqJ+/fry119/ScGCBeXll1+WHTt2SK1atSwLWXRIyJOEvwU9e/aUqVOnqs7kq6++kuzZs/vUDl62xYsXq5Qbntepbt268vfff4tVMFt72LBhLuHj+RLh/H0BRX4w2nrvvfeUYPC8x76eo1nozZw5M9bjWr9+fUjPEdfdV3AvfKVGjRpy4sQJ6dSpk2TJkiXGdXvppZckXNcN7xDStyOHzbFjx+TIkSPqnerbt6865tmzZ4f8OdNs3bpVmjRpIsePH4/RYfqibODcfCWQUSs1fxP16tVzfb5586ZMmjRJCUGdKwg39eDBg9KhQwdL7S5atEhVHStdurTbCwTBDa3FChg1YCYzWLBggdIi5s6dqzqXRo0aWRb+H330kUyePFl9Rh0ElMX85JNPZOnSpdKtWzf57rvvfGrnv//+85pMCgmo/v33X/GH1q1by4YNG6R58+ZehY+v4OX99ttv1cjEDtCZQ4ihwyxUqJDfx2XXOZqf2wdhdZTz888/y08//eR63iLpuiG1y+uvvy7Dhw+XlClTutbXrFlTCd5wPGeadu3aqZQzy5Yt86s9XG89MvWGbaNWG1PsRxWtWrUy+vbtG2N9v379jDfeeMNSW8mSJTP+/PNP9fmRRx5xfd67d6+RKlUqS22lTJnS+O2339TnKlWqGGPGjFGfjx8/biRNmtRSW/rY8LegZ8+eRvPmzdXnAwcOGBkyZPC5nYIFCxrjx4+PsX7cuHFG/vz5DX9InTq18fPPPxuB8tJLLxnffPONYRfp06c3li1bZktbdp1jMMB92717d0ReN7w3f/zxR4x36tixY0aSJEnCeg+SJ09u/P77737/Pc7B1yUQqPnHArTqnTt3xljfrFkz1atbKRWptQDY+IHWBD7//HPLGUjR1uDBg9WQF9qK1trhSPYnp9EjjzyitPYnnnhCVq9e7ap3kDRpUrlx44bP7eDvYB5ArqVKlSqpdRhGjxo1ym97f9q0aZUdN1BwnWHzP3DggNI44TPx1xQC4FexaxRh1zkGA9w3mBg//fRTyZEjR8Dt2XndkN/GW0UsjIwfffTRsN6DUqVKKXu/v+dq1QzpNwF1HVFMpkyZjBkzZsRYj3UZM2a01NZPP/2ktJN27dop7bxLly7Giy++aKRIkcLYuXOnpbZ++eUXo1ChQkrzGTBggGt9p06djMaNGxtWadKkifHMM8+okQ40ln///Vet//7775U2b4VJkyYZjz32mBEvXjy15MyZ05g1a5bhL1988YXx6quvGteuXTMCYfHixUq708dlXuLHj2+5vZEjRxodOnQw7t+/bwSKXedo5scffzRq165t5MqVSy116tQxNm7c6NPfpkmTxkibNq1rSZw4sbpGeH7N67GE87rhea1Xr55x+/ZtdWx//fWXGsE+/fTT6v0K9T345ZdfXMt3331nFChQQMkKvN/mbVisghEO3u/KlSurpXPnzq5RTyDQ4RsLcAB9+OGHqkA80kWDbdu2KY3/gw8+sBShA2DbR5u//PKLXL16VZ555hnlgNRRO74A+x5s+/gbaCtm4KNAcRtPrdaXiAmcD5xk7du3Vw5pgOyB0NT69OkjVoH2nyxZMjWqCNTpDg0Kjyg0T89z89Xpjr+tXbu2Ok9/M76+8sorMZyT0Bbht/E8rof5SYJxjpovv/xSpTTH8T7//PNqHZ6ZhQsXKnv7w+zhOiLNFzCaCuV1M4P626+++qoaneuqVcjsi5H08uXLJUWKFCG9B/Hjxw+KnX7VqlVqZAo/gPl+Qo4sWbJEXnzxRZ/binFMFP6xM3/+fBk7dqwKEQT58+dXTqvXXnstbMcEcwyOJ2fOnAG3dffuXeXwffPNN+Xxxx+35fgg+BF5AfLly6ciiayADtdX0EH5AhyCe/fuVRE//mKlRgRCb0N9jho8o23btlXOejOjR4+Wzz77zPUshwo7r5s3tCDUChXMob7woc33AJE9wTDroJOqVq2aUhzNQPmEmdaqcuBGwGMH8lAuX77sdbly5Ypx69YtS20VL17cWLt2rW3HBtPT0aNHA27n6tWryhGeIEECl0klYcKExptvvmmrScMfWrRoYXz22WeGE4CZxpuzEeusOkJh7jl79myM9TAN+mMusxOYE2/evBljPd6nQEyNdrBhwwbjzp07MdZjHbZZAfdMB3iYOXLkiOX76QnTO4QA1AmGmcZzwXqYR6AJQLvwjC/2Bpy9mHiDUMx//vlHOb3Mi1UqV66sHMeBAocv2sFQFKYkLJjHgHXvvPOOX20ibhvOaE/QNrb5ylNPPSW9e/dWoYFwQI8bN85tsQoc2jgGT3D9tbM71OeoQYFvc2lTzdq1a9U2K8RmFLh165ZfkwntvG4YUcD04wlMQFYr+T1p8z2oWLGiXLhwIcZ6HC+2WQHOa4xaPcG6jBkzSiAw2icWYJdDvDtMP7CHIzWDGW83NzZga4XtHMJH+w+2b9+u7KuYlAJTyciRI1UEw/vvv//AthDHDGAHNNss/Y37xUQeDCH379+vJmh52kp9jYRBHD1mW1aoUMHtWNG5wUymo5KsgMk73s4HwsfKpDFE+8D/gI7Is6PDNfOclfwwfvzxxxjPg/a7IC4+HOeoQUeL84FweO6551ymETyDMGH6gu4QcW30tdPgWDdu3KhMelax87rp590TXDNfipcH8x7EdmzoYB7mi/AEPkeY8TBXxXw/P/74Y1dknr9Q+D/AJogHHy8TBDSENx4STNhC2UgrQMhD4zT7CurUqaMctwijg6aGUMshQ4Y8VPj/8MMPYid6whpswp5Y6UyQwsKbMxXaCbZZwTxjFQ4v88uM48H1suLzQBisHZhnXh46dEg5GM3HhRKivqYVsPscNXDaZ86cWT1vUFy0HwCTDH2dkQulRwuxKVOmqEACDTR+OEWxPhzXTTtqsWDUas6RhbZwr3XQQqjvwSv/z7mNY4OiZy63iPZwHbQA9xUEKcBnhfuJ0SuAcxuZAawqLZ7Q4RsLcA5CA8JsRLPDEOsw0xezan0F2i9ufJ48edzW//777yr1AIQjHlpEQFgVlJECXsT06dOrafVwSgPME0BECEZJMDv4CiIngLfoCURiQPjgZUAETyjRER3A22uD+zx+/HjlQPelrUg8RzMwUSBKCObJSLlu2lGL/6GYmUclumNC6hNfzFLxbb4H2twEZQ+KHs7L89igyVsNgjCbtIB5RnMgUPOPBWgnOgwTD5i2L+qQQSvA1jpt2rQYHnus03ZYDAk9wzdjA7ZI/K2O3ECngRfH6nDX2/BbC26rwKSAqAREDaFDA4jCQHvQqqygfR/QupCzyJ+XxcqQ2NuoxxvooCEkYAeG2c48mQgvN0Y5Zi052Of4sGR25ucDGrMV4E8qW7as0mbxGYIS541UEtBqfT1Pu6+bjryBIG3YsKHfz2sw7oGOVsKxwS9n1cQTzCg6rwTkLo5innrqKWPr1q3q8/PPP28MHTpUff7666+NRx991FJbmDCFKIwiRYqoySlYihYtqrz1S5YscU2Q6tat20Pb2rFjh5EuXTo1merll19Wy+OPP66mzu/atcvyed69e9cYOHCgkTVrVhWpo6fJI7XF559/bqktRPVMnTrV6N69u1oQYXP9+nUjHFSoUMFtwaQ4TGLDJCAsiHLCuooVKxrRBKJzcE6IttKTsfC5UqVKxrlz53xqA88YJsUhsqxs2bLquUDaj4YNG6pJYM8995yKVCO+gclely5dMiItio7CPxbee+89Y8iQIS6BjwueO3duJcSxzSqYgYi/0wK7V69efoVY4mV8/fXX3ULJ8Llly5ZGuXLlLLf34YcfGk8++aTx5ZdfuuUgwjmXLl3a53Zu3LhhhIozZ86o4/aVUaNGqVmuFy5ccK3DZ+T8waxTKyCUcN68eUbXrl2NRo0aqQWf58+fbzlsFx3/Bx984Mors27dOqNGjRpGtWrVjE8//dTwh9dee80oUaKEcejQIde6gwcPqnU4Vl+AsmOePY4ZsKVKlXJdt2LFihlvv/225WM7efKk8b///S/GeszStRoCCaVlxIgRRsmSJdVsfH9mH588edI4f/686ztmQWPGO96xpk2bGps3bzbsIFGiRG73wwpt27ZV7+fy5ctdIeLIj4SZ28gYEAgU/j6yZcsWJUSQKiCcID3E4cOHY6zHCw7hbRU8RHregDlBFn4DWp6VhHOIp1+9erVx7949I5ggIZ6VOHOMapCozpP9+/cbWbJk8bkdxMrjRcQ9eOGFF5SgxYLPWAflwNeEXlOmTFEKBbRrjEAgYHENW7dubbz11lvqXuqkfVZAW9u3b4+xftu2bUqb9wWzEgBwPyHA0OkC3GNcU185ffq0EtLQWvUowtwJoF2r8wbQaeLeofPGtR80aJAaUWMEPHbsWJ/aePbZZ10j70WLFqljqFu3rktJwznr7b7g2QGZR1649v6kxcD5/PDDDzHWr1+/3lLiRW9Q+HsBmgiGWtDW7QJaBbSJMmXKGH///bdaN3v2bJX3xwrIK7Rq1aoY61euXGk55xDAi6OzA5qFPzoTmEZ8BflMkB8FgiNz5swqvwrMB/7gmQvFc4HmbUVY4Lxie4GwzVeQRRWjBWhfnmAdtlWtWtWntpD7BSYyfRy4DxMnTnQzFfiTDRXns2fPnhjrkZ0TnYsvZM+e3S3LJYQ3BJg24WHEaiWDLJQCjBzwPKxZs0Z1eBiJ6JEYhD/atwI64aVLl7rOWee6geD3NcdVihQpXO84jm/YsGFu25GlFiZCX8Fx1KpVy5g5c6ZrwX1Ehwcrgl5nBbxP3kYNUGZgxgwECv8HaFB2CX+kE8ZNhFYHO78WsHi4MMy3ApI6wcYPs8yJEyfU8tVXX6l1VhNaASR1g9bpKfxhVsHw1yqwBU+fPl0lrsNDnydPHksmGqATrsWWiM1qQjZomjly5DC+/fZbNdTHgnuCxHMQTL6Ce4jRQmzs27fP59GXOZU2gJZpbhsC1p+XG5pr+fLljVOnTrnWQdnA6ASJ0HwBzxGSB65YsUJ1TPAhwG9iVjQwYvQVjBIw8tBgZi7McDAf/ffff35p/rg2+vpB2dD+Ljy/vqZJT506tSvRGhQnz6Rr6FCs3AOM+jDCwTNlHtlghAdlyh/gq2nQoIGbWRWdMNYhyVsgUPjHAm7g6NGjbWkLD7mecm4WsNDGYK+0AuzKsLfqbItY0KHA7uxtuvvDwHAXLwG0HjzosKOik0L7GN4HAh54nLvVFxtD3WnTpsWawxw2TyttwjHWvn17dZ30NcP5YR0car4CM8ODzAAwCfpqRkJnrTNtQlCjQzPnukdmTuxjFSgDuOboTKAdY8FnaLDo9HwBggvmLAgtHBccvGZFCCNP+Dh8BRq2Z4oC+KnQGSEIAp2m1WfEjoCMunXrKt8bgJ/F01yEgAUoL1bAeaEuBjpHPXoKRPhDIUDniXcCHQEWfEbAhzdTphUo/GMBNkTYvOvXr2989NFH6sEwL1aAlqedu2bhj//9zc8BgYaXBkugXn8IIZg08NLgWPEyeTMt+QI0FJhlYALBuT3xxBOWHeQwneD6P8jmb9VMACDotenIitA325lhs4VSgDagsWLBZ6xDFFb//v19aqtjx45KsAwePFjZnuGwz5cvn9K2oVkXLlxYRXT4A1Imo+NGIR0sMLX4ey+9OWitgnPxVkxHdwB4RqwKfzsCMg4dOqQEKRQ9PG94N5s1a6baxTo8v97SuvsCnPc4r969e6vO11/hH8woOgr/WICZILYF5gIrYH/9ApqFP0YD/la5ijQgsPDCYMgNIYgoBasRHGb/gTZFeQO2Yiu2U4TZwbzgCdZ5s98/CIyQoN1r05M2Q2Hdxx9/7HM76HzatGmjzCu4VhjRYdQF4YX2YGbxllQtLgJNODZfCDoAaOD+dOZ2BGT88ccfKgoK/hBtWoSwxmhn4cKFAR0TEuDBcQwl8tdffzUiDc7wDQFDhw5VedZRCwD5t5FvHClgkXYXE8Z0hS9fc6I/CCs50QEm3mCCC2bnek4kQ4pcX4tXJ0+eXE2Aa9q0qcrpY7WuQDBB/iKk0/CsvYwUBZjij/thFUxc0mkKkE7BjhTbeqLdnTt3AprFibQEsRUjt1KBzi6QOhwz11HPObbtp06d8jnVMa7PW2+9pd4du667YRiu64UJVJH0/CITANK6eLufVlPNmKHwDwG4xMibj05Ap29A3g/MAhw0aFBYc6Jj5iaEmGeGwLNnz6p8Q0hu5evUcy2wkAwL+Uf09Hm7OXnypJrp6asgQ/EQJMNCjhszv/76qyqQ4S2jYyiOyxMcI8p0mnPCWAVpDwYOHKja8VY8HOkaIg1/rhtmsyPlil3C3857YAbpYZAjDPfCH1CDAfma0CFByTDfT3wOJJ8/hb/F9AC44JhSjvqcSJRlpfYnMhqiYhAKTxQoUMCvSld2oZNaYbo+cpF4S2q1Zs0a15RyK0DDw4vpTzpcX0DaCIxKfE06h2n2yMfkWTUNmUxRb9WufEpWjysY1w1CZvjw4dK8eXOJK/hz3ZAzCtWtPIvWBEoqm59dKEQ4P3/bw2gII1ZU/bMb5vaJhT179qheFQ9k3rx5XcWhkYMEuTUmTZqkEkv9/PPPSpD7AvKY+Lrvg8AwGelxURoSZfnwgJ0+fVo9uL52KBD6ujPzLMdnTmrlD4HqE+Zsi97w1RSlQRrtqVOnquRhnmYfpLEO13F5YoceBgXDaubIYBOM64YkiRjhQFP3lorc34yXRoTpwhcvXpQGDRoEpW1q/rEwZswYlWMcZhRtq0Ryt9atW6uEV8jOB8GLzJXeEpcFy04PXwFS1qLGAEwy6JCgVaC8JL5bSbUbrMRigWo7D6uHajXdNAQEyvuVLFlSZR8FGNngvFEKr1y5cmE5LruvG4CGCAXAavLBYBKM6/Ygcw/a8rcjTmnDPTBTqFAhWbFiheVCOppWrVqp57Zdu3ZiNxT+sYD84jB7eGrqBw8elKpVqyoHFUYG+Pzvv/+GzE4PjR0PKLJ6wkmrH1SMBNAhwTkUKHD2BpLGF74N2Cn9bQPXHiOr2PLPY1gObc+KsMDfjBgxQv2PVLtFihRR+dE902yH+rjMIE042raaDdJsooRDEGY8nB8WT8elrxlM7STY181O5vp5D8xAMYOw9/S3QNTCvwFf2oMwV5e7du2aumfwHcBs6Xk/A8npT+EfC9CeUCrRXJkKQMgicgQOTmgXsDv6Uz7RXyDwN2/erExRZi0FhWbQUVm1X6MikE6PCzDERFUu2I4RBaPTM4cSVA/DdcWw3hs4Z6Qo9qXspROOy9fSgBBG69evl1AT7OumRZi36lnhIEGCBCoNtmcQBQILsO5hnZyvTuxARjiANv9YQO+PHPmwe2PYBWAmQISOtpcjNznqw4YSvCDeHh5E2PgTHggz0Zw5c9RnjHRQdAWVlVAFqkePHsos4gsooAHbuqdjCs5HXLcFCxb4fEz4XWg8sQFnu78VzRBK6VlKMLYQxFAcF9pDnYfYQjN9ebntru5mN8G6nygchNGcHu3iXcRvWXV2X7PhHvhSxhGBHr7UH7Cr8tzDoOYfC7hRiCTAAwYHK0DJODhHUeYOw0JdWBlaTWzl5nzBSrgWNHRE5sCBCWGPCmEojoHOCsNJq6GeMIHAb4BhKvwGEI4oLYl1iISBw8kXcAzQKr1F1MDejtDRcIHRUM+ePVWH5i2sM5zmhsaNG6u6whBY3kIzcU+sAL8UzsczCg3V1PD8+trRRTowhcCv0alTJxWuCxB8MXHiRBk8eLClKKDGNt0DbX5DYSOYYDH3RYN7sm3bNjUqgA8qEqDm/wCzD2JsIeh1zw/zijmaxpvQ1+jRgd1gJIKKWTDxQFDD6QzNBw7br776ynJ7qB4GOySEPzR+vDgAOoEVoYjO0lvpPNgoQ2kW8wa0QWiWKCKPFxwCAj4bdHKe1dVCDZyBy5YtcwmwQGnUqJHXCW3o+Pyd0BaJIHIL97NFixZu5iVULUN9WyvCf4VN9wARgvrdgdJjfh/wGSZUWA6sYOeIOgbhnF4cV5g7d65fuWCCBabEI/1Bjx49VHKyQHJ9IMcMUvgitw/ynOhcLsgUaiWdLbIZesveiVw3yBwaTrJly+ZK6Yxp/DrnPlJqW82qajdIF+JvoQ9vIPeQt/ZQnwFpN6IF5N3xVjsBCeSs5svKYfM9QLElq2lDYgM5+5G/yxOs8yeFuxkKfx+AwDAXt/CXnTt3KqGNBRk9I6V2AXLKIFOo+ZiQqAydiq8gpwqSayG/j85bjlTKWBdojpRAQVZJnf4X2RB1emFkqrRSsyAY4FlAHYRAk/NpkJk1NmHhT7GfSKVgwYKuxG5mkKAN+ZLCeQ88QUeAd8BbEaaHgboJ3vICoS0rNRW8QbOPDwTqFoETCcNxRArp8EeEUyJK4+uvv3YraB3qXB8wy3gbilqdOQlTw6JFi1Qai2+++cYVTgkH8gsvvCDhBOY6ONHgE8EEPZhAMJResmRJQCGt/uLpD8Ks70yZMqmoK89QPqvT9+2a0BbpII0F/F8bN250mWtgS4fTFvc3nPfgtddek/Llyyt/BOYBIV0EovEgR/C+w5TjK/ChzZs3L8Z7jXYCnTBK4R8CkLgNoaGYI6Dzyxw6dEg5jxGna8VW/7BcH/4mesLxID7ZMxIGdlRfQSwylkgDcy4QTohOqFevXqqjmjBhgkoQFo6492D5gwB8NnCw43y9TWiLFiBA4UCFTw5KB8C7hQg8CPZw3oONGzdKnz59XLmUIPSh7GH+Be6PFeEPpzYmjGI2f6VKlVz3EzIjIHs/CGjc4BBQatGfQil21lXVIEe4Z7m5QIA5CwU1PKtn6XTF0WDa8gQFYVDVy7NyU7SAMo4oRI5SkSiZiJKknsVUSPCAOQZFdQBMn7q+AEyP/pgZUa4SKaZh0oNfDpXVUOwnUCj8fQQXG5WWdN3RUNdVtdv/oKldu7YqvHL+/Hl1nHB8obNDgRFdacoXkHseDyU6DnPhalQeOnfunBEpoJpVsAvMW631gLzvnly8eNFy3QgnAcXEW70DXEurSktOm+8BivSgoBGCRFAgCYVddBEiCO9IgcLfA2jVffv2dauKhBJvWiNG2UWr5dPsqKuqQXWnyZMnG3aBh1FrwBihaOcSHliUA/QVlP1DUW5z1ASqF2EdimVECnZ3noGCZ8qbEEOFMBQVsQpq2ZodvijTic4dFaVQMCZaiO264R2z6giNZ/M9mDhxogp0QBGXokWLupQNVFUz10L2BYwgzOU3YS1AjeVPP/3UCBTa/D2Ac8UcUwvnJWx4SPIGmyLiiuFs8sWppIF9GbZzOJN0gifE1iPpE4q8PAxzrg/MhoQdUKcoDjTXB2L59cxg+BGQHRSpI5BK1ko6Z8wRgHPXnDMfDinE1CP/UaQQKXMazZkukRjQW0ptf3LVo8gJ/Bp4NjA/BU5R2IxhH8ZkNyQsjMvodwH+LeTJN8+7wXXDuwqnfjjvQYcOHdQESfjQULxJ17VA4MGQIUMstYV5PG3btlXzU1B3A/4cyA3Mysf3QIq5UPP3AL21WXtFzC7sduZycf4U1g6kruqDSkoGUl4SlC1b1hWK2bhxY6N69eqq8DRCNhFOFw7TVjAxl9EMJ2bfiv6sF5RyRIHyBxWLjw2M3lCaUI9idflE3FN/nttIQz/ruE6Yv2F+/nHNcL66sHu47sGDtHj4X6xgLgGJ2uGw/QPU2A7ULEjh/xDhkDdvXjczC5w2gcbXRlrtXTg/AZyCOF88/Jhcsnbt2rCYtoLJRx99pGy5kQKEFvwtdoGOVjt3MXFvzJgxUfncwnzij/8tFPcgNmDzt+qPgIP46NGj6nOdOnVcwR523E+afTzIlSuXGjpiiIZhG3LcIGbXnEDNs97tw4ApBuYaT5MMzEGILw7nUBypIjRIb4zShsgDg7QPVrIkBmraChVI4xwp4b+IB7c7iRdiynW4J/LVIAUCwO8gjj1asDOZ3dH/dw+Q4A3mXLyTKEOKuTlW3vVgFK1BugrM0UAINRIv6rKvMM9alUOeMLGblzh6THCCrRR2ddgBkUJZgxcL8cWYIGQlnzkeDM9JNpg8AoGJDiXUuT6QsdQXrNRVxaMEuz86EAD7P4RQuAlqfpQAC5xA2UDBDsz5wLyNQEGiv6ZNmyrFBYnGUBtXdzZIaod89dGAHfe0QIECKhkckuBBUUFRH8TjIzso4uqRCA8ywFe7fzCK1mBi6Msvv6zyY+EZ0e/j+++/r94zK4WgYhDQuCFKmTZtmjJVtGvXzvjnn3/ctiGXjjaTBJqHBOus5iGxK9cHTDsY7r788svqXGNbAiFSzCvBzI/iL7j+MKshcgPHh6gSmM5gYw5GKOqNGzdUKo9owY57Gs8U5dO0aVNlT7906ZL6jhxXMJvBD+YrWbNmVdFVsQGfmD9zZ+7evRvDxAVTkLcIJStQ+HvJw+HLYgU4TsePHx9jPRy/+fPnD0uujw4dOqhYfIRzwpH033//GYEAW+TXX3/t+t6gQQP1oOOFgK0znAQzP4q/mAUPhDLiwhFSnCBBAnXN3n//fa8KgxWgqITClh1X72k80z148sknVUCGmU2bNimnsq/AJv/BBx/Euh3vAX7Tn0SOCBCZMmWKceXKFbUOvjWdhNFfKPw9MM9ufdBidSSBpFr9+vVTk8Ww4CHBjL2pU6daasvO7JmYtYyMpdBwcCwQ2HAAIzLJKhhF4GUBeIkQpYCIhFatWhkvvviiEU4iMeNobLHlcOThuJBp1R8tMZLnNETaPY0XL55rAiI63P3798eYCW5FOcCkyBUrVsS6HZO+rM7MxTHky5dPvZ9QDPT9RCLGt956ywgECn8PtHDGgjTAENpz5sxxW+/P1OpJkyapjJI6lAxhWrNmzbLcTrCyZ+IhGzBggNKAkELCqlZhntKOB7Nt27bq85EjR1RHEE4iMeNobMLfMzQ4GsJaI/WexosXzyhcuLBKXY5r9c0337ht37Bhg3pnwwkm6DVr1kxN0DPfT8im3LlzB9Q2hX+IXyBoGoEO14KR6wOCG5oUOiU88FaPMUuWLC7NH/HR8+fPV58xNI+EOP9g5UfxF4yUvKUU8Ifvv//eqz0/moW/Hfd0wIABbgtGvWbefffdsM9ORw0Gbd4y30/Y/ANN0U3h/xDsfoGGDh0aMY5Qs9kHmjtymiN/kT8OR7uKwhDrwDykzRex5bwhcZM0adKoNCmesgj5twINVmCcf4hBvnvEd4cjj7znFHTkBEdMPsI+kSIW6R38Bal1EeOPkDmE2+lp9//880+MkoLhYteuXXL48GFX/LQvqX/jAqgHgZBEpKqOrXh4tBKt91SD1CiYB4QaDQD3FiVTEcJbs2ZNCQTG+T8EXSTdnxwfsbWHXOuYROYPiBGGoMVkFG/59zFBy9eYZBQ3eVih+YDiiCMEu4vpRBqoWTtw4ECfhH44i9XbSbTfUw2UqerVq6tOHUWcMIlP1+zGZNSMGTOKv1Dz9wBJsMygSHq7du0kRYoUESEUkVQOCa3eeecd6du3ryoagSpBKGhhJckTEtTZqSGiUAUeSF3MpWfPnkpbwUQajCqQKC4aiulEqvCHIMTMVEwanDFjRthHlsEm2u+pBiNzKItIOIn/ofVjUiAm8qFaXiBQ8/dS9ckX8IL525Njxq/O9GcVzAhFZkMIWYwi9u7d61qHoX+4ZnAiEyhSCaDa0JYtW9TMXoxQli5dqmZKhnMEgVnamHlcsmRJt/Wo+oRhNTTGaAHKQY8ePSR58uQSzTjhnt65c0dlKMU7ZM6WaxfU/G0S6g8DDyPSQ2PaOF5OTClHegfkW0Fn4CtI44p0vQB29cuXL6vPtWvXVqmewwU6NeQvAhiFYPo9UtGivmqFChUknKDOsWfqa4B1njWQ4zo6ncP58+ddKbnRMUeLGcRJ9zRRokTK8hAs/FM/iSXgM0C+kI8//lhGjhzp0kqgDVtNNPb4448rJyqAxq/rsiKfSZIkSSRcoCNC7hiAY0Iec5A0aVJVxDqcYDTSpUsXlQxLc+rUKZXDSde5jRaQsx8OfCQmQ0JCLPgMUwG2RQtOuacdO3ZUcuPu3bv2N25DNBJ5CJUrVzZ69OgRI1wLcfEIj7QC6oEOGTJEfUY6BUxqwWQP5B/XtULDAWrGYmYlZvQi7lrHsCMGHbVkwwnmMCCNBfLnYBIbFnxGCKq5SlI0gMl1OL/ly5e7UpEgfDdXrlwqV1W04JR7Wq9ePTVPBvNoUKsAubjMSyDQ5h8i+yRMPNDUzdE+x48fV0PyQIZ2sPMj6yjSMSPUL1xgNAMHNMw/7du3VxEK2gyB4Su2hZNIzThqN3C6w7zoaWpDCmSEGMMcFC3gnqLalg71jMZ7+sZDfJCBmKlp8w8BMMcgJasnqBVg1RYL04rO4w1Bu3z5cmVWQQhYOEF0CXL6axCJgYiLFStWqFjscAt/RDbBFKXNUdHgEPQGTDve8vYjJDBazD6w6c+cOVOZTRHphnuLUGwoWdEyz+H+/fsyYsQIJSMQzg0zF6K6Ao3wccOu4QmJHZhCMHzDFHyYff766y+VwAtDVKT09QWkqtXJvlBtC+lhUUwe7aF0H5I+hStPjWc+FORbQQWiPHnyKFPU9u3bw3pMkZxx1G4qVaqkzg8pnDXXr19X62B+jOsg51GtWrVUXh6YfZB+oWHDhkaRIkXUOuTCiQYGDhyonlGYenBOmIFvtQTkw6DwDwHIEY60B5iqDSGNNLGwT6LsITL9+QJq69auXVvVYkU2P+TfefPNN1UqBixI0VyqVCkjHKDmAdJWwPeAKeedOnVSvgg9LT3cRHLGUbuBkoBODSk20BFgwWc8LwcOHDDiOtOnT1c28PXr18fYtm7dOrXNn4SJkQbeJaRw1iClM/x6dtZ6oM0/hKBqECJ/MFHjmWeesWSfhC13/fr1UqRIEfX3qVKlUhE+ujoYbNmlS5cOuTkDfgbMNMS8A0w8ga0/QYIEys4P3wYmeYUbDJUxfMaEGUSIwMfy6aefqnWlSpWSixcvSjQB886cOXPc/Bt2TAqKBBDDDxNIr169Yk2fgvKVq1atkrhuKv7jjz9cJVF15BzWIeLPFmzrRkhIUwB7Jpw7c+ZMwPnf/QEjmW7durmKhmsiSfOP9IyjdgGzIqJeDh06ZEQrMHXC5Bkbu3fvVvtEU7I+jTYZ2wUdviECWjoiLpCTxHMSyujRo31qw9ORFQmOLYxmpk2bpkYg0DCbN2+uUg1EWsqOJk2aqIgoOMxr1Kih1u/Zs8c1MS0aCPakoEgAuaseVIge26JhJGcYhrz++utuc3e8pZoJZOY8hX8IwFAU0S4I68TDaRbaVgS4+WHwfBBu3bol4QCmJizIPIj8IygwjcLh6ODWrFmjhq0Ibw0ncSHjqN2TgpD/CWk1og0kpnvQecHkGJQJUSEGOYo8adasma2/QZt/CIDAxwsJ4R2pOYfsBGkFMBr44osvlA8C4ZWLFy8O92E5gpdfflnFvqODQxqQSElIaBfIiYWRW2yz2aEErVy5MmqylwYTCv8QkCVLFuUUhdnBSeAFXLJkiRoNhFP4R3LG0bg0KSgSiEtKUKRD4R8CYGpADhKYRkjoieSMo3ZPCkInG7RJQSSqoPAP0YsJrROhhdA2PbMRRoPwiWSQ3hhhjyhe89577ylb/+zZs1UueKRBiIaUB4MGDVLCHh0bBD5CHRs3bqxGXYR4g1k9QwCKSyDSB5k9kZoB09DNC3FuxlG7QGc2adIkJfSRUhvmNsT6R0t6Y2I/1PxDAKJdUFpO25xJaMEEJ2j+KFkJGz/KX6IThokEKbUxAojrhGRSEIkqqPmHABRuQUZPEh4mTpwoZcqUUeadb7/91pUYDwnnYBqJBhDeCGFvBuZFVIMixBvU/EMAIg8Qfob/o728XlxAZxxFLDw6gGgIC/QWAgnTDxy/dk0KItEFhX8IgLkB5RtxqTHZyNPhi1z/JPgg3BbzD6D9o7oVZv6i3KRnHdi4CEMgiVWibwpgBFKvXr1wH4JjQc1j5H6H0EdNBRQ0wUQgOEUjIemcXVCoE6tQ8ydRS1zIOEpIuKDmH0JgX9Yl5woWLKjMQSR4oIoYwmxRVtJps6sJeRiM9gkByOQJxxtsyxBGWJAFs3LlylExwShSQcZROHdxrZG3H2Um//3333AfFiERAYV/COjcubMSQognR0paLAcOHFA2aHQEJDgg2+hnn32mZvS+9dZbaq4FHL064yjuCSFOhTb/EIBZvGvXro0RVbJ9+3ZVmShai4lHIsw4Ssj/hZp/CICm6RneCbCO0+9Dn+QNifb+/vtvFetPiFOh5h8CXnrpJaVlQtjA7ABOnTqlIlDSpk0rCxcuDPchEkIcBoV/CEAFqbp16yqbv869gvwyKLYBkwNzrxBCQg2Ff4jAZYbdHwnGAGLMEe1DCCHhgDb/IILCISgYomv1wrmYKlUqGTVqlEoo1rZt27DV3iWEOBsK/yAycOBAt3TB+/fvlzZt2qhOoFevXirx1tChQ8N6jIQQZ0KzT5Br90LAlyhRQn3v06ePbNiwQU0+AgsWLJD+/fvLoUOHwnykhBCnQc0/iFy8eFEyZcrk+g7Bj7S7GsT9wxlMCCGhhsI/iEDwHz16VH1GUW2kbsasUw1mmHqL/yeEkGBD4R9EatasqWz7P/30kyoXiEIu5cqVc23ft28fK3wRQsICs3oGkUGDBqmCIS+88IIqIj5r1ixJnDixa/v06dNVegdCCAk1dPiGgMuXLyvhj1zyZpDgDevNHQIhhIQCCn9CCHEgtPkTQogDofAnhBAHQuFPCCEOhMKfEC/8+OOPKh9TJBTayZEjh4wZMybch0GiDAp/QiKEmTNnSpo0aWKs37Fjh0oCSIidMM6fkAjn0UcfDfchkCiEmj9xBCiXiQyqOXPmlGTJkknRokXlm2++cW1fvny5PPXUU2pbxYoV5dixY25/P2DAAClWrJjbOphiYJIxg4l7BQsWlCRJkqjEfp06dXJtGz16tCrgkyJFClXUp0OHDnL16lWXmemNN95Qc0JgbsKC3/Rm9kEhIFSHwxwRpAh/7bXX5OzZszGOFXWK8beoId2oUSMWrCduUPgTRwDBP3v2bJkyZYpKs92tWzdp1qyZSraH5HqYiV2nTh3Zu3evtG7dWqXlsMrkyZOlY8eOykSD9N2o0pY7d27X9vjx48u4cePU72O29/r166Vnz55q23PPPacEPIT5P//8o5Z3333XaycGwY8Jgjj2NWvWyF9//SUNGzZ02+/PP/+URYsWqXoSWLDvsGHD/Lp2JErBJC9CopmbN28ayZMnNzZv3uy2vlWrVkbjxo2N3r17GwUKFHDb9t5772Hyo3Hx4kX1vX///kbRokXd9vnkk0+M7Nmzu75nzZrV6NOnj8/HtWDBAiN9+vSu7zNmzDBSp04dYz/8Bn4LrF692kiQIIFx4sQJ1/aDBw+qY92+fbvrWHG+V65cce3To0cPo1SpUj4fG4l+aPMnUc8ff/wh169fV0V0zCDT6tNPPy03btyQUqVKuW0rU6aMpd84d+6cnD59+oGlOVHGEyMQlPK8cuWK3L17V27evKmODUn/fOHw4cPKZKRrQeuSoHAUYxvShAOYe1KmTOnaByYoHCMhGpp9SNSj7erLli1TZh29oIiO2e7/IGCy8cyEcufOHddn+AoeBHwItWvXliJFisi3334ru3btkokTJ7o6IbvxTBUOHwJMRoRoqPmTqAeaMRywcJQiw6on+fPnV/Z5M1u3bo0RcXPmzBnVAUCQAnQgGmjZ0LbXrVunHMaeQNhD+KJ+MzoSMH/+fLd9kODv3r17DzwXHCt8FFi09o9ODPMRcJ6E+AqFP4l6IJjhPIWTFwK4bNmyKqpm06ZNysHarl07JZR79OihnL0Q1Ii5N1OhQgU5f/68DB8+XF599VVZuXKlrFixQv29OcoGbWXMmFFVbEN0DX6jc+fOyvGLkcL48eOVYxnr4Xw2g84DoxR0IIhGginI0xxUpUoVFTHUtGlT5SCG6QhRQ+jUdLlQQnwi3E4HQkLB/fv3jTFjxhh58+Y1EiVKZDz66KNGtWrVjA0bNqjtS5YsMXLnzm0kSZLEKFeunDF9+nQ3hy+YPHmykS1bNiNFihRGixYtjCFDhrg5fMGUKVNcv5ElSxajc+fOrm2jR49W65IlS6Z+e/bs2TF+o127dsoJjPVw3Ho6fMHx48eNunXrquNImTKl0aBBA+PMmTOu7b44pwlhSmdCCHEgdPgSQogDofAnhBAHQuFPCCEOhMKfEEIcCIU/IYQ4EAp/QghxIBT+hBDiQCj8CSHEgVD4E0KIA6HwJ4QQB0LhTwghDoTCnxBCxHn8H8wV0vzLOYnoAAAAAElFTkSuQmCC", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXYAAAGtCAYAAAACp1+JAAAAOnRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjEwLjAsIGh0dHBzOi8vbWF0cGxvdGxpYi5vcmcvlHJYcgAAAAlwSFlzAAAPYQAAD2EBqD+naQAAY2JJREFUeJztvQncVHP///8hrUIKhZSISqhblrKkTYko1Y3bUpSlVEruIpIUcpct2iwt3PZ9C5WQvbITsmW7aUEb7XV+j+fn//3M/1zTXFdzzpxppjOv5+NxHtfMmXN95sxZ3uf9ea/beZ7nGSGEELFh+1zvgBBCiGiRYBdCiJghwS6EEDFDgl0IIWKGBLsQQsQMCXYhhIgZEuxCCBEzJNiFECJmSLALIUTMkGAXIk+ZMmWK2W677cwPP/yQ610R2xgS7ELkmBtvvNE888wzud4NESO2U60YIXJLxYoVTefOna2G7mfjxo1m/fr1pmzZslZzFyJddkh7SyHEVqVUqVJ2ESIoMsWIvOWjjz4ybdu2NTvvvLPValu2bGnee++9ItssW7bMXHbZZWbfffe1mm316tVNly5dzO+//57YZs2aNWbo0KHmwAMPNOXKlTN77rmn6dixo/nuu+/s56+//rrViPnrB9s26/2a9HnnnWf35fvvvzdt2rQxO+64o9lrr73MsGHDTPLk9+abbzZHH320qVKliilfvrxp1KiReeKJJ4psw/h///23ue++++xrFr6jJBv7uHHjTP369e3v5bt79eplj4OfZs2amYMPPth88cUXpnnz5qZChQpm7733NiNHjgx5NsS2hDR2kZfMmzfPHHfccVaoDxw40JQuXdrcddddVmDNmjXLHHXUUeavv/6y23z55ZemW7du5rDDDrMC/bnnnjO//PKL2W233aw5o127dmbmzJnmzDPPNH379jUrV640M2bMMJ9//rnZf//9A+8bY5544ommcePGVlC+/PLL5tprrzUbNmywAt4xevRoc+qpp5qzzz7brFu3zjzyyCPmn//8p3nhhRfMySefbLf573//ay644AJz5JFHmosuusiuK2mfeEBdd911plWrVqZnz55m/vz5Zvz48Wbu3Lnm7bfftsfJsXTpUrufPMROP/10+1C54oorzCGHHGIfmCLGYGMXIt/o0KGDV6ZMGe+7775LrPv111+9nXbayWvatKl9P2TIEFRk76mnntrs/zdt2mT/Tpo0yW5z6623FrvNa6+9Zrfhr58FCxbY9ZMnT06s69q1q13Xp0+fIuOcfPLJdn+XLFmSWL9q1aoi461bt847+OCDvRYtWhRZv+OOO9pxk+F7+S72AxYvXmy/o3Xr1t7GjRsT240ZM8Zux291HH/88Xbd/fffn1i3du1ar1q1al6nTp02+y4RL2SKEXkHGvH06dNNhw4dzH777ZdYjwnlrLPOMm+99ZZZsWKFefLJJ02DBg3MaaedttkYztnINmjuffr0KXabMPTu3bvIOLxHK3/llVcS6zG/+LXn5cuX2xnGhx9+GOo7GZvv6Nevn9l++///1r3wwgvtzGbq1KlFtsdkdM455yTelylTxs4MMCOJeCPBLvKOJUuWmFWrVpk6deps9lm9evXMpk2bzM8//2xt5NiRS4JtGGeHHaKzOiJU/Q8cwH4Pfns4JhfMNdj1K1eubHbffXdrNkHAh+HHH3+0f5OPCwKb/XGfO/A3JD+8dt11V/uQEfFGgl0UPMVp7swcwvLmm29a+zpCHWfniy++aO36zDi2VoRxcRE1inCOP3KeirwDzZYoDhyDyXz11VdWY95nn32skxEHaEmwzezZs208uN+xmKzFQnJkSbIG7GDGgDnDaenw9ddf279E5zgTEEJ92rRpNnrFMXny5NAmoZo1a9q/HBf/jAHzzIIFC6xDVQiQxi7yDjTN1q1bm2effbaIaWPRokXmoYceMscee6y1KXfq1Ml88skn5umnny5WK2UbImXGjBlT7DYITL7zjTfeKPI5mnZx+MdjHN7z4CAk0/0GBLZf6+e3pMowJWQy+aGSCgQ3Zpc77rijiNY9ceJEa95xkTZCSGMXecn1119vTRcI8UsuucTayAl3XLt2bSIWe8CAATaEjxBCwh2JE//zzz9tuOOECROsY5WY9vvvv9/079/fzJkzxzoviRvHEcm47du3N7vssosd484777TCGC0f+/jixYtT7huaOCGOXbt2tWGXL730knVcXnXVVXa2AQjZW2+91YYbYn5hrLFjx5ratWubTz/9tMh47Df7w/bEpdeqVcuOmwxjDxo0yIY7Mi6mHrR3HkBHHHFEEUepKHByHZYjRHF8+OGHXps2bbyKFSt6FSpU8Jo3b+698847Rbb5448/vN69e3t77723DQWsXr26DR38/fffi4QdXn311V6tWrW80qVL25C/zp07FwmlJEyRMEC+Z9ddd/Uuvvhi7/PPP08Z7kh4Iv9L2CHbV61a1bv22muLhCDCxIkTvQMOOMArW7asV7duXTsO2yXfdl999ZUN4Sxfvrz9zIU+Joc7+sMbGY/fwnf37NnTW7p0aZFtCHesX7/+ZseUsWvWrBnwTIhtDdWKESIAZIUySyA5Soh8RTZ2IYSIGRLsQggRMyTYhRAiZsjGLoQQMUMauxBCxIzYxrGTHfjrr7+anXbaSd1nhBCxAAMLZafJd/AXgisYwY5QJ+1cCCHiBkXwKPJWcIIdTd0dANLPhRBiW4dy1SisTr4VnGB35heEugS7ECJObMm8LOepEELEDAl2IYSIGRLsQggRMyTYhRAiZkiwCyFEzJBgF0KImCHBLoQQMUOCXQghYkZsE5SKY98rp6a13Q83qTGwEGLbRBq7EELEDAl2IYSIGQVniskGMu8IIWKjsd900022GE2/fv0S69asWWN69eplqlSpYipWrGg6depkFi1aVOT/fvrpJ3PyySebChUqmD322MMMGDDAbNiwocg2r7/+ujnssMNM2bJlTe3atc2UKVNMIT0o0l2EECIywT537lxz1113mUMPPbTI+ssuu8w8//zz5vHHHzezZs2yddE7duyY+Hzjxo1WqK9bt86888475r777rNCe8iQIYltFixYYLdp3ry5+fjjj+2D44ILLjDTpk0Lu7tCCFEwhBLsf/31lzn77LPNPffcY3bdddfE+uXLl5uJEyeaW2+91bRo0cI0atTITJ482Qrw9957z24zffp088UXX5gHHnjANGzY0LRt29YMHz7cjB071gp7mDBhgqlVq5a55ZZbTL169Uzv3r1N586dzW233RbV7xZCiNgSSrBjakGjbtWqVZH1H3zwgVm/fn2R9XXr1jU1atQw7777rn3P30MOOcRUrVo1sU2bNm1sAfl58+Yltkkem23cGKlYu3atHcO/CCFEIRLYefrII4+YDz/80Jpiklm4cKEpU6aMqVSpUpH1CHE+c9v4hbr73H1W0jYI69WrV5vy5ctv9t0jRoww1113XdCfI4QQha2x02aub9++5sEHHzTlypUz+cSgQYOsKcgt7KsQQhQigQQ7ppbFixfbaJUddtjBLjhI77jjDvsarRo7+bJly4r8H1Ex1apVs6/5mxwl495vaRta3KXS1oHoGdcGT+3whBCFTCDB3rJlS/PZZ5/ZSBW3HH744daR6l6XLl3azJw5M/E/8+fPt+GNTZo0se/5yxg8IBwzZsywgviggw5KbOMfw23jxhBCCBGRjZ3O2AcffHCRdTvuuKONWXfru3fvbvr3728qV65shXWfPn2sQG7cuLH9vHXr1laAn3vuuWbkyJHWnj548GDrkEXrhh49epgxY8aYgQMHmm7duplXX33VPPbYY2bqVMVtCyHEVs88JSRx++23t4lJRKoQzTJu3LjE56VKlTIvvPCC6dmzpxX4PBi6du1qhg0bltiGUEeEODHxo0ePNtWrVzf33nuvHUsIIUSWBTsZon5wqhKTzlIcNWvWNC+++GKJ4zZr1sx89NFHme6eEEIUHCoCJoQQMUOCXQghYoYEuxBCxAwJdiGEiBkS7EIIETPUaKNAUDMQIQoHaexCCBEzJNiFECJmSLALIUTMkGAXQoiYIcEuhBAxQ4JdCCFihsIdRWgUQilEfiKNXQghYoYEuxBCxAwJdiGEiBkS7EIIETMk2IUQImZIsAshRMyQYBdCiJghwS6EEDFDgl0IIWKGBLsQQsQMCXYhhIgZEuxCCBEzJNiFECJmSLALIUTMkGAXQoiYIcEuhBAxQ4JdCCFihjooibwh3Y5MoK5MQhSPNHYhhIgZEuxCCBEzJNiFECJmSLALIUTMkGAXQoiYIcEuhBAxQ4JdCCFihgS7EELEDAl2IYSIGRLsQggRMyTYhRCikAX7+PHjzaGHHmp23nlnuzRp0sS89NJLic/XrFljevXqZapUqWIqVqxoOnXqZBYtWlRkjJ9++smcfPLJpkKFCmaPPfYwAwYMMBs2bCiyzeuvv24OO+wwU7ZsWVO7dm0zZcqUTH+nEEIUDIEEe/Xq1c1NN91kPvjgA/P++++bFi1amPbt25t58+bZzy+77DLz/PPPm8cff9zMmjXL/Prrr6Zjx46J/9+4caMV6uvWrTPvvPOOue+++6zQHjJkSGKbBQsW2G2aN29uPv74Y9OvXz9zwQUXmGnTpkX5u4UQIrYEqu54yimnFHl/ww03WC3+vffes0J/4sSJ5qGHHrICHyZPnmzq1atnP2/cuLGZPn26+eKLL8wrr7xiqlataho2bGiGDx9urrjiCjN06FBTpkwZM2HCBFOrVi1zyy232DH4/7feesvcdtttpk2bNlH+diGEiCWhbexo34888oj5+++/rUkGLX79+vWmVatWiW3q1q1ratSoYd599137nr+HHHKIFeoOhPWKFSsSWj/b+Mdw27gximPt2rV2HP8ihBCFSGDB/tlnn1n7OfbvHj16mKefftocdNBBZuHChVbjrlSpUpHtEeJ8Bvz1C3X3ufuspG0Q1KtXry52v0aMGGF22WWXxLLPPvsE/WlCCFGYgr1OnTrW9j179mzTs2dP07VrV2teyTWDBg0yy5cvTyw///xzrndJCCG2jQ5KaOVEqkCjRo3M3LlzzejRo80ZZ5xhnaLLli0rorUTFVOtWjX7mr9z5swpMp6LmvFvkxxJw3uicMqXL1/sfjGDYBFCiEIn4zj2TZs2Wfs2Qr506dJm5syZic/mz59vwxuxwQN/MeUsXrw4sc2MGTOs0Mac47bxj+G2cWMIIYSIUGPH3NG2bVvrEF25cqWNgCHmnFBE7Nrdu3c3/fv3N5UrV7bCuk+fPlYgExEDrVu3tgL83HPPNSNHjrT29MGDB9vYd6dtY7cfM2aMGThwoOnWrZt59dVXzWOPPWamTk2/H6YQQhQygQQ7mnaXLl3Mb7/9ZgU5yUoI9RNOOMF+Tkji9ttvbxOT0OKJZhk3blzi/0uVKmVeeOEFa5tH4O+4447WRj9s2LDENoQ6IsSJicfEQxjlvffeq1BHIYTIhmAnTr0kypUrZ8aOHWuX4qhZs6Z58cUXSxynWbNm5qOPPgqya0IIIf4P1YoRQoiYIcEuhBAxQ4JdCCFihgS7EELEDAl2IYSIGRLsQggRMyTYhRAiZkiwCyFEzJBgF0KImCHBLoQQMUOCXQghYoYEuxBCxAwJdiGEiBkS7EIIETMk2IUQImZIsAshRMyQYBdCiJghwS6EEDFDgl0IIWKGBLsQQsQMCXYhhIgZEuxCCBEzJNiFECJmSLALIUTMkGAXQoiYIcEuhBAxQ4JdCCFihgS7EELEDAl2IYSIGRLsQggRMyTYhRAiZkiwCyFEzJBgF0KImCHBLoQQMUOCXQghYoYEuxBCxAwJdiGEiBkS7EIIETMk2IUQImZIsAshRMyQYBdCiJghwS6EEDFDgl0IIQpZsI8YMcIcccQRZqeddjJ77LGH6dChg5k/f36RbdasWWN69eplqlSpYipWrGg6depkFi1aVGSbn376yZx88smmQoUKdpwBAwaYDRs2FNnm9ddfN4cddpgpW7asqV27tpkyZUomv1MIIQqGQIJ91qxZVmi/9957ZsaMGWb9+vWmdevW5u+//05sc9lll5nnn3/ePP7443b7X3/91XTs2DHx+caNG61QX7dunXnnnXfMfffdZ4X2kCFDEtssWLDAbtO8eXPz8ccfm379+pkLLrjATJs2LarfLYQQsWWHIBu//PLLRd4jkNG4P/jgA9O0aVOzfPlyM3HiRPPQQw+ZFi1a2G0mT55s6tWrZx8GjRs3NtOnTzdffPGFeeWVV0zVqlVNw4YNzfDhw80VV1xhhg4dasqUKWMmTJhgatWqZW655RY7Bv//1ltvmdtuu820adMm5b6tXbvWLo4VK1aEOR5CCFFYgj0ZBDlUrlzZ/kXAo8W3atUqsU3dunVNjRo1zLvvvmsFO38POeQQK9QdCOuePXuaefPmmX/84x92G/8Ybhs095LMRNddd10mP0fEkH2vnJrWdj/cdHLW90WIvHeebtq0yQraY445xhx88MF23cKFC63GXalSpSLbIsT5zG3jF+ruc/dZSdugha9evTrl/gwaNMg+aNzy888/h/1pQghRmBo7tvbPP//cmkjyAZysLEIIUeiE0th79+5tXnjhBfPaa6+Z6tWrJ9ZXq1bNOkWXLVtWZHuiYvjMbZMcJePeb2mbnXfe2ZQvXz7MLgshRMEQSLB7nmeF+tNPP21effVV6+D006hRI1O6dGkzc+bMxDrCIQlvbNKkiX3P388++8wsXrw4sQ0RNgjtgw46KLGNfwy3jRtDCCFERKYYzC9EvDz77LM2lt3ZxHfZZRerSfO3e/fupn///tahirDu06ePFcg4ToHwSAT4ueeea0aOHGnHGDx4sB3bmVJ69OhhxowZYwYOHGi6detmHyKPPfaYmTo1PUeYEEIUMoE09vHjx1vHZLNmzcyee+6ZWB599NHENoQktmvXziYmEQKJWeWpp55KfF6qVClrxuEvAv+cc84xXbp0McOGDUtsw0wAIY6W3qBBAxv2eO+99xYb6iiEECKkxo4pZkuUK1fOjB071i7FUbNmTfPiiy+WOA4Pj48++ijI7gkhhFCtGCGEiB8ZJSgJUYgo6UnkO9LYhRAiZkiwCyFEzJBgF0KImCHBLoQQMUOCXQghYoYEuxBCxAwJdiGEiBkS7EIIETMk2IUQImZIsAshRMyQYBdCiJghwS6EEDFDRcCE2EaKioEKi4l0kMYuhBAxQ4JdCCFihgS7EELEDAl2IYSIGRLsQggRMyTYhRAiZkiwCyFEzJBgF0KImCHBLoQQMUOCXQghYoYEuxBCxAzVihGigOvPqPZMPJHGLoQQMUOCXQghYoYEuxBCxAwJdiGEiBkS7EIIETMk2IUQImZIsAshRMyQYBdCiJghwS6EEDFDgl0IIWKGBLsQQsQMCXYhhIgZEuxCCBEzJNiFECJmSLALIUShC/Y33njDnHLKKWavvfYy2223nXnmmWeKfO55nhkyZIjZc889Tfny5U2rVq3MN998U2SbP//805x99tlm5513NpUqVTLdu3c3f/31V5FtPv30U3PccceZcuXKmX322ceMHDky7G8UQoiCIrBg//vvv02DBg3M2LFjU36OAL7jjjvMhAkTzOzZs82OO+5o2rRpY9asWZPYBqE+b948M2PGDPPCCy/Yh8VFF12U+HzFihWmdevWpmbNmuaDDz4wo0aNMkOHDjV333132N8phBAFQ+AOSm3btrVLKtDWb7/9djN48GDTvn17u+7+++83VatWtZr9mWeeab788kvz8ssvm7lz55rDDz/cbnPnnXeak046ydx88812JvDggw+adevWmUmTJpkyZcqY+vXrm48//tjceuutRR4AftauXWsX/8NBCCEKkUht7AsWLDALFy605hfHLrvsYo466ijz7rvv2vf8xfzihDqw/fbbb281fLdN06ZNrVB3oPXPnz/fLF26NOV3jxgxwn6XWzDfCCFEIRJpz1OEOqCh++G9+4y/e+yxR9Gd2GEHU7ly5SLb1KpVa7Mx3Ge77rrrZt89aNAg079//yIau4S7ENGhPqrbDrFpZl22bFm7CCFEoROpKaZatWr276JFi4qs5737jL+LFy8u8vmGDRtspIx/m1Rj+L9DCCHEVtDYMZ8geGfOnGkaNmyYMIlgO+/Zs6d936RJE7Ns2TIb7dKoUSO77tVXXzWbNm2ytni3zdVXX23Wr19vSpcubdcRQVOnTp2UZhghRHxNOyDzTpY1duLNiVBhcQ5TXv/00082rr1fv37m+uuvN88995z57LPPTJcuXWykS4cOHez29erVMyeeeKK58MILzZw5c8zbb79tevfubSNm2A7OOuss6zglvp2wyEcffdSMHj26iA1dCCFERBr7+++/b5o3b55474Rt165dzZQpU8zAgQNtrDthiWjmxx57rA1vJNHIQTgjwrxly5Y2GqZTp0429t1BVMv06dNNr169rFa/22672aSn4kIdhRBCZCDYmzVrZuPViwOtfdiwYXYpDiJgHnrooRK/59BDDzVvvvlm0N0TQoiCR7VihBAiZkiwCyFEzJBgF0KImCHBLoQQMUOCXQghYoYEuxBCxAwJdiGEiBkS7EIIETMk2IUQImZIsAshRMyQYBdCiJgRm0YbQgihLk//H9LYhRAiZkiwCyFEzJBgF0KImCHBLoQQMUOCXQghYoYEuxBCxAwJdiGEiBkS7EIIETMk2IUQImZIsAshRMyQYBdCiJghwS6EEDFDgl0IIWKGBLsQQsQMCXYhhIgZEuxCCBEzJNiFECJmSLALIUTMkGAXQoiYIcEuhBAxQ4JdCCFihgS7EELEDAl2IYSIGRLsQggRMyTYhRAiZkiwCyFEzNgh1zsghBD5zL5XTk1rux9uOtnkC9LYhRAiZkiwCyFEzJBgF0KImJHXgn3s2LFm3333NeXKlTNHHXWUmTNnTq53SQgh8p68FeyPPvqo6d+/v7n22mvNhx9+aBo0aGDatGljFi9enOtdE0KIvCZvo2JuvfVWc+GFF5rzzz/fvp8wYYKZOnWqmTRpkrnyyis3237t2rV2cSxfvtz+XbFiRZHtNq1dldb3J/9fSUQ9ZrrjZWNM/e6tP6Z+d2H97kxw3+F5XskbennI2rVrvVKlSnlPP/10kfVdunTxTj311JT/c+211/JLtWjRosWL+/Lzzz+XKEPzUmP//fffzcaNG03VqlWLrOf9V199lfJ/Bg0aZE03jk2bNpk///zTVKlSxWy33XYlPgH32Wcf8/PPP5udd945kv3fFsbcFvYxG2NuC/uYjTG3hX3MxpgrtoF9DDImmvrKlSvNXnvtVeJ4eSnYw1C2bFm7+KlUqVLa/8/BjOokbUtjbgv7mI0xt4V9zMaY28I+ZmPMnbeBfUx3zF122WXbdJ7utttuplSpUmbRokVF1vO+WrVqOdsvIYTYFshLwV6mTBnTqFEjM3PmzCKmFd43adIkp/smhBD5Tt6aYrCXd+3a1Rx++OHmyCOPNLfffrv5+++/E1EyUYH5hpDKZDNO3MfcFvYxG2NuC/uYjTG3hX3Mxphlt4F9zMaY2+FBNXnKmDFjzKhRo8zChQtNw4YNzR133GETlYQQQmyjgl0IIURMbOxCCCHCI8EuhBAxQ4JdCCFihgS7EELEDAn2CNiwYYO5//77N0uoEkKUzH777Wf++OOPzdYvW7bMfibCoaiYiKhQoYL58ssvTc2aNU0hQ42fzz77zB6HXXfdNfQ43377rfnuu+9M06ZNTfny5W2NjJJq/mxNfvjhB/Prr7/a/IoddogmFeS0005L+ftYRz+C2rVrm7POOsvUqVPHxIntt9/ehjPvscceRdajJNWoUaNIxVaRPtLYI4Kb/OOPPzb5BIK1cuXKaS1h6devn5k4cWJCqB9//PHmsMMOswWNXn/99cDjob21atXKHHjggeakk04yv/32m13fvXt3c/nll+dcM3z44YfNAQccYI499libPIdQigLqf7z66qu29wDCnOWjjz6y65gR0p+AngRvv/22iQPPPfecXWDatGmJ9yxPP/20GT58uG2yE5ZWrVqZKVOmRFZKd82aNSZKXn75ZfPWW28VaSpErg4P76VLl2Y8fsFq7Fw03bp1M+edd57VDDLlsccesxUmL7vsMlsOYccddyzy+aGHHprWOO5iT4dTTz21xM/vu+++tMciyzcM1atXN88884wVcvzt1auXee2118x///tfK5SCCqIuXbrYZir33nuvqVevnvnkk0+s4OXmJxt53rx5OdUM0ZjPOecc07t3b7s/7733nnn++eetRp0J9BhACJGUx766Mhp9+/Y1O+20k7nhhhtMjx497O/3C4R0BEjFihXtg8gJkHvuucccdNBB9nXQWVVUMwv3G1NRunRpe3/ecsstpl27diYMffv2tfckfRlOPvlke85QFBg7DBTm6tixozn77LNNy5YtS9z/dDjkkEPMf/7zH7tPzHCPOOIIez1x79StW9dMnjw5o/Hzsh771uC2227zGjRoYOu+t2rVynv44Ye9NWvWhB5vu+2222zZfvvtE38zGae4sfOBsmXLJmpDX3jhhV7fvn3t6++//97baaedAo9XtWpV7+OPP7avK1as6H333Xf2NX933HHHwOM9++yzduGY3X///Yn3LE899ZTXq1cv78ADD0x7vAoVKngLFixIvO/WrVvifHzwwQde3bp1Q52b3XbbzZs/f/5m61lXpUoV+/rTTz/1dtlll0DjHnzwwd7UqVMT/8/5GjRokNe4cWPvvPPOC7yfXbt2tftQs2ZNr2PHjnbZd999vUqVKnmnn366V6dOHfsdb731Vlrj8b+///67lw02btzoTZs2ze7zzjvv7O266672Gn399dcDj8W10rlzZ698+fJetWrV7HU+d+7c0PvGteyuI3pJdOrUyb7mGuIeyJSCFewODmSfPn3sjcWJ50ZnXVB++OGHEpd8YvXq1d7y5cuLLGGpUaOGvXk2bNjg7bPPPt4LL7xg13/++ef2Zg8Kwvzrr7/eTLBzE1WuXDnweMkPWf9SpkwZK9Sff/75tMerX7++98orrxRZN2fOHPugWLZsmW0OM2XKlMD7ybFijGRY544jxyXoMY1agFxxxRVez549rdB08Lp37972gbFp0ybvoosu8o455pgtjrVu3TqvRYsWifOd7Wv+scces8pcJkrRihUrvEmTJnknnHCCVQoPOOAA77rrrgs8DrJm3rx59jXH6q677rKvOVc8PDKl4AW7/yK7/fbbrbbBiecCmDhxor1Q48Bff/1lH1q77767/X3JS1gQFmhwaKoIeTfr4dihFQalbdu23uDBgxOCHc0fwfHPf/4zIZTCgGa4ZMkSL1NGjBjhtWvXzosap1zceuut3ptvvmkXXrPu0ksvtdvcc889aQnMbAqQqGcWjJdtwf7bb7/ZGXqjRo3sA/2oo46KZFyOa8OGDUPdP6eccorXpk0bb9iwYV7p0qW9X375xa5HSeJhkSl5W91xa7F+/XrrrMGmNWPGDNO4cWPrqPvll1/MVVddZV555RXz0EMPFWsPb9u2rbXbbck2viV7eHFQ0XLWrFnmp59+MuvWrSvy2aWXXpr2OAMHDrT2u/Hjx5tzzz3X2lf/97//mbvuusvcdNNNJixDhw41Bx98sO388s9//jNRnY56+ql6026JkSNHWhvm+++/b38v+41dmW5YmTgOFyxYYKIgzG9Kh9tuu812COP3u7BZ3uOzueKKK+z71q1bmxNPPDHQuMccc4y13fJ3zpw51gkLX3/9tfWPBAVHLl3McG77YR3Oc8DWnm4EE7ZvnO+ZXIOpWLFihXnyySftvYsTHz8N9nF+//7775+RE5V7nXHxX3COBgwYEHgcfCmXXHKJeeKJJ+w9uffee9v1L730UuBznBKvQGEqyvQRLQMt9vLLL/e+/PLLItt89tlnXrly5Yodg6f/okWLEq+jtod/+OGH1p6HfZBpH/vJeEyva9WqFWgszCSvvfaafY3t+5tvvrGvsTujJUc13Y0CTBrXX3+91dLZt6uvvtr79ddfM9aIR48evdn6O++8M+EXyBcyNY/5+fHHH+0M49BDD/XuvffexPp+/frZY5LrmQX3INc32jQmnMsuu6zIEpZy5cp5e+65p/2dmdjCHS+//LLtucy+YhJkX2fNmuXlKwUr2BG2TIWwu2GGKc58EcbBFBXHH3+8dfZginD25p9++slr2rSp9+STTwYai4cBNznsvffe3uzZs+1rTB1hnJIObOtMJ/faay/78HE2ccwpfkGSa9i/999/P+UDnuMRR9avX+/dd9991hQRFZxvHrooHE5x4fUNN9xgPwOusy01W3Y0a9as2KV58+ah9nHTpk3e3Xff7f39999eVGC2wjn8zDPPFCsvgvLtt99apeXMM89MKIgvvvii9U9lSsEK9nxzaKYCO+VXX32VeP3FF1/Y1++9956NPgjCIYcckogGaNmypZ2hAFpsJoINx9F+++3nPfDAA/bid4L9kUceCWVjxzHFwzYZ1oVxSjrwnbhZih/W8VmuWbhwoXfOOedYLZMHZFQ+EM5Jtq71KGcWUbJx40Zrt47Kds8D8o477oj0Acm9yLkhIg8nvrtv8OFk4kvyCt3G3rx5czN37lxTpUqVzRJWSLD5/vvvc2YPd2C7d/GyxF8zLrHdJLNg0w4CnaeICSeBCDvxKaecYu18+BhuvfVWExZKKdx9993WLk6ctYNkGuyuQRkxYoS1+yfD77/oootCx9sTX41NlPhzP9g08yF1nXwKzu8111xj9txzz8iybEmcI9EpGxnRUTdyxq8FYWz/frbffnubREZCGn8zhexi/Bxh/WSp4B68/vrrrf+DPAVHixYt7H2ZMV6B4rePJ2tOPEFzaQ93EFL14IMP2tcXXHCBd+SRR1rNGBMSrzMBLQ5zzieffJLRONgynUboD08kYiCMiQft2R8n7mBdSf6OLUGUDhrSkCFDrLbEcs0119i4dKbtuYZj99FHH0U+7qOPPmpnVPgS3nnnHXu+/UuuZxZo18z6uG/cGMxOMe/5QyqD8txzz3nHHnus9ZNFZRYllDUquDcwgybfN1znUcwgC05j90evkM2I9uvAq0/D7DCpzEQvoAVPmDDBjklGIho3Xn+y4MJw4403mpUrV9rXZB6SldmzZ0+rhUyaNMlkAhpcFFocGYxvvvnmZmPh7f/HP/4ReDw0808//XSzc8BsI3l2FQSyjMku5TiSrg58BxEJHNdcQwmGbCSBn3nmmZvNGJkNuNo7LpIlVzOLq6++OhEVQ+QOkFlLtBURKJyvMHTp0sWsWrXKzhzLlClj6w35IcoqCESwUNKCWUUmmeWOSpUq2XIZtWrVKrKe2ZWLkMkIr8CIOmElG/bwbJCtqBCcSfzem266yWq/o0aNsrMLjuX06dMDjzdw4ECb1fjqq69aZxzLzJkz7TrnF8iUxYsXeytXrvTyCeKXW7dunXK2kglRJ85FPbNA80+VmMV1hcM7LFOmTClxyVVmuYNrmRkFdnsXpUa2LrOroUOHeplScII96oSVVIkWJBgQHgWEUCLw4hwV8sYbb1gnEOYnzB2EuiGowrB27VobfcANgwOMhSn/+eefbz/LFMxv7C8LAj5fIKOUhyFCAuFJYpF/yRfq1atnzY5RgdkhVcITSlImpreoifoBybWMArTDDjskrnXOPWYuF12UCQVbBCxqSB5hmkoBpAsvvNCaE5j+UgyLam2zZ89Oaxwct5iDKM6EKaOkqS6VANOFpJHPP/98s2JVlMclwSjq6nWZQgIN5hem0BRMytRshEmL6TTVGSmu5ZKozjjjDJus5TfJ5YItFWwL6zQGrkFMhCRpvfvuu/ZY3n777dYM0L59+0BjTZ8+3RbnwsGdSfVFx1FHHWWXO+64o8j6Pn362OAGTJph+e6772ziIX9Hjx5tzXw4yyn6Vr9+fZMPEARBEbC//vrL3u9ROHuhoGzsyRdPSQSNYonKHs6N5rI3O3ToYKIi36NCkiGzMTm7MRMuuOACa7+cOnWqadKkiV2HkMP/cfHFF5tHHnnE5JJMBHdJ4EMYMmSILa/Mdels6th4Ee5BBTsPQmzXZG/SgyC5WmJQ2zWZtlRfJMPbf14QeC+++KIJy6xZs2xWOHb7N954w/52BDvKAjZ9fEC5fED6fSssUVNQGnuyo6I40JLDhDtGCTcgKfQ4ZbgJM4WHC0Kd9GdCqoCZAdoXFyezjHRhNpGu0yydG52QLxyaOKR4XRJhQzMZG2e5K1/rwPFLCjehqlsb0t5dyOCW6oaHDS3EuY3SgZJAWJ0rg8zsrVmzZub333/P+cyCpiXMmlx4LCG9zK722msvE5YmTZrYEhcunND9bsoqUH7XhVaGfUBy/BiPmu8cE8p1BKFTp042FNWVi/A/6JipPP744yYTCkpjj6peSElQS3z+/Pn2NXWVd99991DjYCbAvENXpigEe5RRITwIogRNmnh697o4MonAIKImlbmFdZl0esoEvpfICDRJznGq3xc2esV/zaeKTmJWGOZhlo2ZBQI8bPRLcXz22WcpazxxrIM+zODOO++0dex5QPrr2tCH4N///nfg8ZhFEPmTDLMMlK1MKSjBnk2cDZcpvbsJM7XhYvtm5pDuTGNLYBpiWbJkibVd04AhDFHf3H5tJ6jmky6DBw+22hvT6WrVqtl1NN5gBkPoXi6gEYnrXsXrbLT+49qhs1eyjwKzHJpxPsws8EFhHkGJcbMMEuoy6exVKeJwwqgfkNjUCcNMBtNWJF2fvAKGehZjx461NaYzLT5EJIeLhnGp1rwm1PGMM84ItX8vvfSSLQtK+CVFsKKqoZ7Pdd7/+9//Rlrjw8FxJNqE6IP999/fLrxm3T/+8Y8iS5ygIBdRT5R4ICmGhjLUenGv04FoDX+xu1Rln8OG/VFIi+QkitSddtppdqH8M+syKbJ1ecThhEQDEYKZnFBEqYEw18wRRxyRso47ZbAPO+wwL1MKysbuB/syKcLYybDtoR3TpJjDQWQKGlSubbj+9lt+bS7d6Xm2Imz88LuwE9KGLFVP0aAmBExXq1evtueG5K42bdrYmU+mXHfddWlve+2115qtDU52ysqyRBUZ4XjwwQfttJ/oEGf64HhQnjpdRyROSFLreV0SlKwIAhFP2MMxCbrzzDXD7Pedd96xJpUwrFu3zrZpxAbOeOw7f4laY13Qa4pWjRxDzCQcN95zPCmBwWuXCJYutFPE1s/++H1eRG1hX880cKJgBTuOC+xZXODOuYL9jRsLQYzJIgiEUBFxwYXqh7BH+hoGddZApjcRvw1TA9ELWxJsYYWZ63GK3T5VnXeOZ9B635gJuMCfffZZu+84wRjn6KOPNnGFeuzYhHnA8kDmoYYZz5mNooBoFkwAyX1fcwkmQUxFyT1S8VPR3JmHfL6EEz6Y4QMyGeQFjm1+P8eBQAnuw6APx5R4BQrTKcpmuuQQVyqTfptkOQaFzjQk6fgrwPGabMIJEyZ4uYSEB6a1S5cujXzsbNZ5xyRDbZyTTjrJJu8wjQ4L5Y79pWQpW0zGresolC+QrEM9G8x6JK9QL4jSu/kE1xEJaJjN2Df/EpSjjz46ZQ0W1kXV6cjdA2TM/vnnn16mcF2mqjMVpFokZph0SxuHoWAFO/0eXdo/9jOX1oxgD1O8Kls2XDIkzz77bK9JkyaJ9lkITRocBM3wc0WHoiRbdd4dZAdT9oBeo5mUr8XeynEDZ3flmJIxHKZn5dbg3XffDd16LVtFuyiuxbHDpk4pCZQit4TJkMX2j02dUhSucQevyQzns7AFy/r27ZvoB4BQJxvaFeVzikgQVq1aVcT3Q7Yp7fbCZlj7e9Fmg4KNiqEFHsWGiAzAVEKBH6ZsTz31lP0sKFEmEzlo7YV5AzMEU3TCFWH58uV2ChckgSPqCBsHPgoiBjBFEd6JrR0zFzbEsGGamAxoV8jUF7sjCRz/+te/QiWVOIg7Zr+AfcRkRp4AmZSUGyZGOV8g1hqzDG3ciJDAFBWWqIt2cZ8QOsv1h5ksUzivQAvEVJ+FLVj2xBNPWHMWcC1y7eNLIyqKwmNB2yySgIRNnGuF0t5cS0S1EDpJbkVQ0y1lrjG1RpG9mxKvQMGr7bQAOiVdfPHFthlFx44d86YJB9qam976PfHU6gjaYT5bETa0RXPFxWbMmGHre7iG4DQHDwoRRGgz1J2h+TalZqPAryHRSJiiZcBsIx9qkiSbYDDhce4zLVYWddEu6h6563Br1GAJW4+lbNmyCVMHXchcoTtmksw4gkILTWeuJdKIVoOUFaYBDI3cgzJ+/Hhb5pvonYceeshaDPxLphSsYM+W7ZGTfuWVV3p//PFHosiWM6EEhYJaThj5BTt/g9ZsTq5Kl2mYWrbqvJ911lne1KlTIymE5If69YS1YtpCkGNyc+aOfGiNx3lgH3kYYj7J16JdhCNS4z3fqVGjhjWTcB3hB3rhhRfseoQzZqMw96IzOdKL14VM4rvhs6Bko0eyn4I1xUQN0S+tWrWyiUiETZKiT4IFph2mwnQaCgoRERTpSp6uYUIKWt8l6sQfamUQ3tiuXbvEOn4jXn1CIDFNka3n6t6kA9mnJA0RuRBFiKOf//znP+a0004zo0aNsglW1Ol29fmdiSaXEAUSdZijyxKmW08mRbv8PQyo60Kk1RdffGHNWcm1YsJ0GeK3c624BCXMoxQBS46UCcL5559vTj/99IT5iXsTKMaHyTBMraVnnnnGXkOENdN/wWWah0nKcoXosoZXQODccaV6nbOnuCUo9BEdMGDAZtr122+/HSrKBm688UbvoIMOsjXdmT7iWCJKBDMFiRG55MQTT0yYM+DTTz+1JgRKkWKeYZpJskUm5Y+jBu0tOSqCGVEmEQ5RQ2llok1YmO3lQzngkrTLTDXNJ554wl439Md1yYE4tVnHZ5nw+OOP22vRH31CLXaXaBR0LFdal+g3/z3KvZBvFFQcO8V6SCRAiyRJoSRHUtC0eTR1HJxUvfMXHfrxxx+t5hGmLC6nBicVSRA4FIF9pzaFq/cSBJKl0NxwJJEEQWo1ziQcqsmJVVsCTQinFLUyAIcUziBmE8D4aO9odkFAE+I3+utxFAJofsStcwyd0xknHb15KVMRtuZQNssBRwH3C8EBw4YNK7Kea+eBBx5IxIxnwi+//GJjzv0Jf2FgNkmZAmZ7biwc3WjsYWYBnOubb765SCkFZkPHHXecyZhcP1niAlq0s2X6NXa6CFWvXj2jsSnKTw9RQgnDOtPQfrAFolFjn3f7RyhhmHhzxsC+6CCcjFR1vybMcQhK7969bTp5o0aNvIsuuijjUg/ZCvuLGkpSHH744YkQXOCcs+7MM8/08plM8iO4Jl3ugx9mbWFs16nYaaedInX44uwk4CITmJExK+G8E3zAwmtmBa7PcSYUrGDHQee6HPnB4fLiiy8GHq979+5ehw4dvHXr1lmBhvcdZwtx65m0nsvHCBvnnHK1PHjwcBO+8sorRUwzYUxazZo1K3Zp3ry5Fxamy5i1xo0bZ5NfmI77l1zDw2zOnDmbredhTrx4vtTywfxGfLmjc+fO1gxDhy7nkA4CSsWkSZM2W886IoOioKLves+XBwWRNJiJkrnllltCRdkkU7DOUxxKqab7ODX4jHIDQaCGROfOnW26NmnQpAUzdaMORtiSpDhqUpmLWEdHJBw61JpIx8mEg6pp06YpTUhM+YNC7D/HCackTiVimv1TSJzJTLODkq3qjpiIMEWRpp6PcN0lOyKBdZk42qKu5UOjCfILYMaMGbZBBiUgGB8zAnkBQcDZyv598MEHifwRuiZhyiNd3++4DeOYzQZRWK8xh55yyimbrec3XnXVVRmPX7AaOyFvqTK/WJdJj1IqyFEx8j//+Y+N686Erl27Wm0N5yvx9Sxk5OEQY9pG5UhMInznlqhVq1Zif/waDFo8IXFBwQl93HHHWW0NDeapp54q8nmLFi28q666ygsL03NmVGT8waZNm7x8CvuLmlNPPdVr2rSp97///S+xjjDZ448/3s4Ew3LJJZfY3+5McWjCw4cPt+ZBHPFh7htngrv00kutuczF4YcJI8ymY9bv4IyynEYUMwAy01OVGiG+vXbt2l6mFKxgx/wwc+bMzdYj/LCXBwHzC3bbzz77LMI99Gzcdc+ePW0ihIPX2KEHDRpkhR03FvbtXEXYLFu2LGXMOXH8YZpP//777/ah4G5mdwPRzLp///6h9xMTG1P7bKZxZwLCEnMZNlZq4rDwGlNeJjVFoq7lg4+CSC848MADbYKOaz4dJvEn26xdu9buG/VZooJ7Z82aNRmNgUmQaKUePXrYc8FCkiSKWhS1pQpWsCMQyTR1hcCAi56MMuzlQUEjDmNj3FLoX6oO7qwjE87ZstOxwfIQcHW4nRaE9jV48GAvnzj33HO9Nm3aWGHm14zQ3nkw5TLsL9twjnC286BlyXTGl41aPmQDM4Mk5I9r0Dnzqe2eT7Xs//77b69bt25W4WJx1xFK0YgRI0KNycOBc4LgXbFihV3HDCtsQAOzXJSyypUr24XXUfl7CtbGTm9ByvMSplS9evVEWBR2YkKQgkK4H7Yxwgcz6fySXMKW+hbJTZ1Z52yj2NrTqf/BNuwjdlCSnihjSnhV2C5K2QIbLQkg7pw4SN4hdDQsUbfzywacoxNOOMEu+VrLh/LCJDpRDpd7yF0/hAFSQz0oyWGOyYSt4TNo0CAbcvz666/b+9xBohKld/EPBYFrj3FINqRmE+eIsGZ8TLzH9xDGh8aSDQpWsOM0pJA/DiAuAFcPOZWDMR3GjBljBSbxsrQho/FGpo0sKABGrWceGEcccYRdR6NbYttdn1JiYevXr5/2mBQuQqDnKzj7UhWXoil2kCzWfIvXTgfOLc5jYtqTHaZhm3iTgcn1jTMfYYbDjmuVLN8wY+LMTdXj02ViBoVib37YLx5ENMbA+R5WsD/zzDO2iBoOWb/iw70SJja+b9++NmeDY0n/XAeCOUgjeB6wtOdzY3AuuJfDNisvjoIV7MAJp2E0whyhkUnlu2xUd0Q7qlq1qtWMFi1aZNfxnpvIdTdn//0aiR+q0aULpQ/yAWZMlCZwCVicE4Qcx4BknSBku1dnlPCwpi8rEU6cY/+1mMl16Re4aKvM9ohAIaIKRSYdiEwhSgyh7o9SSUXQyJVUzcs5V1SlzESbXbJkScqGIigOYY4nEVUogsl9Spm90FgmXbAK+CORUNqIMIv6+iuozFM/CAvCEJlCITS//vprO22lvCknK2xXlGzhBFOQCwBtLV0mT55s8gHK61LS1LUnRFDMmzfPauyUWg0SQkm9GUwE3OBkCqa6ocOUhM0GCHOm9Qi0bEH2M6a7oHDsCN11x7E4ojyOlNBmdkHdpTA0bdrUljum5gwmE8JvybDm/TfffGNDNINAe0muP2a7/sxywmg7deqUULyCHEvwjxUlBauxX3/99TbdGk3QP5Wibjn22EwEO7ZGbIe77bZbRHsbTqPMF2EdBI4/D1mmqFz0+AKYedCCjzIGQeDB4Pwd2YqPjwpuePqKRg2CltlAJgqM3yyU9eJV/wc9B1jCcuONN9pZBiUt8FWNHj3avkbr3lLLyVQwM0Yu3H333YmHGNcmpQ/QuPONgtXYmYpSNwXt0P/UZKpKUtHSpUszEsL0MYziKUzDABxeOG1o0BvWbo/dkgs8uYIg2gtT7KwV/BdpgYLx66+/Ru7kRcFAgeEvCgwzIq5L7M98F1U6twQPRx4IKCo02UBIcs9EwR133FHkPeKIWRZBCPgFaDgSlu+++84mIXJvI4SZBWLCTO5LnK4Jhcbq7B/3DPZ2/nJM3njjjbT7yPIAR6l0Tmf2h4CGZCXw0ksvNZlQsIIdZylCHEenX7DzVCdqgAshLFFNr7joiWRheo6mgGmFixUnGxpskIxWbhJuymQnIoWW6LJO9EA+wBSZi94VJaM59j333GOnwLxmSpwp3Nh0n6IzU76AJkxJXAQovzU5CzWsDyQKBYbzgSmD/8O8hSkhbFGyZJI7eiH4GLtFixY2siWqB0gUoBjxQPQ/KChghixJFxSoLdn4+ZzM1EwoWFMMNw8OEQR7soZMN/N8YNy4cVag0yKMapS0D+PmIlIAm3NQJ1WqqT5RA7179zb5AtoLtmZnZ+3fv79tx4YphddRmJew2xJ9kU+gofEbcRATMZFpCzsHjj2Ee6oHSbrHgAcAwQGNGjWyGiv7WpwwmzRpUqD9YyaZDVq1amVb42HGi8oxSaQOgpwlLGF9BkEpWMGOcER75cLnIkcjop4KERkvvPBCRmOvXLkykn3E/HL00Ufb19xIblzCIBHI2KHTBUGRar+wY+bacZh8o7twTHq+4kDDXorZKR9tmVGBuYTfi9aebwoMszoitJgtch1xzYQpQ51ukAC+EaKDaLgRlvr161uNH38XxxQhz/WTqh5PuucHc4k7PyhZKF0c34cffniz45uLssJF8AoY2qSRQUdaPXU0yPwK23WcjMZUDRtIkQ9b54JsVlffhDK2LtWYfQyaLdmuXTvb0suf/s/rTp065VWjAH4X5WqB83HXXXfZ15QCiKqMK6n09H3NJ6iW+eWXX0Y+LpmMZCZTlZEaSKNGjbKlm8nCJcs1KNQq4pqOCq5JSkcDdYHo+UophSgabWzcuNHeK9Rconom1xb9T19//fXAY1E+wZUgoQ8v1yLXJv1zaReYT2WFoaAFe5SQop9KsJNyHLZZMqUNXG/FMWPG2IuJBxHp8aRLBwFhSQo4xYfOO+88u/Cah1rUNW4ygRuFkgLDhg2zN7jrF8sNyk0fVyjORWE3UuHzWYHxQ9kHfx2jsDWbXCkO6pBTAItjQC0VaudExerVq21dmwYNGoRStPw9TwcOHGhLX7geqpT+yKeywlCwphjH+++/X6SDCXbEMF59pqg4If0p+pg48JiH6a4CTPVceBnOUmyvhGsR233xxRcHGovfhgMM843LtCXjDft6VCUQooD9Y/qMqWD8+PG2yxO89NJLxSZi5bK3ZlRwHWHqIJ4dB1uyySBM5rI/6YsM66jhmso0+guzjrv+cJwTE07mseutGgULFy60XagwKXEPhOlxy31N2WMyRyl7gb8HyAugTHe+UbCCHbsWTkmSDvytyLBpcxEk1yopDmyPwOyHWGF/E2ay1LhJw9SRAGxufrsbbf1YwoIdD3t1PsONk8rH4Y5zWLBfc+wIU8MZ6Op+EzfP+Uag5JJsZC5nmygC6ohMIuQS4Y5g51wA0Tphkqn8tvonn3zShksS8cXDB6cnUS1h+gRQG+aCCy6wfgkil5y/h+S5TEOFyT6NXLnyChSm+0cddZQt6engNY10+Swdnn322URpWjr8JDdKjgLqSDNtppUWtdP9S1gOPvjgIm3t8pWTTjopMls4ZXCvueaazdYPGTLEfhYnttSoPaqqllGYEOhdgD2dfaayqjPtUN2Seyos5cqVsyWG+/Xr582dO9eL4j6ksiV181966aUi14+/JWS+UNBx7Jg1kiMDqKPB1NU1jy4Jf0yvP309KqjAh5ZBzCwhW8n1Q4KGPDqylcYcNVHuJ9N7puHJoX8kmdCcOJ3zvbXINHN5Sw2soyqORpP1nj17hqoSmXzPEQGGVuxMmVOnTrXjhs3GnTFjho3djzTSJEOc+SYdwhZ9M4VuimEKmCqOF7s4Jot0QKAznSckz9UciRLit0kqwnySquKhSJ9mzZrZsL9kwU6tj0i6wkcItmAqKIYV7FurkiXhhFGAX4sFsyimMgryZRr2eUKEpY8BX1lJpFMVNrngGX4Tkp6cjwcTDwpiUD9fKgpWsI8aNco6zshm5GJyjlTKc6Zbj71Hjx6mffv2VqCzVKtWrdhtw8SKE2NPMkjUQh1BFiRbbmvBRY5NlNRtnIjEBoeNO96We2tGOYkmwxZhwTH1gwOQazJob9/itE5/H17uiTA2Y/YlE2csmaAzZ8602cnMxEtStII6o1EMkvGPn8797a9XhEbOjJTZlcumxq9AdnkUikbBmmI4mEy/ESZklIF7nVxLvSSTB6nZ1GFHGJAVWdy0lIs9KGTN4fA7/fTTTaHAQ4yolUwSPlKR7pQ8Hyo9RmmCojQv9VKSk7twVPKg43uCQGYsQpFjlKxpEv1F5BHHkJlQ0Lr/mf5uHtBE0lSoUME20yhJsFO8KwjJBcmY7aOBU0yN0h6YfYJAtBcP1+ReCtTyoeAYdYMyoWA19qiKLXExs3ChUCY0Su3ahXxRv4b6Jsnaa1DNkouFGy5VI4dMiw5FBaFoaG1RC/atVZUwCqLKXHY+hFQClmsWhSQoThtHiXGp+gg9Ikao70OhsbPOOsvWgacT1tbEL6yHDh0aeWOeVOYeIt+YxTATDBq1Q834ZFgXyfnPtfdWFE+UXdsnT55ssw2JZKBnJRmEbiHDNV949NFHbZQK2Yhk+H3yySdFlriSjczlqJu2w1577ZXIDPZDog6fwQcffJDoyRsEEpT++usvLwpq1aqVMkOW6JYor3eyhcP0jiXBiXvvySeftIleLGTasm9dunTJeL8KVrBz8dEI2p963b59e2/QoEGJEMagPP744zZFmjBKGvv6l1xTvXp1G5aVaaZgtinuIRbmYZYMqeSUViDjloUsV7Iy45q5nI2m7Qix1157bbP1rENpAEIgSZPPx+O5cOFCm9EclGQFg2xZwh6PP/54m8kbFLJre/bs6ZUtW9Ze1ywoXqyL4uFWsKYYMjfpAYmJgxKZZ5xxhrVp40zD9h7UVOMvsfvss89uVmI31/CbsNfnU/jX1qz2R6QJ54Rz7MxORGFgG6VyJuaDXJDNzOVsNG3HFEOk1i233FKkDy9RPC7Jas6cOZs1YC8OWtXhA8DpmcpEGLR8rd8JPm3atCImFI4n35NcKjgdGjZsaM9RsksSR3zQipaAyZbqrQRxuB6sJE4l+/fCUrDOU044TiAOJmViqSjHhcDNjgCkC3sQnJ2dbFa/E8iV2A1SiTH5wqfjS6pGG0Hs4lSjwzYatDt7XKB8wEUXXbRZ02WiE6j37soMbG2ckPnxxx+t4E2VuUxM+1FHHRX6O7jFo2raTk4Fx5AqqAQbAAEHhFiSHYxgwkfihOGW4H7h+qZiKR2ykh2eRKkFwSku26UQwq6hDA+ldu3aBRqX85P8PYQ7Z5IdC/g5EOycD85NZGHTXoHCVPHrr7+2rymQdPvtt9vXFPoJM/WlSNAPP/xgX2O7dIWN+I7KlSuH2kcqO1arVs1WpitVqpQdlykm0+GgdkIqOVLFkalj7969vcsuu6zIkk/cf//93tFHH20zB90xve2226y5LCxMczFBJMM6psO5hixL7L/ZJIqiXY6VK1cmzBK8DguVJ9966y0vavbdd19vyZIlXr6C/b9FixYJE6PL4D3//PO9/v37Zzx+fs/Lswix67SoogUXGoNLiMAUQAx1UIhhd2GR1DshRtqNF3ZShGZE8hPxrTzNGRPNgQSGoNNosgSZkdD3kgYWhGq5xWlY+QCFv4gyIDyP2j0u9JAw0kwimUhIYxqezCuvvJLzTkqEzjEjI3M5mxAdE1WjB0xGaP4sfvNRmLDjbBShW7BgQaQ9hwE5wf1IrD4LUWkkvYW9t5lBcN79kXSYhIM22k6JV6CgaVAzBW3YlcYFtNl//etfOS2x69dmXC0bXn/xxRf29XvvvefVqVMn0FjsB5Ex+U69evW8p59+erNaJJQWDhNp4aAMLFp7jx497IyA5eKLL7bauqtzn0uIKHHnN1tEUdsFx97gwYNtTSUc0Mwc/UtQqIHUuXPnyMsV9+nTxxs9evRm64m26tu3b6j9pKYNpZUZl4XXOGKJ5smkXLH/vPA3TJRNMgXrPEXTQHNNBmeG386ZixK7Dp7ozmZIDRqe7tiK8Q8E9QGQph227sbWBE0rVWcf9h9/Q1ioacKsCvsqzcGBY0m1vzDJY1HDNYOvBweqS5jLR4hXL8kmng7JWaHYmaMuV/zkk08WcaQ6qN6Kszbo7I8kJBzRfh8NPi58NMOHDw/sfOdaTpXzwqyfaz1T8vcKylHRpTDOEJxI1HMhWsBFHmRaYtfdAEQcHHDAAbYZNY7Y33//3ZqPKDcbBJxQ1CJP7gqfb+BMTJWgxPQ0k1ZpcNppp9klH+E8YyoiG5FIreToiLDNrKMuD0tdfAp0ZaIkbI0SxX/88UfKpCKSqriHgkJ0DmaYZFDcOK5BISoJBzQPBeBBh2LIw4Ps3kyRYI+g6BIaFieExhVRwsPCZaGhMTA+mieCPmiIFSFoRP5Q65w05mTNKArBEQXY19Fe6amJb4L9pqckPgK02ShgPG7IqELLogAfQrZrwlO0C5+Fe3C6GiVb2yYeNJ0/DLVr17bKQHKjdh5MYUoWOB9NchG5sD4a5AWhttSnItqNqDVqu6OxE5mXKRLsERVd4iQxRc206L4fV5zMmWIycaogOIjhzneY6uMoHjx4sI29Z4pLtc3Ro0dnPANyYBojfDCfyhaTop8N+vXrZ2cA3bt3t0KdmR8mQswAPORTFbcqCTRMZo4Ur4qifAYzFTTV5HDO2bNnW5Oo/x4IqiD07t3bpui3aNHCrkMwY4oL44Sn0iqmFx6KrsE8ApgcCK7NoDDjpsYOYdCERxNGyv2JUoOJK2O8AoLGGOvWrcuKU2n8+PE2NPHyyy/3HnroIftd/kUEB4daquzBTMlGj8moWLx4sffmm2/ahdeZsvfeeycaTeCUJoR0/vz51gFKSGlQ6ENKqDDHkOCDTDOsjzjiCJuxnQyp9kceeaSXCePGjbO/32Uw49zNpEHNU089ZbNMCV9m4XUmIbjZpKASlLLZGKOkjM6wFQOxE6IdUe4zVVZe2EYbhU4+NhrBmUYZaeyu7jxzjWJ+wzcSVjvGZ4RzEt8PCVqMg8aKk5oGIxSjClpBMUozC6GSNEBJPhfsHwEOURTEWrJkiZ0FZhKWmQ0wN/LbU93bmZaOLihTTDYbY2SjeiCRB9yUTKOJGgi6v1uqSR1Vs+QoIc4ef4dLMU/WO6IoqYud1TXJzhcwHWDKo2uWc0xSiZPpP2YA4vvDwHVDdVCm95jy3DiYucJEf0VtHycChHOeLNhRujKNDtqwYYPtd0pmp4taocIpDtSwQh57eCpBTO5KEDgXPLRTOXIjKR3tFRDXXnttItNrS0u6UDmP2Ovly5dv9tmyZcu8gw46KHShKaa7LtY1DMTVu+XKK6+0MfuNGzdOZJsSi8w6PssXyI7lmDGNxnTAVNe/ZML69ettVUPi1lesWJEospVJ5mRUEKOfqrjWq6++6u22224ZXfPkQNStW9erUaOGt2bNGrt+4sSJ9lrINWeeeabNhuZecZCByzoK6oXlhx9+sL+5QoUKNmvbmd4uvfRSm78QFDLIjz322M3kRNjidLVr1/YuueQSW5QsGxSUxk6NZhxw6TTGSBemtdSgdrWp/RBuhaOOWNcwXVGoP7N69erQ++bXrnBKov258Cr/NkFj4rMJWirZfOnUGQkCGbsUwyIXYO3atbaWNiYZYsd5P2HCBJNL0KBTZTxjKsykHyvXPM5Tfjf9AlyMNNp6unWDiILB0UfUGFExJc0Cg5oHyaCmTgpROi5/AQclx4Kw3rD07dvXOl4xuZFT4iDclfs1KBT3YwaBwzls/L4fZinM0sJkuaeFV6CgxUaR7YYWVFLGIPWa99lnn1Bjz5kzx9aToNwstSWYFfiXIKCZu9o4fljHZ/kCsx9q5EQNJZnPOeccW5LZ7zxFS0Z7yjWcZzTU1atXJ9atWrXKrmvZsmWoMQkUYNxU5z0IU6ZMSWj6vC5pCZvNetddd1kNluADHJypghyCULly5UTWtv98L1iwwGaFBwXNn3s5KqgJc++993rZoqA09lTaLI4V2nkBrb6wwwd98pbUl5OnfKpOKenAbALnlgvXcjj/QBA7HM4jwrOIgffDukwr1EUJMyA0ybvuuivS0FFmAYT5UTHRD99Bb9lcQ8gcfUlxcuLUBLRNzk3YTkRclzjnMsXfHDsbjbLJJ8CxGyWbNm1KeX9QspiZWpg6O2ESm4qDMEdmUFyXqbqjZdrRrGAFO9Nb4lyZ7rkLIEwUAk44+hQmJy44uLHCxqWeffbZ9oTT4DmM8zQ5npnkJpyktJ9zscITJ060kTe5JHl6T4QI5ZQ5B8kXfNhIoKhv9Kghrpk2dg8++KDto+tK2nINZNJ4/JxzzrHnmDT6KMGBmMqJSCRLULgHeZCT3fnuu+9aswwlgHGohi330Lp1a6skUOoDuL6IFUehS+7/Whz+iCFMdiQRkTSYShCnMsVuKUmOLGMe3Dh4/dc/rzMV7AUV7ugH2zdZYzw5k6MQsL+mG4VAiBonhkSLZM0X+zhClBThMKn8CDaqL7qmwZlCjRQ0Q1d7HC0EWySp+kFLFEQJyS7pElZjpGoePg9udAQ5D1xmZwgOIhqylSCUa1wIJTM1qoImZ9vi/wkCvT05B1xDyaIjTDQH9xmKBYoH1VbJvkSgk/jDdUGobxh++eUXOwNiH3lgYm/nL34CmpekE+ZMCLNf4KaKpAszewbqFiFrmJ1mo/lNwQp2TvATTzyxWeYdF9Lpp5+etvkEU8xhhx1mtX1mAE4Io3WNHTvWnnC05DBOEpxKXPStWrUyUYM2gtaANsfNGkUYYT4TxY2ebTAJMlt0D14euFxTmXRQKqnuCAKJMhNBwEzEbOqKK65IOYsM2oQc5QItmPox/vwCZsHcm5mYPzZs2GCLvDEm2jr3aZAZEOGn6UJGbxBwSKMMciyzgleg4EBJ5fSkKS+OkqChVW3btk2EP7kQKNZ9//33offxscces6F/lNt9//33I2nsPGvWLNssl9KgBxxwgHfFFVdYJ23cmzq7cEfKrw4YMMD2lrznnnusgzIfoJExZWGTw1FZx2f5Ao7IVA1LwkJTG9dMxe/kxOGbSa/XWbNm2fOdDOv4LNf069fPu+GGG7I2fsEK9mxEIfz5559WSM6ePdu+zpfGzr/99ps3YsQIG/2xxx572JrzCIxU3ebj2tQ539lvv/28a665ZrP1Q4YMsZ9lCsL45ZdfTjzINm3aFDq6KMoHDVFQLj/BL9jvuOOOjJrAb59FBYFSCj/99FPG9eLJL2jatGlWOpoVrPMUxwpxzVFGIeAEdA1+86WxM1m2mBroEOV+M2ajXMdtb+2mzoDppbjyDLl2IJNpmao6KM5PegSEhbIUmBb53RxbjgGmDrKZuV4pihUEzg02dkwl+GWSnYjppsJTJpsM42xV8/SKySzneGRa1ZMuVHS9ygR6Qbi4fY6lH/U8zRDi2O+++27bY5Aln6bmUUHWHRpAcixzvmns9KhkQWMn7t+9ZznwwAO91q1b285RYeE8cyzoXNOgQQNbzMotmWiGUYHZbtKkSZutZx2/PSznnnuu16ZNG9vv1K8Ro71j5gvKc889ZzXN4maTYTTqBx54wM4m3TgU7gob433aaafZhfFPOumkxHuWU0891V5PHI9cFpGj/zDmoChm9cVRkBo7T1u0P7LIwmSh5QLCqcjIC1q4ikgfHKREROCMo/5MVOVvo8TNTnD2URsexxeaS1R9K4m4oKY9Tr98BE2XfcOR3bhxY7uOukaPP/64Lbzl7wYUpEAUIXXMQF0DGAdRMmTjhomyYRZxzTXXZJQ16Y/ZwKHJQggyTs5MHNmuuYbnedYZ63eUksPAsc30nieLPJMQVGbMhGPiJA9TEz8tvAJla/SYjJJMtQSy+6gPQqlR+jSi0dx+++2Jmin5ADVCcGpSN8XV4uB1r1697GeZQKnZfC3VC6k04Ey1YnfduNma/xqilC/ZmUFhjG+//dbLFH5LFGWJS8os/+uvvyIbD6d7lOM1atTIe+WVV7xsUbDhjoRYUf8i33tMZqPULGF1aPEkhixbtszG7afqD7k1IfGoSZMmNgsU7c21waMyIQladKkhczSshoNNGf9Hjx49TCFBMg6zNWoEufh9QhKZteFnIOQ3CNjX0VipPZQJxG6jXW/JnpxpaeolGWaWO/g/8lKYLTFjIXQ2THVMf3VHOlpxXlLlFwRNeEqmYAU7xYAoDYuTLls9JqOErFEugqhME84pSZlY2uzlWrCToML5IGkseYpPDX2mrnSpIiMxDDjjSMbBiZyNFO5swEM30yJ1OOY4bsRwE7OOYPK3YAsaR405Cyd8pscRwc44qfqSRpGQtur/Msujqm+PaRBhjGP32Weftf9PSQCUENdRKQj+pKRUSVCZ5pUUrGA///zzS/w8rpmI+Qo1W0grRxNKBTcV2jYRCWGbZBcHNxLp7LmElHWOARmygNB48sknbTmKF198MRG5FYbly5fbDGt/ok7YFmxRHUcEGw/sbCWGXRxRZnlxD42nn37aziT5DvwX1HwPQknJT0TMJPdqDYxXYGzcuNG76aabbFuwww8/3Bs4cGDeRsIQ6zp69OjN1t95551e3759vThRpkwZG7lRHHxWtmxZL64QrfH222/b19OnT/cqVarkTZs2zevevbt3wgknePkAse8kE0VxvxQXZ57v9e0dS5Yssfdh/fr1M46LB3xdVLikVWAU4+W/cTlimEpSo5o0fTzbxE9jh8MckW+gsaUykTD1o6hTmKa8+QomJrTx5OgNf9QMadhxBe3VdbsnWovYc8xPaPHJjZ63RJCKjkGKdjG5J5oGU05yldCgZNtQsCoL9e2dpk6hNsyGnC8KtQX1U/ghPwN/F/c6TdtpaE0pkkwpOMGOzW3cuHF2qgZMpbAX4kTNRjGeTCCZIpUNEsdKlCVE8wFMMFdffbWZMWPGZqV1aYRBeB3JVUEg+QW/BP4TXpdE0GJYUYNTmIYnCAvMToRnOgEY1N5KkxLMIlsSnkFtudwfCHSuy0wFezZaSfpp0qSJreTI/e6K8+H8JHSUz4KCs5kHLrZ1Hrpcj2HGcQ9xipwh0KnZxHhc488884ytnRMFBSfY6STjL9uJ5s4FTi/E4rTFXEEpYG7yZHsbPTvzqRFzFJCJSGEuBAb2X/IMEEzE+vIg5sIP2lGHypguQ5DXxRFl79uwoKnRl9MJzrZt2yb2u7iS0NnMWC4OZooDBgywNupcVgTd2pnlpUqVstVRM42G2VqZ4AXnPOUg8sT0hz25MLCSHEO5APMQQp0byTXbYApIGrhryRcnEEiXXHKJTapxlyVCF2cXTrCgAm5bggcQJZXR2mnD5tLNiQLi+sw0vDDKmQUmCaJEmFklJ+pkGp4YJatWrSpS354Q2kzr22cKodU4cIly8896iC7iwROVxl5wgp3pJNqQ6/0IhPwhOP0hj/kS7ohmhF+AGQVgc8VHkKquSFxYunSprWkCCPM429azQZDQ1SBZrOnUzs9Gh6VMMsvr/V8+RBjwv9HZCS1/S/0U0g3zJJsYEwzlhP2Z4EQoSbBnMcwxX8MdcfCiafiLY4n0oSsTZgRmPKmKgOU63DHKImXJvqJke7vf9BTXOvx777239Z9lItiZwb///vu2GTYKVXEmuzDhslyPCHdm5RQ+4zzg5+nWrVskHb0KTrBvazDlpUMTcbLYYDnpaO84UCXk04foBWKH0ZBSdZmnk1Quueeee+z0nOgguuskt0qjWUsYEG7UoCHT2jn7aD83ePBguw4zV1ioyLhu3bpIMyYLMbN8fjYywTMOmBRZg5jhunXr2sYfVCZ0dT4uvfRS7+KLL8717m1TUJHwrbfe8vKVGjVq2PyKqCHO+s0339xs/RtvvGGvraBQL4XaPbvvvnuino9/yRc6dOhg6wPtueeetjqmv8ojSxDWrVtna+Jnu7YUVR+ffvpp75RTTsl4rPyK7xObaZFEimBz9jt8XDkEEczpl8+2es4x2aZRw0wvVVkCwmjDZPHS0JnSBPh+8FOhERNCSAw2oYX5QqVKlUynTp1sFAv7xu/1L0HAscnsZGsEdtAiMJLyHpE8akRWoPreV199tVllvgULFtjWfiJYdb7OnTvbGvz5SLdu3bzx48dHPu5xxx1nM1cXLlyYWMdrtFi69wSFWvkuoxON2LXJu//++21N+bhmlt9www1e165dU7bby0fy2/hU4OBAS+XcojFzFA6WuEPIoN9W/e2339psRBxhycWrwtqwo4LoH5JeiJqIskgZzjlmeDVq1EhkthJSSagdCTFBIZzR5VBgT3fhjccee6z1EcQ1s3zu3Ll2lkwo7rZQNFCCPY8hpZx49bvvvtu+R0hRxImMOn+SlUgN09ptBc4xznAcvMkFojjvYQU7DwxyNMjo9cdzu8S8oCDUyTfgQUFIIUk7Rx55pA0ZzrQSZT5nllf6P9POtoKiYvIYNHNshJwiQuGwt/OXyAmy17JVGU+IZAjnY6ZDEhW2YB40CE0yKbk+iR0nXC/X0UXY/b/99tvE7ASIQ2ddmMxyZs30nMXuTQQQ+S7MCHKZ5JQOEuzbQLgj8a7+kqu5zp7bFmEqzU2aXFBr9uzZVlDx0IwrmBCKi99P10TBMaLhtlMmKC+MmQOnIu38mBkEKSi2rWSWDx8+vIhph3IEhM7mY9FAPxLseQy2weI6vlCzGVufSA/MBUR0dO7ceTPbKLXQEfD5MENDM6SeUXJ8eNgiZUSsuDo8qeL3qVYYpn56lB298jmz/IADDjD//ve/NzPtUFAs34oG+pGNPY9BcJO4wIXk5+abb7aONi4ukR602GO2k8rByme5Bo2a9H4EJbZwCmwRjojelWq/04XiUlQSJDGrEEhV0uCcc84piKKBfiTY8xhKzeKwoQwCGhsRCNSIQVune4tIHzS4RYsWbaZhYl7Ih8xE+l+iGaJhow1TnxvtGLNb0HLFftD8w7RuSwZhlqzt50NVzGyXAtmwYUOi7K+DiCVXNTRfkSkmz6FsK9oWZWsR7NiIse+Rdi7SB7soQpx+lS5BhfRtImcQoER35BKE+ccff2x7kJJMRRu3+vXrW3NH+/btQ7cEpJwA0TbM8KI0caQyb+Rj2F+hFQ105F5VESWCU4ppORqcc1pJqAcH81XTpk1NzZo1EyVxEaTEtQet854NEBLOro4tnIxRBDtk0lQF5yahlNiGcW4mx8ena7tPNnFkYt4oZNPO1kIaex5DF3kuIlLhH3jgAfse8wwaBLZTNDsRrKIe9bnRgolwQNChyScLu1zAzAFfCjX2Mckws6AuO5og5xnBHIbmzZsX+xmmFMoDiPghwZ7HMP277LLLbMiVEz5ocgh7sgeJohDxiRMnnJWHDQ+gyy+/3Lzzzjs2KgOtmpmGEOkiwZ7HkIF4/PHHb7aeWGRSpzO1mxYiRMCkCicM2nAiSigbwWwMoZ4P2Zti20eCPQ8hvOrhhx9OOPloENGjR4/ETU9PzOOOOy4vwvS2JY2YmilEFPkbT7jIjlw3nCDygv6u2WjPSLMInMOpHmj55vQT0ZC/EfYFDNltRMH4mwb4e0kSgkVxfpE+pLojNMm+pNP8vHnzbFkGEndoZJJrcJBno4vTI488YsMdeWiQjESYHr8d23rQ8rVi20GCPQ9JnkRpUpU5dA0iA5M6O4SwsVCRcMSIEaELbEXJ9ddfb52m9OkkLHPFihVFlrCgFNAQmxA9mk9T64UEqNNPP90W8hLxRIJdFASYWlypY4S7aw6OUzKXsx8eNjhLMb8RrYOtn4xGImFYML9lEv2Es91lLiPY+S7MTzjlXdVQET8Ux56HbCtZftsSmDoQnJhjSPIaOXKkFXQIt1zWOyHTFP8JTayzAQ+FlStXJho8f/7557ZUBclZq1atysp3itwjwZ6HYHohhtllu5Fkws3vMt389neRHjRvRlt1wpRyszig6UCPHTpXODNbquinKCApi1rsCHNa7+FrwL7OupYtW2blO0XuUVRMHkJtmFzUxSg0cEij0eZyNoStnxo2xVXxjOI3ohjQ95MwWWYqLj6eh52S3OKJBLuINd26dUtru1zV10awE52ypYeLPyoqHdJ1uNLeTsQPmWJErKFkrasPk686DKahqEMPcbqmMxPJdfy+yA4S7CLW0GCZZC/6dGLicrV38okzzzwz8jaHfmcsDzSibuj7iQNVxB+ZYkTswdlMhiXmFuzLhP91797dNgvPdbRRcsu5bJGvHY9EdlAcu4g9RBdRxZFIEMowUA73kksusc2ZKbyVS6RXiWwgU4woKHBWulox+WBfTm4uLUQUSGMXBWGKwc5+wgknmAMPPNAWAhszZowtikV3oUIh12YnsfWQxi5iDSYXEpD22WcfG/qIgKekQNzp2LFjkffJSW4OVXeMJ3KeitibXih2RbhjSRpr3AScktwKG2nsItZ06dKlIE0QEtiFjTR2IYSIGXKeCiFEzJBgF0KImCHBLoQQMUOCXQghYoYEuxB5Ak21ieChu5EQmaCoGCFyQLNmzUzDhg3N7bffnli3bt06W3e9atWqBRmiKaJDcexC5An0YK1WrVqud0PEAJlixDZZ++XSSy+1pW7LlStnjj32WDN37tzE5/PmzTPt2rWz3YEoV0tv0++++y7xOeV7qfBI1cc999zT9O7d267/4YcfrKb88ccfJ7bFLMI6zCR+c8nUqVPNoYcear+/cePGtkm0448//rDVJKl9XqFCBdtvlFIGDvrZzpo1y4wePTrRuJzvTmWKefLJJxP7SjXKW265pcixYN2NN95oyyXwW8mypUG3KGwk2MU2x8CBA63Au++++8yHH35oateubdq0aWPNGP/73/9sA2cEIU2bP/jgAyv0NmzYYP93/PjxplevXuaiiy6yxcCee+45+/9BGTBggBWyPFDoV0pz7PXr1yfqsjRq1MgKfwQ+33XuueeaOXPm2M8R6E2aNDEXXnihrcXOQi2bZNj3008/3TbiYF+HDh1qrrnmGtsVyg/7cfjhh5uPPvrI1sahucj8+fNDHl0RC7CxC7Gt8Ndff3mlS5f2HnzwwcS6devWeXvttZc3cuRIb9CgQV6tWrXsulSw3dVXX53yswULFuBv8j766KPEuqVLl9p1r732mn3PX94/8sgjiW3++OMPr3z58t6jjz5a7H6ffPLJ3uWXX554f/zxx3t9+/Ytso0bm++Es846yzvhhBOKbDNgwADvoIMOSryvWbOmd8455yTeb9q0ydtjjz288ePHF7svIv5IYxfbFJhU0IyPOeaYxLrSpUubI4880nz55ZfWjILphXXJLF682Pz666+mZcuWGe8HGreDVnt16tSx3w/UeR8+fLg1wfAZpYGnTZtmywQHgfH8vxN4/8033xSpJY9JyIEpBzs9v1UULnKeilhRvnz5UJ+5SpDgDxRz5pUgjBo1yppbiHhBuFMqt1+/fjbqJRskP8QQ7mrgUdhIYxfbFPvvv7+NHnn77beLCF9s3QcddJDVXt98882UAhnnIs7GmTNnphwbWzlg83b4Hal+3nvvvcTrpUuXmq+//trUq1fPvmff2rdvbxtnN2jQwPYZ5XM//IYtdXBiPP/vdGPTLIReqUIUhzR2sU2B9otzEOclZg6iQEaOHGlWrVplG1Sjqd55553W4Tho0CCzyy67WCGMqQZzCQ5IGk4QUdO2bVuzcuVKKyz79OljNXoiXG666SZTq1Yta84YPHhwyv0YNmyYqVKlio05v/rqq23zjg4dOtjPDjjgAPPEE0/Yxtm77rqrufXWW82iRYvsg8fBA2b27Nk2GgZTDb8lmcsvv9wcccQR1qxzxhlnmHfffdd2fho3blwWj7CIBbk28gsRlNWrV3t9+vTxdtttN69s2bLeMccc482ZMyfx+SeffOK1bt3aq1ChgrfTTjt5xx13nPfdd98lPp8wYYJXp04d64Tdc8897ViOL774wmvSpIl1hjZs2NCbPn16Sufp888/79WvX98rU6aMd+SRR9rv9DtT27dv71WsWNE6MgcPHux16dLFrnPMnz/fa9y4sf0exsNxm+w8hSeeeMI6S9nXGjVqeKNGjSpyLHCe3nbbbUXWNWjQwLv22msjO95i20OZp0IEgFjz5s2bW/NLpUqVcr07QqRENnYhhIgZEuxCCBEzZIoRQoiYIY1dCCFihgS7EELEDAl2IYSIGRLsQggRMyTYhRAiZkiwCyFEzJBgF0KImCHBLoQQJl78P+24UrqalSSJAAAAAElFTkSuQmCC", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for categorical_variable in categorical:\n", " X[categorical_variable].value_counts().plot(kind='bar', figsize=(4,3), title=categorical_variable)\n", " plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*OneHotEncoder* has a parameter *drop* with which we can specify whether we want to drop one of the dummy variables in order to avoid collinearity. We could do this with *OneHotEncoder(drop ='first')*. Note that this is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model. (When features are highly correlated, small changes in the data can cause large fluctuations in the regression coefficients, and also it becomes harder to determine the importance of each feature). However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression model or decision trees.\n", "Let's fit *OneHotEncoder* to categorical variables only.\n", "We will also change the default value of the parameter *sparse_output* which defines whether the encoder will return a sparse matrix or an array. Since we will later convert the output of the encoder to pandas dataframe, we want the output to be an array." ] }, { "cell_type": "code", "execution_count": 43, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[array(['?', 'Federal-gov', 'Local-gov', 'Never-worked', 'Private',\n", " 'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay'],\n", " dtype=object),\n", " array(['10th', '11th', '12th', '1st-4th', '5th-6th', '7th-8th', '9th',\n", " 'Assoc-acdm', 'Assoc-voc', 'Bachelors', 'Doctorate', 'HS-grad',\n", " 'Masters', 'Preschool', 'Prof-school', 'Some-college'],\n", " dtype=object),\n", " array(['cat1', 'cat2', 'cat3', 'cat4'], dtype=object),\n", " array(['?', 'Adm-clerical', 'Armed-Forces', 'Craft-repair',\n", " 'Exec-managerial', 'Farming-fishing', 'Handlers-cleaners',\n", " 'Machine-op-inspct', 'Other-service', 'Priv-house-serv',\n", " 'Prof-specialty', 'Protective-serv', 'Sales', 'Tech-support',\n", " 'Transport-moving'], dtype=object)]" ] }, "execution_count": 43, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ohe = OneHotEncoder(drop='first', sparse_output=False)\n", "ohe.fit(X[categorical])\n", "ohe.categories_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The attribute *categories_* lists the categories of each feature determined during fitting (in order of the features in X and corresponding with the output of transform). This includes the category specified in drop (if any)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's do `transform`:" ] }, { "cell_type": "code", "execution_count": 46, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['workclass_Federal-gov', 'workclass_Local-gov',\n", " 'workclass_Never-worked', 'workclass_Private',\n", " 'workclass_Self-emp-inc', 'workclass_Self-emp-not-inc',\n", " 'workclass_State-gov', 'workclass_Without-pay', 'education_11th',\n", " 'education_12th', 'education_1st-4th', 'education_5th-6th',\n", " 'education_7th-8th', 'education_9th', 'education_Assoc-acdm',\n", " 'education_Assoc-voc', 'education_Bachelors',\n", " 'education_Doctorate', 'education_HS-grad', 'education_Masters',\n", " 'education_Preschool', 'education_Prof-school',\n", " 'education_Some-college', 'capital-gain-category_cat2',\n", " 'capital-gain-category_cat3', 'capital-gain-category_cat4',\n", " 'occupation_Adm-clerical', 'occupation_Armed-Forces',\n", " 'occupation_Craft-repair', 'occupation_Exec-managerial',\n", " 'occupation_Farming-fishing', 'occupation_Handlers-cleaners',\n", " 'occupation_Machine-op-inspct', 'occupation_Other-service',\n", " 'occupation_Priv-house-serv', 'occupation_Prof-specialty',\n", " 'occupation_Protective-serv', 'occupation_Sales',\n", " 'occupation_Tech-support', 'occupation_Transport-moving'],\n", " dtype=object)" ] }, "execution_count": 46, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_cat= ohe.transform(X[categorical])\n", "encoder_feature_names = ohe.get_feature_names_out(categorical)\n", "encoder_feature_names" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now convert the array to pandas dataframe:" ] }, { "cell_type": "code", "execution_count": 48, "metadata": {}, "outputs": [], "source": [ "X_cat= pd.DataFrame(X_cat, columns = encoder_feature_names)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now concatenate the transformed categorical with unchanged numerical variables." ] }, { "cell_type": "code", "execution_count": 50, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "(32561, 42)\n" ] } ], "source": [ "X_num = X[numerical]\n", "X_enc = pd.concat([X_num, X_cat], axis = 1)\n", "print(X_enc.shape)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let us now see how the encoded data frame looks like. As we saw we now have 40 columns. " ] }, { "cell_type": "code", "execution_count": 52, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
agehours-per-weekworkclass_Federal-govworkclass_Local-govworkclass_Never-workedworkclass_Privateworkclass_Self-emp-incworkclass_Self-emp-not-incworkclass_State-govworkclass_Without-payeducation_11theducation_12theducation_1st-4theducation_5th-6theducation_7th-8theducation_9theducation_Assoc-acdmeducation_Assoc-voceducation_Bachelorseducation_Doctorateeducation_HS-gradeducation_Masterseducation_Preschooleducation_Prof-schooleducation_Some-collegecapital-gain-category_cat2capital-gain-category_cat3capital-gain-category_cat4occupation_Adm-clericaloccupation_Armed-Forcesoccupation_Craft-repairoccupation_Exec-managerialoccupation_Farming-fishingoccupation_Handlers-cleanersoccupation_Machine-op-inspctoccupation_Other-serviceoccupation_Priv-house-servoccupation_Prof-specialtyoccupation_Protective-servoccupation_Salesoccupation_Tech-supportoccupation_Transport-moving
039400.00.00.00.00.00.01.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.01.00.00.01.00.00.00.00.00.00.00.00.00.00.00.00.00.0
150130.00.00.00.00.01.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.00.00.00.00.0
238400.00.00.01.00.00.00.00.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.00.00.0
353400.00.00.01.00.00.00.00.01.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.00.00.0
428400.00.00.01.00.00.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.01.00.00.00.00.0
\n", "
" ], "text/plain": [ " age hours-per-week workclass_Federal-gov workclass_Local-gov \\\n", "0 39 40 0.0 0.0 \n", "1 50 13 0.0 0.0 \n", "2 38 40 0.0 0.0 \n", "3 53 40 0.0 0.0 \n", "4 28 40 0.0 0.0 \n", "\n", " workclass_Never-worked workclass_Private workclass_Self-emp-inc \\\n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 1.0 0.0 \n", "3 0.0 1.0 0.0 \n", "4 0.0 1.0 0.0 \n", "\n", " workclass_Self-emp-not-inc workclass_State-gov workclass_Without-pay \\\n", "0 0.0 1.0 0.0 \n", "1 1.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", "\n", " education_11th education_12th education_1st-4th education_5th-6th \\\n", "0 0.0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 0.0 \n", "3 1.0 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 0.0 \n", "\n", " education_7th-8th education_9th education_Assoc-acdm \\\n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", "\n", " education_Assoc-voc education_Bachelors education_Doctorate \\\n", "0 0.0 1.0 0.0 \n", "1 0.0 1.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 1.0 0.0 \n", "\n", " education_HS-grad education_Masters education_Preschool \\\n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 1.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", "\n", " education_Prof-school education_Some-college capital-gain-category_cat2 \\\n", "0 0.0 0.0 1.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", "\n", " capital-gain-category_cat3 capital-gain-category_cat4 \\\n", "0 0.0 0.0 \n", "1 0.0 0.0 \n", "2 0.0 0.0 \n", "3 0.0 0.0 \n", "4 0.0 0.0 \n", "\n", " occupation_Adm-clerical occupation_Armed-Forces occupation_Craft-repair \\\n", "0 1.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 0.0 0.0 0.0 \n", "\n", " occupation_Exec-managerial occupation_Farming-fishing \\\n", "0 0.0 0.0 \n", "1 1.0 0.0 \n", "2 0.0 0.0 \n", "3 0.0 0.0 \n", "4 0.0 0.0 \n", "\n", " occupation_Handlers-cleaners occupation_Machine-op-inspct \\\n", "0 0.0 0.0 \n", "1 0.0 0.0 \n", "2 1.0 0.0 \n", "3 1.0 0.0 \n", "4 0.0 0.0 \n", "\n", " occupation_Other-service occupation_Priv-house-serv \\\n", "0 0.0 0.0 \n", "1 0.0 0.0 \n", "2 0.0 0.0 \n", "3 0.0 0.0 \n", "4 0.0 0.0 \n", "\n", " occupation_Prof-specialty occupation_Protective-serv occupation_Sales \\\n", "0 0.0 0.0 0.0 \n", "1 0.0 0.0 0.0 \n", "2 0.0 0.0 0.0 \n", "3 0.0 0.0 0.0 \n", "4 1.0 0.0 0.0 \n", "\n", " occupation_Tech-support occupation_Transport-moving \n", "0 0.0 0.0 \n", "1 0.0 0.0 \n", "2 0.0 0.0 \n", "3 0.0 0.0 \n", "4 0.0 0.0 " ] }, "execution_count": 52, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_enc.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OneHotEncoder supports aggregating infrequent categories into a single output for each feature. The parameters to enable the gathering of infrequent categories are *min_frequency* and *max_categories*.\n", "\n", "- *min_frequency* is either an integer greater or equal to 1, or a float in the interval (0, 1). If *min_frequency* is an integer, categories with a cardinality smaller than *min_frequency* will be considered infrequent. If *min_frequency* is a float, categories with a cardinality smaller than this fraction of the total number of samples will be considered infrequent. The default value is 1, which means every category is encoded separately.\n", "\n", "- *max_categories* is either None or any integer greater than 1. This parameter sets an upper limit to the number of output features for each input feature. *max_categories* includes the feature that combines infrequent categories." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's set for our one-hot encoder that if a category appears in less than 1% of all the data, it should be considered infrequent." ] }, { "cell_type": "code", "execution_count": 55, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['workclass_?', 'workclass_Federal-gov', 'workclass_Local-gov',\n", " 'workclass_Private', 'workclass_Self-emp-inc',\n", " 'workclass_Self-emp-not-inc', 'workclass_State-gov',\n", " 'workclass_infrequent_sklearn', 'education_10th', 'education_11th',\n", " 'education_12th', 'education_5th-6th', 'education_7th-8th',\n", " 'education_9th', 'education_Assoc-acdm', 'education_Assoc-voc',\n", " 'education_Bachelors', 'education_Doctorate', 'education_HS-grad',\n", " 'education_Masters', 'education_Prof-school',\n", " 'education_Some-college', 'education_infrequent_sklearn',\n", " 'capital-gain-category_cat1', 'capital-gain-category_cat2',\n", " 'capital-gain-category_cat4',\n", " 'capital-gain-category_infrequent_sklearn', 'occupation_?',\n", " 'occupation_Adm-clerical', 'occupation_Craft-repair',\n", " 'occupation_Exec-managerial', 'occupation_Farming-fishing',\n", " 'occupation_Handlers-cleaners', 'occupation_Machine-op-inspct',\n", " 'occupation_Other-service', 'occupation_Prof-specialty',\n", " 'occupation_Protective-serv', 'occupation_Sales',\n", " 'occupation_Tech-support', 'occupation_Transport-moving',\n", " 'occupation_infrequent_sklearn'], dtype=object)" ] }, "execution_count": 55, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ohe_inf = OneHotEncoder(min_frequency=0.01, sparse_output=False)\n", "ohe_inf.fit(X[categorical])\n", "ohe_inf.get_feature_names_out()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see from the previous output that for the feature occupation we have now a category called 'occupation_infrequent_sklearn'. We can check which categories are aggregated in this category:" ] }, { "cell_type": "code", "execution_count": 57, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[array(['Never-worked', 'Without-pay'], dtype=object),\n", " array(['1st-4th', 'Preschool'], dtype=object),\n", " array(['cat3'], dtype=object),\n", " array(['Armed-Forces', 'Priv-house-serv'], dtype=object)]" ] }, "execution_count": 57, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ohe_inf.infrequent_categories_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We see that two categories 'Armed-Forces' and 'Priv-house-serv' are grouped. Checking the presence of these categories in the data we can confirm that it is less than 1%." ] }, { "cell_type": "code", "execution_count": 59, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "occupation\n", "Prof-specialty 0.127146\n", "Craft-repair 0.125887\n", "Exec-managerial 0.124873\n", "Adm-clerical 0.115783\n", "Sales 0.112097\n", "Other-service 0.101195\n", "Machine-op-inspct 0.061485\n", "? 0.056601\n", "Transport-moving 0.049046\n", "Handlers-cleaners 0.042075\n", "Farming-fishing 0.030527\n", "Tech-support 0.028500\n", "Protective-serv 0.019932\n", "Priv-house-serv 0.004576\n", "Armed-Forces 0.000276\n", "Name: proportion, dtype: float64" ] }, "execution_count": 59, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X['occupation'].value_counts(normalize=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These levels were grouped together not because they are similar, but because we don't really have enough data to say much about them." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we fitted the *OneHotEncoder* on the training data, as we should do, if some category is very rare, it might appear only when we transform test data. The behavior for this case is handled with the parameter *handle_unknown*:\n", "- error: This is the default behavior, and an error is raised if an unknown category is present during transform.\n", "- ignore: When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. This means that unknown categories will have the same mapping as the dropped category. For this case, we should not drop any category with the *drop* parameter (as the dropped category is represented with all the zeros of the dummy variables). \n", "- infrequent_if_exist: When an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will map to the infrequent category if it exists. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Ordinal encoding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Categorical ordinal variables have categories that follow a logical ordering. Some examples of ordinal data include:\n", "- Socioeconomic status (low income, middle income or high income)\n", "- Education level (high school, bachelor’s degree, master’s degree or PhD)\n", "- Satisfaction rating (extremely dislike, dislike, neutral, like or extremely like).\n", "\n", "Ordinal variables are encoded using scikit-learn *OrdinalEncoder*. In order to use *OrdinalEncoder*, we have to first specify the order in which we would like to encode our ordinal variable. In our case let's encode education level with ordinal encoder (let's assume some ranking between the education levels), and capital-gain-category." ] }, { "cell_type": "code", "execution_count": 64, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import OrdinalEncoder" ] }, { "cell_type": "code", "execution_count": 65, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "education\n", "HS-grad 10501\n", "Some-college 7291\n", "Bachelors 5355\n", "Masters 1723\n", "Assoc-voc 1382\n", "11th 1175\n", "Assoc-acdm 1067\n", "10th 933\n", "7th-8th 646\n", "Prof-school 576\n", "9th 514\n", "12th 433\n", "Doctorate 413\n", "5th-6th 333\n", "1st-4th 168\n", "Preschool 51\n", "Name: count, dtype: int64" ] }, "execution_count": 65, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X['education'].value_counts()" ] }, { "cell_type": "code", "execution_count": 66, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "capital-gain-category\n", "cat1 29849\n", "cat2 1942\n", "cat4 613\n", "cat3 157\n", "Name: count, dtype: int64" ] }, "execution_count": 66, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X['capital-gain-category'].value_counts()" ] }, { "cell_type": "code", "execution_count": 67, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[array(['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th',\n", " '11th', '12th', 'HS-grad', 'Prof-school', 'Assoc-acdm',\n", " 'Assoc-voc', 'Some-college', 'Bachelors', 'Masters', 'Doctorate'],\n", " dtype=object),\n", " array(['cat1', 'cat2', 'cat3', 'cat4'], dtype=object)]" ] }, "execution_count": 67, "metadata": {}, "output_type": "execute_result" } ], "source": [ "categories=[['Preschool','1st-4th','5th-6th','7th-8th','9th','10th','11th','12th','HS-grad','Prof-school','Assoc-acdm','Assoc-voc','Some-college','Bachelors','Masters','Doctorate'],\n", " ['cat1','cat2','cat3','cat4']]\n", "encoder = OrdinalEncoder(categories=categories)\n", "encoder.fit(X[['education','capital-gain-category']] )\n", "encoder.categories_" ] }, { "cell_type": "code", "execution_count": 68, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(32561, 6)" ] }, "execution_count": 68, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_enc = X.copy()\n", "X_enc[['education','capital-gain-category']] = encoder.transform(X[['education','capital-gain-category']] )\n", "X_enc.shape" ] }, { "cell_type": "code", "execution_count": 69, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclasseducationcapital-gain-categoryhours-per-weekoccupation
039State-gov13.01.040Adm-clerical
150Self-emp-not-inc13.00.013Exec-managerial
238Private8.00.040Handlers-cleaners
353Private6.00.040Handlers-cleaners
428Private13.00.040Prof-specialty
\n", "
" ], "text/plain": [ " age workclass education capital-gain-category hours-per-week \\\n", "0 39 State-gov 13.0 1.0 40 \n", "1 50 Self-emp-not-inc 13.0 0.0 13 \n", "2 38 Private 8.0 0.0 40 \n", "3 53 Private 6.0 0.0 40 \n", "4 28 Private 13.0 0.0 40 \n", "\n", " occupation \n", "0 Adm-clerical \n", "1 Exec-managerial \n", "2 Handlers-cleaners \n", "3 Handlers-cleaners \n", "4 Prof-specialty " ] }, "execution_count": 69, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_enc.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that here we remained with the same number of features as we had initially." ] }, { "cell_type": "code", "execution_count": 71, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "((32561, 6), (32561, 6))" ] }, "execution_count": 71, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X.shape, X_enc.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Target Encoding" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we have categorical features with high cardinality, using one-hot encoding would inflate the feature space making it more computationally expensive for modeling. A classical example of high cardinality categories are location based such as zip code or region. In this case we can use Target Encoding, where each category is encoded based on the average target values for observations belonging to the category, more specifically the encoding scheme mixes the global target mean with the target mean conditioned on the value of the category." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check again how many categories does each categorical feature have using `nunique()` pandas method:" ] }, { "cell_type": "code", "execution_count": 75, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "workclass has 9 categories\n", "education has 16 categories\n", "capital-gain-category has 4 categories\n", "occupation has 15 categories\n" ] } ], "source": [ "for categorical_variable in categorical:\n", " print(f'{categorical_variable} has {X[categorical_variable].nunique()} categories')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see how would we encode workclass and occupation with target encoding." ] }, { "cell_type": "code", "execution_count": 77, "metadata": {}, "outputs": [], "source": [ "from sklearn.preprocessing import TargetEncoder" ] }, { "cell_type": "code", "execution_count": 78, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(32561, 6)" ] }, "execution_count": 78, "metadata": {}, "output_type": "execute_result" } ], "source": [ "enc = TargetEncoder(target_type='binary')\n", "X_enc = X.copy()\n", "X_enc[['workclass','occupation']] = enc.fit_transform(X[['workclass','occupation']], y)\n", "X_enc.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check what are the values that the categories were mapped to with the following: For feature i, encodings_[i] are the encodings matching the categories listed in categories_[i]." ] }, { "cell_type": "code", "execution_count": 80, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[array(['?', 'Federal-gov', 'Local-gov', 'Never-worked', 'Private',\n", " 'Self-emp-inc', 'Self-emp-not-inc', 'State-gov', 'Without-pay'],\n", " dtype=object),\n", " array(['?', 'Adm-clerical', 'Armed-Forces', 'Craft-repair',\n", " 'Exec-managerial', 'Farming-fishing', 'Handlers-cleaners',\n", " 'Machine-op-inspct', 'Other-service', 'Priv-house-serv',\n", " 'Prof-specialty', 'Protective-serv', 'Sales', 'Tech-support',\n", " 'Transport-moving'], dtype=object)]" ] }, "execution_count": 80, "metadata": {}, "output_type": "execute_result" } ], "source": [ "enc.categories_" ] }, { "cell_type": "code", "execution_count": 81, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[array([0.10406847, 0.38626183, 0.29476285, 0. , 0.21867381,\n", " 0.55696537, 0.28490785, 0.27193089, 0. ]),\n", " array([0.10367319, 0.13450071, 0.11845551, 0.22664396, 0.48393209,\n", " 0.11576456, 0.06281553, 0.12490973, 0.04159133, 0.00676869,\n", " 0.44896578, 0.32495995, 0.26930666, 0.30487686, 0.20039788])]" ] }, "execution_count": 81, "metadata": {}, "output_type": "execute_result" } ], "source": [ "enc.encodings_" ] }, { "attachments": { "546e8541-dffc-45db-b3e6-05bfbbbd6f4e.png": { "image/png": "" } }, "cell_type": "markdown", "metadata": {}, "source": [ "The method *fit_transform* internally relies on a cross fitting scheme to prevent target information from leaking into the train. Below is a diagram this process, but we will describe cross validation in more detail in the next section of this Notebook.\n", "\n", "
\n", "\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Multiple column transformations" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What if we wanted to scale some numerical features, encode separately nominal and separately ordinal categorical features? So far, we have done column transformation steps separately.\n", "It would be more convenient to have a single transformer able to handle all columns, applying the appropriate transformations to each column. `sklearn` has `ColumnTransformer` for this purpose. Let's use it to apply all the transformations: " ] }, { "cell_type": "code", "execution_count": 85, "metadata": {}, "outputs": [], "source": [ "from sklearn.compose import ColumnTransformer\n", "\n", "ct = ColumnTransformer([\n", " ('scaling', MinMaxScaler(), ['age', 'hours-per-week']),\n", " ('one_hot', OneHotEncoder(sparse_output=False, drop='first', \n", " min_frequency=0.01, handle_unknown='infrequent_if_exist' ), ['workclass']),\n", " ('target_enc', TargetEncoder(target_type='binary'), ['occupation']),\n", " ('ordinal', OrdinalEncoder(categories=categories), ['education','capital-gain-category'] )\n", "], remainder='passthrough')\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The constructor of `ColumnTransfmer` requires a list of tuples, where each tuple contains a name of the transformation, user defined, a transformer and a list of names (or indices) of columns\n", "that the transformer should be applied to. In this example, we specify that the numerical\n", "columns should be transformed using the `MinMaxScaler`, that nominal feature workclass is encoded with `OneHotEncoder`, feature occupation will be encoded with `TargetEncoder` and the ordinal features with `OrdinalEncoder`. (We would also need to scale the ordinal encoded feature, but for now, we will ignore this step.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we listed all the columns in the dataframe, as we wished to transform all of them. But we could have had a feature which we did not need to process additionally. By default, only the specified columns in transformers are transformed and combined in the output, and the non-specified columns are dropped. We can change the default behaviour, by specifying the parameter *remainder='passthrough'*, and with this, all remaining columns that were not specified in transformers, but present in the data passed to fit will be automatically passed through." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The default configuration for displaying a `ColumnTransformer` is 'text' where set_config(display='text'). To visualize the diagram in the notebook, use can use `set_config(display='diagram')` and then output the `ColumnTransformer` object." ] }, { "cell_type": "code", "execution_count": 89, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
ColumnTransformer(remainder='passthrough',\n",
       "                  transformers=[('scaling', MinMaxScaler(),\n",
       "                                 ['age', 'hours-per-week']),\n",
       "                                ('one_hot',\n",
       "                                 OneHotEncoder(drop='first',\n",
       "                                               handle_unknown='infrequent_if_exist',\n",
       "                                               min_frequency=0.01,\n",
       "                                               sparse_output=False),\n",
       "                                 ['workclass']),\n",
       "                                ('target_enc',\n",
       "                                 TargetEncoder(target_type='binary'),\n",
       "                                 ['occupation']),\n",
       "                                ('ordinal',\n",
       "                                 OrdinalEncoder(categories=[['Preschool',\n",
       "                                                             '1st-4th',\n",
       "                                                             '5th-6th',\n",
       "                                                             '7th-8th', '9th',\n",
       "                                                             '10th', '11th',\n",
       "                                                             '12th', 'HS-grad',\n",
       "                                                             'Prof-school',\n",
       "                                                             'Assoc-acdm',\n",
       "                                                             'Assoc-voc',\n",
       "                                                             'Some-college',\n",
       "                                                             'Bachelors',\n",
       "                                                             'Masters',\n",
       "                                                             'Doctorate'],\n",
       "                                                            ['cat1', 'cat2',\n",
       "                                                             'cat3', 'cat4']]),\n",
       "                                 ['education', 'capital-gain-category'])])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "ColumnTransformer(remainder='passthrough',\n", " transformers=[('scaling', MinMaxScaler(),\n", " ['age', 'hours-per-week']),\n", " ('one_hot',\n", " OneHotEncoder(drop='first',\n", " handle_unknown='infrequent_if_exist',\n", " min_frequency=0.01,\n", " sparse_output=False),\n", " ['workclass']),\n", " ('target_enc',\n", " TargetEncoder(target_type='binary'),\n", " ['occupation']),\n", " ('ordinal',\n", " OrdinalEncoder(categories=[['Preschool',\n", " '1st-4th',\n", " '5th-6th',\n", " '7th-8th', '9th',\n", " '10th', '11th',\n", " '12th', 'HS-grad',\n", " 'Prof-school',\n", " 'Assoc-acdm',\n", " 'Assoc-voc',\n", " 'Some-college',\n", " 'Bachelors',\n", " 'Masters',\n", " 'Doctorate'],\n", " ['cat1', 'cat2',\n", " 'cat3', 'cat4']]),\n", " ['education', 'capital-gain-category'])])" ] }, "execution_count": 89, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn import set_config\n", "set_config(display=\"diagram\")\n", "ct" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's split our data into train and test. Then, let's fit our *ColumnTransformer* to our train data and then transform it." ] }, { "cell_type": "code", "execution_count": 91, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(24420, 12)" ] }, "execution_count": 91, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.linear_model import LogisticRegression\n", "\n", "from sklearn.model_selection import train_test_split\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)\n", "\n", "ct.fit(X_train, y_train)\n", "X_train_trans = ct.transform(X_train)\n", "X_train_trans.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that we obtained 12 features: 2 numerical, 2 ordinal and (9-2) dummy features for workclass, here we did not drop one dummy variable, as we will be using a regularized model later, but we grouped 2 infrequent categories together, and 1 target encoded feature, occupation. Let's see the list of the features for our transformed dataset, using *get_feature_names_out()* method on the fitted ColumnTransformer." ] }, { "cell_type": "code", "execution_count": 93, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['scaling__age', 'scaling__hours-per-week',\n", " 'one_hot__workclass_Federal-gov', 'one_hot__workclass_Local-gov',\n", " 'one_hot__workclass_Private', 'one_hot__workclass_Self-emp-inc',\n", " 'one_hot__workclass_Self-emp-not-inc',\n", " 'one_hot__workclass_State-gov',\n", " 'one_hot__workclass_infrequent_sklearn', 'target_enc__occupation',\n", " 'ordinal__education', 'ordinal__capital-gain-category'],\n", " dtype=object)" ] }, "execution_count": 93, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ct.get_feature_names_out()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's access the one hot encoder inside the column transformer using the attribute `named_transformers_` to check that we did indeed end up with two categories grouped in *'one_hot__workclass_infrequent_sklearn'* category:" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[array(['Never-worked', 'Without-pay'], dtype=object)]" ] }, "execution_count": 95, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ct.named_transformers_['one_hot'].infrequent_categories_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that we fitted the column transformer with X and y. Here we needed y only because we used traget encoding. If we did not use it, we would fit with only X." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, we can build a *LogisticRegression* model. Since we transformed our train data, the same transformations must be done for the test data. As before, we only apply the *transform* on the test data." ] }, { "cell_type": "code", "execution_count": 98, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " precision recall f1-score support\n", "\n", " <=50K 0.82 0.94 0.88 6159\n", " >50K 0.66 0.38 0.48 1982\n", "\n", " accuracy 0.80 8141\n", " macro avg 0.74 0.66 0.68 8141\n", "weighted avg 0.78 0.80 0.78 8141\n", "\n" ] } ], "source": [ "from sklearn.metrics import classification_report\n", "logreg = LogisticRegression(solver=\"liblinear\", random_state=42)\n", "logreg.fit(X_train_trans, y_train)\n", "\n", "X_test_trans = ct.transform(X_test)\n", "y_pred=logreg.predict(X_test_trans)\n", "print(classification_report(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Cross validation and Hyperparameter tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To evaluate our supervised models, so far we have split our dataset into a training set and a test set using the `train_test_split` function, built a model on the training set by calling the `fit` method, and evaluated it on the test set using a variety of metrics.\n", "Note, the reason we split our data into training and test sets is that we are interested in measuring how well our model *generalizes* to new, previously unseen data. We are not interested in how well our model fit the training set, but rather in how\n", "well it can make predictions for data that was not observed during training." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cross-validation is a statistical method of evaluating generalization performance that is more stable and thorough than using a split into a training and a test set. In cross-validation, the data is instead split repeatedly and multiple models are trained. The\n", "most commonly used version of cross-validation is k-fold cross-validation, where k is a user-specified number, usually 5 or 10. When performing five-fold cross-validation, the data is first partitioned into five parts of (approximately) equal size, called folds.\n", "Next, a sequence of models is trained. The first model is trained using the first fold as the test set, and the remaining folds (2–5) are used as the training set. The model is built using the data in folds 2–5, and then the accuracy is evaluated on fold 1. Then\n", "another model is built, this time using fold 2 as the test set and the data in folds 1, 3, 4, and 5 as the training set. This process is repeated using folds 3, 4, and 5 as test sets. For each of these five splits of the data into training and test sets, we compute the accuracy. In the end, we have collected five accuracy values. Usually, the first fifth of the data is the first fold, the second fifth of the data is the second fold, and so on. (Note: this is not valid for time series.)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now import the iris dataset, using the parameter *return_X_y=True* to return the features and label directly. We will also shuffle the data, since the data is sorted by labels, and we want to have data instances of different labels in each fold. Later in this notebook, we will return to this and discuss better approach to achieve this, but for now, we will just shuffle the data." ] }, { "cell_type": "code", "execution_count": 103, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import load_iris\n", "from sklearn.utils import shuffle\n", "X, y = load_iris(return_X_y=True)\n", "X, y = shuffle(X, y, random_state=0)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's check the distribution of the target variable." ] }, { "cell_type": "code", "execution_count": 105, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Counter({2: 50, 1: 50, 0: 50})" ] }, "execution_count": 105, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from collections import Counter \n", "Counter(y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Cross-validation is implemented in `sklearn` using the `cross_val_score` function from the `model_selection` module. The parameters of the `cross_val_score` function are the model we want to evaluate, the training data, and the ground-truth labels. By default, the score computed at each CV iteration is the score method of the estimator. For logistic regression this is accuracy. The following link gives a list of all the possible options that scoring parameter of `cross_val` can take: \\\n", "https://scikit-learn.org/stable/modules/model_evaluation.html \\\n", "If we wanted to use another metric, we could pass another vale to `scoring` parameter, for example `scoring='f1'`. For now, let's keep the default option.\n", "\n", "Let’s evaluate LogisticRegression on the iris dataset: (we will not do scaling for now)." ] }, { "cell_type": "code", "execution_count": 107, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[1. 0.86666667 0.96666667 0.96666667 0.96666667]\n" ] } ], "source": [ "from sklearn.model_selection import cross_val_score\n", "log_reg = LogisticRegression(solver=\"liblinear\", random_state=42 , max_iter=10000)\n", "scores = cross_val_score(log_reg, X, y)\n", "print(scores)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default, `cross_val_score` performs five-fold cross-validation, returning five\n", "accuracy values. The number of folds can be changed using `cv` parameter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "A common way to summarize the cross-validation accuracy is to compute the mean:\n" ] }, { "cell_type": "code", "execution_count": 110, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Average cross-validation score: 0.9533333333333334\n" ] } ], "source": [ "print(\"Average cross-validation score:\", scores.mean() )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using the mean cross-validation we can conclude that we expect the model to be around 95% accurate on average. Looking at all five scores produced by the five-fold cross-validation, we can also conclude that there is a relatively high variance in the\n", "accuracy between folds, ranging from 100% accuracy to 86% accuracy. This could imply that the model is very dependent on the particular folds used for training, but it could also just be a consequence of the small size of the dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that in cross-validation, for each fold, the process can be done in parallel, so for 5-fold CV, if we have computational resources, we could run 5 processes in parallel. With the parameter `n_jobs`, we could specify the number of jobs (number of concurrent threads or processes) to run in parallel. If this parameter is not specified, it means only 1 job is run at a time, while `n_jobs=-1` means that all available CPUs should be used. However, if the training takes a lot of time, allocating all CPUs to cross-validation will leave us without any resources to do other things on the computer.\n", "\n", "To check how many *logical* CPUs we have, we can use `os.cpu_count()`:" ] }, { "cell_type": "code", "execution_count": 113, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "8" ] }, "execution_count": 113, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import os\n", "os.cpu_count()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have used the default value of the hyperparameter C, which is inverse of regularization strength. However to try to get better model performance, we should try different values. Because hyperparameter tuning is such a common task, there are standard methods in `sklearn` that we can use. We can tune the parameters by using three sets: the training set to build the model, the validation set to select the parameters of the model, and the test set to evaluate the performance of the selected parameters. \n", "After selecting the best parameters using the validation set, **we can rebuild a model using the parameter settings we found, but now training on both the training data and the validation data.** This way, we can use as much data as possible to build our model. The performance of the final model is evaluated on the test set." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grid search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The most commonly used method is grid search, which basically means trying all possible combinations of the\n", "parameters of interest." ] }, { "cell_type": "code", "execution_count": 117, "metadata": {}, "outputs": [], "source": [ "X_trainval, X_test, y_trainval, y_test = train_test_split(X, y, random_state=0, test_size=0.2)\n", "# split train+validation set into training and validation sets\n", "X_train, X_val, y_train, y_val = train_test_split(X_trainval, y_trainval, random_state=1, test_size=0.3)" ] }, { "cell_type": "code", "execution_count": 118, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Size of training set: 84, size of validation set: 36 size of test set 30\n" ] } ], "source": [ "print(f'Size of training set: {X_train.shape[0]}, size of validation set: {X_val.shape[0]} size of test set {X_test.shape[0]}')" ] }, { "cell_type": "code", "execution_count": 119, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "{'C': 10, 'penalty': 'l1'}\n" ] } ], "source": [ "best_score = 0\n", "for penalty_value in ['l1','l2']:\n", " for C_value in [0.001, 0.01, 0.1, 1, 10]:\n", " # for each combination of parameters, train a model\n", " log_reg = LogisticRegression(solver=\"liblinear\", random_state=42, penalty=penalty_value, C=C_value, max_iter=10000)\n", " log_reg.fit(X_train, y_train)\n", " # evaluate the model on the validation set\n", " score = log_reg.score(X_val, y_val)\n", " # if we got a better score, store the score and parameters\n", " if score > best_score:\n", " best_score = score\n", " best_parameters = {'C': C_value, 'penalty': penalty_value}\n", "print(best_parameters)" ] }, { "cell_type": "code", "execution_count": 120, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best score on validation set 0.9722222222222222\n", "Test set score with best parameters 0.9666666666666667\n" ] } ], "source": [ "# rebuild a model on the combined training and validation set,\n", "# and evaluate it on the test set\n", "log_reg = LogisticRegression(solver=\"liblinear\", random_state=42, penalty=best_parameters['penalty'], C=best_parameters['C'] , max_iter=10000)\n", "log_reg.fit(X_trainval, y_trainval)\n", "test_score = log_reg.score(X_test, y_test)\n", "print('Best score on validation set', best_score)\n", "print(\"Test set score with best parameters\", test_score)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The distinction between the training set, validation set, and test set is fundamentally important to applying machine learning methods in practice. Any choices made based on the test set accuracy \"leak\" information from the test set into the model.\n", "Therefore, it is important to keep a separate test set, which is only used for the final evaluation. It is good practice to do all exploratory analysis and model selection using the combination of a training and a validation set, and reserve the test set for a final evaluation." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "While the method of splitting the data into a training, a validation, and a test set that\n", "we just saw is workable, and relatively commonly used, it is quite sensitive to how\n", "exactly the data is split. For a better estimate of the generalization performance, instead of\n", "using a single split into a training and a validation set, we can use cross-validation to\n", "evaluate the performance of each parameter combination. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Because grid search with cross-validation is such a commonly used method to adjust parameters, `sklearn` provides the `GridSearchCV` class, which implements it in the form of an estimator. To use the `GridSearchCV` class, we first need to specify the\n", "parameters we want to search over using a dictionary. `GridSearchCV` will then perform all the necessary model fits. The keys of the dictionary are the names of parameters we want to adjust (as given when constructing the model, in this case, *C* and\n", "*penalty*), and the values are the parameter settings we want to try out. The default metric for evaluation used will be the default score of the estimator, but just like in `cross_val_score`, a different `scoring` metric can be set." ] }, { "cell_type": "code", "execution_count": 124, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV" ] }, { "cell_type": "code", "execution_count": 125, "metadata": {}, "outputs": [], "source": [ "param_grid = {'C': [0.001, 0.01, 0.1, 1, 10],\n", " 'penalty': ['l1','l2']}" ] }, { "cell_type": "code", "execution_count": 126, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.2)" ] }, { "cell_type": "code", "execution_count": 127, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Size of training set: 120, size of test set 30\n" ] } ], "source": [ "print(f'Size of training set: {X_train.shape[0]}, size of test set {X_test.shape[0]}')" ] }, { "cell_type": "code", "execution_count": 128, "metadata": {}, "outputs": [], "source": [ "grid_search = GridSearchCV(LogisticRegression(solver=\"liblinear\", random_state=42, max_iter=10000), param_grid, cv=5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The grid_search object that we created behaves just like a classifier; we can call the standard methods `fit`, `predict`, and `score` on it. However, when we call `fit`, it will run cross-validation for each combination of parameters we specified in the parameter grid. We have 5 different values for *C*, and 2 for *penalty*, giving in total 10 combinations. For each combination 5-fold CV is used, meaning in total 50 models are trained. And here we only have two parameters we want to tune. Hence, when running a grid search over many parameters and on large datasets it can be computationally challenging. One way to speed things up is with parallelization. Since using a particular parameter setting on a particular cross-validation split can be done completely independently from the other parameter settings and models, we can again parallelized just like we mentioned in `cross_val_score`. In GridSearchCV we could also use the parameter `n_jobs` to define how many jobs could be run in parallel." ] }, { "cell_type": "code", "execution_count": 130, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
GridSearchCV(cv=5,\n",
       "             estimator=LogisticRegression(max_iter=10000, random_state=42,\n",
       "                                          solver='liblinear'),\n",
       "             param_grid={'C': [0.001, 0.01, 0.1, 1, 10],\n",
       "                         'penalty': ['l1', 'l2']})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "GridSearchCV(cv=5,\n", " estimator=LogisticRegression(max_iter=10000, random_state=42,\n", " solver='liblinear'),\n", " param_grid={'C': [0.001, 0.01, 0.1, 1, 10],\n", " 'penalty': ['l1', 'l2']})" ] }, "execution_count": 130, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid_search.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that this is fit, we can ask for the best parameters as follows:" ] }, { "cell_type": "code", "execution_count": 132, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'C': 10, 'penalty': 'l2'}" ] }, "execution_count": 132, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid_search.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The best cross-validation accuracy (the mean accuracy over the different splits for this parameter setting) is stored in `best_score_`:" ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.975" ] }, "execution_count": 134, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid_search.best_score_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results of a grid search can be found\n", "in the `cv_results_` attribute, which is a dictionary storing all aspects of the search. We can convert it to a pandas dataframe to view it:" ] }, { "cell_type": "code", "execution_count": 136, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
mean_fit_timestd_fit_timemean_score_timestd_score_timeparam_Cparam_penaltyparamssplit0_test_scoresplit1_test_scoresplit2_test_scoresplit3_test_scoresplit4_test_scoremean_test_scorestd_test_scorerank_test_score
00.0029270.0016640.0031220.0013380.001l1{'C': 0.001, 'penalty': 'l1'}0.3333330.3333330.3333330.3333330.3333330.3333330.0000009
10.0041150.0020870.0039500.0019740.001l2{'C': 0.001, 'penalty': 'l2'}0.3333330.3333330.3333330.2916670.3333330.3250000.01666710
20.0034390.0020830.0011410.0008250.010l1{'C': 0.01, 'penalty': 'l1'}0.3333330.3333330.3333330.3750000.3750000.3500000.0204128
\n", "
" ], "text/plain": [ " mean_fit_time std_fit_time mean_score_time std_score_time param_C \\\n", "0 0.002927 0.001664 0.003122 0.001338 0.001 \n", "1 0.004115 0.002087 0.003950 0.001974 0.001 \n", "2 0.003439 0.002083 0.001141 0.000825 0.010 \n", "\n", " param_penalty params split0_test_score \\\n", "0 l1 {'C': 0.001, 'penalty': 'l1'} 0.333333 \n", "1 l2 {'C': 0.001, 'penalty': 'l2'} 0.333333 \n", "2 l1 {'C': 0.01, 'penalty': 'l1'} 0.333333 \n", "\n", " split1_test_score split2_test_score split3_test_score split4_test_score \\\n", "0 0.333333 0.333333 0.333333 0.333333 \n", "1 0.333333 0.333333 0.291667 0.333333 \n", "2 0.333333 0.333333 0.375000 0.375000 \n", "\n", " mean_test_score std_test_score rank_test_score \n", "0 0.333333 0.000000 9 \n", "1 0.325000 0.016667 10 \n", "2 0.350000 0.020412 8 " ] }, "execution_count": 136, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = pd.DataFrame(grid_search.cv_results_)\n", "results.head(3)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "By default an estimator is retrained using the best found parameters on the whole train dataset. We can access the model with `.best_estimator_` attribute and test its perfromance on the test set:" ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9666666666666667" ] }, "execution_count": 138, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid_search.best_estimator_.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or we could have just used the score method directly on the grid_search:" ] }, { "cell_type": "code", "execution_count": 140, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9666666666666667" ] }, "execution_count": 140, "metadata": {}, "output_type": "execute_result" } ], "source": [ "grid_search.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The important thing here is that **we did not use the test set to choose the parameters**, meaning the final test set was only used for the final model evaluation. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Randomized search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we had many different paramenters and many values to evaluate, randomized search would be a better option than grid. In contrast to `GridSearchCV`, not all parameter values are tried out, but rather a fixed number of parameter settings is sampled from the specified distributions. The number of parameter settings that are tried is given by *n_iter*." ] }, { "cell_type": "code", "execution_count": 144, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import RandomizedSearchCV" ] }, { "cell_type": "code", "execution_count": 145, "metadata": {}, "outputs": [], "source": [ "from scipy.stats import loguniform\n", "\n", "distributions= { 'C': loguniform(0.0001, 10), \n", " 'penalty': ['l1', 'l2']}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using a loguniform distribution instead of a regular uniform distribution will ensure that in a sufficiently large number of trials, the same number of samples will be drawn from the [0.0001, 0.001] range as, for example, the [1, 10] range." ] }, { "cell_type": "code", "execution_count": 147, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'C': 4.600306804490298, 'penalty': 'l2'}" ] }, "execution_count": 147, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_search = RandomizedSearchCV(LogisticRegression(solver=\"liblinear\", random_state=42, max_iter=10000), distributions, cv=5, n_iter=10, random_state=1)\n", "random_search.fit(X_train, y_train)\n", "random_search.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Running the results with a different seed especially for a small number of iterations might lead to different parameters selected. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we evaluate the performance of the model with the best parameters on the test set:" ] }, { "cell_type": "code", "execution_count": 150, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9666666666666667" ] }, "execution_count": 150, "metadata": {}, "output_type": "execute_result" } ], "source": [ "random_search.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just like in `GridSearchCV` we could change the `scoring` method, use parallelization with `n_jobs`, view `cv_results_` and access the model through `best_estimator`.\n", "\n", "If we had a large number of parameters, would me much more practical choice than `GridSearchCV`. However, there are more efficient approached than `RandomizedSearchCV`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Successive Halving search" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Sklearn* also provides the *HalvingGridSearchCV* and *HalvingRandomSearchCV* estimators that can be used to search a parameter space using successive halving . Successive halving (SH) is like a tournament among candidate hyperparameter combinations. It is an iterative selection process where all candidates (the hyperparameter combinations) are evaluated with a small amount of resources at the first iteration. Only some of these candidates are selected for the next iteration, which will be allocated more resources. For parameter tuning, the resource is typically the number of training samples, but it can also be an arbitrary numeric parameter." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can summarize the procedure via the following steps:\n", "\n", "1. Draw a large set of candidate configurations (hyperparameter combinations) via random sampling.\n", "2. Train the models (i.e., each set of hyperparameter combinations) with limited resources, for example, a small subset of the training data (as opposed to using the entire training set).\n", "3. Discard the bottom 50 percent (typical value) based on predictive performance.\n", "4. Go back to step 2 with an increased amount of available resources for each of the surviving configurations.\n", "\n", "The steps are repeated until only one hyperparameter configuration remains." ] }, { "cell_type": "code", "execution_count": 155, "metadata": {}, "outputs": [], "source": [ "from sklearn.experimental import enable_halving_search_cv" ] }, { "cell_type": "code", "execution_count": 156, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'C': 4.600306804490298, 'penalty': 'l2'}" ] }, "execution_count": 156, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import HalvingRandomSearchCV\n", "hs = HalvingRandomSearchCV(LogisticRegression(solver=\"liblinear\", random_state=42, max_iter=10000), \n", " param_distributions=distributions, \n", "# specifies the resource that will be allocated to each candidate configuration during evaluation. \n", "# In this case, it's set to 'n_samples', indicating that the number of training samples will be used as the resource. \n", "# Each candidate configuration will be trained and evaluated using a subset of the training data.\n", " resource='n_samples', \n", "# select half of candidates in each iteration % (default is 3, select 1/3 of candidates) \n", " factor=2, \n", "# The number of candidate parameters to sample, at the first iteration.\n", " n_candidates=10,\n", " random_state=1)\n", "\n", "hs.fit(X_train, y_train)\n", "hs.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Again, we evaluate the performance of the model with the best parameters on the test set:" ] }, { "cell_type": "code", "execution_count": 158, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9666666666666667" ] }, "execution_count": 158, "metadata": {}, "output_type": "execute_result" } ], "source": [ "hs.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `cv_results_` attribute again contains useful information for analyzing the results of a search." ] }, { "cell_type": "code", "execution_count": 160, "metadata": {}, "outputs": [], "source": [ "pd.set_option('display.max_columns', None)" ] }, { "cell_type": "code", "execution_count": 161, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
itern_resourcesmean_fit_timestd_fit_timemean_score_timestd_score_timeparam_Cparam_penaltyparamssplit0_test_scoresplit1_test_scoresplit2_test_scoresplit3_test_scoresplit4_test_scoremean_test_scorestd_test_scorerank_test_scoresplit0_train_scoresplit1_train_scoresplit2_train_scoresplit3_train_scoresplit4_train_scoremean_train_scorestd_train_score
00300.0049110.0017390.0035970.0027060.012165l1{'C': 0.012164941464151846, 'penalty': 'l1'}0.5000000.5000000.3333330.3333330.5000000.4333330.08165070.3333330.2500000.3333330.3333330.2916670.3083330.033333
10300.0095510.0064140.0088870.0119554.600307l2{'C': 4.600306804490298, 'penalty': 'l2'}0.8333331.0000000.8333330.6666671.0000000.8666670.12472240.9166670.8750000.8750000.9583330.9583330.9166670.037268
20300.0115710.0164720.0041930.0048230.003248l2{'C': 0.003248350345086679, 'penalty': 'l2'}0.1666670.1666670.3333330.1666670.3333330.2333330.081650150.5000000.5416670.4583330.4583330.4583330.4833330.033333
30300.0067100.0062440.0016950.0014380.001515l2{'C': 0.0015151125123102904, 'penalty': 'l2'}0.1666670.1666670.3333330.1666670.3333330.2333330.081650150.5000000.5416670.4583330.4583330.4583330.4833330.033333
40300.0074520.0030270.0129580.0100960.000854l2{'C': 0.0008536916958038761, 'penalty': 'l2'}0.1666670.1666670.3333330.1666670.3333330.2333330.081650150.5000000.5416670.4583330.4583330.4583330.4833330.033333
50300.0036070.0039760.0056990.0020870.223218l2{'C': 0.223218499287176, 'penalty': 'l2'}0.6666670.6666670.6666670.5000000.8333330.6666670.10540960.8333330.7916670.7916670.7916670.7500000.7916670.026352
60300.0048120.0031950.0006300.0012600.049441l1{'C': 0.04944059287398676, 'penalty': 'l1'}0.1666670.1666670.3333330.1666670.3333330.2333330.081650150.5000000.5416670.4583330.4583330.4583330.4833330.033333
70300.0046580.0011810.0056180.0043660.003684l1{'C': 0.0036844068804921994, 'penalty': 'l1'}0.5000000.5000000.3333330.3333330.5000000.4333330.08165070.3333330.2500000.3333330.3333330.2916670.3083330.033333
80300.0041790.0037690.0043230.0061010.001053l1{'C': 0.001052594868979971, 'penalty': 'l1'}0.5000000.5000000.3333330.3333330.5000000.4333330.08165070.3333330.2500000.3333330.3333330.2916670.3083330.033333
90300.0080030.0052190.0028000.0037680.001406l1{'C': 0.0014056787147388348, 'penalty': 'l1'}0.5000000.5000000.3333330.3333330.5000000.4333330.08165070.3333330.2500000.3333330.3333330.2916670.3083330.033333
101600.0053310.0071960.0099760.0097960.003684l1{'C': 0.0036844068804921994, 'penalty': 'l1'}0.3333330.3333330.3333330.3333330.4166670.3500000.033333110.3125000.2916670.3750000.3125000.3125000.3208330.028260
111600.0062900.0055080.0048300.0052750.001053l1{'C': 0.001052594868979971, 'penalty': 'l1'}0.3333330.3333330.3333330.3333330.4166670.3500000.033333110.3125000.2916670.3750000.3125000.3125000.3208330.028260
121600.0061850.0045790.0038500.0043120.001406l1{'C': 0.0014056787147388348, 'penalty': 'l1'}0.3333330.3333330.3333330.3333330.4166670.3500000.033333110.3125000.2916670.3750000.3125000.3125000.3208330.028260
131600.0055310.0057630.0070230.0058310.223218l2{'C': 0.223218499287176, 'penalty': 'l2'}0.7500000.6666670.6666670.8333331.0000000.7833330.12472250.8125000.7916670.7916670.9375000.8541670.8375000.054962
141600.0076990.0023740.0035440.0016804.600307l2{'C': 4.600306804490298, 'penalty': 'l2'}0.9166671.0000000.8333330.9166671.0000000.9333330.06236120.9583330.9583330.9583331.0000000.9791670.9708330.016667
1521200.0072810.0026730.0074580.0032150.001406l1{'C': 0.0014056787147388348, 'penalty': 'l1'}0.3333330.3333330.3333330.3333330.3333330.3333330.000000140.3333330.3333330.3333330.3333330.3333330.3333330.000000
1621200.0054870.0032900.0065340.0059490.223218l2{'C': 0.223218499287176, 'penalty': 'l2'}0.8333330.9166670.9583330.8750000.9583330.9083330.04859130.9479170.9270830.9375000.9166670.8958330.9250000.017922
1721200.0063140.0046300.0029070.0014854.600307l2{'C': 4.600306804490298, 'penalty': 'l2'}0.9583331.0000000.9583330.9583330.9583330.9666670.01666710.9583330.9687500.9687500.9791670.9583330.9666670.007795
\n", "
" ], "text/plain": [ " iter n_resources mean_fit_time std_fit_time mean_score_time \\\n", "0 0 30 0.004911 0.001739 0.003597 \n", "1 0 30 0.009551 0.006414 0.008887 \n", "2 0 30 0.011571 0.016472 0.004193 \n", "3 0 30 0.006710 0.006244 0.001695 \n", "4 0 30 0.007452 0.003027 0.012958 \n", "5 0 30 0.003607 0.003976 0.005699 \n", "6 0 30 0.004812 0.003195 0.000630 \n", "7 0 30 0.004658 0.001181 0.005618 \n", "8 0 30 0.004179 0.003769 0.004323 \n", "9 0 30 0.008003 0.005219 0.002800 \n", "10 1 60 0.005331 0.007196 0.009976 \n", "11 1 60 0.006290 0.005508 0.004830 \n", "12 1 60 0.006185 0.004579 0.003850 \n", "13 1 60 0.005531 0.005763 0.007023 \n", "14 1 60 0.007699 0.002374 0.003544 \n", "15 2 120 0.007281 0.002673 0.007458 \n", "16 2 120 0.005487 0.003290 0.006534 \n", "17 2 120 0.006314 0.004630 0.002907 \n", "\n", " std_score_time param_C param_penalty \\\n", "0 0.002706 0.012165 l1 \n", "1 0.011955 4.600307 l2 \n", "2 0.004823 0.003248 l2 \n", "3 0.001438 0.001515 l2 \n", "4 0.010096 0.000854 l2 \n", "5 0.002087 0.223218 l2 \n", "6 0.001260 0.049441 l1 \n", "7 0.004366 0.003684 l1 \n", "8 0.006101 0.001053 l1 \n", "9 0.003768 0.001406 l1 \n", "10 0.009796 0.003684 l1 \n", "11 0.005275 0.001053 l1 \n", "12 0.004312 0.001406 l1 \n", "13 0.005831 0.223218 l2 \n", "14 0.001680 4.600307 l2 \n", "15 0.003215 0.001406 l1 \n", "16 0.005949 0.223218 l2 \n", "17 0.001485 4.600307 l2 \n", "\n", " params split0_test_score \\\n", "0 {'C': 0.012164941464151846, 'penalty': 'l1'} 0.500000 \n", "1 {'C': 4.600306804490298, 'penalty': 'l2'} 0.833333 \n", "2 {'C': 0.003248350345086679, 'penalty': 'l2'} 0.166667 \n", "3 {'C': 0.0015151125123102904, 'penalty': 'l2'} 0.166667 \n", "4 {'C': 0.0008536916958038761, 'penalty': 'l2'} 0.166667 \n", "5 {'C': 0.223218499287176, 'penalty': 'l2'} 0.666667 \n", "6 {'C': 0.04944059287398676, 'penalty': 'l1'} 0.166667 \n", "7 {'C': 0.0036844068804921994, 'penalty': 'l1'} 0.500000 \n", "8 {'C': 0.001052594868979971, 'penalty': 'l1'} 0.500000 \n", "9 {'C': 0.0014056787147388348, 'penalty': 'l1'} 0.500000 \n", "10 {'C': 0.0036844068804921994, 'penalty': 'l1'} 0.333333 \n", "11 {'C': 0.001052594868979971, 'penalty': 'l1'} 0.333333 \n", "12 {'C': 0.0014056787147388348, 'penalty': 'l1'} 0.333333 \n", "13 {'C': 0.223218499287176, 'penalty': 'l2'} 0.750000 \n", "14 {'C': 4.600306804490298, 'penalty': 'l2'} 0.916667 \n", "15 {'C': 0.0014056787147388348, 'penalty': 'l1'} 0.333333 \n", "16 {'C': 0.223218499287176, 'penalty': 'l2'} 0.833333 \n", "17 {'C': 4.600306804490298, 'penalty': 'l2'} 0.958333 \n", "\n", " split1_test_score split2_test_score split3_test_score \\\n", "0 0.500000 0.333333 0.333333 \n", "1 1.000000 0.833333 0.666667 \n", "2 0.166667 0.333333 0.166667 \n", "3 0.166667 0.333333 0.166667 \n", "4 0.166667 0.333333 0.166667 \n", "5 0.666667 0.666667 0.500000 \n", "6 0.166667 0.333333 0.166667 \n", "7 0.500000 0.333333 0.333333 \n", "8 0.500000 0.333333 0.333333 \n", "9 0.500000 0.333333 0.333333 \n", "10 0.333333 0.333333 0.333333 \n", "11 0.333333 0.333333 0.333333 \n", "12 0.333333 0.333333 0.333333 \n", "13 0.666667 0.666667 0.833333 \n", "14 1.000000 0.833333 0.916667 \n", "15 0.333333 0.333333 0.333333 \n", "16 0.916667 0.958333 0.875000 \n", "17 1.000000 0.958333 0.958333 \n", "\n", " split4_test_score mean_test_score std_test_score rank_test_score \\\n", "0 0.500000 0.433333 0.081650 7 \n", "1 1.000000 0.866667 0.124722 4 \n", "2 0.333333 0.233333 0.081650 15 \n", "3 0.333333 0.233333 0.081650 15 \n", "4 0.333333 0.233333 0.081650 15 \n", "5 0.833333 0.666667 0.105409 6 \n", "6 0.333333 0.233333 0.081650 15 \n", "7 0.500000 0.433333 0.081650 7 \n", "8 0.500000 0.433333 0.081650 7 \n", "9 0.500000 0.433333 0.081650 7 \n", "10 0.416667 0.350000 0.033333 11 \n", "11 0.416667 0.350000 0.033333 11 \n", "12 0.416667 0.350000 0.033333 11 \n", "13 1.000000 0.783333 0.124722 5 \n", "14 1.000000 0.933333 0.062361 2 \n", "15 0.333333 0.333333 0.000000 14 \n", "16 0.958333 0.908333 0.048591 3 \n", "17 0.958333 0.966667 0.016667 1 \n", "\n", " split0_train_score split1_train_score split2_train_score \\\n", "0 0.333333 0.250000 0.333333 \n", "1 0.916667 0.875000 0.875000 \n", "2 0.500000 0.541667 0.458333 \n", "3 0.500000 0.541667 0.458333 \n", "4 0.500000 0.541667 0.458333 \n", "5 0.833333 0.791667 0.791667 \n", "6 0.500000 0.541667 0.458333 \n", "7 0.333333 0.250000 0.333333 \n", "8 0.333333 0.250000 0.333333 \n", "9 0.333333 0.250000 0.333333 \n", "10 0.312500 0.291667 0.375000 \n", "11 0.312500 0.291667 0.375000 \n", "12 0.312500 0.291667 0.375000 \n", "13 0.812500 0.791667 0.791667 \n", "14 0.958333 0.958333 0.958333 \n", "15 0.333333 0.333333 0.333333 \n", "16 0.947917 0.927083 0.937500 \n", "17 0.958333 0.968750 0.968750 \n", "\n", " split3_train_score split4_train_score mean_train_score std_train_score \n", "0 0.333333 0.291667 0.308333 0.033333 \n", "1 0.958333 0.958333 0.916667 0.037268 \n", "2 0.458333 0.458333 0.483333 0.033333 \n", "3 0.458333 0.458333 0.483333 0.033333 \n", "4 0.458333 0.458333 0.483333 0.033333 \n", "5 0.791667 0.750000 0.791667 0.026352 \n", "6 0.458333 0.458333 0.483333 0.033333 \n", "7 0.333333 0.291667 0.308333 0.033333 \n", "8 0.333333 0.291667 0.308333 0.033333 \n", "9 0.333333 0.291667 0.308333 0.033333 \n", "10 0.312500 0.312500 0.320833 0.028260 \n", "11 0.312500 0.312500 0.320833 0.028260 \n", "12 0.312500 0.312500 0.320833 0.028260 \n", "13 0.937500 0.854167 0.837500 0.054962 \n", "14 1.000000 0.979167 0.970833 0.016667 \n", "15 0.333333 0.333333 0.333333 0.000000 \n", "16 0.916667 0.895833 0.925000 0.017922 \n", "17 0.979167 0.958333 0.966667 0.007795 " ] }, "execution_count": 161, "metadata": {}, "output_type": "execute_result" } ], "source": [ "results = pd.DataFrame(hs.cv_results_)\n", "results" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From the results we can see that in the first round (iter 0), 10 different hyperparameter combinations were used. Since a factor of 2 was specified, this means only 5 combinations were selected for the next round with double the number of resources, i.e., the number of training data points used increased from 30 to 60. Then, half of these 5 combinations were selected, in our case, this is 3, to go to the next round, and the models were trained on all the samples. Out of those 3 combinations, the one with the highest *mean_test_score* (lowest `rank_test_score`) was selected as the best one." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Other libraries for hyperparameter tuning" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since hyperparameter tuning is a very important step in modeling, many different approaches and libraries were developed to speed up the model tuning when there are many different parameters to be tuned. Here, we will illustrate one such simple library, `scikit-optimize`, which implements Bayesian optimization, which works by considering the previously seen hyperparameter combinations when determining the next set of hyperparameters to evaluate. Bayesian optimization reduces the number of evaluations needed to identify optimal hyperparameters, making it effective for optimizing complex models.\n", "Let's install the `scikit-optimize` library with:\n", "\n", "`pip install scikit-optimize`\n", "\n", "Note that when importing this library is referred to as: `skopt`." ] }, { "cell_type": "code", "execution_count": 165, "metadata": {}, "outputs": [], "source": [ "from skopt import BayesSearchCV\n", "from skopt.space import Real " ] }, { "cell_type": "code", "execution_count": 166, "metadata": {}, "outputs": [], "source": [ "distributions = {\n", " 'penalty': ['l1','l2'],\n", " 'C': Real(low=1e-4, high=10, prior='log-uniform'),\n", "}" ] }, { "cell_type": "code", "execution_count": 167, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
BayesSearchCV(estimator=LogisticRegression(max_iter=10000, random_state=42,\n",
       "                                           solver='liblinear'),\n",
       "              n_iter=10, random_state=1,\n",
       "              search_spaces={'C': Real(low=0.0001, high=10, prior='log-uniform', transform='normalize'),\n",
       "                             'penalty': ['l1', 'l2']})
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "BayesSearchCV(estimator=LogisticRegression(max_iter=10000, random_state=42,\n", " solver='liblinear'),\n", " n_iter=10, random_state=1,\n", " search_spaces={'C': Real(low=0.0001, high=10, prior='log-uniform', transform='normalize'),\n", " 'penalty': ['l1', 'l2']})" ] }, "execution_count": 167, "metadata": {}, "output_type": "execute_result" } ], "source": [ "np.int = int\n", "opt = BayesSearchCV(\n", " LogisticRegression(solver=\"liblinear\", random_state=42, max_iter=10000),\n", " distributions,\n", " n_iter=10,\n", " random_state=1\n", ")\n", "\n", "opt.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see the best parameters:" ] }, { "cell_type": "code", "execution_count": 169, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "OrderedDict([('C', 2.4458265756334576), ('penalty', 'l1')])" ] }, "execution_count": 169, "metadata": {}, "output_type": "execute_result" } ], "source": [ "opt.best_params_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "And, as before, we can access the best model with:" ] }, { "cell_type": "code", "execution_count": 171, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
LogisticRegression(C=2.4458265756334576, max_iter=10000, penalty='l1',\n",
       "                   random_state=42, solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "LogisticRegression(C=2.4458265756334576, max_iter=10000, penalty='l1',\n", " random_state=42, solver='liblinear')" ] }, "execution_count": 171, "metadata": {}, "output_type": "execute_result" } ], "source": [ "opt.best_estimator_" ] }, { "cell_type": "code", "execution_count": 172, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.9666666666666667" ] }, "execution_count": 172, "metadata": {}, "output_type": "execute_result" } ], "source": [ "opt.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Practice question\n", "Using sklearn's make_regression, generate a toy dataset:\n", "\n", "X, y = make_regression(n_samples=100, n_features=10, noise=20, n_informative=5, random_state=1), \n", "\n", "split it into train and test. Then use Ridge regression with grid search cross validation to find the value of $\\alpha$ that minimizes the mean square error. Then evaluate this model on the test set. Hint: check the `scoring` parameter and the score method *'neg_root_mean_squared_error'*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Note" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We have mentioned that no modeling decision should be done on the test dataset, and that we use validation data to make model selection and parameter tuning. Let's think about the preprocessing steps such as scaling. If we can use the `fit` method only on the train dataset, and not the validation dataset, how do we apply CV, without applying `fit` to the testing fold?\n", "\n", "Next, we will introduce `pipelines` which will help us with this issue. But without `pipelines`, in the above code, we did not apply the scaling stage before cross validation, even for the regularized models." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Algorithm chains and pipelines" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The following exercise is adapted from Chapter 6 of *Introduction to Machine Learning with Python* by Andreas C. Müller, Sarah Guido.\n", "\n", "In this section we will review how to chain together many different processing steps and machine learning models by using *Pipeline* class. \n", " \n", "Let's start with an example of using a scaler, before applying a machine learning model, this time to do regression. We will use house price dataset from `sklearn`, and apply Ridge Regression. Before applying the model, we will split the data into train and test, and fit the `MinMax` scaler on the training data, and then transform both the training and the testing data. Note that we are scaling here to ensure that the gradient descent moves smoothly towards the minima and that the steps for gradient descent are updated at the same rate for all the features." ] }, { "cell_type": "code", "execution_count": 178, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ ".. _california_housing_dataset:\n", "\n", "California Housing dataset\n", "--------------------------\n", "\n", "**Data Set Characteristics:**\n", "\n", ":Number of Instances: 20640\n", "\n", ":Number of Attributes: 8 numeric, predictive attributes and the target\n", "\n", ":Attribute Information:\n", " - MedInc median income in block group\n", " - HouseAge median house age in block group\n", " - AveRooms average number of rooms per household\n", " - AveBedrms average number of bedrooms per household\n", " - Population block group population\n", " - AveOccup average number of household members\n", " - Latitude block group latitude\n", " - Longitude block group longitude\n", "\n", ":Missing Attribute Values: None\n", "\n", "This dataset was obtained from the StatLib repository.\n", "https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html\n", "\n", "The target variable is the median house value for California districts,\n", "expressed in hundreds of thousands of dollars ($100,000).\n", "\n", "This dataset was derived from the 1990 U.S. census, using one row per census\n", "block group. A block group is the smallest geographical unit for which the U.S.\n", "Census Bureau publishes sample data (a block group typically has a population\n", "of 600 to 3,000 people).\n", "\n", "A household is a group of people residing within a home. Since the average\n", "number of rooms and bedrooms in this dataset are provided per household, these\n", "columns may take surprisingly large values for block groups with few households\n", "and many empty houses, such as vacation resorts.\n", "\n", "It can be downloaded/loaded using the\n", ":func:`sklearn.datasets.fetch_california_housing` function.\n", "\n", ".. rubric:: References\n", "\n", "- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,\n", " Statistics and Probability Letters, 33 (1997) 291-297\n", "\n" ] } ], "source": [ "from sklearn.linear_model import Ridge\n", "from sklearn.metrics import classification_report\n", "\n", "# load the data\n", "from sklearn.datasets import fetch_california_housing\n", "housing = fetch_california_housing()\n", "print(housing.DESCR)" ] }, { "cell_type": "code", "execution_count": 179, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Ridge(alpha=0.01)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Ridge(alpha=0.01)" ] }, "execution_count": 179, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, random_state=0)\n", "# compute minimum and maximum on the training data\n", "scaler = MinMaxScaler().fit(X_train)\n", "# rescale the training data\n", "X_train_scaled = scaler.transform(X_train)\n", "X_test_scaled = scaler.transform(X_test)\n", "\n", "\n", "ridge = Ridge(alpha=0.01)\n", "ridge.fit(X_train_scaled, y_train)" ] }, { "cell_type": "code", "execution_count": 180, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Mean squared error: 0.540554646703696\n", "Root mean squared error: 0.7352242152593289\n", "R2: 0.5910622391340353\n" ] } ], "source": [ "from sklearn.metrics import mean_squared_error, root_mean_squared_error\n", "from sklearn.metrics import r2_score\n", "\n", "y_pred=ridge.predict(X_test_scaled)\n", "print('Mean squared error:', mean_squared_error(y_test, y_pred))\n", "print('Root mean squared error:', root_mean_squared_error(y_test, y_pred))\n", "print('R2: ', r2_score(y_test, y_pred))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " For `RidgeRegression` the default scoring is the R2. Hence, if we were only interested in R2, we could have simply used the *score* function:" ] }, { "cell_type": "code", "execution_count": 182, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5910622391340353" ] }, "execution_count": 182, "metadata": {}, "output_type": "execute_result" } ], "source": [ "ridge.score(X_test_scaled, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's say we want to find better parameters for the ridge using *GridSearchCV*. \n", " \n", "A naive and WRONG approach to doing a grid search with data scaling might look like this:" ] }, { "cell_type": "code", "execution_count": 184, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
GridSearchCV(cv=5, estimator=Ridge(),\n",
       "             param_grid={'alpha': [0.005, 0.01, 0.05, 0.1, 0.5, 1]},\n",
       "             scoring='neg_root_mean_squared_error')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "GridSearchCV(cv=5, estimator=Ridge(),\n", " param_grid={'alpha': [0.005, 0.01, 0.05, 0.1, 0.5, 1]},\n", " scoring='neg_root_mean_squared_error')" ] }, "execution_count": 184, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.model_selection import GridSearchCV\n", "# for illustration purposes only, don't use this code!\n", "param_grid = {'alpha': [0.005, 0.01, 0.05, 0.1, 0.5, 1] }\n", "grid = GridSearchCV(Ridge(), param_grid=param_grid, cv=5, scoring='neg_root_mean_squared_error')\n", "grid.fit(X_train_scaled, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we ran the grid search over the parameter of *Ridge* using the scaled data. However, there is a subtle catch in what we just did. When scaling the data, we used ALL the data in the training set to compute the minimum and maximum of the data. We then used the scaled training data to run our grid search using cross-validation. For each split in the cross-validation, some part of the original training set will be declared the training part of the split, and some the test part of the split. The test part is used to measure the performance of a model trained on the training part when applied to new data. However, we already used the information contained in the test part of the split, when scaling the data. \n", "\n", "So, the splits in the cross-validation no longer correctly mirror how new data will look to the modeling process. We already leaked information from these parts of the data into our modeling process. This will lead to overly optimistic results during cross-validation, and possibly the selection of sub-optimal parameters.\n", "\n", "To get around this problem, the splitting of the dataset during cross-validation should be done BEFORE doing any pre-processing. Any process that extracts knowledge from the dataset should only ever be learned from the training portion of the dataset, and therefore be contained inside the cross-validation loop.\n", "\n", "To achieve this in `sklearn` with the `GridSearchCV` function, we can use the `Pipeline` class. The `Pipeline` class is a class that allows \"gluing\" together multiple processing steps into a single `sklearn` estimator.\n", "\n", "The `Pipeline` class itself has `fit`, `predict`, and `score` methods and behaves just like any other model in `sklearn`. The most common use case of the `Pipeline` class is in chaining pre-processing steps (like scaling of the data) together with a supervised model like a classifier." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Building Pipelines\n", " First, we build a pipeline object by providing it with a list of steps. Each step is a tuple containing a name (any string of our choosing) and an instance of an estimator:" ] }, { "cell_type": "code", "execution_count": 187, "metadata": {}, "outputs": [], "source": [ "from sklearn.pipeline import Pipeline\n", "pipe = Pipeline([('scaler', MinMaxScaler()), \n", " ('ridge', Ridge(alpha=0.01))])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Just like for `ColumnTransformer`, when the `sklearn` display setting is set to diagram, with `set_config(display='text')`, we can visualize the pipeline. Note that clicking on the diagram below allows us to see the details of each step:" ] }, { "cell_type": "code", "execution_count": 189, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('scaler', MinMaxScaler()), ('ridge', Ridge(alpha=0.01))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('scaler', MinMaxScaler()), ('ridge', Ridge(alpha=0.01))])" ] }, "execution_count": 189, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, we created two steps: the first, called \"scaler\", is an instance of *MinMaxScaler*, and the second, called \"ridge\", is an instance of *Ridge*. Now, we can fit the pipeline, like any other `sklearn` estimator:" ] }, { "cell_type": "code", "execution_count": 191, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('scaler', MinMaxScaler()), ('ridge', Ridge(alpha=0.01))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('scaler', MinMaxScaler()), ('ridge', Ridge(alpha=0.01))])" ] }, "execution_count": 191, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.fit(X_train, y_train)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here, *pipe.fit* first calls fit on the first step (the scaler), then transforms the training data using the scaler, and finally fits the *Ridge* with the scaled data. To evaluate on the test data with the default scoring metric of the regressor, we simply call *pipe.score*:" ] }, { "cell_type": "code", "execution_count": 193, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5910622391340353" ] }, "execution_count": 193, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.score(X_test, y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Calling the `score` method on the pipeline first transforms the test data using the scaler, and then calls the `score` method on the `Ridge` using the scaled test data. As we can see, the result is identical to the one we got from the code by doing the transformations by hand. Using the pipeline, we reduced the code needed for our \"preprocessing + classification\" process. The main benefit of using the pipeline, however, is that we can now use this single estimator in *GridSearchCV*." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using a pipeline in a grid search works the same way as using any other estimator. We define a parameter grid to search over, and construct a *GridSearchCV* from the pipeline and the parameter grid. When specifying the parameter grid, there is a slight change, though. We need to specify for each parameter which step of the pipeline it belongs to. The parameter that we want to adjust, *alpha*, is the parameter of *Ridge*, the second step. We gave this step the name \"ridge\". The syntax to define a parameter grid for a pipeline is to specify for each parameter the step name, followed by __ (a double underscore), followed by the parameter name. To search over the *alpha* parameter of \"ridge\" we therefore have to use \"ridge__max_alpha\" as the key in the parameter grid dictionary.\n", "\n", "We could also use the method `get_params` to see the names of all the parameters of an estimator:" ] }, { "cell_type": "code", "execution_count": 196, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'memory': None,\n", " 'steps': [('scaler', MinMaxScaler()), ('ridge', Ridge(alpha=0.01))],\n", " 'transform_input': None,\n", " 'verbose': False,\n", " 'scaler': MinMaxScaler(),\n", " 'ridge': Ridge(alpha=0.01),\n", " 'scaler__clip': False,\n", " 'scaler__copy': True,\n", " 'scaler__feature_range': (0, 1),\n", " 'ridge__alpha': 0.01,\n", " 'ridge__copy_X': True,\n", " 'ridge__fit_intercept': True,\n", " 'ridge__max_iter': None,\n", " 'ridge__positive': False,\n", " 'ridge__random_state': None,\n", " 'ridge__solver': 'auto',\n", " 'ridge__tol': 0.0001}" ] }, "execution_count": 196, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see what parameters of the pipeline that we built can be tuned. Parameters of the scaling step start with *scaler__*, parameters of ridge start with *ridge__*. Note that *scaler* and *ridge* are just names we gave to the steps of the pipeline, we could have given any other name." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's define the parameters that we wish to tune:" ] }, { "cell_type": "code", "execution_count": 199, "metadata": {}, "outputs": [], "source": [ "param_grid = {'ridge__alpha': [0.005, 0.01, 0.05, 0.1, 0.5, 1]}" ] }, { "cell_type": "code", "execution_count": 200, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best cross-validation score (negative RMSE): -0.7243104320018825\n", "Test set score: -0.7363007117320173\n", "Best parameters: {'ridge__alpha': 0.1}\n" ] } ], "source": [ "grid = GridSearchCV(pipe, param_grid=param_grid, cv=5, scoring='neg_root_mean_squared_error')\n", "grid.fit(X_train, y_train)\n", "y_pred=grid.predict(X_test)\n", "\n", "print('Best cross-validation score (negative RMSE): ', grid.best_score_)\n", "print('Test set score: ', grid.score(X_test, y_test))\n", "print('Best parameters: ', grid.best_params_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In contrast to the grid search we did before, now for each split in the cross-validation, the *MinMaxScaler* is refit with ONLY the training splits and no information is leaked from the test split into the parameter search." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In this case since scaling is used to ensure smooth convergence of the gradient descent, there is not much difference when using the wrong and correct way of scaling. However, in many cases, if data leakage occurs it may actually lead to choosing sub-optimal parameters." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## The General Pipeline Interface\n", "The `Pipeline` class is not restricted to pre-processing and classification or regression, but can, in fact, join any number of estimators together. For example, we could build a pipeline containing feature extraction, feature selection, scaling, and classification, for a total of four steps. Similarly, the last step could clustering instead of classification.\n", "\n", "The only requirement for estimators in a pipeline is that all but the last step need to have a `transform` method, so they can produce a new representation of the data that can be used in the next step.\n", "\n", "Internally, during the call to `Pipeline.fit`, the pipeline calls `fit` and then `transform` on each step in turn, with the input given by the output of the transform method of the previous step. For the last step in the pipeline, just `fit` is called." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pipeline is also adjusted for adequate use of sklearn's estimators and transformers. For example, when `TargetEncoder` is a part of a Pipeline and the pipeline is fitted, the pipeline will correctly call `TargetEncoder.fit_transform` and use cross fitting when encoding the training data." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Often, we will want to inspect attributes of one of the steps of the pipeline, say, the coefficients of a linear model. The easiest way to access the steps in a pipeline is via the *named_steps* attribute, which is a dictionary from the step names to the estimators:" ] }, { "cell_type": "code", "execution_count": 206, "metadata": {}, "outputs": [], "source": [ "pipe = Pipeline( [('scaler', StandardScaler()), ('ridge', Ridge(alpha=0.01))])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can check the steps of the pipeline:" ] }, { "cell_type": "code", "execution_count": 208, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[('scaler', StandardScaler()), ('ridge', Ridge(alpha=0.01))]" ] }, "execution_count": 208, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.steps" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Visualize the pipeline:" ] }, { "cell_type": "code", "execution_count": 210, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('scaler', StandardScaler()), ('ridge', Ridge(alpha=0.01))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('scaler', StandardScaler()), ('ridge', Ridge(alpha=0.01))])" ] }, "execution_count": 210, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Or get pipe parameters:" ] }, { "cell_type": "code", "execution_count": 212, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'memory': None,\n", " 'steps': [('scaler', StandardScaler()), ('ridge', Ridge(alpha=0.01))],\n", " 'transform_input': None,\n", " 'verbose': False,\n", " 'scaler': StandardScaler(),\n", " 'ridge': Ridge(alpha=0.01),\n", " 'scaler__copy': True,\n", " 'scaler__with_mean': True,\n", " 'scaler__with_std': True,\n", " 'ridge__alpha': 0.01,\n", " 'ridge__copy_X': True,\n", " 'ridge__fit_intercept': True,\n", " 'ridge__max_iter': None,\n", " 'ridge__positive': False,\n", " 'ridge__random_state': None,\n", " 'ridge__solver': 'auto',\n", " 'ridge__tol': 0.0001}" ] }, "execution_count": 212, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.get_params()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can access a certain step of the `pipeline`, using `named_steps`:" ] }, { "cell_type": "code", "execution_count": 214, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Ridge(alpha=0.01)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Ridge(alpha=0.01)" ] }, "execution_count": 214, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.named_steps['ridge']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we will fit the pipeline with the housing data. Since we wantto pass the feature names to the pipeline, we will convert the numpy array to a dataframe:" ] }, { "cell_type": "code", "execution_count": 216, "metadata": {}, "outputs": [], "source": [ "X_pipe = pd.DataFrame(housing.data, columns=housing.feature_names)\n", "y_pipe = housing.target" ] }, { "cell_type": "code", "execution_count": 217, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ridge coefficiets: [ 0.82961904 0.1187523 -0.26552558 0.30569451 -0.00450277 -0.0393263\n", " -0.89987946 -0.87053475]\n" ] } ], "source": [ "# fit the pipeline defined before to the housing dataset\n", "pipe.fit(X_pipe, y_pipe)\n", "# extract the coefficients from the \"ridge\" step\n", "ridge_coefficients = pipe.named_steps[\"ridge\"].coef_\n", "print('ridge coefficiets: ', ridge_coefficients)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In another example, we could access the names of the features that are going out of the scaler step, and into the ridge step." ] }, { "cell_type": "code", "execution_count": 219, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',\n", " 'AveOccup', 'Latitude', 'Longitude'], dtype=object)" ] }, "execution_count": 219, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.named_steps[\"scaler\"].get_feature_names_out()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pipeline and ColumnTransformer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "One of the steps in the `pipeline` can also be `ColumnTransformer`, and `pipeline` can also be a step in `ColumnTransformer`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's go back to the census dataset, but this time will import the csv with the dataset that has missing values:" ] }, { "cell_type": "code", "execution_count": 223, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
ageworkclasseducationcapital-gain-categoryhours-per-weekoccupationincome
039.0State-govBachelorsNaN40.0Adm-clerical<=50K
150.0Self-emp-not-incBachelorscat113.0Exec-managerial<=50K
238.0PrivateHS-gradcat140.0Handlers-cleaners<=50K
353.0Private11thcat140.0Handlers-cleaners<=50K
428.0PrivateBachelorscat140.0Prof-specialty<=50K
\n", "
" ], "text/plain": [ " age workclass education capital-gain-category hours-per-week \\\n", "0 39.0 State-gov Bachelors NaN 40.0 \n", "1 50.0 Self-emp-not-inc Bachelors cat1 13.0 \n", "2 38.0 Private HS-grad cat1 40.0 \n", "3 53.0 Private 11th cat1 40.0 \n", "4 28.0 Private Bachelors cat1 40.0 \n", "\n", " occupation income \n", "0 Adm-clerical <=50K \n", "1 Exec-managerial <=50K \n", "2 Handlers-cleaners <=50K \n", "3 Handlers-cleaners <=50K \n", "4 Prof-specialty <=50K " ] }, "execution_count": 223, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df = pd.read_csv('census_data_missing.csv' )\n", "df.head()" ] }, { "cell_type": "code", "execution_count": 224, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "age 1628\n", "workclass 1628\n", "education 1628\n", "capital-gain-category 6189\n", "hours-per-week 1628\n", "occupation 1628\n", "income 0\n", "dtype: int64" ] }, "execution_count": 224, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.isnull().sum()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Recall that we used one hot encoding for feature 'workclass', target encoding for 'occupation', ordinal encoding for 'education' and 'capital-gain-category'. Now let us first impute the missing values of categorical features with mode, and of continuous with the median. Then, we will encode the categorical features, scale all with the StandardScaler and then train the model." ] }, { "cell_type": "code", "execution_count": 226, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('preprocessor',\n",
       "                 ColumnTransformer(transformers=[('numerical',\n",
       "                                                  Pipeline(steps=[('imputation_median',\n",
       "                                                                   SimpleImputer(strategy='median'))]),\n",
       "                                                  ['age', 'hours-per-week']),\n",
       "                                                 ('categorical',\n",
       "                                                  Pipeline(steps=[('imputation_mode',\n",
       "                                                                   SimpleImputer(strategy='most_frequent')),\n",
       "                                                                  ('onehot',\n",
       "                                                                   OneHotEncoder(drop='first',\n",
       "                                                                                 handle_unknown='infrequent_if_exist',\n",
       "                                                                                 min_frequ...\n",
       "                                                                   OrdinalEncoder(categories=[['Preschool',\n",
       "                                                                                               '1st-4th',\n",
       "                                                                                               '5th-6th',\n",
       "                                                                                               '7th-8th',\n",
       "                                                                                               '9th',\n",
       "                                                                                               '10th',\n",
       "                                                                                               '11th',\n",
       "                                                                                               '12th',\n",
       "                                                                                               'HS-grad',\n",
       "                                                                                               'Prof-school',\n",
       "                                                                                               'Assoc-acdm',\n",
       "                                                                                               'Assoc-voc',\n",
       "                                                                                               'Some-college',\n",
       "                                                                                               'Bachelors',\n",
       "                                                                                               'Masters',\n",
       "                                                                                               'Doctorate'],\n",
       "                                                                                              ['cat1',\n",
       "                                                                                               'cat2',\n",
       "                                                                                               'cat3',\n",
       "                                                                                               'cat4']]))]),\n",
       "                                                  ['education',\n",
       "                                                   'capital-gain-category'])])),\n",
       "                ('scaler', StandardScaler()),\n",
       "                ('classifier', LogisticRegression(solver='liblinear'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preprocessor',\n", " ColumnTransformer(transformers=[('numerical',\n", " Pipeline(steps=[('imputation_median',\n", " SimpleImputer(strategy='median'))]),\n", " ['age', 'hours-per-week']),\n", " ('categorical',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('onehot',\n", " OneHotEncoder(drop='first',\n", " handle_unknown='infrequent_if_exist',\n", " min_frequ...\n", " OrdinalEncoder(categories=[['Preschool',\n", " '1st-4th',\n", " '5th-6th',\n", " '7th-8th',\n", " '9th',\n", " '10th',\n", " '11th',\n", " '12th',\n", " 'HS-grad',\n", " 'Prof-school',\n", " 'Assoc-acdm',\n", " 'Assoc-voc',\n", " 'Some-college',\n", " 'Bachelors',\n", " 'Masters',\n", " 'Doctorate'],\n", " ['cat1',\n", " 'cat2',\n", " 'cat3',\n", " 'cat4']]))]),\n", " ['education',\n", " 'capital-gain-category'])])),\n", " ('scaler', StandardScaler()),\n", " ('classifier', LogisticRegression(solver='liblinear'))])" ] }, "execution_count": 226, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from sklearn.impute import SimpleImputer\n", "numeric_preprocessor = Pipeline([ (\"imputation_median\", SimpleImputer( strategy=\"median\")) ])\n", "\n", "categorical_preprocessor = Pipeline([\n", " (\"imputation_mode\", SimpleImputer( strategy=\"most_frequent\")),\n", " (\"onehot\", OneHotEncoder(sparse_output=False, drop='first', \n", " min_frequency=0.01, handle_unknown='infrequent_if_exist'))\n", " ])\n", "\n", "ordinal_preprocessor = Pipeline([\n", " (\"imputation_mode\", SimpleImputer( strategy=\"most_frequent\")),\n", " (\"onehot\", OrdinalEncoder(categories=categories))\n", " ])\n", "\n", "target_enc_preprocessor = Pipeline([\n", " (\"imputation_mode\", SimpleImputer( strategy=\"most_frequent\")),\n", " ('target_enc', TargetEncoder(target_type='binary')) \n", " ])\n", " \n", "preprocessor = ColumnTransformer([\n", " (\"numerical\", numeric_preprocessor, ['age', 'hours-per-week']),\n", " (\"categorical\", categorical_preprocessor, ['workclass']),\n", " (\"cat_target\", target_enc_preprocessor, ['occupation']),\n", " (\"ordinal\", ordinal_preprocessor, ['education','capital-gain-category']),\n", " ])\n", "\n", "pipe = Pipeline([\n", " ('preprocessor', preprocessor),\n", " ('scaler', StandardScaler()), \n", " ('classifier', LogisticRegression(solver='liblinear'))])\n", "\n", "pipe" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note that, while `ColumnTransformer` applies the assigned transformations simultaneously, the `Pipeline` executes its steps sequentially." ] }, { "cell_type": "code", "execution_count": 228, "metadata": {}, "outputs": [], "source": [ "X_train, X_test, y_train, y_test = train_test_split(df.drop(columns='income'), df['income'], random_state=0)" ] }, { "cell_type": "code", "execution_count": 229, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Pipeline(steps=[('preprocessor',\n",
       "                 ColumnTransformer(transformers=[('numerical',\n",
       "                                                  Pipeline(steps=[('imputation_median',\n",
       "                                                                   SimpleImputer(strategy='median'))]),\n",
       "                                                  ['age', 'hours-per-week']),\n",
       "                                                 ('categorical',\n",
       "                                                  Pipeline(steps=[('imputation_mode',\n",
       "                                                                   SimpleImputer(strategy='most_frequent')),\n",
       "                                                                  ('onehot',\n",
       "                                                                   OneHotEncoder(drop='first',\n",
       "                                                                                 handle_unknown='infrequent_if_exist',\n",
       "                                                                                 min_frequ...\n",
       "                                                                   OrdinalEncoder(categories=[['Preschool',\n",
       "                                                                                               '1st-4th',\n",
       "                                                                                               '5th-6th',\n",
       "                                                                                               '7th-8th',\n",
       "                                                                                               '9th',\n",
       "                                                                                               '10th',\n",
       "                                                                                               '11th',\n",
       "                                                                                               '12th',\n",
       "                                                                                               'HS-grad',\n",
       "                                                                                               'Prof-school',\n",
       "                                                                                               'Assoc-acdm',\n",
       "                                                                                               'Assoc-voc',\n",
       "                                                                                               'Some-college',\n",
       "                                                                                               'Bachelors',\n",
       "                                                                                               'Masters',\n",
       "                                                                                               'Doctorate'],\n",
       "                                                                                              ['cat1',\n",
       "                                                                                               'cat2',\n",
       "                                                                                               'cat3',\n",
       "                                                                                               'cat4']]))]),\n",
       "                                                  ['education',\n",
       "                                                   'capital-gain-category'])])),\n",
       "                ('scaler', StandardScaler()),\n",
       "                ('classifier', LogisticRegression(solver='liblinear'))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
" ], "text/plain": [ "Pipeline(steps=[('preprocessor',\n", " ColumnTransformer(transformers=[('numerical',\n", " Pipeline(steps=[('imputation_median',\n", " SimpleImputer(strategy='median'))]),\n", " ['age', 'hours-per-week']),\n", " ('categorical',\n", " Pipeline(steps=[('imputation_mode',\n", " SimpleImputer(strategy='most_frequent')),\n", " ('onehot',\n", " OneHotEncoder(drop='first',\n", " handle_unknown='infrequent_if_exist',\n", " min_frequ...\n", " OrdinalEncoder(categories=[['Preschool',\n", " '1st-4th',\n", " '5th-6th',\n", " '7th-8th',\n", " '9th',\n", " '10th',\n", " '11th',\n", " '12th',\n", " 'HS-grad',\n", " 'Prof-school',\n", " 'Assoc-acdm',\n", " 'Assoc-voc',\n", " 'Some-college',\n", " 'Bachelors',\n", " 'Masters',\n", " 'Doctorate'],\n", " ['cat1',\n", " 'cat2',\n", " 'cat3',\n", " 'cat4']]))]),\n", " ['education',\n", " 'capital-gain-category'])])),\n", " ('scaler', StandardScaler()),\n", " ('classifier', LogisticRegression(solver='liblinear'))])" ] }, "execution_count": 229, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.fit(X_train, y_train)" ] }, { "cell_type": "code", "execution_count": 230, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.7940056504114974" ] }, "execution_count": 230, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.score(X_test,y_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We should also do a grid search and fine tune our model: it is also possible to search over the actual steps being performed in the pipeline (say whether to use *StandardScaler* or *MinMaxScaler*). We could also test different imputing strategies, and model parameters. We can also grid search different classifiers, as we will see in the next section. But note, trying all possible solutions is usually not a viable machine learning strategy." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Grid-Searching Which Model To Use\n", "\n", "Here is an example comparing a *Ridge*, *Lasso* and plain Linear Regression on the housing dataset.\n", "\n", "We start by defining the pipeline. Here, we explicitly name the steps. We want two steps, one for the pre-processing and then a regressor. We can instantiate this using *Ridge* and *MinMaxScaler*:" ] }, { "cell_type": "code", "execution_count": 233, "metadata": {}, "outputs": [], "source": [ "pipe = Pipeline([('preprocessing', MinMaxScaler()), \n", " ('regressor', Ridge())])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can define the parameter_grid to search over. We want the regressor to be either *Ridge*, *Lasso* or *LinearRegression*. Because they have different parameters to tune, and may benefit from different pre-processing, we can make use of the list of search grids. To assign an estimator to a step, we use the name of the step as the parameter name. When we want to skip a step in the pipeline (for example, because we may not need pre-processing for the *LinearRegression*), we can set that step to *None*. Note that *GridSearchCV* allows the param_grid to be a list of dictionaries. Each dictionary in the list is expanded into an independent grid. If we are using *RandomizedSearchCV* with a list of dictionaries, first a dictionary is sampled uniformly, and then a parameter is sampled using that dictionary." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's see the pipe parameters:" ] }, { "cell_type": "code", "execution_count": 236, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "{'memory': None,\n", " 'steps': [('preprocessing', MinMaxScaler()), ('regressor', Ridge())],\n", " 'transform_input': None,\n", " 'verbose': False,\n", " 'preprocessing': MinMaxScaler(),\n", " 'regressor': Ridge(),\n", " 'preprocessing__clip': False,\n", " 'preprocessing__copy': True,\n", " 'preprocessing__feature_range': (0, 1),\n", " 'regressor__alpha': 1.0,\n", " 'regressor__copy_X': True,\n", " 'regressor__fit_intercept': True,\n", " 'regressor__max_iter': None,\n", " 'regressor__positive': False,\n", " 'regressor__random_state': None,\n", " 'regressor__solver': 'auto',\n", " 'regressor__tol': 0.0001}" ] }, "execution_count": 236, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pipe.get_params()" ] }, { "cell_type": "code", "execution_count": 237, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import Lasso, LinearRegression\n", "\n", "param_grid = [\n", " {'regressor': [Ridge()], 'preprocessing': [MinMaxScaler(),StandardScaler()],\n", " 'regressor__alpha': [0.01, 0.03, 0.05, 0.1, 0.3, 0.5, 1 ] },\n", " {'regressor': [Lasso()], 'preprocessing': [ MinMaxScaler(),StandardScaler()], \n", " 'regressor__alpha': [0.01, 0.03, 0.05, 0.1, 0.3, 0.5, 1 ]},\n", " {'regressor': [LinearRegression()], 'preprocessing': [None]}\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can instantiate and run the grid search on the housing dataset:" ] }, { "cell_type": "code", "execution_count": 239, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Best params: {'preprocessing': MinMaxScaler(), 'regressor': Ridge(), 'regressor__alpha': 0.1}\n", "Best cross-validation score -0.7243104320018825\n", "Test-set score: -0.7363007117320173\n" ] } ], "source": [ "X_train, X_test, y_train, y_test = train_test_split(housing.data, housing.target, random_state=0)\n", "\n", "grid = GridSearchCV(pipe, param_grid, cv=5, scoring='neg_root_mean_squared_error', n_jobs=-1)\n", "grid.fit(X_train, y_train)\n", "\n", "print('Best params: ', grid.best_params_)\n", "print('Best cross-validation score ', grid.best_score_)\n", "print('Test-set score: ', grid.score(X_test, y_test))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The outcome of the grid search is that *Ridge* with *MinMaxScaler* preprocessing, *alpha=0.1* gave the best result." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Splitting data in cross-validation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "During Cross-Validation, the data is split into k-folds. To specify how the data will be split, so far we have only passed the number of cross validation folds in the parameter `cv`. The default behavior of `GridSearchCV` and `RandomizedSearchCV`, when we pass the number of folds when the estimator is a classifier and y is either binary or multiclass, is the use of stratified stratified k-fold cross-validation. This will enforce the class distribution in each split of the data to match the distribution in the complete training dataset. If the data was just split without considering stratification, in the case of severe class imbalance, one or more folds could end up with a few or no examples from the minority class. This means that some or perhaps many of the model evaluations would be misleading, as the model need only predict the majority class correctly." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Hence, by default `StratifiedKFold`cross-validator is used that provides train/holdout indices to split data. The folds are made by preserving the percentage of samples for each class, without shuffling the data. Let's illustrate the workings of this cross-validator using the small imbalanced dataset we generated previously." ] }, { "cell_type": "code", "execution_count": 244, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import make_classification\n", "\n", "X, y = make_classification(n_samples=1000, n_features=2, n_informative=2,\n", " n_redundant=0, n_repeated=0, n_classes=3,\n", " n_clusters_per_class=1,\n", " weights=[0.01, 0.05, 0.94],\n", " class_sep=0.8, random_state=0)" ] }, { "cell_type": "code", "execution_count": 245, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "2 0.932\n", "1 0.055\n", "0 0.013\n", "Name: proportion, dtype: float64" ] }, "execution_count": 245, "metadata": {}, "output_type": "execute_result" } ], "source": [ "pd.Series(y).value_counts(normalize=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`Split` method of this cross-validator generates indices to split data into training and holdout set. Let's see if we wanted to do 5-fold cross validation, what would be the class distribution for each holdout fold." ] }, { "cell_type": "code", "execution_count": 247, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fold 0: Class distribution of validation instances\n", "2 0.935\n", "1 0.055\n", "0 0.010\n", "Name: proportion, dtype: float64\n", "Fold 1: Class distribution of validation instances\n", "2 0.935\n", "1 0.055\n", "0 0.010\n", "Name: proportion, dtype: float64\n", "Fold 2: Class distribution of validation instances\n", "2 0.930\n", "1 0.055\n", "0 0.015\n", "Name: proportion, dtype: float64\n", "Fold 3: Class distribution of validation instances\n", "2 0.930\n", "1 0.055\n", "0 0.015\n", "Name: proportion, dtype: float64\n", "Fold 4: Class distribution of validation instances\n", "2 0.930\n", "1 0.055\n", "0 0.015\n", "Name: proportion, dtype: float64\n" ] } ], "source": [ "from sklearn.model_selection import StratifiedKFold \n", "skf = StratifiedKFold(n_splits=5)\n", "for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):\n", " print(f'Fold {fold}: Class distribution of validation instances')\n", " print(pd.Series(y[val_idx]).value_counts(normalize=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's compare this with K-Folds cross-validator that just split dataset into k consecutive folds (without shuffling by default)." ] }, { "cell_type": "code", "execution_count": 249, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Fold 0\n", "Class distribution of validation instances\n", "2 0.945\n", "1 0.050\n", "0 0.005\n", "Name: proportion, dtype: float64\n", "Fold 1\n", "Class distribution of validation instances\n", "2 0.935\n", "1 0.055\n", "0 0.010\n", "Name: proportion, dtype: float64\n", "Fold 2\n", "Class distribution of validation instances\n", "2 0.930\n", "1 0.055\n", "0 0.015\n", "Name: proportion, dtype: float64\n", "Fold 3\n", "Class distribution of validation instances\n", "2 0.92\n", "1 0.07\n", "0 0.01\n", "Name: proportion, dtype: float64\n", "Fold 4\n", "Class distribution of validation instances\n", "2 0.930\n", "1 0.045\n", "0 0.025\n", "Name: proportion, dtype: float64\n" ] } ], "source": [ "from sklearn.model_selection import KFold \n", "kf = KFold(n_splits=5)\n", "for fold, (train_idx, val_idx) in enumerate(kf.split(X, y)):\n", " print(f'Fold {fold}')\n", " print('Class distribution of validation instances')\n", " print(pd.Series(y[val_idx]).value_counts(normalize=True))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, the distribution of the rarest class varies in the validation folds." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another important detail which we have not discussed so far is the shuffling of the data instances when the folds are made. By default, data is not shuffled, hence running cross-validation twice with the default parameters, will give exactly the same results. However, if the ordering of data instances in the dataset is not arbitrary, for example data points with the same class label are contiguous, shuffling it first may be essential to get a meaningful cross-validation result. In this case we can use `ShuffleSplit` and `StratifiedShuffleSplit`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Another common case of data that is ordered is time series. Usually today's stock price is correlated with yesterday's and tomorrow's. In applications, we usually have data up to some point, and then try to make predictions for the future, in other words, we are using past data to predict the future data. For this case we can use `TimeSeriesSplit` which simulates that, by taking increasing chunks of data from the past and making predictions on the next chunk. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now to visualize the workings of all this splits, we will use the code from: https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html\n", "\n", "Here, to goal is only to analyze the obtained plots with the code (no need to analyze or replicate the code)." ] }, { "cell_type": "code", "execution_count": 254, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import ShuffleSplit, StratifiedShuffleSplit, TimeSeriesSplit\n", "from matplotlib.patches import Patch\n", "cmap_data = plt.cm.Paired\n", "cmap_cv = plt.cm.coolwarm\n", "n_splits = 5\n", "cvs = [\n", " KFold,\n", " ShuffleSplit,\n", " StratifiedKFold,\n", " StratifiedShuffleSplit,\n", " TimeSeriesSplit\n", "]\n", "X = np.random.randn(100, 10)\n", "percentiles_classes = [0.1, 0.3, 0.6]\n", "y = np.hstack([[ii] * int(100 * perc) for ii, perc in enumerate(percentiles_classes)])" ] }, { "cell_type": "code", "execution_count": 255, "metadata": {}, "outputs": [], "source": [ "def plot_cv_indices(cv, X, y, ax, n_splits, lw=10):\n", " \"\"\"Create a sample plot for indices of a cross-validation object.\"\"\"\n", "\n", " # Generate the training/testing visualizations for each CV split\n", " for ii, (tr, tt) in enumerate(cv.split(X=X, y=y )):\n", " # Fill in indices with the training/test groups\n", " indices = np.array([np.nan] * len(X))\n", " indices[tt] = 1\n", " indices[tr] = 0\n", "\n", " # Visualize the results\n", " ax.scatter(\n", " range(len(indices)),\n", " [ii + 0.5] * len(indices),\n", " c=indices,\n", " marker=\"_\",\n", " lw=lw,\n", " cmap=cmap_cv,\n", " vmin=-0.2,\n", " vmax=1.2,\n", " )\n", "\n", " # Plot the data classes and groups at the end\n", " ax.scatter(\n", " range(len(X)), [ii + 1.5] * len(X), c=y, marker=\"_\", lw=lw, cmap=cmap_data\n", " )\n", "\n", "\n", " # Formatting\n", " yticklabels = list(range(n_splits)) + [\"class\"]\n", " ax.set(\n", " yticks=np.arange(n_splits + 1) + 0.5,\n", " yticklabels=yticklabels,\n", " xlabel=\"Data sample index\",\n", " ylabel=\"CV Iteration\",\n", " ylim=[n_splits + 2.2, -0.2],\n", " xlim=[0, 100],\n", " )\n", " ax.set_title(\"{}\".format(type(cv).__name__), fontsize=15)\n", " return ax" ] }, { "cell_type": "code", "execution_count": 256, "metadata": {}, "outputs": [ { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "", "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "for cv in cvs:\n", " this_cv = cv(n_splits=n_splits)\n", " fig, ax = plt.subplots(figsize=(6, 3))\n", " plot_cv_indices(this_cv, X, y, ax, n_splits)\n", "\n", " ax.legend(\n", " [Patch(color=cmap_cv(0.8)), Patch(color=cmap_cv(0.02))],\n", " [\"Holdout fold\", \"Training folds\" ],\n", " loc=(1.02, 0.8),\n", " )\n", " # Make the legend fit\n", " plt.tight_layout()\n", " fig.subplots_adjust(right=0.7)\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "ml2025", "language": "python", "name": "ml2025" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" } }, "nbformat": 4, "nbformat_minor": 4 }