is_assignments/a2/code/.ipynb_checkpoints/Second IS assignment-checkp...

3110 lines
7.3 MiB
Plaintext
Raw Normal View History

2022-12-19 10:09:00 +01:00
{
"cells": [
{
"cell_type": "markdown",
"id": "c093ea0c",
"metadata": {},
"source": [
"# Seminar 2: Predicting Biodegradability of Chemical"
]
},
{
"cell_type": "markdown",
"id": "7aa30d7d",
"metadata": {},
"source": [
"## 1. Introduction\n",
"Chemicals are all around us. Studying their properties by the means of machine learning is an active\n",
"research field; matching molecular patterns with their behavior can be a decisive factor in the creation of\n",
"new materials, drugs, and more.\n",
"In this seminar assignment, your task is to explore the data and build machine-learning models that\n",
"predict the biodegradability of chemicals."
]
2022-12-29 10:21:35 +01:00
},
{
"cell_type": "markdown",
"id": "aeab08c8",
"metadata": {},
"source": [
"## 2. Task\n",
"You will work with the data set compiled by Mansouri et al. [data](https://www.openml.org/search?type=data&status=active&id=1494&sort=runs). There are 41 features and one target feature (biodegradability).\n",
"The target variable is encoded as ready biodegradable (1) and not ready biodegradable (2). The data set\n",
"consists of 1055 instances. Features can be either symbolic or numeric.\n",
"IMPORTANT: Use the dataset provided on uˇcilnica and NOT the one posted on the link above. It is\n",
"minimally modified and split into train in test sets.\n"
]
},
{
"cell_type": "markdown",
"id": "a4f197dd",
"metadata": {},
"source": [
"### 2.1 Exploration\n",
"Inspect the dataset. How balanced is the target variable? Are there any missing values present? If there\n",
"are, choose a strategy that takes this into account.\n",
"Most of your data is of the numeric type. Can you identify, by adopting exploratory analysis, whether\n",
"some features are directly related to the target? What about feature pairs? Produce at least three types of\n",
"visualizations of the feature space and be prepared to argue why these visualizations were useful for your\n",
"subsequent analysis."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5bcf6290",
"metadata": {},
"outputs": [],
"source": [
"# Needed imports\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import sklearn\n",
"import seaborn as sns\n",
"import scikitplot as skplt\n",
"import warnings\n",
"warnings.filterwarnings(action='once')\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "18ff4f76",
"metadata": {},
"outputs": [],
"source": [
"df_train = pd.read_csv('train.csv')\n",
"df_test = pd.read_csv('test.csv')"
]
},
{
"cell_type": "markdown",
"id": "ea26bfdf",
"metadata": {},
"source": [
"#### Lets inspect training and test data"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "5933f4d7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>V1</th>\n",
" <th>V2</th>\n",
" <th>V3</th>\n",
" <th>V4</th>\n",
" <th>V5</th>\n",
" <th>V6</th>\n",
" <th>V7</th>\n",
" <th>V8</th>\n",
" <th>V9</th>\n",
" <th>V10</th>\n",
" <th>...</th>\n",
" <th>V33</th>\n",
" <th>V34</th>\n",
" <th>V35</th>\n",
" <th>V36</th>\n",
" <th>V37</th>\n",
" <th>V38</th>\n",
" <th>V39</th>\n",
" <th>V40</th>\n",
" <th>V41</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3.919</td>\n",
" <td>2.6909</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>31.4</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2.949</td>\n",
" <td>1.591</td>\n",
" <td>0</td>\n",
" <td>7.253</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4.170</td>\n",
" <td>2.1144</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>30.8</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3.315</td>\n",
" <td>1.967</td>\n",
" <td>0</td>\n",
" <td>7.257</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3.000</td>\n",
" <td>2.7098</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>20.0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>3.046</td>\n",
" <td>5.000</td>\n",
" <td>0</td>\n",
" <td>6.690</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>4.214</td>\n",
" <td>2.6272</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>30.0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2.998</td>\n",
" <td>1.722</td>\n",
" <td>0</td>\n",
" <td>6.770</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>3.942</td>\n",
" <td>2.7719</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>31.6</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3.542</td>\n",
" <td>1.739</td>\n",
" <td>0</td>\n",
" <td>8.127</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 42 columns</p>\n",
"</div>"
],
"text/plain": [
" V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V33 V34 V35 \\\n",
"1 3.919 2.6909 0 0 0 0 0 31.4 2 0 ... 0 0 0 \n",
"2 4.170 2.1144 0 0 0 0 0 30.8 1 1 ... 0 0 0 \n",
"4 3.000 2.7098 0 0 0 0 0 20.0 0 2 ... 0 0 1 \n",
"13 4.214 2.6272 0 0 0 0 0 30.0 3 0 ... 0 0 0 \n",
"16 3.942 2.7719 1 0 0 0 0 31.6 2 0 ... 0 0 0 \n",
"\n",
" V36 V37 V38 V39 V40 V41 Class \n",
"1 2.949 1.591 0 7.253 0 0 2 \n",
"2 3.315 1.967 0 7.257 0 0 2 \n",
"4 3.046 5.000 0 6.690 0 0 2 \n",
"13 2.998 1.722 0 6.770 0 0 2 \n",
"16 3.542 1.739 0 8.127 0 1 2 \n",
"\n",
"[5 rows x 42 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_test.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "1743d191",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>V1</th>\n",
" <th>V2</th>\n",
" <th>V3</th>\n",
" <th>V4</th>\n",
" <th>V5</th>\n",
" <th>V6</th>\n",
" <th>V7</th>\n",
" <th>V8</th>\n",
" <th>V9</th>\n",
" <th>V10</th>\n",
" <th>...</th>\n",
" <th>V33</th>\n",
" <th>V34</th>\n",
" <th>V35</th>\n",
" <th>V36</th>\n",
" <th>V37</th>\n",
" <th>V38</th>\n",
" <th>V39</th>\n",
" <th>V40</th>\n",
" <th>V41</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>821.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>...</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>821.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>4.790476</td>\n",
" <td>3.054551</td>\n",
" <td>0.739953</td>\n",
" <td>0.030451</td>\n",
" <td>0.946809</td>\n",
" <td>0.277778</td>\n",
" <td>1.669031</td>\n",
" <td>37.422813</td>\n",
" <td>1.342790</td>\n",
" <td>1.784870</td>\n",
" <td>...</td>\n",
" <td>0.903073</td>\n",
" <td>1.241135</td>\n",
" <td>0.926714</td>\n",
" <td>3.922100</td>\n",
" <td>2.549406</td>\n",
" <td>0.671395</td>\n",
" <td>8.643191</td>\n",
" <td>0.059102</td>\n",
" <td>0.706856</td>\n",
" <td>1.333333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>0.531991</td>\n",
" <td>0.813983</td>\n",
" <td>1.504545</td>\n",
" <td>0.198281</td>\n",
" <td>2.318081</td>\n",
" <td>1.045544</td>\n",
" <td>2.220221</td>\n",
" <td>9.030008</td>\n",
" <td>2.018433</td>\n",
" <td>1.773856</td>\n",
" <td>...</td>\n",
" <td>1.526124</td>\n",
" <td>2.248684</td>\n",
" <td>1.239133</td>\n",
" <td>0.992636</td>\n",
" <td>0.625021</td>\n",
" <td>1.093633</td>\n",
" <td>1.223700</td>\n",
" <td>0.342364</td>\n",
" <td>2.145396</td>\n",
" <td>0.471683</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>2.000000</td>\n",
" <td>0.803900</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>9.100000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2.279000</td>\n",
" <td>1.467000</td>\n",
" <td>0.000000</td>\n",
" <td>4.948000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>4.499000</td>\n",
" <td>2.510175</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>30.800000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>3.497000</td>\n",
" <td>2.101000</td>\n",
" <td>0.000000</td>\n",
" <td>8.009500</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>4.840000</td>\n",
" <td>3.052400</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>37.850000</td>\n",
" <td>1.000000</td>\n",
" <td>1.500000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>3.732500</td>\n",
" <td>2.461000</td>\n",
" <td>0.000000</td>\n",
" <td>8.508000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>5.119000</td>\n",
" <td>3.415725</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>3.000000</td>\n",
" <td>43.800000</td>\n",
" <td>2.000000</td>\n",
" <td>3.000000</td>\n",
" <td>...</td>\n",
" <td>1.000000</td>\n",
" <td>2.000000</td>\n",
" <td>1.000000</td>\n",
" <td>3.980000</td>\n",
" <td>2.861000</td>\n",
" <td>1.000000</td>\n",
" <td>9.019750</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>6.496000</td>\n",
" <td>7.918400</td>\n",
" <td>12.000000</td>\n",
" <td>2.000000</td>\n",
" <td>36.000000</td>\n",
" <td>13.000000</td>\n",
" <td>18.000000</td>\n",
" <td>60.700000</td>\n",
" <td>24.000000</td>\n",
" <td>12.000000</td>\n",
" <td>...</td>\n",
" <td>12.000000</td>\n",
" <td>18.000000</td>\n",
" <td>7.000000</td>\n",
" <td>10.695000</td>\n",
" <td>5.750000</td>\n",
" <td>8.000000</td>\n",
" <td>14.700000</td>\n",
" <td>4.000000</td>\n",
" <td>27.000000</td>\n",
" <td>2.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows × 42 columns</p>\n",
"</div>"
],
"text/plain": [
" V1 V2 V3 V4 V5 V6 \\\n",
"count 846.000000 846.000000 846.000000 821.000000 846.000000 846.000000 \n",
"mean 4.790476 3.054551 0.739953 0.030451 0.946809 0.277778 \n",
"std 0.531991 0.813983 1.504545 0.198281 2.318081 1.045544 \n",
"min 2.000000 0.803900 0.000000 0.000000 0.000000 0.000000 \n",
"25% 4.499000 2.510175 0.000000 0.000000 0.000000 0.000000 \n",
"50% 4.840000 3.052400 0.000000 0.000000 0.000000 0.000000 \n",
"75% 5.119000 3.415725 1.000000 0.000000 1.000000 0.000000 \n",
"max 6.496000 7.918400 12.000000 2.000000 36.000000 13.000000 \n",
"\n",
" V7 V8 V9 V10 ... V33 \\\n",
"count 846.000000 846.000000 846.000000 846.000000 ... 846.000000 \n",
"mean 1.669031 37.422813 1.342790 1.784870 ... 0.903073 \n",
"std 2.220221 9.030008 2.018433 1.773856 ... 1.526124 \n",
"min 0.000000 9.100000 0.000000 0.000000 ... 0.000000 \n",
"25% 0.000000 30.800000 0.000000 0.000000 ... 0.000000 \n",
"50% 1.000000 37.850000 1.000000 1.500000 ... 0.000000 \n",
"75% 3.000000 43.800000 2.000000 3.000000 ... 1.000000 \n",
"max 18.000000 60.700000 24.000000 12.000000 ... 12.000000 \n",
"\n",
" V34 V35 V36 V37 V38 V39 \\\n",
"count 846.000000 846.000000 846.000000 821.000000 846.000000 846.000000 \n",
"mean 1.241135 0.926714 3.922100 2.549406 0.671395 8.643191 \n",
"std 2.248684 1.239133 0.992636 0.625021 1.093633 1.223700 \n",
"min 0.000000 0.000000 2.279000 1.467000 0.000000 4.948000 \n",
"25% 0.000000 0.000000 3.497000 2.101000 0.000000 8.009500 \n",
"50% 0.000000 1.000000 3.732500 2.461000 0.000000 8.508000 \n",
"75% 2.000000 1.000000 3.980000 2.861000 1.000000 9.019750 \n",
"max 18.000000 7.000000 10.695000 5.750000 8.000000 14.700000 \n",
"\n",
" V40 V41 Class \n",
"count 846.000000 846.000000 846.000000 \n",
"mean 0.059102 0.706856 1.333333 \n",
"std 0.342364 2.145396 0.471683 \n",
"min 0.000000 0.000000 1.000000 \n",
"25% 0.000000 0.000000 1.000000 \n",
"50% 0.000000 0.000000 1.000000 \n",
"75% 0.000000 0.000000 2.000000 \n",
"max 4.000000 27.000000 2.000000 \n",
"\n",
"[8 rows x 42 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train.describe()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b2689ec0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 846 entries, 3 to 1055\n",
"Data columns (total 42 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 V1 846 non-null float64\n",
" 1 V2 846 non-null float64\n",
" 2 V3 846 non-null int64 \n",
" 3 V4 821 non-null float64\n",
" 4 V5 846 non-null int64 \n",
" 5 V6 846 non-null int64 \n",
" 6 V7 846 non-null int64 \n",
" 7 V8 846 non-null float64\n",
" 8 V9 846 non-null int64 \n",
" 9 V10 846 non-null int64 \n",
" 10 V11 846 non-null int64 \n",
" 11 V12 846 non-null float64\n",
" 12 V13 846 non-null float64\n",
" 13 V14 846 non-null float64\n",
" 14 V15 846 non-null float64\n",
" 15 V16 846 non-null int64 \n",
" 16 V17 846 non-null float64\n",
" 17 V18 846 non-null float64\n",
" 18 V19 846 non-null int64 \n",
" 19 V20 846 non-null int64 \n",
" 20 V21 846 non-null int64 \n",
" 21 V22 830 non-null float64\n",
" 22 V23 846 non-null int64 \n",
" 23 V24 846 non-null int64 \n",
" 24 V25 846 non-null int64 \n",
" 25 V26 846 non-null int64 \n",
" 26 V27 838 non-null float64\n",
" 27 V28 846 non-null float64\n",
" 28 V29 838 non-null float64\n",
" 29 V30 846 non-null float64\n",
" 30 V31 846 non-null float64\n",
" 31 V32 846 non-null int64 \n",
" 32 V33 846 non-null int64 \n",
" 33 V34 846 non-null int64 \n",
" 34 V35 846 non-null int64 \n",
" 35 V36 846 non-null float64\n",
" 36 V37 821 non-null float64\n",
" 37 V38 846 non-null int64 \n",
" 38 V39 846 non-null float64\n",
" 39 V40 846 non-null int64 \n",
" 40 V41 846 non-null int64 \n",
" 41 Class 846 non-null int64 \n",
"dtypes: float64(19), int64(23)\n",
"memory usage: 284.2 KB\n"
]
}
],
"source": [
"df_train.info()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "22003f33",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>V1</th>\n",
" <th>V2</th>\n",
" <th>V3</th>\n",
" <th>V4</th>\n",
" <th>V5</th>\n",
" <th>V6</th>\n",
" <th>V7</th>\n",
" <th>V8</th>\n",
" <th>V9</th>\n",
" <th>V10</th>\n",
" <th>...</th>\n",
" <th>V33</th>\n",
" <th>V34</th>\n",
" <th>V35</th>\n",
" <th>V36</th>\n",
" <th>V37</th>\n",
" <th>V38</th>\n",
" <th>V39</th>\n",
" <th>V40</th>\n",
" <th>V41</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3.919</td>\n",
" <td>2.6909</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>31.4</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2.949</td>\n",
" <td>1.591</td>\n",
" <td>0</td>\n",
" <td>7.253</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4.170</td>\n",
" <td>2.1144</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>30.8</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3.315</td>\n",
" <td>1.967</td>\n",
" <td>0</td>\n",
" <td>7.257</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3.000</td>\n",
" <td>2.7098</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>20.0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>3.046</td>\n",
" <td>5.000</td>\n",
" <td>0</td>\n",
" <td>6.690</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>4.214</td>\n",
" <td>2.6272</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>30.0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2.998</td>\n",
" <td>1.722</td>\n",
" <td>0</td>\n",
" <td>6.770</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>3.942</td>\n",
" <td>2.7719</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>31.6</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3.542</td>\n",
" <td>1.739</td>\n",
" <td>0</td>\n",
" <td>8.127</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 42 columns</p>\n",
"</div>"
],
"text/plain": [
" V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V33 V34 V35 \\\n",
"1 3.919 2.6909 0 0 0 0 0 31.4 2 0 ... 0 0 0 \n",
"2 4.170 2.1144 0 0 0 0 0 30.8 1 1 ... 0 0 0 \n",
"4 3.000 2.7098 0 0 0 0 0 20.0 0 2 ... 0 0 1 \n",
"13 4.214 2.6272 0 0 0 0 0 30.0 3 0 ... 0 0 0 \n",
"16 3.942 2.7719 1 0 0 0 0 31.6 2 0 ... 0 0 0 \n",
"\n",
" V36 V37 V38 V39 V40 V41 Class \n",
"1 2.949 1.591 0 7.253 0 0 2 \n",
"2 3.315 1.967 0 7.257 0 0 2 \n",
"4 3.046 5.000 0 6.690 0 0 2 \n",
"13 2.998 1.722 0 6.770 0 0 2 \n",
"16 3.542 1.739 0 8.127 0 1 2 \n",
"\n",
"[5 rows x 42 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_test.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d7235214",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>V1</th>\n",
" <th>V2</th>\n",
" <th>V3</th>\n",
" <th>V4</th>\n",
" <th>V5</th>\n",
" <th>V6</th>\n",
" <th>V7</th>\n",
" <th>V8</th>\n",
" <th>V9</th>\n",
" <th>V10</th>\n",
" <th>...</th>\n",
" <th>V33</th>\n",
" <th>V34</th>\n",
" <th>V35</th>\n",
" <th>V36</th>\n",
" <th>V37</th>\n",
" <th>V38</th>\n",
" <th>V39</th>\n",
" <th>V40</th>\n",
" <th>V41</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.00000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>...</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>4.750938</td>\n",
" <td>3.130050</td>\n",
" <td>0.62201</td>\n",
" <td>0.086124</td>\n",
" <td>1.114833</td>\n",
" <td>0.339713</td>\n",
" <td>1.555024</td>\n",
" <td>35.569378</td>\n",
" <td>1.511962</td>\n",
" <td>1.880383</td>\n",
" <td>...</td>\n",
" <td>0.803828</td>\n",
" <td>1.411483</td>\n",
" <td>1.100478</td>\n",
" <td>3.902612</td>\n",
" <td>2.629201</td>\n",
" <td>0.746411</td>\n",
" <td>8.574038</td>\n",
" <td>0.019139</td>\n",
" <td>0.789474</td>\n",
" <td>1.354067</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>0.603914</td>\n",
" <td>0.897556</td>\n",
" <td>1.27690</td>\n",
" <td>0.406969</td>\n",
" <td>2.393143</td>\n",
" <td>1.182566</td>\n",
" <td>2.246383</td>\n",
" <td>9.471334</td>\n",
" <td>1.721220</td>\n",
" <td>1.784023</td>\n",
" <td>...</td>\n",
" <td>1.498327</td>\n",
" <td>2.374355</td>\n",
" <td>1.320857</td>\n",
" <td>1.029605</td>\n",
" <td>0.714285</td>\n",
" <td>1.077657</td>\n",
" <td>1.315016</td>\n",
" <td>0.195176</td>\n",
" <td>2.589491</td>\n",
" <td>0.479378</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>2.000000</td>\n",
" <td>1.134900</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2.267000</td>\n",
" <td>1.576000</td>\n",
" <td>0.000000</td>\n",
" <td>4.917000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>4.414000</td>\n",
" <td>2.494500</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>29.400000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>3.401000</td>\n",
" <td>2.146000</td>\n",
" <td>0.000000</td>\n",
" <td>7.872000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>4.807000</td>\n",
" <td>3.039300</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>34.200000</td>\n",
" <td>1.000000</td>\n",
" <td>2.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>3.694000</td>\n",
" <td>2.469000</td>\n",
" <td>0.000000</td>\n",
" <td>8.464000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>5.188000</td>\n",
" <td>3.555400</td>\n",
" <td>1.00000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>3.000000</td>\n",
" <td>41.200000</td>\n",
" <td>2.000000</td>\n",
" <td>3.000000</td>\n",
" <td>...</td>\n",
" <td>1.000000</td>\n",
" <td>2.000000</td>\n",
" <td>2.000000</td>\n",
" <td>3.991000</td>\n",
" <td>2.967000</td>\n",
" <td>1.000000</td>\n",
" <td>9.017000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>6.253000</td>\n",
" <td>9.177500</td>\n",
" <td>8.00000</td>\n",
" <td>3.000000</td>\n",
" <td>16.000000</td>\n",
" <td>12.000000</td>\n",
" <td>14.000000</td>\n",
" <td>60.000000</td>\n",
" <td>9.000000</td>\n",
" <td>11.000000</td>\n",
" <td>...</td>\n",
" <td>12.000000</td>\n",
" <td>18.000000</td>\n",
" <td>6.000000</td>\n",
" <td>10.355000</td>\n",
" <td>5.825000</td>\n",
" <td>6.000000</td>\n",
" <td>14.030000</td>\n",
" <td>2.000000</td>\n",
" <td>27.000000</td>\n",
" <td>2.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows × 42 columns</p>\n",
"</div>"
],
"text/plain": [
" V1 V2 V3 V4 V5 V6 \\\n",
"count 209.000000 209.000000 209.00000 209.000000 209.000000 209.000000 \n",
"mean 4.750938 3.130050 0.62201 0.086124 1.114833 0.339713 \n",
"std 0.603914 0.897556 1.27690 0.406969 2.393143 1.182566 \n",
"min 2.000000 1.134900 0.00000 0.000000 0.000000 0.000000 \n",
"25% 4.414000 2.494500 0.00000 0.000000 0.000000 0.000000 \n",
"50% 4.807000 3.039300 0.00000 0.000000 0.000000 0.000000 \n",
"75% 5.188000 3.555400 1.00000 0.000000 1.000000 0.000000 \n",
"max 6.253000 9.177500 8.00000 3.000000 16.000000 12.000000 \n",
"\n",
" V7 V8 V9 V10 ... V33 \\\n",
"count 209.000000 209.000000 209.000000 209.000000 ... 209.000000 \n",
"mean 1.555024 35.569378 1.511962 1.880383 ... 0.803828 \n",
"std 2.246383 9.471334 1.721220 1.784023 ... 1.498327 \n",
"min 0.000000 0.000000 0.000000 0.000000 ... 0.000000 \n",
"25% 0.000000 29.400000 0.000000 0.000000 ... 0.000000 \n",
"50% 0.000000 34.200000 1.000000 2.000000 ... 0.000000 \n",
"75% 3.000000 41.200000 2.000000 3.000000 ... 1.000000 \n",
"max 14.000000 60.000000 9.000000 11.000000 ... 12.000000 \n",
"\n",
" V34 V35 V36 V37 V38 V39 \\\n",
"count 209.000000 209.000000 209.000000 209.000000 209.000000 209.000000 \n",
"mean 1.411483 1.100478 3.902612 2.629201 0.746411 8.574038 \n",
"std 2.374355 1.320857 1.029605 0.714285 1.077657 1.315016 \n",
"min 0.000000 0.000000 2.267000 1.576000 0.000000 4.917000 \n",
"25% 0.000000 0.000000 3.401000 2.146000 0.000000 7.872000 \n",
"50% 0.000000 1.000000 3.694000 2.469000 0.000000 8.464000 \n",
"75% 2.000000 2.000000 3.991000 2.967000 1.000000 9.017000 \n",
"max 18.000000 6.000000 10.355000 5.825000 6.000000 14.030000 \n",
"\n",
" V40 V41 Class \n",
"count 209.000000 209.000000 209.000000 \n",
"mean 0.019139 0.789474 1.354067 \n",
"std 0.195176 2.589491 0.479378 \n",
"min 0.000000 0.000000 1.000000 \n",
"25% 0.000000 0.000000 1.000000 \n",
"50% 0.000000 0.000000 1.000000 \n",
"75% 0.000000 0.000000 2.000000 \n",
"max 2.000000 27.000000 2.000000 \n",
"\n",
"[8 rows x 42 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_test.describe()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "9598495e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 209 entries, 1 to 1051\n",
"Data columns (total 42 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 V1 209 non-null float64\n",
" 1 V2 209 non-null float64\n",
" 2 V3 209 non-null int64 \n",
" 3 V4 209 non-null int64 \n",
" 4 V5 209 non-null int64 \n",
" 5 V6 209 non-null int64 \n",
" 6 V7 209 non-null int64 \n",
" 7 V8 209 non-null float64\n",
" 8 V9 209 non-null int64 \n",
" 9 V10 209 non-null int64 \n",
" 10 V11 209 non-null int64 \n",
" 11 V12 209 non-null float64\n",
" 12 V13 209 non-null float64\n",
" 13 V14 209 non-null float64\n",
" 14 V15 209 non-null float64\n",
" 15 V16 209 non-null int64 \n",
" 16 V17 209 non-null float64\n",
" 17 V18 209 non-null float64\n",
" 18 V19 209 non-null int64 \n",
" 19 V20 209 non-null int64 \n",
" 20 V21 209 non-null int64 \n",
" 21 V22 209 non-null float64\n",
" 22 V23 209 non-null int64 \n",
" 23 V24 209 non-null int64 \n",
" 24 V25 209 non-null int64 \n",
" 25 V26 209 non-null int64 \n",
" 26 V27 209 non-null float64\n",
" 27 V28 209 non-null float64\n",
" 28 V29 209 non-null int64 \n",
" 29 V30 209 non-null float64\n",
" 30 V31 209 non-null float64\n",
" 31 V32 209 non-null int64 \n",
" 32 V33 209 non-null int64 \n",
" 33 V34 209 non-null int64 \n",
" 34 V35 209 non-null int64 \n",
" 35 V36 209 non-null float64\n",
" 36 V37 209 non-null float64\n",
" 37 V38 209 non-null int64 \n",
" 38 V39 209 non-null float64\n",
" 39 V40 209 non-null int64 \n",
" 40 V41 209 non-null int64 \n",
" 41 Class 209 non-null int64 \n",
"dtypes: float64(17), int64(25)\n",
"memory usage: 70.2 KB\n"
]
}
],
"source": [
"df_test.info()"
]
},
{
"cell_type": "markdown",
"id": "84e0c414",
"metadata": {},
"source": [
"#### Display distributions of target variable **Class** in training and validation set."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "5ca239ec",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABJz0lEQVR4nO3deVxUZf//8fcIgoiAKyAuuKe4JpaSmrkkKlmm3ZpZoqlZobmUFd8slxa7NZc00+67UlvMrTQzl9zSUmwxNbM0NdcUNE0QFRS4fn/0Y+5GQGEcGDi9no/HPB7Mda4553POmWHec805Z2zGGCMAAACLKubuAgAAAPITYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYaeIGzt2rGw2W4Es64477tAdd9xhv//ll1/KZrNpyZIlBbL8fv36qVq1agWyLGclJydr4MCBCg4Ols1m0/Dhw/M8j8x9+scff7i+QOSLq18buXX48GHZbDa99tpr1+3r6tf63LlzZbPZdPjwYZfNMyf9+vVTqVKl8n05+c1ms2ns2LFOPbZatWrq16+fS+tB7hF2CpHMfz6ZtxIlSigkJESRkZGaPn26zp8/75LlnDhxQmPHjtXOnTtdMj9XKsy15cYrr7yiuXPn6rHHHtP777+vhx566Jp9ly1bVnDFXWXr1q0aO3aszp0757Ya8qKo1ftPc/HiRY0dO1Zffvml22pYuXKl02EE11cY9rHTDAqNOXPmGElm/Pjx5v333zfvvvuueeWVV0zHjh2NzWYzoaGhZteuXQ6PuXLlirl06VKelvPdd98ZSWbOnDl5elxqaqpJTU2139+4caORZBYvXpyn+Thb2+XLl01KSorLlpUfmjdvblq2bJmrvr6+viY6OjpL+5gxY4wkc/r0aRdX52jSpElGkjl06FC+LsdVCnO9V782cuvQoUNGkpk0adJ1+2Y+L1wlLS3NXLp0yWRkZLhkfqdPnzaSzJgxY7JMi46ONr6+vi5ZzrXExMS4dBtd7dKlS+bKlStOPTYlJcVcvnzZxRUVrGvt48LO0z0RC9fSuXNnNWvWzH4/NjZWGzZs0F133aW7775bv/zyi3x8fCRJnp6e8vTM39148eJFlSxZUl5eXvm6nOspXry4W5efG6dOnVJYWJi7y3AbY4xSUlLsz0+rKyyvDWd4eHjIw8PD3WW4TVpamjIyMvK070qUKOH08ry9vZ1+LFzA3WkL/5M5svPdd99lO/2VV14xksx//vMfe1t2n/a++OIL07JlSxMQEGB8fX1NnTp1TGxsrDHmf6MxV98yR1LatGlj6tevb77//nvTunVr4+PjY4YNG2af1qZNG/tyMue1YMECExsba4KCgkzJkiVN165dzdGjRx1qCg0NzXYU4+/zvF5t0dHRJjQ01OHxycnJZuTIkaZy5crGy8vL1KlTx0yaNCnLp1VJJiYmxixdutTUr1/feHl5mbCwMLNq1apst/XVEhISzMMPP2wCAwONt7e3adSokZk7d26WbXH1LadRiOz6Zm6fzH26f/9+Ex0dbQICAoy/v7/p16+fuXDhQpZ5vf/++6Zp06amRIkSpkyZMqZXr15Ztv/VMpeRU73vvvuuadu2ralQoYLx8vIy9erVM2+++WaW+YSGhpqoqCizevVqEx4ebry9vc3UqVONMcYcPnzYdO3a1ZQsWdJUqFDBDB8+3KxevdpIMhs3bnSYz7Zt20xkZKTx9/c3Pj4+5vbbbzdff/11ruu9WkxMjPH19c12e91///0mKCjIpKWlGWOMWbZsmenSpYupWLGi8fLyMjVq1DDjx4+3T8+Ul9dGamqqef75503Tpk2Nv7+/KVmypGnVqpXZsGGDwzz/PrIzZcoUU7VqVVOiRAlz++23m927d2e7z67mzP435n//b/6+DTP351dffWVuueUW4+3tbapXr27mzZt3zXllrsfVt8wRgMyRnePHj5t77rnH+Pr6mvLly5snn3wyy3ZOT083U6dONWFhYcbb29sEBgaaRx55xJw9e/aaNURHR2dbw9/rmzRpkpk6daqpUaOGKVasmNmxY0eu95UxJsuoRl5eq1f/D8zc/l9//bUZMWKEKV++vClZsqTp1q2bOXXqVJZtMmbMGFOxYkXj4+Nj7rjjDrNnz54c/69e7aOPPjJNmzY1pUqVMn5+fqZBgwZm2rRpDn3+/PNPM2zYMPv/0po1a5pXX33VpKenO2zDnPZxYcfIThHy0EMP6f/+7//0xRdfaNCgQdn22bNnj+666y41atRI48ePl7e3tw4cOKAtW7ZIkurVq6fx48frhRde0COPPKLWrVtLkm677Tb7PM6cOaPOnTvr/vvv14MPPqigoKBr1vXyyy/LZrPpmWee0alTpzRt2jR16NBBO3fuzNMn/NzU9nfGGN19993auHGjBgwYoCZNmmjNmjUaNWqUfv/9d02dOtWh/9dff61PPvlEjz/+uPz8/DR9+nT16NFDR48eVbly5XKs69KlS7rjjjt04MABDRkyRNWrV9fixYvVr18/nTt3TsOGDVO9evX0/vvva8SIEapcubKefPJJSVKFChWynef777+vgQMH6tZbb9UjjzwiSapZs6ZDn549e6p69eqaMGGCfvjhB7399tsKDAzUv//9b3ufl19+Wc8//7x69uypgQMH6vTp05oxY4Zuv/127dixQ6VLl852+d27d9evv/6qjz76SFOnTlX58uUd6p01a5bq16+vu+++W56envrss8/0+OOPKyMjQzExMQ7z2rdvn3r37q3Bgwdr0KBBuummm3ThwgW1a9dOJ0+e1LBhwxQcHKz58+dr48aNWWrZsGGDOnfurPDwcI0ZM0bFihXTnDlz1K5dO3311Ve69dZbr1vv1Xr16qWZM2fq888/17/+9S97+8WLF/XZZ5+pX79+9lGNuXPnqlSpUho5cqRKlSqlDRs26IUXXlBSUpImTZrkMN/cvjaSkpL09ttvq3fv3ho0aJDOnz+vd955R5GRkfr222/VpEkTh/7vvfeezp8/r5iYGKWkpOj1119Xu3bttHv37mu+/pzd/9dy4MAB3XfffRowYICio6P17rvvql+/fgoPD1f9+vWzfUyFChU0a9YsPfbYY7r33nvVvXt3SVKjRo3sfdLT0xUZGanmzZvrtdde07p16zR58mTVrFlTjz32mL3f4MGDNXfuXPXv319PPPGEDh06pDfeeEM7duzQli1bchzhHTx4sE6cOKG1a9fq/fffz7bPnDlzlJKSokceeUTe3t4qW7ZsnvdVdnLzWs3J0KFDVaZMGY0ZM0aHDx/WtGnTNGTIEC1cuNDeJzY2VhMnTlTXrl0VGRmpXbt2KTIyUikpKded/9q1a9W7d2+1b9/eXs8vv/yiLVu2aNiwYZL+el20adNGv//+uwYPHqyqVatq69atio2N1cmTJzVt2rRc7eNCzd1pC/9zvZEdY4wJCAgwN998s/3+1Z/2pk6det3jPa51XEybNm2MJDN79uxsp2U3slOpUiWTlJRkb1+0aJGRZF5//XV7W25Gdq5X29UjO8uWLTOSzEsvveTQ77777jM2m80cOHDA3ibJeHl5ObTt2rXLSDIzZszIsqy/mzZtmpFkPvjgA3vb5cuXTUREhClVqpTDumd+Ms6N6x2z8/DDDzu033vvvaZcuXL2+4cPHzYeHh7m5Zdfdui3e/du4+npmaX9atc6BubixYtZ2iIjI02NGjUc2kJDQ40ks3r1aof2yZMnG0lm2bJl9rZLly6ZunXrOozsZGRkmNq1a5vIyEiH0biLFy+a6tWrmzvvvDNX9V4tIyPDVKpUyfTo0cOhPfO5uXnz5muu6+DBg03JkiUdjhHLy2sjLS0tyzE8f/75pwkKCnLYr5mfln18fMzx48ft7d98842RZEaMGGFvu/q1fqP7P6eRnau3z6lTp4y3t7d58sknrzm/6x2zo/9/POLf3XzzzSY8PNx+/6uvvjKSzIcffujQL3NE8Or2q+V0zE7mdvb3988yapLbfWVMziM713utGpPzyE6HDh0cnvsjRow
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"_, _, bars = plt.hist(df_train['Class'], bins=10)\n",
"plt.xlabel('Class')\n",
"plt.ylabel('Frequency')\n",
"plt.title('Distribution of the target variable in the training set')\n",
"plt.bar_label(bars, fmt='%1.0f')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "c74f9fb5",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABH6UlEQVR4nO3deVxUZf//8fcggogsYgJiLrjlmhYmebsrSWqmabmWuFthrmV5l7mkkZlrmd6VqZVmampmppmYtqi5lm3uu4JbgrggyvX7ox/zbQQUYdiOr+fjMQ+d65y5zuecYZg311znjM0YYwQAAGBRLrldAAAAQHYi7AAAAEsj7AAAAEsj7AAAAEsj7AAAAEsj7AAAAEsj7AAAAEsj7AAAAEsj7AAAAEsj7NzhRo0aJZvNliPbaty4sRo3bmy//91338lms2nx4sU5sv3u3burbNmyObKtzEpISFDv3r0VGBgom82mQYMG3XYfKc/pmTNnnF8gssWNr42MOnTokGw2m956661bruvs1/qcOXNks9l06NAhp/WZnu7du6tIkSLZvh1YF2HHQlJ++aTcChUqpKCgIIWHh2vatGm6cOGCU7Zz4sQJjRo1Sjt37nRKf86Ul2vLiNdff11z5szRM888o48//lhPPfXUTdddtmxZzhV3g59++kmjRo3S+fPnc62G25Hf6r3TXLp0SaNGjdJ3332XazWsXLlSo0aNytZt5OZ+/vHHHxo1alSOBNQ8x8AyZs+ebSSZMWPGmI8//th8+OGH5vXXXzfNmzc3NpvNlClTxvzyyy8Oj0lKSjKXL1++re1s2bLFSDKzZ8++rcclJiaaxMRE+/1169YZSWbRokW31U9ma7t69aq5cuWK07aVHUJDQ029evUytK6np6eJiIhI1T5y5EgjyZw+fdrJ1TmaMGGCkWQOHjyYrdtxlrxc742vjYw6ePCgkWQmTJhwy3VTfi6c5dq1a+by5csmOTnZKf2dPn3aSDIjR45MtSwiIsJ4eno6ZTs3ExkZ6dRjlJab7Wd2W7RokZFk1q1bl+Pbzm2uuZKwkK1atGih2rVr2+8PHz5c0dHReuSRR/Too4/qzz//lIeHhyTJ1dVVrq7Z+2Nw6dIlFS5cWG5ubtm6nVspWLBgrm4/I06dOqWqVavmdhm5xhijK1eu2H8+rS6vvDYyo0CBAipQoEBulwFkTG6nLThPysjOli1b0lz++uuvG0nmvffes7el9dfeN998Y+rVq2d8fHyMp6enqVSpkhk+fLgx5v9GY268pYykNGrUyFSrVs1s3brVNGjQwHh4eJiBAwfalzVq1Mi+nZS+FixYYIYPH24CAgJM4cKFTevWrc2RI0ccaipTpkyaoxj/7vNWtUVERJgyZco4PD4hIcEMGTLE3H333cbNzc1UqlTJTJgwIdVfq5JMZGSkWbp0qalWrZpxc3MzVatWNV9//XWax/pGsbGxpmfPnsbf39+4u7ube++918yZMyfVsbjxlt4oRFrrphyflOd07969JiIiwvj4+Bhvb2/TvXt3c/HixVR9ffzxx+b+++83hQoVMkWLFjUdO3ZMdfxvlLKN9Or98MMPTZMmTUzx4sWNm5ubqVKlinn33XdT9VOmTBnTqlUrs2rVKhMSEmLc3d3N5MmTjTHGHDp0yLRu3doULlzYFC9e3AwaNMisWrUqzb9MN23aZMLDw423t7fx8PAwDRs2ND/88EOG671RZGSk8fT0TPN4derUyQQEBJhr164ZY4xZtmyZadmypSlRooRxc3Mz5cqVM2PGjLEvT3E7r43ExEQzYsQIc//99xtvb29TuHBhU79+fRMdHe3Q579HdiZNmmRKly5tChUqZBo2bGh27dqV5nN2o8w8/8b83++bfx/DlOfz+++/Nw888IBxd3c3wcHBZu7cuTftK2U/bryljH6kjOwcO3bMtGnTxnh6epq77rrLDB06NNVxvn79upk8ebKpWrWqcXd3N/7+/qZv377m3LlzN60hIiIizRput98tW7aY5s2bm2LFiplChQqZsmXLmh49emRoP9Ny9epVM2rUKFOhQgXj7u5u/Pz8TL169cw333zjsN6ff/5p2rdvb4oWLWrc3d1NSEiI+eKLL+zLU56vG293yigPIzt3kKeeekr//e9/9c0336hPnz5prvP777/rkUce0b333qsxY8bI3d1d+/bt048//ihJqlKlisaMGaNXX31Vffv2VYMGDSRJ//nPf+x9nD17Vi1atFCnTp305JNPKiAg4KZ1jRs3TjabTS+++KJOnTqlKVOmKCwsTDt37rytv/AzUtu/GWP06KOPat26derVq5dq1aql1atX64UXXtDx48c1efJkh/V/+OEHLVmyRM8++6y8vLw0bdo0tW/fXkeOHFGxYsXSrevy5ctq3Lix9u3bp/79+ys4OFiLFi1S9+7ddf78eQ0cOFBVqlTRxx9/rMGDB+vuu+/W0KFDJUnFixdPs8+PP/5YvXv3Vp06ddS3b19JUvny5R3W6dChg4KDgxUVFaXt27frgw8+kL+/v8aPH29fZ9y4cRoxYoQ6dOig3r176/Tp03r77bfVsGFD7dixQ76+vmluv127dtqzZ48+/fRTTZ48WXfddZdDvTNmzFC1atX06KOPytXVVV9++aWeffZZJScnKzIy0qGv3bt3q3PnzurXr5/69Omje+65RxcvXlTTpk118uRJDRw4UIGBgZo/f77WrVuXqpbo6Gi1aNFCISEhGjlypFxcXDR79mw1bdpU33//verUqXPLem/UsWNHTZ8+XV999ZWeeOIJe/ulS5f05Zdfqnv37vZRjTlz5qhIkSIaMmSIihQpoujoaL366quKj4/XhAkTHPrN6GsjPj5eH3zwgTp37qw+ffrowoULmjVrlsLDw/Xzzz+rVq1aDut/9NFHunDhgiIjI3XlyhVNnTpVTZs21a5du276+svs838z+/bt0+OPP65evXopIiJCH374obp3766QkBBVq1YtzccUL15cM2bM0DPPPKPHHntM7dq1kyTde++99nWuX7+u8PBwhYaG6q233tK3336riRMnqnz58nrmmWfs6/Xr109z5sxRjx49NGDAAB08eFDvvPOOduzYoR9//DHdEd5+/frpxIkTWrNmjT7++OM0l9+q31OnTql58+YqXry4XnrpJfn6+urQoUNasmRJhvfzRqNGjVJUVJT99R4fH6+tW7dq+/bteuihhyT983u7Xr16KlmypF566SV5enpq4cKFatu2rT7//HM99thjatiwoQYMGKBp06bpv//9r6pUqSJJ9n8tL7fTFpznViM7xhjj4+Nj7rvvPvv9G//amzx58i3ne9xsXkyjRo2MJDNz5sw0l6U1slOyZEkTHx9vb1+4cKGRZKZOnWpvy8jIzq1qu3FkZ9myZUaSGTt2rMN6jz/+uLHZbGbfvn32NknGzc3Noe2XX34xkszbb7+dalv/NmXKFCPJfPLJJ/a2q1evmrp165oiRYo47HvKX8YZcas5Oz179nRof+yxx0yxYsXs9w8dOmQKFChgxo0b57Derl27jKura6r2G91sDsylS5dStYWHh5ty5co5tJUpU8ZIMqtWrXJonzhxopFkli1bZm+7fPmyqVy5ssNfo8nJyaZixYomPDzcYTTu0qVLJjg42Dz00EMZqvdGycnJpmTJkqZ9+/YO7Sk/mxs2bLjpvvbr188ULlzYYY7Y7bw2rl27lmoOz99//20CAgIcnteUkQIPDw9z7Ngxe/vmzZuNJDN48GB7242v9aw+/+mN7Nx4fE6dOmXc3d3N0KFDb9rfrebs6P/PR/y3++67z4SEhNjvf//990aSmTdvnsN6KSOCN7bfKL05Oxntd+nSpbf8HXy7c3Zq1qx5y98JzZo1MzVq1HD4eUtOTjb/+c9/TMWKFe1td/KcHc7GusMUKVLkpmdlpfwl98UXXyg5OTlT23B3d1ePHj0yvH63bt3k5eVlv//444+rRIkSWrlyZaa2n1ErV65UgQIFNGDAAIf2oUOHyhijr7/+2qE9LCzMYfTk3nvvlbe3tw4cOHDL7QQGBqpz5872toIFC2r
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"_, _, bars = plt.hist(df_test['Class'], bins=10)\n",
"plt.xlabel('Class')\n",
"plt.ylabel('Frequency')\n",
"plt.title('Distribution of the target variable in the test set')\n",
"plt.bar_label(bars, fmt='%1.0f')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "82afd315",
"metadata": {},
"source": [
"#### Display relationship between features in the training set using the correlation matrix"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "e8cf8eb1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(42.5, -0.5)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABy4AAAe2CAYAAABKEJQUAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzddXRU1/rw8e9EZuLubkQIEtyhtBQp0EIFK1KkUOqlCrTQ9vbWBai73gq0pcXd3TUJxBPirhOf948JSSbMJMO9cJPfe5/PWrNWSfaZPD1nn2fvffY5+yg0Go0GIYQQQgghhBBCCCGEEEIIIYRoRybtHYAQQgghhBBCCCGEEEIIIYQQQsjEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBD/h+3bt4/x48fj5eWFQqHgr7/+anObPXv20LNnT1QqFSEhIXz33XfXlPn4448JCAjAwsKCfv36cezYsRsffDMycSmEEEIIIYQQQgghhBBCCCHE/2Hl5eV0796djz/+2KjySUlJjB07luHDh3PmzBmefPJJ5s2bx9atWxvL/PbbbyxatIjly5dz6tQpunfvzqhRo8jJyblZ/xsoNBqN5qZ9uxBCCCGEEEIIIYQQQgghhBDiv0ahULB27VomTJhgsMzzzz/Pxo0buXDhQuPPpkyZQlFREVu2bAGgX79+9OnTh48++giA+vp6fH19eeyxx3jhhRduSuzyxKUQQgghhBBCCCGEEEIIIYQQHUxVVRUlJSU6n6qqqhvy3YcPH2bEiBE6Pxs1ahSHDx8GoLq6mpMnT+qUMTExYcSIEY1lbgazm/bNQgghhBBCCCGEEEIIIYQQ4v8sS7+p7R3C/7Tn54Txyiuv6Pxs+fLlvPzyy//xd2dlZeHu7q7zM3d3d0pKSlCr1RQWFlJXV6e3TGxs7H/89w2RiUshhBBCCCGEEEIIIYQQQgghOpjFixezaNEinZ+pVKp2iua/QyYuhRBCCCGEEEIIIYQQQgghhOhgVCrVTZuo9PDwIDs7W+dn2dnZ2NnZYWlpiampKaampnrLeHh43JSYQN5xKYQQQgghhBBCCCGEEEIIIcT/lAEDBrBz506dn23fvp0BAwYAoFQq6dWrl06Z+vp6du7c2VjmZpCJSyGEEEIIIYQQQgghhBBCCCH+DysrK+PMmTOcOXMGgKSkJM6cOUNqaiqgXXZ25syZjeUfeughEhMTee6554iNjeWTTz5h9erVPPXUU41lFi1axJdffsn3339PTEwMCxcupLy8nNmzZ9+0/w9ZKlYIIYQQQgghhBBCCCGEEEKI/8NOnDjB8OHDG/999d2Ys2bN4rvvviMzM7NxEhMgMDCQjRs38tRTT7Fy5Up8fHz46quvGDVqVGOZyZMnk5uby7Jly8jKyiIqKootW7bg7u5+0/4/FBqNRnPTvl0IIYQQQgghhBBCCCGEEEL8n2TpN7W9Q/ifpk79pb1D+K+TpWKFEEIIIYQQQgghhBBCCCGEEO1OlooVQgghhBBCCCGEEEIIIYQQ11Ao5Pk38d8lNU4IIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e7M2jsAIYQQQgghhBBCCCGEEEII0fEo5Pk38V8mNU4IIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e7M2jsAIYQQQgghhBBCCCGEEEII0fEoFPL8m/jv6lATl5Z+U9s7BB3q1F/o+fP+9g5Dx6lpQ4j4el97h9EoZu5QpuzuOPEA/Dp8KP3/ONDeYeg4cs9gAp/f0N5h6Eh6axwByza3dxg6kl8dg/8bO9o7DB0pi0fQ65eOlQdOTh1C52861nkXPWcofdd0rPPu2H2DeeborvYOQ8e7/W4lZOIP7R1Go/i1M1l+qmOdc6/0HIFL2JPtHYaOvEsrCFi+pb3D0JH8yugO1R8AbZ/gldMdqz4t7zGCOfv3tHcYOr4Zcgu3bDzY3mHo2DN2EN1+7Fht3bkZQ+izumO1K8cnDabf7x0rpqP3DubRw7vbO4xGHw0YzuC/O9Y+OnDXYO7a0bHq998jhuAc+nh7h6Ej//IqZu7d295h6Phh2DAmdbCx5urhQztkP/yhgx0nDwB8Nmg4kd92rP10cfZQAj/uWHU86ZFhDF3fcfoE+8YP6pDnXKfPO1ZMcQuGMmprx2rrto7qmG1d4DPr2zsMHUnvjuf2LR3nnAPYPnoQfqs6Vm5KfXwYp/M71jXMHs7j2jsEIcR/mUyVCyGEEEIIIYQQQgghhBBCCCHanUxcCiGEEEIIIYQQQgghhBBCCCHanUxcCiGEEEIIIYQQQgghhBBCCCHanUxcCiGEEEIIIYQQQgghhBBCCCHanUxcCiGEEEIIIYQQQgghhBBCCCHanVl7ByCEEEIIIYQQQgghhBBCCCE6HoVCnn8T/11S44QQQgghhBBCCCGEEEIIIYQQ7U4mLoUQQgghhBBCCCGEEEIIIYQQ7U4mLoUQQgghhBBCCCGEEEIIIYQQ7U4mLoUQQgghhBBCCCGEEEIIIYQQ7U4mLoUQQgghhBBCCCGEEEIIIYQQ7U4mLoUQQgghhBBCCCGEEEIIIYQQ7U4mLoUQQgghhBBCCCGEEEIIIYQQ7c7sRn1RbW0tGRkZ+Pn53aivFEIIIYQQQgghhBBCCCGEEO1EoVC0dwjif8wNm7i8ePEiPXv2pK6u7oZ836C+4Tz10Dh6dg3C092RSfPeY/22E61uM6R/BG+9NIPOoT5cycznzVVr+en3fTplFsy8nacWjMfd1Z7zMaksWvYdJ84mXFdskzp5MjPCB2dLJZcLy3j7ZAIX88v0lp0Y7MG4QDeCHawAiCko46OzyTrlb/Vx5p5OnkQ42eCgMmfKplNcLiq/rpimRXgyp6svLpZKYgvK+OfhBM7nleote1+YB3eGuNPJURtTdF4ZH5xI1in/+pBQJoZ66Gy3/0oB87deMCqe3D27ydm2lZqSYix9fPGZPBXrwECD5QtPniBz3d9U5+ehcnPHa+I92HftCoCmrpaMv/+i5MIFqvNyMbG0xDY8Au+J92Du4GBUPAD3BHkyPdQbJwsl8cXlvHcmgehC/cct0NaK+ZF+hDvY4GltwQdnE/ktPuOacq4WSh7pGsAAd0dUZiZcKavktRNxxBbp/96WZgzwZ/7QYFxtVcRklvDy3xc5e6Woze3Gdffiw2k92XYxiwU/6D8vXpvYlfv7+/Pq+ot8eyDJqHgAZvT1Y8GgQFxtVMRkl7J8YzRn04vb3G58F08+nBTFtphs5v9yqvHnoyLcub+PH1297HC0UnLHJweIztJfNw2Z2dOH+f38cbVREpNTxvJtlzibWaK37OhQVx4ZGIi/oyXmJiYkFVbw5bEU1l7IaiyTsniE3m1f3xXH50dTjIrpvk6ezAzX5oG4q3mgQP9xD7Kz4qFu/kQ42uBlY8G7pxL45ZJufbIyM2VhN3+G+zjjqDLnUmE5755KINrAd+ozNcKTOV20eeBSYet54N5QD+4KcSfkah7IL2NFizzwzyGhTOx0bR5YsM24PABwb7An08O8cbZQEldUzrunDZ93QXY
"text/plain": [
"<Figure size 2500x2500 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"correlation_matrix = df_train.corr()\n",
"fig, ax = plt.subplots(figsize=(25, 25))\n",
"\n",
"ax = sns.heatmap(\n",
" correlation_matrix,\n",
" annot=True,\n",
" linewidths=0.5,\n",
" fmt=\".2f\",\n",
" cmap=\"YlGnBu\"\n",
")\n",
"\n",
"# Jupyter notebook specific\n",
"bottom_side, top_side = ax.get_ylim()\n",
"ax.set_ylim(bottom_side + 0.5, top_side - 0.5)"
]
},
{
"cell_type": "markdown",
"id": "c2b4a57c",
"metadata": {},
"source": [
"We can see that there is the highest positive correlation in **V14** atribute and the highest negative value in the attributes **V1, V27** So lets see the distribution of those values in comparrison to class."
]
},
{
"cell_type": "markdown",
"id": "f1918d5b",
"metadata": {},
"source": [
"**V14 vs V17**"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "8d4ce9a6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.legend.Legend at 0x7fe315fbfb90>"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABNoAAANXCAYAAADjAjLCAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAADjfklEQVR4nOzdeVhUZf8G8PucYd8XRcAUcElFchdFc/lVCmqo1Ztmmlrma6aVLb5lZUhWaquVZablmpmlpaihZpobiokbobmElgqSICCyzzm/P8YZGWY7AwMDeH+uy0vnnGfOec4w9b7efZ/nK8iyLIOIiIiIiIiIiIiqRbT3BIiIiIiIiIiIiBoCBm1EREREREREREQ2wKCNiIiIiIiIiIjIBhi0ERERERERERER2QCDNiIiIiIiIiIiIhtg0EZERERERERERGQDDNqIiIiIiIiIiIhsgEEbERERERERERGRDTBoIyIiIiIiIiIisgEGbURERNRgjR8/HqGhoXa5tyAImDVrlk2vuXLlSrRt2xaOjo7w8fGx6bVt5fz58xAEAcuWLbP3VMxKTExEp06d4OLiAkEQkJuba+8p1Sp7/rNBRETUkDFoIyIiqgcEQVD0a9euXfaeqp79+/dj1qxZt12IURNOnTqF8ePHo2XLlli8eDG+/PJLu85n9erVmD9/vl3nUFXZ2dkYMWIEXF1d8dlnn2HlypVwd3c3GDd06FC4ubnh+vXrJq81evRoODk5ITs7GwDw3XffYcyYMWjdujUEQUD//v1r6jFqVFZWFhwcHDBmzBiTY65fvw5XV1c8+OCDAIBDhw5h6tSpaN++Pdzd3dG8eXOMGDECp0+fNnivuX+PDRgwoMaei4iIqKY52HsCREREZNnKlSv1Xq9YsQLbt283ON6uXbvanJZF+/fvR3x8PMaPH19nK7BqSlFRERwcbPd/tXbt2gVJkvDxxx+jVatWNrtuVa1evRqpqamYNm2a3vGQkBAUFRXB0dHRPhNT4NChQ7h+/Tpmz56N++67z+S40aNHIyEhAT/++CPGjh1rcL6wsBAbNmxATEwM/P39AQALFy7E4cOH0b17d134Vh8FBARgwIAB2LBhAwoLC+Hm5mYwZv369SguLtaFcfPmzcO+ffvw8MMPo0OHDsjMzMSCBQvQpUsXHDhwABEREbr3Vv53FwD8/vvv+PjjjzFw4MCaezAiIqIaxqCNiIioHqhcVXLgwAFs377dbLWJUrIso7i4GK6urtW+Ft3i4uJi0+tlZWUBQJ0PLAVBsPmz25rSz3Lo0KHw9PTE6tWrjQZtGzZswI0bNzB69GjdsZUrV6Jp06YQRVEvWKqPRo8ejcTERGzcuBGPPPKIwfnVq1fD29sbQ4YMAQC88MILWL16NZycnHRjRo4cibvuugtz587FqlWrdMeN/btr165dEAQBo0aNqoGnISIiqh1cOkpERNRALF26FPfccw8CAgLg7OyM8PBwLFy40GBcaGgo7r//fmzduhXdunWDq6srFi1aBAC4cOEChg4dCnd3dwQEBOD555/H1q1bjS5LPXjwIGJiYuDt7Q03Nzf069cP+/bt052fNWsWpk+fDgAICwvTLQs7f/680flPnToVHh4eKCwsNDg3atQoBAYGQq1WA9AEHEOGDEFwcDCcnZ3RsmVLzJ49W3feFO1f5Cs/i6l9xU6dOoX//Oc/8PPzg4uLC7p164aNGzeavYdW5T3aZs2aBUEQcPbsWV2Fn7e3Nx5//HGjz1xRaGgo4uLiAACNGzfWu7apveBCQ0Mxfvx43etly5ZBEATs27cPL7zwAho3bgx3d3c88MAD+Pfffw3e//PPP6Nfv37w9PSEl5cXunfvjtWrVwMA+vfvj82bN+PChQu6n6t2vy9Tn+Wvv/6KPn36wN3dHT4+Phg2bBhOnjypN6Y6n5HW999/j65du8LV1RWNGjXCmDFjcOnSJd35/v37Y9y4cQCA7t27QxAEvc+pIu2yyB07dujCuYpWr14NT09PDB06VHesWbNmEMWq/V/s0tJSvPHGG+jatSu8vb3h7u6OPn36YOfOnXrjtJ/x+++/jy+//BItW7aEs7MzunfvjkOHDhlc96effkJERARcXFwQERGBH3/8UdF8HnjgAbi7u+t+7hVlZWVhx44d+M9//gNnZ2cAQK9evfRCNgBo3bo12rdvb/CzrqykpATr1q1Dv379cMcddyiaHxERUV3EijYiIqIGYuHChWjfvj2GDh0KBwcHJCQk4Omnn4YkSZgyZYre2D///BOjRo3CpEmTMHHiRLRp0wY3btzAPffcg4yMDDz33HMIDAzE6tWrDf6SD2hCk0GDBqFr166Ii4uDKIq6oG/Pnj2IjIzEgw8+iNOnT+Pbb7/FRx99hEaNGgHQBEXGjBw5Ep999hk2b96Mhx9+WHe8sLAQCQkJGD9+PFQqFQBNaOTh4YEXXngBHh4e+PXXX/HGG28gPz8f7733nk0+zz/++AO9e/dG06ZN8corr8Dd3R1r167F8OHDsW7dOjzwwANVuu6IESMQFhaGOXPmICUlBUuWLEFAQADmzZtn8j3z58/HihUr8OOPP2LhwoXw8PBAhw4dqnT/Z555Br6+voiLi8P58+cxf/58TJ06Fd99951uzLJly/DEE0+gffv2mDFjBnx8fHDkyBEkJibi0UcfxWuvvYa8vDxcvHgRH330EQDAw8PD5D1/+eUXDBo0CC1atMCsWbNQVFSETz/9FL1790ZKSorBpvxV+Yy083788cfRvXt3zJkzB1euXMHHH3+Mffv24ciRI/Dx8cFrr72GNm3a4Msvv8Sbb76JsLAwtGzZ0uQ1R48ejeXLl2Pt2rWYOnWq7nhOTg62bt2KUaNG2awaND8/H0uWLMGoUaMwceJEXL9+HV999RWio6ORnJyMTp066Y1fvXo1rl+/jkmTJkEQBLz77rt48MEH8ddff+mW7m7btg0PPfQQwsPDMWfOHGRnZ+Pxxx9XFGa5u7tj2LBh+OGHH5CTkwM/Pz/due+++w5qtVqvms8YWZZx5coVtG/f3uy4LVu2IDc31+L1iIiI6jyZiIiI6p0pU6bIlf9nvLCw0GBcdHS03KJFC71jISEhMgA5MTFR7/gHH3wgA5B/+ukn3bGioiK5bdu2MgB5586dsizLsiRJcuvWreXo6GhZkiS9+4eFhckDBgzQHXvvvfdkAHJ6errFZ5IkSW7atKn80EMP6R1fu3atDEDevXu32WedNGmS7ObmJhcXF+uOjRs3Tg4JCdG93rlzp96zaKWnp8sA5KVLl+qO3XvvvfJdd92ldz1JkuRevXrJrVu3tvg8AOS4uDjd67i4OBmA/MQTT+iNe+CBB2R/f3+L19O+/99//zV7H62QkBB53LhxutdLly6VAcj33Xef3s/t+eefl1UqlZybmyvLsizn5ubKnp6eco8ePeSioiK9a1Z835AhQ/Q+Wy1jn2WnTp3kgIAAOTs7W3fs2LFjsiiK8tixYw2esSqfUWlpqRwQECBHRETozXvTpk0yAPmNN94w+CwOHTpk9pqyLMvl5eVyUFCQHBUVpXf8iy++kAHIW7duNfne9u3by/369bN4j4r3Kikp0Tt27do1uUmTJnqfifYz9vf3l3NycnTHN2zYIAOQExISdMc6deokBwUF6X6+sizL27ZtkwEY/flVtnnzZhmAvGjRIr3jPXv2lJs2bSqr1Wqz71+5cqUMQP7qq6/MjnvooYdkZ2dn+dq1axbnREREVJdx6SgREVEDUbGqJi8vD1evXkW/fv3w119/IS8vT29sWFgYoqOj9Y4lJiaiadOmesvgXFxcMHHiRL1xR48exZkzZ/Doo48iOzsbV69exdWrV3Hjxg3ce++92L17NyRJsnr+giDg4YcfxpYtW1BQUKA7/t1336Fp06a4++67jT7r9evXcfXqVfTp0weFhYU4deqU1feuLCcnB7/++itGjBihu/7Vq1eRnZ2N6OhonDlzRm85ojWeeuopvdd9+vRBdnY28vPzqz1vJf773/9CEAS9+6vValy4cAEAsH37dly/fh2vvPKKwV5rFd+nVEZGBo4ePYr
"text/plain": [
"<Figure size 1500x1000 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"\n",
"plt.figure(figsize=(15, 10))\n",
"\n",
"# Scatter with 1 values of target class\n",
"plt.scatter(\n",
" df_train['V1'][df_train['Class'] == 1],\n",
" df_train['V27'][df_train['Class'] == 1],\n",
")\n",
"\n",
"# Scatter with 2 values of target class\n",
"plt.scatter(\n",
" df_train['V1'][df_train['Class'] == 2],\n",
" df_train['V27'][df_train['Class'] == 2],\n",
")\n",
"\n",
"plt.title('Target value in function of V1 and V27')\n",
"\n",
"plt.xlabel('V1')\n",
"plt.ylabel('V27')\n",
"plt.legend(['Biodegradable', 'Non-biodegradable'])\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "d50d1f44",
"metadata": {},
"outputs": [],
"source": [
"# Spliting the data into features and labels\n",
"X_train = df_train.drop('Class', axis=1)\n",
"y_train = df_train['Class']\n",
"X_test = df_test.drop('Class', axis=1)\n",
"y_test = df_test['Class']"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "f0aa7c9d",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"# Put models in a dictionary\n",
"models = {\n",
" \"Logistic Regression\": LogisticRegression(),\n",
" \"KNN\": KNeighborsClassifier(),\n",
" \"Random Forest\": RandomForestClassifier()\n",
"}\n",
"\n",
"# Create a function to fit and score models\n",
"def fit_and_score(models, X_train, X_test, y_train, y_test):\n",
" \"\"\"\n",
" Fits and evaluates given machine learning models.\n",
" models: dict of different Scikit-Learn machine learning models\n",
" X_train: training data (no labels)\n",
" x_test: testing data (no labels)\n",
" y_train: training labels\n",
" y_test: trest labels\n",
" \"\"\"\n",
"\n",
" # Set random seed\n",
" np.random.seed(42)\n",
"\n",
" # Make a dictioanry to keep model scores\n",
" model_scores = {}\n",
"\n",
" # Loop through models\n",
" for name, model in models.items():\n",
" # Fit the model to the data\n",
" model.fit(X_train, y_train)\n",
" # Evaluate the model and append its score to model_scores\n",
" model_scores[name] = model.score(X_test, y_test)\n",
"\n",
" return model_scores"
]
},
{
"cell_type": "markdown",
"id": "10387356",
"metadata": {},
"source": [
"#### Check if there are any missing values"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "87e277e6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"V4 25\n",
"V22 16\n",
"V27 8\n",
"V29 8\n",
"V37 25\n",
"dtype: int64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"na_counts = df_train.isna().sum()\n",
"na_counts[na_counts > 0]\n"
]
},
{
"cell_type": "markdown",
"id": "cb57434a",
"metadata": {},
"source": [
"#### We can see that there are five atributes that have missing values. Lets inspect them."
]
},
{
"cell_type": "markdown",
"id": "9dbd2c02",
"metadata": {},
"source": [
"##### V4"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "ca1e544a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 821.000000\n",
"mean 0.030451\n",
"std 0.198281\n",
"min 0.000000\n",
"25% 0.000000\n",
"50% 0.000000\n",
"75% 0.000000\n",
"max 2.000000\n",
"Name: V4, dtype: float64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V4'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "9e4d7d1d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0 800\n",
"1.0 17\n",
"2.0 4\n",
"Name: V4, dtype: int64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V4'].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "3a3191c9",
"metadata": {},
"source": [
"We can see that the majority of entires in that particular atribute are zeros. So I think that it would be best if I set all the `Nan` values to zeros."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "d8489bd4",
"metadata": {},
"outputs": [],
"source": [
"df_train['V4'].fillna(0, inplace=True)\n",
"df_test['V4'].fillna(0, inplace=True)"
]
},
{
"cell_type": "markdown",
"id": "3e84e48b",
"metadata": {},
"source": [
"##### V22"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "a711431d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 830.000000\n",
"mean 1.243898\n",
"std 0.094109\n",
"min 0.898000\n",
"25% 1.187500\n",
"50% 1.248500\n",
"75% 1.298750\n",
"max 1.641000\n",
"Name: V22, dtype: float64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V22'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "f0325325",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.299 9\n",
"1.280 9\n",
"1.296 8\n",
"1.254 8\n",
"1.264 8\n",
" ..\n",
"1.449 1\n",
"1.159 1\n",
"1.363 1\n",
"1.331 1\n",
"1.410 1\n",
"Name: V22, Length: 321, dtype: int64"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V22'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "25a74baf",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjIAAAHHCAYAAACle7JuAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABLi0lEQVR4nO3deVxU1f8/8NcAssgyyg6iiLiFiguW4YqKIhq5FW4lmGYWrlgWnywlLTTLLbf6ZriFW25lKi4ouFYq5FIumOYC4s5moDDn90c/JkeGbZzhzoXX8/G4j4dz751z33NnhBfnnntGIYQQICIiIpIhE6kLICIiItIVgwwRERHJFoMMERERyRaDDBEREckWgwwRERHJFoMMERERyRaDDBEREckWgwwRERHJFoMMERERyRaDTDU0ffp0KBSKSjlWQEAAAgIC1I8PHDgAhUKBH374oVKOHx4ejvr161fKsXSVk5ODUaNGwdXVFQqFAhMnTqxwG0Xv6Z07d/RfIEGhUGD69Onqx3I63ytWrIBCocCVK1cMfqzw8HDY2NgY/DiG9vT7TcaNQUbmin5IFS2WlpZwd3dHUFAQFi5ciOzsbL0cJy0tDdOnT0dKSope2tMnY66tPD777DOsWLECb7/9NlavXo3XX3+91H23bt1aecU94eWXX0bNmjVL/UwNGzYM5ubmuHv3Lu7evYs5c+agc+fOcHJyQq1atfDiiy9i/fr1xZ7322+/YezYsWjWrBmsra1Rr149hIaG4sKFC3p9DUeOHMH06dPx4MEDvbarD8Zc25MePnyI6dOn48CBA5LVsGPHjmobNOLi4jB//nypyzAugmQtNjZWABCffPKJWL16tfjuu+/EZ599Jnr27CkUCoXw9PQUv//+u8ZzHj9+LP75558KHee3334TAERsbGyFnpefny/y8/PVj/fv3y8AiI0bN1aoHV1re/TokcjLy9PbsQyhXbt2okOHDuXa19raWoSFhRVbP23aNAFA3L59W8/V/WfdunUCgFi5cqXW7bm5ucLa2lqEhIQIIYT46aefRI0aNUTfvn3F/PnzxaJFi0TXrl0FAPHxxx9rPHfgwIHC1dVVjBs3Tvzf//2fmDFjhnBxcRHW1tbi9OnTensNc+bMEQDE5cuXK/S8f/75Rzx+/Fj92BDnW9faylJQUCD++ecfoVKp9NLe7du3BQAxbdq0YtvCwsKEtbW1Xo5TmoiICGHIX19Pv9/GpE+fPsLT01PqMoyKmTTxifQtODgYbdu2VT+OiopCQkICXnrpJbz88sv4888/YWVlBQAwMzODmZlh3/qHDx+iZs2aMDc3N+hxylKjRg1Jj18et27dgo+Pj9RllOnll1+Gra0t4uLiMHz48GLbt23bhtzcXAwbNgwA0KxZM1y8eBGenp7qfd555x0EBgZi9uzZmDJlCqytrQEAkZGRiIuL0/i8DBo0CC1atMCsWbOwZs0aA7+64lQqFR49egRLS0tYWlpW+vH1xdTUFKamplKXIZmCggKoVKoK/SyS8/tdLUmdpOjZFPXI/Pbbb1q3f/bZZwKA+Oabb9Triv6afNLu3btFhw4dhFKpFNbW1qJx48YiKipKCPFfL8rTS1EPSJcuXUSzZs3E8ePHRadOnYSVlZWYMGGCeluXLl3Uxylqa926dSIqKkq4uLiImjVripCQEHH16lWNmjw9PbX2PjzZZlm1hYWFFfvrJScnR0RGRgoPDw9hbm4uGjduLObMmVPsL1YAIiIiQmzZskU0a9ZMmJubCx8fH7Fz506t5/ppGRkZ4o033hDOzs7CwsJC+Pr6ihUrVhQ7F08vJf1Frm3fovNT9J5evHhRhIWFCaVSKezs7ER4eLjIzc0t1tbq1atFmzZthKWlpahdu7YYNGhQsfOvTVhYmDAzMxMZGRnFtr300kvC1tZWPHz4sNQ2Fi5cKACIU6dOlXm8Nm3aiDZt2pS53++//y7CwsKEl5eXsLCwEC4uLmLEiBHizp076n2KzlFJ57vo/V6zZo3w8fERZmZmYsuWLeptT/ZAFLX1559/ildffVXY2toKe3t7MX78eI3ezsuXL5fYW/hkm2XVJoTu71nRz4gn2/L09BR9+vQRBw8eFM8//7ywsLAQXl5eJfa2Pf16nl6KXkdRj8z169dF3759hbW1tXB0dBSTJ08WBQUFGm0VFhaKefPmCR8fH2FhYSGcnZ3F6NGjxb1790qtISwsTGsNT9Y3Z84cMW/ePNGgQQNhYmIikpOTRX5+vvjoo49EmzZthJ2dnahZs6bo2LGjSEhIKHaMkt7v8v7/etqFCxfEgAEDhIuLi7CwsBB16tQRgwYNEg8ePNDYr6z3uEuXLsVeN3tn2CNT5b3++uv43//+h927d+PNN9/Uus/Zs2fx0ksvwdfXF5988gksLCyQmpqKw4cPAwCee+45fPLJJ/j4448xevRodOrUCQDQvn17dRt3795FcHAwBg8ejNdeew0uLi6l1vXpp59CoVDg/fffx61btzB//nwEBgYiJSVF3XNUHuWp7UlCCLz88svYv38/Ro4ciVatWiE+Ph7vvfcebty4gXnz5mnsf+jQIWzevBnvvPMObG1tsXDhQgwcOBBXr16Fg4NDiXX9888/CAgIQGpqKsaOHQsvLy9s3LgR4eHhePDgASZMmIDnnnsOq1evxqRJk+Dh4YHJkycDAJycnLS2uXr1aowaNQovvPACRo8eDQDw9vbW2Cc0NBReXl6IiYnByZMn8e2338LZ2RmzZ89W7/Ppp5/io48+QmhoKEaNGoXbt2/jq6++QufOnZGcnIxatWqV+LqGDRuGlStXYsOGDRg7dqx6/b179xAfH48hQ4aU+f7dvHkTAODo6FjqfkIIZGRkoFmzZqXuBwB79uzBX3/9hREjRsDV1RVnz57FN998g7Nnz+LYsWNQKBQYMGAALly4gLVr12LevHnq4z95vhMSEtSvzdHRscyB4qGhoahfvz5iYmJw7NgxLFy4EPfv38eqVavKrPlJZdX2LO9ZSVJTU/HKK69g5MiRCAsLw3fffYfw8HD4+fmVeM6dnJywdOlSvP322+jfvz8GDBgAAPD19VXvU1hYiKCgILRr1w5ffPEF9u7diy+//BLe3t54++231fu99dZbWLFiBUaMGIHx48fj8uXLWLRoEZKTk3H48OESe1PfeustpKWlYc+ePVi9erXWfWJjY5GXl4fRo0fDwsIC9vb2yMrKwrfffoshQ4bgzTffRHZ2NpYvX46goCD8+uuvaNWqVZnnrDz/v5726NEjBAUFIT8/H+PGjYOrqytu3LiB7du348GDB1AqlQDK9x5/+OGHyMzMxPXr19U/q6rC4OpnJnWSomdTVo+MEEIolUrRunVr9eOne2TmzZtX5vX+0sahFP2VsGzZMq3btPXI1KlTR2RlZanXb9iwQQAQCxYsUK8rT49MWbU93SOzdetWAUDMnDlTY79XXnlFKBQKkZqaql4HQJibm2us+/333wUA8dVXXxU71pPmz58vAIg1a9ao1z169Ej4+/sLGxsbjdde9NdxeZQ1RuaNN97QWN+/f3/h4OCgfnzlyhVhamoqPv30U439Tp8+LczMzIqtf1pBQYFwc3MT/v7+GuuXLVsmAIj4+PhSn3/37l3h7OwsOnXqVOp+Qvz71ykAsXz58jL31dYLtHbtWgFAJCUlqdeVNg4FgDAxMRFnz57Vuk3bX+gvv/yyxn7vvPOOAKAel1beHpnSanvW96ykHpmnz82tW7eEhYWFmDx5cqntlTVGBv9/zN6TWrduLfz8/NSPDx48KACI77//XmO/Xbt2aV3/tJLGyBSdbzs7O3Hr1i2NbQUFBRrj9YQQ4v79+8LFxaXY/5uS3u+y/n9pk5ycXOa4wIq8xxwjUxzvWqoGbGxsSr3TpOivuW3btkGlUul0DAsLC4wYMaLc+w8fPhy2trbqx6+88grc3NywY8cOnY5fXjt27ICpqSnGjx+vsX7y5MkQQmDnzp0a6wMDAzV6PXx9fWFnZ4e//vqrzOO4urpiyJAh6nU1atTA+PHjkZOTg8TERD28muLGjBmj8bhTp064e/c
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"_, _, bars = plt.hist(df_test['V22'], bins=20)\n",
"plt.xlabel('V22')\n",
"plt.ylabel('Frequency')\n",
"plt.title('Distribution of the V22 atribute in the train set')\n",
"plt.bar_label(bars, fmt='%1.0f')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "6d6b63fd",
"metadata": {},
"source": [
"The distribution of the target variable **V22** is normal, so i could try to fill the missing values with `mean()`."
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "2b2b6e2d",
"metadata": {},
"outputs": [],
"source": [
"df_train['V22'].fillna(df_train['V22'].mean(), inplace=True)\n",
"df_test['V22'].fillna(df_test['V22'].mean(), inplace=True)"
]
},
{
"cell_type": "markdown",
"id": "4164f62c",
"metadata": {},
"source": [
"##### V27"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "9a8b64ac",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 838.000000\n",
"mean 2.218153\n",
"std 0.221545\n",
"min 1.000000\n",
"25% 2.107000\n",
"50% 2.251000\n",
"75% 2.359750\n",
"max 2.859000\n",
"Name: V27, dtype: float64"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V27'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "1bddfb76",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2.000 36\n",
"2.236 31\n",
"2.194 24\n",
"1.848 22\n",
"2.175 21\n",
" ..\n",
"2.294 1\n",
"2.466 1\n",
"2.488 1\n",
"2.372 1\n",
"2.622 1\n",
"Name: V27, Length: 290, dtype: int64"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V27'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "f1787f2e",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjIAAAHHCAYAAACle7JuAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABPhElEQVR4nO3deViUVf8/8PeIzohsirI+IBK4oWJGhmgqKoJohmm55AKKaT5gqW3SoqIVmuVSKVZfA7UQl0TLEnIDl9TSJJdSwTQ1QcyFzUBlzu8Pf8zjyDaMM9xzD+/Xdc1Vc+4zZz5nzj3jhzPnPqMQQggQERERyVADqQMgIiIi0hcTGSIiIpItJjJEREQkW0xkiIiISLaYyBAREZFsMZEhIiIi2WIiQ0RERLLFRIaIiIhki4kMERERyRYTmXpozpw5UCgUdfJcgYGBCAwM1NxPT0+HQqHAxo0b6+T5IyIi0KpVqzp5Ln0VFRVh4sSJcHZ2hkKhwLRp02rdRvmY/vPPP4YPsJ47f/48FAoFEhMTNWURERGwtraWLqhaqOv3e8eOHevkuYylsvEm08ZERuYSExOhUCg0t8aNG8PV1RUhISH4+OOPUVhYaJDnuXz5MubMmYPMzEyDtGdIphybLt5//30kJiZiypQpWLNmDcaOHVtt3c2bN9ddcPd5+umn0aRJk2rPqdGjR0OpVOLatWu4du0aFi5ciF69esHBwQFNmzZFt27dsG7dugqPi4iI0DqPH7z9/fffBunDDz/8gDlz5hikLUMz5djuZwrvt6SkJCxZskSy55fS8uXLmWQ9SJCsJSQkCABi7ty5Ys2aNeLLL78U77//vggODhYKhUJ4eHiI3377Tesxd+7cEf/++2+tnueXX34RAERCQkKtHldaWipKS0s193fv3i0AiA0bNtSqHX1ju337tigpKTHYcxmDv7+/6NGjh051raysRHh4eIXy2bNnCwDi6tWrBo7uf5KTkwUAsWrVqkqPFxcXCysrKzF48GAhhBDfffedaNSokQgLCxNLliwRn376qejTp48AIGbNmqX12J9++kmsWbNG67Z69WrRpEkT4ePjY7A+REVFidp+7KnVavHvv/+Ku3fvasrCw8OFlZWVweLSNzZd6PN+r05177fevXuLDh06GOy5qjJo0CDh4eFhlLYrG29T0qFDB9G7d2+pwzApDaVKoMiwQkND8fjjj2vux8TEYNeuXXjqqafw9NNP448//oClpSUAoGHDhmjY0LhDf+vWLTRp0gRKpdKoz1OTRo0aSfr8usjLy4OPj4/UYdTo6aefho2NDZKSkjBu3LgKx7ds2YLi4mKMHj0aANChQwdkZWXBw8NDU+e///0vgoKCsGDBArz++uuwsrICAAQEBCAgIECrvX379uHWrVua9ura3bt3oVaroVQq0bhxY0liMIS6eL+bspKSEiiVSjRooNsXEOUz2yQjUmdS9HDKZ2R++eWXSo+///77AoD4/PPPNWXlf73f78cffxQ9evQQdnZ2wsrKSrRp00bExMQIIf43i/LgrfwvsvK/wg4fPix69uwpLC0txcsvv6w5dv9fD+VtJScni5iYGOHk5CSaNGkiBg8eLC5cuKAVk4eHR6WzD/e3WVNs4eHhFf5yKyoqEjNmzBBubm5CqVSKNm3aiIULFwq1Wq1VD4CIiooSKSkpokOHDkKpVAofHx+xbdu2Sl/rB125ckVMmDBBODo6CpVKJXx9fUViYmKF1+LB27lz5yptr7K65a9P+ZhmZWWJ8PBwYWdnJ2xtbUVERIQoLi6u0NaaNWvEY489Jho3biyaNWsmRowYUeH1r0x4eLho2LChuHLlSoVjTz31lLCxsRG3bt2qto2PP/5YABDHjh2rtt6UKVOEQqGo8vW43549e8Szzz4r3N3dhVKpFG5ubmLatGlasYSHh1f6GgohxLlz5wQAsXDhQrF48WLxyCOPiAYNGoijR49qjt0/A1E+I3P27FkRHBwsmjRpIlxcXERsbKzWeVQ+xrt379aK98E2q4tNCCHKysrE4sWLhY+Pj1CpVMLR0VFMmjRJXL9+vcbXprL3u77ntq6fBSdPnhSBgYHC0tJSuLq6igULFlRoq6SkRMyaNUt4eXlpxuy1116rcQa1d+/eFZ6//D1eHt/atWvFW2+9JVxdXYVCoRA3btwQ165dE6+88oro2LGjsLKyEjY2NmLAgAEiMzNTq/3qxvvSpUsiLCxMWFlZiRYtWohXXnlFp5mbX375RQQHB4vmzZuLxo0bi1atWonx48dr1dFljD08PCr0nbMznJExe2PHjsWbb76JH3/8ES+88EKldU6ePImnnnoKvr6+mDt3LlQqFbKzs7F//34AQPv27TF37lzMmjULkyZNQs+ePQEA3bt317Rx7do1hIaGYuTIkRgzZgycnJyqjeu9996DQqHAG2+8gby8PCxZsgRBQUHIzMzUzBzpQpfY7ieEwNNPP43du3cjMjISjz76KNLS0vDaa6/h77//xuLFi7Xq79u3D5s2bcJ///tf2NjY4OOPP8awYcNw4cIFNG/evMq4/v33XwQGBiI7OxvR0dHw9PTEhg0bEBERgZs3b+Lll19G+/btsWbNGkyfPh1ubm545ZVXAAAODg6VtrlmzRpMnDgRTzzxBCZNmgQA8PLy0qozfPhweHp6Ii4uDr/++iv+7//+D46OjliwYIGmznvvvYd33nkHw4cPx8SJE3H16lV88skn6NWrF44ePYqmTZtW2a/Ro0dj1apVWL9+PaKjozXl169fR1paGkaNGlXj+OXm5gIAWrRoUWWdO3fuYP369ejevbtOi7U3bNiAW7duYcqUKWjevDl+/vlnfPLJJ7h06RI2bNgAAJg8eTIuX76M7du3Y82aNZW2k5CQgJKSEkyaNAkqlQr29vZQq9WV1i0rK8OAAQPQrVs3fPDBB0hNTcXs2bNx9+5dzJ07t8aY71dTbJMnT0ZiYiLGjx+Pl156CefOncOnn36Ko0ePYv/+/XrNPOpzbuvyfrtx4wYGDBiAoUOHYvjw4di4cSPeeOMNdOrUCaGhoQAAtVqNp59+Gvv27cOkSZPQvn17HD9+HIsXL8aZM2eqXQf21ltvIT8/H5cuXdK8Xx9ceD1v3jwolUq8+uqrKC0thVKpxO+//47Nmzfjueeeg6enJ65cuYLPPvsMvXv3xu+//w5XV9dqX6+ysjKEhITA398fH374IXbs2IGPPvoIXl5emDJlSpWPy8vLQ3BwMBwcHDBz5kw0bdoU58+fx6ZNm7Tq6TLGS5YswdSpU2FtbY233noLAGr8rK0XpM6k6OHUNCMjhBB2dnaiS5cumvsP/oW2ePHiGtdX1PS9OACxYsWKSo9VNiPzn//8RxQUFGjK169fLwCIpUuXasp0mZGpKbYHZ2Q2b94sAIh3331Xq96zzz4rFAqFyM7O1pQBEEqlUqvst99+EwDEJ598UuG57rdkyRIBQHz11Veastu3b4uAgABhbW2t1XcPDw8xaNCgatsrV9MamQkTJmiVP/PMM6J58+aa++fPnxcWFhbivffe06p3/Phx0bBhwwrlD7p7965wcXERAQEBWuUrVqwQAERaWlq1j7927ZpwdHQUPXv2rLbed999JwCI5cuXV1uvXGWzQHFxcUKhUIi//vpLU1bVOpTyv8JtbW1FXl5epcce/AsdgJg6daqmTK1Wi0GDBgmlUql5L+k6I1NdbHv37hUAxNdff61VnpqaWmn5g6qakdH33Nbls2D16tWastLSUuHs7CyGDRumKVuzZo1o0KCB2Lt3r9bjy8+j/fv3VxtDVWtkyl/vRx55pMI5UVJSIsrKyrTKzp07J1QqlZg7d65WWVXjfX89IYTo0qWL8PPzqzbWlJSUGj+jazPGXCNTEa9aqgesra2rvdKk/C/wLVu2VPnXZ01UKhXGjx+vc/1x48bBxsZGc//ZZ5+Fi4sLfvjhB72eX1c//PADLCws8NJLL2mVv/LKKxBCYNu2bVrlQUFBWrMevr6+sLW1xZ9//lnj8zg7O2PUqFGaskaNGuGll15CUVERMjIyDNCbil588UWt+z179sS
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"_, _, bars = plt.hist(df_test['V27'], bins=20)\n",
"plt.xlabel('V27')\n",
"plt.ylabel('Frequency')\n",
"plt.title('Distribution of the V27 atribute in the train set')\n",
"plt.bar_label(bars, fmt='%1.0f')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "53b79865",
"metadata": {},
"source": [
"The distribution of the target variable **V27** is normal, so i could try to fill the missing values with `mean()`."
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "8974127e",
"metadata": {},
"outputs": [],
"source": [
"# Set the nan values to the mean of the column\n",
"df_train['V27'].fillna(df_train['V27'].mean(), inplace=True)\n",
"df_test['V27'].fillna(df_test['V27'].mean(), inplace=True)"
]
},
{
"cell_type": "markdown",
"id": "3afb5a2f",
"metadata": {},
"source": [
"##### V29"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "f410439d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 838.00000\n",
"mean 0.02506\n",
"std 0.15640\n",
"min 0.00000\n",
"25% 0.00000\n",
"50% 0.00000\n",
"75% 0.00000\n",
"max 1.00000\n",
"Name: V29, dtype: float64"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V29'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "2d33e7c4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0 817\n",
"1.0 21\n",
"Name: V29, dtype: int64"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V29'].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "515e9e80",
"metadata": {},
"source": [
"We can see that the majority of entires in that particular atribute are zeros. So I think that it would be best if I set all the `Nan` values to zeros."
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "48e8ba49",
"metadata": {},
"outputs": [],
"source": [
"# Set nan values to 0\n",
"df_train['V29'].fillna(0, inplace=True)\n",
"df_test['V29'].fillna(0, inplace=True)"
]
},
{
"cell_type": "markdown",
"id": "f659f8bc",
"metadata": {},
"source": [
"##### V37"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "8515f06b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 821.000000\n",
"mean 2.549406\n",
"std 0.625021\n",
"min 1.467000\n",
"25% 2.101000\n",
"50% 2.461000\n",
"75% 2.861000\n",
"max 5.750000\n",
"Name: V37, dtype: float64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V37'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "36bc89b5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2.167 9\n",
"2.500 9\n",
"2.833 8\n",
"2.667 8\n",
"1.833 7\n",
" ..\n",
"2.029 1\n",
"1.886 1\n",
"2.089 1\n",
"2.197 1\n",
"2.206 1\n",
"Name: V37, Length: 535, dtype: int64"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V37'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "02c38a9f",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjMAAAHHCAYAAABKudlQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABNjUlEQVR4nO3deVhUZf8G8HsUZ0SWUZQ1FhEXRMWMTFFT3FA0xNQss8Qt01BTtIx6K5cMrV+KlqKWgVqkYqJlEbmiphaY5Pa6YBqaLOYCgjEo8/z+6GVyZB9nOHPg/lzXuS7nnDPPfOeccbjnmec8oxBCCBARERHJVD2pCyAiIiJ6GAwzREREJGsMM0RERCRrDDNEREQkawwzREREJGsMM0RERCRrDDNEREQkawwzREREJGsMM0RERCRrDDN10Ny5c6FQKGrksQICAhAQEKC7vW/fPigUCmzZsqVGHn/s2LFo3rx5jTyWofLz8zFx4kQ4OTlBoVBgxowZ1W6j5Jz+9ddfxi+wjrt06RIUCgViY2N168aOHQtra2vpiqqGmv7/3r59+xp5LFMp63yT+WOYkbnY2FgoFArd0rBhQ7i4uGDAgAFYvnw5bt++bZTHuXr1KubOnYu0tDSjtGdM5lxbVbz//vuIjY3FlClTsGHDBrz44osV7rtt27aaK+4+Q4YMQaNGjSp8TY0ePRpKpRLXr18HAMycOROPPfYY7Ozs0KhRI7Rt2xZz585Ffn6+3v3Gjh2r9zp+cPnzzz+N8hy+//57zJ071yhtGZs513Y/c/j/FhcXh6ioKMkeX0orV65k0CqLIFmLiYkRAMT8+fPFhg0bxOeffy7ef/99ERgYKBQKhfDw8BC//fab3n3u3r0r/v7772o9TkpKigAgYmJiqnU/jUYjNBqN7vbevXsFABEfH1+tdgytraioSBQWFhrtsUyhS5cuonv37lXa18rKSoSGhpZa/+677woA4tq1a0au7l8bN24UAMS6devK3F5QUCCsrKxEcHCwbl337t3F9OnTxfLly8WaNWvElClThEqlEt27dxfFxcW6/Q4dOiQ2bNigt6xfv140atRI+Pj4GO05hIWFieq+7Wm1WvH333+Le/fu6daFhoYKKysro9VlaG1VYcj/94pU9P+tV69eol27dkZ7rPIMHjxYeHh4mKTtss63OWnXrp3o1auX1GWYHQvJUhQZVVBQEB5//HHd7YiICOzZswdPPfUUhgwZgv/+97+wtLQEAFhYWMDCwrSn/s6dO2jUqBGUSqVJH6cyDRo0kPTxqyInJwc+Pj5Sl1GpIUOGwMbGBnFxcRgzZkyp7du3b0dBQQFGjx6tW3fw4MFS+3l5eWH27Nn45Zdf0LVrVwCAv78//P399fY7ePAg7ty5o9deTbp37x60Wi2USiUaNmwoSQ3GUBP/381ZYWEhlEol6tWr2hcRJT3cJDNSpyl6OCU9MykpKWVuf//99wUAsWbNGt26kk/x9/vxxx9F9+7dhVqtFlZWVqJ169YiIiJCCPFvb8qDS8kns5JPY6mpqeLJJ58UlpaW4tVXX9Vtu/9TRElbGzduFBEREcLR0VE0atRIBAcHi4yMDL2aPDw8yuyFuL/NymoLDQ0t9QkuPz9fhIeHC1dXV6FUKkXr1q3Fhx9+KLRard5+AERYWJhISEgQ7dq1E0qlUvj4+IjExMQyj/WDsrOzxfjx44WDg4NQqVTC19dXxMbGljoWDy4XL14ss72y9i05PiXn9Pz58yI0NFSo1Wpha2srxo4dKwoKCkq1tWHDBvHYY4+Jhg0biiZNmohnn3221PEvS2hoqLCwsBDZ2dmltj311FPCxsZG3Llzp8I2tmzZIgBUehynTJkiFApFucfjfvv37xcjRowQbm5uQqlUCldXVzFjxgy9WkJDQ8s8hkIIcfHiRQFAfPjhh2Lp0qWiRYsWol69euLYsWO6bff3RJT0zFy4cEEEBgaKRo0aCWdnZzFv3jy911HJOd67d69evQ+2WVFtQghRXFwsli5dKnx8fIRKpRIODg5i0qRJ4saNG5Uem7L+vxv62q7qe8GpU6dEQECAsLS0FC4uLmLx4sWl2iosLBTvvPOO8PLy0p2z1157rdKe1F69epV6/JL/4yX1ffXVV+Ktt94SLi4uQqFQiJs3b4rr16+LWbNmifbt2wsrKythY2MjBg4cKNLS0vTar+h8X7lyRYSEhAgrKyvRrFkzMWvWrCr14KSkpIjAwEDRtGlT0bBhQ9G8eXMxbtw4vX2qco49PDxKPXf20vyj7sb1OuLFF1/Em2++iR9//BEvvfRSmfucOnUKTz31FHx9fTF//nyoVCqkp6fjp59+AgC0bdsW8+fPxzvvvINJkybhySefBAB069ZN18b169cRFBSE5557Di+88AIcHR0rrGvhwoVQKBSYM2cOcnJyEBUVhX79+iEtLU3Xg1QVVantfkIIDBkyBHv37sWECRPw6KOPIikpCa+99hr+/PNPLF26VG//gwcPYuvWrXjllVdgY2OD5cuXY/jw4cjIyEDTpk3Lrevvv/9GQEAA0tPTMXXqVHh6eiI+Ph5jx47FrVu38Oqrr6Jt27bYsGEDZs6cCVdXV8yaNQsAYG9vX2abGzZswMSJE/HEE09g0qRJAP7p5bjfyJEj4enpicjISPz666/47LPP4ODggMWLF+v2WbhwId5++22MHDkSEydOxLVr1/Dxxx+jZ8+eOHbsGBo3blzu8xo9ejTWrVuHzZs3Y+rUqbr1N27cQFJSEkaNGlXq/N27dw+3bt1CUVERTp48if/85z+wsbHBE088Ue7j3L17F5s3b0a3bt2qNIA7Pj4ed+7cwZQpU9C0aVP88ssv+Pjjj3HlyhXEx8cDAF5++WVcvXoVO3fuxIYNG8psJyYmBoWFhZg0aRJUKhXs7Oyg1WrL3Le4uBgDBw5E165d8cEHH+CHH37Au+++i3v37mH+/PmV1ny/ymp7+eWXERsbi3HjxmH69Om4ePEiPvnkExw7dgw//fSTQT2Qhry2q/L/7ebNmxg4cCCGDRuGkSNHYsuWLZgzZw46dOiAoKAgAIBWq8WQIUNw8OBBTJo0CW3btsWJEyewdOlSnDt3rsJxYW+99RZyc3Nx5coV3f/XBwdjL1iwAEqlErNnz4ZGo4FSqcTp06exbds2PPPMM/D09ER2djZWr16NXr164fTp03BxcanweBUXF2PAgAHo0qUL/u///g+7du3CRx99BC8vL0yZMqXc++Xk5CAwMBD29vZ444030LhxY1y6dAlbt27V268q5zgqKgrTpk2DtbU13nrrLQCo9L22zpA6TdHDqaxnRggh1Gq16NSpk+72g5/Uli5dWul4i8q+JwcgVq1aVea2snpmHnnkEZGXl6dbv3nzZgFALFu2TLeuKj0zldX2YM/Mtm3bBADx3nvv6e03YsQIoVAoRHp6um4dAKFUKvXW/fbbbwKA+Pjjj0s91v2ioqIEAPHFF1/o1hUVFQl/f39hbW2t99w9PDzE4MGDK2yvRGVjZsaPH6+3/umnnxZNmzbV3b506ZKoX7++WLhwod5+J06cEBYWFqXWP+jevXvC2dlZ+Pv7661ftWqVACCSkpJK3efw4cN6nyTbtGlTqqfiQd9++60AIFauXFnhfiXK6g2KjIwUCoVC/PHHH7p15Y1LKfk0bmtrK3Jycsrc9uAndQBi2rRpunVarVYMHjxYKJVK3f+lqvbMVFTbgQMHBADx5Zdf6q3/4Ycfylz/oPJ6Zgx9bVflvWD9+vW6dRqNRjg5OYnhw4fr1m3YsEHUq1dPHDhwQO/+Ja+jn376qcIayhszU3K8W7RoUeo1UVhYqDdOS4h/zoNKpRLz58/XW1fe+b5/PyGE6NSpk/Dz86uw1oSEhErfo6tzjjlmpmy8mqkOsLa2rvAKlJJP4tu3by/3U2hlVCoVxo0bV+X9x4wZAxsbG93tESNGwNnZGd9//71Bj19V33//PerXr4/p06frrZ81axaEEEhMTNRb369fP73eD19fX9ja2uL333+v9HGcnJwwatQo3boGDRpg+vTpyM/PR3JyshGeTWmTJ0/Wu/3
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"_, _, bars = plt.hist(df_test['V37'], bins=20)\n",
"plt.xlabel('V37')\n",
"plt.ylabel('Frequency')\n",
"plt.title('Distribution of the V37 atribute in the train set')\n",
"plt.bar_label(bars, fmt='%1.0f')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "15f862dd",
"metadata": {},
"source": [
"The distribution of the target variable **V37** is normal, so i could try to fill the missing values with `mean()`."
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "e1058d9a",
"metadata": {},
"outputs": [],
"source": [
"df_train['V37'].fillna(df_train['V37'].mean(), inplace=True)\n",
"df_test['V37'].fillna(df_test['V37'].mean(), inplace=True)"
]
},
{
"cell_type": "markdown",
"id": "44ca71d0",
"metadata": {},
"source": [
"### 2.2 Modeling\n",
"Besides the baselines (majority classifier, random classifier), use at least three machine learning algorithms\n",
"to model the target class. Be ready to argue why did you select specific algorithms and how did you find\n",
"the best hyperparameters for them. Consider the following points when creating your models:\n",
"- Create your models using all features and subsets of them using various feature selection techniques.\n",
"- Certain models assume that data follows a particular distribution or may work better with other\n",
"types of variables (e.g., categorical instead of numeric). Explore whether you can come up with feature\n",
"transformations that are more appropriate for your models. Try to construct new features from existing\n",
"ones. Try to explain the results and performance of different models."
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "42e83cd5",
"metadata": {},
"outputs": [],
"source": [
"# Spliting the data into features and labels\n",
"X_train = df_train.drop('Class', axis=1).reset_index(drop=True)\n",
"y_train = df_train['Class'].reset_index(drop=True)\n",
"X_test = df_test.drop('Class', axis=1).reset_index(drop=True)\n",
"y_test = df_test['Class'].reset_index(drop=True)"
]
},
{
"cell_type": "markdown",
"id": "9544c1ec",
"metadata": {},
"source": [
"#### Using majority classifier and random classifier"
]
},
{
"cell_type": "markdown",
"id": "a07a61d4",
"metadata": {},
"source": [
"##### Majority classifier"
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "2f41cf22",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1 0.666667\n",
"2 0.333333\n",
"Name: Class, dtype: float64"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Get the reatio between thhe class we are trying to predict\n",
"y_train.value_counts(normalize=True)\n"
]
},
{
"cell_type": "markdown",
"id": "c3ddae4d",
"metadata": {},
"source": [
"If we were to predict using the majority classifier then we would always predict Ready non-biodegradable."
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "abef9e0c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.645933014354067"
]
},
"execution_count": 36,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_test[y_test == 1].shape[0] / y_test.shape[0]"
]
},
{
"cell_type": "markdown",
"id": "b136180f",
"metadata": {},
"source": [
"We would get the accuracy of 0.645933014354067 if we predicted all the values to be 1."
]
},
{
"cell_type": "markdown",
"id": "bff2a7d3",
"metadata": {},
"source": [
"#### Random classifier"
]
},
{
"cell_type": "markdown",
"id": "a9a5ac3b",
"metadata": {},
"source": [
"We have two classes to predict, so probability of predicting the right class is 50%."
]
},
{
"cell_type": "markdown",
"id": "5779375e",
"metadata": {},
"source": [
"#### Lets firstly write a simple function that will score all our generated models"
]
},
{
"cell_type": "code",
"execution_count": 37,
"id": "3d716f7b",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import precision_score\n",
"from sklearn.metrics import recall_score\n",
"from sklearn.metrics import f1_score\n",
"from sklearn.metrics import roc_auc_score\n",
"from sklearn.metrics import RocCurveDisplay\n",
"from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n",
"from sklearn.model_selection import KFold, RepeatedKFold\n",
"all_scores = []\n",
"\n",
"def score_the_model(model, model_name, random_seed, X_train, X_test, y_train, y_test, plot=False):\n",
" \"\"\"\n",
" Fits and evaluates given machine learning models.\n",
" models: dict of different Scikit-Learn machine learning models\n",
" X_train: training data (no labels)\n",
" x_test: testing data (no labels)\n",
" y_train: training labels\n",
" y_test: trest labels\n",
" \"\"\"\n",
"\n",
" # Set random seed\n",
" np.random.seed(random_seed)\n",
"\n",
" # Fit the model to the data\n",
" model.fit(X_train, y_train)\n",
"\n",
" model_score = model.score(X_test, y_test) # Mean accuracy of ``self.predict(X)`` wrt. `y`.\n",
" # Predict the labels\n",
" y_pred = model.predict(X_test)\n",
"\n",
" # Compute scores\n",
" f1 = f1_score(y_test, y_pred)\n",
" precision = precision_score(y_test, y_pred)\n",
" recall = recall_score(y_test, y_pred)\n",
" auc = roc_auc_score(y_test, y_pred)\n",
" # Plot scores\n",
" normal_scores = {\n",
" 'Accuracy': model_score,\n",
" 'F1': f1,\n",
" 'Precision': precision,\n",
" 'Recall': recall,\n",
" 'AUC': auc\n",
" }\n",
"\n",
" def normal_cv(model, X_train, y_train, random_seed):\n",
" # Perform normal cross-validation\n",
" X_train = X_train.copy()\n",
" y_train = y_train.copy()\n",
" kfold = KFold(n_splits=5, shuffle=True, random_state=random_seed)\n",
" scores = []\n",
"\n",
" for train_ix, test_ix in kfold.split(X_train):\n",
" # Split the data\n",
" X_train_cv, X_test_cv = X_train.iloc[train_ix], X_train.iloc[test_ix]\n",
" y_train_cv, y_test_cv = y_train.iloc[train_ix], y_train.iloc[test_ix]\n",
"\n",
" # Fit the model\n",
" model.fit(X_train_cv, y_train_cv)\n",
"\n",
" # Evaluate the model\n",
" y_pred = model.predict(X_test_cv)\n",
" scrs = {\n",
" 'Accuracy': model.score(X_test_cv, y_test_cv),\n",
" 'F1': f1_score(y_test_cv, y_pred),\n",
" 'Precision': precision_score(y_test_cv, y_pred),\n",
" 'Recall': recall_score(y_test_cv, y_pred),\n",
" 'AUC': roc_auc_score(y_test_cv, y_pred)\n",
" }\n",
" scores.append(scrs)\n",
" \n",
" # Plot all the scores\n",
" scores = pd.DataFrame(scores)\n",
" scores.plot(kind='bar', figsize=(10, 8))\n",
" # Plot also the values at the top of the bars\n",
" plt.title(f'Cross-validated scores for {model_name}')\n",
" plt.xlabel('Fold')\n",
" plt.ylabel('Score')\n",
" plt.legend(loc='lower right')\n",
" plt.show()\n",
" return scores\n",
" if type(X_train) == pd.core.frame.DataFrame:\n",
" scores_cv = normal_cv(model, X_train, y_train, random_seed)\n",
"\n",
" def repeated_cv(model, X_train, y_train, random_seed):\n",
" # Perform another cv with 10 folds\n",
" scores_k_fold = []\n",
" rkf = RepeatedKFold(n_splits=10, n_repeats=10, random_state=random_seed)\n",
" for train_index, test_index in rkf.split(X_train):\n",
" model.fit(X_train.iloc[train_index], y_train.iloc[train_index])\n",
" y_pred = model.predict(X_train.iloc[test_index])\n",
" scrs = {\n",
" 'Accuracy': model.score(X_train.iloc[test_index], y_train.iloc[test_index]),\n",
" 'F1': f1_score(y_train.iloc[test_index], y_pred),\n",
" 'Precision': precision_score(y_train.iloc[test_index], y_pred),\n",
" 'Recall': recall_score(y_train.iloc[test_index], y_pred),\n",
" 'AUC': roc_auc_score(y_train.iloc[test_index], y_pred)\n",
" }\n",
"\n",
" scores_k_fold.append(scrs)\n",
" return scores_k_fold\n",
"\n",
" k_fold_scores_mean = {}\n",
" k_fold_scores_std = {}\n",
"\n",
" if type(X_train) == pd.core.frame.DataFrame:\n",
" scores_k_fold = repeated_cv(model, X_train, y_train, random_seed)\n",
" k_fold_scores_mean['acccuracy_mean'] = np.mean([score['Accuracy'] for score in scores_k_fold])\n",
" k_fold_scores_std['accuracy_std'] = np.std([score['Accuracy'] for score in scores_k_fold]) \n",
" k_fold_scores_mean['f1_mean'] = np.mean([score['F1'] for score in scores_k_fold])\n",
" k_fold_scores_std['f1_std'] = np.std([score['F1'] for score in scores_k_fold])\n",
" k_fold_scores_mean['precision_mean'] = np.mean([score['Precision'] for score in scores_k_fold])\n",
" k_fold_scores_std['precision_std'] = np.std([score['Precision'] for score in scores_k_fold])\n",
" k_fold_scores_mean['recall_mean'] = np.mean([score['Recall'] for score in scores_k_fold])\n",
" k_fold_scores_std['recall_std'] = np.std([score['Recall'] for score in scores_k_fold])\n",
" k_fold_scores_mean['auc_mean'] = np.mean([score['AUC'] for score in scores_k_fold])\n",
" k_fold_scores_std['auc_std'] = np.std([score['AUC'] for score in scores_k_fold])\n",
"\n",
" if plot:\n",
" # Plot scores\n",
" fig, ax = plt.subplots(nrows=3, ncols=2, figsize=(15,15))\n",
"\n",
" # Plot the bar chart of Normal cv scores in the first subplot \n",
" ax[0, 0].bar(normal_scores.keys(), normal_scores.values())\n",
" # Display values of the bars\n",
" for i, v in enumerate(normal_scores.values()):\n",
" ax[0, 0].text(i-0.1, v+0.01, str(round(v, 2)))\n",
" ax[0, 0].set_title(f'Default scoring of {model_name}')\n",
" ax[0, 0].set_ylabel('Score')\n",
"\n",
" # Plot the k-fold cv scores in the third subplot\n",
" ax[0, 1].bar(k_fold_scores_mean.keys(), k_fold_scores_mean.values())\n",
" # Display values of the bars\n",
" for i, v in enumerate(k_fold_scores_mean.values()):\n",
" ax[0, 1].text(i-0.1, v+0.01, str(round(v, 2)))\n",
" ax[0, 1].set_title(f'10-fold cross-validated scoring of {model_name} (mean)')\n",
"\n",
" # Plot the k-fold cv scores in the third subplot\n",
" ax[1, 0].bar(k_fold_scores_std.keys(), k_fold_scores_std.values())\n",
" ax[1, 0].set_title(f'10-fold cross-validated scoring of {model_name} (std)')\n",
"\n",
" \n",
" # Plot the ROC curve in the second subplot\n",
" f = RocCurveDisplay.from_estimator(model, X_test, y_test).plot(ax=ax[1, 1])\n",
" \n",
" # Plot the confusion matrix in the third subplot\n",
" cm = confusion_matrix(y_test, y_pred, labels=model.classes_)\n",
" cm_plt = ConfusionMatrixDisplay(cm, display_labels=model.classes_).plot(ax=ax[2, 0])\n",
"\n",
"\n",
" if hasattr(model, 'feature_importances_'):\n",
" # Plot feature importance in the fourth subplot\n",
" feature_dict = dict(zip(X_train.columns, model.feature_importances_))\n",
"\n",
" # Sort the features by their importance\n",
" feature_dict = {k: v for k, v in sorted(feature_dict.items(), key=lambda item: item[1], reverse=True)}\n",
"\n",
" # Plot the feature importance\n",
" ax[2, 1].bar(feature_dict.keys(), feature_dict.values())\n",
" ax[2, 1].set_title(f'Feature importance of {model_name}')\n",
" ax[2, 1].set_ylabel('Importance')\n",
" ax[2, 1].set_xticklabels(feature_dict.keys(), rotation=90)\n",
" else:\n",
" ax[2, 1].set_visible(False)\n",
" \n",
"\n",
" scores = []\n",
" scores.append(normal_scores)\n",
" normal_scores['model_name'] = model_name\n",
" all_scores.append(normal_scores)\n",
" return scores, model"
]
},
{
"cell_type": "markdown",
"id": "d144deb1",
"metadata": {},
"source": [
"### Decision tree model"
]
},
{
"cell_type": "code",
"execution_count": 38,
"id": "63fe4438",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA04AAAK4CAYAAABDHK0xAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABUyUlEQVR4nO3dd3hU1b7G8XfSSUIIkEKAQGgKSAQNxVAEIRQRFFFBPEpRmhBFUBBUQBSNFeEIR0SkiHBoInAOXZoKUSAUqdI7CSDSEiEks+8fXuY4JrBIIRPC9/M881xn7bX2/s3O5ty8WXuvsVmWZQkAAAAAcE1uri4AAAAAAPI7ghMAAAAAGBCcAAAAAMCA4AQAAAAABgQnAAAAADAgOAEAAACAAcEJAAAAAAwITgAAAABgQHACAAAAAAOCEwDkA5MmTZLNZtPBgwcdbY0aNVKjRo2MY1etWiWbzaZVq1bdtPqyIyIiQp07d3Z1GfnKnj171KxZMxUpUkQ2m01z5851dUk3VXaugTfffFM2m+3mFAQAOUBwApDn9u3bpx49eqh8+fLy8fFRQECA6tWrp1GjRumPP/5wdXm3lYULF+rNN990dRm3jU6dOmnr1q165513NGXKFNWsWfOmHevgwYOy2WyOl6enp4KCglS3bl299tprOnz48E079q2mc+fOTufqWi/+EADc3jxcXQCA28uCBQv0xBNPyNvbWx07dlS1atWUmpqqH3/8Uf3799f27ds1btw4V5eZLyxduvSmH2PhwoUaM2YM4SkP/PHHH4qPj9frr7+u2NjYPDtuhw4d1LJlS9ntdv3+++9av369Ro4cqVGjRunLL7/Uk08+edOO/euvv8rNLWt/o33jjTc0cODAm1RR5nr06KGYmBjH+wMHDmjIkCHq3r27GjRo4GivUKFCntYFIH8hOAHIMwcOHNCTTz6psmXLasWKFQoLC3Ns6927t/bu3asFCxZcc7zdbldqaqp8fHzyolyX8/LycnUJt4W0tDTZ7fabfr5PnTolSQoMDMy1fSYnJ8vPz++6fe699149/fTTTm2HDh1Ss2bN1KlTJ1WpUkXVq1fPtZr+ytvbO8tjPDw85OGRt7+eREdHKzo62vF+w4YNGjJkiKKjozOcu7+6kfMPoODgVj0AeeaDDz7QxYsX9eWXXzqFpqsqVqyoPn36ON7bbDbFxsZq6tSpuuuuu+Tt7a3FixdLkjZt2qQHH3xQAQEB8vf3V5MmTfTTTz857e/KlSsaNmyYKlWqJB8fHxUvXlz169fXsmXLHH0SExPVpUsXlS5dWt7e3goLC9Mjjzzi9KzR382ePVs2m02rV6/OsO3zzz+XzWbTtm3bJEm//PKLOnfu7LgtsUSJEnr22Wf122+/Gc9XZs84HT16VG3atJGfn59CQkLUt29fXb58OcPYH374QU888YTKlCkjb29vhYeHq2/fvk63Qnbu3FljxoyRJKfbka6y2+0aOXKk7rrrLvn4+Cg0NFQ9evTQ77//7nQsy7I0fPhwlS5dWr6+vnrggQe0fft24+e7avr06YqKilLhwoUVEBCgyMhIjRo1yqnP2bNn1bdvX0VERMjb21ulS5dWx44ddfr0aUefkydP6rnnnlNoaKh8fHxUvXp1TZ482Wk/V29f++ijjzRy5EhVqFBB3t7e2rFjhyRp165devzxx1WsWDH5+PioZs2amj9/vtM+buS6+rs333xTZcuWlST1799fNptNERERju03cj1ffQ5u9erV6tWrl0JCQlS6dOkbPs9/VbZsWU2aNEmpqan64IMPnLadPXtWL730ksLDw+Xt7a2KFSvq/fffl91ud+pnt9s1atQoRUZGysfHR8HBwWrRooU2bNjg6PP3Z5xu5Nxl9oxTWlqa3n77bcfPKyIiQq+99lqGaz8iIkKtWrXSjz/+qNq1a8vHx0fly5fXV199la3z9Fem879o0SI1aNBAfn5+Kly4sB566KFM/x3cyDUGIH9ixglAnvnPf/6j8uXLq27dujc8ZsWKFZo5c6ZiY2MVFBSkiIgIbd++XQ0aNFBAQIAGDBggT09Pff7552rUqJFWr16tOnXqSPrzF7C4uDh17dpVtWvX1vnz57VhwwZt3LhRTZs2lSQ99thj2r59u1544QVFRETo5MmTWrZsmQ4fPuz0i+1fPfTQQ/L399fMmTPVsGFDp20zZszQXXfdpWrVqkmSli1bpv3796tLly4qUaKE41bE7du366effsrSQ/B//PGHmjRposOHD+vFF19UyZIlNWXKFK1YsSJD31mzZiklJUXPP/+8ihcvrnXr1unTTz/V0aNHNWvWLEl/3p50/PhxLVu2TFOmTMmwjx49emjSpEnq0qWLXnzxRR04cECjR4/Wpk2btGbNGnl6ekqShgwZouHDh6tly5Zq2bKlNm7cqGbNmik1NdX4mZYtW6YOHTqoSZMmev/99yVJO3fu1Jo1axwh+uLFi2rQoIF27typZ599Vvfee69Onz6t+fPn6+jRowoKCtIff/yhRo0aae/evYqNjVW5cuU0a9Ysde7cWWfPnnUK5JI0ceJEXbp0Sd27d5e3t7eKFSum7du3q169eipVqpQGDhwoPz8/zZw5U23atNE333yjRx99VNKNXVd/17ZtWwUGBqpv376OW+f8/f0l6Yav56t69eql4OBgDRkyRMnJycZzfC3R0dGqUKGCU2hJSUlRw4YNdezYMfXo0UNlypTR2rVrNWjQIJ04cUIjR4509H3uuec0adIkPfjgg+ratavS0tL0ww8/6Keffrrms1vZOXeS1LVrV02ePFmPP/64Xn75Zf3888+Ki4vTzp079e233zr13bt3rx5//HE999xz6tSpkyZMmKDOnTsrKipKd911V7bP11WZnf8pU6aoU6dOat68ud5//32lpKTos88+U/369bVp0ybH/5bc6DUGIJ+yACAPnDt3zpJkPfLIIzc8RpLl5uZmbd++3am9TZs2lpeXl7Vv3z5H2/Hjx63ChQtb999/v6OtevXq1kMPPXTN/f/++++WJOvDDz+88Q/y/zp06GCFhIRYaWlpjrYTJ05Ybm5u1ltvveVoS0lJyTD23//+tyXJ+v777x1tEydOtCRZBw4ccLQ1bNjQatiwoeP9yJEjLUnWzJkzHW3JyclWxYoVLUnWypUrr3vcuLg4y2azWYcOHXK09e7d28rs/xX88MMPliRr6tSpTu2LFy92aj958qTl5eVlPfTQQ5bdbnf0e+211yxJVqdOnTLs+6/69OljBQQEOJ3HvxsyZIglyZozZ06GbVePefXcfP31145tqampVnR0tOXv72+dP3/esizLOnDggCXJCggIsE6ePOm0ryZNmliRkZHWpUuXnPZft25dq1KlSo4203V1LVeP/ffr7Uav56vXSP369a97vkzH+6tHHnnEkmSdO3fOsizLevvtty0/Pz9r9+7dTv0GDhxoubu7W4cPH7Ysy7JWrFhhSbJefPHFDPv863VQtmxZp2vgRs7d0KFDna7JzZs3W5Ksrl27OvV75ZVXLEnWihUrnI73939bJ0+etLy9va2XX375usf9q/Xr11uSrIkTJzrarnX+L1y4YAUGBlrdunVz2kdiYqJVpEgRp/YbvcYA5E/cqgcgT5w/f16SVLhw4SyNa9iwoapWrep4n56erqVLl6pNmzYqX768oz0sLExPPfWUfvzxR8exAgMDtX37du3ZsyfTfRcqVEheXl5atWpVhtvPTNq3b6+TJ086LQE+e/Zs2e12tW/f3ukYV126dEmnT5/WfffdJ0nauHFjlo65cOFChYWF6fHHH3e0+fr6qnv37hn6/vW4ycnJOn36tOrWrSvLsrRp0ybjsWbNmqUiRYqoadOmOn36tOMVFRUlf39/rVy5UpL03XffKTU1VS+88ILT7NlLL710Q58pMDBQycnJ173N7ZtvvlH16tUz/Wv81WMuXLhQJUqUUIcOHRzbPD099eKLL+rixYsZbqt87LHHFBwc7Hh/5swZrVixQu3atdOFCxccn/e3335T8+bNtWfPHh07dsxR8/Wuq6zIyvV8Vbdu3eTu7p7jY0t
"text/plain": [
"<Figure size 1000x800 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_189833/343368697.py:160: UserWarning: FixedFormatter should only be used together with FixedLocator\n",
" ax[2, 1].set_xticklabels(feature_dict.keys(), rotation=90)\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"[{'Accuracy': 0.8038277511961722, 'F1': 0.844106463878327, 'Precision': 0.8671875, 'Recall': 0.8222222222222222, 'AUC': 0.7962462462462462, 'model_name': 'Decision Tree'}]\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABNEAAATYCAYAAAAxo1G2AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdd3xO9///8WcSksgWROxoaO3RaNQqbUOMKh0oSqJGi9SIDlrEaKVDNWpT66NVs1VFbdoara1W7VFqb0FCcn5/+OX65pIrubKviMf9drtuXO/zPue8z7jOeeV1znkfO8MwDAEAAAAAAABIlr2tGwAAAAAAAADkdCTRAAAAAAAAACtIogEAAAAAAABWkEQDAAAAAAAArCCJBgAAAAAAAFhBEg0AAAAAAACwgiQaAAAAAAAAYAVJNAAAAAAAAMAKkmgAAAAAAACAFSTRgGxy+PBhNWrUSJ6enrKzs9OiRYuyZD4NGjRQgwYNsmTa2W3IkCGys7OzdTOsWr58uapVqyZnZ2fZ2dnp2rVrtm6SRSdOnJCdnZ1mzJiRpvFy0z4FAI+DrVu3qnbt2nJ1dZWdnZ127dqV6nFnzJghOzs7nThxwmpdPz8/hYaGprudjytL6zi159r169fLzs5O69evz7L2pcejti+kNybKbufPn9frr7+uAgUKyM7OTlFRUbZuUrLSEy+m5XjzOLp165Z8fHz0/fff27opmWLixIkqWbKkYmJibN2UDCGJBvx/CQfxhI+zs7OKFi2q4OBgffPNN7p582aGph8SEqI9e/bo008/1axZs1SjRo1MannK/vvvPw0ZMiRNATRS7/Lly2rdurXy5cuncePGadasWXJ1dbVYN6v3sdzEz8/PbF0l98npwS8ASA/+EIqIiFDjxo3l7e1t9fh14MABNW7cWG5ubvL29laHDh108eLFVM3r3r17atWqla5cuaKvv/5as2bNUqlSpTJpSfAoW7ZsmYYMGWLrZiAN+vbtqxUrVmjAgAGaNWuWGjdunGzdxPFRnjx55O3trYCAAPXu3Vv79+/PxlbnbAkX6a19csIF5NGjR8vd3V1vvPGGrZuSKUJDQxUbG6tJkybZuikZksfWDQBymmHDhql06dK6d++ezp07p/Xr16tPnz4aNWqUFi9erCpVqqR5mnfu3NHmzZv18ccfKywsLAtanbz//vtPQ4cOlZ+fn6pVq5at886ogQMHqn///rZuRoq2bt2qmzdvavjw4QoKCkrVOFmxj6VGqVKldOfOHeXNmzdN461cuTJL2pOSqKgo3bp1y/R92bJl+uGHH/T111+rYMGCpvLatWtne9sAIK0uXbqkYcOGqWTJkqpatWqKdxGdPn1azz33nDw9PTVixAjdunVLI0eO1J49e7RlyxY5OjqmOK+jR4/q5MmTmjJlirp06ZLJS4Kskh3n2mXLlmncuHEk0pT+mCi7rV27Vi1atNB7772XqvoNGzZUx44dZRiGrl+/rt27d2vmzJkaP368Pv/8c4WHh2dZW9OzD3fo0EFvvPGGnJycsqBFlr366qsqU6aM6futW7fUvXt3vfLKK3r11VdN5YULF862Nlly7949jR49Wn379pWDg4NN25JZnJ2dFRISolGjRundd999JJ44soQkGvCQJk2amN0lNmDAAK1du1YvvfSSXn75ZR04cED58uVL0zQTrh57eXllZlNzrejoaLm6uipPnjzKkydnH6YuXLggKW3bNiv2sdRIuPstraz9wZYVWrZsafb93Llz+uGHH9SyZUv5+fklO17CvgMAOUmRIkV09uxZ+fr6atu2bXrmmWeSrTtixAhFR0dr+/btKlmypCQpMDBQDRs21IwZM9StW7cU55We81JOFB8fr9jY2HSdtx5FtjjXPo7u37+v+Ph4OTo6PhL71oULF9L0W37yySf15ptvmpV99tlnat68ufr166dy5cqpadOmmdzKB9KzDzs4OGR7gqhKlSpmF6wvXbqk7t27q0qVKknWXWJ3796Vo6Oj7O2z52G+JUuW6OLFi2rdunW2zC+7tG7dWl988YXWrVunF154wdbNSRce5wRS4YUXXtCgQYN08uRJfffdd2bD/vnnH73++uvy9vaWs7OzatSoocWLF5uGDxkyxPQYxfvvvy87OztTEuDkyZPq0aOHnnrqKeXLl08FChRQq1atkvQLkFzfYNb6EVi/fr0pUO/UqVOqHoG7efOm+vTpIz8/Pzk5OcnHx0cNGzbUjh07zOr99ddfatq0qfLnzy9XV1dVqVJFo0ePNquzdu1a1atXT66urvLy8lKLFi104MABi8u2f/9+tWvXTvnz51fdunWTXW47OzuFhYVp0aJFqlSpkpycnFSxYkUtX77c4vLXqFFDzs7O8vf316RJk9LUz9r8+fMVEBCgfPnyqWDBgnrzzTd15swZ0/AGDRooJCREkvTMM8/Izs4u3f2BZGQfS3Dt2jX17dvXtO2KFy+ujh076tKlS5Is9/9x7tw5derUScWLF5eTk5OKFCmiFi1aWO2n5cKFC+rcubMKFy4sZ2dnVa1aVTNnzjSrkzC/kSNHavLkyfL395eTk5OeeeYZbd26NV3rKbHQ0FC5ubnp6NGjatq0qdzd3dW+fXtJD/74ioqKUsWKFeXs7KzChQvr7bff1tWrV5NM59dffzXtp+7u7mrWrJn27duX4fYBQAInJyf5+vqmqu7ChQv10ksvmRJokhQUFKQnn3xS8+bNS3Hc0NBQ1a9fX5LUqlWrJI8kpea8bIlhGPrkk09UvHhxubi46Pnnn0/TcTI+Pl6jR49W5cqV5ezsrEKFCqlx48batm2bqU7C+f37779XxYoV5eTkZDq379y5U02aNJGHh4fc3Nz04osv6s8//zSbx7179zR06FCVLVtWzs7OKlCggOrWratVq1aZ6qTmnPewBQsWyM7OTr/99luSYZMmTZKdnZ327t0rSfr7778VGhqqJ554Qs7OzvL19dVbb72ly5cvW11Hls61p0+fVsuWLeXq6iofHx/17dvXYl9Cf/zxh1q1aqWSJUvKyclJJUqUUN++fXXnzh1TndDQUI0bN06S+WN/CVJ73szovjBnzhwFBATI3d1dHh4eqly5cpL40Vo8I6U9DomKijLFIfv377cYEyXEFWfOnFHLli3l5uamQoUK6b333lNcXJzZtC9fvqwOHTrIw8NDXl5eCgkJ0e7du1Pd1cSxY8fUqlUreXt7y8XFRc8++6yWLl1qGp4Q4xuGoXHjxiXZXmlRoEABzZkzR3ny5NGnn35qNiwmJkYREREqU6aMad/54IMPLO5n3333nQIDA+Xi4qL8+fPrueeeM7v7zNI+PGbMGFWsWNE0To0aNTR79uwky/nwb3D8+PGm40DRokXVs2fPJH0ON2jQQJUqVdL+/fv1/PPPy8XFRcWKFdMXX3yRrvWUWELfg3PmzNHAgQNVrFgxubi46MaNG5Ie/B3UuHFjeXp6ysXFRfXr19fGjRuTTOfMmTN66623VLhwYdPfLNOmTUtVGxYtWiQ/Pz/5+/ublSfsp6dOndJLL70kNzc3FStWzPT73rNnj1544QW5urqqVKlSZus7wbVr19SnTx+VKFFCTk5OKlOmjD7//HPFx8eb1Rs5cqRq166tAgUKKF++fAoICNCCBQuSTC8tf58FBATI29tbP//8c6rWQ06Us2/xAHKQDh066KOPPtLKlSvVtWtXSdK+fftUp04dFStWTP3795erq6vmzZunli1bauHChabbgr28vNS3b1+1bdtWTZs2lZubm6QHjwJu2rRJb7zxhooXL64TJ05owoQJatCggfbv3y8XF5cMtbl8+fIaNmyYBg8erG7duqlevXqSUn4E7p133tGCBQsUFhamChUq6PLly9qwYYMOHDigp59+WpK0atUqvfTSSypSpIh69+4tX19fHThwQEuWLFHv3r0lSatXr1aTJk30xBNPaMiQIbpz547GjBmjOnXqaMeOHUnuJmrVqpXKli2rESNGyDCMFJdrw4YN+vHHH9WjRw+5u7vrm2++0WuvvaZTp06pQIECkh4E3I0bN1aRIkU0dOhQxcX
"text/plain": [
"<Figure size 1500x1500 with 7 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjcAAAGwCAYAAABVdURTAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABnwElEQVR4nO3deVhUZf8G8HtYZthBRHYUd0XBNUksTUVxya0yyg2tbFMzfa00d8ulLJfKMjU1+/m+llZmibjgklvuGpuouKAIKCIM+zLz/P5ARkdA5+AMA8P9uS6ummfOOfOd4+DcPss5MiGEABEREZGJMDN2AURERET6xHBDREREJoXhhoiIiEwKww0RERGZFIYbIiIiMikMN0RERGRSGG6IiIjIpFgYu4CqplarcfPmTdjb20Mmkxm7HCIiItKBEAJZWVnw9PSEmdmj+2ZqXbi5efMmfHx8jF0GERERVcL169fh7e39yG1qXbixt7cHUHJyHBwcjFwNERER6UKpVMLHx0fzPf4otS7clA5FOTg4MNwQERHVMLpMKeGEYiIiIjIpDDdERERkUhhuiIiIyKQw3BAREZFJYbghIiIik8JwQ0RERCaF4YaIiIhMCsMNERERmRSGGyIiIjIpDDdERERkUowabv7++28MGDAAnp6ekMlk2Lp162P32b9/P9q3bw+FQoEmTZpg/fr1Bq+TiIiIag6jhpucnBy0adMGK1as0Gn7K1euoH///ujevTvOnj2L999/H2+88QZ27txp4EqJiIiopjDqjTP79u2Lvn376rz9ypUr0bBhQ3z55ZcAgJYtW+LQoUNYunQpQkJCDFUmERER6eiWMh/ZBcVoVM/OaDXUqLuCHz16FMHBwVptISEheP/99yvcp6CgAAUFBZrHSqXSUOURERHVKqnKfETdyERUUiaik0r+eyurAM81r4f1YzoZra4aFW5SUlLg5uam1ebm5galUom8vDxYW1uX2WfhwoWYO3duVZVIRERkklKV+fj3oSBzO6ugzHZmMqCgSG2ECu+rUeGmMqZNm4bJkydrHiuVSvj4+BixIiIioupLCIFUZQGiknQLMk1c7dDayxH+Xo4I8HZESw8H2MiNGy9qVLhxd3dHamqqVltqaiocHBzK7bUBAIVCAYVCURXlERER1ShCCKTcG1oqDTFRSUqkZZcfZJq62t8LMg7wryZBpjzVr6JH6Ny5M8LDw7Xadu/ejc6dOxupIiIioprh4SDz771embTswjLbmsmAZm72mh6Z1l6O8PNwgLXc3AiVS2fUcJOdnY1Lly5pHl+5cgVnz56Fs7Mz6tevj2nTpiEpKQkbNmwAALz99tv45ptv8OGHH+K1117D3r178csvv2D79u3GegtERETVjhACyZn5WsNKFQUZczMZmj4wtFTTgkx5jBpuTp48ie7du2sel86NCQsLw/r165GcnIzExETN8w0bNsT27dsxadIkLF++HN7e3lizZg2XgRMRUa31cJD5917PzJ2cioOMv5cj/L3vBxkry5obZMojE0IIYxdRlZRKJRwdHZGZmQkHBwdjl0NERKQzIQRuZmrPkdE1yPh7lcyRqalBRsr3d42ac0NERFRb3A8yGZqJvtFJmUivIMg0c7Mvmeh7b2ipJgeZJ8VwQ0REZGRCCCRl5GmtWKooyFiYydD0gSDj7+2EFu72tTbIlIfhhoiIqAoJIXDj7oNBpmRo6W5uUZltLTQ9Mo5ofW9oiUHm8RhuiIiIDERqkGnubq8ZVvL3ckRzBplKYbghIiLSg9Ig8/CVfTPKCTKW5g/0yDDI6B3DDRERkUQPB5moG5mIvllxkCmvR0ZhwSBjKAw3REREjyCEwPV07R4ZXYNMgJcTmrnbMchUMYYbIiKie0qDzL9JGfeDTJISmXnlB5kW7g6a3hh/L0cGmWqC4YaIiGolIQQS03O1e2QqCDJyczO08LDXDjJu9pBbmBmhcnochhsiIjJ5Qghcu5Nb5l5LyvziMtsyyNR8DDdERGRSHg4y/96b7JtVQZBp6aF992sGmZqP4YaIiGostVrgWvoDPTKPCjIWZmjpziBTGzDcEBFRjfBwkPn3RgZibiorDjIeDlr3WmrmZg9LcwaZ2oDhhoiIqh21WuDqnRytOTIxSUpkFTw+yPh7OaGpmx2DTC3GcENEREb1YJCJulESZGJvlh9kFJogc39oiUGGHsZwQ0REVUatFrhyJ0czPyYqKRMxN5XIriDI+Hk6aF3Zt4krgww9HsMNEREZhFotcDktR+umkbESgkxTVztYMMhQJTDcEBHREysTZG5kIuZmJnIKVWW2tbI0g5/HA0HG2xFN6jHIkP4w3BARkSQqtcCVtOx7IUaJ6CTdg0yAtxMa17NlkCGDYrghIqIKPRxkopIyEHtTWW6QsbY0LzO0xCBDxsBwQ0REAEqCzOXb2Vr3Woq5qUSuDkEmwNsRjevZwdxMZoTKibQx3BAR1UKlQebfG/eDTGxyxUGmlecDd79mkKFqjuGGiMjEqdQCCbezNUuvdQky/t73bxrZiEGGahiGGyIiE1KsUiPhtvaVfWNvKpFXVDbI2Mgf6pFhkCETwXBDRFRDlRdkYm5mIr9IXWZbW7k5WnmWLr0umSvT0IVBhkwTww0RUQ1QrFLj0r2hJU2PTLJSxyDjhIYutgwyVGsw3BARVTOSg8wDw0qtvRzRyMUWZgwyVIsx3BARGVGxSo2Lt7K1hpbiKggydgoLzfLrAO+SINOwLoMM0cMYboiIqogmyNy4f6+luGQlCorLDzKt7gUZfwYZIkkYboiIDKBIpcbF1Gytm0Y+LsiU9sb4eznCl0GGqNIYboiInlCRSo0LqVkPBBkl4pKVKCwnyNgrLNDKS/sWBQwyRPrFcENEJIGkIGNlgdae94eV/L0c0cDZhkGGyMAYboiIKlCkUiM+5X6QiU7KRFxKlk5BJsDLEfUZZIiMguGGiAhAYbF2j0x0UibikrNQqCo/yDy49NqfQYaoWmG4IaJapzTIPHj36/MVBBkHK4t7F8O7fy2Z+s42kMkYZIiqK4YbIjJppUHmwbtfx6dUHGQenB/DIENUMzHcEJHJKChW4UJKtlaPTEVBxtHaUmtYyd/LET7O1gwyRCaA4YaIaqSCYhXiU7K0ruwbn5KFIpUosy2DDFHtwnBDRNWelCDjZFM2yHjXYZAhqk0qFW6KioqQkpKC3Nxc1KtXD87Ozvqui4hqqYJiFc4naweZC6kMMkSkO53DTVZWFv7v//4PmzZtwvHjx1FYWAghBGQyGby9vdG7d2+8+eabeOqppwxZLxGZkPyi8ntkitVlg0wdG0utENOaQYaIKqBTuFmyZAnmz5+Pxo0bY8CAAfj444/h6ekJa2trpKenIzo6GgcPHkTv3r0RGBiIr7/+Gk2bNjV07URUg+QXqXC+NMjcuN8j86ggE+B9P8h4OTHIEJFuZEKIsn+zPOTVV1/FjBkz0KpVq0duV1BQgHXr1kEul+O1117TW5H6pFQq4ejoiMzMTDg4OBi7HCKTpAkyNzI0tyi4WEGQcbaV3+uRcWCQIaIKSfn+1incmBKGGyL9yi9SIS5ZqXWvpYqCTF1NkHHUXBjP09GKQYaIHkvK9zdXSxGRzvKLVIgtDTL3hpYu3sqGSocgE+DtCA8GGSKqApLCzblz5/Dnn3/C2dkZL7/8MlxcXDTPKZVKvP/++1i7dq3eiySiqiclyLjYPdQj48UgQ0TGo/Ow1K5duzBgwAA0bdoUWVlZyMnJwebNm9G9e3cAQGpqKjw9PaFSqQxa8JPisBRRWXmFDwSZeyuXdA0yAd6OcHdgkCEiwzLIsNScOXMwZcoUzJ8/H0IILF68GAMHDsTmzZvRp0+fJy6aiKqGtCCj0Jro688gQ0Q1gM7hJiYmBj/99BMAQCaT4cMPP4S3tzdeeuklbNq0ide3IaqGSoJM6bCS8l6QyUI5OeZ+kPF20lxLxs1BwSBDRDWOzuFGoVAgIyNDq23YsGEwMzNDaGgovvzyS33XRkQSlAaZB+9+felWdrlBpp69osyVfRlkiMhU6Bxu2rZti3379qFDhw5a7a+88gqEEAg
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.tree import DecisionTreeClassifier\n",
"\n",
"# Score the model with default parameters\n",
"score_dec_tree, model_dec_tree = score_the_model(\n",
" model=DecisionTreeClassifier(),\n",
" model_name='Decision Tree',\n",
" random_seed=42,\n",
" X_train=X_train,\n",
" X_test=X_test,\n",
" y_train=y_train,\n",
" y_test=y_test,\n",
" plot=True\n",
")\n",
"\n",
"print(score_dec_tree)\n"
]
},
{
"cell_type": "markdown",
"id": "a72e54f6",
"metadata": {},
"source": [
"Now lets plot the decision tree"
]
},
{
"cell_type": "code",
"execution_count": 39,
"id": "c4fe47bd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Text(0.6062668766876688, 0.9705882352941176, 'V36 <= 3.678\\ngini = 0.443\\nsamples = 762\\nvalue = [510, 252]\\nclass = Ready biodegradable'),\n",
" Text(0.40335283528352833, 0.9117647058823529, 'V1 <= 4.984\\ngini = 0.485\\nsamples = 325\\nvalue = [134, 191]\\nclass = Reday non-biodegradable'),\n",
" Text(0.26755175517551755, 0.8529411764705882, 'V34 <= 1.5\\ngini = 0.451\\nsamples = 279\\nvalue = [96, 183]\\nclass = Reday non-biodegradable'),\n",
" Text(0.16156615661566157, 0.7941176470588235, 'V14 <= 0.401\\ngini = 0.347\\nsamples = 206\\nvalue = [46, 160]\\nclass = Reday non-biodegradable'),\n",
" Text(0.13276327632763277, 0.7352941176470589, 'V37 <= 3.15\\ngini = 0.231\\nsamples = 15\\nvalue = [13, 2]\\nclass = Ready biodegradable'),\n",
" Text(0.11836183618361837, 0.6764705882352942, 'V28 <= 0.121\\ngini = 0.133\\nsamples = 14\\nvalue = [13, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.10396039603960396, 0.6176470588235294, 'gini = 0.0\\nsamples = 13\\nvalue = [13, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.13276327632763277, 0.6176470588235294, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.14716471647164717, 0.6764705882352942, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.19036903690369036, 0.7352941176470589, 'V38 <= 1.5\\ngini = 0.286\\nsamples = 191\\nvalue = [33, 158]\\nclass = Reday non-biodegradable'),\n",
" Text(0.17596759675967596, 0.6764705882352942, 'V3 <= 1.5\\ngini = 0.262\\nsamples = 187\\nvalue = [29, 158]\\nclass = Reday non-biodegradable'),\n",
" Text(0.16156615661566157, 0.6176470588235294, 'V37 <= 1.986\\ngini = 0.236\\nsamples = 183\\nvalue = [25, 158]\\nclass = Reday non-biodegradable'),\n",
" Text(0.043204320432043204, 0.5588235294117647, 'V1 <= 4.776\\ngini = 0.461\\nsamples = 36\\nvalue = [13, 23]\\nclass = Reday non-biodegradable'),\n",
" Text(0.0288028802880288, 0.5, 'V9 <= 0.5\\ngini = 0.404\\nsamples = 32\\nvalue = [9, 23]\\nclass = Reday non-biodegradable'),\n",
" Text(0.0144014401440144, 0.4411764705882353, 'gini = 0.0\\nsamples = 2\\nvalue = [2, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.043204320432043204, 0.4411764705882353, 'V2 <= 2.733\\ngini = 0.358\\nsamples = 30\\nvalue = [7, 23]\\nclass = Reday non-biodegradable'),\n",
" Text(0.0288028802880288, 0.38235294117647056, 'gini = 0.0\\nsamples = 10\\nvalue = [0, 10]\\nclass = Reday non-biodegradable'),\n",
" Text(0.0576057605760576, 0.38235294117647056, 'V27 <= 1.952\\ngini = 0.455\\nsamples = 20\\nvalue = [7, 13]\\nclass = Reday non-biodegradable'),\n",
" Text(0.043204320432043204, 0.3235294117647059, 'gini = 0.0\\nsamples = 3\\nvalue = [3, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.07200720072007201, 0.3235294117647059, 'V2 <= 2.884\\ngini = 0.36\\nsamples = 17\\nvalue = [4, 13]\\nclass = Reday non-biodegradable'),\n",
" Text(0.043204320432043204, 0.2647058823529412, 'V36 <= 3.163\\ngini = 0.444\\nsamples = 3\\nvalue = [2, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.0288028802880288, 0.20588235294117646, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.0576057605760576, 0.20588235294117646, 'gini = 0.0\\nsamples = 2\\nvalue = [2, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.10081008100810081, 0.2647058823529412, 'V18 <= 1.142\\ngini = 0.245\\nsamples = 14\\nvalue = [2, 12]\\nclass = Reday non-biodegradable'),\n",
" Text(0.08640864086408641, 0.20588235294117646, 'V39 <= 7.676\\ngini = 0.142\\nsamples = 13\\nvalue = [1, 12]\\nclass = Reday non-biodegradable'),\n",
" Text(0.07200720072007201, 0.14705882352941177, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.10081008100810081, 0.14705882352941177, 'gini = 0.0\\nsamples = 12\\nvalue = [0, 12]\\nclass = Reday non-biodegradable'),\n",
" Text(0.1152115211521152, 0.20588235294117646, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.0576057605760576, 0.5, 'gini = 0.0\\nsamples = 4\\nvalue = [4, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.27992799279927993, 0.5588235294117647, 'V17 <= 0.968\\ngini = 0.15\\nsamples = 147\\nvalue = [12, 135]\\nclass = Reday non-biodegradable'),\n",
" Text(0.23402340234023403, 0.5, 'V1 <= 4.436\\ngini = 0.444\\nsamples = 3\\nvalue = [2, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.21962196219621963, 0.4411764705882353, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.24842484248424843, 0.4411764705882353, 'gini = 0.0\\nsamples = 2\\nvalue = [2, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.3258325832583258, 0.5, 'V27 <= 2.223\\ngini = 0.129\\nsamples = 144\\nvalue = [10, 134]\\nclass = Reday non-biodegradable'),\n",
" Text(0.27722772277227725, 0.4411764705882353, 'V18 <= 1.161\\ngini = 0.09\\nsamples = 127\\nvalue = [6, 121]\\nclass = Reday non-biodegradable'),\n",
" Text(0.2232223222322232, 0.38235294117647056, 'V25 <= 0.5\\ngini = 0.051\\nsamples = 114\\nvalue = [3, 111]\\nclass = Reday non-biodegradable'),\n",
" Text(0.18721872187218722, 0.3235294117647059, 'V18 <= 1.109\\ngini = 0.035\\nsamples = 112\\nvalue = [2, 110]\\nclass = Reday non-biodegradable'),\n",
" Text(0.15841584158415842, 0.2647058823529412, 'V15 <= 9.556\\ngini = 0.375\\nsamples = 4\\nvalue = [1, 3]\\nclass = Reday non-biodegradable'),\n",
" Text(0.14401440144014402, 0.20588235294117646, 'gini = 0.0\\nsamples = 3\\nvalue = [0, 3]\\nclass = Reday non-biodegradable'),\n",
" Text(0.17281728172817282, 0.20588235294117646, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.21602160216021601, 0.2647058823529412, 'V31 <= 4.092\\ngini = 0.018\\nsamples = 108\\nvalue = [1, 107]\\nclass = Reday non-biodegradable'),\n",
" Text(0.20162016201620162, 0.20588235294117646, 'gini = 0.0\\nsamples = 88\\nvalue = [0, 88]\\nclass = Reday non-biodegradable'),\n",
" Text(0.2304230423042304, 0.20588235294117646, 'V13 <= 3.196\\ngini = 0.095\\nsamples = 20\\nvalue = [1, 19]\\nclass = Reday non-biodegradable'),\n",
" Text(0.21602160216021601, 0.14705882352941177, 'V37 <= 2.303\\ngini = 0.5\\nsamples = 2\\nvalue = [1, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.20162016201620162, 0.08823529411764706, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.2304230423042304, 0.08823529411764706, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.2448244824482448, 0.14705882352941177, 'gini = 0.0\\nsamples = 18\\nvalue = [0, 18]\\nclass = Reday non-biodegradable'),\n",
" Text(0.25922592259225924, 0.3235294117647059, 'V13 <= 3.089\\ngini = 0.5\\nsamples = 2\\nvalue = [1, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.2448244824482448, 0.2647058823529412, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.27362736273627364, 0.2647058823529412, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.33123312331233123, 0.38235294117647056, 'V18 <= 1.165\\ngini = 0.355\\nsamples = 13\\nvalue = [3, 10]\\nclass = Reday non-biodegradable'),\n",
" Text(0.31683168316831684, 0.3235294117647059, 'V27 <= 1.875\\ngini = 0.5\\nsamples = 6\\nvalue = [3, 3]\\nclass = Ready biodegradable'),\n",
" Text(0.30243024302430244, 0.2647058823529412, 'V2 <= 1.915\\ngini = 0.375\\nsamples = 4\\nvalue = [3, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.28802880288028804, 0.20588235294117646, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.31683168316831684, 0.20588235294117646, 'gini = 0.0\\nsamples = 3\\nvalue = [3, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.33123312331233123, 0.2647058823529412, 'gini = 0.0\\nsamples = 2\\nvalue = [0, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.34563456345634563, 0.3235294117647059, 'gini = 0.0\\nsamples = 7\\nvalue = [0, 7]\\nclass = Reday non-biodegradable'),\n",
" Text(0.37443744374437443, 0.4411764705882353, 'V17 <= 0.983\\ngini = 0.36\\nsamples = 17\\nvalue = [4, 13]\\nclass = Reday non-biodegradable'),\n",
" Text(0.36003600360036003, 0.38235294117647056, 'gini = 0.0\\nsamples = 2\\nvalue = [2, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.38883888388838883, 0.38235294117647056, 'V36 <= 3.43\\ngini = 0.231\\nsamples = 15\\nvalue = [2, 13]\\nclass = Reday non-biodegradable'),\n",
" Text(0.37443744374437443, 0.3235294117647059, 'V28 <= -0.002\\ngini = 0.444\\nsamples = 3\\nvalue = [2, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.36003600360036003, 0.2647058823529412, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.38883888388838883, 0.2647058823529412, 'gini = 0.0\\nsamples = 2\\nvalue = [2, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.40324032403240323, 0.3235294117647059, 'gini = 0.0\\nsamples = 12\\nvalue = [0, 12]\\nclass = Reday non-biodegradable'),\n",
" Text(0.19036903690369036, 0.6176470588235294, 'gini = 0.0\\nsamples = 4\\nvalue = [4, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.20477047704770476, 0.6764705882352942, 'gini = 0.0\\nsamples = 4\\nvalue = [4, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.37353735373537356, 0.7941176470588235, 'V16 <= 0.5\\ngini = 0.432\\nsamples = 73\\nvalue = [50, 23]\\nclass = Ready biodegradable'),\n",
" Text(0.33753375337533753, 0.7352941176470589, 'V1 <= 4.426\\ngini = 0.315\\nsamples = 51\\nvalue = [41, 10]\\nclass = Ready biodegradable'),\n",
" Text(0.30873087308730873, 0.6764705882352942, 'V38 <= 1.5\\ngini = 0.494\\nsamples = 18\\nvalue = [10, 8]\\nclass = Ready biodegradable'),\n",
" Text(0.29432943294329433, 0.6176470588235294, 'gini = 0.0\\nsamples = 7\\nvalue = [0, 7]\\nclass = Reday non-biodegradable'),\n",
" Text(0.32313231323132313, 0.6176470588235294, 'V36 <= 2.928\\ngini = 0.165\\nsamples = 11\\nvalue = [10, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.30873087308730873, 0.5588235294117647, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.33753375337533753, 0.5588235294117647, 'gini = 0.0\\nsamples = 10\\nvalue = [10, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.36633663366336633, 0.6764705882352942, 'V2 <= 1.778\\ngini = 0.114\\nsamples = 33\\nvalue = [31, 2]\\nclass = Ready biodegradable'),\n",
" Text(0.35193519351935193, 0.6176470588235294, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.38073807380738073, 0.6176470588235294, 'V11 <= 7.5\\ngini = 0.061\\nsamples = 32\\nvalue = [31, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.36633663366336633, 0.5588235294117647, 'gini = 0.0\\nsamples = 31\\nvalue = [31, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.3951395139513951, 0.5588235294117647, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.4095409540954095, 0.7352941176470589, 'V27 <= 2.089\\ngini = 0.483\\nsamples = 22\\nvalue = [9, 13]\\nclass = Reday non-biodegradable'),\n",
" Text(0.3951395139513951, 0.6764705882352942, 'gini = 0.0\\nsamples = 3\\nvalue = [3, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.4239423942394239, 0.6764705882352942, 'V36 <= 3.6\\ngini = 0.432\\nsamples = 19\\nvalue = [6, 13]\\nclass = Reday non-biodegradable'),\n",
" Text(0.4095409540954095, 0.6176470588235294, 'gini = 0.0\\nsamples = 10\\nvalue = [0, 10]\\nclass = Reday non-biodegradable'),\n",
" Text(0.4383438343834383, 0.6176470588235294, 'V13 <= 3.479\\ngini = 0.444\\nsamples = 9\\nvalue = [6, 3]\\nclass = Ready biodegradable'),\n",
" Text(0.4239423942394239, 0.5588235294117647, 'V28 <= -0.017\\ngini = 0.245\\nsamples = 7\\nvalue = [6, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.4095409540954095, 0.5, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.4383438343834383, 0.5, 'gini = 0.0\\nsamples = 6\\nvalue = [6, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.4527452745274527, 0.5588235294117647, 'gini = 0.0\\nsamples = 2\\nvalue = [0, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.5391539153915391, 0.8529411764705882, 'V12 <= 0.013\\ngini = 0.287\\nsamples = 46\\nvalue = [38, 8]\\nclass = Ready biodegradable'),\n",
" Text(0.5103510351035103, 0.7941176470588235, 'V17 <= 1.059\\ngini = 0.193\\nsamples = 37\\nvalue = [33, 4]\\nclass = Ready biodegradable'),\n",
" Text(0.495949594959496, 0.7352941176470589, 'V30 <= 10.249\\ngini = 0.153\\nsamples = 36\\nvalue = [33, 3]\\nclass = Ready biodegradable'),\n",
" Text(0.4815481548154815, 0.6764705882352942, 'gini = 0.0\\nsamples = 23\\nvalue = [23, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.5103510351035103, 0.6764705882352942, 'V10 <= 3.5\\ngini = 0.355\\nsamples = 13\\nvalue = [10, 3]\\nclass = Ready biodegradable'),\n",
" Text(0.495949594959496, 0.6176470588235294, 'V18 <= 1.143\\ngini = 0.375\\nsamples = 4\\nvalue = [1, 3]\\nclass = Reday non-biodegradable'),\n",
" Text(0.4815481548154815, 0.5588235294117647, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.5103510351035103, 0.5588235294117647, 'gini = 0.0\\nsamples = 3\\nvalue = [0, 3]\\nclass = Reday non-biodegradable'),\n",
" Text(0.5247524752475248, 0.6176470588235294, 'gini = 0.0\\nsamples = 9\\nvalue = [9, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.5247524752475248, 0.7352941176470589, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.5679567956795679, 0.7941176470588235, 'V30 <= 10.694\\ngini = 0.494\\nsamples = 9\\nvalue = [5, 4]\\nclass = Ready biodegradable'),\n",
" Text(0.5535553555355536, 0.7352941176470589, 'V31 <= 0.902\\ngini = 0.32\\nsamples = 5\\nvalue = [1, 4]\\nclass = Reday non-biodegradable'),\n",
" Text(0.5391539153915391, 0.6764705882352942, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.5679567956795679, 0.6764705882352942, 'gini = 0.0\\nsamples = 4\\nvalue = [0, 4]\\nclass = Reday non-biodegradable'),\n",
" Text(0.5823582358235824, 0.7352941176470589, 'gini = 0.0\\nsamples = 4\\nvalue = [4, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.8091809180918091, 0.9117647058823529, 'V12 <= -0.595\\ngini = 0.24\\nsamples = 437\\nvalue = [376, 61]\\nclass = Ready biodegradable'),\n",
" Text(0.7155715571557155, 0.8529411764705882, 'V27 <= 2.363\\ngini = 0.497\\nsamples = 72\\nvalue = [39, 33]\\nclass = Ready biodegradable'),\n",
" Text(0.6615661566156615, 0.7941176470588235, 'V16 <= 6.5\\ngini = 0.381\\nsamples = 39\\nvalue = [10, 29]\\nclass = Reday non-biodegradable'),\n",
" Text(0.6255625562556255, 0.7352941176470589, 'V14 <= 1.207\\ngini = 0.5\\nsamples = 18\\nvalue = [9, 9]\\nclass = Ready biodegradable'),\n",
" Text(0.5967596759675967, 0.6764705882352942, 'V25 <= 0.5\\ngini = 0.245\\nsamples = 7\\nvalue = [1, 6]\\nclass = Reday non-biodegradable'),\n",
" Text(0.5823582358235824, 0.6176470588235294, 'gini = 0.0\\nsamples = 6\\nvalue = [0, 6]\\nclass = Reday non-biodegradable'),\n",
" Text(0.6111611161116112, 0.6176470588235294, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.6543654365436543, 0.6764705882352942, 'V31 <= 3.283\\ngini = 0.397\\nsamples = 11\\nvalue = [8, 3]\\nclass = Ready biodegradable'),\n",
" Text(0.63996399639964, 0.6176470588235294, 'V15 <= 8.989\\ngini = 0.198\\nsamples = 9\\nvalue = [8, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.6255625562556255, 0.5588235294117647, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.6543654365436543, 0.5588235294117647, 'gini = 0.0\\nsamples = 8\\nvalue = [8, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.6687668766876688, 0.6176470588235294, 'gini = 0.0\\nsamples = 2\\nvalue = [0, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.6975697569756976, 0.7352941176470589, 'V14 <= 3.315\\ngini = 0.091\\nsamples = 21\\nvalue = [1, 20]\\nclass = Reday non-biodegradable'),\n",
" Text(0.6831683168316832, 0.6764705882352942, 'gini = 0.0\\nsamples = 20\\nvalue = [0, 20]\\nclass = Reday non-biodegradable'),\n",
" Text(0.711971197119712, 0.6764705882352942, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.7695769576957696, 0.7941176470588235, 'V31 <= 1.805\\ngini = 0.213\\nsamples = 33\\nvalue = [29, 4]\\nclass = Ready biodegradable'),\n",
" Text(0.7551755175517552, 0.7352941176470589, 'V8 <= 48.35\\ngini = 0.5\\nsamples = 8\\nvalue = [4, 4]\\nclass = Ready biodegradable'),\n",
" Text(0.7407740774077408, 0.6764705882352942, 'gini = 0.0\\nsamples = 4\\nvalue = [4, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.7695769576957696, 0.6764705882352942, 'gini = 0.0\\nsamples = 4\\nvalue = [0, 4]\\nclass = Reday non-biodegradable'),\n",
" Text(0.783978397839784, 0.7352941176470589, 'gini = 0.0\\nsamples = 25\\nvalue = [25, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.9027902790279028, 0.8529411764705882, 'V39 <= 8.446\\ngini = 0.142\\nsamples = 365\\nvalue = [337, 28]\\nclass = Ready biodegradable'),\n",
" Text(0.8631863186318632, 0.7941176470588235, 'V30 <= 5.124\\ngini = 0.375\\nsamples = 52\\nvalue = [39, 13]\\nclass = Ready biodegradable'),\n",
" Text(0.8271827182718272, 0.7352941176470589, 'V8 <= 45.85\\ngini = 0.214\\nsamples = 41\\nvalue = [36, 5]\\nclass = Ready biodegradable'),\n",
" Text(0.7983798379837984, 0.6764705882352942, 'V1 <= 5.601\\ngini = 0.105\\nsamples = 36\\nvalue = [34, 2]\\nclass = Ready biodegradable'),\n",
" Text(0.783978397839784, 0.6176470588235294, 'V14 <= 1.534\\ngini = 0.056\\nsamples = 35\\nvalue = [34, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.7695769576957696, 0.5588235294117647, 'gini = 0.0\\nsamples = 33\\nvalue = [33, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.7983798379837984, 0.5588235294117647, 'V18 <= 1.127\\ngini = 0.5\\nsamples = 2\\nvalue = [1, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.783978397839784, 0.5, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8127812781278128, 0.5, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.8127812781278128, 0.6176470588235294, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.855985598559856, 0.6764705882352942, 'V14 <= 1.224\\ngini = 0.48\\nsamples = 5\\nvalue = [2, 3]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8415841584158416, 0.6176470588235294, 'gini = 0.0\\nsamples = 3\\nvalue = [0, 3]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8703870387038704, 0.6176470588235294, 'gini = 0.0\\nsamples = 2\\nvalue = [2, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.8991899189918992, 0.7352941176470589, 'V30 <= 15.793\\ngini = 0.397\\nsamples = 11\\nvalue = [3, 8]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8847884788478848, 0.6764705882352942, 'gini = 0.0\\nsamples = 8\\nvalue = [0, 8]\\nclass = Reday non-biodegradable'),\n",
" Text(0.9135913591359136, 0.6764705882352942, 'gini = 0.0\\nsamples = 3\\nvalue = [3, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.9423942394239424, 0.7941176470588235, 'V8 <= 13.25\\ngini = 0.091\\nsamples = 313\\nvalue = [298, 15]\\nclass = Ready biodegradable'),\n",
" Text(0.927992799279928, 0.7352941176470589, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.9567956795679567, 0.7352941176470589, 'V14 <= 4.039\\ngini = 0.086\\nsamples = 312\\nvalue = [298, 14]\\nclass = Ready biodegradable'),\n",
" Text(0.9423942394239424, 0.6764705882352942, 'V40 <= 0.5\\ngini = 0.08\\nsamples = 311\\nvalue = [298, 13]\\nclass = Ready biodegradable'),\n",
" Text(0.9135913591359136, 0.6176470588235294, 'V38 <= 0.5\\ngini = 0.069\\nsamples = 306\\nvalue = [295, 11]\\nclass = Ready biodegradable'),\n",
" Text(0.8991899189918992, 0.5588235294117647, 'V30 <= 12.851\\ngini = 0.121\\nsamples = 170\\nvalue = [159, 11]\\nclass = Ready biodegradable'),\n",
" Text(0.8703870387038704, 0.5, 'V15 <= 10.386\\ngini = 0.086\\nsamples = 155\\nvalue = [148, 7]\\nclass = Ready biodegradable'),\n",
" Text(0.855985598559856, 0.4411764705882353, 'V31 <= 5.771\\ngini = 0.154\\nsamples = 83\\nvalue = [76, 7]\\nclass = Ready biodegradable'),\n",
" Text(0.8415841584158416, 0.38235294117647056, 'V30 <= 11.485\\ngini = 0.136\\nsamples = 82\\nvalue = [76, 6]\\nclass = Ready biodegradable'),\n",
" Text(0.8271827182718272, 0.3235294117647059, 'V14 <= 0.861\\ngini = 0.116\\nsamples = 81\\nvalue = [76, 5]\\nclass = Ready biodegradable'),\n",
" Text(0.7767776777677767, 0.2647058823529412, 'V39 <= 10.543\\ngini = 0.238\\nsamples = 29\\nvalue = [25, 4]\\nclass = Ready biodegradable'),\n",
" Text(0.7479747974797479, 0.20588235294117646, 'V35 <= 0.5\\ngini = 0.147\\nsamples = 25\\nvalue = [23, 2]\\nclass = Ready biodegradable'),\n",
" Text(0.7335733573357336, 0.14705882352941177, 'gini = 0.0\\nsamples = 19\\nvalue = [19, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.7623762376237624, 0.14705882352941177, 'V28 <= 0.002\\ngini = 0.444\\nsamples = 6\\nvalue = [4, 2]\\nclass = Ready biodegradable'),\n",
" Text(0.7479747974797479, 0.08823529411764706, 'gini = 0.0\\nsamples = 3\\nvalue = [3, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.7767776777677767, 0.08823529411764706, 'V27 <= 2.305\\ngini = 0.444\\nsamples = 3\\nvalue = [1, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.7623762376237624, 0.029411764705882353, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.7911791179117912, 0.029411764705882353, 'gini = 0.0\\nsamples = 2\\nvalue = [0, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8055805580558055, 0.20588235294117646, 'V36 <= 6.88\\ngini = 0.5\\nsamples = 4\\nvalue = [2, 2]\\nclass = Ready biodegradable'),\n",
" Text(0.7911791179117912, 0.14705882352941177, 'gini = 0.0\\nsamples = 2\\nvalue = [0, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.81998199819982, 0.14705882352941177, 'gini = 0.0\\nsamples = 2\\nvalue = [2, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.8775877587758776, 0.2647058823529412, 'V2 <= 2.978\\ngini = 0.038\\nsamples = 52\\nvalue = [51, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.8631863186318632, 0.20588235294117646, 'V2 <= 2.946\\ngini = 0.124\\nsamples = 15\\nvalue = [14, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.8487848784878488, 0.14705882352941177, 'gini = 0.0\\nsamples = 14\\nvalue = [14, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.8775877587758776, 0.14705882352941177, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.891989198919892, 0.20588235294117646, 'gini = 0.0\\nsamples = 37\\nvalue = [37, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.855985598559856, 0.3235294117647059, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8703870387038704, 0.38235294117647056, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8847884788478848, 0.4411764705882353, 'gini = 0.0\\nsamples = 72\\nvalue = [72, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.927992799279928, 0.5, 'V14 <= 0.653\\ngini = 0.391\\nsamples = 15\\nvalue = [11, 4]\\nclass = Ready biodegradable'),\n",
" Text(0.9135913591359136, 0.4411764705882353, 'gini = 0.0\\nsamples = 2\\nvalue = [0, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.9423942394239424, 0.4411764705882353, 'V17 <= 1.038\\ngini = 0.26\\nsamples = 13\\nvalue = [11, 2]\\nclass = Ready biodegradable'),\n",
" Text(0.927992799279928, 0.38235294117647056, 'gini = 0.0\\nsamples = 11\\nvalue = [11, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.9567956795679567, 0.38235294117647056, 'gini = 0.0\\nsamples = 2\\nvalue = [0, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.927992799279928, 0.5588235294117647, 'gini = 0.0\\nsamples = 136\\nvalue = [136, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.9711971197119712, 0.6176470588235294, 'V27 <= 2.302\\ngini = 0.48\\nsamples = 5\\nvalue = [3, 2]\\nclass = Ready biodegradable'),\n",
" Text(0.9567956795679567, 0.5588235294117647, 'gini = 0.0\\nsamples = 2\\nvalue = [0, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.9855985598559855, 0.5588235294117647, 'gini = 0.0\\nsamples = 3\\nvalue = [3, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.9711971197119712, 0.6764705882352942, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable')]"
]
},
"execution_count": 39,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAMMAAADBwCAYAAADIH5E8AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAD2EAAA9hAHVrK90AAEAAElEQVR4nOzOQQHAMBACsA3/nq8yeJAoyH939wEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFCTdgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGBd2gEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIB1aQcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADWpR0AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABYl3YAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABgXdoBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAdWkHAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA1qUdAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAWJd2AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAYF3aAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgHVpBwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANalHQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFiXdgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGBd2gEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIB1aQcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADWpR0AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABYl3YAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABgXdoBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAdWkHAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA1qUdAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAWJd2AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAYF3aAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgHVpBwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANalHQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFiXdgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGBd2gEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIB1aQcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADWpR0AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABYl3YAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABgXdoBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAdWkHAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA1qUdAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAWJd2AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAYF3aAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgHVpBwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANalHQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFiXdgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGBd2gEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIB1aQcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADWpR0AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABYl3YAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABgXdoBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAdWkHAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA1qUdAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAWJd2AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAYF3aAQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAgHVpBwAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAANalHQAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAFiXdgAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGBd2gEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIB1aQcAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADWpR0AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABYl3YAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABgXdoBAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAACAdWkHAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA1qUdAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
"text/plain": [
"<Figure size 16000x16000 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.tree import plot_tree\n",
"\n",
"plt.figure(figsize=(40, 40), dpi=400)\n",
"plot_tree(model_dec_tree, filled=True, rounded=True, class_names=['Ready biodegradable', 'Reday non-biodegradable'], feature_names=X_train.columns)"
]
},
{
"cell_type": "markdown",
"id": "b55c97cd",
"metadata": {},
"source": [
"### Random Forrest Classifier"
]
},
{
"cell_type": "code",
"execution_count": 40,
"id": "c9d5676b",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA04AAAK4CAYAAABDHK0xAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABYeklEQVR4nO3dd3QV1d7G8ScJ6SEJkEKooSktBg29SAu9iKggForSBKQpCCogCmJFvNIEpKhwaQKiIEUERIiUIEjvHRJAJEAoIcm8f/jmXA8JbEhCDoTvZ62zrmfP3jO/ORnuypM9s4+TZVmWAAAAAAA35ezoAgAAAADgXkdwAgAAAAADghMAAAAAGBCcAAAAAMCA4AQAAAAABgQnAAAAADAgOAEAAACAAcEJAAAAAAwITgAAAABgQHACgCwydepUOTk56fDhw7a2WrVqqVatWsaxq1atkpOTk1atWnXX6kuP0NBQtW/f3tFl3FP27dun+vXry8/PT05OTlqwYIGjS8pyaV3rAHC/IzgBuCsOHDigLl26qGjRovLw8JCvr6+qVaumzz//XFeuXHF0eQ+UxYsX65133nF0GQ+Mdu3aadu2bRo+fLi++eYblS9f/q4d6/Dhw3JycrK9nJ2dlTt3bjVq1EhRUVF37bj3mxs/p3+/Kleu7Ojy0jRjxgyNGjXK0WUA+Jccji4AQPazaNEiPfPMM3J3d1fbtm1VtmxZJSQk6LffflO/fv20Y8cOTZgwwdFl3hOWLVt214+xePFijRkzhvCUBa5cuaKoqCi99dZb6tGjR5Ydt02bNmrcuLGSkpK0d+9ejR07VrVr19bGjRsVFhaWZXXc61I+p38LDAx0UDW3NmPGDG3fvl29e/d2dCkA/h/BCUCmOnTokJ599lkVLlxYv/zyi0JCQmzbunfvrv3792vRokU3HZ+cnKyEhAR5eHhkRbkO5+bm5ugSHgiJiYlKTk6+65/3mTNnJEn+/v6Zts/4+Hh5e3vfss9jjz2mF154wfa+Ro0aatSokcaNG6exY8dmWi33uxs/p8xy9epVubm5ydmZG3mA7Ix/4QAy1UcffaRLly7pq6++sgtNKYoXL65evXrZ3js5OalHjx6aPn26ypQpI3d3dy1ZskSS9Mcff6hRo0by9fWVj4+P6tatq99//91uf9evX9fQoUNVokQJeXh4KE+ePKpevbqWL19u6xMTE6MOHTqoQIECcnd3V0hIiJ544olbPn8xd+5cOTk5afXq1am2ffnll3JyctL27dslSX/++afat29vuy0xb968eumll/TXX38ZP6+0nnE6fvy4WrRoIW9vbwUFBalPnz66du1aqrFr1qzRM888o0KFCsnd3V0FCxZUnz597G6FbN++vcaMGSNJdrcnpUhOTtaoUaNUpkwZeXh4KDg4WF26dNHff/9tdyzLsjRs2DAVKFBAXl5eql27tnbs2GE8vxQzZ85URESEcubMKV9fX4WFhenzzz+363P+/Hn16dNHoaGhcnd3V4ECBdS2bVudPXvW1uf06dN6+eWXFRwcLA8PD4WHh2vatGl2+0m5LeuTTz7RqFGjVKxYMbm7u2vnzp2SpN27d+vpp59W7ty55eHhofLly2vhwoV2+7id6+pG77zzjgoXLixJ6tevn5ycnBQaGmrbfjvXc8qzQatXr1a3bt0UFBSkAgUK3PbnnKJGjRqS/rll9t+mTJmiOnXqKCgoSO7u7ipdurTGjRuXanxoaKiaNm2q3377TRUrVpSHh4eKFi2qr7/+OlXfHTt2qE6dOvL09FSBAgU0bNgwJScnp1nX2LFjbf/O8+XLp+7du+v8+fN2fWrVqqWyZcvqzz//VM2aNeXl5aXixYtr7ty5kqTVq1erUqVK8vT01MMPP6yff/75jj+fmzl48KCeeeYZ5c6dW15eXqpcuXKqP/SkPG84c+ZMvf3228qfP7+8vLx04cIFSdL69evVsGFD+fn5ycvLSzVr1tTatWvt9nHx4kX17t3bdq0HBQWpXr162rx5s+0zWLRokY4cOWL7N/vvawmAYzDjBCBT/fDDDypatKiqVq1622N++eUXzZ49Wz169FBAQIBCQ0O1Y8cO1ahRQ76+vurfv79cXV315ZdfqlatWrZfnKR/flkdMWKEOnbsqIoVK+rChQvatGmTNm/erHr16kmSnnrqKe3YsUOvvvqqQkNDdfr0aS1fvlxHjx696S8jTZo0kY+Pj2bPnq2aNWvabZs1a5bKlCmjsmXLSpKWL1+ugwcPqkOHDsqbN6/tVsQdO3bo999/twsqJleuXFHdunV19OhR9ezZU/ny5dM333yjX375JVXfOXPm6PLly3rllVeUJ08ebdiwQV988YWOHz+uOXPmSJK6dOmikydPavny5frmm29S7aNLly6aOnWqOnTooJ49e+rQoUMaPXq0/vjjD61du1aurq6SpMGDB2vYsGFq3LixGjdurM2bN6t+/fpKSEgwntPy5cvVpk0b1a1bVx9++KEkadeuXVq7dq0tRF+6dEk1atTQrl279NJLL+mxxx7T2bNntXDhQh0/flwBAQG6cuWKatWqpf3796tHjx4qUqSI5syZo/bt2+v8+fN2gVz6JyRcvXpVnTt3lru7u3Lnzq0dO3aoWrVqyp8/vwYMGCBvb2/Nnj1bLVq00Hfffacnn3xS0u1dVzdq2bKl/P391adPH9stYT4+PpJ029dzim7duikwMFCDBw9WfHy88TO+UcofBXLlymXXPm7cOJUpU0bNmzdXjhw59MMPP6hbt25KTk5W9+7d7fru379fTz/9tF5++WW1a9dOkydPVvv27RUREaEyZcpI+uePErVr11ZiYqLt85wwYYI8PT1T1fTOO+9o6NChioyM1CuvvKI9e/Zo3Lhx2rhxo921Jkl///23mjZtqmeffVbPPPOMxo0bp2effVbTp09X79691bVrVz333HP6+OOP9fTTT+vYsWPKmTOn8XO5fPmyXRCXJD8/P7m6uio2NlZVq1bV5cuX1bNnT+XJk0fTpk1T8+bNNXfuXNu1keK9996Tm5ubXn/9dV27dk1ubm765Zdf1KhRI0VERGjIkCFydna2hdU1a9aoYsWKkqSuXbtq7ty56tGjh0qXLq2//vpLv/32m3bt2qXHHntMb731luLi4nT8+HF99tlnkmS7lgA4kAUAmSQuLs6SZD3xxBO3PUaS5ezsbO3YscOuvUWLFpabm5t14MABW9vJkyetnDlzWo8//ritLTw83GrSpMlN9//3339bkqyPP/749k/k/7Vp08YKCgqyEhMTbW2nTp2ynJ2drXfffdfWdvny5VRj//vf/1qSrF9//dXWNmXKFEuSdejQIVtbzZo1rZo1a9rejxo1ypJkzZ4929YWHx9vFS9e3JJkrVy58pbHHTFihOXk5GQdOXLE1ta9e3crrf+7X7NmjSXJmj59ul37kiVL7NpPnz5tubm5WU2aNLGSk5Nt/d58801LktWuXbtU+/63Xr16Wb6+vnaf440GDx5sSbLmzZuXalvKMVM+m2+//da2LSEhwapSpYrl4+NjXbhwwbIsyzp06JAlyfL19bVOnz5tt6+6detaYWFh1tWrV+32X7VqVatEiRK2NtN1dTMpx77xervd6znlGqlevfotP68bjzd06FDrzJkzVkxMjLVmzRqrQoUKliRrzpw5dv3TumYaNGhgFS1a1K6tcOHCqa7f06dPW+7u7tZrr71ma+vdu7clyVq/fr1dPz8/P7trPeUaql+/vpWUlGTrO3r0aEuSNXnyZFtbzZo1LUnWjBkzbG27d++2/X/F77//bmtfunSpJcmaMmXKbX1Oab1S/k2lnMuaNWts4y5evGgVKVLECg0NtdW9cuVKS5JVtGhRu88zOTnZKlGihNWgQQO7fyeXL1+2ihQpYtWrV8/W5ufnZ3Xv3v2WNTdp0sQqXLjwLfsAyFrcqgcg06TcqnI7f/n9t5o1a6p06dK290lJSVq2bJlatGihokWL2tpDQkL03HPP6bfffrMdy9/fXzt27NC+ffvS3Lenp6fc3Ny0atWqVLefmbRu3VqnT5+2WwJ87ty5Sk5
"text/plain": [
"<Figure size 1000x800 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"name": "stderr",
"output_type": "stream",
"text": [
"/tmp/ipykernel_189833/343368697.py:160: UserWarning: FixedFormatter should only be used together with FixedLocator\n",
" ax[2, 1].set_xticklabels(feature_dict.keys(), rotation=90)\n"
]
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABNEAAATYCAYAAAAxo1G2AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeVgW1f//8RegLALiwuaWGO67YRJqqYXiEqVpmfpR3EvFDa20RDQX2jTMTM3cPpZpmpkfNU1JK5fcNct9NxPcN1QImN8f/pgvt4A3KniTPh/XdV8XnDlzzzkzc8997vecc8bOMAxDAAAAAAAAALJkb+sCAAAAAAAAAHkdQTQAAAAAAADACoJoAAAAAAAAgBUE0QAAAAAAAAArCKIBAAAAAAAAVhBEAwAAAAAAAKwgiAYAAAAAAABYQRANAAAAAAAAsIIgGgAAAAAAAGAFQTQghxw8eFBNmjSRh4eH7OzstHjx4lzZTsOGDdWwYcNcee8HbcSIEbKzs7N1MaxasWKFatasKWdnZ9nZ2enSpUu2LtJd6dy5s/z8/GxdDABADtqyZYvq1q0rV1dX2dnZaefOndled9asWbKzs9OxY8es5vXz81Pnzp3vuZyPqsz2cXbbcGvXrpWdnZ3Wrl2ba+W7F/+2c+HYsWOys7PTrFmzbF2UO4qPj1ebNm1UtGhR2dnZKSYmxtZFuit59Xx92KWmpqpq1aoaM2aMrYuSI1asWCE3NzedPXvW1kWxiiAaHhlpjZm0l7Ozs4oXL66QkBB98sknunr16n29f1hYmHbv3q0xY8Zozpw5ql27dg6V/M7+/vtvjRgx4q4az8i+8+fP65VXXpGLi4smTZqkOXPmyNXVNdO8t59j+fLlU4kSJdS5c2edOnXqAZc877p9P6V/DRkyxNbFy9TYsWNzLTAO4OFw7do1RUVFqWnTpipSpIjVH+979+5V06ZN5ebmpiJFiqhjx47Z/vHwzz//6OWXX9aFCxf08ccfa86cOSpdunQO1QT/ZsuXL9eIESNsXQzchYEDB2rlypUaOnSo5syZo6ZNm2aZ9/Z2U8GCBdWgQQMtW7bsAZY478uqnenr62vromXqXj63X3/9tU6ePKnw8PDcKdQD1rRpU5UtW1bR0dG2LopV+WxdAOBBe/fdd1WmTBn9888/iouL09q1azVgwACNHz9eS5YsUfXq1e/6PW/cuKGNGzfqnXfeeeAXsr///lsjR46Un5+fatas+UC3fb+GDRuWZ4MmabZs2aKrV69q1KhRCg4OztY6aefYzZs39dtvv2nWrFlat26d/vjjDzk7O+dyif890vZTelWrVrVRae5s7NixatOmjVq2bGnrogDIo86dO6d3331Xjz32mGrUqHHHXhl//fWXnnnmGXl4eGjs2LG6du2aPvroI+3evVubN2+Wo6PjHbd1+PBhHT9+XNOmTVP37t1zuCbILT/++GOub2P58uWaNGkSgTRJpUuX1o0bN5Q/f35bF+WOfvrpJ7344osaPHhwtvI3btxYnTp1kmEYOn78uCZPnqzQ0FD98MMPCgkJyeXS/nuk7af0XFxcbFSaO7uXz+2HH36oV199VR4eHrlXsAfstdde0+DBgzVy5Ei5u7vbujhZIoiGR06zZs0seokNHTpUP/30k55//nm98MIL2rt3711fYNPuHBcqVCgni/rQSkhIkKurq/Lly6d8+fL2ZejMmTOS7u7Ypj/HunfvLk9PT73//vtasmSJXnnlldwo5r/S7Z/FnJJ2fgHAg1SsWDGdPn1avr6+2rp1q5588sks844dO1YJCQnatm2bHnvsMUlSnTp11LhxY82aNUs9e/a847bu5bspL0pNTVVSUtIjc4PJWnAUOSM5OVmpqalydHT8V5xbZ86cuavPcvny5fWf//zH/L9169aqXLmyJkyYQBAtndv3U05Jf37Zyo4dO7Rr1y6NGzfOZmXIDa1bt1bfvn21YMECde3a1dbFyRLDOQFJzz77rCIjI3X8+HF9+eWXFsv27dunNm3aqEiRInJ2dlbt2rW1ZMkSc/mIESPMIRRvvPGG7OzszPmnjh8/rt69e6tChQpycXFR0aJF9fLLL2eYgySrucGszVmydu1as5HepUsXs6vynYaPXL16VQMGDJCfn5+cnJzk7e2txo0ba/v27Rb5Nm3apObNm6tw4cJydXVV9erVNWHCBIs8P/30k55++mm5urqqUKFCevHFF7V3795M67Znzx61b99ehQsXVv369bOst52dncLDw7V48WJVrVpVTk5OqlKlilasWJFp/WvXri1nZ2f5+/tr6tSpdzXP2oIFCxQQECAXFxd5enrqP//5j8Wwy4YNGyosLEyS9OSTT8rOzu6e5gJ5+umnJd3qOZAmKSlJw4cPV0BAgDw8POTq6qqnn35aa9assVg3bT6Pjz76SJ9//rn8/f3l5OSkJ598Ulu2bMmwrbT95uzsrKpVq+q7777LtEwJCQkaNGiQSpUqJScnJ1WoUEEfffSRDMOwyJd2PBYsWKDKlSvLxcVFQUFB2r17tyRp6tSpKlu2rJydndWwYcNsza+TXfd7fknSl19+aR7jIkWK6NVXX9XJkyct3uPgwYNq3bq1fH195ezsrJIlS+rVV1/V5cuXzX2QkJCg2bNnm5+xf9OcMAAeDCcnp2wPFfr222/1/PPPmwE0SQoODlb58uX1zTff3HHdzp07q0GDBpKkl19+WXZ2dhbzbGXn2pkZwzA0evRolSxZUgUKFFCjRo30559/Zqs+0q2A2IQJE1StWjU5OzvLy8tLTZs21datW808ad8pX331lapUqSInJyfz+33Hjh1q1qyZChYsKDc3Nz333HP67bffLLbxzz//aOTIkSpXrpycnZ1VtGhR1a9fX6tWrTLzxMXFqUuXLipZsqScnJxUrFgxvfjii3f8flq4cKHs7Oz0888/Z1g2depU2dnZ6Y8//pAk/f777+rcubMef/xxOTs7y9fXV127dtX58+et7qPM5kT766+/1LJlS7m6usrb21sDBw5UYmJihnV//fVXvfzyy3rsscfk5OSkUqVKaeDAgbpx44aZp3Pnzpo0aZIky+FsaVJTUxUTE6MqVarI2dlZPj4+eu2113Tx4kWLbd3vuTBv3jwFBATI3d1dBQsWVLVq1TK0IS9duqSBAwea7dGSJUuqU6dOOnfunJnnzJkz6tatm3x8fOTs7KwaNWpo9uzZFu+Tvp0UExNjtpP27NmT6ZxonTt3lpubm06dOqWWLVvKzc1NXl5eGjx4sFJSUize+/z58+rYsaMKFiyoQoUKKSwsTLt27cr2PGtHjhzRyy+/rCJFiqhAgQJ66qmnLIZdprXzDcPQpEmTMhyv7KpUqZI8PT0t2pmS9P3336tFixYqXry4nJyc5O/vr1GjRmWoZ8OGDVW1alXt2bNHjRo1UoECBVSiRAl98MEHGbaV3fNVst7Olv7veJw4cULPP/+83NzcVKJECfM83r17t5599lm5urqqdOnSmjt37l3vn6zc7/klWf+dKFm/bln73GZm8eLFcnR01DPPPGORntYuPnDggP7zn//Iw8NDXl5eioyMlGEYOnnypF588UUVLFhQvr6+mQbhEhMTFRUVpbJly5rXmjfffDPDcZ45c6aeffZZeXt7y8nJSZUrV9bkyZMzvJ+fn5+ef/55rVu3TnXq1JGzs7Mef/xx/fe//82Q19vbW9WrV9f3339/x/rbWt7uAgI8QB07dtTbb7+tH3/8UT169JAk/fnnn6pXr55KlCihIUOGyNXVVd98841atmypb7/9Vq1atdJLL72kQoUKaeDAgWrXrp2aN28uNzc3SbeGAm7YsEGvvvqqSpYsqWPHjmny5Mlq2LCh9uzZowIFCtxXmStVqqR3331Xw4cPV8+ePc1gTd26dbNc5/XXX9fChQsVHh6uypUr6/z581q3bp327t2rJ554QpK0atUqPf/88ypWrJj69+8vX19f7d27V0uXLlX//v0lSatXr1azZs30+OOPa8SIEbpx44YmTpyoevXqafv27Rkmsn/55ZdVrlw5jR07NkOg5nbr1q3TokW
"text/plain": [
"<Figure size 1500x1500 with 7 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjcAAAGwCAYAAABVdURTAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABfgklEQVR4nO3dd1QUV/8G8GcX2KVIM0gVxd4Fe8SoUVHsLYkmGEWTaOxGY2IXS+zRaKLRxG5+JqhJNL4W7BV7QUURFcFGUYKAdNi9vz982dcNJTu4sLI8n3P2HPfOnZlnR8qXO3dmZEIIASIiIiIjITd0ACIiIiJ9YnFDRERERoXFDRERERkVFjdERERkVFjcEBERkVFhcUNERERGhcUNERERGRVTQwcoaWq1GtHR0bC2toZMJjN0HCIiItKBEAIvXryAq6sr5PLCx2bKXHETHR0Nd3d3Q8cgIiKiInj06BEqVqxYaJ8yV9xYW1sDeHlwbGxsDJyGiIiIdJGcnAx3d3fN7/HClLniJvdUlI2NDYsbIiKiUkaXKSWcUExERERGhcUNERERGRUWN0RERGRUWNwQERGRUWFxQ0REREaFxQ0REREZFRY3REREZFRY3BAREZFRYXFDRERERoXFDRERERkVgxY3J0+eRI8ePeDq6gqZTIZdu3b96zrHjx9H48aNoVQqUb16dWzatKnYcxIREVHpYdDiJjU1FZ6enli1apVO/SMjI9GtWze0a9cOISEh+OKLL/DZZ5/hwIEDxZyUiIiISguDPjizS5cu6NKli87916xZgypVqmDp0qUAgDp16uD06dP47rvv4OvrW1wxiYiMXkJqFtKycgwdg4yEwlQOR2tzg+2/VD0V/OzZs/Dx8dFq8/X1xRdffFHgOpmZmcjMzNS8T05OLq54RERvPCEEniSm42Z0Mm4+SUJodDJuRichLjnz31cm0lHjSnb4c2Qrg+2/VBU3sbGxcHJy0mpzcnJCcnIy0tPTYWFhkWedBQsWYPbs2SUVkYjojaFWC0T9naopYG4+SUZodBIS07Lz7a805TUmpB9mJob9WipVxU1RTJkyBRMmTNC8T05Ohru7uwETERHpX7ZKjXtPU3AzOhmhT5JwMzoJt6KTkZqlytPXVC5DDSdr1He1QX03W9RztUEdFxtYKY3+VwKVEaXqK9nZ2RlxcXFabXFxcbCxscl31AYAlEollEplScQjIioRGdkqhMe+QGh0kub0UljsC2TlqPP0VZrKUcfFBvX+W8jUd7VFDadyMDczMUByopJRqoqbli1bYt++fVpthw4dQsuWLQ2UiIioeKVk5iAs5uVoTOiTl6eX7j5NgUot8vQtpzRFXVcb1He11RQz1SpYwdTApwiISppBi5uUlBTcu3dP8z4yMhIhISEoX748KlWqhClTpuDJkyfYsmULAGD48OFYuXIlvv76a3zyySc4evQotm/fjr179xrqIxAR6c3z1KyXIzHR/53o+yQJkX+nQuStY2BvafbfU0q2qO9mg3qutqhc3hJyuazkgxO9YQxa3Fy6dAnt2rXTvM+dG+Pv749NmzYhJiYGDx8+1CyvUqUK9u7di/Hjx2PFihWoWLEi1q1bx8vAiajUeZqcgdDo/43GhD5JxpPE9Hz7OtuYawqY3BEZF1tzyGQsZIjyIxMiv78JjFdycjJsbW2RlJQEGxsbQ8chIiMnhMDj5+maAia3oIlPyf/S68pvWaKeq81/R2ReFjMO5ThvkEjK7+9SNeeGiKg4ZWSr8j0FpCsBgejEjJeXXUfnzpNJQnJG3pvjyWVAtQrlNAVMPVdb1HW1ga2F2Wt8AiICWNwQURl3/1kK9lyPwZ7r0bgTl1Is+zAzkaGmkzXq/3d+TF1XW9RxsYalgj+CiYoDv7OIqMx5lJCG/1yPxp5rMbgVo9+7lpubyVHXxUZrom9NJ2soeIM8ohLD4oaIyoToxHTs/e8IzbXHSZp2E7kMrao7oHtDF7Sv7QiL17z/i7mZCUx4xRKRQbG4ISKj9TQ5A3tvxGDP9RhcfvBc0y6XAW9XfQvdG7qic31nlLdSGDAlEekbixsiMirxKZnYHxqLPdeicSEqQTNBWCYDmlUuj+6eLuhc39mgTywmouLF4oaISr3EtCwEhcZiz/UYnImIx6s3721UyQ7dG7qiWwMXONuyoCEqC1jcEFGplJyRjYM347DnejRO341HzisVTQM3W3Rv6IJuDV1Q0d7SgCmJyBBY3BBRqZGamYPDYXH4z7UYnLzzDFmq/z0osrazNXp4vhyh8XCwMmBKIjI0FjdE9EZLz1Lh6O2n2HM9GkdvP0XmK0++ru5YDt0buqB7Q1dUdyxnwJRE9CZhcUNEry0jW4Vjt59iz/UYnL4Xj6xXCpDXla1Sa51y8njLEt0buqK7pwtqOVnz+UpElAeLGyIqkswcFU7eicee69E4fCsOqVmqYttXRXuLlwVNQxfUc7VhQUNEhWJxQ0Q6y1apcfpePPZci8HBW7F48cozk9zsLNC94cvLrPX5oEcTuYxPwCYiSVjcEFGhclRqnLufgD3XoxF0MxaJadmaZc425ujawAXdPV3QyN2OBQgRvRFY3BBRHhnZKlyMSsCBm7EICo1FfEqWZplDOSW6NnBG94auaFrZHnI+aoCI3jAsbogIQgiExbzA6XvPcOpuPC5EJmhdlWRvaYbO9V3Qo6ELWlR9i89OIqI3GosbojLqaXIGTt2Nx+l78Th1Nx7xKZlay51tzNG2ZgV0begC72pvwcyET7UmotKBxQ1RGZGepcKFqAScuvNydCY87oXWcgszE7xdtTxa16iANjUdUK1COc6hIaJSicUNkZFSqwVuxST/d3TmGS5GPte6o69M9vIxBe9Ud0DrGhXQuLIdlKYmBkxMRKQfLG6IXvHH5cf4/uhd5KjEv3d+w6Vk5iApPVurzdXWHK1rVEDrmg7wruaA8lYKA6UjIio+LG6I/utmdBKm/HlDa3SjtLNSmKBltbdejs7UrICqDlY81URERo/FDRFePpBxzK9XkaVSo31tR4zrUMPQkV6bqYkMNRytoTDlRGAiKltY3BABmPnXTdyPT4WzjTmWfuAJe56uISIqtVjc0BspJTMHp+/GI0dd/KeI7j1NwR9XHkMuA77/qBELGyKiUo7FDb2R5vznJrZfelyi+/zCpyaaVylfovskIiL9Y3FDb6S45Jc3lKtWwQoVrPX3EMaC1Ha2wah21Yt9P0REVPxY3NAbbeS71fFek4qGjkFERKUIL6MgIiIio8LihoiIiIwKixsiIiIyKixuiIiIyKiwuCEiIiKjwqulyKDCY1/g79TMPO2JaVkGSENERMaAxQ0ZzKm7zzBw/YVC+8g5tkhERBKxuCGDeZSQDuDlk6vd7C3yLK9grcQ71SuUdCwiIirlWNyQwXlXd8DaQU0NHYOIiIwEB/2JiIjIqHDkhkrUmYh4rD8ViRy1QHRiuqHjEBGREWJxQyVqzYn7OHnnmVZbSTwYk4iIyo4iFTfZ2dmIjY1FWloaKlSogPLly+s7FxmpHJUaADCgRSU0qmQPMxMZ2td2NHAqIiIyJjoXNy9evMD//d//ITAwEBcuXEBWVhaEEJDJZKhYsSI6deqEYcOGoVmzZsWZl4xE8yrl0cvLzdAxiIjICOk0oXjZsmXw8PDAxo0b4ePjg127diEkJAR37tzB2bNnERAQgJycHHTq1AmdO3fG3bt3izs3ERERUb50Grm5ePEiTp48iXr16uW7vHnz5vjkk0+wZs0abNy4EadOnUKNGjX0GpSIiIhIFzoVN7/99ptOG1MqlRg+fPhrBSIiIiJ6HbzPDRERERkVScXNtWvX8M033+DHH39EfHy81rLk5GR88skneg1HREREJJXOxc3BgwfRvHlzBAYGYtGiRahduzaOHTumWZ6eno7NmzcXS0giIiIiXelc3MyaNQsTJ05EaGgooqKi8PXXX6Nnz54ICgoqznxEREREkuh8n5ubN2/il19+AQDIZDJ8/fXXqFixIt5//30EBgby/jZERET0RtC5uFEqlUhMTNRq8/Pzg1wuR//+/bF06VJ9ZyMiIiKSTOf
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"# Score the model with default parameters\n",
"score_rf, model_rf = score_the_model(\n",
" model=RandomForestClassifier(),\n",
" model_name='Random Forest',\n",
" random_seed=42,\n",
" X_train=X_train,\n",
" X_test=X_test,\n",
" y_train=y_train,\n",
" y_test=y_test,\n",
" plot=True\n",
")"
]
},
{
"cell_type": "markdown",
"id": "bf17a3b5",
"metadata": {},
"source": [
"### KNN"
]
},
{
"cell_type": "code",
"execution_count": 41,
"id": "75fea43a",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAA04AAAK4CAYAAABDHK0xAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABSaUlEQVR4nO3de3yP9f/H8edns/Nsw2yGMadCFjWHnCLmEJF0kA4O5RQTVkjlVLSO0jdKCB34OuVUJFqoWMlEDjmT44bkNIfZdv3+6Ofz7dPGeyf7zPa4327X7dvnfb3f1/W65qrvnt7X9f7YLMuyBAAAAAC4JhdnFwAAAAAA+R3BCQAAAAAMCE4AAAAAYEBwAgAAAAADghMAAAAAGBCcAAAAAMCA4AQAAAAABgQnAAAAADAgOAEAAACAAcEJAHBNM2bMkM1m04EDB+xtTZs2VdOmTY1jV69eLZvNptWrV9+w+rIjLCxM3bp1c3YZ+cru3bvVsmVL+fv7y2azadGiRc4uCQDyHYITAPy/vXv3qnfv3qpYsaI8PT3l5+enhg0b6r333tPFixedXV6hsmzZMo0aNcrZZRQaXbt21ZYtWzR27Fh99tlnql279g0714EDB2Sz2fT22287tFuWpd69e8tms9n/7K+Gb5vNpvj4+HTH6tatm3x9fR3amjZtKpvNpnbt2mX63ACQGUWcXQAA5AdLly7Vww8/LA8PD3Xp0kU1atRQcnKyfvzxRw0ePFjbtm3T5MmTnV1mvrBixYobfo5ly5Zp4sSJhKc8cPHiRcXFxemll15SVFSUU2qwLEt9+/bV5MmTNXz48Az/3EeNGqUvv/wy08f86quvFB8fr4iIiFysFEBhRnACUOjt379fjz76qMqXL6/vvvtOISEh9n39+vXTnj17tHTp0muOT0tLU3Jysjw9PfOiXKdzd3d3dgmFQkpKitLS0m74z/vEiROSpICAgFw7ZlJSknx8fDLdv3///po0aZJeeuklvfLKK+n216pVS1999ZU2btyoO++803i8cuXK6dy5cxo9erSWLFmSpdoB4Fp4VA9Aoffmm2/q/Pnz+vjjjx1C01WVK1fWgAED7J9tNpuioqI0c+ZM3XbbbfLw8NDy5cslSb/++qvuvfde+fn5ydfXV82bN9dPP/3kcLwrV65o9OjRqlKlijw9PVWiRAk1atRIK1eutPdJSEhQ9+7dVbZsWXl4eCgkJET333+/w7tG/zZ//nzZbDatWbMm3b6PPvpINptNW7dulST99ttv6tatm/2xxFKlSumpp57Sn3/+afx5ZfSO0+HDh9WhQwf5+PgoKChIgwYN0uXLl9ON/eGHH/Twww+rXLly8vDwUGhoqAYNGuTwKGS3bt00ceJESbI/pmWz2ez709LSNH78eN12223y9PRUcHCwevfurb/++svhXJZlacyYMSpbtqy8vb11zz33aNu2bcbru2r27NmKiIhQ0aJF5efnp/DwcL333nsOfU6fPq1BgwYpLCxMHh4eKlu2rLp06aKTJ0/a+xw/flxPP/20goOD5enpqZo1a+qTTz5xOM4/HyEbP368KlWqJA8PD23fvl2StGPHDj300EMqXry4PD09Vbt27XSBIDP31b+NGjVK5cuXlyQNHjxYNptNYWFh9v2ZuZ+vvge3Zs0a9e3bV0FBQSpbtmymf84DBgzQxIkTNWzYMI0ZMybDPv3791exYsUyPQNZtGhRDRo0SF9++aU2btyY6VoA4HqYcQJQ6H355ZeqWLGiGjRokOkx3333nebOnauoqCgFBgYqLCxM27ZtU+PGjeXn56chQ4bIzc1NH330kZo2bao1a9aoXr16kv7+ZTUmJkY9evRQ3bp1dfbsWW3YsEEbN25UixYtJEkPPvigtm3bpv79+yssLEzHjx/XypUrdfDgQYdfbP+pbdu28vX11dy5c9WkSROHfXPmzNFtt92mGjVqSJJWrlypffv2qXv37ipVqpT9UcRt27bpp59+cggqJhcvXlTz5s118OBBPfvssypdurQ+++wzfffdd+n6zps3TxcuXNAzzzyjEiVKaP369Xr//fd1+PBhzZs3T5LUu3dvHT16VCtXrtRnn32W7hi9e/fWjBkz1L17dz377LPav3+/JkyYoF9//VVr166Vm5ubJGnEiBEaM2aM2rRpozZt2mjjxo1q2bKlkpOTjde0cuVKde7cWc2bN9cbb7whSfr999+1du1ae4g+f/68GjdurN9//11PPfWU7rzzTp08eVJLlizR4cOHFRgYqIsXL6pp06bas2ePoqKiVKFCBc2bN0/dunXT6dOnHQK5JE2fPl2XLl1Sr1695OHhoeLFi2vbtm1q2LChypQpoxdeeEE+Pj6aO3euOnTooC+++EIPPPCApMzdV//WsWNHBQQEaNCgQercubPatGljf2cos/fzVX379lXJkiU1YsQIJSUlGX/GkjRo0CD95z//0dChQ/Xaa69ds5+fn58GDRqkESNGZHrWacCAAXr33Xc1atQoZp0A5A4LAAqxM2fOWJKs+++/P9NjJFkuLi7Wtm3bHNo7dOhgubu7W3v37rW3HT161CpatKh1991329tq1qxptW3b9prH/+uvvyxJ1ltvvZX5C/l/nTt3toKCgqyUlBR727FjxywXFxfrlVdesbdduHAh3dj//ve/liTr+++/t7dNnz7dkmTt37/f3takSROrSZMm9s/jx4+3JFlz5861tyUlJVmVK1e2JFmrVq267nljYmIsm81m/fHHH/a2fv36WRn9X9QPP/xgSbJmzpzp0L58+XKH9uPHj1vu7u5W27ZtrbS0NHu/F1980ZJkde3aNd2x/2nAgAGWn5+fw8/x30aMGGFJshYsWJBu39VzXv3ZfP755/Z9ycnJVv369S1fX1/r7NmzlmVZ1v79+y1Jlp+fn3X8+HGHYzVv3twKDw+3Ll265HD8Bg0aWFWqVLG3me6ra7l67n/fb5m9n6/eI40aNbruz+vf5ytfvrwlyRo8ePA1+65atcqSZM2bN886ffq0VaxYMat9+/b2/V27drV8fHwcxjRp0sS67bbbLMuyrNGjR1uSrPj4+OteKwBkBo/qASjUzp49K+nvR3uyokmTJqpevbr9c2pqqlasWKEOHTqoYsWK9vaQkBA99thj+vHHH+3nCggI0LZt27R79+4Mj+3l5SV3d3etXr063eNnJp06ddLx48cdlgCfP3++0tLS1KlTJ4dzXHXp0iWdPHlSd911lyRl+dGmZcuWKSQkRA899JC9zdvbW7169UrX95/nTUpK0smTJ9WgQQNZlqVff/3VeK558+bJ399fLVq00MmTJ+1bRESEfH19tWrVKknSt99+q+TkZPXv399h9mzgwIGZuqaAgAAlJSVd9zG3L774QjVr1rTP+PzT1XMuW7ZMpUqVUufOne373Nzc9Oyzz+r8+fPpHqt88MEHVbJkSfvnU6dO6bvvvtMjjzyic+fO2a/3zz//VKtWrbR7924dOXLEXvP17qusyMr9fFXPnj3l6uqa6XMkJiZKkm655ZZM9ff399fAgQO1ZMmSTN0r0t+zTsWKFdPo0aMzXRcAXAvBCUCh5ufnJ0k6d+5clsZVqFDB4fOJEyd04cIF3Xrrren6VqtWTWlpaTp06JAk6ZVXXtHp06d1yy23KDw8XIMHD9Zvv/1m7+/h4aE33nhDX3/9tYKDg3X33XfrzTffVEJCgr3PmTNnlJCQYN9OnTolSWrdurX8/f01Z84ce985c+aoVq1aDr+gnjp1SgMGDFBwcLC8vLxUsmRJ+zWdOXMmSz+LP/74Q5UrV073eF9GP4uDBw+qW7duKl68uHx9fVWyZEn7Y4WZOe/u3bt15swZBQUFqWTJkg7b+fPndfz4cXtNklSlShWH8SVLllSxYsWM5+nbt69uueUW3XvvvSpbtqyeeuop+3tsV+3du9f+6OO1/PHHH6pSpYpcXBz/77ZatWoOdV717/tqz549sixLw4cPT3e9I0eOlCT7NZvuq6zIyv18rdpNhg4dqjp16qh3796aP39+psYMGDBAAQEBmX7XKTthCwCuheAEoFDz8/NT6dK
"text/plain": [
"<Figure size 1000x800 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"ename": "KeyboardInterrupt",
"evalue": "",
"output_type": "error",
"traceback": [
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)",
"Cell \u001b[0;32mIn[41], line 4\u001b[0m\n\u001b[1;32m 1\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m \u001b[38;5;21;01msklearn\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mneighbors\u001b[39;00m \u001b[38;5;28;01mimport\u001b[39;00m KNeighborsClassifier\n\u001b[1;32m 3\u001b[0m \u001b[38;5;66;03m# Score the model with default parameters\u001b[39;00m\n\u001b[0;32m----> 4\u001b[0m score_knn, model_knn \u001b[38;5;241m=\u001b[39m \u001b[43mscore_the_model\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 5\u001b[0m \u001b[43m \u001b[49m\u001b[43mmodel\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mKNeighborsClassifier\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 6\u001b[0m \u001b[43m \u001b[49m\u001b[43mmodel_name\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mKNN\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 7\u001b[0m \u001b[43m \u001b[49m\u001b[43mrandom_seed\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m42\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 8\u001b[0m \u001b[43m \u001b[49m\u001b[43mX_train\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mX_train\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 9\u001b[0m \u001b[43m \u001b[49m\u001b[43mX_test\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mX_test\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 10\u001b[0m \u001b[43m \u001b[49m\u001b[43my_train\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43my_train\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 11\u001b[0m \u001b[43m \u001b[49m\u001b[43my_test\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43my_test\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 12\u001b[0m \u001b[43m \u001b[49m\u001b[43mplot\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mTrue\u001b[39;49;00m\n\u001b[1;32m 13\u001b[0m \u001b[43m)\u001b[49m\n",
"Cell \u001b[0;32mIn[37], line 105\u001b[0m, in \u001b[0;36mscore_the_model\u001b[0;34m(model, model_name, random_seed, X_train, X_test, y_train, y_test, plot)\u001b[0m\n\u001b[1;32m 102\u001b[0m k_fold_scores_std \u001b[38;5;241m=\u001b[39m {}\n\u001b[1;32m 104\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mtype\u001b[39m(X_train) \u001b[38;5;241m==\u001b[39m pd\u001b[38;5;241m.\u001b[39mcore\u001b[38;5;241m.\u001b[39mframe\u001b[38;5;241m.\u001b[39mDataFrame:\n\u001b[0;32m--> 105\u001b[0m scores_k_fold \u001b[38;5;241m=\u001b[39m \u001b[43mrepeated_cv\u001b[49m\u001b[43m(\u001b[49m\u001b[43mmodel\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mX_train\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43my_train\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mrandom_seed\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 106\u001b[0m k_fold_scores_mean[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124macccuracy_mean\u001b[39m\u001b[38;5;124m'\u001b[39m] \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39mmean([score[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mAccuracy\u001b[39m\u001b[38;5;124m'\u001b[39m] \u001b[38;5;28;01mfor\u001b[39;00m score \u001b[38;5;129;01min\u001b[39;00m scores_k_fold])\n\u001b[1;32m 107\u001b[0m k_fold_scores_std[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124maccuracy_std\u001b[39m\u001b[38;5;124m'\u001b[39m] \u001b[38;5;241m=\u001b[39m np\u001b[38;5;241m.\u001b[39mstd([score[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mAccuracy\u001b[39m\u001b[38;5;124m'\u001b[39m] \u001b[38;5;28;01mfor\u001b[39;00m score \u001b[38;5;129;01min\u001b[39;00m scores_k_fold]) \n",
"Cell \u001b[0;32mIn[37], line 89\u001b[0m, in \u001b[0;36mscore_the_model.<locals>.repeated_cv\u001b[0;34m(model, X_train, y_train, random_seed)\u001b[0m\n\u001b[1;32m 87\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m train_index, test_index \u001b[38;5;129;01min\u001b[39;00m rkf\u001b[38;5;241m.\u001b[39msplit(X_train):\n\u001b[1;32m 88\u001b[0m model\u001b[38;5;241m.\u001b[39mfit(X_train\u001b[38;5;241m.\u001b[39miloc[train_index], y_train\u001b[38;5;241m.\u001b[39miloc[train_index])\n\u001b[0;32m---> 89\u001b[0m y_pred \u001b[38;5;241m=\u001b[39m \u001b[43mmodel\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mpredict\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX_train\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43miloc\u001b[49m\u001b[43m[\u001b[49m\u001b[43mtest_index\u001b[49m\u001b[43m]\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 90\u001b[0m scrs \u001b[38;5;241m=\u001b[39m {\n\u001b[1;32m 91\u001b[0m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mAccuracy\u001b[39m\u001b[38;5;124m'\u001b[39m: model\u001b[38;5;241m.\u001b[39mscore(X_train\u001b[38;5;241m.\u001b[39miloc[test_index], y_train\u001b[38;5;241m.\u001b[39miloc[test_index]),\n\u001b[1;32m 92\u001b[0m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mF1\u001b[39m\u001b[38;5;124m'\u001b[39m: f1_score(y_train\u001b[38;5;241m.\u001b[39miloc[test_index], y_pred),\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 95\u001b[0m \u001b[38;5;124m'\u001b[39m\u001b[38;5;124mAUC\u001b[39m\u001b[38;5;124m'\u001b[39m: roc_auc_score(y_train\u001b[38;5;241m.\u001b[39miloc[test_index], y_pred)\n\u001b[1;32m 96\u001b[0m }\n\u001b[1;32m 98\u001b[0m scores_k_fold\u001b[38;5;241m.\u001b[39mappend(scrs)\n",
"File \u001b[0;32m~/Documents/faks_git/is_assignments/a2/code/.venv/lib64/python3.11/site-packages/sklearn/neighbors/_classification.py:234\u001b[0m, in \u001b[0;36mKNeighborsClassifier.predict\u001b[0;34m(self, X)\u001b[0m\n\u001b[1;32m 218\u001b[0m \u001b[38;5;124;03m\"\"\"Predict the class labels for the provided data.\u001b[39;00m\n\u001b[1;32m 219\u001b[0m \n\u001b[1;32m 220\u001b[0m \u001b[38;5;124;03mParameters\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 229\u001b[0m \u001b[38;5;124;03m Class labels for each data sample.\u001b[39;00m\n\u001b[1;32m 230\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 231\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mweights \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124muniform\u001b[39m\u001b[38;5;124m\"\u001b[39m:\n\u001b[1;32m 232\u001b[0m \u001b[38;5;66;03m# In that case, we do not need the distances to perform\u001b[39;00m\n\u001b[1;32m 233\u001b[0m \u001b[38;5;66;03m# the weighting so we do not compute them.\u001b[39;00m\n\u001b[0;32m--> 234\u001b[0m neigh_ind \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mkneighbors\u001b[49m\u001b[43m(\u001b[49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mreturn_distance\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43;01mFalse\u001b[39;49;00m\u001b[43m)\u001b[49m\n\u001b[1;32m 235\u001b[0m neigh_dist \u001b[38;5;241m=\u001b[39m \u001b[38;5;28;01mNone\u001b[39;00m\n\u001b[1;32m 236\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n",
"File \u001b[0;32m~/Documents/faks_git/is_assignments/a2/code/.venv/lib64/python3.11/site-packages/sklearn/neighbors/_base.py:824\u001b[0m, in \u001b[0;36mKNeighborsMixin.kneighbors\u001b[0;34m(self, X, n_neighbors, return_distance)\u001b[0m\n\u001b[1;32m 817\u001b[0m use_pairwise_distances_reductions \u001b[38;5;241m=\u001b[39m (\n\u001b[1;32m 818\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_fit_method \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mbrute\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 819\u001b[0m \u001b[38;5;129;01mand\u001b[39;00m ArgKmin\u001b[38;5;241m.\u001b[39mis_usable_for(\n\u001b[1;32m 820\u001b[0m X \u001b[38;5;28;01mif\u001b[39;00m X \u001b[38;5;129;01mis\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m \u001b[38;5;28;01mNone\u001b[39;00m \u001b[38;5;28;01melse\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_fit_X, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_fit_X, \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39meffective_metric_\n\u001b[1;32m 821\u001b[0m )\n\u001b[1;32m 822\u001b[0m )\n\u001b[1;32m 823\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m use_pairwise_distances_reductions:\n\u001b[0;32m--> 824\u001b[0m results \u001b[38;5;241m=\u001b[39m \u001b[43mArgKmin\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcompute\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 825\u001b[0m \u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 826\u001b[0m \u001b[43m \u001b[49m\u001b[43mY\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43m_fit_X\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 827\u001b[0m \u001b[43m \u001b[49m\u001b[43mk\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mn_neighbors\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 828\u001b[0m \u001b[43m \u001b[49m\u001b[43mmetric\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43meffective_metric_\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 829\u001b[0m \u001b[43m \u001b[49m\u001b[43mmetric_kwargs\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43meffective_metric_params_\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 830\u001b[0m \u001b[43m \u001b[49m\u001b[43mstrategy\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mauto\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\n\u001b[1;32m 831\u001b[0m \u001b[43m \u001b[49m\u001b[43mreturn_distance\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mreturn_distance\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 832\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 834\u001b[0m \u001b[38;5;28;01melif\u001b[39;00m (\n\u001b[1;32m 835\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_fit_method \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mbrute\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mand\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mmetric \u001b[38;5;241m==\u001b[39m \u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mprecomputed\u001b[39m\u001b[38;5;124m\"\u001b[39m \u001b[38;5;129;01mand\u001b[39;00m issparse(X)\n\u001b[1;32m 836\u001b[0m ):\n\u001b[1;32m 837\u001b[0m results \u001b[38;5;241m=\u001b[39m _kneighbors_from_graph(\n\u001b[1;32m 838\u001b[0m X, n_neighbors\u001b[38;5;241m=\u001b[39mn_neighbors, return_distance\u001b[38;5;241m=\u001b[39mreturn_distance\n\u001b[1;32m 839\u001b[0m )\n",
"File \u001b[0;32m~/Documents/faks_git/is_assignments/a2/code/.venv/lib64/python3.11/site-packages/sklearn/metrics/_pairwise_distances_reduction/_dispatcher.py:277\u001b[0m, in \u001b[0;36mArgKmin.compute\u001b[0;34m(cls, X, Y, k, metric, chunk_size, metric_kwargs, strategy, return_distance)\u001b[0m\n\u001b[1;32m 196\u001b[0m \u001b[38;5;124;03m\"\"\"Compute the argkmin reduction.\u001b[39;00m\n\u001b[1;32m 197\u001b[0m \n\u001b[1;32m 198\u001b[0m \u001b[38;5;124;03mParameters\u001b[39;00m\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 274\u001b[0m \u001b[38;5;124;03mreturns.\u001b[39;00m\n\u001b[1;32m 275\u001b[0m \u001b[38;5;124;03m\"\"\"\u001b[39;00m\n\u001b[1;32m 276\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m X\u001b[38;5;241m.\u001b[39mdtype \u001b[38;5;241m==\u001b[39m Y\u001b[38;5;241m.\u001b[39mdtype \u001b[38;5;241m==\u001b[39m np\u001b[38;5;241m.\u001b[39mfloat64:\n\u001b[0;32m--> 277\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mArgKmin64\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mcompute\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 278\u001b[0m \u001b[43m \u001b[49m\u001b[43mX\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mX\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 279\u001b[0m \u001b[43m \u001b[49m\u001b[43mY\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mY\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 280\u001b[0m \u001b[43m \u001b[49m\u001b[43mk\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mk\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 281\u001b[0m \u001b[43m \u001b[49m\u001b[43mmetric\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmetric\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 282\u001b[0m \u001b[43m \u001b[49m\u001b[43mchunk_size\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mchunk_size\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 283\u001b[0m \u001b[43m \u001b[49m\u001b[43mmetric_kwargs\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmetric_kwargs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 284\u001b[0m \u001b[43m \u001b[49m\u001b[43mstrategy\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mstrategy\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 285\u001b[0m \u001b[43m \u001b[49m\u001b[43mreturn_distance\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mreturn_distance\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 286\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 288\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m X\u001b[38;5;241m.\u001b[39mdtype \u001b[38;5;241m==\u001b[39m Y\u001b[38;5;241m.\u001b[39mdtype \u001b[38;5;241m==\u001b[39m np\u001b[38;5;241m.\u001b[39mfloat32:\n\u001b[1;32m 289\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m ArgKmin32\u001b[38;5;241m.\u001b[39mcompute(\n\u001b[1;32m 290\u001b[0m X\u001b[38;5;241m=\u001b[39mX,\n\u001b[1;32m 291\u001b[0m Y\u001b[38;5;241m=\u001b[39mY,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 297\u001b[0m return_distance\u001b[38;5;241m=\u001b[39mreturn_distance,\n\u001b[1;32m 298\u001b[0m )\n",
"File \u001b[0;32msklearn/metrics/_pairwise_distances_reduction/_argkmin.pyx:95\u001b[0m, in \u001b[0;36msklearn.metrics._pairwise_distances_reduction._argkmin.ArgKmin64.compute\u001b[0;34m()\u001b[0m\n",
"File \u001b[0;32m~/Documents/faks_git/is_assignments/a2/code/.venv/lib64/python3.11/site-packages/threadpoolctl.py:171\u001b[0m, in \u001b[0;36m_ThreadpoolLimiter.__exit__\u001b[0;34m(self, type, value, traceback)\u001b[0m\n\u001b[1;32m 168\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m__enter__\u001b[39m(\u001b[38;5;28mself\u001b[39m):\n\u001b[1;32m 169\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[38;5;28mself\u001b[39m\n\u001b[0;32m--> 171\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21m__exit__\u001b[39m(\u001b[38;5;28mself\u001b[39m, \u001b[38;5;28mtype\u001b[39m, value, traceback):\n\u001b[1;32m 172\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mrestore_original_limits()\n\u001b[1;32m 174\u001b[0m \u001b[38;5;129m@classmethod\u001b[39m\n\u001b[1;32m 175\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mwrap\u001b[39m(\u001b[38;5;28mcls\u001b[39m, controller, \u001b[38;5;241m*\u001b[39m, limits\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mNone\u001b[39;00m, user_api\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mNone\u001b[39;00m):\n",
"\u001b[0;31mKeyboardInterrupt\u001b[0m: "
]
}
],
"source": [
"from sklearn.neighbors import KNeighborsClassifier\n",
"\n",
"# Score the model with default parameters\n",
"score_knn, model_knn = score_the_model(\n",
" model=KNeighborsClassifier(),\n",
" model_name='KNN',\n",
" random_seed=42,\n",
" X_train=X_train,\n",
" X_test=X_test,\n",
" y_train=y_train,\n",
" y_test=y_test,\n",
" plot=True\n",
")"
]
},
{
"cell_type": "markdown",
"id": "0b925bbf",
"metadata": {},
"source": [
"#### Logistic Regression"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "33d0774a",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"\n",
"# Score the model with default parameters\n",
"score_log_reg, model_log_reg = score_the_model(\n",
" model=LogisticRegression(max_iter=100),\n",
" model_name='Logistic Regression',\n",
" random_seed=42,\n",
" X_train=X_train,\n",
" X_test=X_test,\n",
" y_train=y_train,\n",
" y_test=y_test,\n",
" plot=True\n",
")"
]
},
{
"cell_type": "markdown",
"id": "641e5a5a",
"metadata": {},
"source": [
"#### SVM"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "96adfe07",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.svm import SVC\n",
"from sklearn.preprocessing import StandardScaler\n",
"\n",
"# Scale the data\n",
"scaler = StandardScaler()\n",
"X_train_scaled = scaler.fit_transform(X_train)\n",
"X_test_scaled = scaler.transform(X_test)\n",
"# Score the model with default parameters\n",
"\n",
"scores_svm, model_svm = score_the_model(\n",
" model=SVC(),\n",
" model_name='SVM',\n",
" random_seed=42,\n",
" X_train=X_train_scaled,\n",
" X_test=X_test_scaled,\n",
" y_train=y_train,\n",
" y_test=y_test,\n",
" plot=False\n",
")\n",
"\n",
"print(scores_svm)\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0842608e",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.ensemble import GradientBoostingClassifier\n",
"\n",
"# Score the model with default parameters\n",
"score_gb, model_gb = score_the_model(\n",
" model=GradientBoostingClassifier(),\n",
" model_name='Gradient Boosting',\n",
" random_seed=42,\n",
" X_train=X_train,\n",
" X_test=X_test,\n",
" y_train=y_train,\n",
" y_test=y_test,\n",
" plot=True\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4c75c0cd",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.ensemble import AdaBoostClassifier\n",
"\n",
"# Score the model with default parameters\n",
"score_ada, model_ada = score_the_model(\n",
" model=AdaBoostClassifier(\n",
" estimator = RandomForestClassifier(),\n",
" n_estimators=500000,\n",
" learning_rate=0.001,\n",
" ),\n",
" model_name='AdaBoost',\n",
" random_seed=42,\n",
" X_train=X_train,\n",
" X_test=X_test,\n",
" y_train=y_train,\n",
" y_test=y_test,\n",
" plot=True\n",
")\n"
]
},
{
"cell_type": "markdown",
"id": "121d0534",
"metadata": {},
"source": [
"Searching for the best params:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "28e3bd1e",
"metadata": {},
"outputs": [],
"source": [
"best_params_gradient_boosting = {\n",
"'validation_fraction': 0.3, \n",
"'tol': 0.0001,\n",
"'subsample': 0.9,\n",
"'n_estimators': 50,\n",
"'min_samples_split': 5, \n",
"'min_samples_leaf': 5,\n",
"'max_features': 'log2', \n",
"'max_depth': 10, \n",
"'learning_rate': 0.1\n",
"}\n",
"\n",
"best_params_grad_boost_scores, best_params_grad_boost_model = score_the_model(\n",
" model=GradientBoostingClassifier(**best_params_gradient_boosting),\n",
" model_name='Gradient Boosting',\n",
" random_seed=42,\n",
" X_train=X_train,\n",
" X_test=X_test,\n",
" y_train=y_train,\n",
" y_test=y_test,\n",
" plot=True\n",
")\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "57b06b7d",
"metadata": {},
"source": [
"### Comparisson between models performances"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1d3fdfaf",
"metadata": {},
"outputs": [],
"source": [
"# Plot Scores of all models\n",
"all_accuracy = [score['Accuracy'] for score in all_scores]\n",
"all_precision = [score['Precision'] for score in all_scores]\n",
"all_recall = [score['Recall'] for score in all_scores]\n",
"all_f1 = [score['F1'] for score in all_scores]\n",
"all_roc_auc = [score['AUC'] for score in all_scores]\n",
"model_names = [score['model_name'] for score in all_scores]\n",
"\n",
"fig, ax = plt.subplots(3, 2, figsize=(25, 20))\n",
"fig.suptitle('Scores of all models', fontsize=20)\n",
"\n",
"ax[0, 0].bar(model_names, all_accuracy)\n",
"ax[0, 0].set_title('Accuracy')\n",
"ax[0, 0].set_ylabel('Accuracy')\n",
"ax[0, 0].set_xticklabels(model_names, rotation=90)\n",
"\n",
"ax[0, 1].bar(model_names, all_precision)\n",
"ax[0, 1].set_title('Precision')\n",
"ax[0, 1].set_ylabel('Precision')\n",
"ax[0, 1].set_xticklabels(model_names, rotation=90)\n",
"\n",
"ax[1, 0].bar(model_names, all_recall)\n",
"ax[1, 0].set_title('Recall')\n",
"ax[1, 0].set_ylabel('Recall')\n",
"ax[1, 0].set_xticklabels(model_names, rotation=90)\n",
"\n",
"ax[1, 1].bar(model_names, all_f1)\n",
"ax[1, 1].set_title('F1')\n",
"ax[1, 1].set_ylabel('F1')\n",
"ax[1, 1].set_xticklabels(model_names, rotation=90)\n",
"\n",
"ax[2, 0].bar(model_names, all_roc_auc)\n",
"ax[2, 0].set_title('ROC AUC')\n",
"ax[2, 0].set_ylabel('ROC AUC')\n",
"ax[2, 0].set_xticklabels(model_names, rotation=90)\n",
"\n",
"ax[2, 1].set_visible(False)\n",
"\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "3dafbf40",
"metadata": {},
"source": [
"### 2.3 Evaluation\n",
"Given that the data set is not in the ”big data” category, implement a cross-validation procedure based\n",
"on five folds (approximately equal sized) of your data. Furthermore, repeat the experiment 10 times with\n",
"different folds and average the results (include standard deviation). You are expected to report the following\n",
"metrics:\n",
"- F1\n",
"- Precision\n",
"- Recall\n",
"- AUC\n",
"Comment on the performance of algorithms and visualize their final scores. How do they perform against\n",
"the random baseline? What about the constant one? How do different learning scenarios impact the final\n",
"score? Are the differences between the models statistically significant?"
]
},
{
"cell_type": "markdown",
"id": "74d18249",
"metadata": {},
"source": [
"### F1 score"
]
},
{
"cell_type": "markdown",
"id": "f6f42fd4",
"metadata": {},
"source": [
"The F1 score is a metric that combines precision and recall. It is often used in classification tasks as a way to balance the two metrics, as it can be difficult to optimize for both at the same time.\n",
"To compute the F1 score, you first need to calculate the precision and recall for a given model.\n",
"\n",
"**Precision** \n",
"P = TP/(TP + FP)\n",
"\n",
"**Recall** \n",
"R = TP/ (TP + FN)\n",
"\n",
"Once you have calculated precision and recall, the F1 score is simply the harmonic mean of the two, computed using the following formula:\n",
"\n",
"F1 = 2 * (precision * recall) / (precision + recall)\n",
"\n",
"The F1 score ranges from 0 to 1, with a higher score indicating better performance. A perfect score is achieved when the precision and recall are both 1."
]
},
{
"cell_type": "markdown",
"id": "2e76c17b",
"metadata": {},
"source": [
"### Precision score"
]
},
{
"cell_type": "markdown",
"id": "05e44481",
"metadata": {},
"source": [
"Precision is a metric that measures the accuracy of a classifier when it predicts the positive class. It is defined as the number of true positive predictions made by the classifier, divided by the total number of positive predictions made by the classifier.\n",
"\n",
"In other words, precision is a measure of the proportion of positive predictions that are actually correct. It is a useful metric to consider when the cost of false positives is high, such as in cases where the classifier is being used to make important decisions (e.g. medical diagnosis, fraud detection).\n",
"\n",
"Precision = True Positives / (True Positives + False Positives)"
]
},
{
"cell_type": "markdown",
"id": "c4323660",
"metadata": {},
"source": [
"### Recall score\n"
]
},
{
"cell_type": "markdown",
"id": "bf24df24",
"metadata": {},
"source": [
"Recall is a metric that measures the ability of a classifier to detect all instances of the positive class. It is defined as the number of true positive predictions made by the classifier, divided by the total number of actual positive cases in the data.\n",
"\n",
"In other words, recall is a measure of the proportion of actual positive cases that the classifier is able to identify. It is a useful metric to consider when the cost of false negatives is high, such as in cases where it is important to identify all instances of the positive class (e.g. cancer diagnosis, intrusion detection).\n",
"\n",
"Recall = True Positives / (True Positives + False Negatives)"
]
},
{
"cell_type": "markdown",
"id": "aa3f5b17",
"metadata": {},
"source": [
"### AUC score"
]
},
{
"cell_type": "markdown",
"id": "697ee032",
"metadata": {},
"source": [
"The AUC is calculated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various classification thresholds. The TPR is defined as the number of true positive predictions made by the classifier, divided by the total number of actual positive cases in the data. The FPR is defined as the number of false positive predictions made by the classifier, divided by the total number of actual negative cases in the data.\n",
"\n",
"The AUC is then calculated by computing the area under this curve. An AUC of 1 indicates a perfect classifier, while an AUC of 0.5 indicates a classifier that is no better than random.\n",
"\n",
"The AUC can be calculated using the following formula:\n",
"\n",
"AUC = (TPR1 - TPR0) + (TPR2 - TPR1) + ... + (TPRn - TPRn-1)\n",
"\n",
"where TPRi is the TPR at the ith classification threshold and TPRi-1 is the TPR at the previous classification threshold.\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "addfc3ea",
"metadata": {},
"source": [
"## Report and presentation\n",
"The assignment has to be submitted in the form of two files: a markdown file and a PDF file created from\n",
"the R Studio markdown file (in RStudio → file - new file - R Markdown), where you write both the code,\n",
"as well as the text of answers (echo = T option must be enabled for each code block). Markdown files can\n",
"easily be exported to PDF using (“Knit”) button in R Studio. If you are using Python, you can produce a\n",
"similar report with Jupyter Notebook."
]
2022-12-19 10:09:00 +01:00
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2022-12-29 10:21:35 +01:00
"version": "3.11.0"
},
"vscode": {
"interpreter": {
"hash": "73efbd7de9807940366a2e2c585910074bc00282bd7f8b3dae7eb06897ea8ebf"
}
2022-12-19 10:09:00 +01:00
}
},
"nbformat": 4,
"nbformat_minor": 5
}