is_assignments/a2/code/Second IS assignment.ipynb

2638 lines
3.0 MiB
Plaintext
Raw Normal View History

2022-12-19 10:09:00 +01:00
{
"cells": [
{
"cell_type": "markdown",
"id": "c093ea0c",
"metadata": {},
"source": [
"# Seminar 2: Predicting Biodegradability of Chemical"
]
},
{
"cell_type": "markdown",
"id": "7aa30d7d",
"metadata": {},
"source": [
"## 1. Introduction\n",
"Chemicals are all around us. Studying their properties by the means of machine learning is an active\n",
"research field; matching molecular patterns with their behavior can be a decisive factor in the creation of\n",
"new materials, drugs, and more.\n",
"In this seminar assignment, your task is to explore the data and build machine-learning models that\n",
"predict the biodegradability of chemicals."
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "aeab08c8",
"metadata": {},
"source": [
"## 2. Task\n",
"You will work with the data set compiled by Mansouri et al. [data](https://www.openml.org/search?type=data&status=active&id=1494&sort=runs). There are 41 features and one target feature (biodegradability).\n",
"The target variable is encoded as ready biodegradable (1) and not ready biodegradable (2). The data set\n",
"consists of 1055 instances. Features can be either symbolic or numeric.\n",
"IMPORTANT: Use the dataset provided on uˇcilnica and NOT the one posted on the link above. It is\n",
"minimally modified and split into train in test sets.\n"
]
},
{
"cell_type": "markdown",
"id": "a4f197dd",
"metadata": {},
"source": [
"### 2.1 Exploration\n",
"Inspect the dataset. How balanced is the target variable? Are there any missing values present? If there\n",
"are, choose a strategy that takes this into account.\n",
"Most of your data is of the numeric type. Can you identify, by adopting exploratory analysis, whether\n",
"some features are directly related to the target? What about feature pairs? Produce at least three types of\n",
"visualizations of the feature space and be prepared to argue why these visualizations were useful for your\n",
"subsequent analysis."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "5bcf6290",
"metadata": {},
"outputs": [],
"source": [
"# Needed imports\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import sklearn\n",
"import seaborn as sns\n",
"import scikitplot as skplt\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "18ff4f76",
"metadata": {},
"outputs": [],
"source": [
"df_train = pd.read_csv('train.csv')\n",
"df_test = pd.read_csv('test.csv')"
]
},
{
"cell_type": "markdown",
"id": "ea26bfdf",
"metadata": {},
"source": [
"#### Lets inspect training and test data"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "5933f4d7",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>V1</th>\n",
" <th>V2</th>\n",
" <th>V3</th>\n",
" <th>V4</th>\n",
" <th>V5</th>\n",
" <th>V6</th>\n",
" <th>V7</th>\n",
" <th>V8</th>\n",
" <th>V9</th>\n",
" <th>V10</th>\n",
" <th>...</th>\n",
" <th>V33</th>\n",
" <th>V34</th>\n",
" <th>V35</th>\n",
" <th>V36</th>\n",
" <th>V37</th>\n",
" <th>V38</th>\n",
" <th>V39</th>\n",
" <th>V40</th>\n",
" <th>V41</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3.919</td>\n",
" <td>2.6909</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>31.4</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2.949</td>\n",
" <td>1.591</td>\n",
" <td>0</td>\n",
" <td>7.253</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4.170</td>\n",
" <td>2.1144</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>30.8</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3.315</td>\n",
" <td>1.967</td>\n",
" <td>0</td>\n",
" <td>7.257</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3.000</td>\n",
" <td>2.7098</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>20.0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>3.046</td>\n",
" <td>5.000</td>\n",
" <td>0</td>\n",
" <td>6.690</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>4.214</td>\n",
" <td>2.6272</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>30.0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2.998</td>\n",
" <td>1.722</td>\n",
" <td>0</td>\n",
" <td>6.770</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>3.942</td>\n",
" <td>2.7719</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>31.6</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3.542</td>\n",
" <td>1.739</td>\n",
" <td>0</td>\n",
" <td>8.127</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 42 columns</p>\n",
"</div>"
],
"text/plain": [
" V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V33 V34 V35 \\\n",
"1 3.919 2.6909 0 0 0 0 0 31.4 2 0 ... 0 0 0 \n",
"2 4.170 2.1144 0 0 0 0 0 30.8 1 1 ... 0 0 0 \n",
"4 3.000 2.7098 0 0 0 0 0 20.0 0 2 ... 0 0 1 \n",
"13 4.214 2.6272 0 0 0 0 0 30.0 3 0 ... 0 0 0 \n",
"16 3.942 2.7719 1 0 0 0 0 31.6 2 0 ... 0 0 0 \n",
"\n",
" V36 V37 V38 V39 V40 V41 Class \n",
"1 2.949 1.591 0 7.253 0 0 2 \n",
"2 3.315 1.967 0 7.257 0 0 2 \n",
"4 3.046 5.000 0 6.690 0 0 2 \n",
"13 2.998 1.722 0 6.770 0 0 2 \n",
"16 3.542 1.739 0 8.127 0 1 2 \n",
"\n",
"[5 rows x 42 columns]"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_test.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "1743d191",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>V1</th>\n",
" <th>V2</th>\n",
" <th>V3</th>\n",
" <th>V4</th>\n",
" <th>V5</th>\n",
" <th>V6</th>\n",
" <th>V7</th>\n",
" <th>V8</th>\n",
" <th>V9</th>\n",
" <th>V10</th>\n",
" <th>...</th>\n",
" <th>V33</th>\n",
" <th>V34</th>\n",
" <th>V35</th>\n",
" <th>V36</th>\n",
" <th>V37</th>\n",
" <th>V38</th>\n",
" <th>V39</th>\n",
" <th>V40</th>\n",
" <th>V41</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>821.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>...</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>821.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" <td>846.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>4.790476</td>\n",
" <td>3.054551</td>\n",
" <td>0.739953</td>\n",
" <td>0.030451</td>\n",
" <td>0.946809</td>\n",
" <td>0.277778</td>\n",
" <td>1.669031</td>\n",
" <td>37.422813</td>\n",
" <td>1.342790</td>\n",
" <td>1.784870</td>\n",
" <td>...</td>\n",
" <td>0.903073</td>\n",
" <td>1.241135</td>\n",
" <td>0.926714</td>\n",
" <td>3.922100</td>\n",
" <td>2.549406</td>\n",
" <td>0.671395</td>\n",
" <td>8.643191</td>\n",
" <td>0.059102</td>\n",
" <td>0.706856</td>\n",
" <td>1.333333</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>0.531991</td>\n",
" <td>0.813983</td>\n",
" <td>1.504545</td>\n",
" <td>0.198281</td>\n",
" <td>2.318081</td>\n",
" <td>1.045544</td>\n",
" <td>2.220221</td>\n",
" <td>9.030008</td>\n",
" <td>2.018433</td>\n",
" <td>1.773856</td>\n",
" <td>...</td>\n",
" <td>1.526124</td>\n",
" <td>2.248684</td>\n",
" <td>1.239133</td>\n",
" <td>0.992636</td>\n",
" <td>0.625021</td>\n",
" <td>1.093633</td>\n",
" <td>1.223700</td>\n",
" <td>0.342364</td>\n",
" <td>2.145396</td>\n",
" <td>0.471683</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>2.000000</td>\n",
" <td>0.803900</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>9.100000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2.279000</td>\n",
" <td>1.467000</td>\n",
" <td>0.000000</td>\n",
" <td>4.948000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>4.499000</td>\n",
" <td>2.510175</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>30.800000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>3.497000</td>\n",
" <td>2.101000</td>\n",
" <td>0.000000</td>\n",
" <td>8.009500</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>4.840000</td>\n",
" <td>3.052400</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>37.850000</td>\n",
" <td>1.000000</td>\n",
" <td>1.500000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>3.732500</td>\n",
" <td>2.461000</td>\n",
" <td>0.000000</td>\n",
" <td>8.508000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>5.119000</td>\n",
" <td>3.415725</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>3.000000</td>\n",
" <td>43.800000</td>\n",
" <td>2.000000</td>\n",
" <td>3.000000</td>\n",
" <td>...</td>\n",
" <td>1.000000</td>\n",
" <td>2.000000</td>\n",
" <td>1.000000</td>\n",
" <td>3.980000</td>\n",
" <td>2.861000</td>\n",
" <td>1.000000</td>\n",
" <td>9.019750</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>6.496000</td>\n",
" <td>7.918400</td>\n",
" <td>12.000000</td>\n",
" <td>2.000000</td>\n",
" <td>36.000000</td>\n",
" <td>13.000000</td>\n",
" <td>18.000000</td>\n",
" <td>60.700000</td>\n",
" <td>24.000000</td>\n",
" <td>12.000000</td>\n",
" <td>...</td>\n",
" <td>12.000000</td>\n",
" <td>18.000000</td>\n",
" <td>7.000000</td>\n",
" <td>10.695000</td>\n",
" <td>5.750000</td>\n",
" <td>8.000000</td>\n",
" <td>14.700000</td>\n",
" <td>4.000000</td>\n",
" <td>27.000000</td>\n",
" <td>2.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows × 42 columns</p>\n",
"</div>"
],
"text/plain": [
" V1 V2 V3 V4 V5 V6 \\\n",
"count 846.000000 846.000000 846.000000 821.000000 846.000000 846.000000 \n",
"mean 4.790476 3.054551 0.739953 0.030451 0.946809 0.277778 \n",
"std 0.531991 0.813983 1.504545 0.198281 2.318081 1.045544 \n",
"min 2.000000 0.803900 0.000000 0.000000 0.000000 0.000000 \n",
"25% 4.499000 2.510175 0.000000 0.000000 0.000000 0.000000 \n",
"50% 4.840000 3.052400 0.000000 0.000000 0.000000 0.000000 \n",
"75% 5.119000 3.415725 1.000000 0.000000 1.000000 0.000000 \n",
"max 6.496000 7.918400 12.000000 2.000000 36.000000 13.000000 \n",
"\n",
" V7 V8 V9 V10 ... V33 \\\n",
"count 846.000000 846.000000 846.000000 846.000000 ... 846.000000 \n",
"mean 1.669031 37.422813 1.342790 1.784870 ... 0.903073 \n",
"std 2.220221 9.030008 2.018433 1.773856 ... 1.526124 \n",
"min 0.000000 9.100000 0.000000 0.000000 ... 0.000000 \n",
"25% 0.000000 30.800000 0.000000 0.000000 ... 0.000000 \n",
"50% 1.000000 37.850000 1.000000 1.500000 ... 0.000000 \n",
"75% 3.000000 43.800000 2.000000 3.000000 ... 1.000000 \n",
"max 18.000000 60.700000 24.000000 12.000000 ... 12.000000 \n",
"\n",
" V34 V35 V36 V37 V38 V39 \\\n",
"count 846.000000 846.000000 846.000000 821.000000 846.000000 846.000000 \n",
"mean 1.241135 0.926714 3.922100 2.549406 0.671395 8.643191 \n",
"std 2.248684 1.239133 0.992636 0.625021 1.093633 1.223700 \n",
"min 0.000000 0.000000 2.279000 1.467000 0.000000 4.948000 \n",
"25% 0.000000 0.000000 3.497000 2.101000 0.000000 8.009500 \n",
"50% 0.000000 1.000000 3.732500 2.461000 0.000000 8.508000 \n",
"75% 2.000000 1.000000 3.980000 2.861000 1.000000 9.019750 \n",
"max 18.000000 7.000000 10.695000 5.750000 8.000000 14.700000 \n",
"\n",
" V40 V41 Class \n",
"count 846.000000 846.000000 846.000000 \n",
"mean 0.059102 0.706856 1.333333 \n",
"std 0.342364 2.145396 0.471683 \n",
"min 0.000000 0.000000 1.000000 \n",
"25% 0.000000 0.000000 1.000000 \n",
"50% 0.000000 0.000000 1.000000 \n",
"75% 0.000000 0.000000 2.000000 \n",
"max 4.000000 27.000000 2.000000 \n",
"\n",
"[8 rows x 42 columns]"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train.describe()"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "b2689ec0",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 846 entries, 3 to 1055\n",
"Data columns (total 42 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 V1 846 non-null float64\n",
" 1 V2 846 non-null float64\n",
" 2 V3 846 non-null int64 \n",
" 3 V4 821 non-null float64\n",
" 4 V5 846 non-null int64 \n",
" 5 V6 846 non-null int64 \n",
" 6 V7 846 non-null int64 \n",
" 7 V8 846 non-null float64\n",
" 8 V9 846 non-null int64 \n",
" 9 V10 846 non-null int64 \n",
" 10 V11 846 non-null int64 \n",
" 11 V12 846 non-null float64\n",
" 12 V13 846 non-null float64\n",
" 13 V14 846 non-null float64\n",
" 14 V15 846 non-null float64\n",
" 15 V16 846 non-null int64 \n",
" 16 V17 846 non-null float64\n",
" 17 V18 846 non-null float64\n",
" 18 V19 846 non-null int64 \n",
" 19 V20 846 non-null int64 \n",
" 20 V21 846 non-null int64 \n",
" 21 V22 830 non-null float64\n",
" 22 V23 846 non-null int64 \n",
" 23 V24 846 non-null int64 \n",
" 24 V25 846 non-null int64 \n",
" 25 V26 846 non-null int64 \n",
" 26 V27 838 non-null float64\n",
" 27 V28 846 non-null float64\n",
" 28 V29 838 non-null float64\n",
" 29 V30 846 non-null float64\n",
" 30 V31 846 non-null float64\n",
" 31 V32 846 non-null int64 \n",
" 32 V33 846 non-null int64 \n",
" 33 V34 846 non-null int64 \n",
" 34 V35 846 non-null int64 \n",
" 35 V36 846 non-null float64\n",
" 36 V37 821 non-null float64\n",
" 37 V38 846 non-null int64 \n",
" 38 V39 846 non-null float64\n",
" 39 V40 846 non-null int64 \n",
" 40 V41 846 non-null int64 \n",
" 41 Class 846 non-null int64 \n",
"dtypes: float64(19), int64(23)\n",
"memory usage: 284.2 KB\n"
]
}
],
"source": [
"df_train.info()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "22003f33",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>V1</th>\n",
" <th>V2</th>\n",
" <th>V3</th>\n",
" <th>V4</th>\n",
" <th>V5</th>\n",
" <th>V6</th>\n",
" <th>V7</th>\n",
" <th>V8</th>\n",
" <th>V9</th>\n",
" <th>V10</th>\n",
" <th>...</th>\n",
" <th>V33</th>\n",
" <th>V34</th>\n",
" <th>V35</th>\n",
" <th>V36</th>\n",
" <th>V37</th>\n",
" <th>V38</th>\n",
" <th>V39</th>\n",
" <th>V40</th>\n",
" <th>V41</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>3.919</td>\n",
" <td>2.6909</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>31.4</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2.949</td>\n",
" <td>1.591</td>\n",
" <td>0</td>\n",
" <td>7.253</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4.170</td>\n",
" <td>2.1144</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>30.8</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3.315</td>\n",
" <td>1.967</td>\n",
" <td>0</td>\n",
" <td>7.257</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3.000</td>\n",
" <td>2.7098</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>20.0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>3.046</td>\n",
" <td>5.000</td>\n",
" <td>0</td>\n",
" <td>6.690</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>13</th>\n",
" <td>4.214</td>\n",
" <td>2.6272</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>30.0</td>\n",
" <td>3</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2.998</td>\n",
" <td>1.722</td>\n",
" <td>0</td>\n",
" <td>6.770</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>16</th>\n",
" <td>3.942</td>\n",
" <td>2.7719</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>31.6</td>\n",
" <td>2</td>\n",
" <td>0</td>\n",
" <td>...</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3.542</td>\n",
" <td>1.739</td>\n",
" <td>0</td>\n",
" <td>8.127</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>5 rows × 42 columns</p>\n",
"</div>"
],
"text/plain": [
" V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 ... V33 V34 V35 \\\n",
"1 3.919 2.6909 0 0 0 0 0 31.4 2 0 ... 0 0 0 \n",
"2 4.170 2.1144 0 0 0 0 0 30.8 1 1 ... 0 0 0 \n",
"4 3.000 2.7098 0 0 0 0 0 20.0 0 2 ... 0 0 1 \n",
"13 4.214 2.6272 0 0 0 0 0 30.0 3 0 ... 0 0 0 \n",
"16 3.942 2.7719 1 0 0 0 0 31.6 2 0 ... 0 0 0 \n",
"\n",
" V36 V37 V38 V39 V40 V41 Class \n",
"1 2.949 1.591 0 7.253 0 0 2 \n",
"2 3.315 1.967 0 7.257 0 0 2 \n",
"4 3.046 5.000 0 6.690 0 0 2 \n",
"13 2.998 1.722 0 6.770 0 0 2 \n",
"16 3.542 1.739 0 8.127 0 1 2 \n",
"\n",
"[5 rows x 42 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_test.head()"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "d7235214",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>V1</th>\n",
" <th>V2</th>\n",
" <th>V3</th>\n",
" <th>V4</th>\n",
" <th>V5</th>\n",
" <th>V6</th>\n",
" <th>V7</th>\n",
" <th>V8</th>\n",
" <th>V9</th>\n",
" <th>V10</th>\n",
" <th>...</th>\n",
" <th>V33</th>\n",
" <th>V34</th>\n",
" <th>V35</th>\n",
" <th>V36</th>\n",
" <th>V37</th>\n",
" <th>V38</th>\n",
" <th>V39</th>\n",
" <th>V40</th>\n",
" <th>V41</th>\n",
" <th>Class</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>count</th>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.00000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>...</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" <td>209.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mean</th>\n",
" <td>4.750938</td>\n",
" <td>3.130050</td>\n",
" <td>0.62201</td>\n",
" <td>0.086124</td>\n",
" <td>1.114833</td>\n",
" <td>0.339713</td>\n",
" <td>1.555024</td>\n",
" <td>35.569378</td>\n",
" <td>1.511962</td>\n",
" <td>1.880383</td>\n",
" <td>...</td>\n",
" <td>0.803828</td>\n",
" <td>1.411483</td>\n",
" <td>1.100478</td>\n",
" <td>3.902612</td>\n",
" <td>2.629201</td>\n",
" <td>0.746411</td>\n",
" <td>8.574038</td>\n",
" <td>0.019139</td>\n",
" <td>0.789474</td>\n",
" <td>1.354067</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>0.603914</td>\n",
" <td>0.897556</td>\n",
" <td>1.27690</td>\n",
" <td>0.406969</td>\n",
" <td>2.393143</td>\n",
" <td>1.182566</td>\n",
" <td>2.246383</td>\n",
" <td>9.471334</td>\n",
" <td>1.721220</td>\n",
" <td>1.784023</td>\n",
" <td>...</td>\n",
" <td>1.498327</td>\n",
" <td>2.374355</td>\n",
" <td>1.320857</td>\n",
" <td>1.029605</td>\n",
" <td>0.714285</td>\n",
" <td>1.077657</td>\n",
" <td>1.315016</td>\n",
" <td>0.195176</td>\n",
" <td>2.589491</td>\n",
" <td>0.479378</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>2.000000</td>\n",
" <td>1.134900</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2.267000</td>\n",
" <td>1.576000</td>\n",
" <td>0.000000</td>\n",
" <td>4.917000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>4.414000</td>\n",
" <td>2.494500</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>29.400000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>3.401000</td>\n",
" <td>2.146000</td>\n",
" <td>0.000000</td>\n",
" <td>7.872000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>4.807000</td>\n",
" <td>3.039300</td>\n",
" <td>0.00000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>34.200000</td>\n",
" <td>1.000000</td>\n",
" <td>2.000000</td>\n",
" <td>...</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>3.694000</td>\n",
" <td>2.469000</td>\n",
" <td>0.000000</td>\n",
" <td>8.464000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>5.188000</td>\n",
" <td>3.555400</td>\n",
" <td>1.00000</td>\n",
" <td>0.000000</td>\n",
" <td>1.000000</td>\n",
" <td>0.000000</td>\n",
" <td>3.000000</td>\n",
" <td>41.200000</td>\n",
" <td>2.000000</td>\n",
" <td>3.000000</td>\n",
" <td>...</td>\n",
" <td>1.000000</td>\n",
" <td>2.000000</td>\n",
" <td>2.000000</td>\n",
" <td>3.991000</td>\n",
" <td>2.967000</td>\n",
" <td>1.000000</td>\n",
" <td>9.017000</td>\n",
" <td>0.000000</td>\n",
" <td>0.000000</td>\n",
" <td>2.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>6.253000</td>\n",
" <td>9.177500</td>\n",
" <td>8.00000</td>\n",
" <td>3.000000</td>\n",
" <td>16.000000</td>\n",
" <td>12.000000</td>\n",
" <td>14.000000</td>\n",
" <td>60.000000</td>\n",
" <td>9.000000</td>\n",
" <td>11.000000</td>\n",
" <td>...</td>\n",
" <td>12.000000</td>\n",
" <td>18.000000</td>\n",
" <td>6.000000</td>\n",
" <td>10.355000</td>\n",
" <td>5.825000</td>\n",
" <td>6.000000</td>\n",
" <td>14.030000</td>\n",
" <td>2.000000</td>\n",
" <td>27.000000</td>\n",
" <td>2.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>8 rows × 42 columns</p>\n",
"</div>"
],
"text/plain": [
" V1 V2 V3 V4 V5 V6 \\\n",
"count 209.000000 209.000000 209.00000 209.000000 209.000000 209.000000 \n",
"mean 4.750938 3.130050 0.62201 0.086124 1.114833 0.339713 \n",
"std 0.603914 0.897556 1.27690 0.406969 2.393143 1.182566 \n",
"min 2.000000 1.134900 0.00000 0.000000 0.000000 0.000000 \n",
"25% 4.414000 2.494500 0.00000 0.000000 0.000000 0.000000 \n",
"50% 4.807000 3.039300 0.00000 0.000000 0.000000 0.000000 \n",
"75% 5.188000 3.555400 1.00000 0.000000 1.000000 0.000000 \n",
"max 6.253000 9.177500 8.00000 3.000000 16.000000 12.000000 \n",
"\n",
" V7 V8 V9 V10 ... V33 \\\n",
"count 209.000000 209.000000 209.000000 209.000000 ... 209.000000 \n",
"mean 1.555024 35.569378 1.511962 1.880383 ... 0.803828 \n",
"std 2.246383 9.471334 1.721220 1.784023 ... 1.498327 \n",
"min 0.000000 0.000000 0.000000 0.000000 ... 0.000000 \n",
"25% 0.000000 29.400000 0.000000 0.000000 ... 0.000000 \n",
"50% 0.000000 34.200000 1.000000 2.000000 ... 0.000000 \n",
"75% 3.000000 41.200000 2.000000 3.000000 ... 1.000000 \n",
"max 14.000000 60.000000 9.000000 11.000000 ... 12.000000 \n",
"\n",
" V34 V35 V36 V37 V38 V39 \\\n",
"count 209.000000 209.000000 209.000000 209.000000 209.000000 209.000000 \n",
"mean 1.411483 1.100478 3.902612 2.629201 0.746411 8.574038 \n",
"std 2.374355 1.320857 1.029605 0.714285 1.077657 1.315016 \n",
"min 0.000000 0.000000 2.267000 1.576000 0.000000 4.917000 \n",
"25% 0.000000 0.000000 3.401000 2.146000 0.000000 7.872000 \n",
"50% 0.000000 1.000000 3.694000 2.469000 0.000000 8.464000 \n",
"75% 2.000000 2.000000 3.991000 2.967000 1.000000 9.017000 \n",
"max 18.000000 6.000000 10.355000 5.825000 6.000000 14.030000 \n",
"\n",
" V40 V41 Class \n",
"count 209.000000 209.000000 209.000000 \n",
"mean 0.019139 0.789474 1.354067 \n",
"std 0.195176 2.589491 0.479378 \n",
"min 0.000000 0.000000 1.000000 \n",
"25% 0.000000 0.000000 1.000000 \n",
"50% 0.000000 0.000000 1.000000 \n",
"75% 0.000000 0.000000 2.000000 \n",
"max 2.000000 27.000000 2.000000 \n",
"\n",
"[8 rows x 42 columns]"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_test.describe()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "9598495e",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 209 entries, 1 to 1051\n",
"Data columns (total 42 columns):\n",
" # Column Non-Null Count Dtype \n",
"--- ------ -------------- ----- \n",
" 0 V1 209 non-null float64\n",
" 1 V2 209 non-null float64\n",
" 2 V3 209 non-null int64 \n",
" 3 V4 209 non-null int64 \n",
" 4 V5 209 non-null int64 \n",
" 5 V6 209 non-null int64 \n",
" 6 V7 209 non-null int64 \n",
" 7 V8 209 non-null float64\n",
" 8 V9 209 non-null int64 \n",
" 9 V10 209 non-null int64 \n",
" 10 V11 209 non-null int64 \n",
" 11 V12 209 non-null float64\n",
" 12 V13 209 non-null float64\n",
" 13 V14 209 non-null float64\n",
" 14 V15 209 non-null float64\n",
" 15 V16 209 non-null int64 \n",
" 16 V17 209 non-null float64\n",
" 17 V18 209 non-null float64\n",
" 18 V19 209 non-null int64 \n",
" 19 V20 209 non-null int64 \n",
" 20 V21 209 non-null int64 \n",
" 21 V22 209 non-null float64\n",
" 22 V23 209 non-null int64 \n",
" 23 V24 209 non-null int64 \n",
" 24 V25 209 non-null int64 \n",
" 25 V26 209 non-null int64 \n",
" 26 V27 209 non-null float64\n",
" 27 V28 209 non-null float64\n",
" 28 V29 209 non-null int64 \n",
" 29 V30 209 non-null float64\n",
" 30 V31 209 non-null float64\n",
" 31 V32 209 non-null int64 \n",
" 32 V33 209 non-null int64 \n",
" 33 V34 209 non-null int64 \n",
" 34 V35 209 non-null int64 \n",
" 35 V36 209 non-null float64\n",
" 36 V37 209 non-null float64\n",
" 37 V38 209 non-null int64 \n",
" 38 V39 209 non-null float64\n",
" 39 V40 209 non-null int64 \n",
" 40 V41 209 non-null int64 \n",
" 41 Class 209 non-null int64 \n",
"dtypes: float64(17), int64(25)\n",
"memory usage: 70.2 KB\n"
]
}
],
"source": [
"df_test.info()"
]
},
{
"cell_type": "markdown",
"id": "84e0c414",
"metadata": {},
"source": [
"#### Display distributions of target variable **Class** in training and validation set."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "5ca239ec",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABJz0lEQVR4nO3deVxUZf//8fcIgoiAKyAuuKe4JpaSmrkkKlmm3ZpZoqlZobmUFd8slxa7NZc00+67UlvMrTQzl9zSUmwxNbM0NdcUNE0QFRS4fn/0Y+5GQGEcGDi9no/HPB7Mda4553POmWHec805Z2zGGCMAAACLKubuAgAAAPITYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYQcAAFgaYaeIGzt2rGw2W4Es64477tAdd9xhv//ll1/KZrNpyZIlBbL8fv36qVq1agWyLGclJydr4MCBCg4Ols1m0/Dhw/M8j8x9+scff7i+QOSLq18buXX48GHZbDa99tpr1+3r6tf63LlzZbPZdPjwYZfNMyf9+vVTqVKl8n05+c1ms2ns2LFOPbZatWrq16+fS+tB7hF2CpHMfz6ZtxIlSigkJESRkZGaPn26zp8/75LlnDhxQmPHjtXOnTtdMj9XKsy15cYrr7yiuXPn6rHHHtP777+vhx566Jp9ly1bVnDFXWXr1q0aO3aszp0757Ya8qKo1ftPc/HiRY0dO1Zffvml22pYuXKl02EE11cY9rHTDAqNOXPmGElm/Pjx5v333zfvvvuueeWVV0zHjh2NzWYzoaGhZteuXQ6PuXLlirl06VKelvPdd98ZSWbOnDl5elxqaqpJTU2139+4caORZBYvXpyn+Thb2+XLl01KSorLlpUfmjdvblq2bJmrvr6+viY6OjpL+5gxY4wkc/r0aRdX52jSpElGkjl06FC+LsdVCnO9V782cuvQoUNGkpk0adJ1+2Y+L1wlLS3NXLp0yWRkZLhkfqdPnzaSzJgxY7JMi46ONr6+vi5ZzrXExMS4dBtd7dKlS+bKlStOPTYlJcVcvnzZxRUVrGvt48LO0z0RC9fSuXNnNWvWzH4/NjZWGzZs0F133aW7775bv/zyi3x8fCRJnp6e8vTM39148eJFlSxZUl5eXvm6nOspXry4W5efG6dOnVJYWJi7y3AbY4xSUlLsz0+rKyyvDWd4eHjIw8PD3WW4TVpamjIyMvK070qUKOH08ry9vZ1+LFzA3WkL/5M5svPdd99lO/2VV14xksx//vMfe1t2n/a++OIL07JlSxMQEGB8fX1NnTp1TGxsrDHmf6MxV98yR1LatGlj6tevb77//nvTunVr4+PjY4YNG2af1qZNG/tyMue1YMECExsba4KCgkzJkiVN165dzdGjRx1qCg0NzXYU4+/zvF5t0dHRJjQ01OHxycnJZuTIkaZy5crGy8vL1KlTx0yaNCnLp1VJJiYmxixdutTUr1/feHl5mbCwMLNq1apst/XVEhISzMMPP2wCAwONt7e3adSokZk7d26WbXH1LadRiOz6Zm6fzH26f/9+Ex0dbQICAoy/v7/p16+fuXDhQpZ5vf/++6Zp06amRIkSpkyZMqZXr15Ztv/VMpeRU73vvvuuadu2ralQoYLx8vIy9erVM2+++WaW+YSGhpqoqCizevVqEx4ebry9vc3UqVONMcYcPnzYdO3a1ZQsWdJUqFDBDB8+3KxevdpIMhs3bnSYz7Zt20xkZKTx9/c3Pj4+5vbbbzdff/11ruu9WkxMjPH19c12e91///0mKCjIpKWlGWOMWbZsmenSpYupWLGi8fLyMjVq1DDjx4+3T8+Ul9dGamqqef75503Tpk2Nv7+/KVmypGnVqpXZsGGDwzz/PrIzZcoUU7VqVVOiRAlz++23m927d2e7z67mzP435n//b/6+DTP351dffWVuueUW4+3tbapXr27mzZt3zXllrsfVt8wRgMyRnePHj5t77rnH+Pr6mvLly5snn3wyy3ZOT083U6dONWFhYcbb29sEBgaaRx55xJw9e/aaNURHR2dbw9/rmzRpkpk6daqpUaOGKVasmNmxY0eu95UxJsuoRl5eq1f/D8zc/l9//bUZMWKEKV++vClZsqTp1q2bOXXqVJZtMmbMGFOxYkXj4+Nj7rjjDrNnz54c/69e7aOPPjJNmzY1pUqVMn5+fqZBgwZm2rRpDn3+/PNPM2zYMPv/0po1a5pXX33VpKenO2zDnPZxYcfIThHy0EMP6f/+7//0xRdfaNCgQdn22bNnj+666y41atRI48ePl7e3tw4cOKAtW7ZIkurVq6fx48frhRde0COPPKLWrVtLkm677Tb7PM6cOaPOnTvr/vvv14MPPqigoKBr1vXyyy/LZrPpmWee0alTpzRt2jR16NBBO3fuzNMn/NzU9nfGGN19993auHGjBgwYoCZNmmjNmjUaNWqUfv/9d02dOtWh/9dff61PPvlEjz/+uPz8/DR9+nT16NFDR48eVbly5XKs69KlS7rjjjt04MABDRkyRNWrV9fixYvVr18/nTt3TsOGDVO9evX0/vvva8SIEapcubKefPJJSVKFChWynef777+vgQMH6tZbb9UjjzwiSapZs6ZDn549e6p69eqaMGGCfvjhB7399tsKDAzUv//9b3ufl19+Wc8//7x69uypgQMH6vTp05oxY4Zuv/127dixQ6VLl852+d27d9evv/6qjz76SFOnTlX58uUd6p01a5bq16+vu+++W56envrss8/0+OOPKyMjQzExMQ7z2rdvn3r37q3Bgwdr0KBBuummm3ThwgW1a9dOJ0+e1LBhwxQcHKz58+dr48aNWWrZsGGDOnfurPDwcI0ZM0bFihXTnDlz1K5dO3311Ve69dZbr1vv1Xr16qWZM2fq888/17/+9S97+8WLF/XZZ5+pX79+9lGNuXPnqlSpUho5cqRKlSqlDRs26IUXXlBSUpImTZrkMN/cvjaSkpL09ttvq3fv3ho0aJDOnz+vd955R5GRkfr222/VpEkTh/7vvfeezp8/r5iYGKWkpOj1119Xu3bttHv37mu+/pzd/9dy4MAB3XfffRowYICio6P17rvvql+/fgoPD1f9+vWzfUyFChU0a9YsPfbYY7r33nvVvXt3SVKjRo3sfdLT0xUZGanmzZvrtdde07p16zR58mTVrFlTjz32mL3f4MGDNXfuXPXv319PPPGEDh06pDfeeEM7duzQli1bchzhHTx4sE6cOKG1a9fq/fffz7bPnDlzlJKSokceeUTe3t4qW7ZsnvdVdnLzWs3J0KFDVaZMGY0ZM0aHDx/WtGnTNGTIEC1cuNDeJzY2VhMnTlTXrl0VGRmpXbt2KTIyUikpKded/9q1a9W7d2+1b9/eXs8vv/yiLVu2aNiwYZL+el20adNGv//+uwYPHqyqVatq69atio2N1cmTJzVt2rRc7eNCzd1pC/9zvZEdY4wJCAgwN998s/3+1Z/2pk6det3jPa51XEybNm2MJDN79uxsp2U3slOpUiWTlJRkb1+0aJGRZF5//XV7W25Gdq5X29UjO8uWLTOSzEsvveTQ77777jM2m80cOHDA3ibJeHl5ObTt2rXLSDIzZszIsqy/mzZtmpFkPvjgA3vb5cuXTUREhClVqpTDumd+Ms6N6x2z8/DDDzu033vvvaZcuXL2+4cPHzYeHh7m5Zdfdui3e/du4+npmaX9atc6BubixYtZ2iIjI02NGjUc2kJDQ40ks3r1aof2yZMnG0lm2bJl9rZLly6ZunXrOozsZGRkmNq1a5vIyEiH0biLFy+a6tWrmzvvvDNX9V4tIyPDVKpUyfTo0cOhPfO5uXnz5muu6+DBg03JkiUdjhHLy2sjLS0tyzE8f/75pwkKCnLYr5mfln18fMzx48ft7d98842RZEaMGGFvu/q1fqP7P6eRnau3z6lTp4y3t7d58sknrzm/6x2zo/9/POLf3XzzzSY8PNx+/6uvvjKSzIcffujQL3NE8Or2q+V0zE7mdvb3988yapLbfWVMziM713utGpPzyE6HDh0cnvsjRow
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"_, _, bars = plt.hist(df_train['Class'], bins=10)\n",
"plt.xlabel('Class')\n",
"plt.ylabel('Frequency')\n",
"plt.title('Distribution of the target variable in the training set')\n",
"plt.bar_label(bars, fmt='%1.0f')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "c74f9fb5",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjsAAAHHCAYAAABZbpmkAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABH6UlEQVR4nO3deVxUZf//8fcggogsYgJiLrjlmhYmebsrSWqmabmWuFthrmV5l7mkkZlrmd6VqZVmampmppmYtqi5lm3uu4JbgrggyvX7ox/zbQQUYdiOr+fjMQ+d65y5zuecYZg311znjM0YYwQAAGBRLrldAAAAQHYi7AAAAEsj7AAAAEsj7AAAAEsj7AAAAEsj7AAAAEsj7AAAAEsj7AAAAEsj7AAAAEsj7NzhRo0aJZvNliPbaty4sRo3bmy//91338lms2nx4sU5sv3u3burbNmyObKtzEpISFDv3r0VGBgom82mQYMG3XYfKc/pmTNnnF8gssWNr42MOnTokGw2m956661bruvs1/qcOXNks9l06NAhp/WZnu7du6tIkSLZvh1YF2HHQlJ++aTcChUqpKCgIIWHh2vatGm6cOGCU7Zz4sQJjRo1Sjt37nRKf86Ul2vLiNdff11z5szRM888o48//lhPPfXUTdddtmxZzhV3g59++kmjRo3S+fPnc62G25Hf6r3TXLp0SaNGjdJ3332XazWsXLlSo0aNytZt5OZ+/vHHHxo1alSOBNQ8x8AyZs+ebSSZMWPGmI8//th8+OGH5vXXXzfNmzc3NpvNlClTxvzyyy8Oj0lKSjKXL1++re1s2bLFSDKzZ8++rcclJiaaxMRE+/1169YZSWbRokW31U9ma7t69aq5cuWK07aVHUJDQ029evUytK6np6eJiIhI1T5y5EgjyZw+fdrJ1TmaMGGCkWQOHjyYrdtxlrxc742vjYw6ePCgkWQmTJhwy3VTfi6c5dq1a+by5csmOTnZKf2dPn3aSDIjR45MtSwiIsJ4eno6ZTs3ExkZ6dRjlJab7Wd2W7RokZFk1q1bl+Pbzm2uuZKwkK1atGih2rVr2+8PHz5c0dHReuSRR/Too4/qzz//lIeHhyTJ1dVVrq7Z+2Nw6dIlFS5cWG5ubtm6nVspWLBgrm4/I06dOqWqVavmdhm5xhijK1eu2H8+rS6vvDYyo0CBAipQoEBulwFkTG6nLThPysjOli1b0lz++uuvG0nmvffes7el9dfeN998Y+rVq2d8fHyMp6enqVSpkhk+fLgx5v9GY268pYykNGrUyFSrVs1s3brVNGjQwHh4eJiBAwfalzVq1Mi+nZS+FixYYIYPH24CAgJM4cKFTevWrc2RI0ccaipTpkyaoxj/7vNWtUVERJgyZco4PD4hIcEMGTLE3H333cbNzc1UqlTJTJgwIdVfq5JMZGSkWbp0qalWrZpxc3MzVatWNV9//XWax/pGsbGxpmfPnsbf39+4u7ube++918yZMyfVsbjxlt4oRFrrphyflOd07969JiIiwvj4+Bhvb2/TvXt3c/HixVR9ffzxx+b+++83hQoVMkWLFjUdO3ZMdfxvlLKN9Or98MMPTZMmTUzx4sWNm5ubqVKlinn33XdT9VOmTBnTqlUrs2rVKhMSEmLc3d3N5MmTjTHGHDp0yLRu3doULlzYFC9e3AwaNMisWrUqzb9MN23aZMLDw423t7fx8PAwDRs2ND/88EOG671RZGSk8fT0TPN4derUyQQEBJhr164ZY4xZtmyZadmypSlRooRxc3Mz5cqVM2PGjLEvT3E7r43ExEQzYsQIc//99xtvb29TuHBhU79+fRMdHe3Q579HdiZNmmRKly5tChUqZBo2bGh27dqV5nN2o8w8/8b83++bfx/DlOfz+++/Nw888IBxd3c3wcHBZu7cuTftK2U/bryljH6kjOwcO3bMtGnTxnh6epq77rrLDB06NNVxvn79upk8ebKpWrWqcXd3N/7+/qZv377m3LlzN60hIiIizRput98tW7aY5s2bm2LFiplChQqZsmXLmh49emRoP9Ny9epVM2rUKFOhQgXj7u5u/Pz8TL169cw333zjsN6ff/5p2rdvb4oWLWrc3d1NSEiI+eKLL+zLU56vG293yigPIzt3kKeeekr//e9/9c0336hPnz5prvP777/rkUce0b333qsxY8bI3d1d+/bt048//ihJqlKlisaMGaNXX31Vffv2VYMGDSRJ//nPf+x9nD17Vi1atFCnTp305JNPKiAg4KZ1jRs3TjabTS+++KJOnTqlKVOmKCwsTDt37rytv/AzUtu/GWP06KOPat26derVq5dq1aql1atX64UXXtDx48c1efJkh/V/+OEHLVmyRM8++6y8vLw0bdo0tW/fXkeOHFGxYsXSrevy5ctq3Lix9u3bp/79+ys4OFiLFi1S9+7ddf78eQ0cOFBVqlTRxx9/rMGDB+vuu+/W0KFDJUnFixdPs8+PP/5YvXv3Vp06ddS3b19JUvny5R3W6dChg4KDgxUVFaXt27frgw8+kL+/v8aPH29fZ9y4cRoxYoQ6dOig3r176/Tp03r77bfVsGFD7dixQ76+vmluv127dtqzZ48+/fRTTZ48WXfddZdDvTNmzFC1atX06KOPytXVVV9++aWeffZZJScnKzIy0qGv3bt3q3PnzurXr5/69Omje+65RxcvXlTTpk118uRJDRw4UIGBgZo/f77WrVuXqpbo6Gi1aNFCISEhGjlypFxcXDR79mw1bdpU33//verUqXPLem/UsWNHTZ8+XV999ZWeeOIJe/ulS5f05Zdfqnv37vZRjTlz5qhIkSIaMmSIihQpoujoaL366quKj4/XhAkTHPrN6GsjPj5eH3zwgTp37qw+ffrowoULmjVrlsLDw/Xzzz+rVq1aDut/9NFHunDhgiIjI3XlyhVNnTpVTZs21a5du276+svs838z+/bt0+OPP65evXopIiJCH374obp3766QkBBVq1YtzccUL15cM2bM0DPPPKPHHntM7dq1kyTde++99nWuX7+u8PBwhYaG6q233tK3336riRMnqnz58nrmmWfs6/Xr109z5sxRjx49NGDAAB08eFDvvPOOduzYoR9//DHdEd5+/frpxIkTWrNmjT7++OM0l9+q31OnTql58+YqXry4XnrpJfn6+urQoUNasmRJhvfzRqNGjVJUVJT99R4fH6+tW7dq+/bteuihhyT983u7Xr16KlmypF566SV5enpq4cKFatu2rT7//HM99thjatiwoQYMGKBp06bpv//9r6pUqSJJ9n8tL7fTFpznViM7xhjj4+Nj7rvvPvv9G//amzx58i3ne9xsXkyjRo2MJDNz5sw0l6U1slOyZEkTHx9vb1+4cKGRZKZOnWpvy8jIzq1qu3FkZ9myZUaSGTt2rMN6jz/+uLHZbGbfvn32NknGzc3Noe2XX34xkszbb7+dalv/NmXKFCPJfPLJJ/a2q1evmrp165oiRYo47HvKX8YZcas5Oz179nRof+yxx0yxYsXs9w8dOmQKFChgxo0b57Derl27jKura6r2G91sDsylS5dStYWHh5ty5co5tJUpU8ZIMqtWrXJonzhxopFkli1bZm+7fPmyqVy5ssNfo8nJyaZixYomPDzcYTTu0qVLJjg42Dz00EMZqvdGycnJpmTJkqZ9+/YO7Sk/mxs2bLjpvvbr188ULlzYYY7Y7bw2rl27lmoOz99//20CAgIcnteUkQIPDw9z7Ngxe/vmzZuNJDN48GB7242v9aw+/+mN7Nx4fE6dOmXc3d3N0KFDb9rfrebs6P/PR/y3++67z4SEhNjvf//990aSmTdvnsN6KSOCN7bfKL05Oxntd+nSpbf8HXy7c3Zq1qx5y98JzZo1MzVq1HD4eUtOTjb/+c9/TMWKFe1td/KcHc7GusMUKVLkpmdlpfwl98UXXyg5OTlT23B3d1ePHj0yvH63bt3k5eVlv//444+rRIkSWrlyZaa2n1ErV65UgQIFNGDAAIf2oUOHyhijr7/+2qE9LCzMYfTk3nvvlbe3tw4cOHDL7QQGBqpz5872toIFC2r
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"_, _, bars = plt.hist(df_test['Class'], bins=10)\n",
"plt.xlabel('Class')\n",
"plt.ylabel('Frequency')\n",
"plt.title('Distribution of the target variable in the test set')\n",
"plt.bar_label(bars, fmt='%1.0f')\n",
"plt.show()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "82afd315",
"metadata": {},
"source": [
"#### Display relationship between features in the training set using the correlation matrix"
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "e8cf8eb1",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(42.5, -0.5)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABy4AAAe2CAYAAABKEJQUAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzddXRU1/rw8e9EZuLubkQIEtyhtBQp0EIFK1KkUOqlCrTQ9vbWBai73gq0pcXd3TUJxBPirhOf948JSSbMJMO9cJPfe5/PWrNWSfaZPD1nn2fvffY5+yg0Go0GIYQQQgghhBBCCCGEEEIIIYRoRybtHYAQQgghhBBCCCGEEEIIIYQQQsjEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBCi3cnEpRBCCCGEEEIIIYQQQgghhBD/h+3bt4/x48fj5eWFQqHgr7/+anObPXv20LNnT1QqFSEhIXz33XfXlPn4448JCAjAwsKCfv36cezYsRsffDMycSmEEEIIIYQQQgghhBBCCCHE/2Hl5eV0796djz/+2KjySUlJjB07luHDh3PmzBmefPJJ5s2bx9atWxvL/PbbbyxatIjly5dz6tQpunfvzqhRo8jJyblZ/xsoNBqN5qZ9uxBCCCGEEEIIIYQQQgghhBDiv0ahULB27VomTJhgsMzzzz/Pxo0buXDhQuPPpkyZQlFREVu2bAGgX79+9OnTh48++giA+vp6fH19eeyxx3jhhRduSuzyxKUQQgghhBBCCCGEEEIIIYQQHUxVVRUlJSU6n6qqqhvy3YcPH2bEiBE6Pxs1ahSHDx8GoLq6mpMnT+qUMTExYcSIEY1lbgazm/bNQgghhBBCCCGEEEIIIYQQ4v8sS7+p7R3C/7Tn54Txyiuv6Pxs+fLlvPzyy//xd2dlZeHu7q7zM3d3d0pKSlCr1RQWFlJXV6e3TGxs7H/89w2RiUshhBBCCCGEEEIIIYQQQgghOpjFixezaNEinZ+pVKp2iua/QyYuhRBCCCGEEEIIIYQQQgghhOhgVCrVTZuo9PDwIDs7W+dn2dnZ2NnZYWlpiampKaampnrLeHh43JSYQN5xKYQQQgghhBBCCCGEEEIIIcT/lAEDBrBz506dn23fvp0BAwYAoFQq6dWrl06Z+vp6du7c2VjmZpCJSyGEEEIIIYQQQgghhBBCCCH+DysrK+PMmTOcOXMGgKSkJM6cOUNqaiqgXXZ25syZjeUfeughEhMTee6554iNjeWTTz5h9erVPPXUU41lFi1axJdffsn3339PTEwMCxcupLy8nNmzZ9+0/w9ZKlYIIYQQQgghhBBCCCGEEEKI/8NOnDjB8OHDG/999d2Ys2bN4rvvviMzM7NxEhMgMDCQjRs38tRTT7Fy5Up8fHz46quvGDVqVGOZyZMnk5uby7Jly8jKyiIqKootW7bg7u5+0/4/FBqNRnPTvl0IIYQQQgghhBBCCCGEEEL8n2TpN7W9Q/ifpk79pb1D+K+TpWKFEEIIIYQQQgghhBBCCCGEEO1OlooVQgghhBBCCCGEEEIIIYQQ11Ao5Pk38d8lNU4IIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e7M2jsAIYQQQgghhBBCCCGEEEII0fEo5Pk38V8mNU4IIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e5k4lIIIYQQQgghhBBCCCGEEEII0e7M2jsAIYQQQgghhBBCCCGEEEII0fEoFPL8m/jv6lATl5Z+U9s7BB3q1F/o+fP+9g5Dx6lpQ4j4el97h9EoZu5QpuzuOPEA/Dp8KP3/ONDeYeg4cs9gAp/f0N5h6Eh6axwByza3dxg6kl8dg/8bO9o7DB0pi0fQ65eOlQdOTh1C52861nkXPWcofdd0rPPu2H2DeeborvYOQ8e7/W4lZOIP7R1Go/i1M1l+qmOdc6/0HIFL2JPtHYaOvEsrCFi+pb3D0JH8yugO1R8AbZ/gldMdqz4t7zGCOfv3tHcYOr4Zcgu3bDzY3mHo2DN2EN1+7Fht3bkZQ+izumO1K8cnDabf7x0rpqP3DubRw7vbO4xGHw0YzuC/O9Y+OnDXYO7a0bHq998jhuAc+nh7h6Ej//IqZu7d295h6Phh2DAmdbCx5urhQztkP/yhgx0nDwB8Nmg4kd92rP10cfZQAj/uWHU86ZFhDF3fcfoE+8YP6pDnXKfPO1ZMcQuGMmprx2rrto7qmG1d4DPr2zsMHUnvjuf2LR3nnAPYPnoQfqs6Vm5KfXwYp/M71jXMHs7j2jsEIcR/mUyVCyGEEEIIIYQQQgghhBBCCCHanUxcCiGEEEIIIYQQQgghhBBCCCHanUxcCiGEEEIIIYQQQgghhBBCCCHanUxcCiGEEEIIIYQQQgghhBBCCCHanUxcCiGEEEIIIYQQQgghhBBCCCHanVl7ByCEEEIIIYQQQgghhBBCCCE6HoVCnn8T/11S44QQQgghhBBCCCGEEEIIIYQQ7U4mLoUQQgghhBBCCCGEEEIIIYQQ7U4mLoUQQgghhBBCCCGEEEIIIYQQ7U4mLoUQQgghhBBCCCGEEEIIIYQQ7U4mLoUQQgghhBBCCCGEEEIIIYQQ7U4mLoUQQgghhBBCCCGEEEIIIYQQ7U4mLoUQQgghhBBCCCGEEEIIIYQQ7c7sRn1RbW0tGRkZ+Pn53aivFEIIIYQQQgghhBBCCCGEEO1EoVC0dwjif8wNm7i8ePEiPXv2pK6u7oZ836C+4Tz10Dh6dg3C092RSfPeY/22E61uM6R/BG+9NIPOoT5cycznzVVr+en3fTplFsy8nacWjMfd1Z7zMaksWvYdJ84mXFdskzp5MjPCB2dLJZcLy3j7ZAIX88v0lp0Y7MG4QDeCHawAiCko46OzyTrlb/Vx5p5OnkQ42eCgMmfKplNcLiq/rpimRXgyp6svLpZKYgvK+OfhBM7nleote1+YB3eGuNPJURtTdF4ZH5xI1in/+pBQJoZ66Gy3/0oB87deMCqe3D27ydm2lZqSYix9fPGZPBXrwECD5QtPniBz3d9U5+ehcnPHa+I92HftCoCmrpaMv/+i5MIFqvNyMbG0xDY8Au+J92Du4GBUPAD3BHkyPdQbJwsl8cXlvHcmgehC/cct0NaK+ZF+hDvY4GltwQdnE/ktPuOacq4WSh7pGsAAd0dUZiZcKavktRNxxBbp/96WZgzwZ/7QYFxtVcRklvDy3xc5e6Woze3Gdffiw2k92XYxiwU/6D8vXpvYlfv7+/Pq+ot8eyDJqHgAZvT1Y8GgQFxtVMRkl7J8YzRn04vb3G58F08+nBTFtphs5v9yqvHnoyLcub+PH1297HC0UnLHJweIztJfNw2Z2dOH+f38cbVREpNTxvJtlzibWaK37OhQVx4ZGIi/oyXmJiYkFVbw5bEU1l7IaiyTsniE3m1f3xXH50dTjIrpvk6ezAzX5oG4q3mgQP9xD7Kz4qFu/kQ42uBlY8G7pxL45ZJufbIyM2VhN3+G+zjjqDLnUmE5755KINrAd+ozNcKTOV20eeBSYet54N5QD+4KcSfkah7IL2NFizzwzyGhTOx0bR5YsM24PABwb7An08O8cbZQEldUzrunDZ93QXY
"text/plain": [
"<Figure size 2500x2500 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"correlation_matrix = df_train.corr()\n",
"fig, ax = plt.subplots(figsize=(25, 25))\n",
"\n",
"ax = sns.heatmap(\n",
" correlation_matrix,\n",
" annot=True,\n",
" linewidths=0.5,\n",
" fmt=\".2f\",\n",
" cmap=\"YlGnBu\"\n",
")\n",
"\n",
"# Jupyter notebook specific\n",
"bottom_side, top_side = ax.get_ylim()\n",
"ax.set_ylim(bottom_side + 0.5, top_side - 0.5)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "c2b4a57c",
"metadata": {},
"source": [
"We can see that there is the highest positive correlation in **V14** atribute and the highest negative value in the attributes **V1, V27** So lets see the distribution of those values in comparrison to class."
]
},
{
"cell_type": "markdown",
"id": "f1918d5b",
"metadata": {},
"source": [
"**V14 vs V17**"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "8d4ce9a6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"<matplotlib.legend.Legend at 0x7fe49dcbc5b0>"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABNoAAANXCAYAAADjAjLCAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAADjfklEQVR4nOzdeVhUZf8G8PucYd8XRcAUcElFchdFc/lVCmqo1Ztmmlrma6aVLb5lZUhWaquVZablmpmlpaihZpobiokbobmElgqSICCyzzm/P8YZGWY7AwMDeH+uy0vnnGfOec4w9b7efZ/nK8iyLIOIiIiIiIiIiIiqRbT3BIiIiIiIiIiIiBoCBm1EREREREREREQ2wKCNiIiIiIiIiIjIBhi0ERERERERERER2QCDNiIiIiIiIiIiIhtg0EZERERERERERGQDDNqIiIiIiIiIiIhsgEEbERERERERERGRDTBoIyIiIiIiIiIisgEGbURERNRgjR8/HqGhoXa5tyAImDVrlk2vuXLlSrRt2xaOjo7w8fGx6bVt5fz58xAEAcuWLbP3VMxKTExEp06d4OLiAkEQkJuba+8p1Sp7/rNBRETUkDFoIyIiqgcEQVD0a9euXfaeqp79+/dj1qxZt12IURNOnTqF8ePHo2XLlli8eDG+/PJLu85n9erVmD9/vl3nUFXZ2dkYMWIEXF1d8dlnn2HlypVwd3c3GDd06FC4ubnh+vXrJq81evRoODk5ITs7GwDw3XffYcyYMWjdujUEQUD//v1r6jFqVFZWFhwcHDBmzBiTY65fvw5XV1c8+OCDAIBDhw5h6tSpaN++Pdzd3dG8eXOMGDECp0+fNnivuX+PDRgwoMaei4iIqKY52HsCREREZNnKlSv1Xq9YsQLbt283ON6uXbvanJZF+/fvR3x8PMaPH19nK7BqSlFRERwcbPd/tXbt2gVJkvDxxx+jVatWNrtuVa1evRqpqamYNm2a3vGQkBAUFRXB0dHRPhNT4NChQ7h+/Tpmz56N++67z+S40aNHIyEhAT/++CPGjh1rcL6wsBAbNmxATEwM/P39AQALFy7E4cOH0b17d134Vh8FBARgwIAB2LBhAwoLC+Hm5mYwZv369SguLtaFcfPmzcO+ffvw8MMPo0OHDsjMzMSCBQvQpUsXHDhwABEREbr3Vv53FwD8/vvv+PjjjzFw4MCaezAiIqIaxqCNiIioHqhcVXLgwAFs377dbLWJUrIso7i4GK6urtW+Ft3i4uJi0+tlZWUBQJ0PLAVBsPmz25rSz3Lo0KHw9PTE6tWrjQZtGzZswI0bNzB69GjdsZUrV6Jp06YQRVEvWKqPRo8ejcTERGzcuBGPPPKIwfnVq1fD29sbQ4YMAQC88MILWL16NZycnHRjRo4cibvuugtz587FqlWrdMeN/btr165dEAQBo0aNqoGnISIiqh1cOkpERNRALF26FPfccw8CAgLg7OyM8PBwLFy40GBcaGgo7r//fmzduhXdunWDq6srFi1aBAC4cOEChg4dCnd3dwQEBOD555/H1q1bjS5LPXjwIGJiYuDt7Q03Nzf069cP+/bt052fNWsWpk+fDgAICwvTLQs7f/680flPnToVHh4eKCwsNDg3atQoBAYGQq1WA9AEHEOGDEFwcDCcnZ3RsmVLzJ49W3feFO1f5Cs/i6l9xU6dOoX//Oc/8PPzg4uLC7p164aNGzeavYdW5T3aZs2aBUEQcPbsWV2Fn7e3Nx5//HGjz1xRaGgo4uLiAACNGzfWu7apveBCQ0Mxfvx43etly5ZBEATs27cPL7zwAho3bgx3d3c88MAD+Pfffw3e//PPP6Nfv37w9PSEl5cXunfvjtWrVwMA+vfvj82bN+PChQu6n6t2vy9Tn+Wvv/6KPn36wN3dHT4+Phg2bBhOnjypN6Y6n5HW999/j65du8LV1RWNGjXCmDFjcOnSJd35/v37Y9y4cQCA7t27QxAEvc+pIu2yyB07dujCuYpWr14NT09PDB06VHesWbNmEMWq/V/s0tJSvPHGG+jatSu8vb3h7u6OPn36YOfOnXrjtJ/x+++/jy+//BItW7aEs7MzunfvjkOHDhlc96effkJERARcXFwQERGBH3/8UdF8HnjgAbi7u+t+7hVlZWVhx44d+M9//gNnZ2cAQK9evfRCNgBo3bo12rdvb/CzrqykpATr1q1Dv379cMcddyiaHxERUV3EijYiIqIGYuHChWjfvj2GDh0KBwcHJCQk4Omnn4YkSZgyZYre2D///BOjRo3CpEmTMHHiRLRp0wY3btzAPffcg4yMDDz33HMIDAzE6tWrDf6SD2hCk0GDBqFr166Ii4uDKIq6oG/Pnj2IjIzEgw8+iNOnT+Pbb7/FRx99hEaNGgHQBEXGjBw5Ep999hk2b96Mhx9+WHe8sLAQCQkJGD9+PFQqFQBNaOTh4YEXXngBHh4e+PXXX/HGG28gPz8f7733nk0+zz/++AO9e/dG06ZN8corr8Dd3R1r167F8OHDsW7dOjzwwANVuu6IESMQFhaGOXPmICUlBUuWLEFAQADmzZtn8j3z58/HihUr8OOPP2LhwoXw8PBAhw4dqnT/Z555Br6+voiLi8P58+cxf/58TJ06Fd99951uzLJly/DEE0+gffv2mDFjBnx8fHDkyBEkJibi0UcfxWuvvYa8vDxcvHgRH330EQDAw8PD5D1/+eUXDBo0CC1atMCsWbNQVFSETz/9FL1790ZKSorBpvxV+Yy083788cfRvXt3zJkzB1euXMHHH3+Mffv24ciRI/Dx8cFrr72GNm3a4Msvv8Sbb76JsLAwtGzZ0uQ1R48ejeXLl2Pt2rWYOnWq7nhOTg62bt2KUaNG2awaND8/H0uWLMGoUaMwceJEXL9+HV999RWio6ORnJyMTp066Y1fvXo1rl+/jkmTJkEQBLz77rt48MEH8ddff+mW7m7btg0PPfQQwsPDMWfOHGRnZ+Pxxx9XFGa5u7tj2LBh+OGHH5CTkwM/Pz/due+++w5qtVqvms8YWZZx5coVtG/f3uy4LVu2IDc31+L1iIiI6jyZiIiI6p0pU6bIlf9nvLCw0GBcdHS03KJFC71jISEhMgA5MTFR7/gHH3wgA5B/+ukn3bGioiK5bdu2MgB5586dsizLsiRJcuvWreXo6GhZkiS9+4eFhckDBgzQHXvvvfdkAHJ6errFZ5IkSW7atKn80EMP6R1fu3atDEDevXu32WedNGmS7ObmJhcXF+uOjRs3Tg4JCdG93rlzp96zaKWnp8sA5KVLl+qO3XvvvfJdd92ldz1JkuRevXrJrVu3tvg8AOS4uDjd67i4OBmA/MQTT+iNe+CBB2R/f3+L19O+/99//zV7H62QkBB53LhxutdLly6VAcj33Xef3s/t+eefl1UqlZybmyvLsizn5ubKnp6eco8ePeSioiK9a1Z835AhQ/Q+Wy1jn2WnTp3kgIAAOTs7W3fs2LFjsiiK8tixYw2esSqfUWlpqRwQECBHRETozXvTpk0yAPmNN94w+CwOHTpk9pqyLMvl5eVyUFCQHBUVpXf8iy++kAHIW7duNfne9u3by/369bN4j4r3Kikp0Tt27do1uUmTJnqfifYz9vf3l3NycnTHN2zYIAOQExISdMc6deokBwUF6X6+sizL27ZtkwEY/flVtnnzZhmAvGjRIr3jPXv2lJs2bSqr1Wqz71+5cqUMQP7qq6/MjnvooYdkZ2dn+dq1axbnREREVJdx6SgREVEDUbGqJi8vD1evXkW/fv3w119/IS8vT29sWFgYoqOj9Y4lJiaiadOmesvgXFxcMHHiRL1xR48exZkzZ/Doo48iOzsbV69exdWrV3Hjxg3ce++92L17NyRJsnr+giDg4YcfxpYtW1BQUKA7/t1336Fp06a4++67jT7r9evXcfXqVfTp0weFhYU4deqU1feuLCcnB7/++itGjBihu/7Vq1eRnZ2N6OhonDlzRm85ojWeeuopvdd9+vRBdnY28vPzqz1vJf773/9CEAS9+6vValy4cAEAsH37dly/fh2vvPKKwV5rFd+nVEZGBo4ePYr
"text/plain": [
"<Figure size 1500x1000 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"\n",
"plt.figure(figsize=(15, 10))\n",
"\n",
"# Scatter with 1 values of target class\n",
"plt.scatter(\n",
" df_train['V1'][df_train['Class'] == 1],\n",
" df_train['V27'][df_train['Class'] == 1],\n",
")\n",
"\n",
"# Scatter with 2 values of target class\n",
"plt.scatter(\n",
" df_train['V1'][df_train['Class'] == 2],\n",
" df_train['V27'][df_train['Class'] == 2],\n",
")\n",
"\n",
"plt.title('Target value in function of V1 and V27')\n",
"\n",
"plt.xlabel('V1')\n",
"plt.ylabel('V27')\n",
"plt.legend(['Biodegradable', 'Non-biodegradable'])\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "d50d1f44",
"metadata": {},
"outputs": [],
"source": [
"# Spliting the data into features and labels\n",
"X_train = df_train.drop('Class', axis=1)\n",
"y_train = df_train['Class']\n",
"X_test = df_test.drop('Class', axis=1)\n",
"y_test = df_test['Class']"
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "f0aa7c9d",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"# Put models in a dictionary\n",
"models = {\n",
" \"Logistic Regression\": LogisticRegression(),\n",
" \"KNN\": KNeighborsClassifier(),\n",
" \"Random Forest\": RandomForestClassifier()\n",
"}\n",
"\n",
"# Create a function to fit and score models\n",
"def fit_and_score(models, X_train, X_test, y_train, y_test):\n",
" \"\"\"\n",
" Fits and evaluates given machine learning models.\n",
" models: dict of different Scikit-Learn machine learning models\n",
" X_train: training data (no labels)\n",
" x_test: testing data (no labels)\n",
" y_train: training labels\n",
" y_test: trest labels\n",
" \"\"\"\n",
"\n",
" # Set random seed\n",
" np.random.seed(42)\n",
"\n",
" # Make a dictioanry to keep model scores\n",
" model_scores = {}\n",
"\n",
" # Loop through models\n",
" for name, model in models.items():\n",
" # Fit the model to the data\n",
" model.fit(X_train, y_train)\n",
" # Evaluate the model and append its score to model_scores\n",
" model_scores[name] = model.score(X_test, y_test)\n",
"\n",
" return model_scores"
]
},
{
"cell_type": "markdown",
"id": "10387356",
"metadata": {},
"source": [
"#### Check if there are any missing values"
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "87e277e6",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"V4 25\n",
"V22 16\n",
"V27 8\n",
"V29 8\n",
"V37 25\n",
"dtype: int64"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"na_counts = df_train.isna().sum()\n",
"na_counts[na_counts > 0]\n"
]
},
{
"cell_type": "markdown",
"id": "cb57434a",
"metadata": {},
"source": [
"#### We can see that there are five atributes that have missing values. Lets inspect them."
]
},
{
"cell_type": "markdown",
"id": "9dbd2c02",
"metadata": {},
"source": [
"##### V4"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "ca1e544a",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 821.000000\n",
"mean 0.030451\n",
"std 0.198281\n",
"min 0.000000\n",
"25% 0.000000\n",
"50% 0.000000\n",
"75% 0.000000\n",
"max 2.000000\n",
"Name: V4, dtype: float64"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V4'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "9e4d7d1d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0 800\n",
"1.0 17\n",
"2.0 4\n",
"Name: V4, dtype: int64"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V4'].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "3a3191c9",
"metadata": {},
"source": [
"We can see that the majority of entires in that particular atribute are zeros. So I think that it would be best if I set all the `Nan` values to zeros."
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "d8489bd4",
"metadata": {},
"outputs": [],
"source": [
"df_train['V4'].fillna(0, inplace=True)\n",
"df_test['V4'].fillna(0, inplace=True)"
]
},
{
"cell_type": "markdown",
"id": "3e84e48b",
"metadata": {},
"source": [
"##### V22"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "a711431d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 830.000000\n",
"mean 1.243898\n",
"std 0.094109\n",
"min 0.898000\n",
"25% 1.187500\n",
"50% 1.248500\n",
"75% 1.298750\n",
"max 1.641000\n",
"Name: V22, dtype: float64"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V22'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 20,
"id": "f0325325",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"1.299 9\n",
"1.280 9\n",
"1.296 8\n",
"1.254 8\n",
"1.264 8\n",
" ..\n",
"1.449 1\n",
"1.159 1\n",
"1.363 1\n",
"1.331 1\n",
"1.410 1\n",
"Name: V22, Length: 321, dtype: int64"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V22'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"id": "25a74baf",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjIAAAHHCAYAAACle7JuAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABLi0lEQVR4nO3deVxU1f8/8NcAssgyyg6iiLiFiguW4YqKIhq5FW4lmGYWrlgWnywlLTTLLbf6ZriFW25lKi4ouFYq5FIumOYC4s5moDDn90c/JkeGbZzhzoXX8/G4j4dz751z33NnhBfnnntGIYQQICIiIpIhE6kLICIiItIVgwwRERHJFoMMERERyRaDDBEREckWgwwRERHJFoMMERERyRaDDBEREckWgwwRERHJFoMMERERyRaDTDU0ffp0KBSKSjlWQEAAAgIC1I8PHDgAhUKBH374oVKOHx4ejvr161fKsXSVk5ODUaNGwdXVFQqFAhMnTqxwG0Xv6Z07d/RfIEGhUGD69Onqx3I63ytWrIBCocCVK1cMfqzw8HDY2NgY/DiG9vT7TcaNQUbmin5IFS2WlpZwd3dHUFAQFi5ciOzsbL0cJy0tDdOnT0dKSope2tMnY66tPD777DOsWLECb7/9NlavXo3XX3+91H23bt1aecU94eWXX0bNmjVL/UwNGzYM5ubmuHv3Lu7evYs5c+agc+fOcHJyQq1atfDiiy9i/fr1xZ7322+/YezYsWjWrBmsra1Rr149hIaG4sKFC3p9DUeOHMH06dPx4MEDvbarD8Zc25MePnyI6dOn48CBA5LVsGPHjmobNOLi4jB//nypyzAugmQtNjZWABCffPKJWL16tfjuu+/EZ599Jnr27CkUCoXw9PQUv//+u8ZzHj9+LP75558KHee3334TAERsbGyFnpefny/y8/PVj/fv3y8AiI0bN1aoHV1re/TokcjLy9PbsQyhXbt2okOHDuXa19raWoSFhRVbP23aNAFA3L59W8/V/WfdunUCgFi5cqXW7bm5ucLa2lqEhIQIIYT46aefRI0aNUTfvn3F/PnzxaJFi0TXrl0FAPHxxx9rPHfgwIHC1dVVjBs3Tvzf//2fmDFjhnBxcRHW1tbi9OnTensNc+bMEQDE5cuXK/S8f/75Rzx+/Fj92BDnW9faylJQUCD++ecfoVKp9NLe7du3BQAxbdq0YtvCwsKEtbW1Xo5TmoiICGHIX19Pv9/GpE+fPsLT01PqMoyKmTTxifQtODgYbdu2VT+OiopCQkICXnrpJbz88sv4888/YWVlBQAwMzODmZlh3/qHDx+iZs2aMDc3N+hxylKjRg1Jj18et27dgo+Pj9RllOnll1+Gra0t4uLiMHz48GLbt23bhtzcXAwbNgwA0KxZM1y8eBGenp7qfd555x0EBgZi9uzZmDJlCqytrQEAkZGRiIuL0/i8DBo0CC1atMCsWbOwZs0aA7+64lQqFR49egRLS0tYWlpW+vH1xdTUFKamplKXIZmCggKoVKoK/SyS8/tdLUmdpOjZFPXI/Pbbb1q3f/bZZwKA+Oabb9Triv6afNLu3btFhw4dhFKpFNbW1qJx48YiKipKCPFfL8rTS1EPSJcuXUSzZs3E8ePHRadOnYSVlZWYMGGCeluXLl3Uxylqa926dSIqKkq4uLiImjVripCQEHH16lWNmjw9PbX2PjzZZlm1hYWFFfvrJScnR0RGRgoPDw9hbm4uGjduLObMmVPsL1YAIiIiQmzZskU0a9ZMmJubCx8fH7Fz506t5/ppGRkZ4o033hDOzs7CwsJC+Pr6ihUrVhQ7F08vJf1Frm3fovNT9J5evHhRhIWFCaVSKezs7ER4eLjIzc0t1tbq1atFmzZthKWlpahdu7YYNGhQsfOvTVhYmDAzMxMZGRnFtr300kvC1tZWPHz4sNQ2Fi5cKACIU6dOlXm8Nm3aiDZt2pS53++//y7CwsKEl5eXsLCwEC4uLmLEiBHizp076n2KzlFJ57vo/V6zZo3w8fERZmZmYsuWLeptT/ZAFLX1559/ildffVXY2toKe3t7MX78eI3ezsuXL5fYW/hkm2XVJoTu71nRz4gn2/L09BR9+vQRBw8eFM8//7ywsLAQXl5eJfa2Pf16nl6KXkdRj8z169dF3759hbW1tXB0dBSTJ08WBQUFGm0VFhaKefPmCR8fH2FhYSGcnZ3F6NGjxb1790qtISwsTGsNT9Y3Z84cMW/ePNGgQQNhYmIikpOTRX5+vvjoo49EmzZthJ2dnahZs6bo2LGjSEhIKHaMkt7v8v7/etqFCxfEgAEDhIuLi7CwsBB16tQRgwYNEg8ePNDYr6z3uEuXLsVeN3tn2CNT5b3++uv43//+h927d+PNN9/Uus/Zs2fx0ksvwdfXF5988gksLCyQmpqKw4cPAwCee+45fPLJJ/j4448xevRodOrUCQDQvn17dRt3795FcHAwBg8ejNdeew0uLi6l1vXpp59CoVDg/fffx61btzB//nwEBgYiJSVF3XNUHuWp7UlCCLz88svYv38/Ro4ciVatWiE+Ph7vvfcebty4gXnz5mnsf+jQIWzevBnvvPMObG1tsXDhQgwcOBBXr16Fg4NDiXX9888/CAgIQGpqKsaOHQsvLy9s3LgR4eHhePDgASZMmIDnnnsOq1evxqRJk+Dh4YHJkycDAJycnLS2uXr1aowaNQovvPACRo8eDQDw9vbW2Cc0NBReXl6IiYnByZMn8e2338LZ2RmzZ89W7/Ppp5/io48+QmhoKEaNGoXbt2/jq6++QufOnZGcnIxatWqV+LqGDRuGlStXYsOGDRg7dqx6/b179xAfH48hQ4aU+f7dvHkTAODo6FjqfkIIZGRkoFmzZqXuBwB79uzBX3/9hREjRsDV1RVnz57FN998g7Nnz+LYsWNQKBQYMGAALly4gLVr12LevHnq4z95vhMSEtSvzdHRscyB4qGhoahfvz5iYmJw7NgxLFy4EPfv38eqVavKrPlJZdX2LO9ZSVJTU/HKK69g5MiRCAsLw3fffYfw8HD4+fmVeM6dnJywdOlSvP322+jfvz8GDBgAAPD19VXvU1hYiKCgILRr1w5ffPEF9u7diy+//BLe3t54++231fu99dZbWLFiBUaMGIHx48fj8uXLWLRoEZKTk3H48OESe1PfeustpKWlYc+ePVi9erXWfWJjY5GXl4fRo0fDwsIC9vb2yMrKwrfffoshQ4bgzTffRHZ2NpYvX46goCD8+uuvaNWqVZnnrDz/v5726NEjBAUFIT8/H+PGjYOrqytu3LiB7du348GDB1AqlQDK9x5/+OGHyMzMxPXr19U/q6rC4OpnJnWSomdTVo+MEEIolUrRunVr9eOne2TmzZtX5vX+0sahFP2VsGzZMq3btPXI1KlTR2RlZanXb9iwQQAQCxYsUK8rT49MWbU93SOzdetWAUDMnDlTY79XXnlFKBQKkZqaql4HQJibm2us+/333wUA8dVXXxU71pPmz58vAIg1a9ao1z169Ej4+/sLGxsbjdde9NdxeZQ1RuaNN97QWN+/f3/h4OCgfnzlyhVhamoqPv30U439Tp8+LczMzIqtf1pBQYFwc3MT/v7+GuuXLVsmAIj4+PhSn3/37l3h7OwsOnXqVOp+Qvz71ykAsXz58jL31dYLtHbtWgFAJCUlqdeVNg4FgDAxMRFnz57Vuk3bX+gvv/yyxn7vvPOOAKAel1beHpnSanvW96ykHpmnz82tW7eEhYWFmDx5cqntlTVGBv9/zN6TWrduLfz8/NSPDx48KACI77//XmO/Xbt2aV3/tJLGyBSdbzs7O3Hr1i2NbQUFBRrj9YQQ4v79+8LFxaXY/5uS3u+y/n9pk5ycXOa4wIq8xxwjUxzvWqoGbGxsSr3TpOivuW3btkGlUul0DAsLC4wYMaLc+w8fPhy2trbqx6+88grc3NywY8cOnY5fXjt27ICpqSnGjx+vsX7y5MkQQmDnzp0a6wMDAzV6PXx9fWFnZ4e//vqrzOO4urpiyJAh6nU1atTA+PHjkZOTg8TERD28muLGjBmj8bhTp064e/c
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"_, _, bars = plt.hist(df_test['V22'], bins=20)\n",
"plt.xlabel('V22')\n",
"plt.ylabel('Frequency')\n",
"plt.title('Distribution of the V22 atribute in the train set')\n",
"plt.bar_label(bars, fmt='%1.0f')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "6d6b63fd",
"metadata": {},
"source": [
"The distribution of the target variable **V22** is normal, so i could try to fill the missing values with `mean()`."
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "2b2b6e2d",
"metadata": {},
"outputs": [],
"source": [
"df_train['V22'].fillna(df_train['V22'].mean(), inplace=True)\n",
"df_test['V22'].fillna(df_test['V22'].mean(), inplace=True)"
]
},
{
"cell_type": "markdown",
"id": "4164f62c",
"metadata": {},
"source": [
"##### V27"
]
},
{
"cell_type": "code",
"execution_count": 23,
"id": "9a8b64ac",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 838.000000\n",
"mean 2.218153\n",
"std 0.221545\n",
"min 1.000000\n",
"25% 2.107000\n",
"50% 2.251000\n",
"75% 2.359750\n",
"max 2.859000\n",
"Name: V27, dtype: float64"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V27'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 24,
"id": "1bddfb76",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2.000 36\n",
"2.236 31\n",
"2.194 24\n",
"1.848 22\n",
"2.175 21\n",
" ..\n",
"2.294 1\n",
"2.466 1\n",
"2.488 1\n",
"2.372 1\n",
"2.622 1\n",
"Name: V27, Length: 290, dtype: int64"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V27'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "f1787f2e",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjIAAAHHCAYAAACle7JuAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABPhElEQVR4nO3deViUVf8/8PeIzohsirI+IBK4oWJGhmgqKoJohmm55AKKaT5gqW3SoqIVmuVSKVZfA7UQl0TLEnIDl9TSJJdSwTQ1QcyFzUBlzu8Pf8zjyDaMM9xzD+/Xdc1Vc+4zZz5nzj3jhzPnPqMQQggQERERyVADqQMgIiIi0hcTGSIiIpItJjJEREQkW0xkiIiISLaYyBAREZFsMZEhIiIi2WIiQ0RERLLFRIaIiIhki4kMERERyRYTmXpozpw5UCgUdfJcgYGBCAwM1NxPT0+HQqHAxo0b6+T5IyIi0KpVqzp5Ln0VFRVh4sSJcHZ2hkKhwLRp02rdRvmY/vPPP4YPsJ47f/48FAoFEhMTNWURERGwtraWLqhaqOv3e8eOHevkuYylsvEm08ZERuYSExOhUCg0t8aNG8PV1RUhISH4+OOPUVhYaJDnuXz5MubMmYPMzEyDtGdIphybLt5//30kJiZiypQpWLNmDcaOHVtt3c2bN9ddcPd5+umn0aRJk2rPqdGjR0OpVOLatWu4du0aFi5ciF69esHBwQFNmzZFt27dsG7dugqPi4iI0DqPH7z9/fffBunDDz/8gDlz5hikLUMz5djuZwrvt6SkJCxZskSy55fS8uXLmWQ9SJCsJSQkCABi7ty5Ys2aNeLLL78U77//vggODhYKhUJ4eHiI3377Tesxd+7cEf/++2+tnueXX34RAERCQkKtHldaWipKS0s193fv3i0AiA0bNtSqHX1ju337tigpKTHYcxmDv7+/6NGjh051raysRHh4eIXy2bNnCwDi6tWrBo7uf5KTkwUAsWrVqkqPFxcXCysrKzF48GAhhBDfffedaNSokQgLCxNLliwRn376qejTp48AIGbNmqX12J9++kmsWbNG67Z69WrRpEkT4ePjY7A+REVFidp+7KnVavHvv/+Ku3fvasrCw8OFlZWVweLSNzZd6PN+r05177fevXuLDh06GOy5qjJo0CDh4eFhlLYrG29T0qFDB9G7d2+pwzApDaVKoMiwQkND8fjjj2vux8TEYNeuXXjqqafw9NNP448//oClpSUAoGHDhmjY0LhDf+vWLTRp0gRKpdKoz1OTRo0aSfr8usjLy4OPj4/UYdTo6aefho2NDZKSkjBu3LgKx7ds2YLi4mKMHj0aANChQwdkZWXBw8NDU+e///0vgoKCsGDBArz++uuwsrICAAQEBCAgIECrvX379uHWrVua9ura3bt3oVaroVQq0bhxY0liMIS6eL+bspKSEiiVSjRooNsXEOUz2yQjUmdS9HDKZ2R++eWXSo+///77AoD4/PPPNWXlf73f78cffxQ9evQQdnZ2wsrKSrRp00bExMQIIf43i/LgrfwvsvK/wg4fPix69uwpLC0txcsvv6w5dv9fD+VtJScni5iYGOHk5CSaNGkiBg8eLC5cuKAVk4eHR6WzD/e3WVNs4eHhFf5yKyoqEjNmzBBubm5CqVSKNm3aiIULFwq1Wq1VD4CIiooSKSkpokOHDkKpVAofHx+xbdu2Sl/rB125ckVMmDBBODo6CpVKJXx9fUViYmKF1+LB27lz5yptr7K65a9P+ZhmZWWJ8PBwYWdnJ2xtbUVERIQoLi6u0NaaNWvEY489Jho3biyaNWsmRowYUeH1r0x4eLho2LChuHLlSoVjTz31lLCxsRG3bt2qto2PP/5YABDHjh2rtt6UKVOEQqGo8vW43549e8Szzz4r3N3dhVKpFG5ubmLatGlasYSHh1f6GgohxLlz5wQAsXDhQrF48WLxyCOPiAYNGoijR49qjt0/A1E+I3P27FkRHBwsmjRpIlxcXERsbKzWeVQ+xrt379aK98E2q4tNCCHKysrE4sWLhY+Pj1CpVMLR0VFMmjRJXL9+vcbXprL3u77ntq6fBSdPnhSBgYHC0tJSuLq6igULFlRoq6SkRMyaNUt4eXlpxuy1116rcQa1d+/eFZ6//D1eHt/atWvFW2+9JVxdXYVCoRA3btwQ165dE6+88oro2LGjsLKyEjY2NmLAgAEiMzNTq/3qxvvSpUsiLCxMWFlZiRYtWohXXnlFp5mbX375RQQHB4vmzZuLxo0bi1atWonx48dr1dFljD08PCr0nbMznJExe2PHjsWbb76JH3/8ES+88EKldU6ePImnnnoKvr6+mDt3LlQqFbKzs7F//34AQPv27TF37lzMmjULkyZNQs+ePQEA3bt317Rx7do1hIaGYuTIkRgzZgycnJyqjeu9996DQqHAG2+8gby8PCxZsgRBQUHIzMzUzBzpQpfY7ieEwNNPP43du3cjMjISjz76KNLS0vDaa6/h77//xuLFi7Xq79u3D5s2bcJ///tf2NjY4OOPP8awYcNw4cIFNG/evMq4/v33XwQGBiI7OxvR0dHw9PTEhg0bEBERgZs3b+Lll19G+/btsWbNGkyfPh1ubm545ZVXAAAODg6VtrlmzRpMnDgRTzzxBCZNmgQA8PLy0qozfPhweHp6Ii4uDr/++iv+7//+D46OjliwYIGmznvvvYd33nkHw4cPx8SJE3H16lV88skn6NWrF44ePYqmTZtW2a/Ro0dj1apVWL9+PaKjozXl169fR1paGkaNGlXj+OXm5gIAWrRoUWWdO3fuYP369ejevbtOi7U3bNiAW7duYcqUKWjevDl+/vlnfPLJJ7h06RI2bNgAAJg8eTIuX76M7du3Y82aNZW2k5CQgJKSEkyaNAkqlQr29vZQq9WV1i0rK8OAAQPQrVs3fPDBB0hNTcXs2bNx9+5dzJ07t8aY71dTbJMnT0ZiYiLGjx+Pl156CefOncOnn36Ko0ePYv/+/XrNPOpzbuvyfrtx4wYGDBiAoUOHYvjw4di4cSPeeOMNdOrUCaGhoQAAtVqNp59+Gvv27cOkSZPQvn17HD9+HIsXL8aZM2eqXQf21ltvIT8/H5cuXdK8Xx9ceD1v3jwolUq8+uqrKC0thVKpxO+//47Nmzfjueeeg6enJ65cuYLPPvsMvXv3xu+//w5XV9dqX6+ysjKEhITA398fH374IXbs2IGPPvoIXl5emDJlSpWPy8vLQ3BwMBwcHDBz5kw0bdoU58+fx6ZNm7Tq6TLGS5YswdSpU2FtbY233noLAGr8rK0XpM6k6OHUNCMjhBB2dnaiS5cumvsP/oW2ePHiGtdX1PS9OACxYsWKSo9VNiPzn//8RxQUFGjK169fLwCIpUuXasp0mZGpKbYHZ2Q2b94sAIh3331Xq96zzz4rFAqFyM7O1pQBEEqlUqvst99+EwDEJ598UuG57rdkyRIBQHz11Veastu3b4uAgABhbW2t1XcPDw8xaNCgatsrV9MamQkTJmiVP/PMM6J58+aa++fPnxcWFhbivffe06p3/Phx0bBhwwrlD7p7965wcXERAQEBWuUrVqwQAERaWlq1j7927ZpwdHQUPXv2rLbed999JwCI5cuXV1uvXGWzQHFxcUKhUIi//vpLU1bVOpTyv8JtbW1FXl5epcce/AsdgJg6daqmTK1Wi0GDBgmlUql5L+k6I1NdbHv37hUAxNdff61VnpqaWmn5g6qakdH33Nbls2D16tWastLSUuHs7CyGDRumKVuzZo1o0KCB2Lt3r9bjy8+j/fv3VxtDVWtkyl/vRx55pMI5UVJSIsrKyrTKzp07J1QqlZg7d65WWVXjfX89IYTo0qWL8PPzqzbWlJSUGj+jazPGXCNTEa9aqgesra2rvdKk/C/wLVu2VPnXZ01UKhXGjx+vc/1x48bBxsZGc//ZZ5+Fi4sLfvjhB72eX1c//PADLCws8NJLL2mVv/LKKxBCYNu2bVrlQUFBWrMevr6+sLW1xZ9//lnj8zg7O2PUqFGaskaNGuGll15CUVERMjIyDNCbil588UWt+z179sS
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"_, _, bars = plt.hist(df_test['V27'], bins=20)\n",
"plt.xlabel('V27')\n",
"plt.ylabel('Frequency')\n",
"plt.title('Distribution of the V27 atribute in the train set')\n",
"plt.bar_label(bars, fmt='%1.0f')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"id": "53b79865",
"metadata": {},
"source": [
"The distribution of the target variable **V27** is normal, so i could try to fill the missing values with `mean()`."
]
},
{
"cell_type": "code",
"execution_count": 26,
"id": "8974127e",
"metadata": {},
"outputs": [],
"source": [
"# Set the nan values to the mean of the column\n",
"df_train['V27'].fillna(df_train['V27'].mean(), inplace=True)\n",
"df_test['V27'].fillna(df_test['V27'].mean(), inplace=True)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "3afb5a2f",
"metadata": {},
"source": [
"##### V29"
]
},
{
"cell_type": "code",
"execution_count": 27,
"id": "f410439d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 838.00000\n",
"mean 0.02506\n",
"std 0.15640\n",
"min 0.00000\n",
"25% 0.00000\n",
"50% 0.00000\n",
"75% 0.00000\n",
"max 1.00000\n",
"Name: V29, dtype: float64"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V29'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 28,
"id": "2d33e7c4",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.0 817\n",
"1.0 21\n",
"Name: V29, dtype: int64"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V29'].value_counts()"
]
},
{
"cell_type": "markdown",
"id": "515e9e80",
"metadata": {},
"source": [
"We can see that the majority of entires in that particular atribute are zeros. So I think that it would be best if I set all the `Nan` values to zeros."
]
},
{
"cell_type": "code",
"execution_count": 29,
"id": "48e8ba49",
"metadata": {},
"outputs": [],
"source": [
"# Set nan values to 0\n",
"df_train['V29'].fillna(0, inplace=True)\n",
"df_test['V29'].fillna(0, inplace=True)"
]
},
{
"cell_type": "markdown",
"id": "f659f8bc",
"metadata": {},
"source": [
"##### V37"
]
},
{
"cell_type": "code",
"execution_count": 30,
"id": "8515f06b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"count 821.000000\n",
"mean 2.549406\n",
"std 0.625021\n",
"min 1.467000\n",
"25% 2.101000\n",
"50% 2.461000\n",
"75% 2.861000\n",
"max 5.750000\n",
"Name: V37, dtype: float64"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V37'].describe()"
]
},
{
"cell_type": "code",
"execution_count": 31,
"id": "36bc89b5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2.167 9\n",
"2.500 9\n",
"2.833 8\n",
"2.667 8\n",
"1.833 7\n",
" ..\n",
"2.029 1\n",
"1.886 1\n",
"2.089 1\n",
"2.197 1\n",
"2.206 1\n",
"Name: V37, Length: 535, dtype: int64"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_train['V37'].value_counts()"
]
},
{
"cell_type": "code",
"execution_count": 32,
"id": "02c38a9f",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjMAAAHHCAYAAABKudlQAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABNjUlEQVR4nO3deVhUZf8G8HsUZ0SWUZQ1FhEXRMWMTFFT3FA0xNQss8Qt01BTtIx6K5cMrV+KlqKWgVqkYqJlEbmiphaY5Pa6YBqaLOYCgjEo8/z+6GVyZB9nOHPg/lzXuS7nnDPPfOeccbjnmec8oxBCCBARERHJVD2pCyAiIiJ6GAwzREREJGsMM0RERCRrDDNEREQkawwzREREJGsMM0RERCRrDDNEREQkawwzREREJGsMM0RERCRrDDN10Ny5c6FQKGrksQICAhAQEKC7vW/fPigUCmzZsqVGHn/s2LFo3rx5jTyWofLz8zFx4kQ4OTlBoVBgxowZ1W6j5Jz+9ddfxi+wjrt06RIUCgViY2N168aOHQtra2vpiqqGmv7/3r59+xp5LFMp63yT+WOYkbnY2FgoFArd0rBhQ7i4uGDAgAFYvnw5bt++bZTHuXr1KubOnYu0tDSjtGdM5lxbVbz//vuIjY3FlClTsGHDBrz44osV7rtt27aaK+4+Q4YMQaNGjSp8TY0ePRpKpRLXr18HAMycOROPPfYY7Ozs0KhRI7Rt2xZz585Ffn6+3v3Gjh2r9zp+cPnzzz+N8hy+//57zJ071yhtGZs513Y/c/j/FhcXh6ioKMkeX0orV65k0CqLIFmLiYkRAMT8+fPFhg0bxOeffy7ef/99ERgYKBQKhfDw8BC//fab3n3u3r0r/v7772o9TkpKigAgYmJiqnU/jUYjNBqN7vbevXsFABEfH1+tdgytraioSBQWFhrtsUyhS5cuonv37lXa18rKSoSGhpZa/+677woA4tq1a0au7l8bN24UAMS6devK3F5QUCCsrKxEcHCwbl337t3F9OnTxfLly8WaNWvElClThEqlEt27dxfFxcW6/Q4dOiQ2bNigt6xfv140atRI+Pj4GO05hIWFieq+7Wm1WvH333+Le/fu6daFhoYKKysro9VlaG1VYcj/94pU9P+tV69eol27dkZ7rPIMHjxYeHh4mKTtss63OWnXrp3o1auX1GWYHQvJUhQZVVBQEB5//HHd7YiICOzZswdPPfUUhgwZgv/+97+wtLQEAFhYWMDCwrSn/s6dO2jUqBGUSqVJH6cyDRo0kPTxqyInJwc+Pj5Sl1GpIUOGwMbGBnFxcRgzZkyp7du3b0dBQQFGjx6tW3fw4MFS+3l5eWH27Nn45Zdf0LVrVwCAv78//P399fY7ePAg7ty5o9deTbp37x60Wi2USiUaNmwoSQ3GUBP/381ZYWEhlEol6tWr2hcRJT3cJDNSpyl6OCU9MykpKWVuf//99wUAsWbNGt26kk/x9/vxxx9F9+7dhVqtFlZWVqJ169YiIiJCCPFvb8qDS8kns5JPY6mpqeLJJ58UlpaW4tVXX9Vtu/9TRElbGzduFBEREcLR0VE0atRIBAcHi4yMDL2aPDw8yuyFuL/NymoLDQ0t9QkuPz9fhIeHC1dXV6FUKkXr1q3Fhx9+KLRard5+AERYWJhISEgQ7dq1E0qlUvj4+IjExMQyj/WDsrOzxfjx44WDg4NQqVTC19dXxMbGljoWDy4XL14ss72y9i05PiXn9Pz58yI0NFSo1Wpha2srxo4dKwoKCkq1tWHDBvHYY4+Jhg0biiZNmohnn3221PEvS2hoqLCwsBDZ2dmltj311FPCxsZG3Llzp8I2tmzZIgBUehynTJkiFApFucfjfvv37xcjRowQbm5uQqlUCldXVzFjxgy9WkJDQ8s8hkIIcfHiRQFAfPjhh2Lp0qWiRYsWol69euLYsWO6bff3RJT0zFy4cEEEBgaKRo0aCWdnZzFv3jy911HJOd67d69evQ+2WVFtQghRXFwsli5dKnx8fIRKpRIODg5i0qRJ4saNG5Uem7L+vxv62q7qe8GpU6dEQECAsLS0FC4uLmLx4sWl2iosLBTvvPOO8PLy0p2z1157rdKe1F69epV6/JL/4yX1ffXVV+Ktt94SLi4uQqFQiJs3b4rr16+LWbNmifbt2wsrKythY2MjBg4cKNLS0vTar+h8X7lyRYSEhAgrKyvRrFkzMWvWrCr14KSkpIjAwEDRtGlT0bBhQ9G8eXMxbtw4vX2qco49PDxKPXf20vyj7sb1OuLFF1/Em2++iR9//BEvvfRSmfucOnUKTz31FHx9fTF//nyoVCqkp6fjp59+AgC0bdsW8+fPxzvvvINJkybhySefBAB069ZN18b169cRFBSE5557Di+88AIcHR0rrGvhwoVQKBSYM2cOcnJyEBUVhX79+iEtLU3Xg1QVVantfkIIDBkyBHv37sWECRPw6KOPIikpCa+99hr+/PNPLF26VG//gwcPYuvWrXjllVdgY2OD5cuXY/jw4cjIyEDTpk3Lrevvv/9GQEAA0tPTMXXqVHh6eiI+Ph5jx47FrVu38Oqrr6Jt27bYsGEDZs6cCVdXV8yaNQsAYG9vX2abGzZswMSJE/HEE09g0qRJAP7p5bjfyJEj4enpicjISPz666/47LPP4ODggMWLF+v2WbhwId5++22MHDkSEydOxLVr1/Dxxx+jZ8+eOHbsGBo3blzu8xo9ejTWrVuHzZs3Y+rUqbr1N27cQFJSEkaNGlXq/N27dw+3bt1CUVERTp48if/85z+wsbHBE088Ue7j3L17F5s3b0a3bt2qNIA7Pj4ed+7cwZQpU9C0aVP88ssv+Pjjj3HlyhXEx8cDAF5++WVcvXoVO3fuxIYNG8psJyYmBoWFhZg0aRJUKhXs7Oyg1WrL3Le4uBgDBw5E165d8cEHH+CHH37Au+++i3v37mH+/PmV1ny/ymp7+eWXERsbi3HjxmH69Om4ePEiPvnkExw7dgw//fSTQT2Qhry2q/L/7ebNmxg4cCCGDRuGkSNHYsuWLZgzZw46dOiAoKAgAIBWq8WQIUNw8OBBTJo0CW3btsWJEyewdOlSnDt3rsJxYW+99RZyc3Nx5coV3f/XBwdjL1iwAEqlErNnz4ZGo4FSqcTp06exbds2PPPMM/D09ER2djZWr16NXr164fTp03BxcanweBUXF2PAgAHo0qUL/u///g+7du3CRx99BC8vL0yZMqXc++Xk5CAwMBD29vZ444030LhxY1y6dAlbt27V268q5zgqKgrTpk2DtbU13nrrLQCo9L22zpA6TdHDqaxnRggh1Gq16NSpk+72g5/Uli5dWul4i8q+JwcgVq1aVea2snpmHnnkEZGXl6dbv3nzZgFALFu2TLeuKj0zldX2YM/Mtm3bBADx3nvv6e03YsQIoVAoRHp6um4dAKFUKvXW/fbbbwKA+Pjjj0s91v2ioqIEAPHFF1/o1hUVFQl/f39hbW2t99w9PDzE4MGDK2yvRGVjZsaPH6+3/umnnxZNmzbV3b506ZKoX7++WLhwod5+J06cEBYWFqXWP+jevXvC2dlZ+Pv7661ftWqVACCSkpJK3efw4cN6nyTbtGlTqqfiQd9++60AIFauXFnhfiXK6g2KjIwUCoVC/PHHH7p15Y1LKfk0bmtrK3Jycsrc9uAndQBi2rRpunVarVYMHjxYKJVK3f+lqvbMVFTbgQMHBADx5Zdf6q3/4Ycfylz/oPJ6Zgx9bVflvWD9+vW6dRqNRjg5OYnhw4fr1m3YsEHUq1dPHDhwQO/+Ja+jn376qcIayhszU3K8W7RoUeo1UVhYqDdOS4h/zoNKpRLz58/XW1fe+b5/PyGE6NSpk/Dz86uw1oSEhErfo6tzjjlmpmy8mqkOsLa2rvAKlJJP4tu3by/3U2hlVCoVxo0bV+X9x4wZAxsbG93tESNGwNnZGd9//71Bj19V33//PerXr4/p06frrZ81axaEEEhMTNRb369fP73eD19fX9ja2uL333+v9HGcnJwwatQo3boGDRpg+vTpyM/PR3JyshGeTWmTJ0/Wu/3
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"_, _, bars = plt.hist(df_test['V37'], bins=20)\n",
"plt.xlabel('V37')\n",
"plt.ylabel('Frequency')\n",
"plt.title('Distribution of the V37 atribute in the train set')\n",
"plt.bar_label(bars, fmt='%1.0f')\n",
"plt.show()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "15f862dd",
"metadata": {},
"source": [
"The distribution of the target variable **V37** is normal, so i could try to fill the missing values with `mean()`."
]
},
{
"cell_type": "code",
"execution_count": 33,
"id": "e1058d9a",
"metadata": {},
"outputs": [],
"source": [
"df_train['V37'].fillna(df_train['V37'].mean(), inplace=True)\n",
"df_test['V37'].fillna(df_test['V37'].mean(), inplace=True)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "44ca71d0",
"metadata": {},
"source": [
"### 2.2 Modeling\n",
"Besides the baselines (majority classifier, random classifier), use at least three machine learning algorithms\n",
"to model the target class. Be ready to argue why did you select specific algorithms and how did you find\n",
"the best hyperparameters for them. Consider the following points when creating your models:\n",
"- Create your models using all features and subsets of them using various feature selection techniques.\n",
"- Certain models assume that data follows a particular distribution or may work better with other\n",
"types of variables (e.g., categorical instead of numeric). Explore whether you can come up with feature\n",
"transformations that are more appropriate for your models. Try to construct new features from existing\n",
"ones. Try to explain the results and performance of different models."
]
},
{
"cell_type": "code",
"execution_count": 34,
"id": "42e83cd5",
"metadata": {},
"outputs": [],
"source": [
"# Spliting the data into features and labels\n",
"X_train = df_train.drop('Class', axis=1).reset_index(drop=True)\n",
"y_train = df_train['Class'].reset_index(drop=True)\n",
"X_test = df_test.drop('Class', axis=1).reset_index(drop=True)\n",
"y_test = df_test['Class'].reset_index(drop=True)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "5779375e",
"metadata": {},
"source": [
"#### Lets firstly write a simple function that will score all our generated models"
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "3d716f7b",
"metadata": {},
"outputs": [],
"source": [
"from sklearn.metrics import precision_score\n",
"from sklearn.metrics import recall_score\n",
"from sklearn.metrics import f1_score\n",
"from sklearn.metrics import roc_auc_score\n",
"from sklearn.metrics import RocCurveDisplay\n",
"from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n",
"\n",
"def score_the_model(model, model_name, random_seed, X_train, X_test, y_train, y_test, plot=False):\n",
" \"\"\"\n",
" Fits and evaluates given machine learning models.\n",
" models: dict of different Scikit-Learn machine learning models\n",
" X_train: training data (no labels)\n",
" x_test: testing data (no labels)\n",
" y_train: training labels\n",
" y_test: trest labels\n",
" \"\"\"\n",
"\n",
" # Set random seed\n",
" np.random.seed(random_seed)\n",
"\n",
" # Fit the model to the data\n",
" model.fit(X_train, y_train)\n",
"\n",
" model_score = model.score(X_test, y_test) # Mean accuracy of ``self.predict(X)`` wrt. `y`.\n",
" # Predict the labels\n",
" y_pred = model.predict(X_test)\n",
"\n",
" # Compute scores\n",
" f1 = f1_score(y_test, y_pred)\n",
" precision = precision_score(y_test, y_pred)\n",
" recall = recall_score(y_test, y_pred)\n",
" auc = roc_auc_score(y_test, y_pred)\n",
"\n",
" # Plot scores\n",
" scores = {\n",
" 'Accuracy': model_score,\n",
" 'F1': f1,\n",
" 'Precision': precision,\n",
" 'Recall': recall,\n",
" 'AUC': auc\n",
" }\n",
" if plot:\n",
" # Plot scores\n",
" fig, ax = plt.subplots(nrows=2, ncols=2, figsize=(15,15))\n",
"\n",
" # Plot the bar chart in the first subplot\n",
" ax[0, 0].bar(scores.keys(), scores.values())\n",
" # Display values of the bars\n",
" for i, v in enumerate(scores.values()):\n",
" ax[0, 0].text(i-0.1, v+0.01, str(round(v, 2)))\n",
" ax[0, 0].set_title(f'Model performance for {model_name}')\n",
" ax[0, 0].set_ylabel('Score')\n",
" \n",
" # Plot the ROC curve in the second subplot\n",
" f = RocCurveDisplay.from_estimator(model, X_test, y_test)\n",
" f.plot(ax=ax[0, 1])\n",
" \n",
" # Plot the confusion matrix in the third subplot\n",
" cm = confusion_matrix(y_test, y_pred, labels=model.classes_)\n",
" cm_plt = ConfusionMatrixDisplay(cm, display_labels=model.classes_)\n",
" cm_plt.plot(ax=ax[1, 0])\n",
" return scores, model"
]
},
{
"cell_type": "markdown",
"id": "d144deb1",
"metadata": {},
"source": [
"### Decision tree model"
]
},
{
"cell_type": "code",
"execution_count": 61,
"id": "63fe4438",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABNwAAATFCAYAAAB7FctDAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeVxU9f7H8fewg7Ipi4oIapqZpgZq7i2Wt7qWrablVtlys0zq3rREK0srb1y6aVqW2ebN23r7pVlGqVmmglppKW6IGwgiu2wz8/sDODmCpThwWF7Px4NHzfecM/M5A+b05vv9fix2u90uAAAAAAAAAE7hYnYBAAAAAAAAQGNC4AYAAAAAAAA4EYEbAAAAAAAA4EQEbgAAAAAAAIATEbgBAAAAAAAATkTgBgAAAAAAADgRgRsAAAAAAADgRARuAAAAAAAAgBMRuAEAAAAAAABOROAGnCWLxaInn3zyrK9LSUmRxWLRkiVLnF7TuXrnnXfUpUsXubu7KyAgwOxyGryVK1eqZ8+e8vLyksViUXZ2ttkl1Zqa/lxfeumluvTSS2ulJgAAAAAwG4EbGqQlS5bIYrHIYrFo3bp1VY7b7XaFh4fLYrHor3/9qwkVNhw7duzQ+PHj1bFjRy1atEivvfaa2SU1aMeOHdOtt94qb29vzZ8/X++8846aNWtWa6938p8Fi8UiLy8vtWnTRsOGDdO///1v5eXl1dprNzSRkZEO79XpvupjKA4AAACgYXEzuwDgXHh5eWnp0qUaOHCgw/iaNWt08OBBeXp6mlRZw7F69WrZbDa99NJLOu+888wup8HbtGmT8vLyNGvWLA0dOrTOXvfpp59W+/btVVpaqrS0NK1evVoPP/yw4uLi9Nlnn+miiy6qldeNiIjQiRMn5O7uflbXffXVV7VSzx+Jj49Xfn6+8XjFihX6z3/+o3/9618KCgoyxvv371/ntQEAAABoXAjc0KBdc801+uCDD/Tvf/9bbm6//zgvXbpUUVFRyszMNLG6+q2goEDNmjXT0aNHJcmpS0kLCwvl4+PjtOdrSGrj/az8Xv2Rq6++WtHR0cbjadOm6ZtvvtFf//pXXXfddfrtt9/k7e3ttJoqVc6qO1seHh5Or+XPjBgxwuFxWlqa/vOf/2jEiBGKjIw87XVn8v4DAAAAwMlYUooGbdSoUTp27JhWrVpljJWUlOjDDz/U6NGjq72moKBAjzzyiMLDw+Xp6anzzz9f//znP2W32x3OKy4u1pQpUxQcHCxfX19dd911OnjwYLXPeejQId15550KDQ2Vp6enLrzwQi1evLhG91S5RHDt2rW699571bJlS/n5+Wns2LE6fvx4lfO/+OILDRo0SM2aNZOvr6+uvfZabd++3eGc8ePHq3nz5tqzZ4+uueYa+fr66vbbb1dkZKRmzpwpSQoODq6yP90rr7yiCy+8UJ6enmrTpo0eeOCBKvuRXXrpperWrZuSkpI0ePBg+fj46PHHHzf29vrnP/+p+fPnq0OHDvLx8dFVV12lAwcOyG63a9asWWrbtq28vb11/fXXKysry+G5//e//+naa69VmzZt5OnpqY4dO2rWrFmyWq3V1vDrr7/qsssuk4+Pj8LCwvTCCy9Ueb+Kior05JNPqnPnzvLy8lLr1q114403as+ePcY5NptN8fHxuvDCC+Xl5aXQ0FDde++91b7/p9Yxbtw4SVLv3r1lsVg0fvx44/gHH3ygqKgoeXt7KygoSHfccYcOHTp0Rt+rmrj88ssVGxur/fv3691333U4tmPHDt18881q0aKFvLy8FB0drc8++6zKc2RnZ2vKlCmKjIyUp6en2rZtq7FjxxphdnV7uKWlpWnChAlq27atPD091bp1a11//fVKSUlxeK9O3cPt6NGjuuuuuxQaGiovLy/16NFDb731lsM5J/9cvfbaa+rYsaM8PT3Vu3dvbdq0qUbv08n+6P0/m5+LM/lzCQAAAKDxYoYbGrTIyEj169dP//nPf3T11VdLKv8f3ZycHN12223697//7XC+3W7Xddddp2+//VZ33XWXevbsqS+//FJ///vfdejQIf3rX/8yzr377rv17rvvavTo0erfv7+++eYbXXvttVVqSE9P1yWXXCKLxaJJkyYpODhYX3zxhe666y7l5ubq4YcfrtG9TZo0SQEBAXryySe1c+dOLViwQPv379fq1atlsVgklTc7GDdunIYNG6bnn39ehYWFWrBggQYOHKgtW7Y4zNopKyvTsGHDNHDgQP3zn/+Uj4+Pxo8fr7fffluffPKJFixYoObNmxtLD5988kk99dRTGjp0qO6//36jhk2bNun77793WEJ47NgxXX311brtttt0xx13KDQ01Dj23nvvqaSkRA8++KCysrL0wgsv6NZbb9Xll1+u1atX67HHHtPu3bv18ssv69FHH3UIKpcsWaLmzZsrJiZGzZs31zfffKMZM2YoNzdXc+fOdXi/jh8/rr/85S+68cYbdeutt+rDDz/UY489pu7duxs/G1arVX/961+VkJCg2267TZMnT1ZeXp5WrVqlbdu2qWPHjpKke++9V0uWLNGECRP00EMPad++fZo3b562bNlS5d5P9sQTT+j888/Xa6+9ZizxrHzOyufr3bu35syZo/T0dL300kv6/vvvtWXLFocZcdV9r2pqzJgxevzxx/XVV19p4sSJkqTt27drwIABCgsL09SpU9WsWTP997//1YgRI/TRRx/phhtukCTl5+dr0KBB+u2333TnnXfq4osvVmZmpj777DMdPHjQYRnmyW666SZt375dDz74oCIjI3X06FGtWrVKqampp51JduLECV166aXavXu3Jk2apPbt2+uDDz7Q+PHjlZ2drcmTJzucv3TpUuXl5enee++VxWLRCy+8oBtvvFF79+496+Wtpzrd+3+mPxdn8+cSAAAAQCNlBxqgN9980y7JvmnTJvu8efPsvr6+9sLCQrvdbrffcsst9ssuu8xut9vtERER9muvvda47tNPP7VLsj/zzDMOz3fzzTfbLRaLfffu3Xa73W7funWrXZL9b3/7m8N5o0ePtkuyz5w50xi766677K1bt7ZnZmY6nHvbbbfZ/f39jbr27dtnl2R/8803z+jeoqKi7CUlJcb4Cy+8YJdk/9///me32+32vLw8e0BAgH3ixIkO16elpdn9/f0dxseNG2eXZJ86dWqV15s5c6Zdkj0jI8MYO3r0qN3Dw8N+1VVX2a1WqzE+b948uyT74sWLjbEhQ4bYJdkXLlzo8LyV9xscHGzPzs42xqdNm2aXZO/Ro4e9tLTUGB81apTdw8PDXlRUZIxVvncnu/fee+0+Pj4O51XW8PbbbxtjxcXF9latWtlvuukmY2zx4sV2Sfa4uLgqz2uz2ex2u93+3Xff2SXZ33vvPYfjK1eurHb8VCf/bFYqKSmxh4SE2Lt162Y/ceKEMf7555/bJdlnzJhhjP3R9+pMX+9U/v7+9l69ehmPr7jiCnv37t0d3kObzWbv37+/vVOnTsbYjBkz7JLsH3/8cZXnrHy/Tv25Pn78uF2Sfe7cuX9Y95AhQ+xDhgwxHsfHx9sl2d99911jrKSkxN6vXz978+bN7bm5uQ6v17JlS3tWVpZx7v/+9z+7JPv//d///eHrnmzu3Ll2SfZ9+/YZY6d7/8/05+Js/lwCAAAAaLxYUooG79Zbb9WJEyf0+eefKy8vT59//vlpl5OuWLFCrq6ueuihhxzGH3nkEdntdn3xxRfGeZKqnHfqbDW73a6PPvpIw4cPl91uV2ZmpvE1bNgw5eTkaPPmzTW6r3vuucdhps79998vNzc3o7ZVq1YpOztbo0aNcnhdV1dX9e3bV99++22V57z//vvP6LW//vprlZSU6OGHH5aLy+//mZg4caL8/Py0fPlyh/M9PT01YcKEap/rlltukb+/v/G4b9++kqQ77rjDYd+9vn37qqSkxGGJ5cl7juXl5SkzM1ODBg1SYWGhduzY4fA6zZs31x133GE89vDwUJ8+fbR3715j7KOPPlJQUJAefPDBKnVWzhr84IMP5O/vryuvvNLhfY2KilL
"text/plain": [
"<Figure size 1500x1500 with 5 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjcAAAGwCAYAAABVdURTAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABqfklEQVR4nO3dd3iTVfsH8G/SNulOKd1toKyyy6ZSFAQKBZSlKMoqqLgAEV5UULYyFEVQUQQFxJ++KA7kFQRKGQKizLJaCpRRCh2U0qZ7JOf3R5tIaApJSZo2/X6uq5fmyXme3HkIzc059zlHIoQQICIiIrIRUmsHQERERGROTG6IiIjIpjC5ISIiIpvC5IaIiIhsCpMbIiIisilMboiIiMimMLkhIiIim2Jv7QCqm0ajwY0bN+Dm5gaJRGLtcIiIiMgIQgjk5OQgICAAUum9+2bqXHJz48YNKJVKa4dBREREVXDt2jUEBQXds02dS27c3NwAlN0cd3d3K0dDRERExlCpVFAqlbrv8Xupc8mNdijK3d2dyQ0REVEtY0xJCQuKiYiIyKYwuSEiIiKbwuSGiIiIbAqTGyIiIrIpTG6IiIjIpjC5ISIiIpvC5IaIiIhsCpMbIiIisilMboiIiMimMLkhIiIim2LV5ObPP//EoEGDEBAQAIlEgs2bN9/3nL1796Jjx46Qy+Vo2rQp1q9fb/E4iYiIqPawanKTl5eHdu3aYeXKlUa1v3z5Mh577DH06tULsbGxeP311/HCCy9gx44dFo6UiIiIagurbpw5YMAADBgwwOj2q1atQqNGjfDRRx8BAFq2bIkDBw7g448/RmRkpKXCJCIiIiOlqwqRW1SKxt6uVouhVu0KfujQIUREROgdi4yMxOuvv17pOUVFRSgqKtI9VqlUlgqPiIioTlEVluB0cjZir2XhVHIWTiVnIyW7EL2ae2Pd+K5Wi6tWJTepqanw9fXVO+br6wuVSoWCggI4OTlVOGfx4sWYP39+dYVIRERkkwpL1Dh7Q4VTyVk4ea0skbmUkVehnUQCFJSorRDhv2pVclMVM2fOxLRp03SPVSoVlEqlFSMiIiKq2UrVGlxIz8XJa1k4mZyNU8lZSEjNQalGVGir9HRCaJAH2gd5IDRIgTaBCrjIrZte1Krkxs/PD2lpaXrH0tLS4O7ubrDXBgDkcjnkcnl1hEdERFTrCCFw9VY+TiZn4eS1skTmzI1sFJZoKrT1cpWhXZAHQoM8EKpUoF2QBzxdZFaI+t5qVXLTrVs3bNu2Te9YdHQ0unXrZqWIiIiIapd0VWF5jUw2TpbXyWQXlFRo5yq3R9tABdopPdAuSIFQpQcCFI6QSCRWiNo0Vk1ucnNzcfHiRd3jy5cvIzY2Fp6enmjQoAFmzpyJ69evY8OGDQCAl19+GZ999hnefPNNPPfcc9i9ezd+/PFHbN261VpvgYiIqMbKLigr+D15R51MqqqwQjuZnRStAtzLkpggD7RTeqCxlwuk0pqfyBhi1eTm6NGj6NWrl+6xtjYmKioK69evR0pKCpKSknTPN2rUCFu3bsXUqVOxYsUKBAUF4auvvuI0cCIiqvPKCn6zdUNLJ5OzcdlAwa9UAjTzcUM7ZXkiE+SB5n5ukNnbzqYFEiFExeogG6ZSqaBQKJCdnQ13d3drh0NERGSyUrUG59Nyy4eVymplEtJyoDZQ8NvA0xmhQWX1Me2UHmgd4G71gt+qMOX7u/a9OyIiojpECIErt/J1SczJ5CycrbTgV4725T0yoeVDTDWx4NfSmNwQERHVIGm6gt/yot9rWVAVllZo5ya3R9vyBEab0PjXkoJfS2NyQ0REZCXZ+SU4db0sidEmNGmqogrtZPZStA5wL5+GXTaDqVH92lvwa2lMboiIiKpBQXF5wW/5ongnr2Xhyq38Cu2kEiDE102XxLQL8kCIr20V/FoakxsiIiIzK1FrcD4tRzesdDI5G+crKfhtWN+5fNaSQlfw6yzj1/OD4N0jIiJ6ABqNwJVbebpF8U5ey8LZGyoUlVYs+PV2k5clMUEeCFV6IDRQgXp1sODX0pjcEBERmSA1W7/g91RyJQW/jva6GUtl07AV8HNnwW91YHJDRERUiaz8Yl0CE1u+OF56zr0LfrWL47Hg13qY3BAREaGs4PfMjWzdNgWnku9d8KtdFC80SIHmfm5wsGPBb03B5IaIiOqcErUGCal3Fvxm4UJ6bqUFv3dOwWbBb83HPx0iIrJpGo3A5Vt5uhV+TyVXXvDr4ybXWxQvNEgBD2cW/NY2TG6IiMhmCCGQqirUbVOgLfrNqaTgt90d2xS0V3rAT+FohajJ3JjcEBFRrZWVX1y2KF75WjInk7Nw00DBr7y84FebxIQGKRDMgl+bxeSGiIhqhfziUpy9odItincqOQtXDRT82kkl5QW/5dOwlQqE+LLgty5hckNERDWOtuBXuyjeqfIVfg3U+yK4vnP5rKWyVX5bByjgJLOr/qCpxmByQ0REVqUt+NUmMSfLC36LDRT8+rrL9bYqCA30gMLZwQpRU03G5IaIiKqNEAIp2YV6i+KdTs5GTlHFgl93R3vdOjLaVX5Z8EvGYHJDREQWczuvuHzW0r+r/GbkGi74bROoQGiQorzg1wPB9Z25VQFVCZMbIiIyi/ziUpy5rtItincqORtJmYYLfpv7uum2KQgNYsEvmReTGyIiMllx6b8Fv9rF8S6kGy74beTlUra6b/nMpVb+LPgly2JyQ0RE96TRCFzK0Bb8lk3DjksxXPDr5+6o26YgNEjBgl+yCiY3RESkI4TAjezCfxfFu5aFM9fvXfB7575Lvu4s+CXrY3JDRFSHZWoLfstnLp1MzkJGbnGFdo4OUrQJ+HdRvHZBHmjIgl+qoZjcEBHVEXlFpThzPRunkrMRW14rcy2zoEK7fwt+PXSr/Ib4usKeBb9USzC5ISKyQcWlGpxLVen2XTqVXHnBb+Pygt+yXhkPtA5wh6MDC36p9mJyQ0RUy5UV/ObqFsU7mZyN+BsqFKsNF/xqp2C3C/JA2yAFFE4s+CXbwuSGiKgWEULgelaBbpuCsoJfFXINFPwqnBz0FsVrF6SADwt+qQ5gckNEVINpC361+y6dukfBb9vAfxfFa6/0QANPFvxS3cTkhoiohsjVFfyWLYp3MjkLybcrFvzaSyVo7ueG0CAPtC8fYmrmw4JfIi0mN0REVqAr+C1fT+ZUchYupOdCGCr49XbRW0umlT8LfonuhckNEZGFqTUCl27m6hbFO5WchfiUHIMFv/4Kx7JEpnwtmTaBLPglMhWTGyIiM9IW/J68Y1G808nZyCtWV2jr4eygK/TVJjQ+biz4JXpQTG6IiB7ArdyiskXxyntkTiVn41ZexYJfJwe78oJfBULLF8djwS+RZTC5ISIyUm5RKU6X18doE5rrWYYLflv4lxf8lvfINPVmwS9RdWFyQ0RkQFGpGudScsqnYZclNBdvGi74bXJHwW8oC36JrI7JDRHVeWqNQOLNXN1aMieTsxCfokKJumImE6Bw1G1T0C5IgTZBCrg7suCXqCZhckNEdYoQAsm3C8p2wi6fvXTmeuUFv+2C/t08kgW/RLUDkxsismkZuUU4lZyl23fpVHI2Mg0U/DrL7NAmQKG375LS04kFv0S1EJMbIrIZOYUlOH09W7dNwclr2ZUW/Lb0dy9bFK98iKmpjyvspExkiGxBlZKbkpISpKamIj8/H97e3vD09DR3XERE91RUqkZ8Sk75Cr9lPTKJBgp+JRKgsZdLeY1MWdFvSxb8Etk0o5ObnJwc/N///R82btyIw4cPo7i4GEIISCQSBAUFoV+/fnjxxRfRpUsXS8ZLRHWQWiNwMT23PIkp65E5l2q44DfQw6ls1lKQB9opFWgbqIAbC36J6hSjkptly5Zh4cKFaNKkCQYNGoS3334bAQEBcHJyQmZmJs6cOYP9+/ejX79+CAsLw6effopmzZpZOnYiskHagl/tongnk7Nx5no28g0U/NZzdkA7pYduld/QIA94u8mtEDUR1SQSIQyt2qDv2WefxaxZs9C6det7tisqKsK6desgk8nw3HPPmS1
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.tree import DecisionTreeClassifier\n",
"\n",
"# Score the model with default parameters\n",
"scores, model = score_the_model(\n",
" model=DecisionTreeClassifier(),\n",
" model_name='Decision Tree',\n",
" random_seed=42,\n",
" X_train=X_train,\n",
" X_test=X_test,\n",
" y_train=y_train,\n",
" y_test=y_test,\n",
" plot=True\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "a72e54f6",
"metadata": {},
"source": [
"Now lets plot the decision tree"
]
},
{
"cell_type": "code",
"execution_count": 50,
"id": "c4fe47bd",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[Text(0.6615466101694916, 0.9722222222222222, 'V36 <= 3.678\\ngini = 0.444\\nsamples = 846\\nvalue = [564, 282]\\nclass = Ready biodegradable'),\n",
" Text(0.4240819209039548, 0.9166666666666666, 'V1 <= 4.792\\ngini = 0.483\\nsamples = 361\\nvalue = [147, 214]\\nclass = Reday non-biodegradable'),\n",
" Text(0.2803672316384181, 0.8611111111111112, 'V34 <= 1.5\\ngini = 0.435\\nsamples = 285\\nvalue = [91, 194]\\nclass = Reday non-biodegradable'),\n",
" Text(0.1906779661016949, 0.8055555555555556, 'V14 <= 0.673\\ngini = 0.32\\nsamples = 210\\nvalue = [42, 168]\\nclass = Reday non-biodegradable'),\n",
" Text(0.16807909604519775, 0.75, 'V18 <= 1.158\\ngini = 0.278\\nsamples = 12\\nvalue = [10, 2]\\nclass = Ready biodegradable'),\n",
" Text(0.15677966101694915, 0.6944444444444444, 'V28 <= 0.121\\ngini = 0.165\\nsamples = 11\\nvalue = [10, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.14548022598870056, 0.6388888888888888, 'gini = 0.0\\nsamples = 10\\nvalue = [10, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.16807909604519775, 0.6388888888888888, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.17937853107344634, 0.6944444444444444, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.2132768361581921, 0.75, 'V38 <= 1.5\\ngini = 0.271\\nsamples = 198\\nvalue = [32, 166]\\nclass = Reday non-biodegradable'),\n",
" Text(0.2019774011299435, 0.6944444444444444, 'V41 <= 1.5\\ngini = 0.247\\nsamples = 194\\nvalue = [28, 166]\\nclass = Reday non-biodegradable'),\n",
" Text(0.1906779661016949, 0.6388888888888888, 'V22 <= 1.265\\ngini = 0.221\\nsamples = 190\\nvalue = [24, 166]\\nclass = Reday non-biodegradable'),\n",
" Text(0.12146892655367232, 0.5833333333333334, 'V32 <= 0.5\\ngini = 0.148\\nsamples = 161\\nvalue = [13, 148]\\nclass = Reday non-biodegradable'),\n",
" Text(0.11016949152542373, 0.5277777777777778, 'V28 <= 0.843\\ngini = 0.139\\nsamples = 160\\nvalue = [12, 148]\\nclass = Reday non-biodegradable'),\n",
" Text(0.09887005649717515, 0.4722222222222222, 'V37 <= 1.95\\ngini = 0.129\\nsamples = 159\\nvalue = [11, 148]\\nclass = Reday non-biodegradable'),\n",
" Text(0.03389830508474576, 0.4166666666666667, 'V22 <= 1.174\\ngini = 0.34\\nsamples = 23\\nvalue = [5, 18]\\nclass = Reday non-biodegradable'),\n",
" Text(0.022598870056497175, 0.3611111111111111, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.04519774011299435, 0.3611111111111111, 'V31 <= 1.358\\ngini = 0.298\\nsamples = 22\\nvalue = [4, 18]\\nclass = Reday non-biodegradable'),\n",
" Text(0.03389830508474576, 0.3055555555555556, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.05649717514124294, 0.3055555555555556, 'V37 <= 1.935\\ngini = 0.245\\nsamples = 21\\nvalue = [3, 18]\\nclass = Reday non-biodegradable'),\n",
" Text(0.04519774011299435, 0.25, 'V35 <= 1.5\\ngini = 0.18\\nsamples = 20\\nvalue = [2, 18]\\nclass = Reday non-biodegradable'),\n",
" Text(0.022598870056497175, 0.19444444444444445, 'V3 <= 0.5\\ngini = 0.105\\nsamples = 18\\nvalue = [1, 17]\\nclass = Reday non-biodegradable'),\n",
" Text(0.011299435028248588, 0.1388888888888889, 'gini = 0.0\\nsamples = 15\\nvalue = [0, 15]\\nclass = Reday non-biodegradable'),\n",
" Text(0.03389830508474576, 0.1388888888888889, 'V22 <= 1.231\\ngini = 0.444\\nsamples = 3\\nvalue = [1, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.022598870056497175, 0.08333333333333333, 'gini = 0.0\\nsamples = 2\\nvalue = [0, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.04519774011299435, 0.08333333333333333, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.06779661016949153, 0.19444444444444445, 'V17 <= 0.97\\ngini = 0.5\\nsamples = 2\\nvalue = [1, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.05649717514124294, 0.1388888888888889, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.07909604519774012, 0.1388888888888889, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.06779661016949153, 0.25, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.1638418079096045, 0.4166666666666667, 'V9 <= 3.5\\ngini = 0.084\\nsamples = 136\\nvalue = [6, 130]\\nclass = Reday non-biodegradable'),\n",
" Text(0.13559322033898305, 0.3611111111111111, 'V18 <= 1.162\\ngini = 0.059\\nsamples = 131\\nvalue = [4, 127]\\nclass = Reday non-biodegradable'),\n",
" Text(0.11299435028248588, 0.3055555555555556, 'V37 <= 2.292\\ngini = 0.017\\nsamples = 118\\nvalue = [1, 117]\\nclass = Reday non-biodegradable'),\n",
" Text(0.1016949152542373, 0.25, 'V37 <= 2.285\\ngini = 0.087\\nsamples = 22\\nvalue = [1, 21]\\nclass = Reday non-biodegradable'),\n",
" Text(0.0903954802259887, 0.19444444444444445, 'gini = 0.0\\nsamples = 21\\nvalue = [0, 21]\\nclass = Reday non-biodegradable'),\n",
" Text(0.11299435028248588, 0.19444444444444445, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.12429378531073447, 0.25, 'gini = 0.0\\nsamples = 96\\nvalue = [0, 96]\\nclass = Reday non-biodegradable'),\n",
" Text(0.15819209039548024, 0.3055555555555556, 'V2 <= 3.228\\ngini = 0.355\\nsamples = 13\\nvalue = [3, 10]\\nclass = Reday non-biodegradable'),\n",
" Text(0.14689265536723164, 0.25, 'V11 <= 0.5\\ngini = 0.165\\nsamples = 11\\nvalue = [1, 10]\\nclass = Reday non-biodegradable'),\n",
" Text(0.13559322033898305, 0.19444444444444445, 'gini = 0.0\\nsamples = 9\\nvalue = [0, 9]\\nclass = Reday non-biodegradable'),\n",
" Text(0.15819209039548024, 0.19444444444444445, 'V16 <= 1.5\\ngini = 0.5\\nsamples = 2\\nvalue = [1, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.14689265536723164, 0.1388888888888889, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.1694915254237288, 0.1388888888888889, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.1694915254237288, 0.25, 'gini = 0.0\\nsamples = 2\\nvalue = [2, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.192090395480226, 0.3611111111111111, 'V18 <= 1.146\\ngini = 0.48\\nsamples = 5\\nvalue = [2, 3]\\nclass = Reday non-biodegradable'),\n",
" Text(0.1807909604519774, 0.3055555555555556, 'gini = 0.0\\nsamples = 3\\nvalue = [0, 3]\\nclass = Reday non-biodegradable'),\n",
" Text(0.2033898305084746, 0.3055555555555556, 'gini = 0.0\\nsamples = 2\\nvalue = [2, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.12146892655367232, 0.4722222222222222, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.1327683615819209, 0.5277777777777778, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.2598870056497175, 0.5833333333333334, 'V2 <= 2.882\\ngini = 0.471\\nsamples = 29\\nvalue = [11, 18]\\nclass = Reday non-biodegradable'),\n",
" Text(0.22033898305084745, 0.5277777777777778, 'V8 <= 38.8\\ngini = 0.459\\nsamples = 14\\nvalue = [9, 5]\\nclass = Ready biodegradable'),\n",
" Text(0.1977401129943503, 0.4722222222222222, 'V31 <= 1.896\\ngini = 0.32\\nsamples = 5\\nvalue = [1, 4]\\nclass = Reday non-biodegradable'),\n",
" Text(0.1864406779661017, 0.4166666666666667, 'gini = 0.0\\nsamples = 4\\nvalue = [0, 4]\\nclass = Reday non-biodegradable'),\n",
" Text(0.20903954802259886, 0.4166666666666667, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.24293785310734464, 0.4722222222222222, 'V22 <= 1.331\\ngini = 0.198\\nsamples = 9\\nvalue = [8, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.23163841807909605, 0.4166666666666667, 'gini = 0.0\\nsamples = 8\\nvalue = [8, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.2542372881355932, 0.4166666666666667, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.2994350282485876, 0.5277777777777778, 'V13 <= 3.298\\ngini = 0.231\\nsamples = 15\\nvalue = [2, 13]\\nclass = Reday non-biodegradable'),\n",
" Text(0.288135593220339, 0.4722222222222222, 'V12 <= 0.841\\ngini = 0.133\\nsamples = 14\\nvalue = [1, 13]\\nclass = Reday non-biodegradable'),\n",
" Text(0.2768361581920904, 0.4166666666666667, 'gini = 0.0\\nsamples = 12\\nvalue = [0, 12]\\nclass = Reday non-biodegradable'),\n",
" Text(0.2994350282485876, 0.4166666666666667, 'V2 <= 3.222\\ngini = 0.5\\nsamples = 2\\nvalue = [1, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.288135593220339, 0.3611111111111111, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.3107344632768362, 0.3611111111111111, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.3107344632768362, 0.4722222222222222, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.2132768361581921, 0.6388888888888888, 'gini = 0.0\\nsamples = 4\\nvalue = [4, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.2245762711864407, 0.6944444444444444, 'gini = 0.0\\nsamples = 4\\nvalue = [4, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.3700564971751412, 0.8055555555555556, 'V16 <= 0.5\\ngini = 0.453\\nsamples = 75\\nvalue = [49, 26]\\nclass = Ready biodegradable'),\n",
" Text(0.3163841807909605, 0.75, 'V1 <= 4.107\\ngini = 0.311\\nsamples = 52\\nvalue = [42, 10]\\nclass = Ready biodegradable'),\n",
" Text(0.288135593220339, 0.6944444444444444, 'V34 <= 2.5\\ngini = 0.48\\nsamples = 10\\nvalue = [4, 6]\\nclass = Reday non-biodegradable'),\n",
" Text(0.2768361581920904, 0.6388888888888888, 'gini = 0.0\\nsamples = 6\\nvalue = [0, 6]\\nclass = Reday non-biodegradable'),\n",
" Text(0.2994350282485876, 0.6388888888888888, 'gini = 0.0\\nsamples = 4\\nvalue = [4, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.3446327683615819, 0.6944444444444444, 'V2 <= 2.217\\ngini = 0.172\\nsamples = 42\\nvalue = [38, 4]\\nclass = Ready biodegradable'),\n",
" Text(0.3220338983050847, 0.6388888888888888, 'V38 <= 1.5\\ngini = 0.444\\nsamples = 3\\nvalue = [1, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.3107344632768362, 0.5833333333333334, 'gini = 0.0\\nsamples = 2\\nvalue = [0, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.3333333333333333, 0.5833333333333334, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.3672316384180791, 0.6388888888888888, 'V1 <= 4.426\\ngini = 0.097\\nsamples = 39\\nvalue = [37, 2]\\nclass = Ready biodegradable'),\n",
" Text(0.3559322033898305, 0.5833333333333334, 'V7 <= 0.5\\ngini = 0.32\\nsamples = 10\\nvalue = [8, 2]\\nclass = Ready biodegradable'),\n",
" Text(0.3446327683615819, 0.5277777777777778, 'V28 <= 0.029\\ngini = 0.198\\nsamples = 9\\nvalue = [8, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.3333333333333333, 0.4722222222222222, 'gini = 0.0\\nsamples = 7\\nvalue = [7, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.3559322033898305, 0.4722222222222222, 'V28 <= 0.097\\ngini = 0.5\\nsamples = 2\\nvalue = [1, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.3446327683615819, 0.4166666666666667, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.3672316384180791, 0.4166666666666667, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.3672316384180791, 0.5277777777777778, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.3785310734463277, 0.5833333333333334, 'gini = 0.0\\nsamples = 29\\nvalue = [29, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.423728813559322, 0.75, 'V27 <= 2.089\\ngini = 0.423\\nsamples = 23\\nvalue = [7, 16]\\nclass = Reday non-biodegradable'),\n",
" Text(0.4124293785310734, 0.6944444444444444, 'gini = 0.0\\nsamples = 3\\nvalue = [3, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.4350282485875706, 0.6944444444444444, 'V28 <= 0.015\\ngini = 0.32\\nsamples = 20\\nvalue = [4, 16]\\nclass = Reday non-biodegradable'),\n",
" Text(0.4124293785310734, 0.6388888888888888, 'V27 <= 2.239\\ngini = 0.124\\nsamples = 15\\nvalue = [1, 14]\\nclass = Reday non-biodegradable'),\n",
" Text(0.4011299435028249, 0.5833333333333334, 'gini = 0.0\\nsamples = 14\\nvalue = [0, 14]\\nclass = Reday non-biodegradable'),\n",
" Text(0.423728813559322, 0.5833333333333334, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.4576271186440678, 0.6388888888888888, 'V36 <= 3.535\\ngini = 0.48\\nsamples = 5\\nvalue = [3, 2]\\nclass = Ready biodegradable'),\n",
" Text(0.4463276836158192, 0.5833333333333334, 'gini = 0.0\\nsamples = 2\\nvalue = [0, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.4689265536723164, 0.5833333333333334, 'gini = 0.0\\nsamples = 3\\nvalue = [3, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.5677966101694916, 0.8611111111111112, 'V30 <= 10.221\\ngini = 0.388\\nsamples = 76\\nvalue = [56, 20]\\nclass = Ready biodegradable'),\n",
" Text(0.5254237288135594, 0.8055555555555556, 'V36 <= 3.673\\ngini = 0.201\\nsamples = 44\\nvalue = [39, 5]\\nclass = Ready biodegradable'),\n",
" Text(0.5141242937853108, 0.75, 'V18 <= 1.158\\ngini = 0.133\\nsamples = 42\\nvalue = [39, 3]\\nclass = Ready biodegradable'),\n",
" Text(0.4915254237288136, 0.6944444444444444, 'V12 <= 1.437\\ngini = 0.05\\nsamples = 39\\nvalue = [38, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.480225988700565, 0.6388888888888888, 'gini = 0.0\\nsamples = 37\\nvalue = [37, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.5028248587570622, 0.6388888888888888, 'V12 <= 1.504\\ngini = 0.5\\nsamples = 2\\nvalue = [1, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.4915254237288136, 0.5833333333333334, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.5141242937853108, 0.5833333333333334, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.536723163841808, 0.6944444444444444, 'V17 <= 1.005\\ngini = 0.444\\nsamples = 3\\nvalue = [1, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.5254237288135594, 0.6388888888888888, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.5480225988700564, 0.6388888888888888, 'gini = 0.0\\nsamples = 2\\nvalue = [0, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.536723163841808, 0.75, 'gini = 0.0\\nsamples = 2\\nvalue = [0, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.6101694915254238, 0.8055555555555556, 'V36 <= 3.511\\ngini = 0.498\\nsamples = 32\\nvalue = [17, 15]\\nclass = Ready biodegradable'),\n",
" Text(0.5706214689265536, 0.75, 'V31 <= 1.567\\ngini = 0.278\\nsamples = 12\\nvalue = [2, 10]\\nclass = Reday non-biodegradable'),\n",
" Text(0.559322033898305, 0.6944444444444444, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.5819209039548022, 0.6944444444444444, 'V2 <= 4.616\\ngini = 0.165\\nsamples = 11\\nvalue = [1, 10]\\nclass = Reday non-biodegradable'),\n",
" Text(0.5706214689265536, 0.6388888888888888, 'gini = 0.0\\nsamples = 10\\nvalue = [0, 10]\\nclass = Reday non-biodegradable'),\n",
" Text(0.5932203389830508, 0.6388888888888888, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.6497175141242938, 0.75, 'V17 <= 1.025\\ngini = 0.375\\nsamples = 20\\nvalue = [15, 5]\\nclass = Ready biodegradable'),\n",
" Text(0.6271186440677966, 0.6944444444444444, 'V1 <= 4.856\\ngini = 0.124\\nsamples = 15\\nvalue = [14, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.615819209039548, 0.6388888888888888, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.6384180790960452, 0.6388888888888888, 'gini = 0.0\\nsamples = 14\\nvalue = [14, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.672316384180791, 0.6944444444444444, 'V31 <= 2.562\\ngini = 0.32\\nsamples = 5\\nvalue = [1, 4]\\nclass = Reday non-biodegradable'),\n",
" Text(0.6610169491525424, 0.6388888888888888, 'gini = 0.0\\nsamples = 4\\nvalue = [0, 4]\\nclass = Reday non-biodegradable'),\n",
" Text(0.6836158192090396, 0.6388888888888888, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.8990112994350282, 0.9166666666666666, 'V40 <= 0.5\\ngini = 0.241\\nsamples = 485\\nvalue = [417, 68]\\nclass = Ready biodegradable'),\n",
" Text(0.8206214689265536, 0.8611111111111112, 'V12 <= -0.712\\ngini = 0.185\\nsamples = 457\\nvalue = [410, 47]\\nclass = Ready biodegradable'),\n",
" Text(0.751412429378531, 0.8055555555555556, 'V27 <= 2.363\\ngini = 0.452\\nsamples = 55\\nvalue = [36, 19]\\nclass = Ready biodegradable'),\n",
" Text(0.7401129943502824, 0.75, 'V8 <= 42.5\\ngini = 0.475\\nsamples = 31\\nvalue = [12, 19]\\nclass = Reday non-biodegradable'),\n",
" Text(0.7175141242937854, 0.6944444444444444, 'V22 <= 1.228\\ngini = 0.495\\nsamples = 20\\nvalue = [11, 9]\\nclass = Ready biodegradable'),\n",
" Text(0.7062146892655368, 0.6388888888888888, 'V11 <= 0.5\\ngini = 0.459\\nsamples = 14\\nvalue = [5, 9]\\nclass = Reday non-biodegradable'),\n",
" Text(0.6949152542372882, 0.5833333333333334, 'gini = 0.0\\nsamples = 6\\nvalue = [0, 6]\\nclass = Reday non-biodegradable'),\n",
" Text(0.7175141242937854, 0.5833333333333334, 'V31 <= 1.845\\ngini = 0.469\\nsamples = 8\\nvalue = [5, 3]\\nclass = Ready biodegradable'),\n",
" Text(0.7062146892655368, 0.5277777777777778, 'gini = 0.0\\nsamples = 2\\nvalue = [0, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.7288135593220338, 0.5277777777777778, 'V37 <= 3.486\\ngini = 0.278\\nsamples = 6\\nvalue = [5, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.7175141242937854, 0.4722222222222222, 'gini = 0.0\\nsamples = 4\\nvalue = [4, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.7401129943502824, 0.4722222222222222, 'V2 <= 3.358\\ngini = 0.5\\nsamples = 2\\nvalue = [1, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.7288135593220338, 0.4166666666666667, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.751412429378531, 0.4166666666666667, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.7288135593220338, 0.6388888888888888, 'gini = 0.0\\nsamples = 6\\nvalue = [6, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.7627118644067796, 0.6944444444444444, 'V3 <= 0.5\\ngini = 0.165\\nsamples = 11\\nvalue = [1, 10]\\nclass = Reday non-biodegradable'),\n",
" Text(0.751412429378531, 0.6388888888888888, 'gini = 0.0\\nsamples = 10\\nvalue = [0, 10]\\nclass = Reday non-biodegradable'),\n",
" Text(0.7740112994350282, 0.6388888888888888, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.7627118644067796, 0.75, 'gini = 0.0\\nsamples = 24\\nvalue = [24, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.8898305084745762, 0.8055555555555556, 'V39 <= 8.446\\ngini = 0.13\\nsamples = 402\\nvalue = [374, 28]\\nclass = Ready biodegradable'),\n",
" Text(0.847457627118644, 0.75, 'V30 <= 5.124\\ngini = 0.357\\nsamples = 56\\nvalue = [43, 13]\\nclass = Ready biodegradable'),\n",
" Text(0.8192090395480226, 0.6944444444444444, 'V8 <= 46.1\\ngini = 0.206\\nsamples = 43\\nvalue = [38, 5]\\nclass = Ready biodegradable'),\n",
" Text(0.7966101694915254, 0.6388888888888888, 'V6 <= 1.5\\ngini = 0.1\\nsamples = 38\\nvalue = [36, 2]\\nclass = Ready biodegradable'),\n",
" Text(0.7853107344632768, 0.5833333333333334, 'V14 <= 1.534\\ngini = 0.053\\nsamples = 37\\nvalue = [36, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.7740112994350282, 0.5277777777777778, 'gini = 0.0\\nsamples = 35\\nvalue = [35, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.7966101694915254, 0.5277777777777778, 'V17 <= 0.989\\ngini = 0.5\\nsamples = 2\\nvalue = [1, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.7853107344632768, 0.4722222222222222, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.807909604519774, 0.4722222222222222, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.807909604519774, 0.5833333333333334, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8418079096045198, 0.6388888888888888, 'V18 <= 1.105\\ngini = 0.48\\nsamples = 5\\nvalue = [2, 3]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8305084745762712, 0.5833333333333334, 'gini = 0.0\\nsamples = 2\\nvalue = [2, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.8531073446327684, 0.5833333333333334, 'gini = 0.0\\nsamples = 3\\nvalue = [0, 3]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8757062146892656, 0.6944444444444444, 'V30 <= 15.07\\ngini = 0.473\\nsamples = 13\\nvalue = [5, 8]\\nclass = Reday non-biodegradable'),\n",
" Text(0.864406779661017, 0.6388888888888888, 'gini = 0.0\\nsamples = 8\\nvalue = [0, 8]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8870056497175142, 0.6388888888888888, 'gini = 0.0\\nsamples = 5\\nvalue = [5, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.9322033898305084, 0.75, 'V31 <= 9.19\\ngini = 0.083\\nsamples = 346\\nvalue = [331, 15]\\nclass = Ready biodegradable'),\n",
" Text(0.9209039548022598, 0.6944444444444444, 'V8 <= 13.25\\ngini = 0.078\\nsamples = 345\\nvalue = [331, 14]\\nclass = Ready biodegradable'),\n",
" Text(0.9096045197740112, 0.6388888888888888, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.9322033898305084, 0.6388888888888888, 'V38 <= 0.5\\ngini = 0.073\\nsamples = 344\\nvalue = [331, 13]\\nclass = Ready biodegradable'),\n",
" Text(0.8926553672316384, 0.5833333333333334, 'V30 <= 12.851\\ngini = 0.118\\nsamples = 191\\nvalue = [179, 12]\\nclass = Ready biodegradable'),\n",
" Text(0.8700564971751412, 0.5277777777777778, 'V15 <= 10.386\\ngini = 0.078\\nsamples = 173\\nvalue = [166, 7]\\nclass = Ready biodegradable'),\n",
" Text(0.8587570621468926, 0.4722222222222222, 'V30 <= 11.485\\ngini = 0.143\\nsamples = 90\\nvalue = [83, 7]\\nclass = Ready biodegradable'),\n",
" Text(0.847457627118644, 0.4166666666666667, 'V14 <= 2.748\\ngini = 0.126\\nsamples = 89\\nvalue = [83, 6]\\nclass = Ready biodegradable'),\n",
" Text(0.8361581920903954, 0.3611111111111111, 'V14 <= 0.861\\ngini = 0.107\\nsamples = 88\\nvalue = [83, 5]\\nclass = Ready biodegradable'),\n",
" Text(0.8135593220338984, 0.3055555555555556, 'V15 <= 10.142\\ngini = 0.231\\nsamples = 30\\nvalue = [26, 4]\\nclass = Ready biodegradable'),\n",
" Text(0.8022598870056498, 0.25, 'V2 <= 2.458\\ngini = 0.391\\nsamples = 15\\nvalue = [11, 4]\\nclass = Ready biodegradable'),\n",
" Text(0.7909604519774012, 0.19444444444444445, 'gini = 0.0\\nsamples = 2\\nvalue = [0, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8135593220338984, 0.19444444444444445, 'V36 <= 3.866\\ngini = 0.26\\nsamples = 13\\nvalue = [11, 2]\\nclass = Ready biodegradable'),\n",
" Text(0.8022598870056498, 0.1388888888888889, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8248587570621468, 0.1388888888888889, 'V31 <= 0.973\\ngini = 0.153\\nsamples = 12\\nvalue = [11, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.8135593220338984, 0.08333333333333333, 'V14 <= 0.792\\ngini = 0.5\\nsamples = 2\\nvalue = [1, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.8022598870056498, 0.027777777777777776, 'gini = 0.0\\nsamples = 1\\nvalue = [1, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.8248587570621468, 0.027777777777777776, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8361581920903954, 0.08333333333333333, 'gini = 0.0\\nsamples = 10\\nvalue = [10, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.8248587570621468, 0.25, 'gini = 0.0\\nsamples = 15\\nvalue = [15, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.8587570621468926, 0.3055555555555556, 'V2 <= 2.978\\ngini = 0.034\\nsamples = 58\\nvalue = [57, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.847457627118644, 0.25, 'V2 <= 2.946\\ngini = 0.117\\nsamples = 16\\nvalue = [15, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.8361581920903954, 0.19444444444444445, 'gini = 0.0\\nsamples = 15\\nvalue = [15, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.8587570621468926, 0.19444444444444445, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8700564971751412, 0.25, 'gini = 0.0\\nsamples = 42\\nvalue = [42, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.8587570621468926, 0.3611111111111111, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8700564971751412, 0.4166666666666667, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.8813559322033898, 0.4722222222222222, 'gini = 0.0\\nsamples = 83\\nvalue = [83, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.9152542372881356, 0.5277777777777778, 'V17 <= 1.038\\ngini = 0.401\\nsamples = 18\\nvalue = [13, 5]\\nclass = Ready biodegradable'),\n",
" Text(0.903954802259887, 0.4722222222222222, 'V14 <= 0.644\\ngini = 0.231\\nsamples = 15\\nvalue = [13, 2]\\nclass = Ready biodegradable'),\n",
" Text(0.8926553672316384, 0.4166666666666667, 'gini = 0.0\\nsamples = 2\\nvalue = [0, 2]\\nclass = Reday non-biodegradable'),\n",
" Text(0.9152542372881356, 0.4166666666666667, 'gini = 0.0\\nsamples = 13\\nvalue = [13, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.9265536723163842, 0.4722222222222222, 'gini = 0.0\\nsamples = 3\\nvalue = [0, 3]\\nclass = Reday non-biodegradable'),\n",
" Text(0.9717514124293786, 0.5833333333333334, 'V31 <= 1.042\\ngini = 0.013\\nsamples = 153\\nvalue = [152, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.96045197740113, 0.5277777777777778, 'V30 <= 10.944\\ngini = 0.375\\nsamples = 4\\nvalue = [3, 1]\\nclass = Ready biodegradable'),\n",
" Text(0.9491525423728814, 0.4722222222222222, 'gini = 0.0\\nsamples = 3\\nvalue = [3, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.9717514124293786, 0.4722222222222222, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.9830508474576272, 0.5277777777777778, 'gini = 0.0\\nsamples = 149\\nvalue = [149, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.943502824858757, 0.6944444444444444, 'gini = 0.0\\nsamples = 1\\nvalue = [0, 1]\\nclass = Reday non-biodegradable'),\n",
" Text(0.9774011299435028, 0.8611111111111112, 'V13 <= 4.278\\ngini = 0.375\\nsamples = 28\\nvalue = [7, 21]\\nclass = Reday non-biodegradable'),\n",
" Text(0.9661016949152542, 0.8055555555555556, 'V22 <= 1.251\\ngini = 0.159\\nsamples = 23\\nvalue = [2, 21]\\nclass = Reday non-biodegradable'),\n",
" Text(0.9548022598870056, 0.75, 'gini = 0.0\\nsamples = 18\\nvalue = [0, 18]\\nclass = Reday non-biodegradable'),\n",
" Text(0.9774011299435028, 0.75, 'V12 <= -0.564\\ngini = 0.48\\nsamples = 5\\nvalue = [2, 3]\\nclass = Reday non-biodegradable'),\n",
" Text(0.9661016949152542, 0.6944444444444444, 'gini = 0.0\\nsamples = 2\\nvalue = [2, 0]\\nclass = Ready biodegradable'),\n",
" Text(0.9887005649717514, 0.6944444444444444, 'gini = 0.0\\nsamples = 3\\nvalue = [0, 3]\\nclass = Reday non-biodegradable'),\n",
" Text(0.9887005649717514, 0.8055555555555556, 'gini = 0.0\\nsamples = 5\\nvalue = [5, 0]\\nclass = Ready biodegradable')]"
]
},
"execution_count": 50,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAEj4AABIgCAYAAABMxjtCAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOz9eZSd+UHf+X9ubdr3tVtqtVrq1i611truNTaEbcDgJgFMxmeIiX+QSQbm/H7DCSSeZIYwmUMCTpgzJgwMgSR2SNgC7g4YsGOM8b1VKu1Lq1tSt1pLa9/3pbb7+8PhOSncQNOW7tXyep3jP1zuqvpUddWt53nO+b5dqtfr9QAAAAAAAAAAAAAAAAAAAAAAADRAS7MHAAAAAAAAAAAAAAAAAAAAAAAATw7hIwAAAAAAAAAAAAAAAAAAAAAAoGGEjwAAAAAAAAAAAAAAAAAAAAAAgIYRPgIAAAAAAAAAAAAAAAAAAAAAABpG+AgAAAAAAAAAAAAAAAAAAAAAAGgY4SMAAAAAAAAAAAAAAAAAAAAAAKBhhI8AAAAAAAAAAAAAAAAAAAAAAICGET4CAAAAAAAAAAAAAAAAAAAAAAAaRvgIAAAAAAAAAAAAAAAAAAAAAABoGOEjAAAAAAAAAAAAAAAAAAAAAACgYYSPAAAAAAAAAAAAAAAAAAAAAACAhhE+AgAAAAAAAAAAAAAAAAAAAAAAGkb4CAAAAAAAAAAAAAAAAAAAAAAAaBjhIwAAAAAAAAAAAAAAAAAAAAAAoGGEjwAAAAAAAAAAAAAAAAAAAAAAgIYRPgIAAAAAAAAAAAAAAAAAAAAAABpG+AgAAAAAAAAAAAAAAAAAAAAAAGgY4SMAAAAAAAAAAAAAAAAAAAAAAKBhhI8AAAAAAAAAAAAAAAAAAAAAAICGET4CAAAAAAAAAAAAAAAAAAAAAAAaRvgIAAAAAAAAAAAAAAAAAAAAAABoGOEjAAAAAAAAAAAAAAAAAAAAAACgYYSPAAAAAAAAAAAAAAAAAAAAAACAhhE+AgAAAAAAAAAAAAAAAAAAAAAAGkb4CAAAAAAAAAAAAAAAAAAAAAAAaBjhIwAAAAAAAAAAAAAAAAAAAAAAoGGEjwAAAAAAAAAAAAAAAAAAAAAAgIYRPgIAAAAAAAAAAAAAAAAAAAAAABpG+AgAAAAAAAAAAAAAAAAAAAAAAGgY4SMAAAAAAAAAAAAAAAAAAAAAAKBhhI8AAAAAAAAAAAAAAAAAAAAAAICGET4CAAAAAAAAAAAAAAAAAAAAAAAaRvgIAAAAAAAAAAAAAAAAAAAAAABoGOEjAAAAAAAAAAAAAAAAAAAAAACgYYSPAAAAAAAAAAAAAAAAAAAAAACAhhE+AgAAAAAAAAAAAAAAAAAAAAAAGkb4CAAAAAAAAAAAAAAAAAAAAAAAaBjhIwAAAAAAAAAAAAAAAAAAAAAAoGGEjwAAAAAAAAAAAAAAAAAAAAAAgIYRPgIAAAAAAAAAAAAAAAAAAAAAABpG+AgAAAAAAAAAAAAAAAAAAAAAAGgY4SMAAAAAAAAAAAAAAAAAAAAAAKBhhI8AAAAAAAAAAAAAAAAAAAAAAICGET4CAAAAAAAAAAAAAAAAAAAAAAAaRvgIAAAAAAAAAAAAAAAAAAAAAABoGOEjAAAAAAAAAAAAAAAAAAAAAACgYYSPAAAAAAAAAAAAAAAAAAAAAACAhhE+AgAAAAAAAAAAAAAAAAAAAAAAGkb4CAAAAAAAAAAAAAAAAAAAAAAAaBjhIwAAAAAAAAAAAAAAAAAAAAAAoGGEjwAAAAAAAAAAAAAAAAAAAAAAgIYRPgIAAAAAAAAAAAAAAAAAAAAAABpG+AgAAAAAAAAAAAAAAAAAAAAAAGgY4SMAAAAAAAAAAAAAAAAAAAAAAKBhhI8AAAAAAAAAAAAAAAAAAAAAAICGET4CAAAAAAAAAAAAAAAAAAAAAAAaRvgIAAAAAAAAAAAAAAAAAAAAAABoGOEjAAAAAAAAAAAAAAAAAAAAAACgYYSPAAAAAAAAAAAAAAAAAAAAAACAhhE+AgAAAAAAAAAAAAAAAAAAAAAAGkb4CAAAAAAAAAAAAAAAAAAAAAAAaBjhIwAAAAAAAAAAAAAAAAAAAAAAoGGEjwAAAAAAAAAAAAAAAAAAAAAAgIYRPgIAAAAAAAAAAAAAAAAAAAAAABpG+AgAAAAAAAAAAAAAAAAAAAAAAGgY4SMAAAAAAAAAAAAAAAAAAAAAAKBhhI8AAAAAAAAAAAAAAAAAAAAAAICGET4CAAAAAAAAAAAAAAAAAAAAAAAaRvgIAAAAAAAAAAAAAAAAAAAAAABoGOEjAAAAAAAAAAAAAAAAAAAAAACgYYSPAAAAAAAAAAAAAAAAAAAAAACAhhE+AgAAAAAAAAAAAAAAAAAAAAAAGkb4CAAAAAAAAAAAAAAAAAAAAAAAaBjhIwAAAAAAAAAAAAAAAAAAAAAAoGGEjwAAAAAAAAAAAAAAAAAAAAAAgIYRPgIAAAAAAAAAAAAAAAAAAAAAABpG+AgAAAAAAAAAAAAAAAAAAAAAAGgY4SMAAAAAAAAAAAAAAAAAAAAAAKBhhI8AAAAAAAAAAAAAAAAAAAAAAICGET4CAAAAAAAAAAAAAAAAAAAAAAAaRvgIAAAAAAAAAAAAAAAAAAAAAABoGOEjAAAAAAAAAAAAAAAAAAAAAACgYYSPAAAAAAAAAAAAAAAAAAAAAACAhhE+AgAAAAAAAAAAAAAAAAAAAAAAGkb4CAAAAAAAAAAAAAAAAAAAAAAAaBjhIwAAAAAAAAAAAAAAAAAAAAAAoGGEjwAAAAAAAAAAAAAAAAAAAAAAgIYRPgIAAAAAAAAAAAAAAAAAAAAAABpG+AgAAAAAAAAAAAAAAAAAAAAAAGgY4SMAAAAAAAAAAAAAAAAAAAAAAKBhhI8AAAAAAAAAAAAAAAAAAAAAAICGET4CAAAAAAAAAAAAAAAAAAAAAAAaRvgIAAAAAAAAAAAAAAAAAAAAAABoGOEjAAAAAAAAAAAAAAAAAAAAAACgYYSPAAAAAAAAAAAAAAAAAAAAAACAhhE+AgAAAAAAAAAAAAAAAAAAAAAAGkb4CAAAAAAAAAAAAAAAAAAAAAAAaBjhIwAAAAAAAAAAAAAAAAAAAAAAoGGEjwAAAAAAAAAAAAAAAAAAAAAAgIYRPgIAAAAAAAAAAAAAAAAAAAAAABpG+AgAAAAAAAAAAAAAAAAAAAAAAGgY4SMAAAAAAAAAAAAAAAAAAAAAAKBhhI8AAAAAAAAAAAAAAAAAAAAAAICGET4CAAAAAAAAAAAAAAAAAAAAAAAaRvgIAAAAAAAAAAAAAAAAAAAAAABoGOEjAAAAAAAAAAAAAAAAAAAAAACgYYSPAAAAAAAAAAAAAAAAAAAAAACAhhE+AgAAAAAAAAAAAAAAAAAAAAAAGkb4CAAAAAAAAAAAAAAAAAAAAAAAaBjhIwAAAAAAAAAAAAAAAAAAAAAAoGGEjwAAAAAAAAAAAAAAAAAAAAAAgIYRPgIAAAAAAAAAAAAAAAAAAAAAABpG+AgAAAAAAAAAAAAAAAAAAAAAAGgY4SMAAAAAAAAAAAAAAAAAAAAAAKBhhI8AAAAAAAAAAAAAAAAAAAAAAICGET4CAAAAAAAAAAAAAAAAAAAAAAAaRvgIAAAAAAAAAAAAAAAAAAAAAABoGOEjAAAAAAAAAAAAAAAAAAAAAACgYYSPAAAAAAAAAAAAAAAAAAAAAACAhhE+AgAAAAAAAAAAAAAAAAAAAAAAGqat2QMAAAAAAAAAAAAAAOBRUK/Xs3///pw7d67ZUx5rEydOzLp16zJlypRmTwEAAAAAAB4Q4SMAAAAAAAAAAAAAAPhL1Ov1/NiP/Vg+8YlPNHvKE2HlihX5oy9+MfPnz2/2FAAAAAAA4AEo1ev1erNHAAAAAAAAAAAAAADAw2z79u3p7OzM//6Dfz3f/Q2dKZVKzZ702Dpz6Wq+73/9V/nuD//N/MIv/EKz5wAAAAAAAA9AW7MHAAAAAAAAAAAAAADAw+706dNJku//tkrmzJj6wD7PD//0v80n//7fSqlUyt/
"text/plain": [
"<Figure size 6000x6000 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.tree import plot_tree\n",
"\n",
"plt.figure(figsize=(60, 60))\n",
"plot_tree(model, filled=True, rounded=True, class_names=['Ready biodegradable', 'Reday non-biodegradable'], feature_names=X_train.columns)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "b55c97cd",
"metadata": {},
"source": [
"### Random Forrest Classifier"
]
},
{
"cell_type": "code",
"execution_count": 60,
"id": "c9d5676b",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"({'Accuracy': 0.8229665071770335,\n",
" 'F1': 0.8664259927797834,\n",
" 'Precision': 0.8450704225352113,\n",
" 'Recall': 0.8888888888888888,\n",
" 'AUC': 0.7957957957957957},\n",
" RandomForestClassifier())"
]
},
"execution_count": 60,
"metadata": {},
"output_type": "execute_result"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABNwAAATFCAYAAAB7FctDAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeVhUdeP+8XvYwQU1NkUSNZdMU8PgccssktQwW01LkVKzQkuywlywTKksonKhLDQtn8wy80lzo6hM03KpLPc9ExRNUFAQZn5/9PN8m0BTHDiMvF/XNVfOZ85ynxmo6fac87HYbDabAAAAAAAAADiEi9kBAAAAAAAAgMsJhRsAAAAAAADgQBRuAAAAAAAAgANRuAEAAAAAAAAOROEGAAAAAAAAOBCFGwAAAAAAAOBAFG4AAAAAAACAA1G4AQAAAAAAAA5E4QYAAAAAAAA4EIUb8A8Wi0Xjx4+/6PX27t0ri8WiWbNmOTzTpZozZ46aN28ud3d31apVy+w4Tm/p0qVq06aNvLy8ZLFYdPz4cbMjVbiBAwcqNDTU7BgAAAAAUClRuKFSmjVrliwWiywWi1atWlXidZvNppCQEFksFt12220mJHQeW7du1cCBA9W4cWPNmDFDb7/9ttmRnNrRo0d17733ytvbW1OnTtWcOXNUrVq1ctvf338XLBaL3NzcFBwcrIEDB+rgwYPltl9n88/36e+PhIQEs+OVatKkSVq4cKHZMQAAAACUAzezAwDn4+Xlpblz56pTp052419//bV+//13eXp6mpTMeWRkZMhqter111/XVVddZXYcp/fDDz/oxIkTmjBhgiIjIytsv88//7waNmyo06dP6/vvv9esWbO0atUqbd68WV5eXhWWo7I7+z79XcuWLU1Kc36TJk3S3Xffrd69e5sdBQAAAICDUbihUuvRo4fmz5+vN954Q25u//fjOnfuXIWFhSk7O9vEdJVbXl6eqlWrpsOHD0uSQy8lzc/Pl4+Pj8O250zK4/08+1mdT/fu3dWuXTtJ0qBBg+Tn56eXXnpJixYt0r333uuwLM7u7++TI13IZwQAAAAAZ3FJKSq1vn376ujRo1qxYoUxVlhYqI8//lj9+vUrdZ28vDw9+eSTCgkJkaenp5o1a6ZXXnlFNpvNbrmCggKNGDFC/v7+qlGjhnr16qXff/+91G0ePHhQDz74oAIDA+Xp6alrrrlGaWlpZTqms5e+ffPNN3r44Yd1xRVXqGbNmhowYID+/PPPEst/8cUX6ty5s6pVq6YaNWqoZ8+e+vXXX+2WGThwoKpXr65du3apR48eqlGjhu6//36FhoYqMTFRkuTv71/i/nTTpk3TNddcI09PT9WrV0+PPfZYifuR3XjjjWrZsqXWr1+vG264QT4+Pnr22WeNe9a98sormjp1qho1aiQfHx9169ZNBw4ckM1m04QJE1S/fn15e3vr9ttv17Fjx+y2/dlnn6lnz56qV6+ePD091bhxY02YMEHFxcWlZvjtt9/UtWtX+fj4KDg4WC+//HKJ9+v06dMaP368mjZtKi8vL9WtW1d33nmndu3aZSxjtVqVkpKia665Rl5eXgoMDNTDDz9c6vv/zxwxMTGSpOuvv14Wi0UDBw40Xp8/f77CwsLk7e0tPz8/PfDAAyUu+zzXZ3WxOnfuLEl2x1VYWKhx48YpLCxMvr6+qlatmjp37qyvvvrKbt2/f3Zvv/22GjduLE9PT11//fX64YcfSuxr4cKFatmypby8vNSyZUt9+umnpWa60N89i8WiuLg4zZ8/Xy1atJC3t7fat2+vX375RZL01ltv6aqrrpKXl5duvPFG7d2796Lfn3P58ssvjd+nWrVq6fbbb9eWLVvslhk/frwsFot+++039evXT7Vr17Y7y/b99983Puc6derovvvu04EDB+y2sWPHDt11110KCgqSl5eX6tevr/vuu085OTnGe5CXl6f33nvPuPT17z9LAAAAAJwbZ7ihUgsNDVX79u313//+V927d5f0VwGVk5Oj++67T2+88Ybd8jabTb169dJXX32lhx56SG3atNGyZcv01FNP6eDBg3rttdeMZQcNGqT3339f/fr1U4cOHfTll1+qZ8+eJTJkZWXpP//5j1ES+Pv764svvtBDDz2k3NxcPfHEE2U6tri4ONWqVUvjx4/Xtm3bNH36dO3bt08ZGRmyWCyS/prsICYmRlFRUXrppZeUn5+v6dOnq1OnTtq4caPdTeuLiooUFRWlTp066ZVXXpGPj48GDhyo2bNn69NPP9X06dNVvXp1XXvttZL+KhWee+45RUZG6pFHHjEy/PDDD/ruu+/k7u5ubPvo0aPq3r277rvvPj3wwAMKDAw0Xvvggw9UWFioYcOG6dixY3r55Zd177336qabblJGRoaeeeYZ7dy5U2+++aZGjhxpV1TOmjVL1atXV3x8vKpXr64vv/xS48aNU25uriZPnmz3fv3555+69dZbdeedd+ree+/Vxx9/rGeeeUatWrUyfjaKi4t12223KT09Xffdd58ef/xxnThxQitWrNDmzZvVuHFjSdLDDz+sWbNmKTY2VsOHD9eePXs0ZcoUbdy4scSx/93o0aPVrFkzvf3228ali2e3eXZ7119/vZKSkpSVlaXXX39d3333nTZu3Gh3Rlxpn9XFOltC1a5d2xjLzc3VO++8o759+2rw4ME6ceKE3n33XUVFRWndunVq06aN3Tbmzp2rEydO6OGHH5bFYtHLL7+sO++8U7t37zbeg+XLl+uuu+5SixYtlJSUpKNHjyo2Nlb169e329bF/O5J0rfffqtFixbpsccekyQlJSXptttu09NPP61p06bp0Ucf1Z9//qmXX35ZDz74oL788ssLel9ycnJKnPnq5+cnSVq5cqW6d++uRo0aafz48Tp16pTefPNNdezYURs2bCgxCcQ999yjJk2aaNKkSUZpOHHiRI0dO1b33nuvBg0apCNHjujNN9/UDTfcYHzOhYWFioqKUkFBgYYNG6agoCAdPHhQn3/+uY4fPy5fX1/NmTNHgwYNUnh4uIYMGSJJxs8SAAAAgMuADaiEZs6caZNk++GHH2xTpkyx1ahRw5afn2+z2Wy2e+65x9a1a1ebzWazNWjQwNazZ09jvYULF9ok2V544QW77d199902i8Vi27lzp81ms9k2bdpkk2R79NFH7Zbr16+fTZItMTHRGHvooYdsdevWtWVnZ9ste99999l8fX2NXHv27LFJss2cOfOCji0sLMxWWFhojL/88ss2SbbPPvvMZrPZbCdOnLDVqlXLNnjwYLv1MzMzbb6+vnbjMTExNkm2hISEEvtLTEy0SbIdOXLEGDt8+LDNw8PD1q1bN1txcbExPmXKFJskW1pamjHWpUsXmyRbamqq3XbPHq+/v7/t+PHjxvioUaNskmytW7e2nTlzxhjv27evzcPDw3b69Glj7Ox793cPP/ywzcfHx265sxlmz55tjBUUFNiCgoJsd911lzGWlpZmk2RLTk4usV2r1Wqz2Wy2b7/91ibJ9sEHH9i9vnTp0lLH/+nvP5tnFRYW2gICAmwtW7a0nTp1yhj//PPPbZJs48aNM8bO91mdb38rV660HTlyxHbgwAHbxx9/bPP397d5enraDhw4YCxbVFRkKygosFv/zz//tAUGBtoefPBBY+zsZ3fFFVfYjh07Zox/9tlnNkm2//3vf8ZYmzZtbHXr1rX7jJcvX26TZGvQoIExdqG/ezabzSbJ5unpaduzZ48x9tZbb9kk2YKCgmy5ubnG+Nmfp78ve773qbTH348lICDAdvToUWPsp59+srm4uNgGDBhgjJ39nenbt6/dPvbu3WtzdXW1TZw40W78l19+sbm5uRnjGzdutEmyzZ8//7yZq1WrZouJiTnvMgAAAACcE5eUotK79957derUKX3++ec6ceKEPv/883NeTrpkyRK5urpq+PDhduNPPvmkbDabvvjiC2M5SSWW++fZajabTZ988omio6Nls9mUnZ1tPKKiopSTk6MNGzaU6biGDBlidybVI488Ijc3NyPbihUrdPz4cfXt29duv66
"text/plain": [
"<Figure size 1500x1500 with 5 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAjcAAAGwCAYAAABVdURTAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjYuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8o6BhiAAAACXBIWXMAAA9hAAAPYQGoP6dpAABf50lEQVR4nO3dd1QU198G8GcX2KVIM0hTFHtX7BGjRkWxtySSaBRNotHYIjF2xRJ7bIlGo8aW1wQ1v2hM7L1giQ0rooKIBVBEQDrs3vcPw8aVkh3cZWV5PufsSfbOnZlnR3C/3rkzIxNCCBARERGZCLmxAxARERHpE4sbIiIiMiksboiIiMiksLghIiIik8LihoiIiEwKixsiIiIyKSxuiIiIyKSYGztAUVOr1Xj06BFsbW0hk8mMHYeIiIh0IITA8+fP4e7uDrm84LGZElfcPHr0CB4eHsaOQURERIVw//59lCtXrsA+Ja64sbW1BfDi4NjZ2Rk5DREREekiKSkJHh4emu/xgpS44ibnVJSdnR2LGyIiomJGlyklnFBMREREJoXFDREREZkUFjdERERkUljcEBERkUlhcUNEREQmhcUNERERmRQWN0RERGRSWNwQERGRSWFxQ0RERCaFxQ0RERGZFKMWN8ePH0e3bt3g7u4OmUyGHTt2/Oc6R48eRcOGDaFUKlGlShVs2LDB4DmJiIio+DBqcZOSkoL69etjxYoVOvW/e/cuunTpgjZt2iAkJARffvklPvvsM+zbt8/ASYmIiKi4MOqDMzt16oROnTrp3H/VqlWoWLEiFi1aBACoWbMmTp48iSVLlsDX19dQMYmIiEgHmdlqPE3JQLZKwKO0tdFyFKungp8+fRo+Pj5abb6+vvjyyy/zXScjIwMZGRma90lJSYaKR0REZHJyCpa455l4kpz+z38zEJecgSfPX/w3LjkTcckZSEjNAgB4V34Lvwx+22iZi1VxExMTAxcXF602FxcXJCUlIS0tDVZWVrnWmTt3LmbMmFFUEYmIiN54GdkqPP2nIPm3SMl8qVj5ty0xLUvSts3kMqiFMFBy3RSr4qYwJk6ciICAAM37pKQkeHh4GDERERGR/mVkq16MoLxUoOQULE+SM15qL1zB4lRKAadSyn9ftgqUKaVEGduX2kop4GitgFwuM9Cn1E2xKm5cXV0RGxur1RYbGws7O7s8R20AQKlUQqlUFkU8IiIivXq5YHnyatHy8mmh5xlISs+WtG1zuQxvlVK8Upy8KFDK2CpRppQSTv8sc7CyMHrBIkWxKm6aN2+O3bt3a7UdOHAAzZs3N1IiIipprj1MxG8XHhh92J1Mj0otkJCapRlleZKcgecSCxYLMxnesskZTflnpOWfAiWnrcw/RYx9MStYpDBqcZOcnIw7d+5o3t+9exchISEoXbo0ypcvj4kTJ+Lhw4fYtGkTAGDo0KFYvnw5xo0bh08++QSHDx/G1q1bsWvXLmN9BCIqQR4/T8fA9X8jLjnT2FGoBLEwk2lGVbSKln8KlxenhhSagkUmM82CRQqjFjfnz59HmzZtNO9z5sb4+/tjw4YNiI6ORlRUlGZ5xYoVsWvXLowZMwbLli1DuXLlsHbtWl4GTkQGp1YLfLX1MuKSM1G5jA261HM3diQyMTIADtYWWoVMmVJK2FmZs2CRSCZEyRpbTUpKgr29PRITE2FnZ2fsOERUTPxw9A4W7A2DpYUcf454B1VdbI0diahEkfL9Xazm3BBRyZOZrYZKbdx/g119mIhF+28BAGZ0r83ChugNx+KGiN44z1Iyse96DP66Eo1T4XEwcm2j0a2+O/o05q0kiN50LG6I6I2QlJ6F/ddj8deVRzh5Ow7Zb0pF84/qLraY06sO5z4QFQMsbojIaJIzsnEoNBZ/Xo7G8VtPkKlSa5bVdLND13pu6FzXDc62xr9XlbXCjIUNUTHB4oaIilRapgqHbz7GX1ce4fDNx8jI/regqepcCl3ruaNLPTdUcS5lxJREVJyxuCEyEWcinuJMxFNjxyhQ+JMUHAqNRWqmStNW0ckGXeu5oWs9d1R35URdInp9LG6IirnkjGzM3hWKX/+O+u/Ob4hyjlboWs8dXeu5oba7HU/3EJFesbghKsZOhz/F179dxoNnaQCALvXc4GhtYeRU+bO3skD7Wq6oX86eBQ0RGQyLG6JiKD1Lhfl7b2J9cCQAoKyDFb79oD6aV37LuMGIiN4ALG6IiplLUc/w1bbLiHiSAgD4qKkHJnephVJK/joTEQEsbogMRgiBlcfCsfp4BDKy1P+9go7Ssl5MxnWxU2Lee/XQprqz3rZNRGQKWNwQGYBKLTDjz+vYdPqeQbbf08sdM7rXgf0bPL+GiMhYWNwQ6VlGtgpjtoRg99UYyGTAlC610KGWi962b6Uwg1Mp49/UjojoTcXihkiPnqdnYcimCzgd8RQWZjIs8fNC13ruxo5FRFSisLgh0pPHz9MxcN053IhOQimlOX7s3wgtqjgZOxYRUYnD4oZIDyLjUjBg3d+Iik+FUykFNgxqijpl7Y0di4ioRGJxQ/Sarj1MxMD1fyMuORPlS1tj0ydN4elkY+xYREQlFosbotcQfCcOQzadR0qmCrXc7LDhkyZwtrU0diwiohKNxQ1RIf115RHGbAlBlkrAu/Jb+LF/I9ha8tJsIiJjY3FDlI/4lEwM3nQeMYnpeS5/lJgGIYAudd2w2K8+lOZmRZyQiIjywuKGKB/nIuNx4d6zAvsMaF4Bgd1qw0zOh0ASEb0pWNwQ5UOIF/+t4WqL+e/Vy7Xc3sqCE4eJiN5ALG6I/kMppTnqezgYOwYREemIxQ2ZrPvxqbj8IKHQ61++X/h1iYjIeFjckElSqwV6/RCMuOTM194W59MQERUvLG7IJKmF0BQ2jSs4wtyscAWKuVyOT9+pqM9oRERkYCxuyOT95N8E9ta8/wwRUUkhN3YAIiIiIn3iyA0VqQM3YrHiyB2o1MKg+xEw7PaJiOjNxeKGitSm05EIKcKrkOytLGCl4J2DiYhKEhY3VKRyRmw+b1UJb1d6y+D7q+FmC4U5z74SEZUkLG7IKGq526FNDWdjxyAiIhPE4oYM7snzDNx+/BwAkJCaZeQ0RERk6ljckEFlZKvQfsmxXEWNXMYb4xERkWGwuCGDSslQaQqbqs6lIJMBzraW8K5s+Pk2RERUMrG4oSKz78tWkPNRBkREZGC8jISIiIhMCosbIiIiMiksboiIiMikFGrOTVZWFmJiYpCamooyZcqgdOnS+s5FREREVCg6j9w8f/4cK1euROvWrWFnZwdPT0/UrFkTZcqUQYUKFTB48GCcO3fOkFmJiIiI/pNOxc3ixYvh6emJ9evXw8fHBzt27EBISAhu3bqF06dPIzAwENnZ2ejQoQM6duyI27dvGzo3ERERUZ50Oi117tw5HD9+HLVr185zedOmTfHJJ59g1apVWL9+PU6cOIGqVavqNSgRERGRLnQqbn799VedNqZUKjF06NDXCkRERET0Oni1FBEREZkUScXN5cuX8c033+CHH35AXFyc1rKkpCR88skneg1HREREJJXOxc3+/fvRtGlTBAUFYf78+ahRowaOHDmiWZ6WloaNGzcaJCQRERGRrnQubqZPn46xY8fi2rVriIyMxLhx49C9e3fs3bvXkPmomNtzLRoAYKMwAx8ETkRERUHnm/hdv34dP//8MwBAJpNh3LhxKFeuHN5//30EBQWhSZMmBgtJxVNYzHPM/PMGAGBUu6qQsbohIqIioHNxo1QqkZCQoNXWt29fyOVy+Pn5YdGiRfrORsVYWqYKI365iIxsNVpXK4PBLSsZOxIREZUQOhc3Xl5eOHLkCBo1aqTV/uGHH0IIAX9/f72Ho+Jrxp/XcftxMsrYKrGoT33I5Ry1ISKioqFzcTNs2DAcP348z2UfffQRhBBYs2aN3oJR8XU6/CmCzt2HTAYs9fOCUymlsSMREVEJIhNCCGOHKEpJSUmwt7dHYmIi7OzsjB3HJG06HYlpf1xH2xrOWDeQc7GIiOj1Sfn+5k38yGAsLfjjRURERY/fPkRERGRSWNwQERGRSWF
"text/plain": [
"<Figure size 640x480 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"from sklearn.ensemble import RandomForestClassifier\n",
"\n",
"# Score the model with default parameters\n",
"score_the_model(\n",
" model=RandomForestClassifier(),\n",
" model_name='Random Forest',\n",
" random_seed=42,\n",
" X_train=X_train,\n",
" X_test=X_test,\n",
" y_train=y_train,\n",
" y_test=y_test,\n",
" plot=True\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 47,
"id": "47db95d0",
"metadata": {},
"outputs": [],
"source": [
"\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.ensemble import RandomForestClassifier\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"\n",
"# Put models in a dictionary\n",
"models = {\n",
" \"Logistic Regression\": LogisticRegression(max_iter=120),\n",
" \"KNN\": KNeighborsClassifier(),\n",
" \"Random Forest\": RandomForestClassifier(),\n",
" \"Decision Tree\": DecisionTreeClassifier(),\n",
"} \n",
"\n",
"# Create a function to fit and score models\n",
"def fit_and_score(models, X_train, X_test, y_train, y_test):\n",
" \"\"\"\n",
" Fits and evaluates given machine learning models.\n",
" models: dict of different Scikit-Learn machine learning models\n",
" X_train: training data (no labels)\n",
" x_test: testing data (no labels)\n",
" y_train: training labels\n",
" y_test: trest labels\n",
" \"\"\"\n",
"\n",
" # Set random seed\n",
" np.random.seed(42)\n",
"\n",
" # Make a dictioanry to keep model scores\n",
" model_scores = {}\n",
"\n",
" # Loop through models\n",
" for name, model in models.items():\n",
" # Fit the model to the data\n",
" model.fit(X_train, y_train)\n",
" # Evaluate the model and append its score to model_scores\n",
" model_scores[name] = model.score(X_test, y_test)\n",
"\n",
" return model_scores"
]
},
{
"cell_type": "code",
"execution_count": 48,
"id": "71a1be34",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/home/gasperspagnolo/Documents/faks_git/is_assignments/a2/code/.venv/lib64/python3.10/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):\n",
"STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.\n",
"\n",
"Increase the number of iterations (max_iter) or scale the data as shown in:\n",
" https://scikit-learn.org/stable/modules/preprocessing.html\n",
"Please also refer to the documentation for alternative solver options:\n",
" https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression\n",
" n_iter_i = _check_optimize_result(\n"
]
},
{
"data": {
"text/plain": [
"{'Logistic Regression': 0.84688995215311,\n",
" 'KNN': 0.7511961722488039,\n",
" 'Random Forest': 0.8229665071770335,\n",
" 'Decision Tree': 0.784688995215311}"
]
},
"execution_count": 48,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"fit_and_score(models, X_train, X_test, y_train, y_test)"
]
},
{
"cell_type": "markdown",
"id": "3dafbf40",
"metadata": {},
"source": [
"### 2.3 Evaluation\n",
"Given that the data set is not in the ”big data” category, implement a cross-validation procedure based\n",
"on five folds (approximately equal sized) of your data. Furthermore, repeat the experiment 10 times with\n",
"different folds and average the results (include standard deviation). You are expected to report the following\n",
"metrics:\n",
"- F1\n",
"- Precision\n",
"- Recall\n",
"- AUC\n",
"Comment on the performance of algorithms and visualize their final scores. How do they perform against\n",
"the random baseline? What about the constant one? How do different learning scenarios impact the final\n",
"score? Are the differences between the models statistically significant?"
]
},
{
"cell_type": "markdown",
"id": "1bd730c6",
"metadata": {},
"source": [
"Tle malo u detajle razlozi kko delajo tej scoringi"
]
},
{
"cell_type": "markdown",
"id": "addfc3ea",
"metadata": {},
"source": [
"## Report and presentation\n",
"The assignment has to be submitted in the form of two files: a markdown file and a PDF file created from\n",
"the R Studio markdown file (in RStudio → file - new file - R Markdown), where you write both the code,\n",
"as well as the text of answers (echo = T option must be enabled for each code block). Markdown files can\n",
"easily be exported to PDF using (“Knit”) button in R Studio. If you are using Python, you can produce a\n",
"similar report with Jupyter Notebook."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.8"
},
"vscode": {
"interpreter": {
"hash": "73efbd7de9807940366a2e2c585910074bc00282bd7f8b3dae7eb06897ea8ebf"
}
}
},
"nbformat": 4,
"nbformat_minor": 5
}