{ "cells": [ { "cell_type": "markdown", "id": "c093ea0c", "metadata": {}, "source": [ "# Seminar 2: Predicting Biodegradability of Chemical" ] }, { "cell_type": "markdown", "id": "7aa30d7d", "metadata": {}, "source": [ "## 1. Introduction\n", "Chemicals are all around us. Studying their properties by the means of machine learning is an active\n", "research field; matching molecular patterns with their behavior can be a decisive factor in the creation of\n", "new materials, drugs, and more.\n", "In this seminar assignment, your task is to explore the data and build machine-learning models that\n", "predict the biodegradability of chemicals." ] }, { "attachments": {}, "cell_type": "markdown", "id": "aeab08c8", "metadata": {}, "source": [ "## 2. Task\n", "You will work with the data set compiled by Mansouri et al. [data](https://www.openml.org/search?type=data&status=active&id=1494&sort=runs). There are 41 features and one target feature (biodegradability).\n", "The target variable is encoded as ready biodegradable (1) and not ready biodegradable (2). The data set\n", "consists of 1055 instances. Features can be either symbolic or numeric.\n", "IMPORTANT: Use the dataset provided on uˇcilnica and NOT the one posted on the link above. It is\n", "minimally modified and split into train in test sets.\n" ] }, { "cell_type": "markdown", "id": "a4f197dd", "metadata": {}, "source": [ "### 2.1 Exploration\n", "Inspect the dataset. How balanced is the target variable? Are there any missing values present? If there\n", "are, choose a strategy that takes this into account.\n", "Most of your data is of the numeric type. Can you identify, by adopting exploratory analysis, whether\n", "some features are directly related to the target? What about feature pairs? Produce at least three types of\n", "visualizations of the feature space and be prepared to argue why these visualizations were useful for your\n", "subsequent analysis." ] }, { "cell_type": "code", "execution_count": 4, "id": "5bcf6290", "metadata": {}, "outputs": [], "source": [ "# Needed imports\n", "import numpy as np\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import sklearn\n", "import seaborn as sns\n", "import scikitplot as skplt\n" ] }, { "cell_type": "code", "execution_count": 5, "id": "18ff4f76", "metadata": {}, "outputs": [], "source": [ "df_train = pd.read_csv('train.csv')\n", "df_test = pd.read_csv('test.csv')" ] }, { "cell_type": "markdown", "id": "ea26bfdf", "metadata": {}, "source": [ "#### Lets inspect training and test data" ] }, { "cell_type": "code", "execution_count": 6, "id": "5933f4d7", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | V1 | \n", "V2 | \n", "V3 | \n", "V4 | \n", "V5 | \n", "V6 | \n", "V7 | \n", "V8 | \n", "V9 | \n", "V10 | \n", "... | \n", "V33 | \n", "V34 | \n", "V35 | \n", "V36 | \n", "V37 | \n", "V38 | \n", "V39 | \n", "V40 | \n", "V41 | \n", "Class | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "3.919 | \n", "2.6909 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "31.4 | \n", "2 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "2.949 | \n", "1.591 | \n", "0 | \n", "7.253 | \n", "0 | \n", "0 | \n", "2 | \n", "
2 | \n", "4.170 | \n", "2.1144 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "30.8 | \n", "1 | \n", "1 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "3.315 | \n", "1.967 | \n", "0 | \n", "7.257 | \n", "0 | \n", "0 | \n", "2 | \n", "
4 | \n", "3.000 | \n", "2.7098 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "20.0 | \n", "0 | \n", "2 | \n", "... | \n", "0 | \n", "0 | \n", "1 | \n", "3.046 | \n", "5.000 | \n", "0 | \n", "6.690 | \n", "0 | \n", "0 | \n", "2 | \n", "
13 | \n", "4.214 | \n", "2.6272 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "30.0 | \n", "3 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "2.998 | \n", "1.722 | \n", "0 | \n", "6.770 | \n", "0 | \n", "0 | \n", "2 | \n", "
16 | \n", "3.942 | \n", "2.7719 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "31.6 | \n", "2 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "3.542 | \n", "1.739 | \n", "0 | \n", "8.127 | \n", "0 | \n", "1 | \n", "2 | \n", "
5 rows × 42 columns
\n", "\n", " | V1 | \n", "V2 | \n", "V3 | \n", "V4 | \n", "V5 | \n", "V6 | \n", "V7 | \n", "V8 | \n", "V9 | \n", "V10 | \n", "... | \n", "V33 | \n", "V34 | \n", "V35 | \n", "V36 | \n", "V37 | \n", "V38 | \n", "V39 | \n", "V40 | \n", "V41 | \n", "Class | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | \n", "846.000000 | \n", "846.000000 | \n", "846.000000 | \n", "821.000000 | \n", "846.000000 | \n", "846.000000 | \n", "846.000000 | \n", "846.000000 | \n", "846.000000 | \n", "846.000000 | \n", "... | \n", "846.000000 | \n", "846.000000 | \n", "846.000000 | \n", "846.000000 | \n", "821.000000 | \n", "846.000000 | \n", "846.000000 | \n", "846.000000 | \n", "846.000000 | \n", "846.000000 | \n", "
mean | \n", "4.790476 | \n", "3.054551 | \n", "0.739953 | \n", "0.030451 | \n", "0.946809 | \n", "0.277778 | \n", "1.669031 | \n", "37.422813 | \n", "1.342790 | \n", "1.784870 | \n", "... | \n", "0.903073 | \n", "1.241135 | \n", "0.926714 | \n", "3.922100 | \n", "2.549406 | \n", "0.671395 | \n", "8.643191 | \n", "0.059102 | \n", "0.706856 | \n", "1.333333 | \n", "
std | \n", "0.531991 | \n", "0.813983 | \n", "1.504545 | \n", "0.198281 | \n", "2.318081 | \n", "1.045544 | \n", "2.220221 | \n", "9.030008 | \n", "2.018433 | \n", "1.773856 | \n", "... | \n", "1.526124 | \n", "2.248684 | \n", "1.239133 | \n", "0.992636 | \n", "0.625021 | \n", "1.093633 | \n", "1.223700 | \n", "0.342364 | \n", "2.145396 | \n", "0.471683 | \n", "
min | \n", "2.000000 | \n", "0.803900 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "9.100000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "2.279000 | \n", "1.467000 | \n", "0.000000 | \n", "4.948000 | \n", "0.000000 | \n", "0.000000 | \n", "1.000000 | \n", "
25% | \n", "4.499000 | \n", "2.510175 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "30.800000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "3.497000 | \n", "2.101000 | \n", "0.000000 | \n", "8.009500 | \n", "0.000000 | \n", "0.000000 | \n", "1.000000 | \n", "
50% | \n", "4.840000 | \n", "3.052400 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.000000 | \n", "37.850000 | \n", "1.000000 | \n", "1.500000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "1.000000 | \n", "3.732500 | \n", "2.461000 | \n", "0.000000 | \n", "8.508000 | \n", "0.000000 | \n", "0.000000 | \n", "1.000000 | \n", "
75% | \n", "5.119000 | \n", "3.415725 | \n", "1.000000 | \n", "0.000000 | \n", "1.000000 | \n", "0.000000 | \n", "3.000000 | \n", "43.800000 | \n", "2.000000 | \n", "3.000000 | \n", "... | \n", "1.000000 | \n", "2.000000 | \n", "1.000000 | \n", "3.980000 | \n", "2.861000 | \n", "1.000000 | \n", "9.019750 | \n", "0.000000 | \n", "0.000000 | \n", "2.000000 | \n", "
max | \n", "6.496000 | \n", "7.918400 | \n", "12.000000 | \n", "2.000000 | \n", "36.000000 | \n", "13.000000 | \n", "18.000000 | \n", "60.700000 | \n", "24.000000 | \n", "12.000000 | \n", "... | \n", "12.000000 | \n", "18.000000 | \n", "7.000000 | \n", "10.695000 | \n", "5.750000 | \n", "8.000000 | \n", "14.700000 | \n", "4.000000 | \n", "27.000000 | \n", "2.000000 | \n", "
8 rows × 42 columns
\n", "\n", " | V1 | \n", "V2 | \n", "V3 | \n", "V4 | \n", "V5 | \n", "V6 | \n", "V7 | \n", "V8 | \n", "V9 | \n", "V10 | \n", "... | \n", "V33 | \n", "V34 | \n", "V35 | \n", "V36 | \n", "V37 | \n", "V38 | \n", "V39 | \n", "V40 | \n", "V41 | \n", "Class | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | \n", "3.919 | \n", "2.6909 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "31.4 | \n", "2 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "2.949 | \n", "1.591 | \n", "0 | \n", "7.253 | \n", "0 | \n", "0 | \n", "2 | \n", "
2 | \n", "4.170 | \n", "2.1144 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "30.8 | \n", "1 | \n", "1 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "3.315 | \n", "1.967 | \n", "0 | \n", "7.257 | \n", "0 | \n", "0 | \n", "2 | \n", "
4 | \n", "3.000 | \n", "2.7098 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "20.0 | \n", "0 | \n", "2 | \n", "... | \n", "0 | \n", "0 | \n", "1 | \n", "3.046 | \n", "5.000 | \n", "0 | \n", "6.690 | \n", "0 | \n", "0 | \n", "2 | \n", "
13 | \n", "4.214 | \n", "2.6272 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "30.0 | \n", "3 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "2.998 | \n", "1.722 | \n", "0 | \n", "6.770 | \n", "0 | \n", "0 | \n", "2 | \n", "
16 | \n", "3.942 | \n", "2.7719 | \n", "1 | \n", "0 | \n", "0 | \n", "0 | \n", "0 | \n", "31.6 | \n", "2 | \n", "0 | \n", "... | \n", "0 | \n", "0 | \n", "0 | \n", "3.542 | \n", "1.739 | \n", "0 | \n", "8.127 | \n", "0 | \n", "1 | \n", "2 | \n", "
5 rows × 42 columns
\n", "\n", " | V1 | \n", "V2 | \n", "V3 | \n", "V4 | \n", "V5 | \n", "V6 | \n", "V7 | \n", "V8 | \n", "V9 | \n", "V10 | \n", "... | \n", "V33 | \n", "V34 | \n", "V35 | \n", "V36 | \n", "V37 | \n", "V38 | \n", "V39 | \n", "V40 | \n", "V41 | \n", "Class | \n", "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | \n", "209.000000 | \n", "209.000000 | \n", "209.00000 | \n", "209.000000 | \n", "209.000000 | \n", "209.000000 | \n", "209.000000 | \n", "209.000000 | \n", "209.000000 | \n", "209.000000 | \n", "... | \n", "209.000000 | \n", "209.000000 | \n", "209.000000 | \n", "209.000000 | \n", "209.000000 | \n", "209.000000 | \n", "209.000000 | \n", "209.000000 | \n", "209.000000 | \n", "209.000000 | \n", "
mean | \n", "4.750938 | \n", "3.130050 | \n", "0.62201 | \n", "0.086124 | \n", "1.114833 | \n", "0.339713 | \n", "1.555024 | \n", "35.569378 | \n", "1.511962 | \n", "1.880383 | \n", "... | \n", "0.803828 | \n", "1.411483 | \n", "1.100478 | \n", "3.902612 | \n", "2.629201 | \n", "0.746411 | \n", "8.574038 | \n", "0.019139 | \n", "0.789474 | \n", "1.354067 | \n", "
std | \n", "0.603914 | \n", "0.897556 | \n", "1.27690 | \n", "0.406969 | \n", "2.393143 | \n", "1.182566 | \n", "2.246383 | \n", "9.471334 | \n", "1.721220 | \n", "1.784023 | \n", "... | \n", "1.498327 | \n", "2.374355 | \n", "1.320857 | \n", "1.029605 | \n", "0.714285 | \n", "1.077657 | \n", "1.315016 | \n", "0.195176 | \n", "2.589491 | \n", "0.479378 | \n", "
min | \n", "2.000000 | \n", "1.134900 | \n", "0.00000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "2.267000 | \n", "1.576000 | \n", "0.000000 | \n", "4.917000 | \n", "0.000000 | \n", "0.000000 | \n", "1.000000 | \n", "
25% | \n", "4.414000 | \n", "2.494500 | \n", "0.00000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "29.400000 | \n", "0.000000 | \n", "0.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "3.401000 | \n", "2.146000 | \n", "0.000000 | \n", "7.872000 | \n", "0.000000 | \n", "0.000000 | \n", "1.000000 | \n", "
50% | \n", "4.807000 | \n", "3.039300 | \n", "0.00000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "34.200000 | \n", "1.000000 | \n", "2.000000 | \n", "... | \n", "0.000000 | \n", "0.000000 | \n", "1.000000 | \n", "3.694000 | \n", "2.469000 | \n", "0.000000 | \n", "8.464000 | \n", "0.000000 | \n", "0.000000 | \n", "1.000000 | \n", "
75% | \n", "5.188000 | \n", "3.555400 | \n", "1.00000 | \n", "0.000000 | \n", "1.000000 | \n", "0.000000 | \n", "3.000000 | \n", "41.200000 | \n", "2.000000 | \n", "3.000000 | \n", "... | \n", "1.000000 | \n", "2.000000 | \n", "2.000000 | \n", "3.991000 | \n", "2.967000 | \n", "1.000000 | \n", "9.017000 | \n", "0.000000 | \n", "0.000000 | \n", "2.000000 | \n", "
max | \n", "6.253000 | \n", "9.177500 | \n", "8.00000 | \n", "3.000000 | \n", "16.000000 | \n", "12.000000 | \n", "14.000000 | \n", "60.000000 | \n", "9.000000 | \n", "11.000000 | \n", "... | \n", "12.000000 | \n", "18.000000 | \n", "6.000000 | \n", "10.355000 | \n", "5.825000 | \n", "6.000000 | \n", "14.030000 | \n", "2.000000 | \n", "27.000000 | \n", "2.000000 | \n", "
8 rows × 42 columns
\n", "