Chapter 2, Part 2. Use of Machine Learning and Genetic Algorithms¶
Installation¶
Packages/modules required to run the program:
pip install numpy
pip install pandas
pip install install matplotlib
pip install scikit-learn
pip install tpot
We use "import" to load the libraries we will need.
- Numpy, package is used to do scientific calculations.
- Pandas, on the other hand contains tools that help in data analysis.
- Matplotlib, used for generating simple and powerful graphic visualizations.
- Scikit-learn, module for machine learning built on top of SciPy.
- We use the TPOT interface to make it as similar as possible to scikit-learn. It is also possible to use TPOT for regression problems with the TPOTRegressor class. Aside from the class name, a TPOTRegressor is used in the same way as a TPOTClassifier.
This program was run on the Anaconda distribution using a Python kernel
In [4]:
Copied!
# Packages required to run the program
#!pip install numpy
#!pip install pandas
#!pip install install matplotlib
#!pip install scikit-learn
#!pip install xgboost
#!pip install tpot
# Import the necessary modules
import pandas as pd
import numpy as np
#import tpot as tpot
from tpot import TPOTClassifier
from tpot import TPOTRegressor
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# Packages required to run the program
#!pip install numpy
#!pip install pandas
#!pip install install matplotlib
#!pip install scikit-learn
#!pip install xgboost
#!pip install tpot
# Import the necessary modules
import pandas as pd
import numpy as np
#import tpot as tpot
from tpot import TPOTClassifier
from tpot import TPOTRegressor
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
Reading input files with asteroid family data
- Opening the input files as "fam_nov" is saved with filename2.
- The input files "prop_el_all" is filename for reading asteroids in the family region.
In [7]:
Copied!
filename2='fam_nov'
data2_df=pd.read_csv(str(filename2),
skiprows=1, #Changed to skip reading the index
header=None,
delim_whitespace=True,
index_col=None,
names=['Id_or', 'a', 'e', 'sin(i)', 'H', '?'],
low_memory=False,
dtype={'Id_or':np.int8,
'a':np.float64,
'e':np.float64,
'sin(i)':np.float64,
'H':np.float64,
'?':np.float64,
}
)
#Opening the input file "prop_el_all" as a dataframe for the family region.
filename='prop_el_all' #family region file
data1_df=pd.read_csv(str(filename),
skiprows=0,
header=None,
delim_whitespace=True,
index_col=None,
names=['Id_prop', 'a', 'e', 'sin(i)', 'n', 'g', 's', 'H', 'LE','final','present'],
low_memory=False,
dtype={'Id_prop':np.float64,
'a':np.float64,
'e':np.float64,
'sin(i)':np.float64,
'n':np.float64,
'g':np.float64,
's':np.float64,
'H':np.float64,
'LE':np.float64,
}
)
data1_df['present']=data1_df['Id_prop'].isin(data2_df['Id_or'])
data1_df.loc[(data1_df['present']==False), 'present'] = 0
data1_df.loc[(data1_df['present']==True), 'present'] = 1
print('Input file 1:',data1_df)
n_data = data1_df.shape[0]
X_data = data1_df.iloc[:,1:4].values
y_data = data1_df.present.to_numpy()
y_data=y_data.astype('int')
filename2='fam_nov'
data2_df=pd.read_csv(str(filename2),
skiprows=1, #Changed to skip reading the index
header=None,
delim_whitespace=True,
index_col=None,
names=['Id_or', 'a', 'e', 'sin(i)', 'H', '?'],
low_memory=False,
dtype={'Id_or':np.int8,
'a':np.float64,
'e':np.float64,
'sin(i)':np.float64,
'H':np.float64,
'?':np.float64,
}
)
#Opening the input file "prop_el_all" as a dataframe for the family region.
filename='prop_el_all' #family region file
data1_df=pd.read_csv(str(filename),
skiprows=0,
header=None,
delim_whitespace=True,
index_col=None,
names=['Id_prop', 'a', 'e', 'sin(i)', 'n', 'g', 's', 'H', 'LE','final','present'],
low_memory=False,
dtype={'Id_prop':np.float64,
'a':np.float64,
'e':np.float64,
'sin(i)':np.float64,
'n':np.float64,
'g':np.float64,
's':np.float64,
'H':np.float64,
'LE':np.float64,
}
)
data1_df['present']=data1_df['Id_prop'].isin(data2_df['Id_or'])
data1_df.loc[(data1_df['present']==False), 'present'] = 0
data1_df.loc[(data1_df['present']==True), 'present'] = 1
print('Input file 1:',data1_df)
n_data = data1_df.shape[0]
X_data = data1_df.iloc[:,1:4].values
y_data = data1_df.present.to_numpy()
y_data=y_data.astype('int')
Input file 1: Id_prop a e sin(i) n g \ 0 10.0 3.141802 0.135780 0.088953 64.621686 128.701534 1 16.0 2.922128 0.102900 0.044069 72.049849 76.934743 2 22.0 2.909599 0.087946 0.218235 72.518789 56.847816 3 24.0 3.134510 0.153380 0.018968 64.845876 132.105084 4 33.0 2.866142 0.297306 0.034773 74.166350 83.752129 ... ... ... ... ... ... ... 11273 162468.0 3.122736 0.198642 0.291439 65.218781 71.130360 11274 188330.0 3.187104 0.147355 0.271327 63.253524 126.382193 11275 189818.0 3.203046 0.174452 0.229950 62.775911 191.750135 11276 637410.0 3.164595 0.208793 0.175752 63.923242 139.684601 11277 639326.0 3.141262 0.208871 0.174122 64.639614 117.756444 s H LE final present 0 -97.083999 5.48 14.8 10 1 1 -73.319571 6.05 2000.0 16 1 2 -63.378817 6.50 44.4 22 1 3 -103.436261 7.27 416.7 24 1 4 -108.512172 8.69 10.9 33 1 ... ... ... ... ... ... 11273 -85.444693 13.81 13.2 162468 0 11274 -84.432335 13.78 47.4 188330 0 11275 -102.653468 13.96 23.2 189818 0 11276 -115.005772 -9.99 69.4 2010BF123 0 11277 -110.011323 -9.99 19.3 2010OA121 0 [11278 rows x 11 columns]
2.3.1 Using Genetic algorithms to optimize machine learming prediction.¶
Genetic algorithms are used to identify the most appropriate machine learning process for a particular task. The user must provide simple inputs after manually cleaning the raw user data:
- Generations: number of genetic algorithm training iterations,
- Population size: number of individuals retained in the population each generation,
- Crossvalidation - cv: used to evaluate each pipeline using a simple parameter K, which corresponds to the number of groups into which the data sample is divided, and
- Random state: random number generator seed for reproducibility.
In [9]:
Copied!
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data,
train_size=0.75, test_size=0.25)
pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
random_state=55, verbosity=2)
pipeline_optimizer.fit(X_train, y_train)
print(pipeline_optimizer.score(X_test, y_test))
pipeline_optimizer.export('tpot_exported_pipeline_genetico.py')
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data,
train_size=0.75, test_size=0.25)
pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
random_state=55, verbosity=2)
pipeline_optimizer.fit(X_train, y_train)
print(pipeline_optimizer.score(X_test, y_test))
pipeline_optimizer.export('tpot_exported_pipeline_genetico.py')
Optimization Progress: 0%| | 0/120 [00:00<?, ?pipeline/s]
Generation 1 - Current best internal CV score: 0.9976353745947465 Generation 2 - Current best internal CV score: 0.9976353745947465 Generation 3 - Current best internal CV score: 0.9976353745947465 Generation 4 - Current best internal CV score: 0.9976353745947465 Generation 5 - Current best internal CV score: 0.9976353745947465 Best pipeline: ExtraTreesClassifier(input_matrix, bootstrap=False, criterion=gini, max_features=0.7000000000000001, min_samples_leaf=13, min_samples_split=17, n_estimators=100) 0.9975177304964539
The tool then automatically produces the machine learning pipeline’s best model.
In [ ]:
Copied!