Chapter 2, Part 2. Use of Machine Learning and Genetic Algorithms¶

Installation¶

Packages/modules required to run the program:

pip install numpy
pip install pandas
pip install install matplotlib
pip install scikit-learn
pip install tpot

We use "import" to load the libraries we will need.

Numpy, package is used to do scientific calculations.
Pandas, on the other hand contains tools that help in data analysis.
Matplotlib, used for generating simple and powerful graphic visualizations.
Scikit-learn, module for machine learning built on top of SciPy.
We use the TPOT interface to make it as similar as possible to scikit-learn. It is also possible to use TPOT for regression problems with the TPOTRegressor class. Aside from the class name, a TPOTRegressor is used in the same way as a TPOTClassifier.

This program was run on the Anaconda distribution using a Python kernel

In [4]:

            
                Copied!
                
                    
                    
                
                

        
# Packages required to run the program
#!pip install numpy
#!pip install pandas
#!pip install install matplotlib
#!pip install scikit-learn
#!pip install xgboost
#!pip install tpot

# Import the necessary modules
import pandas as pd
import numpy as np
#import tpot as tpot
from tpot import TPOTClassifier
from tpot import TPOTRegressor
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
# Packages required to run the program
#!pip install numpy
#!pip install pandas
#!pip install install matplotlib
#!pip install scikit-learn
#!pip install xgboost
#!pip install tpot

# Import the necessary modules
import pandas as pd
import numpy as np
#import tpot as tpot
from tpot import TPOTClassifier
from tpot import TPOTRegressor
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

Reading input files with asteroid family data

Opening the input files as "fam_nov" is saved with filename2.
The input files "prop_el_all" is filename for reading asteroids in the family region.

In [7]:

            
                Copied!
                
                    
                    
                
                

        
filename2='fam_nov'   
data2_df=pd.read_csv(str(filename2),
                 skiprows=1,  #Changed to skip reading the index
                 header=None,
                 delim_whitespace=True,
                 index_col=None,
                 names=['Id_or', 'a', 'e', 'sin(i)', 'H', '?'],
                 low_memory=False,
                 dtype={'Id_or':np.int8,
                        'a':np.float64,
                        'e':np.float64,
                        'sin(i)':np.float64,
                        'H':np.float64,
                        '?':np.float64,
                 }
)

#Opening the input file "prop_el_all" as a dataframe for the family region.
filename='prop_el_all' #family region file
data1_df=pd.read_csv(str(filename),
                 skiprows=0,
                 header=None,
                 delim_whitespace=True,
                 index_col=None,
                 names=['Id_prop', 'a', 'e', 'sin(i)', 'n', 'g', 's', 'H', 'LE','final','present'],
                 low_memory=False,
                 dtype={'Id_prop':np.float64,
                        'a':np.float64,
                        'e':np.float64,
                        'sin(i)':np.float64,
                        'n':np.float64,
                        'g':np.float64,
                        's':np.float64,
                        'H':np.float64,
                        'LE':np.float64,
                 }
)

data1_df['present']=data1_df['Id_prop'].isin(data2_df['Id_or']) 

data1_df.loc[(data1_df['present']==False), 'present'] = 0   
data1_df.loc[(data1_df['present']==True), 'present'] = 1    

print('Input file 1:',data1_df)

n_data = data1_df.shape[0] 
X_data = data1_df.iloc[:,1:4].values  
y_data = data1_df.present.to_numpy()
y_data=y_data.astype('int')
filename2='fam_nov'   
data2_df=pd.read_csv(str(filename2),
                 skiprows=1,  #Changed to skip reading the index
                 header=None,
                 delim_whitespace=True,
                 index_col=None,
                 names=['Id_or', 'a', 'e', 'sin(i)', 'H', '?'],
                 low_memory=False,
                 dtype={'Id_or':np.int8,
                        'a':np.float64,
                        'e':np.float64,
                        'sin(i)':np.float64,
                        'H':np.float64,
                        '?':np.float64,
                 }
)

#Opening the input file "prop_el_all" as a dataframe for the family region.
filename='prop_el_all' #family region file
data1_df=pd.read_csv(str(filename),
                 skiprows=0,
                 header=None,
                 delim_whitespace=True,
                 index_col=None,
                 names=['Id_prop', 'a', 'e', 'sin(i)', 'n', 'g', 's', 'H', 'LE','final','present'],
                 low_memory=False,
                 dtype={'Id_prop':np.float64,
                        'a':np.float64,
                        'e':np.float64,
                        'sin(i)':np.float64,
                        'n':np.float64,
                        'g':np.float64,
                        's':np.float64,
                        'H':np.float64,
                        'LE':np.float64,
                 }
)

data1_df['present']=data1_df['Id_prop'].isin(data2_df['Id_or']) 

data1_df.loc[(data1_df['present']==False), 'present'] = 0   
data1_df.loc[(data1_df['present']==True), 'present'] = 1    

print('Input file 1:',data1_df)

n_data = data1_df.shape[0] 
X_data = data1_df.iloc[:,1:4].values  
y_data = data1_df.present.to_numpy()
y_data=y_data.astype('int')  

Input file 1:         Id_prop         a         e    sin(i)          n           g  \
0          10.0  3.141802  0.135780  0.088953  64.621686  128.701534   
1          16.0  2.922128  0.102900  0.044069  72.049849   76.934743   
2          22.0  2.909599  0.087946  0.218235  72.518789   56.847816   
3          24.0  3.134510  0.153380  0.018968  64.845876  132.105084   
4          33.0  2.866142  0.297306  0.034773  74.166350   83.752129   
...         ...       ...       ...       ...        ...         ...   
11273  162468.0  3.122736  0.198642  0.291439  65.218781   71.130360   
11274  188330.0  3.187104  0.147355  0.271327  63.253524  126.382193   
11275  189818.0  3.203046  0.174452  0.229950  62.775911  191.750135   
11276  637410.0  3.164595  0.208793  0.175752  63.923242  139.684601   
11277  639326.0  3.141262  0.208871  0.174122  64.639614  117.756444   

                s      H      LE      final present  
0      -97.083999   5.48    14.8         10       1  
1      -73.319571   6.05  2000.0         16       1  
2      -63.378817   6.50    44.4         22       1  
3     -103.436261   7.27   416.7         24       1  
4     -108.512172   8.69    10.9         33       1  
...           ...    ...     ...        ...     ...  
11273  -85.444693  13.81    13.2     162468       0  
11274  -84.432335  13.78    47.4     188330       0  
11275 -102.653468  13.96    23.2     189818       0  
11276 -115.005772  -9.99    69.4  2010BF123       0  
11277 -110.011323  -9.99    19.3  2010OA121       0  

[11278 rows x 11 columns]

2.3.1 Using Genetic algorithms to optimize machine learming prediction.¶

Genetic algorithms are used to identify the most appropriate machine learning process for a particular task. The user must provide simple inputs after manually cleaning the raw user data:

Generations: number of genetic algorithm training iterations,
Population size: number of individuals retained in the population each generation,
Crossvalidation - cv: used to evaluate each pipeline using a simple parameter K, which corresponds to the number of groups into which the data sample is divided, and
Random state: random number generator seed for reproducibility.

In [9]:

            
                Copied!
                
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data,
                                                    train_size=0.75, test_size=0.25)

pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
                                    random_state=55, verbosity=2)

pipeline_optimizer.fit(X_train, y_train)
print(pipeline_optimizer.score(X_test, y_test))

pipeline_optimizer.export('tpot_exported_pipeline_genetico.py')
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data,
                                                    train_size=0.75, test_size=0.25)

pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
                                    random_state=55, verbosity=2)

pipeline_optimizer.fit(X_train, y_train)
print(pipeline_optimizer.score(X_test, y_test))

pipeline_optimizer.export('tpot_exported_pipeline_genetico.py')

Optimization Progress:   0%|          | 0/120 [00:00<?, ?pipeline/s]

Generation 1 - Current best internal CV score: 0.9976353745947465

Generation 2 - Current best internal CV score: 0.9976353745947465

Generation 3 - Current best internal CV score: 0.9976353745947465

Generation 4 - Current best internal CV score: 0.9976353745947465

Generation 5 - Current best internal CV score: 0.9976353745947465

Best pipeline: ExtraTreesClassifier(input_matrix, bootstrap=False, criterion=gini, max_features=0.7000000000000001, min_samples_leaf=13, min_samples_split=17, n_estimators=100)
0.9975177304964539

The tool then automatically produces the machine learning pipeline’s best model.

In [ ]: