import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
= pd.read_csv('https://www.openml.org/data/get_csv/1592296/php9xWOpn')
df
= ['V28', 'V29', 'V30', 'V31', 'V32', 'V33', 'Class']
predictors 'Class'] -= 1 df[
Simple Neural Networks in Python
Neural Networks (NN) have become incredibly popular due to their high level of accuracy. The creation of a NN can be complicated and have a high level of customization. I wanted to explore just the simplest NN that you could create. A framework as a workhorse for developing new NN.
The SciKitlearn
provides the easiest solution with the Multi-Layer Perceptron series of functions. It doesn’t provide a bunch of the more advanced features of TensorFlow
, like GPU support, but that is not what I’m looking for.
Initialization
For the demonstration, I decided to use a data set on faults found in steel plates from the OpenML website. The data set includes 27 features with 7 binary predictors.
Since there are multiple binary predictors, I needed to create a single class variable to represent each class. The Class
variable doesn’t currently represent this, it represents all faults that don’t fit in the categories of V28
to V33
. The single variable class was created with the np.argmax
function which returns the index of the highest value between all the predictors.
= np.argmax(df[predictors].values, axis =1)
y = df.drop(predictors, axis = 1)
X = train_test_split(X, y, random_state=1) X_train, X_test, y_train, y_test
Modelling
This is the most basic model that I would like to evaluate. I’ve used the GridSearch
function, so all combinations of parameters are tested. The only parameter I wanted to examine was the size of the hidden layers. Each hidden layer provided is a tuple, where each number represents the number of nodes in a singled layer. Multiple numbers represent additional layers.
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV
= {'hidden_layer_sizes':[(1),(100), (100,100), (100,100,100),
parameters 100,100,100,100),
(100,100,100,100,100),
(100,100,100,100,100,100),
(100,100,100,100,100,100,100),
(100,100,100,100,100,100,100,100),
(100,100,100,100,100,100,100,100,100),
(100,100,100,100,100,100,100,100,100,100)]}
(= MLPClassifier(random_state = 1,max_iter = 10000,
model = 'adam', learning_rate = 'adaptive')
solver
= GridSearchCV(estimator = model, param_grid = parameters)
grid
grid.fit(X_train, y_train)print(grid.best_score_)
0.4054982817869416
The performance of the best model in the grid is not impressive. It took me awhile to realize that I had forgotten to scale the features. I included this error to show the importance of scaling on model performance.
Feature Scaling
The features are simply scaled with the StandardScaler
function. The same model is used on the scaled features.
from sklearn.preprocessing import StandardScaler
= StandardScaler()
sc = sc.fit(X_train)
scaler = scaler.transform(X_train)
X_train_sc = scaler.transform(X_test)
X_test_sc
= {'hidden_layer_sizes':[(1),(100), (100,100), (100,100,100),
parameters 100,100,100,100),
(100,100,100,100,100),
(100,100,100,100,100,100),
(100,100,100,100,100,100,100),
(100,100,100,100,100,100,100,100),
(100,100,100,100,100,100,100,100,100),
(100,100,100,100,100,100,100,100,100,100)]}
(= MLPClassifier(random_state = 1,max_iter = 10000,
model = 'adam', learning_rate = 'adaptive')
solver
= GridSearchCV(estimator = model, param_grid = parameters, cv=3)
grid
grid.fit(X_train_sc, y_train) grid.best_score_
0.7553264604810996
The performance of the scaled model is much more impressive. After the GridSearch
function finds the parameters for the best model, it retrains the model on the entire dataset. This is because the function utilize cross validation, so some data was withheld for comparing the different models on test data.
Conclusion
With our model constructed, we can now test its performance on the original test set. It is important to remember to use the scaled test features, as that is what the model is expecting.
grid.score(X_test_sc, y_test)
0.7304526748971193
The results are pretty satisfactory. A decent level of accuracy without a lot of complicated code. Default values were used, whenever they were appropriate. Additional steps could be taken, but this remains a good foundation for future exploratory analysis.
Photo by Alina Grubnyak on Unsplash