If you are following my tutorial series on Kaggle’s Titanic Competition (Part-I and Part-II) or have already participated the Competition, you are familiar with the whole story. If you are not familiar with it, since this is a follow-up tutorial, I strongly recommend you to check out the Competition Page or Part-I and Part-II of this tutorial series. In the Part-III (Final) of the series, (i) I will use natural language processing (NLP) techniques to obtain the titles of the passengers, (ii) create an Artificial Neural Network (ANN or RegularNet) to train the model, and (iii) use Grid Search Cross Validation to tune the ANN so that we get the best results. Let’s start…
Throughout this tutorial series, I tried to keep things as simple as possible and develop the story slowly and more clearly. In Part-I of the tutorial, we learned to write a python program with less than 20 lines of code to enter the Kaggle’s Competition. Things were kept as simple as possible. We cleaned the non-numerical parts, took care of the null values, trained our model using train.csv file, predicted the survival of the passenger in the test.csv file, and saved it as a csv file for submission.
Since I did not explore the dataset properly in Part-I, I focused on data exploration on Part-II thanks to visualization libraries Matplotlib and Seaborn. I imputed the null values instead of dropping them by using aggregated functions, better cleaned the data, and finally generated the dummy variables from the categorical variables. Then, I switched to RandomForestClassifier model instead of LogisticRegression which also improved the precision. We generated approximately 20% increase in precision compared to the model in Part-I.
Part-III of the Tutorial
Now, we will use the Name column to derive the titles of the passengers which played a significant role on Survival. In addition, I will create an Artificial Neural Network (ANN or RegularNets) with Keras to obtain better results. Finally, to be able to tune the ANN model, I will use GridSearchCV to detect the best parameters. Finally, I will generate a new .csv file for submission.
Preparing the Dataset
Similar to what we have done in Part-I and Part-II, I will start with cleaning data and imputing the null values. This time, I will adopt a different approach and combine the two dataset for cleaning and imputing. I have already explained why I impute the null values the way I will in the Part-II, so I will give you the code straight away. If you feel that some operations do not make sense, you may refer to Part-II or comment below. One thing though, since I saw in Part-II that people younger than 18 had a greater chance of survival, I have decided to add a new feature to measure this effect.
Deriving Passenger Titles with NLP
Before we drop the unnecessary columns and get the dummies of the categorical variables above, I will derive the titles from ‘Name’ column. To understand what we are doing, I will start by running the following code to get the first 10 rows Name column values.
And here is what we get:
The structure of the name column value is as follows:
Therefore, I need to split these String based on the dot and comma and extract the title. We can accomplish this with the following code:
Once we run this code, we will have a Title column with titles in it. To be able to see what kind of titles do we have, I will run this:
It seems that we have four major groups: ‘Mr’, ‘Mrs’, ‘Miss’, ‘Master’, and others. However, before grouping all the other titles as Others, we need to take care of the French titles. We need to convert them to their corresponding English titles with the following code:
Now, we only have officer and royal titles. It makes sense to combine them as Others. I can achieve this with the following code:
Final Touch on Data Preparation
Now that our Titles are more managable, we can create dummies and drop the unnecessary columns with the following code:
Creating an Artificial Neural Network for Training
Standardizing Our Data with Standard Scaler
To be able to get good result, we must scale our data by using Scikit Learn’s Standard Scaler. Standard Scaler standardizes features by removing the mean and scaling to unit variance (i.e. standardization) which is different than MinMaxScaler. Mathematical difference between Standardization and Normalizer is as follows:
I will choose StandardScaler() for scaling my dataset and run the following code:
Building the ANN Model
After standardizing our data, we can start building our artificial neural network. I will create one Input Layer (Dense), one Output Layer (Dense), and one Hidden Layer (Dense). After each layer until the Output Layer, I will apply 0.2 Dropout for regularization to fight with over-fitting. Finally, I will build the model with Keras Classifier so that I can apply GridSearchCV on this neural network. As we have 14 explanatory variable, our input_dimension must be equal to 14 and since I will make binary classification, my final output unit must be one so that I can use it to make Survived or Not-Survived classification. The other units in between is “try and see” values and I selected them 10.
Grid Search Cross Validation
After building the ANN, I will use scikit-learn GridSearchCV to find the best parameters and tune my ANN to get the best results. I will try different optimizers, epochs, and batch_sizes with the following code.
After I run this code and print out the best parameters, I get the following output:
Please note that I did not activate the Cross Validation in the GridSearchCV. If you would like to add cross validation functionality to GridSearchCV, select a cv value inside the GridSearch (e.g. cv=5).
Fitting the Model with Best Parameters
Now that we found the best parameters, we can re-create our classifier with the best parameter values and fit to our training dataset with the following code:
Since we obtain the prediction, we may conduct the final operations to make it ready for submission. One thing to note is that our ANN gives us the probabilities of survival which is a continuous numerical variable. However, we need a binary categorical variable. Therefore, I am also making the necessary operation with the lambda function below to convert the continuous values to binary values (0 or 1) and writing the results to a csv file.
You have created an artificial neural network to classify the Survivals of titanic passengers. Neural Networks are proved to outperform all the other machine learning algorithms as long as there is large volume of data. Since our dataset only consists of 1309 lines, it is likely that some machine learning algorithms such as Gradient Boosting Tree or Random Forest with good tuning may outperform neural networks. However, for datasets with large volumes, this will not be the case as you may see on the chart below:
I would say that Titanic Dataset may be on the left side of the intersection of where older algorithms outperform deep learning algorithms. However, we will still achieve an accuracy rate higher than 80% which is around the natural level of accuracy.