In the Part-I of this tutorial, I have developed a small python program with less than 20 lines of code that allowed us to enter the first Kaggle competition. However, this model did not perform very well since we did not do a good data exploration and preparation effort to understand the data better and to structure the model better. In this second part of the tutorial, we will explore the dataset with the help of Seaborn and Matplotlib libraries. In addition, new concepts will be introduced and applied for a better performing model. Finally, we will be able to score better in our second submission.
As I mentioned in the Part-I, you need to install Python on your system to be able to run any Python code. Also, you need to install libraries such as Numpy, Pandas, Matplotlib, Seaborn. In addition, you need an IDE (text editor) to write your code. You may use your choice of IDE, of course. However, I strongly recommend using iPython via Anaconda Distribution (you can download it form here). iPython provides an interactive shell and it’s pretty flexible (for more info, please visit this). So, you should definitely check it if you are not already using it.
Exploring Our Data
To be able to create a good model, firstly, we need to explore our data. Seaborn, a statistical data visualization library, comes in pretty handy. First, let’s remember how our dataset looks like:
and this is the explanation of the variables you see above:
So, now it is time to explore some of these variables’ effect on survival!
My first suspicion is that there is a relation between a person’s gender (male-female) and his/her survival probability. To be able to check this effect, I am using a count plot of the males and females against survived and not-survived labels:
It is clear that females had a greater chance of survival compared to males. Therefore, gender must be an explanatory variable in our model.
Secondly, I suspect that there is a correlation between the passenger class and survival rate as well. When I plot Pclass against Survival, I obtain this:
Just as I suspected, passenger class has a significant influence on one’s survival chance. It seems that if someone is traveling in third class, it has a great chance of non-survival. Therefore, Pclass is definitely explanatory on survival probability.
Thirdly, I also suspect that number of siblings aboard (SibSp) and number of parents aboard (Parch) are also significant in explaining the survival chance. Therefore, I am plotting SibSp and Parch variables against Survival and I obtain this:
So, we reach to this conclusion: As the number of siblings on board or number of parents on board increases, the chances of survival increases. In other words, people traveling with their families had higher chance of survival.
Another potential explanatory variable (feature) of our model is the Embarked variable. When I plot Embarked against the Survival, I obtain this outcome:
It is clearly visible that people embarked on Southampthon Port were less fortunate compared the others. Therefore, I will also include this variable in my model.
So far, we checked 5 categorical variables (Sex, Plclass, SibSp, Parch, Embarked) and it seems that they all played a role in person’s survival chance.
Now it is time to work on our numerical variables Fare and Age. First of all, I would like to see the effect of Age on Survival chance. Therefore, I am plotting the Age variable (seaborn.distplot):
I can see that survival rate is higher for children below 18 while for people above 18 and below 35, this rate is low. It is evident that Age plays a role in Survival.
Finally, I need to see whether the Fare is helpful in explaining the Survival probability. Therefore, I am plotting Fare variable (seaborn.distplot):
In general, we can see that as the Fare paid by the passenger increases, chance of survival increases as we expected.
I will ignore three columns: Name, Cabin, Ticket. The reason for this is the need to use more advanced techniques to include these variables in our model. To give an idea on how to extract features from these variables: You can tokenize the Names of the passenger and derive their titles. Apart from titles like Mr. and Mrs., you will find other titles such as Master or Lady, etc. Surely, this played a role on who to save during that night. Therefore, you can take advantage of the given Name column as well as Cabin and Ticket columns.
Checking the Data for Null Values
Null values are our enemies! In Titanic dataset, we have some missing values. First of all, I will combine the two datasets after dropping Survived column of train dataset.
I need to find out about the null values situation! There are two two ways to accomplish this: info() function and heatmaps (way cooler!). To be able to detect this, I used seaborn’s heatmap with the following code:
Here is the outcome. Yellow lines are the missing values.
There are a lot of missing Age and Cabin values. There are two values missing in Embarked column while one missing in Fare column. Let’s take care of these first. Btw, alternatively, we can use info() function to receive the same information in script form:
Reading the Datasets
I am not getting into detail on the dataset as it was explained in the Part-I of the tutorial. By using the code below, I will import Pandas & Numpy libraries and reading train & test csv files.
As we now from above, we have null values in both train and test sets. I decided to impute these null values and prepare the datasets for the model fitting and prediction separately.
Imputing Null Values
There are two main approaches to solve the missing values problem in a datasets: drop or fill. Drop is the easy and naive way out; although, sometimes it might actually perform better. In our case, we will fill them unless we have decided to drop a whole column altogether.
Initial look of our dataset is as follows:
I will make several imputation and transformations to get a fully numerical and clean dataset to be able to fit to the machine learning model with the following code (it also contain imputation):
After running this code on train dataset, I get this:
There are no null values no strings or categories which would get in my way. Now, I can split the data into two, Features (X or explanatory variables) and Label (Y or response variable) and then, I can use sklearn’s train test split function to make train and test splits inside train dataset.
Note this: We have another dataset called test. This is quite confusing due to the naming made by Kaggle. We are basically training and testing our model using train dataset by splitting it into X_train, X_test, y_train, y_test dataframes and then, applying the trained model on our test dataset to get a predictions file.
Select and Train the Model
In Part-I, I used Logistic Regression as my machine learning algorithm. Another well known machine learning algorithm is Gradient Boosting Classifier and since it usually performs better than Logistic Regression, I will select this algorithm in this tutorial. The code shared below allows use to the import the Gradient Boosting Classifier algorithm, create a model based on it, fit and train the model using X_train and y_train dataframes, and finally make predictions on X_test.
Now, we have the predictions and we also know the answers since X_test is split from the train dataframe. To be able to measure our success, we can use confusion matrix and classification report. You can achive this by running the code below:
And this is the output:
We obtain about 82% accuracy which may be considered pretty good although there is still room for improvement.
Create the Prediction File for the Kaggle Competition
Now, we have a trained and working model that we can use for predictions the survivals of the passenger in the test.csv file.
I will, firstly, clean and prepare the data with the following code (quite similar to how we clean train dataset). Just note that I save PassengerId columns as a separate dataframe before removing it under the name ‘ids’.
Finally, I can predict the Survival values of the test dataframe and write to a csv file as required with the following code.
There you have a new and better model for Kaggle competition. We made several improvements in our code which increased the accuracy around 15–20% which is a good improvement. As I mentioned above, there is still some room for improvement and the accuracy can increase up to 85%. However, the scores shown in the scoreboard is not very honest, in my opinion since many people used dishonest techniques to increase their ranking.