Titanic: Machine Learning from Disaster
If you are interested in machine learning, you have probably heard of Kaggle. Kaggle is a platform where you can learn a lot about machine learning with Python and R, do data science projects, and (this is the most fun part) join machine learning competitions. Competitions are changed and/or updated over time but currently, “Titanic: Machine Learning from Disaster” is the first of beginners’ competitions. In this post, we will obtain a ready-to-upload submission file with less than 20 lines of Python code. To be able to this, we will use Pandas and Scikit-Learn libraries.
Some Wiki Info about Titanic and the Infamous Accident
RMS Titanic was the largest ship afloat at the time it entered service and it sank after colliding with an iceberg during its first voyage to the United States on 15 April, 1912. During the voyage, there were 2,224 passengers and crew aboard and unfortunately, 1,502 of them died. It was one of the deadliest commercial peacetime maritime disasters in the 20th century.
One of the reasons for such high number of casualties was lack of sufficient lifeboats for the passengers and crew. Although luck played a part in surviving the accident. Some people were more likely to survive than others such as women, children, and the upper-class. We will calculate this likelihood and effect of having particular features on likelihood of surviving. And we will accomplish this in less than 20 lines of code and have a file ready for submission. … Let’s Get Started!
Download the Data
Titanic dataset is an open dataset where you can reach from many different repositories and GitHub accounts. However, downloading from Kaggle will be definitely the best choice as the other sources may have slightly different versions and may not offer separate train and test files. So, please visit this link to download the datasets (Train.csv and Test.csv) to get started.
Normally our Train.csv file looks like this in Excel:
After converting it to table in Excel (Data->Text to Columns), we get this view:
Way nicer, right! Now, we can clearly see that we have 12 variables. While the “Survived” variable represents whether a particular passenger survived the accident, the rest is the essential information about this passenger. Here is the brief explanation of the variables:
Clean, Massage, and Prepare The Data
I am assuming that you have your Python3 environment installed. However, if you have not yet, for Windows you may refer to this link and for mac you may refer to this link. After making sure that you have Python3 installed on your system, open your favorite IDE and start coding!
First we will load the training data for cleaning and getting it ready for training our model. I will (i) load the data, (ii) delete the rows with empty values, (iii) select “Survival” column as my response variable, (iv) drop the for-now irrelevant explanatory variables, (v) convert categorical variables to dummy variables and I will accomplish all this with 7 lines of code:
Create the Model and Train
To able to discover the relationship between Survival variable and other variables (or features if you will), you need to select a statistical machine learning model and train your model with the processed data.
Scikit-learn provides several algorithms for this. I will select the DecisionTreeClassifier which is a basic but powerful algorithm for machine learning. And get this!, we will only need 3 lines of code to reveal the hidden relationship between Survival (denoted as y) and the selected explanatory variables (denoted as X).
Make Predictions and Save Your Results
After the revealing the hidden relationship between Survival and the selected explanatory variables, we may prepare our testing data for prediction phase. Test.csv file is slightly different than the Train.csv file: It does not contain the “Survival” column. Thismakes sense because if we would know all the answers we could have just faked our algorithm and submit the correct answers after writing by hand (wait! some people somehow have already done that?). Anyway, so our testing data needs almost the same kind of cleaning, massaging, prepping, and preprocessing for prediction phase. We will accomplish this with 5 lines of code:
Now our test data is clean and prepared for prediction. Finally, make the predictions for the given test file and save it to memory:
So easy, right! Before saving this predictions, we need to obtain proper structure so that Kaggle can automatically score our predictions. Remember we saved the PassengerId column to the memory as a separate dataset (dataframe, if you will)? Now I am going to assign (or attach) the predictions dataset to PassengerIds (note that they are both single column datasets). Finally, I will get the data from the memory and save it in csv (comma separated values) format which is required by Kaggle.
Now you can go to Kaggle’s Titanic competition page (of course, after logging in) and upload your submission file.
If you are too lazy to write 20 lines of code (not being judgemental here, at all) and looking for the full code, visit my Github.
Will You Make It to the Top?
Definitely, not! We tried to have a machine learning algorithm enabling you to enter in a Kaggle competition in the most simplistic way. As you improve this basic code, you will be able to rank better in the following submissions.