The Titanic dataset is a collection of data covering the passengers on the ill-fated liner’s voyage. It is used by sites as an introduction to Machine Learning. Sites, e.g. Kaggle ( https://www.kaggle.com/c/titanic ) use the dataset as the basis for a competition.
The dataset provides data on pasengers e.g. names, number of persons in travel group, passenger acomodation/class, etc
The idea of the competition is to process the data, learn the data and predict whether a particular passenger survived the sinking or not.
I stumbled across the challenge and have blogged how I set about approaching the problem. I tried to use the steps I’d found in another blog, but I hit probelms, so it was something of a refreshing ‘get my hands dirty’ time, knee deep in some Python.
I approached the problem first using Jupyter Notebooks, and then, when I felt comfortable decided to convert it to Power BI.
What’s in the Dataset?
Good question!
The Kaggle competition provides 3 files:
- train.csv
- test.csv
- gender_submission.csv
The ‘train.csv’ holds sample data, used to build and train a model, ‘test.csv’ contains the full list of passengers for the competition, and ‘gender_submission.csv’ contains a set of predictions based on the gender of the passengers.
The ‘train.csv’ file holds an indication of whether the passenger survived or not. It holds only a subset of all the passengers, ‘test.csv’ holds the full passenger list.
Looking at the first couple of rows of the csv file, we see:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
| 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.25 | S | |
| 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
Let’s go load this into a Jupyter Notebook to have a look properly.
