The Titanic Dataset

The Titanic dataset is a collection of data covering the passengers on the ill-fated liner’s voyage. It is used by sites as an introduction to Machine Learning. Sites, e.g. Kaggle ( https://www.kaggle.com/c/titanic ) use the dataset as the basis for a competition.

The dataset provides data on pasengers e.g. names, number of persons in travel group, passenger acomodation/class, etc

The idea of the competition is to process the data, learn the data and predict whether a particular passenger survived the sinking or not.

I stumbled across the challenge and have blogged how I set about approaching the problem. I tried to use the steps I’d found in another blog, but I hit probelms, so it was something of a refreshing ‘get my hands dirty’ time, knee deep in some Python.

I approached the problem first using Jupyter Notebooks, and then, when I felt comfortable decided to convert it to Power BI.

What’s in the Dataset?

Good question!

The Kaggle competition provides 3 files:

  • train.csv
  • test.csv
  • gender_submission.csv

The ‘train.csv’ holds sample data, used to build and train a model, ‘test.csv’ contains the full list of passengers for the competition, and ‘gender_submission.csv’ contains a set of predictions based on the gender of the passengers.

The ‘train.csv’ file holds an indication of whether the passenger survived or not. It holds only a subset of all the passengers, ‘test.csv’ holds the full passenger list.

Looking at the first couple of rows of the csv file, we see:

PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
103Braund, Mr. Owen Harrismale2210A/5 211717.25S
211Cumings, Mrs. John Bradley (Florence Briggs Thayer)female3810PC 1759971.2833C85C

Let’s go load this into a Jupyter Notebook to have a look properly.