The Titanic Dataset

The Titanic dataset is a collection of data covering the passengers on the ill-fated liner’s voyage. It is used by sites as an introduction to Machine Learning. Sites, e.g. Kaggle ( https://www.kaggle.com/c/titanic ) use the dataset as the basis for a competition.

The dataset provides data on pasengers e.g. names, number of persons in travel group, passenger acomodation/class, etc

The idea of the competition is to process the data, learn the data and predict whether a particular passenger survived the sinking or not.

I stumbled across the challenge and have blogged how I set about approaching the problem. I tried to use the steps I’d found in another blog, but I hit probelms, so it was something of a refreshing ‘get my hands dirty’ time, knee deep in some Python.

I approached the problem first using Jupyter Notebooks, and then, when I felt comfortable decided to convert it to Power BI.

What’s in the Dataset?

Good question!

The Kaggle competition provides 3 files:

train.csv
test.csv
gender_submission.csv

The ‘train.csv’ holds sample data, used to build and train a model, ‘test.csv’ contains the full list of passengers for the competition, and ‘gender_submission.csv’ contains a set of predictions based on the gender of the passengers.

The ‘train.csv’ file holds an indication of whether the passenger survived or not. It holds only a subset of all the passengers, ‘test.csv’ holds the full passenger list.

Looking at the first couple of rows of the csv file, we see:

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	0	3	Braund, Mr. Owen Harris	male	22	1	0	A/5 21171	7.25		S
2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	0	PC 17599	71.2833	C85	C

Let’s go load this into a Jupyter Notebook to have a look properly.

The Titanic Dataset

Share this: