Analyzing the ‘Titanic’ Data-set with PowerBI

Zainab Balogun O.
5 min readJun 4, 2020

--

Photo by Anthony Metcalfe on Unsplash

A week ago I got the Titanic dataset from a group chat I’m on for us to work on and keep ourselves busy with.

Truth is I’ve been super lazy for a while now, I’ve lost the pace I started my learning with and I didn’t go looking for it. So when the dataset was dropped, I thought of it as an external push. My mind went…“Ovaa whether you like it or not you have to do something on this”.

So let’s get to it. The tool we’ll be making use of obviously is PowerBI (available anywhere online tbh).

Quick intro: PowerBI is a business analytics tool used mainly for interactive visualizations i.e. to display charts and dashboards and create reports. It can also be used to clean up data and write DAX (Data Analysis eXpressions) formulas.

Get said dataset ➡️ here ⬅️

Silly assumption: I assume that if you intend on working on this, you have some basic knowledge of PowerBI already and this isn’t your first encounter.

Data Summary (i.e. columns):

  • Survival (0= No, 1 =Yes)
  • pclass/ticket class (1=1st, 2=2nd, 3=3rd)
  • sex (female, male)
  • age (in years)
  • sibsp (number of siblings/spouses aboard)
  • parch (number of parents/children aboard)
  • ticket number
  • fare
  • cabin number
  • embarked (port of embarkation; C = Cherbourg, Q= Queenstown, S = Southampton)

There, these info. make you understand the data more.

After importing the csv file into PowerBI, and hitting the ‘transform data’ option which will take you to the Power Query Editor (where we clean the data), here are some tidying up I carried out;

  1. I added a conditional column Yes/No to replace 1/0 in the survival column (makes it more understandable)
  2. Added another conditional column to identify the Upper/Middle/Lower class which replaces 1/2/3 in the pclass column. For understandability
  3. Removed column 1 and 2 after replacement.
  4. I noticed there were some ‘nulls’ in the age column so I replaced them all with 0 because I can’t make up the ages on my own.
  5. The cabin column contains some empty rows so I replaced with ‘N/A’
  6. Replaced S/C/Q in the embarked column with their corresponding names Southampton/Cherbourg/Queenstown. For clarity.
  7. I crosschecked the datatypes for each column and corrected wrongly labelled ones.
  8. Hit Close and Apply

Now we are done with the Power Query Editor and we’ve moved to the PowerBI interface where we will write some measures, create new columns and create visuals.

  1. Created an extra column to place the different ages available in categories (Teens; 0–18, Youth; 19–30, Adult; 31–50, Aged; >51), using the nested IF formula to categorize.

PS: remember in my data cleaning, I replaced ‘nulls’ with 0 in the age column. I made sure to not make this visible in my analysis by including in my nested IF formula that 0 = N/A. Then you can hide this category in the filter pane if you need the age category for analysis.

2. Created 2 different measures ‘Died’ and ‘Survived’ using the CALCULATE formula. This is so I can distinctively display the number of people who died and survived on the ship.

3. Created a measure to count the number of passengers present on the ship by using the DISTINCTCOUNT formula.

4. Created another measure to add up the transportation fare using the SUM formula.

…and that’s all I did for that part.

Visualization:

I didn’t do much, I don’t do much lol so I just;

  1. Inserted a text box and placed in my title at the top of the page.
  2. Created 3 different cards to display: a) Total Passengers b) Those who survived c) Those who died.
  3. Used bar charts to display the number of passengers who survived by a) Gender b) Point of embarkation c) Travel class
  4. Used a metric/table visual to display the number of passengers who survived by age category.
  5. Created a slicer which appeared at the top of the page to provide more granular information between Male or Female.

After that, I just carried out some basic formatting like;

  • Eliminating the ‘x-axis’ and some ‘y-axis’ and their titles and turning on data labels for each charts.
  • Repositioning the legends.
  • Editing of colors i.e. red to show death and light blue to show survival.
  • Adding titles to each and every chart (‘cos that’s very important)
  • etc.

And then we have this…

Some insights gotten from the visualization:

  1. A total of 891 passengers boarded the ship. 549 passengers died, 342 survived.
  2. There were 577 males aboard, 489 died, 109 survived.Total number of females aboard were 314, 81 died. 233 survived. Could it be that more rescue preference was given to the females?
  3. The port most passengers got in from is Southampton (644 people), the least; Queenstown. So I guess it’s only natural that the most death will come from Southampton? 66% death there, followed by 61% from Queenstown and 45% from Cherbourg.
  4. Looking at the ‘survival by Travel class’ chart, we can see that we have more passengers in the lower class, then upper, then middle. The most death occurred in the lower class and the least death in the upper class. I think this is because naturally, more preference i.e. live boats/rescue teams will be assigned to the elites before touching those in the lower class.
  5. And lastly, I grouped the survival rate by age category to see the age range that got hit the most. Adult(31–50)>Youth(19–30)>Teen(0–18)>Aged(>50). In that order.

PPS: note that not all passenger’s ages were recorded so we don’t have a complete record for that.

That’s all the insights I got from the data but that’s not all there is. You can explore the data to find out more for yourself.

And please, if you do, share it with me 😁

Find me on Twitter and/or E-mail.

Thanks for reading, I hope you practice with it.

Peace :-)

--

--

Zainab Balogun O.

advocate 4 humanity, quality TV shows & 💤. subscribe to my newsletter, some say it's fun https://zigzagzee.substack.com/