Analyzing Data of Titanic Passengers
I decided to investigate the Titanic passenger dataset, made available via the Kaggle data science community.
What factors made some people more likely to survive? It has been a long-held belief that wealthy passengers were more likely to survive -- perhaps because the crew placed them on lifeboats first -- but this may not necessarily be true.
Additionally, women and children are usually rescued first during any disaster. Was this true for the Titanic as well?
I decided to dig deep into the data to identify any trends, insights, and hopefully, some surprises.
First, I converted the .csv file to a dataframe:
import pandas as pd
titanic_df = pd.read_csv('titanic-data.csv')
Then I calculated how many passengers were included in the data:
len(titanic_df)
Hmmm, 891 records. This is less than half of the 2,224 passengers and crew members who were aboard the ship, according to Kaggle and additional sources.
Therefore, any analysis performed would be incomplete or inconclusive, as the underlying dataset does not contain information on all of the individuals aboard the ship.
It is also not known how and where these 891 records were originally obtained, or whether these 891 records are a random sample of the 2,224 passengers. As such, the statistical validity of this sample is also unknown.
Nonetheless, I jumped, er, dived right in.
I performed the following to get the the column headings, even though they were given by Kaggle:
titanic_df.head()
To get some fast statistical data, I performed the following.
titanic_df.describe()
Not all of these results are useful. Determining the mean and standard deviation of such qualitative data as PassengerID, Survived, and Pclass is meaningless.
Percent of Survivors
To quickly obtain the number of survivors, I declared a variable survivors
and set it equal to the count of the number of records in which the Survived column was equal to 1:
# Total survivors
# Making a new dataframe, only selecting those records where 'Survived' is equal to 1.
len(titanic_df[titanic_df['Survived']==1])
To obtain the percent of the passengers that survived, I determined that it was 38.38 percent:
# Percent that survived
round(((survivors/len(titanic_df)) * 100), 2)
I wanted to determine if this percentage of survivors differs from the total percentage.
According to Kaggle, 1,502 out of 2,224 passengers and crew members were killed. This means that 722 survived, or 32.46 percent.
As such, we have a higher survival rate in the sample dataset.
Not ideal, but I was curious how the sample data may yield different results than the original underlying data.
Age
Nonetheless, I was curious to learn more about the ages of the passengers, and whether this was correlated with survival.
From the results of the describe
method used above, I spotted some NaN
(Not a Number) values in the Age column. Not good. This means that some values for Age are missing. To determine just how many are missing, I performed a quick calculation:
ageless = len(titanic_df[np.isnan(titanic_df['Age'])])
print(ageless)
which as a percentage of total records comes to
percent_ageless = round(((ageless/len(titanic_df)) * 100), 2)
print(percent_ageless)
With close to 20 percent of age data missing, this could be a potential issue.
Looking at the age data, I wanted to determine the median, mean, and maximum ages for the dataset:
titanic_df.median()['Age']
titanic_df.mean()['Age']
titanic_df.max()['Age']
Male or Female
Additionally, I was curious if the passenger's sex had any relationship to survival.
I decided I wanted to investigate Pandas's groupby
function, starting with indexing by Sex:
titanic_df.groupby('Sex')
titanic_df.groupby('Sex').size()
titanic_df.groupby('Sex').mean()
Again, some of the results are meaningless, as the mean of PassengerID, Survived, and Pclass are irrelevant.
So, I pulled out the individual results for Age, Fare, and Survived.
Women paid more on average than men, and the number of female survivors was more than twice that of male survivors.
(Because Survived carries a value of 1, performing a simple sum()
of the Survived column proved to be a convenient way to make this determination.)
titanic_df.groupby('Sex').mean()['Age']
titanic_df.groupby('Sex').mean()['Fare']
titanic_df.groupby('Sex')['Survived'].sum()
Clearly, women were more than twice as likely to survive as men.
Fares
I decided to have a look at the median, mean and maximum Fares:
titanic_df.median()['Fare']
titanic_df.mean()['Fare']
titanic_df.max()['Fare']
Wow, what wild swings in ticket prices! The highest fare was about 16 times that of the average.
Port of Embarkation
I had a look at Embarked, or the port of embarkation. Most embarked on their voyage in Southampton, but the highest average fare was paid at Cherbourg, about $60.
titanic_df.groupby('Embarked')
titanic_df.groupby('Embarked').size()
titanic_df.groupby('Embarked').mean()
titanic_df.groupby('Embarked').mean()['Age']
titanic_df.groupby('Embarked').mean()['Fare']
Class
It has long been known that the wealthier passengers survived the Titanic disaster. Grouping the data by Pclass allowed for a closer examination of this assertion.
titanic_df.groupby('Pclass').size()
titanic_df.groupby('Pclass').mean()['Age']
titanic_df.groupby('Pclass').mean()['Fare']
titanic_df.groupby('Pclass')['Survived'].sum()
The mean age of passengers in 1st class was the highest, at 38 years old, and the mean fare, at $84, was the highest -- more than four times that of 2nd class.
The largest group of passengers was in 3rd class, but the greatest number of survivors -- 136 -- were unsurprisingly in 1st class.
However, there were 119 survivors in 3rd class.
print((titanic_df.groupby('Pclass')['Survived'].sum() / titanic_df.groupby('Pclass').size()) * 100)
A more accurate measure of class and likelihood of survival could be obtained by finding the percentage of survivors by class, as calculated above.
Indeed, about 63 percent of passengers in 1st class survived the Titanic disaster, and less than 25 percent of passengers in 3rd class survived.
Clearly, there was a correlation between survival and ticket class.
Charting Survivors by Class
To chart the survivors by class, I wrote a function survivors_by_class
which takes Pclass as an argument.
# Chart of Survivors by Class
def survivors_by_class(Pclass):
num_survivors_by_class = len(titanic_df[(titanic_df['Survived']==1) & (titanic_df['Pclass']==Pclass)])
num_passengers_by_class = len(titanic_df[titanic_df['Pclass']==Pclass])
percent_survived_by_class = (num_survivors_by_class / num_passengers_by_class) * 100
percent_of_total = (num_survivors_by_class / survivors) * 100 # survivors defined earlier
return num_survivors_by_class, num_passengers_by_class
pclass1, pclass1_tot = survivors_by_class(1)
pclass2, pclass2_tot = survivors_by_class(2)
pclass3, pclass3_tot = survivors_by_class(3)
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
tick_list = sorted(list(titanic_df['Pclass'].unique()))
index = np.arange(len(tick_list))
width = .25
survived_pclass = [pclass1, pclass2, pclass3]
total_pclass = [pclass1_tot, pclass2_tot, pclass3_tot]
p1 = plt.bar(index + width, total_pclass, width, color='Blue', label='Total')
p2 = plt.bar(index, survived_pclass, width, color='Green', label='Survivors')
plt.xlabel('Passenger Class')
plt.ylabel('Number of People')
plt.xticks(index + width/2., tick_list)
plt.title('Survivor Ratio Based on Passenger Class')
plt.legend(loc='best')
plt.show()
This bar graph shows the dramatic disparity between those who survived and those who perished, especially in 3rd class.
I then wanted to generate a simple pie chart of survivors:
survived_df = titanic_df.groupby('Survived').size()
print(survived_df)
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
group_names = ['Did Not Survive','Survived']
survived_df.plot(kind='pie', autopct='%.2f', labels=None)
plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")
plt.show()
Charting Survivors by Age
Beyond class, I wanted to investigate whether survival was correlated with the ages of the passengers.
After cleaning the data (removing records with no age data, and converting ages from floats to integers), I created 5 age brackets, filtered the dataframe, and copied it into new dataframes.
I then generated 5 separate pie charts for each age bracket.
# Pie Charts of Survivors by Age
# Drop records with missing age information
# Create age brackets 0-15, 16-30, 31-45, 46-60, 61-100
# Group total people by age brackets
# Group survivors by age brackets
# Create 5 separate pie charts for the age brackets
# Remove records with no age data
titanic_df_age_dropna = titanic_df[~np.isnan(titanic_df['Age'])].copy()
# Convert ages from floats to integers
titanic_df_age_dropna['Age'] = titanic_df_age_dropna['Age'].astype(int)
# For each age bracket, filter by age, make a copy, and save as a new dataframe
children = titanic_df_age_dropna[titanic_df_age_dropna.Age < 16].copy()
younger_adults = titanic_df_age_dropna[(titanic_df_age_dropna.Age >= 16) & (titanic_df_age_dropna.Age < 31)].copy()
adults = titanic_df_age_dropna[(titanic_df_age_dropna.Age >= 31) & (titanic_df_age_dropna.Age < 46)].copy()
older_adults = titanic_df_age_dropna[(titanic_df_age_dropna.Age >= 46) & (titanic_df_age_dropna.Age < 61)].copy()
seniors = titanic_df_age_dropna[titanic_df_age_dropna.Age >= 61].copy()
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
group_names = ['Did Not Survive','Survived']
survived_df_children = children.groupby('Survived').size()
survived_df_children.plot(kind='pie', autopct='%.2f', labels=None)
plt.title('Survivors, Younger than 16')
plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")
plt.show()
survived_df_younger_adults = younger_adults.groupby('Survived').size()
survived_df_younger_adults.plot(kind='pie', autopct='%.2f', labels=None)
plt.title('Survivors, Ages 16 - 30')
plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")
plt.show()
survived_df_adults = adults.groupby('Survived').size()
survived_df_adults.plot(kind='pie', autopct='%.2f', labels=None)
plt.title('Survivors, Ages 31 - 45')
plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")
plt.show()
survived_df_older_adults = older_adults.groupby('Survived').size()
survived_df_older_adults.plot(kind='pie', autopct='%.2f', labels=None)
plt.title('Survivors, Ages 46 - 60')
plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")
plt.show()
survived_df_seniors = seniors.groupby('Survived').size()
survived_df_seniors.plot(kind='pie', autopct='%.2f', labels=None)
plt.title('Survivors, Ages 61 and Older')
plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")
plt.show()
These results were fascinating. More than half, at 59 percent, of children younger than 16 survived the disaster.
Surprisingly, only 23 percent of passengers aged 61 and over survived.
Therefore, it is safe to say that children were more likely to survive than any other age group, even more than the elderly.
Finally, I wanted to determine if there were any correlations with the data.
I wrote a function correlation
to measure correlation based on standard deviation.
There also exists the Pearson correlation in the Statistics package of Scipy.
After cleaning the data (dropping records for which no data exists for the given index), I set out to determine if there was any correlation between Age and Fare, Survived and Age, and Survived and Pclass, but none of the calculations yielded a value close to 1.
def correlation(x, y):
std_x = (x - x.mean()) / x.std(ddof=0)
std_y = (y - y.mean()) / y.std(ddof=0)
return (std_x * std_y).mean()
# Correlation between Age and Fare, using the Pearson Correlation.
from scipy import stats
titanic_data_af_dropna = titanic_df[['Age', 'Fare']].dropna()
age_fare_correlation = stats.pearsonr(titanic_data_af_dropna['Age'], titanic_data_af_dropna['Fare'])
print(age_fare_correlation)
# Correlation between Age and Fare, using the correlation formula.
correlation(titanic_data_af_dropna['Age'], titanic_data_af_dropna['Fare'])
# Correlation between Survived and Age, using the Pearson Correlation.
from scipy import stats
titanic_data_sa_dropna = titanic_df[['Survived', 'Age']].dropna()
age_survival_correlation = stats.pearsonr(titanic_data_sa_dropna['Survived'], titanic_data_sa_dropna['Age'])
print(age_survival_correlation)
# Correlation between Survived and Age, using the correlation formula.
print(correlation(titanic_data_sa_dropna['Survived'], titanic_data_sa_dropna['Age']))
# Correlation between Survived and Pclass, using the Pearson Correlation.
titanic_data_sc_dropna = titanic_df[['Survived', 'Pclass']].dropna()
class_survival_correlation = stats.pearsonr(titanic_data_sc_dropna['Survived'], titanic_data_sc_dropna['Pclass'])
print(class_survival_correlation)
# Correlation between Survived and Pclass, using the correlation formula.
print(correlation(titanic_data_sc_dropna['Survived'], titanic_data_sc_dropna['Pclass']))
Concluding Remarks
The Titanic disaster has been one of the most studied events in modern history, rife with legends and conjecture about why it occurred and how it might have been prevented.
The dataset from Kaggle proved to deliver a great exercise in wrangling, exploring, and analyzing data. Indeed, it was shown that the rich were more likely to survive -- no surprise -- yet children were more likely to survive as well.
Though due to the incomplete dataset, including missing values, the analysis delivers limited and inconclusive results.