Analyzing Titanic Passenger Data


P2 Titanic

Analyzing Data of Titanic Passengers

I decided to investigate the Titanic passenger dataset, made available via the Kaggle data science community.

What factors made some people more likely to survive? It has been a long-held belief that wealthy passengers were more likely to survive -- perhaps because the crew placed them on lifeboats first -- but this may not necessarily be true.

Additionally, women and children are usually rescued first during any disaster. Was this true for the Titanic as well?

I decided to dig deep into the data to identify any trends, insights, and hopefully, some surprises.

The Titanic

First, I converted the .csv file to a dataframe:

In [174]:
import pandas as pd

titanic_df = pd.read_csv('titanic-data.csv')

Then I calculated how many passengers were included in the data:

In [175]:
len(titanic_df)
Out[175]:
891

Hmmm, 891 records. This is less than half of the 2,224 passengers and crew members who were aboard the ship, according to Kaggle and additional sources.

Therefore, any analysis performed would be incomplete or inconclusive, as the underlying dataset does not contain information on all of the individuals aboard the ship.

It is also not known how and where these 891 records were originally obtained, or whether these 891 records are a random sample of the 2,224 passengers. As such, the statistical validity of this sample is also unknown.

Nonetheless, I jumped, er, dived right in.

I performed the following to get the the column headings, even though they were given by Kaggle:

In [176]:
titanic_df.head()
Out[176]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

To get some fast statistical data, I performed the following.

In [177]:
titanic_df.describe()
/Users/jakewengroff/anaconda/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning:

Invalid value encountered in percentile

Out[177]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 NaN 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 NaN 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 NaN 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Not all of these results are useful. Determining the mean and standard deviation of such qualitative data as PassengerID, Survived, and Pclass is meaningless.

Percent of Survivors

To quickly obtain the number of survivors, I declared a variable survivors and set it equal to the count of the number of records in which the Survived column was equal to 1:

In [178]:
# Total survivors
# Making a new dataframe, only selecting those records where 'Survived' is equal to 1.

len(titanic_df[titanic_df['Survived']==1])
Out[178]:
342

To obtain the percent of the passengers that survived, I determined that it was 38.38 percent:

In [179]:
# Percent that survived

round(((survivors/len(titanic_df)) * 100), 2)
Out[179]:
38.38

I wanted to determine if this percentage of survivors differs from the total percentage.

According to Kaggle, 1,502 out of 2,224 passengers and crew members were killed. This means that 722 survived, or 32.46 percent.

As such, we have a higher survival rate in the sample dataset.

Not ideal, but I was curious how the sample data may yield different results than the original underlying data.

Age

Nonetheless, I was curious to learn more about the ages of the passengers, and whether this was correlated with survival.

From the results of the describe method used above, I spotted some NaN (Not a Number) values in the Age column. Not good. This means that some values for Age are missing. To determine just how many are missing, I performed a quick calculation:

In [180]:
ageless = len(titanic_df[np.isnan(titanic_df['Age'])])
print(ageless)
177

which as a percentage of total records comes to

In [181]:
percent_ageless = round(((ageless/len(titanic_df)) * 100), 2)
print(percent_ageless)
19.87

With close to 20 percent of age data missing, this could be a potential issue.

Looking at the age data, I wanted to determine the median, mean, and maximum ages for the dataset:

In [182]:
titanic_df.median()['Age']
Out[182]:
28.0
In [183]:
titanic_df.mean()['Age']
Out[183]:
29.69911764705882
In [184]:
titanic_df.max()['Age']
Out[184]:
80.0

Male or Female

Additionally, I was curious if the passenger's sex had any relationship to survival.

I decided I wanted to investigate Pandas's groupby function, starting with indexing by Sex:

In [185]:
titanic_df.groupby('Sex')
Out[185]:
<pandas.core.groupby.DataFrameGroupBy object at 0x11ae5f198>
In [186]:
titanic_df.groupby('Sex').size()
Out[186]:
Sex
female    314
male      577
dtype: int64
In [187]:
titanic_df.groupby('Sex').mean()
Out[187]:
PassengerId Survived Pclass Age SibSp Parch Fare
Sex
female 431.028662 0.742038 2.159236 27.915709 0.694268 0.649682 44.479818
male 454.147314 0.188908 2.389948 30.726645 0.429809 0.235702 25.523893

Again, some of the results are meaningless, as the mean of PassengerID, Survived, and Pclass are irrelevant.

So, I pulled out the individual results for Age, Fare, and Survived.

Women paid more on average than men, and the number of female survivors was more than twice that of male survivors.

(Because Survived carries a value of 1, performing a simple sum() of the Survived column proved to be a convenient way to make this determination.)

In [188]:
titanic_df.groupby('Sex').mean()['Age']
Out[188]:
Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64
In [189]:
titanic_df.groupby('Sex').mean()['Fare']
Out[189]:
Sex
female    44.479818
male      25.523893
Name: Fare, dtype: float64
In [190]:
titanic_df.groupby('Sex')['Survived'].sum()
Out[190]:
Sex
female    233
male      109
Name: Survived, dtype: int64

Clearly, women were more than twice as likely to survive as men.

Fares

I decided to have a look at the median, mean and maximum Fares:

In [191]:
titanic_df.median()['Fare']
Out[191]:
14.4542
In [192]:
titanic_df.mean()['Fare']
Out[192]:
32.2042079685746
In [193]:
titanic_df.max()['Fare']
Out[193]:
512.32920000000001

Wow, what wild swings in ticket prices! The highest fare was about 16 times that of the average.

Port of Embarkation

I had a look at Embarked, or the port of embarkation. Most embarked on their voyage in Southampton, but the highest average fare was paid at Cherbourg, about $60.

In [194]:
titanic_df.groupby('Embarked')
Out[194]:
<pandas.core.groupby.DataFrameGroupBy object at 0x11ae5f828>
In [195]:
titanic_df.groupby('Embarked').size()
Out[195]:
Embarked
C    168
Q     77
S    644
dtype: int64
In [196]:
titanic_df.groupby('Embarked').mean()
Out[196]:
PassengerId Survived Pclass Age SibSp Parch Fare
Embarked
C 445.357143 0.553571 1.886905 30.814769 0.386905 0.363095 59.954144
Q 417.896104 0.389610 2.909091 28.089286 0.428571 0.168831 13.276030
S 449.527950 0.336957 2.350932 29.445397 0.571429 0.413043 27.079812
In [197]:
titanic_df.groupby('Embarked').mean()['Age']
Out[197]:
Embarked
C    30.814769
Q    28.089286
S    29.445397
Name: Age, dtype: float64
In [198]:
titanic_df.groupby('Embarked').mean()['Fare']
Out[198]:
Embarked
C    59.954144
Q    13.276030
S    27.079812
Name: Fare, dtype: float64

Class

It has long been known that the wealthier passengers survived the Titanic disaster. Grouping the data by Pclass allowed for a closer examination of this assertion.

In [199]:
titanic_df.groupby('Pclass').size()
Out[199]:
Pclass
1    216
2    184
3    491
dtype: int64
In [200]:
titanic_df.groupby('Pclass').mean()['Age']
Out[200]:
Pclass
1    38.233441
2    29.877630
3    25.140620
Name: Age, dtype: float64
In [201]:
titanic_df.groupby('Pclass').mean()['Fare']
Out[201]:
Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64
In [202]:
titanic_df.groupby('Pclass')['Survived'].sum()
Out[202]:
Pclass
1    136
2     87
3    119
Name: Survived, dtype: int64

The mean age of passengers in 1st class was the highest, at 38 years old, and the mean fare, at $84, was the highest -- more than four times that of 2nd class.

The largest group of passengers was in 3rd class, but the greatest number of survivors -- 136 -- were unsurprisingly in 1st class.

However, there were 119 survivors in 3rd class.

In [203]:
print((titanic_df.groupby('Pclass')['Survived'].sum() / titanic_df.groupby('Pclass').size()) * 100)
Pclass
1    62.962963
2    47.282609
3    24.236253
dtype: float64

A more accurate measure of class and likelihood of survival could be obtained by finding the percentage of survivors by class, as calculated above.

Indeed, about 63 percent of passengers in 1st class survived the Titanic disaster, and less than 25 percent of passengers in 3rd class survived.

Clearly, there was a correlation between survival and ticket class.

Charting Survivors by Class

To chart the survivors by class, I wrote a function survivors_by_class which takes Pclass as an argument.

In [204]:
# Chart of Survivors by Class

def survivors_by_class(Pclass):
    num_survivors_by_class = len(titanic_df[(titanic_df['Survived']==1) & (titanic_df['Pclass']==Pclass)])
    num_passengers_by_class = len(titanic_df[titanic_df['Pclass']==Pclass])
    percent_survived_by_class = (num_survivors_by_class / num_passengers_by_class) * 100
    percent_of_total = (num_survivors_by_class / survivors) * 100             # survivors defined earlier
    return num_survivors_by_class, num_passengers_by_class

pclass1, pclass1_tot = survivors_by_class(1)
pclass2, pclass2_tot = survivors_by_class(2)
pclass3, pclass3_tot = survivors_by_class(3)

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

tick_list = sorted(list(titanic_df['Pclass'].unique()))
index = np.arange(len(tick_list))
width = .25

survived_pclass = [pclass1, pclass2, pclass3]
total_pclass = [pclass1_tot, pclass2_tot, pclass3_tot]

p1 = plt.bar(index + width, total_pclass, width, color='Blue', label='Total')
p2 = plt.bar(index, survived_pclass, width, color='Green', label='Survivors')

plt.xlabel('Passenger Class')
plt.ylabel('Number of People')
plt.xticks(index + width/2., tick_list)
plt.title('Survivor Ratio Based on Passenger Class')
plt.legend(loc='best')

plt.show()

This bar graph shows the dramatic disparity between those who survived and those who perished, especially in 3rd class.

I then wanted to generate a simple pie chart of survivors:

In [205]:
survived_df = titanic_df.groupby('Survived').size()

print(survived_df)
Survived
0    549
1    342
dtype: int64
In [206]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

group_names = ['Did Not Survive','Survived']

survived_df.plot(kind='pie', autopct='%.2f', labels=None)

plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")

plt.show()

Charting Survivors by Age

Beyond class, I wanted to investigate whether survival was correlated with the ages of the passengers.

After cleaning the data (removing records with no age data, and converting ages from floats to integers), I created 5 age brackets, filtered the dataframe, and copied it into new dataframes.

I then generated 5 separate pie charts for each age bracket.

In [207]:
# Pie Charts of Survivors by Age

# Drop records with missing age information
# Create age brackets    0-15, 16-30, 31-45, 46-60, 61-100
# Group total people by age brackets
# Group survivors by age brackets
# Create 5 separate pie charts for the age brackets

# Remove records with no age data
titanic_df_age_dropna = titanic_df[~np.isnan(titanic_df['Age'])].copy()

# Convert ages from floats to integers
titanic_df_age_dropna['Age'] = titanic_df_age_dropna['Age'].astype(int)

# For each age bracket, filter by age, make a copy, and save as a new dataframe

children = titanic_df_age_dropna[titanic_df_age_dropna.Age < 16].copy()
younger_adults = titanic_df_age_dropna[(titanic_df_age_dropna.Age >= 16) & (titanic_df_age_dropna.Age < 31)].copy()
adults = titanic_df_age_dropna[(titanic_df_age_dropna.Age >= 31) & (titanic_df_age_dropna.Age < 46)].copy()
older_adults = titanic_df_age_dropna[(titanic_df_age_dropna.Age >= 46) & (titanic_df_age_dropna.Age < 61)].copy()
seniors = titanic_df_age_dropna[titanic_df_age_dropna.Age >= 61].copy()

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

group_names = ['Did Not Survive','Survived']

survived_df_children = children.groupby('Survived').size()
survived_df_children.plot(kind='pie', autopct='%.2f', labels=None)

plt.title('Survivors, Younger than 16')
plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")
plt.show()

survived_df_younger_adults = younger_adults.groupby('Survived').size()
survived_df_younger_adults.plot(kind='pie', autopct='%.2f', labels=None)

plt.title('Survivors, Ages 16 - 30')
plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")
plt.show()

survived_df_adults = adults.groupby('Survived').size()
survived_df_adults.plot(kind='pie', autopct='%.2f', labels=None)

plt.title('Survivors, Ages 31 - 45')
plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")
plt.show()

survived_df_older_adults = older_adults.groupby('Survived').size()
survived_df_older_adults.plot(kind='pie', autopct='%.2f', labels=None)

plt.title('Survivors, Ages 46 - 60')
plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")
plt.show()

survived_df_seniors = seniors.groupby('Survived').size()
survived_df_seniors.plot(kind='pie', autopct='%.2f', labels=None)

plt.title('Survivors, Ages 61 and Older')
plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")
plt.show()

These results were fascinating. More than half, at 59 percent, of children younger than 16 survived the disaster.

Surprisingly, only 23 percent of passengers aged 61 and over survived.

Therefore, it is safe to say that children were more likely to survive than any other age group, even more than the elderly.

Finally, I wanted to determine if there were any correlations with the data.

I wrote a function correlation to measure correlation based on standard deviation.

There also exists the Pearson correlation in the Statistics package of Scipy.

After cleaning the data (dropping records for which no data exists for the given index), I set out to determine if there was any correlation between Age and Fare, Survived and Age, and Survived and Pclass, but none of the calculations yielded a value close to 1.

In [208]:
def correlation(x, y):
    std_x = (x - x.mean()) / x.std(ddof=0)
    std_y = (y - y.mean()) / y.std(ddof=0)

    return (std_x * std_y).mean()
In [209]:
# Correlation between Age and Fare, using the Pearson Correlation.

from scipy import stats

titanic_data_af_dropna = titanic_df[['Age', 'Fare']].dropna()

age_fare_correlation = stats.pearsonr(titanic_data_af_dropna['Age'], titanic_data_af_dropna['Fare'])

print(age_fare_correlation)
(0.096066691769038898, 0.010216277504446435)
In [210]:
# Correlation between Age and Fare, using the correlation formula.

correlation(titanic_data_af_dropna['Age'], titanic_data_af_dropna['Fare'])
Out[210]:
0.09606669176903883
In [211]:
# Correlation between Survived and Age, using the Pearson Correlation.

from scipy import stats

titanic_data_sa_dropna = titanic_df[['Survived', 'Age']].dropna()

age_survival_correlation = stats.pearsonr(titanic_data_sa_dropna['Survived'], titanic_data_sa_dropna['Age'])

print(age_survival_correlation)
(-0.077221094572177656, 0.039124654013483327)
In [212]:
# Correlation between Survived and Age, using the correlation formula.

print(correlation(titanic_data_sa_dropna['Survived'], titanic_data_sa_dropna['Age']))
-0.0772210945721773
In [213]:
# Correlation between Survived and Pclass, using the Pearson Correlation.

titanic_data_sc_dropna = titanic_df[['Survived', 'Pclass']].dropna()

class_survival_correlation = stats.pearsonr(titanic_data_sc_dropna['Survived'], titanic_data_sc_dropna['Pclass'])

print(class_survival_correlation)
(-0.33848103596101536, 2.5370473879804202e-25)
In [214]:
# Correlation between Survived and Pclass, using the correlation formula.

print(correlation(titanic_data_sc_dropna['Survived'], titanic_data_sc_dropna['Pclass']))
-0.33848103596101325

Concluding Remarks

The Titanic disaster has been one of the most studied events in modern history, rife with legends and conjecture about why it occurred and how it might have been prevented.

The dataset from Kaggle proved to deliver a great exercise in wrangling, exploring, and analyzing data. Indeed, it was shown that the rich were more likely to survive -- no surprise -- yet children were more likely to survive as well.

Though due to the incomplete dataset, including missing values, the analysis delivers limited and inconclusive results.

In [ ]: