Analyzing Data of Titanic Passengers

I decided to investigate the Titanic passenger dataset, made available via the Kaggle data science community.

What factors made some people more likely to survive? It has been a long-held belief that wealthy passengers were more likely to survive -- perhaps because the crew placed them on lifeboats first -- but this may not necessarily be true.

Additionally, women and children are usually rescued first during any disaster. Was this true for the Titanic as well?

I decided to dig deep into the data to identify any trends, insights, and hopefully, some surprises.

The Titanic

First, I converted the .csv file to a dataframe:

import pandas as pd

titanic_df = pd.read_csv('titanic-data.csv')

Then I calculated how many passengers were included in the data:

len(titanic_df)

891

Hmmm, 891 records. This is less than half of the 2,224 passengers and crew members who were aboard the ship, according to Kaggle and additional sources.

Therefore, any analysis performed would be incomplete or inconclusive, as the underlying dataset does not contain information on all of the individuals aboard the ship.

It is also not known how and where these 891 records were originally obtained, or whether these 891 records are a random sample of the 2,224 passengers. As such, the statistical validity of this sample is also unknown.

Nonetheless, I jumped, er, dived right in.

I performed the following to get the the column headings, even though they were given by Kaggle:

titanic_df.head()

To get some fast statistical data, I performed the following.

titanic_df.describe()

/Users/jakewengroff/anaconda/lib/python3.5/site-packages/numpy/lib/function_base.py:3834: RuntimeWarning:

Invalid value encountered in percentile

Not all of these results are useful. Determining the mean and standard deviation of such qualitative data as PassengerID, Survived, and Pclass is meaningless.

Percent of Survivors

To quickly obtain the number of survivors, I declared a variable survivors and set it equal to the count of the number of records in which the Survived column was equal to 1:

# Total survivors
# Making a new dataframe, only selecting those records where 'Survived' is equal to 1.

len(titanic_df[titanic_df['Survived']==1])

342

To obtain the percent of the passengers that survived, I determined that it was 38.38 percent:

# Percent that survived

round(((survivors/len(titanic_df)) * 100), 2)

38.38

I wanted to determine if this percentage of survivors differs from the total percentage.

According to Kaggle, 1,502 out of 2,224 passengers and crew members were killed. This means that 722 survived, or 32.46 percent.

As such, we have a higher survival rate in the sample dataset.

Not ideal, but I was curious how the sample data may yield different results than the original underlying data.

Age

Nonetheless, I was curious to learn more about the ages of the passengers, and whether this was correlated with survival.

From the results of the describe method used above, I spotted some NaN (Not a Number) values in the Age column. Not good. This means that some values for Age are missing. To determine just how many are missing, I performed a quick calculation:

ageless = len(titanic_df[np.isnan(titanic_df['Age'])])
print(ageless)

177

which as a percentage of total records comes to

percent_ageless = round(((ageless/len(titanic_df)) * 100), 2)
print(percent_ageless)

19.87

With close to 20 percent of age data missing, this could be a potential issue.

Looking at the age data, I wanted to determine the median, mean, and maximum ages for the dataset:

titanic_df.median()['Age']

28.0

titanic_df.mean()['Age']

29.69911764705882

titanic_df.max()['Age']

80.0

Male or Female

Additionally, I was curious if the passenger's sex had any relationship to survival.

I decided I wanted to investigate Pandas's groupby function, starting with indexing by Sex:

titanic_df.groupby('Sex')

<pandas.core.groupby.DataFrameGroupBy object at 0x11ae5f198>

titanic_df.groupby('Sex').size()

Sex
female    314
male      577
dtype: int64

titanic_df.groupby('Sex').mean()

Again, some of the results are meaningless, as the mean of PassengerID, Survived, and Pclass are irrelevant.

So, I pulled out the individual results for Age, Fare, and Survived.

Women paid more on average than men, and the number of female survivors was more than twice that of male survivors.

(Because Survived carries a value of 1, performing a simple sum() of the Survived column proved to be a convenient way to make this determination.)

titanic_df.groupby('Sex').mean()['Age']

Sex
female    27.915709
male      30.726645
Name: Age, dtype: float64

titanic_df.groupby('Sex').mean()['Fare']

Sex
female    44.479818
male      25.523893
Name: Fare, dtype: float64

titanic_df.groupby('Sex')['Survived'].sum()

Sex
female    233
male      109
Name: Survived, dtype: int64

Clearly, women were more than twice as likely to survive as men.

Fares

I decided to have a look at the median, mean and maximum Fares:

titanic_df.median()['Fare']

14.4542

titanic_df.mean()['Fare']

32.2042079685746

titanic_df.max()['Fare']

512.32920000000001

Wow, what wild swings in ticket prices! The highest fare was about 16 times that of the average.

Port of Embarkation

I had a look at Embarked, or the port of embarkation. Most embarked on their voyage in Southampton, but the highest average fare was paid at Cherbourg, about $60.

titanic_df.groupby('Embarked')

<pandas.core.groupby.DataFrameGroupBy object at 0x11ae5f828>

titanic_df.groupby('Embarked').size()

Embarked
C    168
Q     77
S    644
dtype: int64

titanic_df.groupby('Embarked').mean()

titanic_df.groupby('Embarked').mean()['Age']

Embarked
C    30.814769
Q    28.089286
S    29.445397
Name: Age, dtype: float64

titanic_df.groupby('Embarked').mean()['Fare']

Embarked
C    59.954144
Q    13.276030
S    27.079812
Name: Fare, dtype: float64

Class

It has long been known that the wealthier passengers survived the Titanic disaster. Grouping the data by Pclass allowed for a closer examination of this assertion.

titanic_df.groupby('Pclass').size()

Pclass
1    216
2    184
3    491
dtype: int64

titanic_df.groupby('Pclass').mean()['Age']

Pclass
1    38.233441
2    29.877630
3    25.140620
Name: Age, dtype: float64

titanic_df.groupby('Pclass').mean()['Fare']

Pclass
1    84.154687
2    20.662183
3    13.675550
Name: Fare, dtype: float64

titanic_df.groupby('Pclass')['Survived'].sum()

Pclass
1    136
2     87
3    119
Name: Survived, dtype: int64

The mean age of passengers in 1st class was the highest, at 38 years old, and the mean fare, at $84, was the highest -- more than four times that of 2nd class.

The largest group of passengers was in 3rd class, but the greatest number of survivors -- 136 -- were unsurprisingly in 1st class.

However, there were 119 survivors in 3rd class.

print((titanic_df.groupby('Pclass')['Survived'].sum() / titanic_df.groupby('Pclass').size()) * 100)

Pclass
1    62.962963
2    47.282609
3    24.236253
dtype: float64

A more accurate measure of class and likelihood of survival could be obtained by finding the percentage of survivors by class, as calculated above.

Indeed, about 63 percent of passengers in 1st class survived the Titanic disaster, and less than 25 percent of passengers in 3rd class survived.

Clearly, there was a correlation between survival and ticket class.

Charting Survivors by Class

To chart the survivors by class, I wrote a function survivors_by_class which takes Pclass as an argument.

# Chart of Survivors by Class

def survivors_by_class(Pclass):
    num_survivors_by_class = len(titanic_df[(titanic_df['Survived']==1) & (titanic_df['Pclass']==Pclass)])
    num_passengers_by_class = len(titanic_df[titanic_df['Pclass']==Pclass])
    percent_survived_by_class = (num_survivors_by_class / num_passengers_by_class) * 100
    percent_of_total = (num_survivors_by_class / survivors) * 100             # survivors defined earlier
    return num_survivors_by_class, num_passengers_by_class

pclass1, pclass1_tot = survivors_by_class(1)
pclass2, pclass2_tot = survivors_by_class(2)
pclass3, pclass3_tot = survivors_by_class(3)

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

tick_list = sorted(list(titanic_df['Pclass'].unique()))
index = np.arange(len(tick_list))
width = .25

survived_pclass = [pclass1, pclass2, pclass3]
total_pclass = [pclass1_tot, pclass2_tot, pclass3_tot]

p1 = plt.bar(index + width, total_pclass, width, color='Blue', label='Total')
p2 = plt.bar(index, survived_pclass, width, color='Green', label='Survivors')

plt.xlabel('Passenger Class')
plt.ylabel('Number of People')
plt.xticks(index + width/2., tick_list)
plt.title('Survivor Ratio Based on Passenger Class')
plt.legend(loc='best')

plt.show()

This bar graph shows the dramatic disparity between those who survived and those who perished, especially in 3rd class.

I then wanted to generate a simple pie chart of survivors:

survived_df = titanic_df.groupby('Survived').size()

print(survived_df)

Survived
0    549
1    342
dtype: int64

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

group_names = ['Did Not Survive','Survived']

survived_df.plot(kind='pie', autopct='%.2f', labels=None)

plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")

plt.show()

Charting Survivors by Age

Beyond class, I wanted to investigate whether survival was correlated with the ages of the passengers.

After cleaning the data (removing records with no age data, and converting ages from floats to integers), I created 5 age brackets, filtered the dataframe, and copied it into new dataframes.

I then generated 5 separate pie charts for each age bracket.

# Pie Charts of Survivors by Age

# Drop records with missing age information
# Create age brackets    0-15, 16-30, 31-45, 46-60, 61-100
# Group total people by age brackets
# Group survivors by age brackets
# Create 5 separate pie charts for the age brackets

# Remove records with no age data
titanic_df_age_dropna = titanic_df[~np.isnan(titanic_df['Age'])].copy()

# Convert ages from floats to integers
titanic_df_age_dropna['Age'] = titanic_df_age_dropna['Age'].astype(int)

# For each age bracket, filter by age, make a copy, and save as a new dataframe

children = titanic_df_age_dropna[titanic_df_age_dropna.Age < 16].copy()
younger_adults = titanic_df_age_dropna[(titanic_df_age_dropna.Age >= 16) & (titanic_df_age_dropna.Age < 31)].copy()
adults = titanic_df_age_dropna[(titanic_df_age_dropna.Age >= 31) & (titanic_df_age_dropna.Age < 46)].copy()
older_adults = titanic_df_age_dropna[(titanic_df_age_dropna.Age >= 46) & (titanic_df_age_dropna.Age < 61)].copy()
seniors = titanic_df_age_dropna[titanic_df_age_dropna.Age >= 61].copy()

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

group_names = ['Did Not Survive','Survived']

survived_df_children = children.groupby('Survived').size()
survived_df_children.plot(kind='pie', autopct='%.2f', labels=None)

plt.title('Survivors, Younger than 16')
plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")
plt.show()

survived_df_younger_adults = younger_adults.groupby('Survived').size()
survived_df_younger_adults.plot(kind='pie', autopct='%.2f', labels=None)

plt.title('Survivors, Ages 16 - 30')
plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")
plt.show()

survived_df_adults = adults.groupby('Survived').size()
survived_df_adults.plot(kind='pie', autopct='%.2f', labels=None)

plt.title('Survivors, Ages 31 - 45')
plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")
plt.show()

survived_df_older_adults = older_adults.groupby('Survived').size()
survived_df_older_adults.plot(kind='pie', autopct='%.2f', labels=None)

plt.title('Survivors, Ages 46 - 60')
plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")
plt.show()

survived_df_seniors = seniors.groupby('Survived').size()
survived_df_seniors.plot(kind='pie', autopct='%.2f', labels=None)

plt.title('Survivors, Ages 61 and Older')
plt.axis('equal')
plt.ylabel('')
plt.legend(labels=group_names, loc="best")
plt.show()

These results were fascinating. More than half, at 59 percent, of children younger than 16 survived the disaster.

Surprisingly, only 23 percent of passengers aged 61 and over survived.

Therefore, it is safe to say that children were more likely to survive than any other age group, even more than the elderly.

Finally, I wanted to determine if there were any correlations with the data.

I wrote a function correlation to measure correlation based on standard deviation.

There also exists the Pearson correlation in the Statistics package of Scipy.

After cleaning the data (dropping records for which no data exists for the given index), I set out to determine if there was any correlation between Age and Fare, Survived and Age, and Survived and Pclass, but none of the calculations yielded a value close to 1.

def correlation(x, y):
    std_x = (x - x.mean()) / x.std(ddof=0)
    std_y = (y - y.mean()) / y.std(ddof=0)

    return (std_x * std_y).mean()

# Correlation between Age and Fare, using the Pearson Correlation.

from scipy import stats

titanic_data_af_dropna = titanic_df[['Age', 'Fare']].dropna()

age_fare_correlation = stats.pearsonr(titanic_data_af_dropna['Age'], titanic_data_af_dropna['Fare'])

print(age_fare_correlation)

(0.096066691769038898, 0.010216277504446435)

# Correlation between Age and Fare, using the correlation formula.

correlation(titanic_data_af_dropna['Age'], titanic_data_af_dropna['Fare'])

0.09606669176903883

# Correlation between Survived and Age, using the Pearson Correlation.

from scipy import stats

titanic_data_sa_dropna = titanic_df[['Survived', 'Age']].dropna()

age_survival_correlation = stats.pearsonr(titanic_data_sa_dropna['Survived'], titanic_data_sa_dropna['Age'])

print(age_survival_correlation)

(-0.077221094572177656, 0.039124654013483327)

# Correlation between Survived and Age, using the correlation formula.

print(correlation(titanic_data_sa_dropna['Survived'], titanic_data_sa_dropna['Age']))

-0.0772210945721773

# Correlation between Survived and Pclass, using the Pearson Correlation.

titanic_data_sc_dropna = titanic_df[['Survived', 'Pclass']].dropna()

class_survival_correlation = stats.pearsonr(titanic_data_sc_dropna['Survived'], titanic_data_sc_dropna['Pclass'])

print(class_survival_correlation)

(-0.33848103596101536, 2.5370473879804202e-25)

# Correlation between Survived and Pclass, using the correlation formula.

print(correlation(titanic_data_sc_dropna['Survived'], titanic_data_sc_dropna['Pclass']))

-0.33848103596101325

Concluding Remarks

The Titanic disaster has been one of the most studied events in modern history, rife with legends and conjecture about why it occurred and how it might have been prevented.

The dataset from Kaggle proved to deliver a great exercise in wrangling, exploring, and analyzing data. Indeed, it was shown that the rich were more likely to survive -- no surprise -- yet children were more likely to survive as well.

Though due to the incomplete dataset, including missing values, the analysis delivers limited and inconclusive results.

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	NaN	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	NaN	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	NaN	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
Sex
female	431.028662	0.742038	2.159236	27.915709	0.694268	0.649682	44.479818
male	454.147314	0.188908	2.389948	30.726645	0.429809	0.235702	25.523893

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
Embarked
C	445.357143	0.553571	1.886905	30.814769	0.386905	0.363095	59.954144
Q	417.896104	0.389610	2.909091	28.089286	0.428571	0.168831	13.276030
S	449.527950	0.336957	2.350932	29.445397	0.571429	0.413043	27.079812

Analyzing Titanic Passenger Data