Load the dataset

First, import the packages required to continue.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Read the dataset using Pandas

Read previously loaded data and store it in a variable named df, display the first few rows with head(), by default head() will return first 5 rows of the dataset, but you can specify any number of rows like head(10).

df = pd.read_csv('train.csv')
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Check the Dimension of the Dataset

Let’s check for the basic information about the dataset by running simple commands.

#training_set.shape which returns a number of rows coloumns in dataset
df.shape

(891, 12)

#training_set.columns: It returns column headings.
df.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

#training_set.isnull().sum(): It returns a number of null values in each column.
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

Preparing the dataset

Perspectives

The target variable, “Survived,” is what we will forecast after preprocessing our data. Therefore, we keep that column.

Only the “Age,” “Cabin,” and “Embarked” columns lack values.

PassengerId, Name, and Ticket don’t really contribute much to forecasting the target variable.

We can create a new column called “Size of the family” because “ParCh” (Parent/Child) and “SibSp” (Siblings/Spouse) details are related to families.

The category data that needs to be encoded to numerical values are “Sex,” “Cabin,” and “Embarked.”

Columns dropping

We will eliminate the columns with the lowest priority in this step. This category includes columns like “Ticket” and “PassengerId.” To remove the columns, use drop().

#dropping ticket column
df.drop(['Ticket','PassengerId'], 1, inplace=True)

df.head()

	Survived	Pclass	Name	Sex	Age	SibSp	Fare	Cabin	Embarked
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	7.2500	NaN	S
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	71.2833	C85	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	7.9250	NaN	S
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	53.1000	C123	S
4	0	3	Allen, Mr. William Henry	male	35.0	0	8.0500	NaN	S

Now, let’s run df.info(), and look at the status of our dataset.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Fare      891 non-null    float64
 8   Cabin     204 non-null    object 
 9   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(4)
memory usage: 69.7+ KB

Which column has missing Value?

‘Cabin’, ‘Embarked’ and ‘Age’ column has missing values.

Creating new classes “Cabin”

Although the Cabin column has 687 missing values, it has a distinctive character that indicates the deck number at the beginning. As a result, we are going to establish a column named Deck to extract this information, which we may utilize in our forecast in the future.

# user defined function 
def assignDeckValue(CabinCode):
    if pd.isnull(CabinCode):
        category = 'Unknown'
    else:
        category = CabinCode[0]
    return category
  
Deck = np.array([assignDeckValue(cabin) for cabin in df['Cabin'].values])

df = df.assign(Deck = Deck)

df.head()

	Survived	Pclass	Name	Sex	Age	SibSp	Fare	Cabin	Embarked	Deck
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	7.2500	NaN	S	Unknown
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	71.2833	C85	C	C
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	7.9250	NaN	S	Unknown
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	53.1000	C123	S	C
4	0	3	Allen, Mr. William Henry	male	35.0	0	8.0500	NaN	S	Unknown

Derive a new column

‘ParCh’ and ‘SibSp’ are the details related to family size, so let’s derive a new column named ‘Size of the Family’.

df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

‘Name’: Instead of dropping right away, from the Name of the Passenger, we need to get only their Title

 #Using expression pattern to extract the Title of the passenger
df['Title'] = df.Name.str.extract(' ([A-Za-z]+)\.', expand=False)

# Changing to common category
df['Title'] = df['Title'].replace(['Dr', 'Rev', 'Col', 'Major', 'Countess', 'Sir', 'Jonkheer', 'Lady', 'Capt', 'Don'], 'Others')
df['Title'] = df['Title'].replace('Ms', 'Miss')
df['Title'] = df['Title'].replace('Mme', 'Mrs')
df['Title'] = df['Title'].replace('Mlle', 'Miss')

df.head()

	Survived	Pclass	Name	Sex	Age	SibSp	Fare	Cabin	Embarked	Deck	FamilySize	Title
0	0	3	Braund, Mr. Owen Harris	male	22.0	1	7.2500	NaN	S	Unknown	2	Mr
1	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	71.2833	C85	C	C	2	Mrs
2	1	3	Heikkinen, Miss. Laina	female	26.0	0	7.9250	NaN	S	Unknown	1	Miss
3	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	53.1000	C123	S	C	2	Mrs
4	0	3	Allen, Mr. William Henry	male	35.0	0	8.0500	NaN	S	Unknown	1	Mr

Now, let’s drop Cabin, Name columns, we have extracted needed information from these two.

df.drop(['Cabin','Name'],1,inplace=True)

This is how our dataset looks like now.

df.head()

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Deck	FamilySize	Title
0	0	3	male	22.0	1	7.2500	S	Unknown	2	Mr
1	1	1	female	38.0	1	71.2833	C	C	2	Mrs
2	1	3	female	26.0	0	7.9250	S	Unknown	1	Miss
3	1	1	female	35.0	1	53.1000	S	C	2	Mrs
4	0	3	male	35.0	0	8.0500	S	Unknown	1	Mr

Managing missing values

‘Embarked’: For the Embarked column, only two rows are missing their data. Embarked accepts categorical values such as C = Cherbourg, Q = Queenstown, and S = Southampton; in this instance, we may just substitute the most frequently occurring value, which in this example is ‘S,’ for the missing values.

# Returns count for each category
df['Embarked'].value_counts()

# Fills null values with 'S'-most common occurence
common = 'S'
df['Embarked']=df['Embarked'].fillna('S')

# Checking the no of null values now
df['Embarked'].isnull().sum()

df.isnull().sum()

Survived        0
Pclass          0
Sex             0
Age           177
SibSp           0
Parch           0
Fare            0
Embarked        0
Deck            0
FamilySize      0
Title           0
dtype: int64

Age: Using the mean value for each group, we will impute the missing values in the ‘Age’ column. Because there are many age ranges, taking the mean of the entire column can lead to conflicting results.

means = df.groupby('Title')['Age'].mean()

title_list = ['Master','Miss','Mr','Mrs','Others']
def age_missing_replace(means, dframe, title_list):
    for title in title_list:
        temp = dframe['Title'] == title 
        dframe.loc[temp, 'Age'] = dframe.loc[temp, 'Age'].fillna(means[title]) 
        
age_missing_replace(means, df, title_list)

df.head()

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Deck	FamilySize	Title
0	0	3	male	22.0	1	7.2500	S	Unknown	2	Mr
1	1	1	female	38.0	1	71.2833	C	C	2	Mrs
2	1	3	female	26.0	0	7.9250	S	Unknown	1	Miss
3	1	1	female	35.0	1	53.1000	S	C	2	Mrs
4	0	3	male	35.0	0	8.0500	S	Unknown	1	Mr

Coding categorical characteristics

Without being converted to numerical values, many machine learning algorithms are unable to accommodate categorical variables. Fortunately, there are numerous methods to manage this scenario provided by the sci-kit-learn and pandas python tools.

It’s them,

Finding and Replacing
Labeling
A single hot encode
Individual Binary Encoding
Using Sci-kit Learn’s LabelEncoder

At first, we will only use map to convert the categorical values into numerical data ().

df['Embarked'] = df['Embarked'].map({'C':0, 'Q':1, 'S':2})
df['Sex'] = df['Sex'].map({'male':0, 'female':1})
df['Title'] = df['Title'].map({'Master':0,'Miss':1,'Mr':2,'Mrs':3,'Others':4})

df.head(20)

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked	Deck	FamilySize	Title
0	0	3	0	22.000000	1	0	7.2500	2	Unknown	2	2
1	1	1	1	38.000000	1	0	71.2833	0	C	2	3
2	1	3	1	26.000000	0	0	7.9250	2	Unknown	1	1
3	1	1	1	35.000000	1	0	53.1000	2	C	2	3
4	0	3	0	35.000000	0	0	8.0500	2	Unknown	1	2
5	0	3	0	32.368090	0	0	8.4583	1	Unknown	1	2
6	0	1	0	54.000000	0	0	51.8625	2	E	1	2
7	0	3	0	2.000000	3	1	21.0750	2	Unknown	5	0
8	1	3	1	27.000000	0	2	11.1333	2	Unknown	3	3
9	1	2	1	14.000000	1	0	30.0708	0	Unknown	2	3
10	1	3	1	4.000000	1	1	16.7000	2	G	3	1
11	1	1	1	58.000000	0	0	26.5500	2	C	1	1
12	0	3	0	20.000000	0	0	8.0500	2	Unknown	1	2
13	0	3	0	39.000000	1	5	31.2750	2	Unknown	7	2
14	0	3	1	14.000000	0	0	7.8542	2	Unknown	1	1
15	1	2	1	55.000000	0	0	16.0000	2	Unknown	1	3
16	0	3	0	2.000000	4	1	29.1250	1	Unknown	6	0
17	1	2	0	32.368090	0	0	13.0000	2	Unknown	1	2
18	0	3	1	31.000000	1	0	18.0000	2	Unknown	2	3
19	1	3	1	35.788991	0	0	7.2250	0	Unknown	1	3

Manually replacing the categorical value is not the right choice if there are many categories. Let’s do one conversion using LabelEncoder() provided by sklearn.preprocessing library.

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
df['Deck'] = le.fit_transform(df['Deck'])

Dataset ready…

df.head()

	Survived	Pclass	Sex	Age	SibSp	Fare	Embarked	Deck	FamilySize	Title
0	0	3	0	22.0	1	7.2500	2	8	2	2
1	1	1	1	38.0	1	71.2833	0	2	2	3
2	1	3	1	26.0	0	7.9250	2	8	1	1
3	1	1	1	35.0	1	53.1000	2	2	2	3
4	0	3	0	35.0	0	8.0500	2	8	1	2

Twitter Facebook LinkedIn

How to do Preprocessing of Dataset Before Applying Machine Learning Algorithms

Load the dataset

Durga Pokharel

Load the dataset

Read the dataset using Pandas

Check the Dimension of the Dataset

Preparing the dataset

Perspectives

Columns dropping

Now, let’s run df.info(), and look at the status of our dataset.

Which column has missing Value?

Creating new classes “Cabin”

Derive a new column

This is how our dataset looks like now.

Managing missing values

Coding categorical characteristics

Dataset ready…

Comments

You May Also Enjoy

How to do Data Science Project

Mathematics Behind Support Vector Machine

Different method of Outliers Detection

Naive Bayes From Scratch