Check the data mtcars with head and save a new data as mtcars.subset after dropping two non-numeric (binary) variables for PCA analysis

data <- mtcars
head(data)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
str(data)
## 'data.frame':    32 obs. of  11 variables:
##  $ mpg : num  21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
##  $ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
##  $ disp: num  160 160 108 258 360 ...
##  $ hp  : num  110 110 93 110 175 105 245 62 95 123 ...
##  $ drat: num  3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
##  $ wt  : num  2.62 2.88 2.32 3.21 3.44 ...
##  $ qsec: num  16.5 17 18.6 19.4 17 ...
##  $ vs  : num  0 0 1 1 0 1 0 1 1 1 ...
##  $ am  : num  1 1 1 0 0 0 0 0 0 0 ...
##  $ gear: num  4 4 4 3 3 3 3 4 4 4 ...
##  $ carb: num  4 4 1 1 2 1 4 2 2 4 ...

In our data vs and am are binary variable so I drop them here.

library(dplyr)
mtcars.subset <- data[,-c(8,9)]

Fit PCA in the data by mtcars.pca matcars.subset with cor = TRUE and scores = TRUE

mtcars.pca<-prcomp(mtcars.subset, cor = TRUE, scores = TRUE)
## Warning: In prcomp.default(mtcars.subset, cor = TRUE, scores = TRUE) :
##  extra arguments 'cor', 'scores' will be disregarded

Get summary of mtcars.pca

summary(mtcars.pca)
## Importance of components:
##                            PC1      PC2     PC3     PC4     PC5     PC6    PC7
## Standard deviation     136.532 38.14735 3.06642 1.27492 0.90474 0.64734 0.3054
## Proportion of Variance   0.927  0.07237 0.00047 0.00008 0.00004 0.00002 0.0000
## Cumulative Proportion    0.927  0.99938 0.99985 0.99993 0.99997 0.99999 1.0000
##                           PC8    PC9
## Standard deviation     0.2859 0.2159
## Proportion of Variance 0.0000 0.0000
## Cumulative Proportion  1.0000 1.0000

From above summary we see when standard deviation is greater propertion of variance is high similarly when standard deviation is low propertion of variation is also low.

Get eigenvalue of the components using standard deviation of mtcars.pca and chose the number of components based on Kaiser’s criteria

mtcars.pca$sdev ^2
## [1] 1.864106e+04 1.455220e+03 9.402948e+00 1.625431e+00 8.185525e-01
## [6] 4.190430e-01 9.327903e-02 8.175127e-02 4.660443e-02

Get scree plot and chose the number of components best on “first bend” of this plot

#Calculating total variance explained by each principal component
var_explained<-mtcars.pca$sdev^2/sum(mtcars.pca$sdev^2)
##Creating scree plot 
library(ggplot2)
qplot(c(1:9), var_explained) + geom_line() + xlab("Principal Component") + ylab("Variance explained") + ggtitle("Scree Plot")

How many components must be retained based on Kaiser’s rule and/or scree plot?

Kisher’s rule suggaest us to use 2 components and Scree plot suggest us to retain 4 components for the problem.

Fit the final PCA model based on the retained components

library(psych)
## Warning: package 'psych' was built under R version 4.1.2
mtcars.pca<- psych::principal(mtcars.subset, nfactors = 4, rotate = "none")
mtcars.pca
## Principal Components Analysis
## Call: psych::principal(r = mtcars.subset, nfactors = 4, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
##        PC1   PC2   PC3   PC4   h2     u2 com
## mpg  -0.93  0.04 -0.16  0.00 0.90 0.0995 1.1
## cyl   0.96  0.02 -0.18  0.02 0.95 0.0504 1.1
## disp  0.94 -0.13 -0.06  0.17 0.94 0.0569 1.1
## hp    0.87  0.39 -0.01  0.04 0.91 0.0854 1.4
## drat -0.74  0.49  0.11  0.44 0.99 0.0062 2.5
## wt    0.89 -0.25  0.32  0.10 0.96 0.0360 1.5
## qsec -0.53 -0.70  0.45 -0.02 0.97 0.0283 2.6
## gear -0.50  0.79  0.15 -0.15 0.92 0.0775 1.8
## carb  0.58  0.70  0.33 -0.11 0.95 0.0525 2.5
## 
##                        PC1  PC2  PC3  PC4
## SS loadings           5.66 2.08 0.50 0.27
## Proportion Var        0.63 0.23 0.06 0.03
## Cumulative Var        0.63 0.86 0.92 0.95
## Proportion Explained  0.66 0.24 0.06 0.03
## Cumulative Proportion 0.66 0.91 0.97 1.00
## 
## Mean item complexity =  1.7
## Test of the hypothesis that 4 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.02 
##  with the empirical chi square  0.96  with prob <  0.99 
## 
## Fit based upon off diagonal values = 1

Get the head of the saved loadings of mtcars.pca

head(mtcars.pca)
## $values
## [1] 5.65593947 2.08210029 0.50421482 0.26502753 0.18315864 0.12379319 0.10506192
## [8] 0.05851375 0.02219038
## 
## $rotation
## [1] "none"
## 
## $n.obs
## [1] 32
## 
## $communality
##       mpg       cyl      disp        hp      drat        wt      qsec      gear 
## 0.9004692 0.9495949 0.9430951 0.9146434 0.9938301 0.9639755 0.9716949 0.9224616 
##      carb 
## 0.9475174 
## 
## $loadings
## 
## Loadings:
##      PC1    PC2    PC3    PC4   
## mpg  -0.935        -0.157       
## cyl   0.957        -0.179       
## disp  0.945 -0.128         0.175
## hp    0.873  0.389              
## drat -0.742  0.493  0.106  0.435
## wt    0.888 -0.248  0.322       
## qsec -0.534 -0.698  0.446       
## gear -0.498  0.795  0.147 -0.145
## carb  0.582  0.699  0.330 -0.110
## 
##                  PC1   PC2   PC3   PC4
## SS loadings    5.656 2.082 0.504 0.265
## Proportion Var 0.628 0.231 0.056 0.029
## Cumulative Var 0.628 0.860 0.916 0.945
## 
## $fit
## [1] 0.9982615

Retain two components, get their loadings

mtcars.pca_2 <- principal(mtcars.subset, nfactors = 2, rotate="none")
mtcars.pca_2$loadings
## 
## Loadings:
##      PC1    PC2   
## mpg  -0.935       
## cyl   0.957       
## disp  0.945 -0.128
## hp    0.873  0.389
## drat -0.742  0.493
## wt    0.888 -0.248
## qsec -0.534 -0.698
## gear -0.498  0.795
## carb  0.582  0.699
## 
##                  PC1   PC2
## SS loadings    5.656 2.082
## Proportion Var 0.628 0.231
## Cumulative Var 0.628 0.860

Principal components (PCs) are constructed by the linear combination of the original variables, where PCA loading are the coefficients. Here, cyl has the weights of 0.957 on PC1 computation but not in PC2. Positive loading in above data indicates a variable and a component are positively correlated. Negative loading indicate a negative correlation between the variable and component. Similarly disp has positive loading with PC1 and negative loading with PC2. Large (either positive or negative) loading indicate that a variable has a strong effect on that principal component. The larger value of cyl indicates the strong effect on PC1.

Get biplot of these two component loadings

biplot(mtcars.pca, col = c("blue", "black"), cex = c(0.5, 1.3))

Get the head of the saved scores of mtcars.pca

head(mtcars.pca$scores)
##                           PC1        PC2        PC3        PC4
## Mazda RX4         -0.27929417  0.8132290 -0.2877380 -0.2447853
## Mazda RX4 Wag     -0.26793045  0.6770476  0.1560073 -0.1664252
## Datsun 710        -0.96699807 -0.2263347 -0.2959516 -0.2110014
## Hornet 4 Drive    -0.09052843 -1.3699797 -0.4639869 -0.5984019
## Hornet Sportabout  0.66729435 -0.5743299 -1.4547534  0.2862895
## Valiant            0.02085807 -1.6956141  0.1574155 -1.6929588

The original ddataset is projected into four principal components.

Get the head of the scores of first two components of mtcars.pca

head(mtcars.pca_2$scores)
##                           PC1        PC2
## Mazda RX4         -0.27929417  0.8132290
## Mazda RX4 Wag     -0.26793045  0.6770476
## Datsun 710        -0.96699807 -0.2263347
## Hornet 4 Drive    -0.09052843 -1.3699797
## Hornet Sportabout  0.66729435 -0.5743299
## Valiant            0.02085807 -1.6956141

Get biplot of these two component scores

biplot(mtcars.pca_2, col = c("blue", "red"))

Here we can observe that hp, cyl, disp and wt contribute to PC1 with higher values. And mpg which has negative loadings is in opposite direction to PC1 with higher values. Gear and carb has higher contribution to PC2 with positive values and qsec has negative value.

Get dissimilar distance of all the variables of mtcars data as mtcars.dist

#Distance calculation
mtcars.dist<- dist(mtcars.subset)

Fit classical multi-dimensional scaling model with the mtcars.dist in 2-dimensional state as cars.mds.2d

# Fitting classical two- dimensional scaling model
cars.mds.2d<-cmdscale(mtcars.dist)
summary(cars.mds.2d)
##        V1                V2          
##  Min.   :-181.07   Min.   :-139.047  
##  1st Qu.:-116.69   1st Qu.: -10.373  
##  Median : -43.99   Median :   2.144  
##  Mean   :   0.00   Mean   :   0.000  
##  3rd Qu.: 132.85   3rd Qu.:  29.375  
##  Max.   : 242.81   Max.   :  52.503

Plot the cars.mds.2d and compare it with the biplot of mtcars.pca

plot(cars.mds.2d, pch = 19)
abline(h = 0, v = 0, lty =2)
mtcars.subset<-mtcars[, 1:2] %>% scale
text(cars.mds.2d, pos = 4, labels = rownames(mtcars.subset), col = "tomato")

Hornet 4 Drive, Pontiac Fire bird etc lies on the positive orthant which means they have positive contribution to the first and second components. However, Lotus Europa and Ferari has opposite but highest weight component 1 and component 2.

Fit classical multi-dimensional scaling model with the mtcars.dist in 3-dimensional state as cars.mds.3d

#Fiting multi-dimensional scaling model with mtcars.dist in 3 - dimensional state
cars.mds.3d<-cmdscale(mtcars.dist, k = 3)
summary(cars.mds.3d)
##        V1                V2                 V3         
##  Min.   :-181.07   Min.   :-139.047   Min.   :-6.8611  
##  1st Qu.:-116.69   1st Qu.: -10.373   1st Qu.:-1.8374  
##  Median : -43.99   Median :   2.144   Median : 0.8492  
##  Mean   :   0.00   Mean   :   0.000   Mean   : 0.0000  
##  3rd Qu.: 132.85   3rd Qu.:  29.375   3rd Qu.: 2.2806  
##  Max.   : 242.81   Max.   :  52.503   Max.   : 5.0029

Create a 3-d scatterplot of cars.mds.3d with type = “h”, pch=20 and lty.hplot=2 and interpret it carefully

library(scatterplot3d)
cars.mds.3d <- data.frame(cmdscale(mtcars.dist, k = 3))
scatterplot3d(cars.mds.3d, type = "h", pch = 19, lty.hplot = 2)

Create a 3-d scatterplot of cars.mds.3d with type = “h”, pch=20, lty.hplot=2 and color=mtcars$cyl

library(scatterplot3d)
cars.mds.3d <- data.frame(cmdscale(mtcars.dist, k = 3))
scatterplot3d(cars.mds.3d, type = "h", pch = 19, lty.hplot = 2, color = mtcars$cyl)

We plotted the principal components in 3- dimensional scatter plot which is distinguished by color cyl. We can use higher dimensions by changing the k argument in the cmdscale() function to a higher value for eg. k = 3 for 3 dimensuon.

Write a summary comparing PCA and MDS fits done above for mtcars data

The input to PCA is the original vectors in n-dimensional space. Similarly, input to MDS is the pairwise distances between points. PCA behaves as an algorithm but MDS is a visualization technique for any factor analysis. MDS applies PCA for the dimensionality reduction. For the mtcars, it shows that two or more but less than or equal to 5 latent features can be generated from the given dataset.

Comments