## Check the data mtcars with head and save a new data as `mtcars.subset`

after dropping two non-numeric (binary) variables for PCA analysis

```
data <- mtcars
head(data)
```

```
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
```

```
str(data)
```

```
## 'data.frame': 32 obs. of 11 variables:
## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ...
## $ disp: num 160 160 108 258 360 ...
## $ hp : num 110 110 93 110 175 105 245 62 95 123 ...
## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ...
## $ wt : num 2.62 2.88 2.32 3.21 3.44 ...
## $ qsec: num 16.5 17 18.6 19.4 17 ...
## $ vs : num 0 0 1 1 0 1 0 1 1 1 ...
## $ am : num 1 1 1 0 0 0 0 0 0 0 ...
## $ gear: num 4 4 4 3 3 3 3 4 4 4 ...
## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...
```

In our data vs and am are binary variable so I drop them here.

```
library(dplyr)
mtcars.subset <- data[,-c(8,9)]
```

## Fit PCA in the data by `mtcars.pca`

`matcars.subset`

with `cor = TRUE`

and `scores = TRUE`

```
mtcars.pca<-prcomp(mtcars.subset, cor = TRUE, scores = TRUE)
```

```
## Warning: In prcomp.default(mtcars.subset, cor = TRUE, scores = TRUE) :
## extra arguments 'cor', 'scores' will be disregarded
```

## Get summary of `mtcars.pca`

```
summary(mtcars.pca)
```

```
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 136.532 38.14735 3.06642 1.27492 0.90474 0.64734 0.3054
## Proportion of Variance 0.927 0.07237 0.00047 0.00008 0.00004 0.00002 0.0000
## Cumulative Proportion 0.927 0.99938 0.99985 0.99993 0.99997 0.99999 1.0000
## PC8 PC9
## Standard deviation 0.2859 0.2159
## Proportion of Variance 0.0000 0.0000
## Cumulative Proportion 1.0000 1.0000
```

From above summary we see when standard deviation is greater propertion of variance is high similarly when standard deviation is low propertion of variation is also low.

## Get eigenvalue of the components using standard deviation of `mtcars.pca`

and chose the number of components based on Kaiser’s criteria

```
mtcars.pca$sdev ^2
```

```
## [1] 1.864106e+04 1.455220e+03 9.402948e+00 1.625431e+00 8.185525e-01
## [6] 4.190430e-01 9.327903e-02 8.175127e-02 4.660443e-02
```

## Get scree plot and chose the number of components best on “first bend” of this plot

```
#Calculating total variance explained by each principal component
var_explained<-mtcars.pca$sdev^2/sum(mtcars.pca$sdev^2)
##Creating scree plot
library(ggplot2)
qplot(c(1:9), var_explained) + geom_line() + xlab("Principal Component") + ylab("Variance explained") + ggtitle("Scree Plot")
```

## How many components must be retained based on Kaiser’s rule and/or scree plot?

Kisher’s rule suggaest us to use 2 components and Scree plot suggest us to retain 4 components for the problem.

## Fit the final PCA model based on the retained components

```
library(psych)
```

```
## Warning: package 'psych' was built under R version 4.1.2
```

```
mtcars.pca<- psych::principal(mtcars.subset, nfactors = 4, rotate = "none")
mtcars.pca
```

```
## Principal Components Analysis
## Call: psych::principal(r = mtcars.subset, nfactors = 4, rotate = "none")
## Standardized loadings (pattern matrix) based upon correlation matrix
## PC1 PC2 PC3 PC4 h2 u2 com
## mpg -0.93 0.04 -0.16 0.00 0.90 0.0995 1.1
## cyl 0.96 0.02 -0.18 0.02 0.95 0.0504 1.1
## disp 0.94 -0.13 -0.06 0.17 0.94 0.0569 1.1
## hp 0.87 0.39 -0.01 0.04 0.91 0.0854 1.4
## drat -0.74 0.49 0.11 0.44 0.99 0.0062 2.5
## wt 0.89 -0.25 0.32 0.10 0.96 0.0360 1.5
## qsec -0.53 -0.70 0.45 -0.02 0.97 0.0283 2.6
## gear -0.50 0.79 0.15 -0.15 0.92 0.0775 1.8
## carb 0.58 0.70 0.33 -0.11 0.95 0.0525 2.5
##
## PC1 PC2 PC3 PC4
## SS loadings 5.66 2.08 0.50 0.27
## Proportion Var 0.63 0.23 0.06 0.03
## Cumulative Var 0.63 0.86 0.92 0.95
## Proportion Explained 0.66 0.24 0.06 0.03
## Cumulative Proportion 0.66 0.91 0.97 1.00
##
## Mean item complexity = 1.7
## Test of the hypothesis that 4 components are sufficient.
##
## The root mean square of the residuals (RMSR) is 0.02
## with the empirical chi square 0.96 with prob < 0.99
##
## Fit based upon off diagonal values = 1
```

## Get the head of the saved loadings of `mtcars.pca`

```
head(mtcars.pca)
```

```
## $values
## [1] 5.65593947 2.08210029 0.50421482 0.26502753 0.18315864 0.12379319 0.10506192
## [8] 0.05851375 0.02219038
##
## $rotation
## [1] "none"
##
## $n.obs
## [1] 32
##
## $communality
## mpg cyl disp hp drat wt qsec gear
## 0.9004692 0.9495949 0.9430951 0.9146434 0.9938301 0.9639755 0.9716949 0.9224616
## carb
## 0.9475174
##
## $loadings
##
## Loadings:
## PC1 PC2 PC3 PC4
## mpg -0.935 -0.157
## cyl 0.957 -0.179
## disp 0.945 -0.128 0.175
## hp 0.873 0.389
## drat -0.742 0.493 0.106 0.435
## wt 0.888 -0.248 0.322
## qsec -0.534 -0.698 0.446
## gear -0.498 0.795 0.147 -0.145
## carb 0.582 0.699 0.330 -0.110
##
## PC1 PC2 PC3 PC4
## SS loadings 5.656 2.082 0.504 0.265
## Proportion Var 0.628 0.231 0.056 0.029
## Cumulative Var 0.628 0.860 0.916 0.945
##
## $fit
## [1] 0.9982615
```

## Retain two components, get their loadings

```
mtcars.pca_2 <- principal(mtcars.subset, nfactors = 2, rotate="none")
mtcars.pca_2$loadings
```

```
##
## Loadings:
## PC1 PC2
## mpg -0.935
## cyl 0.957
## disp 0.945 -0.128
## hp 0.873 0.389
## drat -0.742 0.493
## wt 0.888 -0.248
## qsec -0.534 -0.698
## gear -0.498 0.795
## carb 0.582 0.699
##
## PC1 PC2
## SS loadings 5.656 2.082
## Proportion Var 0.628 0.231
## Cumulative Var 0.628 0.860
```

Principal components (PCs) are constructed by the linear combination of the original variables, where PCA loading are the coefficients. Here, cyl has the weights of 0.957 on PC1 computation but not in PC2. Positive loading in above data indicates a variable and a component are positively correlated. Negative loading indicate a negative correlation between the variable and component. Similarly disp has positive loading with PC1 and negative loading with PC2. Large (either positive or negative) loading indicate that a variable has a strong effect on that principal component. The larger value of cyl indicates the strong effect on PC1.

## Get biplot of these two component loadings

```
biplot(mtcars.pca, col = c("blue", "black"), cex = c(0.5, 1.3))
```

## Get the head of the saved scores of `mtcars.pca`

```
head(mtcars.pca$scores)
```

```
## PC1 PC2 PC3 PC4
## Mazda RX4 -0.27929417 0.8132290 -0.2877380 -0.2447853
## Mazda RX4 Wag -0.26793045 0.6770476 0.1560073 -0.1664252
## Datsun 710 -0.96699807 -0.2263347 -0.2959516 -0.2110014
## Hornet 4 Drive -0.09052843 -1.3699797 -0.4639869 -0.5984019
## Hornet Sportabout 0.66729435 -0.5743299 -1.4547534 0.2862895
## Valiant 0.02085807 -1.6956141 0.1574155 -1.6929588
```

The original ddataset is projected into four principal components.

## Get the head of the scores of first two components of `mtcars.pca`

```
head(mtcars.pca_2$scores)
```

```
## PC1 PC2
## Mazda RX4 -0.27929417 0.8132290
## Mazda RX4 Wag -0.26793045 0.6770476
## Datsun 710 -0.96699807 -0.2263347
## Hornet 4 Drive -0.09052843 -1.3699797
## Hornet Sportabout 0.66729435 -0.5743299
## Valiant 0.02085807 -1.6956141
```

## Get biplot of these two component scores

```
biplot(mtcars.pca_2, col = c("blue", "red"))
```

Here we can observe that hp, cyl, disp and wt contribute to PC1 with higher values. And mpg which has negative loadings is in opposite direction to PC1 with higher values. Gear and carb has higher contribution to PC2 with positive values and qsec has negative value.

## Get dissimilar distance of all the variables of `mtcars`

data as `mtcars.dist`

```
#Distance calculation
mtcars.dist<- dist(mtcars.subset)
```

## Fit classical multi-dimensional scaling model with the `mtcars.dist`

in 2-dimensional state as `cars.mds.2d`

```
# Fitting classical two- dimensional scaling model
cars.mds.2d<-cmdscale(mtcars.dist)
summary(cars.mds.2d)
```

```
## V1 V2
## Min. :-181.07 Min. :-139.047
## 1st Qu.:-116.69 1st Qu.: -10.373
## Median : -43.99 Median : 2.144
## Mean : 0.00 Mean : 0.000
## 3rd Qu.: 132.85 3rd Qu.: 29.375
## Max. : 242.81 Max. : 52.503
```

## Plot the `cars.mds.2d`

and compare it with the biplot of `mtcars.pca`

```
plot(cars.mds.2d, pch = 19)
abline(h = 0, v = 0, lty =2)
mtcars.subset<-mtcars[, 1:2] %>% scale
text(cars.mds.2d, pos = 4, labels = rownames(mtcars.subset), col = "tomato")
```

Hornet 4 Drive, Pontiac Fire bird etc lies on the positive orthant which means they have positive contribution to the first and second components. However, Lotus Europa and Ferari has opposite but highest weight component 1 and component 2.

## Fit classical multi-dimensional scaling model with the mtcars.dist in 3-dimensional state as `cars.mds.3d`

```
#Fiting multi-dimensional scaling model with mtcars.dist in 3 - dimensional state
cars.mds.3d<-cmdscale(mtcars.dist, k = 3)
summary(cars.mds.3d)
```

```
## V1 V2 V3
## Min. :-181.07 Min. :-139.047 Min. :-6.8611
## 1st Qu.:-116.69 1st Qu.: -10.373 1st Qu.:-1.8374
## Median : -43.99 Median : 2.144 Median : 0.8492
## Mean : 0.00 Mean : 0.000 Mean : 0.0000
## 3rd Qu.: 132.85 3rd Qu.: 29.375 3rd Qu.: 2.2806
## Max. : 242.81 Max. : 52.503 Max. : 5.0029
```

## Create a 3-d scatterplot of `cars.mds.3d`

with `type = “h”`

, `pch=20`

and `lty.hplot=2`

and interpret it carefully

```
library(scatterplot3d)
cars.mds.3d <- data.frame(cmdscale(mtcars.dist, k = 3))
scatterplot3d(cars.mds.3d, type = "h", pch = 19, lty.hplot = 2)
```

## Create a 3-d scatterplot of `cars.mds.3d`

with `type = “h”, pch=20, lty.hplot=2`

and `color=mtcars$cyl`

```
library(scatterplot3d)
cars.mds.3d <- data.frame(cmdscale(mtcars.dist, k = 3))
scatterplot3d(cars.mds.3d, type = "h", pch = 19, lty.hplot = 2, color = mtcars$cyl)
```

We plotted the principal components in 3- dimensional scatter plot which is distinguished by color cyl. We can use higher dimensions by changing the k argument in the cmdscale() function to a higher value for eg. k = 3 for 3 dimensuon.

## Write a summary comparing PCA and MDS fits done above for `mtcars`

data

The input to PCA is the original vectors in n-dimensional space. Similarly, input to MDS is the pairwise distances between points. PCA behaves as an algorithm but MDS is a visualization technique for any factor analysis. MDS applies PCA for the dimensionality reduction. For the mtcars, it shows that two or more but less than or equal to 5 latent features can be generated from the given dataset.

## Comments