Tags

,

Calculating Correlations

One of the methods to explore into the associations between two or more variables in a dataset is to find the correlation between them. In this tutorial, I would explain you how to examine correlation using R and how to interpret the results.

For this tutorial purpose we will use one of R’s default datasets, swiss. To load that dataset into workspace, the following R command is run:

>data(swiss)

#To view the variables in that dataset as well as help file on it, we run the following commands one by one:

>?swiss                        # opens up helpfile

>names(swiss)             # displays the variables list in that dataset

# There are six variables in swiss dataset:

[1] “Fertility”   “Agriculture”   “Examination”  “Education”   “Catholic”

[6] “Infant.Mortality”

First of all we calculate the correlation between all the variables. The matrix will be generated with the correlation coefficients using the following r function:

>cor(swiss)

             Fertility Agriculture Examination   Education   Catholic Infant.Mortality
Fertility    1.0000000  0.35307918  -0.6458827 -0.66378886  0.4636847     0.41655603
Agriculture  0.3530792  1.00000000  -0.6865422 -0.63952252  0.4010951     -0.06085861
Examination  -0.6458827 -0.68654221 1.0000000  0.69841530 -0.5727418      -0.11402160
Education    -0.6637889 -0.63952252 0.6984153  1.00000000 -0.1538589      -0.09932185
Catholic     0.4636847  0.40109505  -0.5727418 -0.15385892  1.0000000      0.17549591
Infant.Mortality  0.4165560 -0.06085861  -0.1140216 -0.09932185  0.1754959 1.00000000

This is pretty bad looking with many decimal places. We, therefore, round the coefficient values to two decimal places using the following function:

>round(cor(swiss), 2)  # and what we get is:

                Fertility Agriculture Examination Education Catholic Infant.Mortality
Fertility            1.00        0.35       -0.65     -0.66     0.46       0.42
Agriculture          0.35        1.00       -0.69     -0.64     0.40      -0.06
Examination         -0.65       -0.69        1.00      0.70    -0.57      -0.11
Education           -0.66       -0.64        0.70      1.00    -0.15      -0.10
Catholic             0.46        0.40       -0.57     -0.15     1.00       0.18
Infant.Mortality     0.42       -0.06       -0.11     -0.10     0.18       1.00

The above table shows that side-by-side correlation coefficients between all six variables and some of the interpretations of the result are as follows:

  1. We can see the fertility is very strongly associated with education with negative correlation of 0.66
  2. Similarly Agriculture has -0.69 correlation with examination
  3. A strong positive correlation (of 0.46) between Fertility and Catholic socio-indicator of the respondents

But the above test is not a hypothesis test. To run a hypothesis test on the dataset, using one pair of indicators (variables) at a time, and with confidence interval in the test, we will use the following function in R:

>cor.test(swiss$Fertility, swiss$Education)               # this will return the following results:

Pearson's product-moment correlation
data:  swiss$Fertility and swiss$Education
t = -5.9536, df = 45, p-value = 3.659e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7987075 -0.4653206
sample estimates:
 cor 
-0.6637889

The result above says that the correlation is not equal to zero and at 95%CI, it is significantly different from zero and observed correlation between these two variables is -0.66 with p-value<0.001

But if you want all the values of correlations matrix with respective p-values between variables, you need to install an external library “Hmisc” and all of its dependencies (if any) using the following command (if not already installed on your computer):

>install.packages(“Hmisc”, dependencies=TRUE)

We now load it into our workspace:

>require(“Hmisc”)

Once this package is attached into our working environment, we then use the following rcorr() function to get the correlations matrix with respective p-values:

>rcorr(as.matrix(swiss))

Running the above function actually produces two matrices, one for correlation coefficients and the other for probability values:

                 Fertility Agriculture Examination Education Catholic Infant.Mortality

Fertility             1.00        0.35       -0.65     -0.66     0.46             0.42

Agriculture           0.35        1.00       -0.69     -0.64     0.40            -0.06

Examination          -0.65       -0.69        1.00      0.70    -0.57            -0.11

Education            -0.66       -0.64        0.70      1.00    -0.15            -0.10

Catholic              0.46        0.40       -0.57     -0.15     1.00             0.18

Infant.Mortality      0.42       -0.06       -0.11     -0.10     0.18             1.00

n= 47

P

Fertility Agriculture Examination Education Catholic Infant.Mortality

Fertility                  0.0149      0.0000      0.0000    0.0010   0.0036

Agriculture      0.0149                0.0000      0.0000    0.0052   0.6845

Examination      0.0000    0.0000                  0.0000    0.0000   0.4454

Education        0.0000    0.0000      0.0000                0.3018   0.5065

Catholic         0.0010    0.0052      0.0000      0.3018             0.2380

Infant.Mortality 0.0036    0.6845      0.4454      0.5065    0.2380

We can actually see that there are many correlations with significant p-values except a few cases such as infant mortality and agriculture, which is true also as coefficient between these two variables is very close to zero (-0.06) and so on.

I hope this tutorial was easy to understand and helpful for performing basic correlation tests in R on a dataset with two or more variables in it.

Advertisements