Correlation

Difference between cor() and cor.test()

The cor() function will calculate:

  • a correlation between 2 variables
  • a correlation matrix between more than 2 variables

Among statisticians there are 3 popular opinions about the choice of correlation coefficient.
1. it depends on the type of data:

  • continuous measurements -> Pearson correlation
  • discrete measurements or ranked categories -> Kendall or Spearman correlation

2. it depends on normality:

  • normally distributed, continuous measurements -> Pearson correlation
  • continuous measurements that are not normally distributed, discrete measurements or ranked categories -> Kendall or Spearman correlation

3. it depends on the relation between the 2 variables:

  • linear -> Pearson correlation
  • monotonic -> Kendall or Spearman correlation

See this blog for an explanation on the difference between a linear and a monotonic relation.

The cor.test() function will only work on 2 variables, not on more than 2 variables. It will calculate:

  • correlation coefficient
  • p-value that defines if this correlation is significantly different from 0

Normality is important here, especially multivariate normality:

  • multivariate normality -> Pearson
  • no multivariate normality -> Kendall or Spearman

Warning when you use Kendall or Spearman

You will see this warning every time you do a non-parametric test. It tells you that these tests do not work well when there are ties in your data (= the same value appearing multiple times in the data set).

cor.test(mtcars$mpg,mtcars$wt,method="kendall")

data: mtcars$mpg and mtcars$wt
z = -5.7981, p-value = 6.706e-09
alternative hypothesis: true tau is not equal to 0
sample estimates:
tau
-0.7278321

Warning message:
In cor.test.default(mtcars$mpg, mtcars$wt, method = "kendall") :
Cannot compute exact p-value with ties

Output in APA style

You can write the correlation matrix to a document in APA style:

library(apaTables)
apa.cor.table(mtcars[1:5],filename="Table1_APA.doc",table.number=1)

will generate a word document in your working directory with the following content:

Add correlation coefficient to the scatter plot

To add the correlation coefficient to a plot use the ggpubr package. First create the scatter plot.

library(ggplot2)
p <- ggplot(mtcars,aes(hp,mpg)) + geom_point()

Then add a regression line and the Pearson correlation coefficient.

library(ggpubr)
p + geom_smooth(method="lm",se=FALSE) + stat_cor(method="pearson")
Plot of hp versus mpg with correlation coefficient, p-value and regression line.

Pearson correlation coefficient and p-value of cor.test() are automatically added to the plot.

Linear transformations will not change the correlation coefficient

Linear transformation (+k, -k, *k, /k where k is a constant) will not change the correlation.

p2 <- ggplot(mtcars,aes(hp+100,mpg)) + geom_point()
p2 + geom_smooth(method="lm",se=FALSE) + stat_cor(method="pearson")
Same plot as above after linear transformation of hp. Correlation doesn’t change.
p3 <- ggplot(mtcars,aes(hp,mpg*2)) + geom_point()
p3 + geom_smooth(method="lm",se=FALSE) + stat_cor(method="pearson")
Same plot as above after linear transformation of mpg. Correlation doesn’t change.

Non-linear transformations can improve the correlation coefficient

A non-linear transformation (log, square root…) can improve the correlation provided the relation between X and Y is non-linear. Check the scatter plot to see if there is a non-linear relation between X and Y. If the relation looks linear don’t do a non-linear transformation.

p4 <- ggplot(mtcars,aes(log(hp),mpg)) + geom_point()
p4 + geom_smooth(method="lm",se=FALSE) + stat_cor(method="pearson")
Same plot as above after non-linear transformation of hp. Correlation improves.