Sunday, April 13, 2014

graphical meaning of the correlation in linear regression

graphical meaning of correlation in linear regression

graphical meaning of correlation in linear regression

First guess, Which does plot have higher correlation?

par(mfrow = c(2, 1))
x1 = rnorm(1000) + seq(0.01, 10, 0.01)
y1 = rnorm(1000, 0, 2) + seq(0.01, 10, 0.01)
plot(x1, y1, xlim = c(-2, 12), ylim = c(-2, 12))
abline(lm(y1 ~ x1))
abline(0, 1, col = "pink")

x2 = rnorm(1000) + seq(0.01, 10, 0.01)
y2 = seq(-1, 0.998, 0.002) + seq(0.001, 1, 0.001)
plot(x2, y2, xlim = c(-2, 12), ylim = c(-2, 12), col = "red")
abline(lm(y2 ~ x2))
abline(0, 1, col = "pink")

plot of chunk unnamed-chunk-1

Answer is the second plot!

par(mfrow = c(2, 1))
plot(x1, y1, main = paste("correlation = ", signif(cor(x1, y1))), xlim = c(-2, 
    12), ylim = c(-2, 12))
abline(lm(y1 ~ x1))
plot(x2, y2, main = paste("correlation = ", signif(cor(x2, y2))), xlim = c(-2, 
    12), ylim = c(-2, 12), col = "red")
abline(lm(y2 ~ x2))

plot of chunk unnamed-chunk-2

Then, How about this?

par(mfrow = c(3, 1))
x3 = seq(-1, 0.998, 0.002) + seq(0.01, 10, 0.01)
y3 = rnorm(1000, 0, 0.01) + seq(0.01, 10, 0.01)
plot(x3, y3, xlim = c(-2, 12), ylim = c(-2, 12), main = paste("correlation =", 
    cor(x3, y3)))
abline(lm(y3 ~ x3))

x4 = seq(-1, 0.998, 0.002) + seq(0.01, 10, 0.01)
y4 = rnorm(1000, 0, 0.01) + seq(0.001, 1, 0.001)
plot(x4, y4, xlim = c(-2, 12), ylim = c(-2, 12), col = "red", main = paste("correlation =", 
    cor(x4, y4)))
abline(lm(y4 ~ x4))

x5 = seq(-1, 0.998, 0.002) + seq(0.01, 10, 0.01)
y5 = rnorm(1000, 0, 0.01)
plot(x5, y5, xlim = c(-2, 12), ylim = c(-2, 12), col = "blue", main = paste("correlation =", 
    cor(x5, y5)))
abline(lm(y5 ~ x5))

plot of chunk unnamed-chunk-3

Then, Why the correlation of the third plot have almost zero?

What does the correlation stand for? slope? or density? Maybe density, which is the concentration degree of dots based on linear regression line.

In linear regression, the formula is y = b0 + b1 * x (b1 is sd(y)/sd(x)*cor(x,y))

In other word, correlation is the slope of lm(z-transformed y ~ z-transformed x)

Let's see..

par(mfrow = c(2, 1))
plot(x4, y4, main = paste(" Original plot (correlation =", cor(x4, y4), ")"))
abline(lm(y4 ~ x4))
abline(0, 1, col = "pink")

plot((x4 - mean(x4))/sd(x4), (y4 - mean(y4))/sd(y4), main = paste("Z-transformed x, y plot (correlation=", 
    cor(x4, y4), ")"))
abline((y4 - mean(y4))/sd(y4) ~ (x4 - mean(x4))/sd(x4))
abline(0, 1, col = "pink")

plot of chunk unnamed-chunk-4

That's the reason for zero correlation value in horizontal line.