Throw a stone at me

Monday, July 14, 2014

FA (factor analysis) & PCA (principal component analysis)

기본적으로 linear algebra 를 공부한 뒤 (SVD까지는 알아야 함)

PCA
http://sites.stat.psu.edu/~ajw13/stat505/fa06/16_princomp/

FA
http://sites.stat.psu.edu/~ajw13/stat505/fa06/17_factor/

PCA vs FA
1. by program
http://stats.stackexchange.com/questions/102882/steps-done-in-factor-analysis-compared-to-steps-done-in-pca/102999#102999

2. by plot
http://stats.stackexchange.com/questions/95038/how-does-factor-analysis-explain-the-covariance-and-pca-explains-the-variance/95106#95106

3. by theory
http://stats.stackexchange.com/questions/94048/pca-and-exploratory-factor-analysis-on-the-same-data-set/94104#94104

more
http://stats.stackexchange.com/questions/50745/best-factor-extraction-methods-with-reference-to-spss

Sunday, April 13, 2014

graphical meaning of the correlation in linear regression

graphical meaning of correlation in linear regression

First guess, Which does plot have higher correlation?

par(mfrow = c(2, 1))
x1 = rnorm(1000) + seq(0.01, 10, 0.01)
y1 = rnorm(1000, 0, 2) + seq(0.01, 10, 0.01)
plot(x1, y1, xlim = c(-2, 12), ylim = c(-2, 12))
abline(lm(y1 ~ x1))
abline(0, 1, col = "pink")

x2 = rnorm(1000) + seq(0.01, 10, 0.01)
y2 = seq(-1, 0.998, 0.002) + seq(0.001, 1, 0.001)
plot(x2, y2, xlim = c(-2, 12), ylim = c(-2, 12), col = "red")
abline(lm(y2 ~ x2))
abline(0, 1, col = "pink")

plot of chunk unnamed-chunk-1

Answer is the second plot!

par(mfrow = c(2, 1))
plot(x1, y1, main = paste("correlation = ", signif(cor(x1, y1))), xlim = c(-2, 
    12), ylim = c(-2, 12))
abline(lm(y1 ~ x1))
plot(x2, y2, main = paste("correlation = ", signif(cor(x2, y2))), xlim = c(-2, 
    12), ylim = c(-2, 12), col = "red")
abline(lm(y2 ~ x2))

plot of chunk unnamed-chunk-2

Then, How about this?

par(mfrow = c(3, 1))
x3 = seq(-1, 0.998, 0.002) + seq(0.01, 10, 0.01)
y3 = rnorm(1000, 0, 0.01) + seq(0.01, 10, 0.01)
plot(x3, y3, xlim = c(-2, 12), ylim = c(-2, 12), main = paste("correlation =", 
    cor(x3, y3)))
abline(lm(y3 ~ x3))

x4 = seq(-1, 0.998, 0.002) + seq(0.01, 10, 0.01)
y4 = rnorm(1000, 0, 0.01) + seq(0.001, 1, 0.001)
plot(x4, y4, xlim = c(-2, 12), ylim = c(-2, 12), col = "red", main = paste("correlation =", 
    cor(x4, y4)))
abline(lm(y4 ~ x4))

x5 = seq(-1, 0.998, 0.002) + seq(0.01, 10, 0.01)
y5 = rnorm(1000, 0, 0.01)
plot(x5, y5, xlim = c(-2, 12), ylim = c(-2, 12), col = "blue", main = paste("correlation =", 
    cor(x5, y5)))
abline(lm(y5 ~ x5))

plot of chunk unnamed-chunk-3

Then, Why the correlation of the third plot have almost zero?

What does the correlation stand for? slope? or density? Maybe density, which is the concentration degree of dots based on linear regression line.

In linear regression, the formula is y = b0 + b1 * x (b1 is sd(y)/sd(x)*cor(x,y))

In other word, correlation is the slope of lm(z-transformed y ~ z-transformed x)

Let's see..

par(mfrow = c(2, 1))
plot(x4, y4, main = paste(" Original plot (correlation =", cor(x4, y4), ")"))
abline(lm(y4 ~ x4))
abline(0, 1, col = "pink")

plot((x4 - mean(x4))/sd(x4), (y4 - mean(y4))/sd(y4), main = paste("Z-transformed x, y plot (correlation=", 
    cor(x4, y4), ")"))
abline((y4 - mean(y4))/sd(y4) ~ (x4 - mean(x4))/sd(x4))
abline(0, 1, col = "pink")

plot of chunk unnamed-chunk-4

That's the reason for zero correlation value in horizontal line.