기본적으로 linear algebra 를 공부한 뒤 (SVD까지는 알아야 함)
PCA
http://sites.stat.psu.edu/~ajw13/stat505/fa06/16_princomp/
FA
http://sites.stat.psu.edu/~ajw13/stat505/fa06/17_factor/
PCA vs FA
1. by program
http://stats.stackexchange.com/questions/102882/steps-done-in-factor-analysis-compared-to-steps-done-in-pca/102999#102999
2. by plot
http://stats.stackexchange.com/questions/95038/how-does-factor-analysis-explain-the-covariance-and-pca-explains-the-variance/95106#95106
3. by theory
http://stats.stackexchange.com/questions/94048/pca-and-exploratory-factor-analysis-on-the-same-data-set/94104#94104
more
http://stats.stackexchange.com/questions/50745/best-factor-extraction-methods-with-reference-to-spss
Monday, July 14, 2014
Sunday, April 13, 2014
graphical meaning of the correlation in linear regression
graphical meaning of correlation in linear regression
First guess, Which does plot have higher correlation?
par(mfrow = c(2, 1))
x1 = rnorm(1000) + seq(0.01, 10, 0.01)
y1 = rnorm(1000, 0, 2) + seq(0.01, 10, 0.01)
plot(x1, y1, xlim = c(-2, 12), ylim = c(-2, 12))
abline(lm(y1 ~ x1))
abline(0, 1, col = "pink")
x2 = rnorm(1000) + seq(0.01, 10, 0.01)
y2 = seq(-1, 0.998, 0.002) + seq(0.001, 1, 0.001)
plot(x2, y2, xlim = c(-2, 12), ylim = c(-2, 12), col = "red")
abline(lm(y2 ~ x2))
abline(0, 1, col = "pink")
Answer is the second plot!
par(mfrow = c(2, 1))
plot(x1, y1, main = paste("correlation = ", signif(cor(x1, y1))), xlim = c(-2,
12), ylim = c(-2, 12))
abline(lm(y1 ~ x1))
plot(x2, y2, main = paste("correlation = ", signif(cor(x2, y2))), xlim = c(-2,
12), ylim = c(-2, 12), col = "red")
abline(lm(y2 ~ x2))
Then, How about this?
par(mfrow = c(3, 1))
x3 = seq(-1, 0.998, 0.002) + seq(0.01, 10, 0.01)
y3 = rnorm(1000, 0, 0.01) + seq(0.01, 10, 0.01)
plot(x3, y3, xlim = c(-2, 12), ylim = c(-2, 12), main = paste("correlation =",
cor(x3, y3)))
abline(lm(y3 ~ x3))
x4 = seq(-1, 0.998, 0.002) + seq(0.01, 10, 0.01)
y4 = rnorm(1000, 0, 0.01) + seq(0.001, 1, 0.001)
plot(x4, y4, xlim = c(-2, 12), ylim = c(-2, 12), col = "red", main = paste("correlation =",
cor(x4, y4)))
abline(lm(y4 ~ x4))
x5 = seq(-1, 0.998, 0.002) + seq(0.01, 10, 0.01)
y5 = rnorm(1000, 0, 0.01)
plot(x5, y5, xlim = c(-2, 12), ylim = c(-2, 12), col = "blue", main = paste("correlation =",
cor(x5, y5)))
abline(lm(y5 ~ x5))
Then, Why the correlation of the third plot have almost zero?
What does the correlation stand for? slope? or density? Maybe density, which is the concentration degree of dots based on linear regression line.
In linear regression, the formula is y = b0 + b1 * x (b1 is sd(y)/sd(x)*cor(x,y))
In other word, correlation is the slope of lm(z-transformed y ~ z-transformed x)
Let's see..
par(mfrow = c(2, 1))
plot(x4, y4, main = paste(" Original plot (correlation =", cor(x4, y4), ")"))
abline(lm(y4 ~ x4))
abline(0, 1, col = "pink")
plot((x4 - mean(x4))/sd(x4), (y4 - mean(y4))/sd(y4), main = paste("Z-transformed x, y plot (correlation=",
cor(x4, y4), ")"))
abline((y4 - mean(y4))/sd(y4) ~ (x4 - mean(x4))/sd(x4))
abline(0, 1, col = "pink")
That's the reason for zero correlation value in horizontal line.
Subscribe to:
Posts (Atom)