Throw a stone at me: PLINK

GWAS에 대한 전체 적인 윤곽을 잡고 세부적으로 필요한 내용을 정리하고자 한다. 접근 방법이 논문으로 하는 것이 일반적이기는 하나 여러 문제로 인하여 좀 더 와닿는 실용적인 방법으로 접근해 보려 한다. 한 방법은 application을 이용해 보면서 용어, 기법들을 숙지하는 것.

PLINK에 대해 알아보자. 아래 내용은 http://pngu.mgh.harvard.edu/~purcell/plink/ 와 그 외의 필요한 자료를 정리한 것으로 한다.

PLINK tutorial (manual chapter 2)

<<Abstract>>
89 Asian HapMap individuals 의 83,534 autosomal SNPs을 가지고 간략한 GWAS 분석(case-control GWAS)을 해본다(tutorial data download 필요).
특정 관심 trait는 인종과 특정 SNP(rs2222162) 그리고 random component의 복합적인 효과로 만들어져있음을 가정한다.
trait와 인종, trait와 rs2222162와의 관계는 아래와 같다 (quantitative trait 는 특정 trait 에 대한 측정 값이 "qt.phe" 파일에 나타나 있고, 이 수치의 median 값인 6.828753를 기준으로 이 값보다 크면 case, 작으면 control로 구분하여 disease trait를 정하였다).

<population-disease table>

<rs2222162-disease table>

<<Input Data>>
PLINK의 input은 기본적으로 PED file과 MAP 파일이 있다.

MAP file : marker (SNP) 에 대한 설명 파일, 아래와 같은 4개의 column으로 구성

chromosome (1-22, X, Y, or 0 if unplaced)
rs# or snp identifier
Genetic distance (morgans)
Base-pair position (bp units)

PED file : 각 individual의 genotype 정보가 들어잇는 파일, 아래 6개 column은 의무적으로 들어가 있으며 그 뒤 genotype 의 column이 들어가게 된다.

Family ID
Individual ID
Paternal ID
Maternal ID
Sex (1= male; 2=female; other=unknown)
Phenotype

위 MAP, PED 파일과 두개의 phenotype file ( 위 table의 quantitative trait와 disease trait를 위한 파일, phe 확장자)이 이번 tutorial의 input이 되겠다.

phe file : phenotype을 나타내는 파일로 아래 3개의 column으로 구성됨

Family ID
Individual ID
trait value

<<Overview>>
이 tutorial 에서는 GWAS 의 아래와 같은 사항에 대해 test 해 본다.

getting started ;; 기본적인 명령어 사용법
making a binary PED file ;; PED 파일을 binary 파일인 BED 변환하여 빠른 PLINK 실행 도모
working with the binary PED file ;; BED 파일 사용법
summary statistics : missing rates ;; PED 파일의 간략한 서머리로 missing 된 SNP나 individual 의
summary statistics : allele frequencies ;;
basic association analysis ;;
genotypic association models ;;
stratification analysis ;;
association analysis, accounting for clusters ;;
quantitative trait association analysis
extracting a SNP of interest ;;

위 사항 중 몇가지만 더 자세하게 알아본다.

<<Detail>>
-summary statistics-
기본적으로 PLINK에서는 분석에 앞서 SNP과 individual의 filtering을 하게 된다. MIND, GENO, MAF 의 3가지 옵션에 의해 filtering 이 되는데 MIND의 경우 individual을 filtering 하는데, GENO와 MAF는 SNP를 filtering 하는데 사용된다(default :: MIND>0.1, GENO > 0.1, MAF < 0.01). MIND는 individual의 missing SNP의 비율을 나타내고, GENO의 경우 한 SNP가 전체 population에서의 missing 된 비율을 나타내며, MAF는 minor allele frequency 를 의미한다. 그렇기 때문에 MIND와 GENO의 옵션의 경우 값이 클수록 loose 한 filtering이 되며 MAF의 경우 값이 작을 수록 loose 하게 된다.

-Basic Association Analysis-

plink --bfile hapmap1 --assoc --out as1 #bfile은 binary PED를 사용하라는 의미이고 --assoc는 association test를 하라는 의미

아래와 같은 결과 파일(as1.assoc) 생성

A1 : code for allele 1(the minor allele)
F_A : case (affected) 에서의 A1 frequency
F_U : control (unaffected) 에서의 A1 frequency
A2 : code for other allele
CHISQ : chi-squared statistic
P : asymptotic significance value
OR : odds ratio

<<command line>>
plink --file hapmap1
plink --file hapmap1 --make-bed --out hapmap1
plink --file hapmap1 --make-bed --mind 0.05 --out highgeno
plink --bfile hapmap1 --missing --out miss_stat
plink --bfile hapmap1 --chr 1 --out res1 --missing
plink --bfile hapmap1 --freq --out freq_stat
plink --bfile hapmap1 --freq --within pop.phe --out freq_stat
plink --bfile hapmap1 --snp rs1891905 --freq --within pop.phe --out snp1_freq_stat
plink --bfile hapmap1 --assoc --out as1
plink --bfile hapmap1 --assoc --adjust --out as2
plink --bfile hapmap1 --pheno pop.phe --assoc --adjust --out as3
plink --bfile hapmap1 --model --snp rs2222162 -out mod1
plink --bfile hapmap1 --model --cell 0 --snp rs2222162 --out mod2
plink --bfile hapmap1 --cluster --mc 2 --ppc 0.05 --out str1
plink --bfile hapmap1 --mh --within str1.cluster2 --adjust --out aac1
plink --bfile hapmap1 --cluster --cc --ppc 0.01 --out version2
plink --bfile hapmap1 --mh --within version2.cluster2 --adjust --out aac2
plink --bfile hapmap1 --cluster --K 2 --out version3
plink --bfile hapmap1 --mh --within pop.phe --adjust --out aac3
plink --bfile hapmap1 --cluster --matrix --out ibd_view
plink --bfile hapmap1 --assoc --pheno qt.phe --out quant1
plink --bfile hapmap1 --assoc --pheno qt.phe --perm --within str1.clulster2 --out quant2
plink --bfile hapmap1 --assoc --pheno qt.phe --mperm 1000 --within str1.cluster2 --out quant3
plink --bfile hapmap1 --pheno qt.phe --gxe --covar pop.phe --snp rs2222162 --out quant3
plink --bfile hapmap1 --snp rs2222162 --recodeAD --out rec_snp1

---------------------------------------------------------------------------------------------------
<<Additional Information>>
-chi-square test (Pearson's chi-squared test)-
chi-square test는 expected frequency와 observed frequency에 차이가 있는 지를 검정하는 test.
observed frequency는 말 그대로 관측값을 의미. expected frequency를 구하는 방법은 두가지가 가능하다. 각 category 별로 균등하게 나누는 방법과 prior knowledge로 나누는 방법. 예를 들면 공대에 100명의 신입이 들어왔을 때 남여 의 expected frequency는 균등하게 50:50으로 예측할 수도 있고 혹은 작년의 기준 (prior knowledge)를 바탕으로 60:40 으로 예측할 수도 있다.
아래의 예로 chi-square test를 해본다(Thai 라는 자동차 딜러가 자동차 제고를 줄이기 위해 자동차 색깔의 판매수에 대해 자신의 예측과 실제 자동차 판매 댓수의 차이를 검정한다).

위의 expected frequency는 20:30:10:10:30 의 prior knowledge(작년 판매수)를 기반으로 했다.
이 두 데이터가 실제로 차이가 있는 것인지 혹은 단순 observed frequency의 random sampling에 의한 것인지 판단하기 위해 chi-square test를 하게 된다. 이때 null hypothesis는 "차이가 없다"이며 alternative hypothesis는 "차이가 있다"가 된다. 아래 식 대로 chi-square statistic 을 구하게 되고 이 값의 chi-square distribution에 위치에 따라 차이가 유의함을 판단하게 된다.

위 공식대로 계산하면 아래와 같은 값을 얻게 된다.

chi-square statistic이 26.95 임을 계산하게 되고 위 분포는 4 df (N-1 degree; 여기서 N은 color의 갯수) 의 chi-square distribution을 따른다고 가정하게 되므로 P=0.05 로 하였을때 chi-statistic, 9.49 보다 크게 되므로 null hypothesis를 reject 하게 된다. 곧 이 차이는 유의하다고 판단하게 된다.

http://www.enviroliteracy.org/pdf/materials/1210.pdf : chi-square 설명
http://acb.qfab.org/acb/ws09/presentations/Day3_JStankovich.pdf : chi-square, odd ratio 설명

Throw a stone at me

Tuesday, April 23, 2013

PLINK

No comments:

Post a Comment