Thursday, January 20, 2011

1000 genome supplementary note


nature에 나온 1000 genome 논문의 supplementary information을 정리한다.


2. Samples
YRI (Yoruba in Ibadan, Nigeria), CEU (ancestry from Northern and Western Europe), CHB (Han Chinese in Beijing, Chian), JPT (Japanese in Tokyo, Japan), LWK (from the Luhya in Webuye, Kenya), TSI (Toscani in Italia), CHD(the Chinese in Metropolitan Denver, CO, USA)


4. Read mapping and generation of BAM files
-quality recalculated -> remap -> merge lanes from the same library (Picard MergeSamFiles) -> remove duplicate (samtools : rmdup for paired end, rmdupse for single end) -> merge libraries to the plaform level -> remove duplicate (Picard MarkDuplicates)


4.1 Reference genome
-NCBI36, revised Cambridge reference sequence instead of mtDNA. sex-specific reference (Y chr only for male, psudoautosomal region masked in Y chr)


4.2 Mapping of Illumina Data
-Maq v0.7 -u -a 1000


4.5 Recalibration of Base Quality Values
-recalibrate qulity after initial alignment. this algorithm(covariate-aware base quality recalibration algorithm) is implemented in GATK software.
http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration 
-effect of recalibaration : the total number of variants called decreased by 2.8%. changing Ti/Tv ratio from 1.07 to 1.96 (true variants around 2. random 0.5)
Ti/Tv ratio : 
http://www.cbs.dtu.dk/staff/dave/roanoke/genetics980415f.htm
http://paup.csit.fsu.edu/paupfaq/paupans.html


4.6 Comparison of Read Data to known HapMap Genotypes
-genotype log likelihood  (samtools pileup -g) was used for matching expected genotype, and if the best genotype did not seperate well from the others(1.2 separation), then removed.  
likelihood:
http://www.aistudy.co.kr/math/likelihood.htm
genotype likelihood : maq paper(http://graphy21.blogspot.com/2011/01/maq.html)


5.SNP calling 
-maq에서 나온 genotype likelihood(GLij(g) = P(Bij,Qij | Gij =g))를 이용해서 snp를 call한다. 
P(Gij = g|Bij,Qij) = P(Bij,Qij | Gij = g) P(Gij=g) / Kij , Kij = Σg P(Bij, Qij| Gij = g) P(Gij = g)  
말로 풀어서 다시 말하면 maq에서 나온 공식으로 genotype likelihood를 구하고 bayesian공식으로 poterior probability, 즉 데이터가 나왔을때 어떤 genotyp이냐를 추즉한다.
이렇게 snp가 call되면  post-processing step으로 false positive를 제거하고 VCF (variant call format) 형식으로 저장한다.
-post-processing filtering {
--expected depth보다 너무 낮거나 높은거(평균 depth의 반 or 두배), 아마도 CNV에 의한 paralog로 잘못 mapping된거라 생각되서
--snp call 부위의 local realignment, indel에 의한 misalignment를 방지 하기 위해(보통 gap open penalty가 mismatch보다 크다)
--poor mapping quality 제거 , reference 자체가 완벽하지 않기때문에 unrepresented region에서 나온 read가 잘못 맵핑될수 있다(경험상 잘못 mapping되는 region을 6.1에 있다). }


5.1 Low-Coverage SNP calling


5.3 Exon project SNP calls 

No comments:

Post a Comment