Throw a stone at me: 08/13/13

STRUCTURE 프로그램은 unlinked marker (recombinant allele의 frequency가 50% 이면 unlinked, 곧 marker 간의 거리가 먼, 2.0 version 이후로는 weakly linked markder도 다룬다고 함) 의 genotype data를 가지고 model-based clustering method를 이용하여 population structure를 추정하는 프로그램. (http://pritch.bsd.uchicago.edu/structure.html). 이 posting 에서는 우선적으로 manual(http://pritch.bsd.uchicago.edu/structure_software/release_versions/v2.3.4/structure_doc.pdf) 내용을 기반으로 하고 가능하다면 논문(http://pritch.bsd.uchicago.edu/publications/structure.pdf) 도 cover 해보고자 한다

introduction으로 홍창범씨의 블로그의 예제 (http://hongiiv.tistory.com/610) 를 실행해보는 것이 좋다.

how to format the data files

맨 첫줄(underbar 위에 있는 것)은 이해를 돕고자 넣은 것이고 그 다음 line 부터가 structure의 input format이다.
아래는 row 1부터 row 별 설명이다.

Marker Names
Recessive Alleles
Inter-Marker Distances
Phase Information
Individual/Genotype data

Label
PopData
PopFlag
LocData
Phenotype
Extra Columns
Genotype Data

how to choose appropriate models
Ancestry models

No admixture model : individuals are discretely from one population or another : 각 개인이 온전히 하나의 population 에서부터만 유래한 것 (population이 하나라는 의미가 아니라 각 개인의 genome이 여러 population 이 섞인 것이 아니라 딱 하나의 population 에서부터 왔다는 의미). 이 경우 individual i가 population k에 속할 posterior probability (P(model|data))를 report하게 된다.
Admixture model : each individual draws some fraction of his/her genome from each of the K populations : 각 개인의 genome이 여러 population이 섞인 경우.
Linkage model : like the admixture model, but linked loci are more likely to come from the same population :
Using prior population information

LOCPRIOR model
USEPOPINFO model
USEPOPINFO model

Allele frequency models

Estimating λ
Correlated allele frequencies model

how to interpret the results

how to estimate of K (the number of populations)

다음은 Resequencing of 31 wild and cultivated soybean genomes identifies patterns of genetic diversity and selection 이라는 논문의 이해를 돕기 위한 background knowledge를 위함이다.

먼저 용어 이해부터...

"""We identified a high level of linkage disequilibrium in the soybean genome, suggesting that marker-assisted breeding of soybean will be less challenging than map-based cloning"""

"""We identified a set of 205,614 tag SNPs that may be useful for QTL mapping"""
QTL mapping : http://graphy21.blogspot.kr/2013/08/qtl-mapping.html

"""From this analysis, we identified a total of 6,318,109 SNPs and 186,177 PAVs"""
PAV(presence-absence variation) : describe sequences that are present in one genome but entirely missing in the other genome.

"""We constructed a rooted phylogenetic tree using Lotus japonicus as the outgroup"""
outgroup : 3개 이상의 monophyletic group 혹은 species들의 evolutionary relationship을 결정할 때 reference group 혹은 species로 사용되는 monophyletic group 혹은 species(monophyletic 이라 함은 하나의 ancestral species고 나머지 species는 그것으로 부터 파생된 group). group들 중 the most recent common ancestor (=root) 로 부터 가장 먼저 branching 된 group. rooted tree를 구축할때 outgroup을 이용한다. 곧 unrooted tree를 구축했으면(http://graphy21.blogspot.kr/2013/08/coursera-computational-molecular.html 참조) outgroup이 정해지면 rooted tree로 transform이 가능.

transform unrooted tree to rooted tree with outgroup
아래 내용은 http://cabbagesofdoom.blogspot.kr/2012/06/how-to-root-phylogenetic-tree.html 의 내용에 기반한다.

"""Using the Baysesian clustering program STRUCTURE, with K changing progressively from 2-7"""
STRUCTURE : http://graphy21.blogspot.kr/2013/08/structure-software.html 참조

"""Whole-genome SNP analysis, using the parameter Θπ, also identified a lower level of genetic diversity in cultivated soybeans compared to wild soybeans"""

""" Calculation of the divergence index(Fst) value between wild and cultivated soybeans allowed us to identify genomic regions of large Fst value, which signified areas having a high degree of diversification between wild and cultivated soybeans"""
Fst는 total population 내의 subpopulation 간에 differentiation 정도를 나타내기 때문에, wild와 cultivated soybean의 genomic region 별 Fst를 구함으로 해서 두 그룹간에 분리가 되는 genomic region을 찾기 위함.

Fst(fixation index) : developed as a special case of Wright's F-statistics
Fst가 0 이면 no divergence between population (interbreeding freely), 1이면 complete isolation. 아래 F-statistics 참조.

Inbreeding coeffiecient : subpopulation 에서의 non-random mating 에 의한 heterozygosity가 줄어든 정도를 나타냄. F_IS로 표시하나 단순하게 F로 대신 표현하기도 한다(상황에 따라 구분을 잘해야 한다, 아래 F-statistics 에서는 F와 F_IS를 구분하여 사용).

F_IS= (H_S - H_I) / H_S

H_I= mean observed heterozygosity per individual within subpopulations
H_S= mean expected heterozygosity within random mating subpopulations
아래 F-statistics 참조
-1에서 1 까지 값을 갖으며 -1은 모든 individual이 heterozygous 할때, 1은 heterozygous한 individual 이 하나도 없음을 의미(0은 observed와 expected의 갯수가 동일한 상황).

F-statistics : statistically expected level of heterozygosity in population. Hardy-Weinberg expectation과 비교했을 때 heterozygosity 가 적어진 정도를 나타내는 값.
특정 genomic locus 에 A와 a allele 두 가지가 있다고 할 때, F의 계산은 아래와 같다.

f(Aa)는 heterozygous genotype의 frequency. 곧 1- observed / expected의 값이 F-statistics. F의 수치가 높으면 observed frequency가 expected frequency에 비해 작다는 것(차이가 크다는 것).
예를 들어 보면

F-statistics 에는 어떤 population structure 의 level에 초점을 두느냐에 따라 F_IS, F_ST, F_IT로 구분된다.

I는 individual을 S는 subpopulation을 T는 total population을 의미.
F_IS 와 F_IT는 위의 F-statistics 방법 대로 계산하고 F_ST는 아래 식을 이용하여 계산.

F_ST는 Wahlund effect 에 의한 것이고 F_IS는 inbreeding에 의한것.

""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
F_IS, F_ST, F_IT이부분의 이해를 돕고자 아래 내용을 참고한다.
http://www.library.auckland.ac.nz/subject-guides/bio/pdfs/733Pop-g-stats2.pdf

H는 heterozygosity를 의미.

H_I= mean observed heterozygosity per individual within subpopulations
H_S= mean expected heterozygosity within random mating subpopulations

H_T = expected heterozygosity in random mating total population

inbreeding coefficient = F_IS= (H_S - H_I) / H_S

_{subpopulation 에서의 non-random mating 에 의한 H의 reduction 정도를 나타냄}
subpopulation 내에서 genetic inbreeding의 정도
-1 <= F_IS<= 1 (-1: all H, 1: no H)

fixation index = F_ST = (H_T - H_S) / H_T

total population 대비해서 subpopulation의 H 의 reduction 정도를 나타냄 (genetic drift에 의한)
subpopulation 간의 genetic differentiation을 의미
0 <= F_ST<= 1 (0: no differentiation, 1: complete differentiation between subpopulations)
결국 위 식은 1 - H_S / H_T인데, H_S는 subpopulation의 expected H의 평균값. 곧 Fst가 0 이란 의미는 subpopulation의 expected H의 평균값이 total population의 expected H와 같다는 것. 곧 subpopulation 간의 heterozygosity의 차이가 없다는 의미.
예를 들어 Fst가 0.39가 나왔다면 total genetic variation의 39%는 subpopulation 간에 의한 것이고 61 %는 within subpopulation의 variation에 의한 것

overall fixation index = F_IT= (H_T - H_I) / H_T

total population 에서의 H의 reduction 정도를 나타냄

위의 식을 보면 subpopulation이 두개 이상인데 어떻게 하나의 H_I,H_S 를 선택할 수 있나 라는 생각이 들 수 있는데 저 값들은 subpopulation의 값들의 평균 값임을 기억하자.

Wright에 의하면 Fst의 값에 따라 해석하는 guideline을 제시했는데..
0.00~0.05 : little genetic differentiation, 0.05~0.15 : moderate, 0.15~0.25 : great, 0.25~ : very great

위의 내용은 allele frequency에 기반한 Fst 계산법.
DNA Sequence Data 를 기반한 Fst 계산법을 알아본다 (여러가지가 있는데 그 중 단순하고 이해가 될만한 것으로).
1982년 Nei 에 의해 고안된 것으로 H (heterozygosity) 대신 nucleotide diversity (π) 를 사용한다.

πT는 total population 에서의 nucleotide diversity를 의미하며, pi, pj는 haplotype i와 haplotype j의 frequency를 의미, πij 는 haplotype i와 haplotype j 의 genetic distance(단순 DNA difference proportion일 수도 있고, Juke-Cantor 방법 혹은 Kimura 2- parameter 의 substitution rate에 의한 distance 일 수 있다) 를 의미.

πT와 더불어 subpopulation 내의 average necleotide diversity인 πS-bar를 계산하게 되면 위 식으로 Fst-like nucleotide measure of subpopulation differentiation 인 γst를 구할수 있다(πB 는 average nucleotide diversity between subpopulations).

위 식 이외의 Fst 계산 방법

"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""

Wahlund effect : 한 population 안에 여러 subpopulation이 섞여 있음으로 해서 heterozygosity 가 적어지는 현상. 예를 들어 한 population 안에 두개의 subpopulation이 있다고 할 때 각 subpopulation이 HW equilibrium 상태에 있다고 하면 아래와 같이 두 subpopulation의 같은 allele의 frequency가 동일하지 않는이상 heterozygosity의 수가 적게 나타나는 현상이 생긴다.

Linkage disequilibrium (LD) : 두 loci A,B가 있고 각 locus 마다 2개의 allele type이 있을 때

위의 왼쪽 table과 같이 haplotype의 frequency가 있을 때 각 locus의 allele의 frequency를 위의 오른쪽 table과 같이 계산 가능하다. 위 allele frequency를 가지고 두 loci가 독립이다라는 가정하에 haplotype을 다시 계산할수 있다(x11 = p1*q1 등등).

이때 이 observed frequency와 독립이라는 가정하의 haplotype frequency의 차이를 linkage disequilibrium (LD) 이라고 하며 이 값이 0 이면 'linkage equilibrium' 아니면 'linkage disequilibrium' 이라 한다 (D 값을 x12로 계산하려면 p1q2 - x12 = D, 이는 x11 + x12 = p1, 곧 x11 = p1 - x12, 그러므로 x11 - p1q1 = p1 - x12 - p1q1 = p1(1-q1) - x12 = p1q2 - x12 = D).
이 D 값은 allele frequency 에 dependent 하기 때문에(이것의 정확한 의미는 이해가 되지 않음..) D' 혹은 r (correlation coefficient) 를 사용.

혹은

http://en.wikipedia.org/wiki/Linkage_disequilibrium 뒷 부분을 보게 되면 HLA를 이용한 LD-test 예제가 나온다(예상되기로는 A1과 B8의 유전자가 dominant 라 gfi를 계산할때 root를 씌어줌). D를 구하고 D의 SE (standard error) 를 구해서 t-statistics로 변형 후 검정

Hardy-Weinberg expectation : Hardy-Weinberg principle이 어떠한 진화적인 압박이나 영향이 없을 때 allele 이나 genotype의 frequency가 일정하게 유지된다라는 것. 가장 simple 한 예가 2개의 allele을 갖는 locus 의(A,a 라고 가정), frequency of A allele f(A) = p 과 f(a) = q 일 때 expected genotype frequencies인 f(AA)=p^2, f(aa) = a^2, f(Aa) = Aa 가 된다. 그리고 다음 세대를 위한 gametes가 random하다는 가정하에 다음 세대의 A와 a 의 frequency는 동일하게 되고 마찬가지로 genotype의 frequency 역시 동일하게 된다.

Throw a stone at me

Tuesday, August 13, 2013

structure software

something about population genomics