Tuesday, January 27, 2015

Estimated Breeding Values (EBV)

EBV (Estimated Breeding Value) 란 상대적 유전값 (relative genetic value)으로 육종(breeding), 즉 인간이 원하는 방향으로 생물을 개량하는 데 객관적인 유전적 지표로 사용되어 진다. 아래 전체적인 EBV를 구하는 방법과 그 적용 예를 들어본다.

Contents


Introduction
Theoretical Background
Statistical Model
Computing and Using EBV's
How to use ebb's to make selection decisions
Multiple-trait model

Introduction

EBV 란 statistical prediction of the relative genetic merit, 곧 개체의 유전적인 유리함이 어떤지에 대한 통계적 예측값이다. 이 값은 육종에 사용된다. 원하는 형질의 EBV가 높은 개체만이 생식하게 나고 그렇지 않은 개체의 생식은 막음으로서 population 이 특정 형질에 특화되는 방향으로 흘러가게 하는 것이다.

Theoretical Background

표현형(phenotype) 은 고기의 지방함량, 사료대비 성장률등의 관측가능한 육체적 형질을 의미한다. 이 phenotype은 아래식과 같이 크게 유전적 요인과 비유전적요인으로 구성되어진다. 유전적 요인은 자손에게 유전이 되며 비유전적 요인은 환경적요인과 그밖에 그 개체에 존재하는 특성들을 의미한다. 비유전적 요인은 크게 fixed effect (sex, year of birth, management condition) 과 그밖에 측정하기 힘든 effect로 나뉜다.

\[ \mbox{Phenotypic observation = Environmental effects + Genetic effects + Residual effects} \]
이를 통계적 모델로 표현하면 아래와 같다.
\[ y_{ij} = \mu_i + g_i + e_{ij} \quad (1)\]

\( y_{ij}\) = i번째 개체의 j 번째 관측치
\( \mu_i \) = i번째 개체의 fixed environmental effect
\( g_i \) = i번째 개체의 유전형의 addictive(\(g_a\)), dominance(\(g_d\)), epistatic(\(g_e\)) genectic value의 총 합
\( e_{ij} \) = i 번째 개체의 j번째 관측에서의 random environmental effect의 총합

여기서 \(g_i\)를 (\(g_a\)), (\(g_d\)), (\(g_e\))로 대체하게 되면
\[\begin{eqnarray} y_{ij} & = & \mu_i + (g_a)_i + (g_d)_i + (g_e)_i + e_{ij} \\ & = & \mu_i + (g_a)_i + e_{ij}^* & (2) \end{eqnarray} \]
\(e_{ij}^* = (g_d)_i + (g_e)_i + e_{ij}\)

위 2번 식이 \(\mu_i\) 를 알고 있다는 가정하 EBV를 계산하는 기본식이다. 그러나 \(\mu_i\) 를 정확하게 알고 있는 것은 거의 불가능하기 때문에 같은 데이터를 가지고 \(\mu_i\) 와 \((g_a)_i\)를 동시에 구해야 한다. 이 두가지 값을 동시에 구하는 방법이 BLUP (Best Linear Unbiased Prediction) 이다.

BLUP은 동시에 fixed effect에 의해 발생하는 차이 조절하고 집단에서의 가계도에 있는 유전적 관계를 설명하는 과정에서 EBV를 구한다.
BLUP을 이용하기 위해서는 형질에 대한 heritability를 알고 있어야 한다.
BLUP을 표현하는 식으로 mixed model equation 이 있다.
\[ Y = X\beta + Zu + e \quad (3)\]
\(Y \) = vector of observed value
\(\beta \) = vector of fixed effect
\(u \) = vector of random genetic effect (2번식에서 \((q_a)i\) 와 같은 의미)
\(e \) = vector of residual error
\(X, Z \) = design matrix

3번식은 여러 단계를 거쳐 아래와 같은 4번식으로 유도가 된다.
\[ \left[\begin{array}{cc} X'X & X'Z \\ Z'X & Z'Z + A^{-1}\alpha \end{array}\right] \left[\begin{array}{cc} \beta \\ u \end{array}\right] = \left[\begin{array}{cc} X'Y \\ Z'Y \end{array}\right] \quad (4) \]
\(A^{-1}\) = \(A\)의 inverse matrix, A는 가계도를 바탕으로 개체간의 유전적 관계를 수학적으로 나타낸 matrix
\( \alpha \) = (1 - \(h^2\)) / \(h^2\), \(h\) 는 heritability
개체간의 유전적 관계에서 breeding value을 추측할 수 있는 BLUP의 핵심은 \(A^{-1}\) 에 있다.

Animal ID	Sire ID	Dam ID	Sex	Contemporary Group	physical trait Test Score
1	Unknown	Unknown	M
2	Unknown	Unknown	F
3	1	Unknown	M
4	1	Unknown	F
5	Unknown	Unknown	M
6	3	2	M	1	2
7	3	2	F	2	4
8	5	4	M	1	3
9	5	4	F	2	5

table 1
9마리의 개체가 있고 개체간의 가계도는 표1에서 와 같다고 하자. \(A\)는 개체간의 유전적 관계를 나타내는 matrix 로 9X9 diagonal matrix 이다. \(A\)를 구하는 방법은 아래와 같다.

먼저 위의 표와 같이 가족 관계를 표를 만드는데 부모가 되는 개체가 자식이 되는 개체보다 앞이 되도록 표를 만든다.

for i to n,
for j to i,

두 부모(s, d)를 모두 아는 경우 \[ \left\{\begin{array}{r1} if \: j < i, a_{ij} = a_{ji} = 0.5(a_{js} + a_{jd}) \\ if \: j = i, a_{ii} = 1 + 0.5a_{sd} \end{array}\right. \]
한쪽 부모(s)만 아는 경우 \[ \left\{\begin{array}{r1}if \: j < i, a_{ij} = a_{ji} = 0.5(a_{js} \\ if \: j = i, a_{ii} = 1 \end{array}\right. \]
양쪽 부모를 다 모르는 경우 \[ \left\{\begin{array}{r1}if \: j < i, a_{ij} = a_{ji} = 0 \\ if \: = i, a_{ii} = 1 \end{array}\right. \]

위 방법을 이용하게 되면 \(A\)는 아래와 같이 구해진다.

table2
위의 값이 클수록 두 개체간의 유전적 관계가 밀접한 것을 의미한다. 개체수가 많아질 경우 A row, column의 수는 기하급수적으로 많아지고 \(A^{-1} \) 의 계산은 더욱 힘들어 진다. 1970 중반에 \(A\) 를 구하지 않고 \(A^{-1} \) 를 바로 구할 수 있는 방법이 고안되면서 수십만 개체의 genetic relationship matrix 의 계산이 가능해졌다.

Statistical Model

BLUP이 fixed effect와 genetic effect를 동시에 고려하기 때문에 타당한 fixed effect 의 선택이 되어야만 개체간의 올바른 유전적 비교가 가능하다. fixed effect가 제대로 선택되지 않으면 fixed effect에 대한 효과가 residual error에 포함되어지고 그만큼 모델의 설명력은 떨어지게 된다. fixed effect에는 sex, 그리고 physical interpretation을 수행한 평가자, 나이, 평가 시기 등이 있다.

Computing and Using EBV's

heritability(h) 를 20%, fixed effect 를 contemporary group이라는 가정하에 예제 표1의 EBV를 계산하면 아래와 같다.
\[ Y = X\beta + Zu + e \]
은
\[ \left[\begin{array}{c} y_6 \\ y_7 \\ y_8 \\ y_9 \end{array}\right] = \left[\begin{array}{c} 2 \\ 4 \\ 3 \\ 5 \end{array}\right] = \left[\begin{array}{cc} 1 & 0 \\ 0 & 1 \\ 1 & 0 \\ 0 & 1 \end{array}\right] \left[\begin{array}{c} b_1 \\ b_2 \end{array}\right] + \left[\begin{array}{r1} 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 & 1 \end{array}\right] \left[\begin{array}{c} a_1 \\ a_2 \\ a_3 \\ a_4 \\ a_5 \\ a_6 \\ a_7 \\ a_8 \\ a_9 \end{array}\right] + \left[\begin{array}{c} e_6 \\ e_7 \\ e_8 \\ e_9 \end{array}\right]\]
가 된다.
\[ \left[\begin{array}{cc} X'X & X'Z \\ Z'X & Z'Z + A^{-1}\alpha \end{array}\right] \left[\begin{array}{cc} \beta \\ u \end{array}\right] = \left[\begin{array}{cc} X'Y \\ Z'Y \end{array}\right] \quad (4) \]
을 이용하기 위해서 \(A^{-1}\) 와 \( \alpha \)를 구하게 되면

\(A^{-1}\) 은 아래와 같고

table 3
heritability = 0.2 로 가정하였기 때문에 \( \alpha \) = (1 - 0.2) / 0.2 = 4 가 된다.

4번식에 \(X, Z, A^{-1}, \alpha \) 값을 넣고 계산하게 되면 아래와 같이 fixed effect의 값과 개체의 EBV를 구하게 된다.
\[ \left[\begin{array}{c} \beta \\ u \end{array}\right] = \left[\begin{array}{c} \mbox{Contemporary group 1 average} \\ \mbox{Contemporary group 2 average} \\ EBV 1 \\ EBV 2 \\ EBV 3 \\ EBV 4 \\ EBV 5 \\ EBV 6 \\ EBV 7 \\ EBV 8 \\ EBV 9 \end{array}\right] = \left[\begin{array}{c} 2.5 \\ 4.5 \\ 0 \\ -0.09 \\ -0.07 \\ 0.07 \\ 0.09 \\ -0.13 \\ -0.13 \\ 0.13 \\ 0.13 \end{array}\right] \]

How to use EBV's to Make Selection Decisions

EBV 는 그저 germplasm(= collection of genetic resources) 일 뿐 EBV가 높다고 항상 좋은 것은 아니다. phenotype을 disease susceptibility 로 두었다면 EBV 이 최소값을 같는 개체를 선발하는 것이 옳을 것이다.
위의 예의 physical trait를 사료 대비 성장률이라고 성장률을 높이길 원한다는 가정하면 먼저 성별로 개체를 나누고 EBV를 내림차순으로 정렬한다 (표 4).

Male ID	Male EBV	Female ID	Female EBV
8	0.13	9	0.13
5	0.09	4	0.07
1	0.00	2	-0.09
3	-0.07	7
6	-0.13

table 4
trait 을 더욱 강화하고자 한다는 가정하에서는 male은 8번 개체를 female 개체는 9번 개체를 선택한다.

* 위 내응은 https://www.seeingeye.org/images/HowToUseEBV.pdf 를 바탕으로 함.

GWAS 2

일전에 GWAS (Genome-Wide Association Study) 에 대해 간략하게 포스팅을 한적이 있었는데 이번에 조금 더 자세하게 정리해 보려 한다.

Content


Important Questions in Human Genetics
Concepts Underlying the Study Design
Capturing Common Variation
Genotyping Technologies
Study Design
Association Test
Replication and Meta Analysis

Important Questions in Human Genetics

human genetics 의 핵심적인 목표는 common complex disease의 유전적 위험 요소를 찾는것. 여러 기술과 study design이 있는데 그중 하나가 GWAS.

유전적 위험 요소를 찾는 GWAS의 적용사례 중 하나가 age-related macular degeneration (AMD) 의 원인이 되는 Complement Factor H gene을 찾은 것

GWAS를 통한 personalized medicine 분야의 적용도 또다른 human genetics의 목표라 할 수 있음

GWAS를 통한 항응고제인 wafarin 복용량 결정이 한 예

Concetps Underlying the Study Design

가장 일반적인 SNP을 이용함. SNP 은 일반적으로 MAF (minor allele frequency) 가 높은 single base substitution 을 뜻함
rare disorder 의 경우 특정 소수 혹은 단일의 유전자에 의한 것이 많기 때문에 몇몇의 가계도를 분석하는 linkage analysis를 통해 genetic risk factor 가 잘찾아짐. 하지만 common disease의 경우 linkage analysis 가 효과가 없었음
common disease는 다르게 접근해야 한다는 생각과 common disease의 원인 유전자가 몇몇 common variant (MAF가 높은 SNP) 이라는 발견과 맞물려 common disease의 원인은 common variant 라는 CD/CV (common disease / common variant) 가정이 성립됨. 이를 바탕으로 하는 것이 GWAS

CD/ CV 에 서 분파된 하나의 생각은 common variant가 disease에 영향을 주기 때문에 그 영향의 크기는 상대적으로(rare variant to rare disease에 비해) 적을 것이다(=low penetrance)
common variant가 low penentrance이지만 질병이 유전된기때문에 multiple SNP이 원인이 되는게 아닐까 생각하게 됨.
위 두가지 생각으로 family-based에서 population-based study로 초점이 옮겨가게 됨
SNP의 allele frequency와 그 SNP에 의한 질병의 위험도(effect size)가 중요 요소가 genetic study design의 핵심. 위 그림에서 우측 하단이 GWAS 연구하기 좋은 케이스
The National Human Genome Institute GWAS catalog 에 GWAS 를 통한 disease causal SNP 이 정리 되어 있음

Capturing Common Variation

CD / CV 가설을 제대로 테스트 하기 위해서는 몇가지 체계적 접근이 필요함

Common variant (SNP) 의 파악 필요.

연구해야 할 SNP들이 어떤것이 있는지 또한 genome 상의 위치 의 정보 필요

population 간 SNP의 차이에 대한 이해 필요

population 간에 차이를 고려 하지 않으면 GWAS연구에서 target trait 관련 SNP이 아니라 population 간의 차이가 나는 SNP이 뽑힐 수 있음

SNP간의 correlation 정보 필요

중복된 정보를 제거하기 위해 혹은 효율적인 연구를 위해

위 3가지 목적에서 생긴 것이 The International HapMap Project

1.6 M SNPs
11 human population
LD (linkage disequilbrium) 측정

SNP 이 chromosome의 동일선상에 있을 경우 연관되어 있는 정도를 나타내는 것
african이 generation 이 다른 population에 비해 높기 때문에 LD 영역이 상대적으로 적음
recombination 이 일어나면 두 SNP은 LD가 깨짐, 곧 linkage equilibrium
LD의 지표로 \(D' \) 와 \( r^2\) 를 많이 사용

\(D'\)는 0 ~ 1 사이값을 갖음. 1 이 complete LD를 의미, 0 은 HWE 상태 하에 두 SNP이 독립일 경우.
\[ D' = \left\{\begin{array}{r1} \frac{\pi_{AB}\pi_{ab} - \pi_{Ab}\pi_{aB}}{\mbox{min}(\pi_A\pi_b,\pi_a\pi_B)} & if \pi_{AB}\pi_{ab} - \pi_{Ab}\pi_{aB} > 0 \\ \frac{\pi_{AB}\pi_{ab} - \pi_{Ab}\pi_{aB}}{\mbox{min}(\pi_A\pi_B,\pi_a\pi_b)} & if \pi_{AB}\pi_{ab} - \pi_{Ab}\pi_{aB} < 0 \end{array} \right\} \]
\( r^2\) 은 correlation 식
\[r^2 = \frac{(\pi_{AB}\pi_{ab} - \pi_{Ab}\pi_{aB})^2}{\pi_A\pi_b\pi_a\pi_B} \]
LD 분석을 통해 tagSNP 이 선택되어짐(LD block내 대표 SNP). population마다 LD가 다르기 때문에 tagSNP도 달라짐
이 LD로 인해 GWAS로 뽑힌 결과가 꼭 질병과 direct하게 연관되어 있다는 보장은 없음. 뽑힌 SNP의 LD 관계에 있는 SNP이 질병의 직접적인 원인 SNP 일 수 있음.

Genotyping Technologies

Illumina 와 Affymerix 사의 SNP chip이 많이 사용됨. 요즘은 sequencing으로 많이 함.

Study Design

case control design 과 quantitative design

quantitative design

target trait (혹은 disease)의 지표가 quantitative 함. 곧 양적으로 측정됨
통계적으로 유의한 genetic effect를 파악하기 쉬운 경향이 있음
heart disease의 지표가 될수 있는 HDL, LDL 등의 lipid level이 quantitative trait의 예

case control design

trait가 affected VS unaffected 식으로 이원화 된 경우의 study

Association Test

사실상 GWAS분석이라는 것은 SNP 하나하나를 독립적으로 phenotype과 연관성이 있나 검사하는 것임

quantitative design

GLM(depedent variable를 함수로 변형시켜 linear 하게 만든 것) 또는 ANOVA로 분석

case control design

contingency table method(chi-square test, Fisher's exact test), logistic regression을 사용
logistic regression은 covariates 의 보정을 할수 있어 contingency table 보다 더 많이 쓰임

genotype data를 숫자로 변경하는 "data encoding" 하는 방식 선택도 테스트의 통계적 검정력에 영향이 있음(genotype-based group의 숫자에 따라 자유도 변화등).

allelic VS genotypic

allelic : SNP 의 하나의 allele과 phenotype의 연관성을 봄
genotypic : 한쌍의 allele 을 하나의 값으로 치환함.

dominant, recessive, multiplicative, addictive model이 있음.

sex, age, study site, population substructure등의 covariate 보정필요

population substructure를 고려해야 하는 이유 : 예를 들어 폐암 환자그룹에 특정 population이 많이 있다면 폐암 관련 유전자가 아니라 population specific한 SNP이 뽑힐것임.
STRUCTURE나 ENGENSTART 방법으로 보정 필요. HapMap의 인종별 allele freuncy를 참고하거나 아니면 PCA를 이용.

SNP하나하나에 대한 통계적 분석이기 떄문에 multiple test correction 필요

Bonferroni correction : 모든 SNP이 independent 하다는 가정
FDR (false discover)
permutation testing
genome-wide significance : 특정 population 의 LD의 분포를 바탕으로 독립적인 "effective" genetic region 이 있음을 바탕으로 correction 을 해야하는 test 갯수를 기준으로 p-value threshold를 정함, 유럽인의 경우 7.2e-8

Replication and Meta Analysis

replication : GWAS 결과에서 얻어진 SNP을 검증하는 실험
meta analysis : 여러 study를 합쳐서 분석하는 방법

* 위 내용은 http://www.ncbi.nlm.nih.gov/pubmed/23300413 과 http://bioinformatics.org.au/ws09/presentations/ 의 day3 jstankovich의 pdf를 바탕으로 함

Wednesday, January 21, 2015

Statistical models in human genetics (lesson 1)

Contents


genetic mapping
data for genetic study
3 common question in genetic mapping
common designs for genetic studies

Genetic Mapping
형질의 유전패턴과 chromosomal region의 유전 패턴을 비교하여 "positional cloning" 하는 것
- positional cloning : 어떤 유전자인지를 아는 것과 상관없이 어디에 위치하는지를 찾는 것.

Data for a Genetic Study

pedigree : 관계가 알려진 개체 모임 (예: 가족)
genetic markers : 손쉽게 측정 가능한 genetic variant
phgenotype : mendelian phenotype 혹은 complex phenotype

3 Common Question in Genetic mapping

어떠한 형질에 영향을 주는 유전자가 있느냐? Is a trait genetic? => Epidemiological study

집단과 친척관계에서 형질의 분포를 조사함.
Crohn's disease 의 경우 전체 인구에서의 발병률과 twins of affected individuals 차이가 남

그 유전자는 어디에 있느냐? Where are those genes? => Linkage analysis

Crohn's 이 있는 가계도와 genetic marker를 비교해 보니 "D16S3136" marker가 연관됨을 확인. "D16S3136" marker 의 위치는 알려져 있음.

그 유전자는 무엇이냐? What are those genes? => Association analysis

Crohn's 환자 중 NOD2의 유전자의 mutation 을 갖고 있는 환자가 많음.

Common Designs for Genetic Studies

parametric linkage analysis

특정 모델와 위치를 평가하는 방식(disease loci의 allele frequency, genotype 별 발병률)
Identify variants with relatively large contributions to disease risk

allele-sharing analysis
association analysis

Identify genetic variants with relatively small individual contributions to disease risk
가장 단순한 방법으로는 case와 control의 allele frequency 비교
genome-wide하게 수백만 개의 marker를 가지고 살수도 있고 몇개의 candidate에 초점을 맞춰서 할수도 있음

* 위 내용은 http://www.sph.umich.edu/csg/abecasis/class/666.01.pdf 를 바탕으로 함

Tuesday, January 20, 2015

ipython notebook to pdf

1. ipython nbconvert sample.ipynb --to latex --post PDF
trouble: 에러 발생 (원인 불명)

2. ipython nbconvert sample.ipynb --to latex 후 pdflatex sample.tex
trouble : 한글의 경우 utf-8 인대도 에러 발생

3. brew install Caskroom/cask/wkhtmltopdf 깔고
ipython nbconvert sample.ipynb --to html 후 wkhtmltopdf sample.html sample.pdf
trouble : plot이 page 넘어가는 부분에 있을 경우 plot이 쪼개짐

Friday, January 16, 2015

Model of Evolution

아래 내용은 http://www.evolution-textbook.org/content/free/contents/ch28.html 을 요약한 내용임

Mathematical theory can be used in two distinct ways

진화 생물학을 이해하는데 도움을 주는 수학적 기술은 두가지 방향으로 이용된다.

1. 다양한 진화 과정의 결과를 산출 및 예측하고 진화 과정에서의 핵심 인자를 찾는데 이용된다 (위 그림 A, 점은 gene을, 색깔은 allele을 뜻한다.).

여러 유전자의 영향을 받는 complex trait 가 어떻게 진화할것인가? 혹은 random drift, gene flow, selection이 어떻게 집단을 변화시킬것인가?
수학적 기술은 구체적이고 정확한 양적 예측을 위해 이용되기보다도 위 질문들에 대한 대답을 찾기 위한 이해를 돕는데 이용되어 진다.
수학적 기술을 이용한 "toy models"을 통해 최소한 원리적으로라도 언어로 주장되어 지는 것들의 효과를 명확하게 나타낼 수 있다.

2. 거꾸로 시간을 거슬러 집단 내에서 샘플링된 개체 혹은 유전자의 기원을 찾는데 이용된다 (위 그림 B).

범죄현장의 샘플에서 그 샘플의 인종을 추정한다던지, 유전병의 원인이 되는 변이의 나이를 추정한다던지.

Deterministic processes

A population is described by the proportions of the genotypes it contains
집단이란 여러 타입의 entity의 집합이므로 각 타입의 비율을 나타내는 것이 중요하다. entity는 개체가 될수도 있고 유전자가 될수도 있다. 유전자가 entity일 경우 allele 이 유전자의 다른 타입이 되겠다. 그리고 집단에 대한 모델을 구축할때 variable과 parameter를 구분하는게 중요하다.

variable은 allele frequency와 같은 것
parameter는 selection coefficient, recombination rates, mutation rate와 같은 집단이 어떻게 진화할지슬 결정하는 요인.

Reproducing populations tend to grow exponentially
Gradual evolution can be described by differential equations
populations tend to evolve toward stable equilibria
stability is determined by the leading eigenvalue
stability anlysis extends to several variables

Random processes

Our understanding of random events is relatively recent
A random process is described by its probability distribution
Probability can be thought of in several ways
Probabilities can be assigned to unique events
The Increase of Independently reproducing individuals is a branching process
The sum of many independent event follows a normal distribution
Random walks can be approximated by diffusion

Wednesday, January 14, 2015

SNP chip analysis (from raw data to phylogenetic tree)

아래 과정은 snp chip에서 case-control GWAS study까지의 대략적인 코드이다.

Parse raw file

describe parsing script.

In [1]:

#parse raw file

import csv

handle = open("snp_chips.csv")
reader = csv.reader(handle)
replicate = 4
start_column = -16 * replicate + replicate - 1
samples = reader.next()[start_column : : replicate]
samples = ['snp_id'] + [sample.split('.')[0] for sample in samples]
print 'samples      ::', samples
print 'sample count ::', len(samples) - 1


str_output_ori = '{}\n'.format(','.join(samples))
str_output_val = str_output_ori
str_output_val_filtered = str_output_val
str_output_ori = 'chr,pos,' + str_output_ori
pure_row_count = 0
for sp in reader:
    if sp[0][0] == '#':
        continue
    snps = [sp[1]] + sp[start_column :: replicate]
    str_snps = ','.join(snps)
    str_output_ori += '{0},{1},{2}\n'.format(int(sp[53][-2:]),
                                             sp[56], str_snps)
    str_snps = str_snps.replace("AAAA","0")
    str_snps = str_snps.replace("AAAC","1")
    str_snps = str_snps.replace("AACC","2")
    str_snps = str_snps.replace("ACCC","3")
    str_snps = str_snps.replace("CCCC","4")
    
    str_snps = str_snps.replace("AAAG","1")
    str_snps = str_snps.replace("AAGG","2")
    str_snps = str_snps.replace("AGGG","3")
    str_snps = str_snps.replace("GGGG","4")
    
    str_snps = str_snps.replace("TTTT","0")
    str_snps = str_snps.replace("TTTC","1")
    str_snps = str_snps.replace("TTCC","2")
    str_snps = str_snps.replace("TCCC","3")
    str_snps = str_snps.replace("CCCC","4")
    
    str_snps = str_snps.replace("TTTG","1")
    str_snps = str_snps.replace("TTGG","2")
    str_snps = str_snps.replace("TGGG","3")
    str_snps = str_snps.replace("GGGG","4")
    
    str_snps = str_snps.replace("X","nan")
    str_snps = str_snps.replace("NC","nan")
    str_output_val += '{}\n'.format(str_snps)
    if 'nan' not in str_snps:
        pure_row_count += 1
        str_output_val_filtered += '{}\n'.format(str_snps)

#print 'snps which all samples have value :: ', pure_row_count

handle.close()

handle_w = open('snp_chips.parsed.csv','w')
handle_w.write(str_output_ori)
handle_w.close()
handle_w = open('snp_chips.parsed.value.csv','w')
handle_w.write(str_output_val)
handle_w.close()
handle_w = open('snp_chips.parsed.value.filterd.csv','w')
handle_w.write(str_output_val_filtered)
handle_w.close()

samples      :: ['snp_id', 'sample1', 'sample2', 'sample9', 'sample3', 'sample5', 'sample14', 'sample13', 'sample11', 'sample16', 'sample12', 'sample15', 'sample6', 'sample8', 'sample7', 'sample4', 'sample10']
sample count :: 16

SNP simple stats

In [2]:

#SNP simple stats

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('snp_chips.parsed.value.csv', index_col=0)
## hist duplicated probes
duplicated_snps = []
indexes = list(data.index)
for snp in indexes:
    duplicated_count = indexes.count(snp)
    if duplicated_count > 1:
        duplicated_snps.append(snp)
print "duplicate probes :: ", len(set(duplicated_snps))

data["index"] = data.index
data.drop_duplicates(cols='index', take_last=True, inplace=True) # remove
                                                        # duplicate row
del data["index"]

## plot non-nan count of samples
snp_count_per_sample = data.apply(lambda x: sum(x.notnull()))
fig, ax = plt.subplots(figsize=(15,4))
ax.plot(snp_count_per_sample, marker='o', ls='-', color='blue')
ax.set_xticks(range(0,16))
ax.set_xticklabels(snp_count_per_sample.index, rotation=-45)
ax.set_xlim([-1,16])
ax.grid(True)
ax.set_title('# of SNPs with value per sample')
ax.set_xlabel('sample names')
ax.set_ylabel('# of SNPs with value')
for i in range(len(snp_count_per_sample.index)):
    ax.text(i, snp_count_per_sample[i], snp_count_per_sample[i], 
            ha='center', va='bottom')

duplicate probes ::  455

In [3]:

data = data.loc[data.apply(lambda x: sum(x.isnull()),axis=1) == 0,:]
print 'filtered data ::', data.shape

filtered data :: (2494, 16)

PCA dimension reduction

In [4]:

#PCA dimension reduction

from sklearn.decomposition import PCA
import math
import numpy as np


pca = PCA(n_components=3)
pca.fit(data.T)
tc = pca.transform(data.T)

colors = cm.rainbow(np.linspace(0, 1, len(tc)))
fig, ax = plt.subplots(figsize=(10,10))
for i, sample in enumerate(data.columns):
    ax.plot(tc[i, 0], tc[i, 1], marker='o', ls='*', color=colors[i])
    ax.text(tc[i, 0], tc[i, 1], sample, ha='left', va='bottom', 
            color=colors[i])
ax.set_title('PCA reduction')
ax.set_xlabel('principle component 1')
ax.set_ylabel('principle component 2')

Out[4]:

<matplotlib.text.Text at 0x107666510>

Phylogenetic tree

Calculate genetic distance

Calculate Genetic distance (http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1002451)
: Genetic distance (D) between pair-wise combinations of individuals was calculated as PLINK where

\[ D = 1−{(IBS2+0.5 \times IBS1) \over N} \]
IBS2 and IBS1 are the number of loci that share either 2 or 1 alleles identical by state (IBS), respectively, and N is the number of loci tested.
Make distance matrix file (phylip input file)
: The input format for distance data is straightforward. The first line of the input file contains the number of species. There follows species data, starting, as with all other programs, with a species name. The species name is ten characters long, and must be padded out with blanks if shorter. And every cell should be space delimited.

In [7]:

## calculate genetic distance
import math

distance_file = 'distance_matrix.txt'

def calculate_distance(s1, s2, allele_num):
    return 1 - (allele_num - (s1 - s2).apply(math.fabs)).sum() / (allele_num * len(s1))
distance_matrix = data.apply(lambda x1: data.apply(
                        lambda x2: calculate_distance(x1, x2, 4)))
str_output = '{0}\n'.format(len(distance_matrix))
for sample in distance_matrix.index:
    sample_name = sample[:10] if len(sample) > 10 else sample + ' '*(10-len(sample))
    str_output += '{0} {1}\n'.format(sample_name,
        ' '.join(['{:.4f}'.format(v) for v in distance_matrix.loc[sample,]]))

handle = open(distance_file, 'w')
handle.write(str_output)
handle.close()

Convert distance matrix to newick format

Use Phylip (check "Running the programs in background or under control of a command file" in http://evolution.genetics.washington.edu/phylip/doc/main.html).

In [8]:

import os
from IPython.display import Image

if os.path.isfile('outfile'):
    os.remove('outfile')
if os.path.isfile('outtree'):
    os.remove('outtree')
str_output = '{0}\nY\n'.format(distance_file)
handle = open('fitch_input','w')
handle.write(str_output)
handle.close()
os.system('fitch < fitch_input > fitch_result')

Out[8]:

Draw tree

use phylip drawgram, drawtree

In [9]:

## draw tree with python package "ete2"

from ete2 import Tree, TreeStyle

handle = open('outtree')
lines = handle.readlines()
handle.close()
newick = ''.join([line.strip() for line in lines])
t = Tree(newick)
t.show()
#t.render('myTree.png', w=150, h = 150, units='mm')
#Image(filename='myTree.png')

No module named MySQLdb

 MySQLdb module could not be loaded

In [10]:

## draw tree with phylip "drawtree"
## phylip "drawtree" make ps format file. Converting file format is needed to render image.

loc_fontfile = 'font1' # check the location of fontfile
handle = open('drawtree_input', 'w')
handle.write('outtree\n{0}\nY\n'.format(loc_fontfile)) 
handle.close()
commands = []
commands.append('drawtree < drawtree_input > drawtree_result') # use phylip "drawtree"
commands.append('mv plotfile drawtree.ps')
commands.append('convert drawtree.ps drawtree.png') # use "convert"
for command in commands:
    os.system(command)
Image(filename='drawtree.png')

Out[10]:

In [11]:

# draw tree with phylip "fitch" result

handle = open('outfile')
lines = handle.readlines()
handle.close()
start_index_str = 'Negative branch lengths not allowed\n'
end_index_str = 'remember: this is an unrooted tree!\n'
print '<Unrooted Tree>'
print ''.join(lines[lines.index(start_index_str)+1: 
                                    lines.index(end_index_str)])

<Unrooted Tree>


        +--sample16  
      +-7 
      ! +--sample2   
    +-9 
    ! ! +---sample3   
    ! +-3 
  +-2   !  +--sample4   
  ! !   +-13  
  ! !      +---sample5   
  ! ! 
  ! +----sample6   
  ! 
  !    +-----sample7   
  !    !  
  ! +-12     +---sample8   
  ! !  !  +-11  
  ! !  !  !  +----sample9   
  ! !  +-10  
  ! !     !  +----sample10  
  ! !     !  !  
  ! !     +-14   +---sample11  
  4-1        ! +-6 
  ! !        ! ! ! +---sample12  
  ! !        +-5 +-8 
  ! !          !   +----sample13  
  ! !          ! 
  ! !          +----sample14  
  ! ! 
  ! +----sample15  
  ! 
  +----sample1

Significant diff SNPs

use plink (consider all of heterozygosity as 1) up to manhattan plot PLINK case-control association test result("assoc") fields summary

CHR     Chromosome
SNP     SNP ID
BP      Physical position (base-pair)
A1      Minor allele name (based on whole sample)
F_A     Frequency of this allele in cases
F_U     Frequency of this allele in controls
A2      Major allele name
CHISQ   Basic allelic test chi-square (1df)
P       Asymptotic p-value for this test
OR      Estimated odds ratio (for A1, i.e. A2 is reference)

In [12]:

# make PLINK input files
group1 = ['sample1', 'sample2', 'sample3', 'sample4', 'sample5', 'sample6']
group2 = ['sample7', 'sample8', 'sample9', 'sample10', 'sample11', 
          'sample12', 'sample13', 'sample14', 'sample15','sample16']

## make MAP file
data_plink = pd.read_csv('snp_chips.parsed.csv', index_col=2)
data_plink = data_plink.loc[data.index,:]
data_plink['id'] = data_plink.index
data_plink.drop_duplicates(cols='id', take_last=True, inplace=True)
data_map = data_plink[['chr', 'id', 'pos']]
data_map.to_csv('data.map', sep=' ', index=False, header=False)

## make PED file
data_plink = data_plink.drop(['chr', 'pos', 'id'], axis=1)
data_ped = data_plink.apply(lambda x: ' '.join('{0} {1}'.format(
                                        snp[0], snp[-1]) for snp in x))
str_output = ''
for sample in data_ped.index:
    str_output += '{0} {1} {2}\n'.format(sample, 1 if sample in group1
                                         else 2, data_ped[sample])
handle = open('data.ped', 'w')
handle.write(str_output)
handle.close()

# execute plink
plink_command = 'plink --file data --assoc --no-fid --no-parents '\
                                '--allow-no-sex --no-sex --map3 --adjust'
os.system(plink_command)

Out[12]:

In [13]:

# draw Manhattan plot
import matplotlib.cm as cm

plink_map = pd.read_csv('data.map', header=None, sep=' ', index_col=1)
plink_map.columns = ['chr', 'pos']
handle = open('plink.assoc')
lines = handle.readlines()
handle.close()
plink_pvalue = pd.DataFrame([{'id': line.strip().split()[1], 
            'pvalue':line.strip().split()[8]} for line\
            in lines[1:] if line.strip().split()[8] != 'NA' ])
plink_pvalue.index = plink_pvalue['id']
plink_map = plink_map.loc[plink_pvalue.index, :]
result = pd.concat([plink_map, plink_pvalue], join='outer', axis=1)
chr_x_axis = result.pivot_table(cols='chr', aggfunc='max')
chr_x_axis = chr_x_axis.loc['pos',].cumsum()
x_axis = result.apply(lambda x: (x['pos'] + chr_x_axis[x['chr']-1])\
                      if x['chr'] != 0else x['pos'], axis=1)
y_axis = pd.Series([-math.log(float(x), 10) for x in result['pvalue']])

fig, ax = plt.subplots(figsize=(15,4))

colors = cm.rainbow(np.linspace(0, 1, len(chr_x_axis)))
for i, chr in enumerate(chr_x_axis.index):
    rt = result.loc[result['chr'] == chr, :]
    x_axis = rt.apply(lambda x: (x['pos'] + chr_x_axis[x['chr']-1])\
                      if x['chr'] != 0else x['pos'], axis=1)
    y_axis = pd.Series([-math.log(float(x), 10) for x in rt['pvalue']])
    
    ax.plot(x_axis, y_axis, ls='*', marker='o',color=colors[i])
ax.set_xticks([0] + [i for i in chr_x_axis.values])
ax.set_xticklabels(['chr{}'.format(i) for i in chr_x_axis.index],
                   rotation=-45)
ax.grid(True)
ax.set_title('Manhattan plot')
ax.set_xlabel('chromosome (bp)')
ax.set_ylabel('pvalue (-log10)')

Out[13]:

<matplotlib.text.Text at 0x10a23cdd0>

SNPs diversity

ROD

\[ ROD = 1 - \pi_{cul} / \pi_{wild} \] \[ \pi = \Sigma_{ij}x_ix_j\pi_{ij} = 2* \Sigma_{i=1}^n\Sigma_{j=1}^{i-1}x_ix_j\pi_{ij} \]
where \(x_i\) and \(x_j\) are the respective frequencies of the ith and jth sequences, \(\pi_{ij}\) is the number of nucleotide differences per nucleotide site between the ith and jth sequences, and n is the number of sequences in the sample.
- calculation example : http://www.vcu.edu/csbc/lfsc520/week7/Week7LectureNotes.pdf

selective sweep analysis\({}^1\)

We used allele counts at SNP positions to identify signatures of selection in sliding 40-kb windows, for pools of sequence data. For each pool and SNP, we determined the numbers of reads corresponding to the most and least abundant allele (\(n_{MAJ}\) and \(n_{MIN}\)). For each window in each breed pool, we calculated a pooled heterozygosity score: \(H_p = 2\Sigma n_{MAJ}\Sigma n_{MIN}/ (\Sigma n_{MAJ} + \Sigma n_{MIN}){}^2\), where SnMAJ and SnMIN are the sums of \(n_{MAJ}\) and, respectively, \(n_{MIN}\) for all SNPs in the window. Individual \(H_p\) values were then Z-transformed as follows: \(ZH_p = (H_p - \mu H_p)/ \sigma H_p\). \({}_{1. http://www.nature.com/nature/journal/v464/n7288/abs/nature08832.html}\)

In []: