Wednesday, January 21, 2015

Statistical models in human genetics (lesson 1)

Contents


genetic mapping
data for genetic study
3 common question in genetic mapping
common designs for genetic studies

Genetic Mapping
형질의 유전패턴과 chromosomal region의 유전 패턴을 비교하여 "positional cloning" 하는 것
- positional cloning : 어떤 유전자인지를 아는 것과 상관없이 어디에 위치하는지를 찾는 것.

Data for a Genetic Study

pedigree : 관계가 알려진 개체 모임 (예: 가족)
genetic markers : 손쉽게 측정 가능한 genetic variant
phgenotype : mendelian phenotype 혹은 complex phenotype

3 Common Question in Genetic mapping

어떠한 형질에 영향을 주는 유전자가 있느냐? Is a trait genetic? => Epidemiological study

집단과 친척관계에서 형질의 분포를 조사함.
Crohn's disease 의 경우 전체 인구에서의 발병률과 twins of affected individuals 차이가 남

그 유전자는 어디에 있느냐? Where are those genes? => Linkage analysis

Crohn's 이 있는 가계도와 genetic marker를 비교해 보니 "D16S3136" marker가 연관됨을 확인. "D16S3136" marker 의 위치는 알려져 있음.

그 유전자는 무엇이냐? What are those genes? => Association analysis

Crohn's 환자 중 NOD2의 유전자의 mutation 을 갖고 있는 환자가 많음.

Common Designs for Genetic Studies

parametric linkage analysis

특정 모델와 위치를 평가하는 방식(disease loci의 allele frequency, genotype 별 발병률)
Identify variants with relatively large contributions to disease risk

allele-sharing analysis
association analysis

Identify genetic variants with relatively small individual contributions to disease risk
가장 단순한 방법으로는 case와 control의 allele frequency 비교
genome-wide하게 수백만 개의 marker를 가지고 살수도 있고 몇개의 candidate에 초점을 맞춰서 할수도 있음

* 위 내용은 http://www.sph.umich.edu/csg/abecasis/class/666.01.pdf 를 바탕으로 함

Tuesday, January 20, 2015

ipython notebook to pdf

1. ipython nbconvert sample.ipynb --to latex --post PDF
trouble: 에러 발생 (원인 불명)

2. ipython nbconvert sample.ipynb --to latex 후 pdflatex sample.tex
trouble : 한글의 경우 utf-8 인대도 에러 발생

3. brew install Caskroom/cask/wkhtmltopdf 깔고
ipython nbconvert sample.ipynb --to html 후 wkhtmltopdf sample.html sample.pdf
trouble : plot이 page 넘어가는 부분에 있을 경우 plot이 쪼개짐

Friday, January 16, 2015

Model of Evolution

아래 내용은 http://www.evolution-textbook.org/content/free/contents/ch28.html 을 요약한 내용임

Mathematical theory can be used in two distinct ways

진화 생물학을 이해하는데 도움을 주는 수학적 기술은 두가지 방향으로 이용된다.

1. 다양한 진화 과정의 결과를 산출 및 예측하고 진화 과정에서의 핵심 인자를 찾는데 이용된다 (위 그림 A, 점은 gene을, 색깔은 allele을 뜻한다.).

여러 유전자의 영향을 받는 complex trait 가 어떻게 진화할것인가? 혹은 random drift, gene flow, selection이 어떻게 집단을 변화시킬것인가?
수학적 기술은 구체적이고 정확한 양적 예측을 위해 이용되기보다도 위 질문들에 대한 대답을 찾기 위한 이해를 돕는데 이용되어 진다.
수학적 기술을 이용한 "toy models"을 통해 최소한 원리적으로라도 언어로 주장되어 지는 것들의 효과를 명확하게 나타낼 수 있다.

2. 거꾸로 시간을 거슬러 집단 내에서 샘플링된 개체 혹은 유전자의 기원을 찾는데 이용된다 (위 그림 B).

범죄현장의 샘플에서 그 샘플의 인종을 추정한다던지, 유전병의 원인이 되는 변이의 나이를 추정한다던지.

Deterministic processes

A population is described by the proportions of the genotypes it contains
집단이란 여러 타입의 entity의 집합이므로 각 타입의 비율을 나타내는 것이 중요하다. entity는 개체가 될수도 있고 유전자가 될수도 있다. 유전자가 entity일 경우 allele 이 유전자의 다른 타입이 되겠다. 그리고 집단에 대한 모델을 구축할때 variable과 parameter를 구분하는게 중요하다.

variable은 allele frequency와 같은 것
parameter는 selection coefficient, recombination rates, mutation rate와 같은 집단이 어떻게 진화할지슬 결정하는 요인.

Reproducing populations tend to grow exponentially
Gradual evolution can be described by differential equations
populations tend to evolve toward stable equilibria
stability is determined by the leading eigenvalue
stability anlysis extends to several variables

Random processes

Our understanding of random events is relatively recent
A random process is described by its probability distribution
Probability can be thought of in several ways
Probabilities can be assigned to unique events
The Increase of Independently reproducing individuals is a branching process
The sum of many independent event follows a normal distribution
Random walks can be approximated by diffusion

Wednesday, January 14, 2015

SNP chip analysis (from raw data to phylogenetic tree)

아래 과정은 snp chip에서 case-control GWAS study까지의 대략적인 코드이다.

Parse raw file

describe parsing script.

In [1]:

#parse raw file

import csv

handle = open("snp_chips.csv")
reader = csv.reader(handle)
replicate = 4
start_column = -16 * replicate + replicate - 1
samples = reader.next()[start_column : : replicate]
samples = ['snp_id'] + [sample.split('.')[0] for sample in samples]
print 'samples      ::', samples
print 'sample count ::', len(samples) - 1


str_output_ori = '{}\n'.format(','.join(samples))
str_output_val = str_output_ori
str_output_val_filtered = str_output_val
str_output_ori = 'chr,pos,' + str_output_ori
pure_row_count = 0
for sp in reader:
    if sp[0][0] == '#':
        continue
    snps = [sp[1]] + sp[start_column :: replicate]
    str_snps = ','.join(snps)
    str_output_ori += '{0},{1},{2}\n'.format(int(sp[53][-2:]),
                                             sp[56], str_snps)
    str_snps = str_snps.replace("AAAA","0")
    str_snps = str_snps.replace("AAAC","1")
    str_snps = str_snps.replace("AACC","2")
    str_snps = str_snps.replace("ACCC","3")
    str_snps = str_snps.replace("CCCC","4")
    
    str_snps = str_snps.replace("AAAG","1")
    str_snps = str_snps.replace("AAGG","2")
    str_snps = str_snps.replace("AGGG","3")
    str_snps = str_snps.replace("GGGG","4")
    
    str_snps = str_snps.replace("TTTT","0")
    str_snps = str_snps.replace("TTTC","1")
    str_snps = str_snps.replace("TTCC","2")
    str_snps = str_snps.replace("TCCC","3")
    str_snps = str_snps.replace("CCCC","4")
    
    str_snps = str_snps.replace("TTTG","1")
    str_snps = str_snps.replace("TTGG","2")
    str_snps = str_snps.replace("TGGG","3")
    str_snps = str_snps.replace("GGGG","4")
    
    str_snps = str_snps.replace("X","nan")
    str_snps = str_snps.replace("NC","nan")
    str_output_val += '{}\n'.format(str_snps)
    if 'nan' not in str_snps:
        pure_row_count += 1
        str_output_val_filtered += '{}\n'.format(str_snps)

#print 'snps which all samples have value :: ', pure_row_count

handle.close()

handle_w = open('snp_chips.parsed.csv','w')
handle_w.write(str_output_ori)
handle_w.close()
handle_w = open('snp_chips.parsed.value.csv','w')
handle_w.write(str_output_val)
handle_w.close()
handle_w = open('snp_chips.parsed.value.filterd.csv','w')
handle_w.write(str_output_val_filtered)
handle_w.close()

samples      :: ['snp_id', 'sample1', 'sample2', 'sample9', 'sample3', 'sample5', 'sample14', 'sample13', 'sample11', 'sample16', 'sample12', 'sample15', 'sample6', 'sample8', 'sample7', 'sample4', 'sample10']
sample count :: 16

SNP simple stats

In [2]:

#SNP simple stats

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('snp_chips.parsed.value.csv', index_col=0)
## hist duplicated probes
duplicated_snps = []
indexes = list(data.index)
for snp in indexes:
    duplicated_count = indexes.count(snp)
    if duplicated_count > 1:
        duplicated_snps.append(snp)
print "duplicate probes :: ", len(set(duplicated_snps))

data["index"] = data.index
data.drop_duplicates(cols='index', take_last=True, inplace=True) # remove
                                                        # duplicate row
del data["index"]

## plot non-nan count of samples
snp_count_per_sample = data.apply(lambda x: sum(x.notnull()))
fig, ax = plt.subplots(figsize=(15,4))
ax.plot(snp_count_per_sample, marker='o', ls='-', color='blue')
ax.set_xticks(range(0,16))
ax.set_xticklabels(snp_count_per_sample.index, rotation=-45)
ax.set_xlim([-1,16])
ax.grid(True)
ax.set_title('# of SNPs with value per sample')
ax.set_xlabel('sample names')
ax.set_ylabel('# of SNPs with value')
for i in range(len(snp_count_per_sample.index)):
    ax.text(i, snp_count_per_sample[i], snp_count_per_sample[i], 
            ha='center', va='bottom')

duplicate probes ::  455

In [3]:

data = data.loc[data.apply(lambda x: sum(x.isnull()),axis=1) == 0,:]
print 'filtered data ::', data.shape

filtered data :: (2494, 16)

PCA dimension reduction

In [4]:

#PCA dimension reduction

from sklearn.decomposition import PCA
import math
import numpy as np


pca = PCA(n_components=3)
pca.fit(data.T)
tc = pca.transform(data.T)

colors = cm.rainbow(np.linspace(0, 1, len(tc)))
fig, ax = plt.subplots(figsize=(10,10))
for i, sample in enumerate(data.columns):
    ax.plot(tc[i, 0], tc[i, 1], marker='o', ls='*', color=colors[i])
    ax.text(tc[i, 0], tc[i, 1], sample, ha='left', va='bottom', 
            color=colors[i])
ax.set_title('PCA reduction')
ax.set_xlabel('principle component 1')
ax.set_ylabel('principle component 2')

Out[4]:

<matplotlib.text.Text at 0x107666510>

Phylogenetic tree

Calculate genetic distance

Calculate Genetic distance (http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1002451)
: Genetic distance (D) between pair-wise combinations of individuals was calculated as PLINK where

\[ D = 1−{(IBS2+0.5 \times IBS1) \over N} \]
IBS2 and IBS1 are the number of loci that share either 2 or 1 alleles identical by state (IBS), respectively, and N is the number of loci tested.
Make distance matrix file (phylip input file)
: The input format for distance data is straightforward. The first line of the input file contains the number of species. There follows species data, starting, as with all other programs, with a species name. The species name is ten characters long, and must be padded out with blanks if shorter. And every cell should be space delimited.

In [7]:

## calculate genetic distance
import math

distance_file = 'distance_matrix.txt'

def calculate_distance(s1, s2, allele_num):
    return 1 - (allele_num - (s1 - s2).apply(math.fabs)).sum() / (allele_num * len(s1))
distance_matrix = data.apply(lambda x1: data.apply(
                        lambda x2: calculate_distance(x1, x2, 4)))
str_output = '{0}\n'.format(len(distance_matrix))
for sample in distance_matrix.index:
    sample_name = sample[:10] if len(sample) > 10 else sample + ' '*(10-len(sample))
    str_output += '{0} {1}\n'.format(sample_name,
        ' '.join(['{:.4f}'.format(v) for v in distance_matrix.loc[sample,]]))

handle = open(distance_file, 'w')
handle.write(str_output)
handle.close()

Convert distance matrix to newick format

Use Phylip (check "Running the programs in background or under control of a command file" in http://evolution.genetics.washington.edu/phylip/doc/main.html).

In [8]:

import os
from IPython.display import Image

if os.path.isfile('outfile'):
    os.remove('outfile')
if os.path.isfile('outtree'):
    os.remove('outtree')
str_output = '{0}\nY\n'.format(distance_file)
handle = open('fitch_input','w')
handle.write(str_output)
handle.close()
os.system('fitch < fitch_input > fitch_result')

Out[8]:

Draw tree

use phylip drawgram, drawtree

In [9]:

## draw tree with python package "ete2"

from ete2 import Tree, TreeStyle

handle = open('outtree')
lines = handle.readlines()
handle.close()
newick = ''.join([line.strip() for line in lines])
t = Tree(newick)
t.show()
#t.render('myTree.png', w=150, h = 150, units='mm')
#Image(filename='myTree.png')

No module named MySQLdb

 MySQLdb module could not be loaded

In [10]:

## draw tree with phylip "drawtree"
## phylip "drawtree" make ps format file. Converting file format is needed to render image.

loc_fontfile = 'font1' # check the location of fontfile
handle = open('drawtree_input', 'w')
handle.write('outtree\n{0}\nY\n'.format(loc_fontfile)) 
handle.close()
commands = []
commands.append('drawtree < drawtree_input > drawtree_result') # use phylip "drawtree"
commands.append('mv plotfile drawtree.ps')
commands.append('convert drawtree.ps drawtree.png') # use "convert"
for command in commands:
    os.system(command)
Image(filename='drawtree.png')

Out[10]:

In [11]:

# draw tree with phylip "fitch" result

handle = open('outfile')
lines = handle.readlines()
handle.close()
start_index_str = 'Negative branch lengths not allowed\n'
end_index_str = 'remember: this is an unrooted tree!\n'
print '<Unrooted Tree>'
print ''.join(lines[lines.index(start_index_str)+1: 
                                    lines.index(end_index_str)])

<Unrooted Tree>


        +--sample16  
      +-7 
      ! +--sample2   
    +-9 
    ! ! +---sample3   
    ! +-3 
  +-2   !  +--sample4   
  ! !   +-13  
  ! !      +---sample5   
  ! ! 
  ! +----sample6   
  ! 
  !    +-----sample7   
  !    !  
  ! +-12     +---sample8   
  ! !  !  +-11  
  ! !  !  !  +----sample9   
  ! !  +-10  
  ! !     !  +----sample10  
  ! !     !  !  
  ! !     +-14   +---sample11  
  4-1        ! +-6 
  ! !        ! ! ! +---sample12  
  ! !        +-5 +-8 
  ! !          !   +----sample13  
  ! !          ! 
  ! !          +----sample14  
  ! ! 
  ! +----sample15  
  ! 
  +----sample1

Significant diff SNPs

use plink (consider all of heterozygosity as 1) up to manhattan plot PLINK case-control association test result("assoc") fields summary

CHR     Chromosome
SNP     SNP ID
BP      Physical position (base-pair)
A1      Minor allele name (based on whole sample)
F_A     Frequency of this allele in cases
F_U     Frequency of this allele in controls
A2      Major allele name
CHISQ   Basic allelic test chi-square (1df)
P       Asymptotic p-value for this test
OR      Estimated odds ratio (for A1, i.e. A2 is reference)

In [12]:

# make PLINK input files
group1 = ['sample1', 'sample2', 'sample3', 'sample4', 'sample5', 'sample6']
group2 = ['sample7', 'sample8', 'sample9', 'sample10', 'sample11', 
          'sample12', 'sample13', 'sample14', 'sample15','sample16']

## make MAP file
data_plink = pd.read_csv('snp_chips.parsed.csv', index_col=2)
data_plink = data_plink.loc[data.index,:]
data_plink['id'] = data_plink.index
data_plink.drop_duplicates(cols='id', take_last=True, inplace=True)
data_map = data_plink[['chr', 'id', 'pos']]
data_map.to_csv('data.map', sep=' ', index=False, header=False)

## make PED file
data_plink = data_plink.drop(['chr', 'pos', 'id'], axis=1)
data_ped = data_plink.apply(lambda x: ' '.join('{0} {1}'.format(
                                        snp[0], snp[-1]) for snp in x))
str_output = ''
for sample in data_ped.index:
    str_output += '{0} {1} {2}\n'.format(sample, 1 if sample in group1
                                         else 2, data_ped[sample])
handle = open('data.ped', 'w')
handle.write(str_output)
handle.close()

# execute plink
plink_command = 'plink --file data --assoc --no-fid --no-parents '\
                                '--allow-no-sex --no-sex --map3 --adjust'
os.system(plink_command)

Out[12]:

In [13]:

# draw Manhattan plot
import matplotlib.cm as cm

plink_map = pd.read_csv('data.map', header=None, sep=' ', index_col=1)
plink_map.columns = ['chr', 'pos']
handle = open('plink.assoc')
lines = handle.readlines()
handle.close()
plink_pvalue = pd.DataFrame([{'id': line.strip().split()[1], 
            'pvalue':line.strip().split()[8]} for line\
            in lines[1:] if line.strip().split()[8] != 'NA' ])
plink_pvalue.index = plink_pvalue['id']
plink_map = plink_map.loc[plink_pvalue.index, :]
result = pd.concat([plink_map, plink_pvalue], join='outer', axis=1)
chr_x_axis = result.pivot_table(cols='chr', aggfunc='max')
chr_x_axis = chr_x_axis.loc['pos',].cumsum()
x_axis = result.apply(lambda x: (x['pos'] + chr_x_axis[x['chr']-1])\
                      if x['chr'] != 0else x['pos'], axis=1)
y_axis = pd.Series([-math.log(float(x), 10) for x in result['pvalue']])

fig, ax = plt.subplots(figsize=(15,4))

colors = cm.rainbow(np.linspace(0, 1, len(chr_x_axis)))
for i, chr in enumerate(chr_x_axis.index):
    rt = result.loc[result['chr'] == chr, :]
    x_axis = rt.apply(lambda x: (x['pos'] + chr_x_axis[x['chr']-1])\
                      if x['chr'] != 0else x['pos'], axis=1)
    y_axis = pd.Series([-math.log(float(x), 10) for x in rt['pvalue']])
    
    ax.plot(x_axis, y_axis, ls='*', marker='o',color=colors[i])
ax.set_xticks([0] + [i for i in chr_x_axis.values])
ax.set_xticklabels(['chr{}'.format(i) for i in chr_x_axis.index],
                   rotation=-45)
ax.grid(True)
ax.set_title('Manhattan plot')
ax.set_xlabel('chromosome (bp)')
ax.set_ylabel('pvalue (-log10)')

Out[13]:

<matplotlib.text.Text at 0x10a23cdd0>

SNPs diversity

ROD

\[ ROD = 1 - \pi_{cul} / \pi_{wild} \] \[ \pi = \Sigma_{ij}x_ix_j\pi_{ij} = 2* \Sigma_{i=1}^n\Sigma_{j=1}^{i-1}x_ix_j\pi_{ij} \]
where \(x_i\) and \(x_j\) are the respective frequencies of the ith and jth sequences, \(\pi_{ij}\) is the number of nucleotide differences per nucleotide site between the ith and jth sequences, and n is the number of sequences in the sample.
- calculation example : http://www.vcu.edu/csbc/lfsc520/week7/Week7LectureNotes.pdf

selective sweep analysis\({}^1\)

We used allele counts at SNP positions to identify signatures of selection in sliding 40-kb windows, for pools of sequence data. For each pool and SNP, we determined the numbers of reads corresponding to the most and least abundant allele (\(n_{MAJ}\) and \(n_{MIN}\)). For each window in each breed pool, we calculated a pooled heterozygosity score: \(H_p = 2\Sigma n_{MAJ}\Sigma n_{MIN}/ (\Sigma n_{MAJ} + \Sigma n_{MIN}){}^2\), where SnMAJ and SnMIN are the sums of \(n_{MAJ}\) and, respectively, \(n_{MIN}\) for all SNPs in the window. Individual \(H_p\) values were then Z-transformed as follows: \(ZH_p = (H_p - \mu H_p)/ \sigma H_p\). \({}_{1. http://www.nature.com/nature/journal/v464/n7288/abs/nature08832.html}\)

In []: