Wednesday, January 14, 2015

SNP chip analysis (from raw data to phylogenetic tree)

아래 과정은 snp chip에서 case-control GWAS study까지의 대략적인 코드이다.

Parse raw file

describe parsing script.

In [1]:

#parse raw file

import csv

handle = open("snp_chips.csv")
reader = csv.reader(handle)
replicate = 4
start_column = -16 * replicate + replicate - 1
samples = reader.next()[start_column : : replicate]
samples = ['snp_id'] + [sample.split('.')[0] for sample in samples]
print 'samples      ::', samples
print 'sample count ::', len(samples) - 1


str_output_ori = '{}\n'.format(','.join(samples))
str_output_val = str_output_ori
str_output_val_filtered = str_output_val
str_output_ori = 'chr,pos,' + str_output_ori
pure_row_count = 0
for sp in reader:
    if sp[0][0] == '#':
        continue
    snps = [sp[1]] + sp[start_column :: replicate]
    str_snps = ','.join(snps)
    str_output_ori += '{0},{1},{2}\n'.format(int(sp[53][-2:]),
                                             sp[56], str_snps)
    str_snps = str_snps.replace("AAAA","0")
    str_snps = str_snps.replace("AAAC","1")
    str_snps = str_snps.replace("AACC","2")
    str_snps = str_snps.replace("ACCC","3")
    str_snps = str_snps.replace("CCCC","4")
    
    str_snps = str_snps.replace("AAAG","1")
    str_snps = str_snps.replace("AAGG","2")
    str_snps = str_snps.replace("AGGG","3")
    str_snps = str_snps.replace("GGGG","4")
    
    str_snps = str_snps.replace("TTTT","0")
    str_snps = str_snps.replace("TTTC","1")
    str_snps = str_snps.replace("TTCC","2")
    str_snps = str_snps.replace("TCCC","3")
    str_snps = str_snps.replace("CCCC","4")
    
    str_snps = str_snps.replace("TTTG","1")
    str_snps = str_snps.replace("TTGG","2")
    str_snps = str_snps.replace("TGGG","3")
    str_snps = str_snps.replace("GGGG","4")
    
    str_snps = str_snps.replace("X","nan")
    str_snps = str_snps.replace("NC","nan")
    str_output_val += '{}\n'.format(str_snps)
    if 'nan' not in str_snps:
        pure_row_count += 1
        str_output_val_filtered += '{}\n'.format(str_snps)

#print 'snps which all samples have value :: ', pure_row_count

handle.close()

handle_w = open('snp_chips.parsed.csv','w')
handle_w.write(str_output_ori)
handle_w.close()
handle_w = open('snp_chips.parsed.value.csv','w')
handle_w.write(str_output_val)
handle_w.close()
handle_w = open('snp_chips.parsed.value.filterd.csv','w')
handle_w.write(str_output_val_filtered)
handle_w.close()

samples      :: ['snp_id', 'sample1', 'sample2', 'sample9', 'sample3', 'sample5', 'sample14', 'sample13', 'sample11', 'sample16', 'sample12', 'sample15', 'sample6', 'sample8', 'sample7', 'sample4', 'sample10']
sample count :: 16

SNP simple stats

In [2]:

#SNP simple stats

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('snp_chips.parsed.value.csv', index_col=0)
## hist duplicated probes
duplicated_snps = []
indexes = list(data.index)
for snp in indexes:
    duplicated_count = indexes.count(snp)
    if duplicated_count > 1:
        duplicated_snps.append(snp)
print "duplicate probes :: ", len(set(duplicated_snps))

data["index"] = data.index
data.drop_duplicates(cols='index', take_last=True, inplace=True) # remove
                                                        # duplicate row
del data["index"]

## plot non-nan count of samples
snp_count_per_sample = data.apply(lambda x: sum(x.notnull()))
fig, ax = plt.subplots(figsize=(15,4))
ax.plot(snp_count_per_sample, marker='o', ls='-', color='blue')
ax.set_xticks(range(0,16))
ax.set_xticklabels(snp_count_per_sample.index, rotation=-45)
ax.set_xlim([-1,16])
ax.grid(True)
ax.set_title('# of SNPs with value per sample')
ax.set_xlabel('sample names')
ax.set_ylabel('# of SNPs with value')
for i in range(len(snp_count_per_sample.index)):
    ax.text(i, snp_count_per_sample[i], snp_count_per_sample[i], 
            ha='center', va='bottom')

duplicate probes ::  455

In [3]:

data = data.loc[data.apply(lambda x: sum(x.isnull()),axis=1) == 0,:]
print 'filtered data ::', data.shape

filtered data :: (2494, 16)

PCA dimension reduction

In [4]:

#PCA dimension reduction

from sklearn.decomposition import PCA
import math
import numpy as np


pca = PCA(n_components=3)
pca.fit(data.T)
tc = pca.transform(data.T)

colors = cm.rainbow(np.linspace(0, 1, len(tc)))
fig, ax = plt.subplots(figsize=(10,10))
for i, sample in enumerate(data.columns):
    ax.plot(tc[i, 0], tc[i, 1], marker='o', ls='*', color=colors[i])
    ax.text(tc[i, 0], tc[i, 1], sample, ha='left', va='bottom', 
            color=colors[i])
ax.set_title('PCA reduction')
ax.set_xlabel('principle component 1')
ax.set_ylabel('principle component 2')

Out[4]:

<matplotlib.text.Text at 0x107666510>

Phylogenetic tree

Calculate genetic distance

Calculate Genetic distance (http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1002451)
: Genetic distance (D) between pair-wise combinations of individuals was calculated as PLINK where

\[ D = 1−{(IBS2+0.5 \times IBS1) \over N} \]
IBS2 and IBS1 are the number of loci that share either 2 or 1 alleles identical by state (IBS), respectively, and N is the number of loci tested.
Make distance matrix file (phylip input file)
: The input format for distance data is straightforward. The first line of the input file contains the number of species. There follows species data, starting, as with all other programs, with a species name. The species name is ten characters long, and must be padded out with blanks if shorter. And every cell should be space delimited.

In [7]:

## calculate genetic distance
import math

distance_file = 'distance_matrix.txt'

def calculate_distance(s1, s2, allele_num):
    return 1 - (allele_num - (s1 - s2).apply(math.fabs)).sum() / (allele_num * len(s1))
distance_matrix = data.apply(lambda x1: data.apply(
                        lambda x2: calculate_distance(x1, x2, 4)))
str_output = '{0}\n'.format(len(distance_matrix))
for sample in distance_matrix.index:
    sample_name = sample[:10] if len(sample) > 10 else sample + ' '*(10-len(sample))
    str_output += '{0} {1}\n'.format(sample_name,
        ' '.join(['{:.4f}'.format(v) for v in distance_matrix.loc[sample,]]))

handle = open(distance_file, 'w')
handle.write(str_output)
handle.close()

Convert distance matrix to newick format

Use Phylip (check "Running the programs in background or under control of a command file" in http://evolution.genetics.washington.edu/phylip/doc/main.html).

In [8]:

import os
from IPython.display import Image

if os.path.isfile('outfile'):
    os.remove('outfile')
if os.path.isfile('outtree'):
    os.remove('outtree')
str_output = '{0}\nY\n'.format(distance_file)
handle = open('fitch_input','w')
handle.write(str_output)
handle.close()
os.system('fitch < fitch_input > fitch_result')

Out[8]:

Draw tree

use phylip drawgram, drawtree

In [9]:

## draw tree with python package "ete2"

from ete2 import Tree, TreeStyle

handle = open('outtree')
lines = handle.readlines()
handle.close()
newick = ''.join([line.strip() for line in lines])
t = Tree(newick)
t.show()
#t.render('myTree.png', w=150, h = 150, units='mm')
#Image(filename='myTree.png')

No module named MySQLdb

 MySQLdb module could not be loaded

In [10]:

## draw tree with phylip "drawtree"
## phylip "drawtree" make ps format file. Converting file format is needed to render image.

loc_fontfile = 'font1' # check the location of fontfile
handle = open('drawtree_input', 'w')
handle.write('outtree\n{0}\nY\n'.format(loc_fontfile)) 
handle.close()
commands = []
commands.append('drawtree < drawtree_input > drawtree_result') # use phylip "drawtree"
commands.append('mv plotfile drawtree.ps')
commands.append('convert drawtree.ps drawtree.png') # use "convert"
for command in commands:
    os.system(command)
Image(filename='drawtree.png')

Out[10]:

In [11]:

# draw tree with phylip "fitch" result

handle = open('outfile')
lines = handle.readlines()
handle.close()
start_index_str = 'Negative branch lengths not allowed\n'
end_index_str = 'remember: this is an unrooted tree!\n'
print '<Unrooted Tree>'
print ''.join(lines[lines.index(start_index_str)+1: 
                                    lines.index(end_index_str)])

<Unrooted Tree>


        +--sample16  
      +-7 
      ! +--sample2   
    +-9 
    ! ! +---sample3   
    ! +-3 
  +-2   !  +--sample4   
  ! !   +-13  
  ! !      +---sample5   
  ! ! 
  ! +----sample6   
  ! 
  !    +-----sample7   
  !    !  
  ! +-12     +---sample8   
  ! !  !  +-11  
  ! !  !  !  +----sample9   
  ! !  +-10  
  ! !     !  +----sample10  
  ! !     !  !  
  ! !     +-14   +---sample11  
  4-1        ! +-6 
  ! !        ! ! ! +---sample12  
  ! !        +-5 +-8 
  ! !          !   +----sample13  
  ! !          ! 
  ! !          +----sample14  
  ! ! 
  ! +----sample15  
  ! 
  +----sample1

Significant diff SNPs

use plink (consider all of heterozygosity as 1) up to manhattan plot PLINK case-control association test result("assoc") fields summary

CHR     Chromosome
SNP     SNP ID
BP      Physical position (base-pair)
A1      Minor allele name (based on whole sample)
F_A     Frequency of this allele in cases
F_U     Frequency of this allele in controls
A2      Major allele name
CHISQ   Basic allelic test chi-square (1df)
P       Asymptotic p-value for this test
OR      Estimated odds ratio (for A1, i.e. A2 is reference)

In [12]:

# make PLINK input files
group1 = ['sample1', 'sample2', 'sample3', 'sample4', 'sample5', 'sample6']
group2 = ['sample7', 'sample8', 'sample9', 'sample10', 'sample11', 
          'sample12', 'sample13', 'sample14', 'sample15','sample16']

## make MAP file
data_plink = pd.read_csv('snp_chips.parsed.csv', index_col=2)
data_plink = data_plink.loc[data.index,:]
data_plink['id'] = data_plink.index
data_plink.drop_duplicates(cols='id', take_last=True, inplace=True)
data_map = data_plink[['chr', 'id', 'pos']]
data_map.to_csv('data.map', sep=' ', index=False, header=False)

## make PED file
data_plink = data_plink.drop(['chr', 'pos', 'id'], axis=1)
data_ped = data_plink.apply(lambda x: ' '.join('{0} {1}'.format(
                                        snp[0], snp[-1]) for snp in x))
str_output = ''
for sample in data_ped.index:
    str_output += '{0} {1} {2}\n'.format(sample, 1 if sample in group1
                                         else 2, data_ped[sample])
handle = open('data.ped', 'w')
handle.write(str_output)
handle.close()

# execute plink
plink_command = 'plink --file data --assoc --no-fid --no-parents '\
                                '--allow-no-sex --no-sex --map3 --adjust'
os.system(plink_command)

Out[12]:

In [13]:

# draw Manhattan plot
import matplotlib.cm as cm

plink_map = pd.read_csv('data.map', header=None, sep=' ', index_col=1)
plink_map.columns = ['chr', 'pos']
handle = open('plink.assoc')
lines = handle.readlines()
handle.close()
plink_pvalue = pd.DataFrame([{'id': line.strip().split()[1], 
            'pvalue':line.strip().split()[8]} for line\
            in lines[1:] if line.strip().split()[8] != 'NA' ])
plink_pvalue.index = plink_pvalue['id']
plink_map = plink_map.loc[plink_pvalue.index, :]
result = pd.concat([plink_map, plink_pvalue], join='outer', axis=1)
chr_x_axis = result.pivot_table(cols='chr', aggfunc='max')
chr_x_axis = chr_x_axis.loc['pos',].cumsum()
x_axis = result.apply(lambda x: (x['pos'] + chr_x_axis[x['chr']-1])\
                      if x['chr'] != 0else x['pos'], axis=1)
y_axis = pd.Series([-math.log(float(x), 10) for x in result['pvalue']])

fig, ax = plt.subplots(figsize=(15,4))

colors = cm.rainbow(np.linspace(0, 1, len(chr_x_axis)))
for i, chr in enumerate(chr_x_axis.index):
    rt = result.loc[result['chr'] == chr, :]
    x_axis = rt.apply(lambda x: (x['pos'] + chr_x_axis[x['chr']-1])\
                      if x['chr'] != 0else x['pos'], axis=1)
    y_axis = pd.Series([-math.log(float(x), 10) for x in rt['pvalue']])
    
    ax.plot(x_axis, y_axis, ls='*', marker='o',color=colors[i])
ax.set_xticks([0] + [i for i in chr_x_axis.values])
ax.set_xticklabels(['chr{}'.format(i) for i in chr_x_axis.index],
                   rotation=-45)
ax.grid(True)
ax.set_title('Manhattan plot')
ax.set_xlabel('chromosome (bp)')
ax.set_ylabel('pvalue (-log10)')

Out[13]:

<matplotlib.text.Text at 0x10a23cdd0>

SNPs diversity

ROD

\[ ROD = 1 - \pi_{cul} / \pi_{wild} \] \[ \pi = \Sigma_{ij}x_ix_j\pi_{ij} = 2* \Sigma_{i=1}^n\Sigma_{j=1}^{i-1}x_ix_j\pi_{ij} \]
where \(x_i\) and \(x_j\) are the respective frequencies of the ith and jth sequences, \(\pi_{ij}\) is the number of nucleotide differences per nucleotide site between the ith and jth sequences, and n is the number of sequences in the sample.
- calculation example : http://www.vcu.edu/csbc/lfsc520/week7/Week7LectureNotes.pdf

selective sweep analysis\({}^1\)

We used allele counts at SNP positions to identify signatures of selection in sliding 40-kb windows, for pools of sequence data. For each pool and SNP, we determined the numbers of reads corresponding to the most and least abundant allele (\(n_{MAJ}\) and \(n_{MIN}\)). For each window in each breed pool, we calculated a pooled heterozygosity score: \(H_p = 2\Sigma n_{MAJ}\Sigma n_{MIN}/ (\Sigma n_{MAJ} + \Sigma n_{MIN}){}^2\), where SnMAJ and SnMIN are the sums of \(n_{MAJ}\) and, respectively, \(n_{MIN}\) for all SNPs in the window. Individual \(H_p\) values were then Z-transformed as follows: \(ZH_p = (H_p - \mu H_p)/ \sigma H_p\). \({}_{1. http://www.nature.com/nature/journal/v464/n7288/abs/nature08832.html}\)

In []:

Tuesday, January 13, 2015

Genetic distance

genetic distance 란 같은 종안의 집단간의 genetic divergence 를 측정하는 것.

evolution model에 따라 여러가지 방법이 있는데 아래는 그 중 Nei's standard genetic distance 의 방법의 예이다.

- Neil's genetic distance

\[ D{}_N = -\ln{I_N} \]

\[ I_N ={\sum^m_{i=1}(p_{ix}p_{iy}) \over [(\sum^m_{i=1}p^2_{ix})(\sum^m_{i=1}p^2_{iy})]^{1/2}} \]

여기서 \(p_{ix}\) 는 x 집단의 i 번째 allele 의 frequency를 의미.

예)
멸종 위기의 딱따구리가 아래와 같이 3개의 집단으로 구성되어 있고 각 집단의 dehydrogenase(LDH) locus 에서 allele (B,C,D) 의 frequency가 아래와 같다고 할때, Venon 집단과 Apalachicola 집단의 Nei's genetic distatnce 를 구해보면,

LDH allele	Venon	Apalachicola	Gameland
B	0.023	0.019	0.981
C	0.977	0.885	0.019
D	0.000	0.096	0.000

\( \Sigma p^2_{ix} = 0.023^2 + 0.977^2 + 0.000^2 = 0.955 \),
\( \Sigma p^2_{iy} = 0.019^2 + 0.885^2 + 0.096^2 = 0.793 \),
\( \Sigma p_{iy} \Sigma p_{ix} = 0.023 \times 0.019 + 0.977 \times 0.885 + 0 \times 0.096 = 0.793 \),
\(D_N = -\ln{I_n} = -\ln{0.994} = 0.006\).

이 D 값은 크면 클수록 집단간의 거리가 멀다는 의미. 초파리의 경우 allozyme genetic distance를 예로 보자면 isolated population (D=0.03), sub-species(D=0.23), distinct species(D=1.21).
그러나 이 값은 종다마 다르고 또한 유전자 마다 다르기 때문에 정확하진 않음.

* 위 내용은 Introduction to Conservation Genetics 책을 바탕으로 함.

Monday, January 12, 2015

mathjax in blogger

http://weiqigao.blogspot.kr/2013/03/enabling-mathjax-in-blogger.html 에 중단에 있는 내용을 head안에 카피.

아래는 테스트

When \(a \ne 0\), there are two solutions to \(ax^2 + bx + c = 0\) and they are \[x = {-b \pm \sqrt{b^2 - 4ac} \over 2a}.\]

Tuesday, January 6, 2015

binomial test confidence interval

http://www.sigmazone.com/binomial_confidence_interval.htm

binomial test 의 confidence interval을 구하는 방법

1. Normal Approximation Method of the Binomial Confidence Interval

2. Exact Confidence Interval

베이지안 통계추론(오만숙 저)을 보는데..

동전을 10번 던졌을때 앞이 2가 나온경우 사전분포를 uniform distribution (a,b가 각각 1인 beta distribution)으로 가정하고 우도함수를 binomial distribution으로 계산하면 사후분포가 beta distribution 이 나오고 이 사후 분포의 95% 신뢰구간을 구하면 (0.06, 0.52) 가 나온다.

이를 고전적인 95% 신뢰구간을 구하면 (0.025, 0.56) 라고 하는데 이는

binom.test(2, 10, 0.5) 함수를 사용하면 구할 수 있다.

문젠 추정치가 0.2 일때 이것의 95% 의 신뢰구간을 구하는 normal approximation 방법으로 구한 값과 차이가 나는데..

02 +- 1.96 * ((0.2 * 0.8)/10)^0.5 의 값과는 다르다.

이는

Normal Approximation Method of the Binomial Confidence Interval 과 Exact Confidence Interval 의 차이였던듯..

아래를 참조하면 잘 설명이 되어 있음.

http://www.sigmazone.com/binomial_confidence_interval.htm

Sunday, December 28, 2014

d3 참고 자료

1. https://medium.com/@c_behrens/enter-update-exit-6cafc6014c36

D3 에서 DOM element를 select하고 data를 bind하고 초과되는 data에 sub-selection (enter) 를 통한 새로운 DOM element 생성 혹은 exit 를 통한 제거가 필요한 data의 sub-selection 의 설명이 잘 되어 있음.
enter 와 exit 메소드의 설명이 명확히 되어 있음

2. http://christopheviau.com/d3_tutorial/

transition().delay() 를 통한 animation chaining 예제. this 키워드를 이용하여 callback 함수가 현재 이벤트가 일어난 context를 접근할 수 있음. text로 데이터를 가져와서 표현하는 예제도 포함

Thursday, November 20, 2014

odds ratio VS relative risk

odds ratio(OR), relative risk(RR).
어떤 연구는 odds ratio를 쓰고 어떤 연구는 relative risk를 쓰고.

대략적으로 구글링을 해보면 cohort 연구와 같은 prospective study (특정 결과에 영향이 있는 원인을 기준으로 샘플을 분류하고 시간이 흐름에 따라 결과가 어떻게 나오는지 연구하는 것) 에서는 RR을 쓰는 것이 맞고 case-control 과 같은 retrospective study 에서는 RR대신에 OR을 쓴다 정도로 설명이 나온다.

이 설명이 별로 와닿지가 않는데.. 그냥 수치를 넣고 테이블 값을 변형해서 OR과 RR 값을 구해보면 왜 RR이 아닌 OR을 사용하는지 이해될 수 있다.

흡연과 폐암의 예를 들어보자.

실제적 전체인구가 20000 명인데 흡연과 폐암간의 관계가 아래 표와 같다고 하자 (RR이 2, OR 이 2.11 이다).

	cancer	non-cancer
smoker	1000	9000	10000
non-smoker	500	9500	10000
	1500	18500

prospective study 는 smoker와 non-smoker 를 나누고 시간이 흐른뒤에 각각의 group에서 cancer, non-cancer 의 비율을 본다. 전체 인구에서 smoker 100 명을 sampling하고 non-smoker 100명을 샘플링 해서 오랜시간 동안 cancer의 발생 환자 수를 확인해보면 ideal한 경우에 아래 표와 같은 결과가 나타날 것이다.

	cancer	non-cancer
smoker	10	90	100
non-smoker	5	95	100
	15	185

여기서는 smoker 그룹을 먼저 샘플링 했기때문에, 곧 smoker 100명의 집단을 정해놓고 그 안에서 cancer 가 발생한 사람수를 확인 한 것이기때문에 smoker일 경우 cancer 가 걸릴 확률 (10/100) 을 구해도 괜찮다. 곧 RR을 구해도 된다. RR을 구하면 2가 나오고 OR 을 구하면 2.11이 나온다.

그런데 case-control study 처럼 cancer 환자군을 모집하고 그 안에서 smoker와 non-smoker를 나누고 마찬가지로 non-cancer 집단을 모집해서 그안에서 smoker와 non-smoker를 구해서 prospective study와 같은 확률 구하면 어떻게 될까? (cancer 환자 100명을 모집하면 모집단의 표에서 smoker와 non-smoker의 비율이 2:1이기에 대략적으로 67:33 명의 smoker 와 non-smoker가 뽑힐 것이다)

	cancer	non-cancer
smoker	67	49	116
non-smoker	33	51	84
	100	100

위의 표 수치를 가지고 smoker 긴데 cancer 가 걸릴 확률을 구하면 67/116 으로 말도 안되는 값이 나온다. 당연히 smoker 그룹을 정해놓고 그안에서 cancer의 발병 확률을 계산한게 아니라 cancer 환자수 non-cancer 정상인 따로따로 사람을 모집하고 smoker를 기준으로 위와 같은 계산을 하니 엉뚱한 값이 나온다. RR은 1.47이 나온다. OR은 2.11 이 나온다(cancer 내의 smoker, non-smoker 수를 반올림 하였기 때문에 ideal 값과는 약간의 차이가 난다).

예에서 볼수 있듯이 OR이 RR에 비해 robust한 것을 알수 있다. 물론 RR이 확률값의 비율을 나타낸 것이기에 직관적이고 정확하나 case-control study와 같은 경우에는 RR 을 구하면 엉뚱한 값을 구하게 된다.

Monday, July 14, 2014

FA (factor analysis) & PCA (principal component analysis)

기본적으로 linear algebra 를 공부한 뒤 (SVD까지는 알아야 함)

PCA
http://sites.stat.psu.edu/~ajw13/stat505/fa06/16_princomp/

FA
http://sites.stat.psu.edu/~ajw13/stat505/fa06/17_factor/

PCA vs FA
1. by program
http://stats.stackexchange.com/questions/102882/steps-done-in-factor-analysis-compared-to-steps-done-in-pca/102999#102999

2. by plot
http://stats.stackexchange.com/questions/95038/how-does-factor-analysis-explain-the-covariance-and-pca-explains-the-variance/95106#95106

3. by theory
http://stats.stackexchange.com/questions/94048/pca-and-exploratory-factor-analysis-on-the-same-data-set/94104#94104

more
http://stats.stackexchange.com/questions/50745/best-factor-extraction-methods-with-reference-to-spss