아래 과정은 snp chip에서 case-control GWAS study까지의 대략적인 코드이다.

Parse raw file

describe parsing script.

In [1]:

#parse raw file

import csv

handle = open("snp_chips.csv")
reader = csv.reader(handle)
replicate = 4
start_column = -16 * replicate + replicate - 1
samples = reader.next()[start_column : : replicate]
samples = ['snp_id'] + [sample.split('.')[0] for sample in samples]
print 'samples      ::', samples
print 'sample count ::', len(samples) - 1


str_output_ori = '{}\n'.format(','.join(samples))
str_output_val = str_output_ori
str_output_val_filtered = str_output_val
str_output_ori = 'chr,pos,' + str_output_ori
pure_row_count = 0
for sp in reader:
    if sp[0][0] == '#':
        continue
    snps = [sp[1]] + sp[start_column :: replicate]
    str_snps = ','.join(snps)
    str_output_ori += '{0},{1},{2}\n'.format(int(sp[53][-2:]),
                                             sp[56], str_snps)
    str_snps = str_snps.replace("AAAA","0")
    str_snps = str_snps.replace("AAAC","1")
    str_snps = str_snps.replace("AACC","2")
    str_snps = str_snps.replace("ACCC","3")
    str_snps = str_snps.replace("CCCC","4")
    
    str_snps = str_snps.replace("AAAG","1")
    str_snps = str_snps.replace("AAGG","2")
    str_snps = str_snps.replace("AGGG","3")
    str_snps = str_snps.replace("GGGG","4")
    
    str_snps = str_snps.replace("TTTT","0")
    str_snps = str_snps.replace("TTTC","1")
    str_snps = str_snps.replace("TTCC","2")
    str_snps = str_snps.replace("TCCC","3")
    str_snps = str_snps.replace("CCCC","4")
    
    str_snps = str_snps.replace("TTTG","1")
    str_snps = str_snps.replace("TTGG","2")
    str_snps = str_snps.replace("TGGG","3")
    str_snps = str_snps.replace("GGGG","4")
    
    str_snps = str_snps.replace("X","nan")
    str_snps = str_snps.replace("NC","nan")
    str_output_val += '{}\n'.format(str_snps)
    if 'nan' not in str_snps:
        pure_row_count += 1
        str_output_val_filtered += '{}\n'.format(str_snps)

#print 'snps which all samples have value :: ', pure_row_count

handle.close()

handle_w = open('snp_chips.parsed.csv','w')
handle_w.write(str_output_ori)
handle_w.close()
handle_w = open('snp_chips.parsed.value.csv','w')
handle_w.write(str_output_val)
handle_w.close()
handle_w = open('snp_chips.parsed.value.filterd.csv','w')
handle_w.write(str_output_val_filtered)
handle_w.close()

samples      :: ['snp_id', 'sample1', 'sample2', 'sample9', 'sample3', 'sample5', 'sample14', 'sample13', 'sample11', 'sample16', 'sample12', 'sample15', 'sample6', 'sample8', 'sample7', 'sample4', 'sample10']
sample count :: 16

SNP simple stats

In [2]:

#SNP simple stats

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('snp_chips.parsed.value.csv', index_col=0)
## hist duplicated probes
duplicated_snps = []
indexes = list(data.index)
for snp in indexes:
    duplicated_count = indexes.count(snp)
    if duplicated_count > 1:
        duplicated_snps.append(snp)
print "duplicate probes :: ", len(set(duplicated_snps))

data["index"] = data.index
data.drop_duplicates(cols='index', take_last=True, inplace=True) # remove
                                                        # duplicate row
del data["index"]

## plot non-nan count of samples
snp_count_per_sample = data.apply(lambda x: sum(x.notnull()))
fig, ax = plt.subplots(figsize=(15,4))
ax.plot(snp_count_per_sample, marker='o', ls='-', color='blue')
ax.set_xticks(range(0,16))
ax.set_xticklabels(snp_count_per_sample.index, rotation=-45)
ax.set_xlim([-1,16])
ax.grid(True)
ax.set_title('# of SNPs with value per sample')
ax.set_xlabel('sample names')
ax.set_ylabel('# of SNPs with value')
for i in range(len(snp_count_per_sample.index)):
    ax.text(i, snp_count_per_sample[i], snp_count_per_sample[i], 
            ha='center', va='bottom')

duplicate probes ::  455

In [3]:

data = data.loc[data.apply(lambda x: sum(x.isnull()),axis=1) == 0,:]
print 'filtered data ::', data.shape

filtered data :: (2494, 16)

PCA dimension reduction

In [4]:

#PCA dimension reduction

from sklearn.decomposition import PCA
import math
import numpy as np


pca = PCA(n_components=3)
pca.fit(data.T)
tc = pca.transform(data.T)

colors = cm.rainbow(np.linspace(0, 1, len(tc)))
fig, ax = plt.subplots(figsize=(10,10))
for i, sample in enumerate(data.columns):
    ax.plot(tc[i, 0], tc[i, 1], marker='o', ls='*', color=colors[i])
    ax.text(tc[i, 0], tc[i, 1], sample, ha='left', va='bottom', 
            color=colors[i])
ax.set_title('PCA reduction')
ax.set_xlabel('principle component 1')
ax.set_ylabel('principle component 2')

Out[4]:

<matplotlib.text.Text at 0x107666510>

Phylogenetic tree

Calculate genetic distance

Calculate Genetic distance (http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1002451)
: Genetic distance (D) between pair-wise combinations of individuals was calculated as PLINK where

\[ D = 1−{(IBS2+0.5 \times IBS1) \over N} \]
IBS2 and IBS1 are the number of loci that share either 2 or 1 alleles identical by state (IBS), respectively, and N is the number of loci tested.
Make distance matrix file (phylip input file)
: The input format for distance data is straightforward. The first line of the input file contains the number of species. There follows species data, starting, as with all other programs, with a species name. The species name is ten characters long, and must be padded out with blanks if shorter. And every cell should be space delimited.

In [7]:

## calculate genetic distance
import math

distance_file = 'distance_matrix.txt'

def calculate_distance(s1, s2, allele_num):
    return 1 - (allele_num - (s1 - s2).apply(math.fabs)).sum() / (allele_num * len(s1))
distance_matrix = data.apply(lambda x1: data.apply(
                        lambda x2: calculate_distance(x1, x2, 4)))
str_output = '{0}\n'.format(len(distance_matrix))
for sample in distance_matrix.index:
    sample_name = sample[:10] if len(sample) > 10 else sample + ' '*(10-len(sample))
    str_output += '{0} {1}\n'.format(sample_name,
        ' '.join(['{:.4f}'.format(v) for v in distance_matrix.loc[sample,]]))

handle = open(distance_file, 'w')
handle.write(str_output)
handle.close()

Convert distance matrix to newick format

Use Phylip (check "Running the programs in background or under control of a command file" in http://evolution.genetics.washington.edu/phylip/doc/main.html).

In [8]:

import os
from IPython.display import Image

if os.path.isfile('outfile'):
    os.remove('outfile')
if os.path.isfile('outtree'):
    os.remove('outtree')
str_output = '{0}\nY\n'.format(distance_file)
handle = open('fitch_input','w')
handle.write(str_output)
handle.close()
os.system('fitch < fitch_input > fitch_result')

Out[8]:

Draw tree

use phylip drawgram, drawtree

In [9]:

## draw tree with python package "ete2"

from ete2 import Tree, TreeStyle

handle = open('outtree')
lines = handle.readlines()
handle.close()
newick = ''.join([line.strip() for line in lines])
t = Tree(newick)
t.show()
#t.render('myTree.png', w=150, h = 150, units='mm')
#Image(filename='myTree.png')

No module named MySQLdb

 MySQLdb module could not be loaded

In [10]:

## draw tree with phylip "drawtree"
## phylip "drawtree" make ps format file. Converting file format is needed to render image.

loc_fontfile = 'font1' # check the location of fontfile
handle = open('drawtree_input', 'w')
handle.write('outtree\n{0}\nY\n'.format(loc_fontfile)) 
handle.close()
commands = []
commands.append('drawtree < drawtree_input > drawtree_result') # use phylip "drawtree"
commands.append('mv plotfile drawtree.ps')
commands.append('convert drawtree.ps drawtree.png') # use "convert"
for command in commands:
    os.system(command)
Image(filename='drawtree.png')

Out[10]:

In [11]:

# draw tree with phylip "fitch" result

handle = open('outfile')
lines = handle.readlines()
handle.close()
start_index_str = 'Negative branch lengths not allowed\n'
end_index_str = 'remember: this is an unrooted tree!\n'
print '<Unrooted Tree>'
print ''.join(lines[lines.index(start_index_str)+1: 
                                    lines.index(end_index_str)])

<Unrooted Tree>


        +--sample16  
      +-7 
      ! +--sample2   
    +-9 
    ! ! +---sample3   
    ! +-3 
  +-2   !  +--sample4   
  ! !   +-13  
  ! !      +---sample5   
  ! ! 
  ! +----sample6   
  ! 
  !    +-----sample7   
  !    !  
  ! +-12     +---sample8   
  ! !  !  +-11  
  ! !  !  !  +----sample9   
  ! !  +-10  
  ! !     !  +----sample10  
  ! !     !  !  
  ! !     +-14   +---sample11  
  4-1        ! +-6 
  ! !        ! ! ! +---sample12  
  ! !        +-5 +-8 
  ! !          !   +----sample13  
  ! !          ! 
  ! !          +----sample14  
  ! ! 
  ! +----sample15  
  ! 
  +----sample1

Significant diff SNPs

use plink (consider all of heterozygosity as 1) up to manhattan plot PLINK case-control association test result("assoc") fields summary

CHR     Chromosome
SNP     SNP ID
BP      Physical position (base-pair)
A1      Minor allele name (based on whole sample)
F_A     Frequency of this allele in cases
F_U     Frequency of this allele in controls
A2      Major allele name
CHISQ   Basic allelic test chi-square (1df)
P       Asymptotic p-value for this test
OR      Estimated odds ratio (for A1, i.e. A2 is reference)

In [12]:

# make PLINK input files
group1 = ['sample1', 'sample2', 'sample3', 'sample4', 'sample5', 'sample6']
group2 = ['sample7', 'sample8', 'sample9', 'sample10', 'sample11', 
          'sample12', 'sample13', 'sample14', 'sample15','sample16']

## make MAP file
data_plink = pd.read_csv('snp_chips.parsed.csv', index_col=2)
data_plink = data_plink.loc[data.index,:]
data_plink['id'] = data_plink.index
data_plink.drop_duplicates(cols='id', take_last=True, inplace=True)
data_map = data_plink[['chr', 'id', 'pos']]
data_map.to_csv('data.map', sep=' ', index=False, header=False)

## make PED file
data_plink = data_plink.drop(['chr', 'pos', 'id'], axis=1)
data_ped = data_plink.apply(lambda x: ' '.join('{0} {1}'.format(
                                        snp[0], snp[-1]) for snp in x))
str_output = ''
for sample in data_ped.index:
    str_output += '{0} {1} {2}\n'.format(sample, 1 if sample in group1
                                         else 2, data_ped[sample])
handle = open('data.ped', 'w')
handle.write(str_output)
handle.close()

# execute plink
plink_command = 'plink --file data --assoc --no-fid --no-parents '\
                                '--allow-no-sex --no-sex --map3 --adjust'
os.system(plink_command)

Out[12]:

In [13]:

# draw Manhattan plot
import matplotlib.cm as cm

plink_map = pd.read_csv('data.map', header=None, sep=' ', index_col=1)
plink_map.columns = ['chr', 'pos']
handle = open('plink.assoc')
lines = handle.readlines()
handle.close()
plink_pvalue = pd.DataFrame([{'id': line.strip().split()[1], 
            'pvalue':line.strip().split()[8]} for line\
            in lines[1:] if line.strip().split()[8] != 'NA' ])
plink_pvalue.index = plink_pvalue['id']
plink_map = plink_map.loc[plink_pvalue.index, :]
result = pd.concat([plink_map, plink_pvalue], join='outer', axis=1)
chr_x_axis = result.pivot_table(cols='chr', aggfunc='max')
chr_x_axis = chr_x_axis.loc['pos',].cumsum()
x_axis = result.apply(lambda x: (x['pos'] + chr_x_axis[x['chr']-1])\
                      if x['chr'] != 0else x['pos'], axis=1)
y_axis = pd.Series([-math.log(float(x), 10) for x in result['pvalue']])

fig, ax = plt.subplots(figsize=(15,4))

colors = cm.rainbow(np.linspace(0, 1, len(chr_x_axis)))
for i, chr in enumerate(chr_x_axis.index):
    rt = result.loc[result['chr'] == chr, :]
    x_axis = rt.apply(lambda x: (x['pos'] + chr_x_axis[x['chr']-1])\
                      if x['chr'] != 0else x['pos'], axis=1)
    y_axis = pd.Series([-math.log(float(x), 10) for x in rt['pvalue']])
    
    ax.plot(x_axis, y_axis, ls='*', marker='o',color=colors[i])
ax.set_xticks([0] + [i for i in chr_x_axis.values])
ax.set_xticklabels(['chr{}'.format(i) for i in chr_x_axis.index],
                   rotation=-45)
ax.grid(True)
ax.set_title('Manhattan plot')
ax.set_xlabel('chromosome (bp)')
ax.set_ylabel('pvalue (-log10)')

Out[13]:

<matplotlib.text.Text at 0x10a23cdd0>

SNPs diversity

ROD

\[ ROD = 1 - \pi_{cul} / \pi_{wild} \] \[ \pi = \Sigma_{ij}x_ix_j\pi_{ij} = 2* \Sigma_{i=1}^n\Sigma_{j=1}^{i-1}x_ix_j\pi_{ij} \]
where \(x_i\) and \(x_j\) are the respective frequencies of the ith and jth sequences, \(\pi_{ij}\) is the number of nucleotide differences per nucleotide site between the ith and jth sequences, and n is the number of sequences in the sample.
- calculation example : http://www.vcu.edu/csbc/lfsc520/week7/Week7LectureNotes.pdf

selective sweep analysis\({}^1\)

We used allele counts at SNP positions to identify signatures of selection in sliding 40-kb windows, for pools of sequence data. For each pool and SNP, we determined the numbers of reads corresponding to the most and least abundant allele (\(n_{MAJ}\) and \(n_{MIN}\)). For each window in each breed pool, we calculated a pooled heterozygosity score: \(H_p = 2\Sigma n_{MAJ}\Sigma n_{MIN}/ (\Sigma n_{MAJ} + \Sigma n_{MIN}){}^2\), where SnMAJ and SnMIN are the sums of \(n_{MAJ}\) and, respectively, \(n_{MIN}\) for all SNPs in the window. Individual \(H_p\) values were then Z-transformed as follows: \(ZH_p = (H_p - \mu H_p)/ \sigma H_p\). \({}_{1. http://www.nature.com/nature/journal/v464/n7288/abs/nature08832.html}\)

In []:

genetic distance 란 같은 종안의 집단간의 genetic divergence 를 측정하는 것.

evolution model에 따라 여러가지 방법이 있는데 아래는 그 중 Nei's standard genetic distance 의 방법의 예이다.

- Neil's genetic distance

\[ D{}_N = -\ln{I_N} \]

\[ I_N ={\sum^m_{i=1}(p_{ix}p_{iy}) \over [(\sum^m_{i=1}p^2_{ix})(\sum^m_{i=1}p^2_{iy})]^{1/2}} \]

여기서 \(p_{ix}\) 는 x 집단의 i 번째 allele 의 frequency를 의미.

예)
멸종 위기의 딱따구리가 아래와 같이 3개의 집단으로 구성되어 있고 각 집단의 dehydrogenase(LDH) locus 에서 allele (B,C,D) 의 frequency가 아래와 같다고 할때, Venon 집단과 Apalachicola 집단의 Nei's genetic distatnce 를 구해보면,

LDH allele	Venon	Apalachicola	Gameland
B	0.023	0.019	0.981
C	0.977	0.885	0.019
D	0.000	0.096	0.000

\( \Sigma p^2_{ix} = 0.023^2 + 0.977^2 + 0.000^2 = 0.955 \),
\( \Sigma p^2_{iy} = 0.019^2 + 0.885^2 + 0.096^2 = 0.793 \),
\( \Sigma p_{iy} \Sigma p_{ix} = 0.023 \times 0.019 + 0.977 \times 0.885 + 0 \times 0.096 = 0.793 \),
\(D_N = -\ln{I_n} = -\ln{0.994} = 0.006\).

이 D 값은 크면 클수록 집단간의 거리가 멀다는 의미. 초파리의 경우 allozyme genetic distance를 예로 보자면 isolated population (D=0.03), sub-species(D=0.23), distinct species(D=1.21).
그러나 이 값은 종다마 다르고 또한 유전자 마다 다르기 때문에 정확하진 않음.

* 위 내용은 Introduction to Conservation Genetics 책을 바탕으로 함.

Throw a stone at me

Wednesday, January 14, 2015

SNP chip analysis (from raw data to phylogenetic tree)

아래 과정은 snp chip에서 case-control GWAS study까지의 대략적인 코드이다.

Parse raw file

SNP simple stats

PCA dimension reduction

Phylogenetic tree

Calculate genetic distance

Convert distance matrix to newick format

Draw tree

Significant diff SNPs

SNPs diversity

ROD

selective sweep analysis\({}^1\)

Tuesday, January 13, 2015

Genetic distance

Monday, January 12, 2015

mathjax in blogger

Tuesday, January 6, 2015

binomial test confidence interval

http://www.sigmazone.com/binomial_confidence_interval.htm

binomial test 의 confidence interval을 구하는 방법

1. Normal Approximation Method of the Binomial Confidence Interval

2. Exact Confidence Interval

Normal Approximation Method of the Binomial Confidence Interval 과 Exact Confidence Interval 의 차이였던듯..