Wednesday, January 21, 2015

Statistical models in human genetics (lesson 1)

Contents 
  • genetic mapping
  • data for genetic study
  • 3 common question in genetic mapping
  • common designs for genetic studies



Genetic Mapping
형질의 유전패턴과 chromosomal region의 유전 패턴을 비교하여 "positional cloning" 하는 것
- positional cloning : 어떤 유전자인지를 아는 것과 상관없이 어디에 위치하는지를 찾는 것.


Data for a Genetic Study
  • pedigree : 관계가 알려진 개체 모임 (예: 가족)
  • genetic markers : 손쉽게 측정 가능한 genetic variant
  • phgenotype : mendelian phenotype 혹은 complex phenotype

3 Common Question in Genetic mapping
  • 어떠한 형질에 영향을 주는 유전자가 있느냐? Is a trait genetic? => Epidemiological study
      • 집단과 친척관계에서 형질의 분포를 조사함.
      • Crohn's disease 의 경우 전체 인구에서의 발병률과 twins of affected individuals 차이가 남 
    • 그 유전자는 어디에 있느냐? Where are those genes? => Linkage analysis
      • Crohn's 이 있는 가계도와 genetic marker를 비교해 보니 "D16S3136" marker가 연관됨을 확인. "D16S3136" marker 의 위치는 알려져 있음.
    • 그 유전자는 무엇이냐? What are those genes? => Association analysis
      • Crohn's 환자 중 NOD2의 유전자의 mutation 을 갖고 있는 환자가 많음.

    Common Designs for Genetic Studies
    • parametric linkage analysis
      • 특정 모델와 위치를 평가하는 방식(disease loci의 allele frequency, genotype 별 발병률)
      • Identify variants with relatively large contributions to disease risk 
    • allele-sharing analysis
    • association analysis
      • Identify genetic variants with relatively small individual contributions to disease risk 
      • 가장 단순한 방법으로는 case와 control의 allele frequency 비교
      • genome-wide하게 수백만 개의 marker를 가지고 살수도 있고 몇개의 candidate에 초점을 맞춰서 할수도 있음




    * 위 내용은 http://www.sph.umich.edu/csg/abecasis/class/666.01.pdf 를 바탕으로 함

    Tuesday, January 20, 2015

    ipython notebook to pdf

    1. ipython nbconvert sample.ipynb --to latex --post PDF
    trouble: 에러 발생 (원인 불명)

    2. ipython nbconvert sample.ipynb --to latex 후 pdflatex sample.tex
    trouble : 한글의 경우 utf-8 인대도 에러 발생

    3. brew install Caskroom/cask/wkhtmltopdf  깔고
    ipython nbconvert sample.ipynb --to html 후 wkhtmltopdf sample.html sample.pdf
    trouble :  plot이 page 넘어가는 부분에 있을 경우 plot이 쪼개짐

    Friday, January 16, 2015

    Model of Evolution

    아래 내용은 http://www.evolution-textbook.org/content/free/contents/ch28.html 을 요약한 내용임

    Mathematical theory can be used in two distinct ways

    진화 생물학을 이해하는데 도움을 주는 수학적 기술은 두가지 방향으로 이용된다.



    1. 다양한 진화 과정의 결과를 산출 및 예측하고 진화 과정에서의 핵심 인자를 찾는데 이용된다 (위 그림 A, 점은 gene을, 색깔은 allele을 뜻한다.).
    •  여러 유전자의 영향을 받는 complex trait 가 어떻게 진화할것인가? 혹은 random drift, gene flow, selection이 어떻게 집단을 변화시킬것인가? 
    •  수학적 기술은 구체적이고 정확한 양적 예측을 위해 이용되기보다도 위 질문들에 대한 대답을 찾기 위한 이해를 돕는데 이용되어 진다. 
    • 수학적 기술을 이용한 "toy models"을 통해 최소한 원리적으로라도 언어로 주장되어 지는 것들의 효과를 명확하게 나타낼 수 있다.

    2. 거꾸로 시간을 거슬러 집단 내에서 샘플링된 개체 혹은 유전자의 기원을 찾는데 이용된다 (위 그림 B).
    • 범죄현장의 샘플에서 그 샘플의 인종을 추정한다던지,  유전병의 원인이 되는 변이의 나이를 추정한다던지.

    Deterministic processes

    A population is described by the proportions of the genotypes it contains
    집단이란 여러 타입의 entity의 집합이므로 각 타입의 비율을 나타내는 것이 중요하다. entity는 개체가 될수도 있고 유전자가 될수도 있다. 유전자가 entity일 경우 allele 이 유전자의 다른 타입이 되겠다. 그리고 집단에 대한 모델을 구축할때 variable과 parameter를 구분하는게 중요하다.

    • variable은 allele frequency와 같은 것
    • parameter는 selection coefficient, recombination rates, mutation rate와 같은 집단이 어떻게 진화할지슬 결정하는 요인.


    Reproducing populations tend to grow exponentially
    Gradual evolution can be described by differential equations
    populations tend to evolve toward stable equilibria
    stability is determined by the leading eigenvalue
    stability anlysis extends to several variables

    Random processes

    Our understanding of random events is relatively recent
    A random process is described by its probability distribution
    Probability can be thought of in several ways
    Probabilities can be assigned to unique events
    The Increase of Independently reproducing individuals is a branching process
     The sum of many independent event follows  a normal distribution
    Random walks can be approximated by diffusion

    Wednesday, January 14, 2015

    SNP chip analysis (from raw data to phylogenetic tree)


    아래 과정은 snp chip에서 case-control GWAS study까지의 대략적인 코드이다.


    Parse raw file

    describe parsing script.
    In [1]:
    #parse raw file
    
    import csv
    
    handle = open("snp_chips.csv")
    reader = csv.reader(handle)
    replicate = 4
    start_column = -16 * replicate + replicate - 1
    samples = reader.next()[start_column : : replicate]
    samples = ['snp_id'] + [sample.split('.')[0] for sample in samples]
    print 'samples      ::', samples
    print 'sample count ::', len(samples) - 1
    
    
    str_output_ori = '{}\n'.format(','.join(samples))
    str_output_val = str_output_ori
    str_output_val_filtered = str_output_val
    str_output_ori = 'chr,pos,' + str_output_ori
    pure_row_count = 0
    for sp in reader:
        if sp[0][0] == '#':
            continue
        snps = [sp[1]] + sp[start_column :: replicate]
        str_snps = ','.join(snps)
        str_output_ori += '{0},{1},{2}\n'.format(int(sp[53][-2:]),
                                                 sp[56], str_snps)
        str_snps = str_snps.replace("AAAA","0")
        str_snps = str_snps.replace("AAAC","1")
        str_snps = str_snps.replace("AACC","2")
        str_snps = str_snps.replace("ACCC","3")
        str_snps = str_snps.replace("CCCC","4")
        
        str_snps = str_snps.replace("AAAG","1")
        str_snps = str_snps.replace("AAGG","2")
        str_snps = str_snps.replace("AGGG","3")
        str_snps = str_snps.replace("GGGG","4")
        
        str_snps = str_snps.replace("TTTT","0")
        str_snps = str_snps.replace("TTTC","1")
        str_snps = str_snps.replace("TTCC","2")
        str_snps = str_snps.replace("TCCC","3")
        str_snps = str_snps.replace("CCCC","4")
        
        str_snps = str_snps.replace("TTTG","1")
        str_snps = str_snps.replace("TTGG","2")
        str_snps = str_snps.replace("TGGG","3")
        str_snps = str_snps.replace("GGGG","4")
        
        str_snps = str_snps.replace("X","nan")
        str_snps = str_snps.replace("NC","nan")
        str_output_val += '{}\n'.format(str_snps)
        if 'nan' not in str_snps:
            pure_row_count += 1
            str_output_val_filtered += '{}\n'.format(str_snps)
    
    #print 'snps which all samples have value :: ', pure_row_count
    
    handle.close()
    
    handle_w = open('snp_chips.parsed.csv','w')
    handle_w.write(str_output_ori)
    handle_w.close()
    handle_w = open('snp_chips.parsed.value.csv','w')
    handle_w.write(str_output_val)
    handle_w.close()
    handle_w = open('snp_chips.parsed.value.filterd.csv','w')
    handle_w.write(str_output_val_filtered)
    handle_w.close()
    
    samples      :: ['snp_id', 'sample1', 'sample2', 'sample9', 'sample3', 'sample5', 'sample14', 'sample13', 'sample11', 'sample16', 'sample12', 'sample15', 'sample6', 'sample8', 'sample7', 'sample4', 'sample10']
    sample count :: 16
    
    

    SNP simple stats

    In [2]:
    #SNP simple stats
    
    import pandas as pd
    import matplotlib.pyplot as plt
    
    data = pd.read_csv('snp_chips.parsed.value.csv', index_col=0)
    ## hist duplicated probes
    duplicated_snps = []
    indexes = list(data.index)
    for snp in indexes:
        duplicated_count = indexes.count(snp)
        if duplicated_count > 1:
            duplicated_snps.append(snp)
    print "duplicate probes :: ", len(set(duplicated_snps))
    
    data["index"] = data.index
    data.drop_duplicates(cols='index', take_last=True, inplace=True) # remove
                                                            # duplicate row
    del data["index"]
    
    ## plot non-nan count of samples
    snp_count_per_sample = data.apply(lambda x: sum(x.notnull()))
    fig, ax = plt.subplots(figsize=(15,4))
    ax.plot(snp_count_per_sample, marker='o', ls='-', color='blue')
    ax.set_xticks(range(0,16))
    ax.set_xticklabels(snp_count_per_sample.index, rotation=-45)
    ax.set_xlim([-1,16])
    ax.grid(True)
    ax.set_title('# of SNPs with value per sample')
    ax.set_xlabel('sample names')
    ax.set_ylabel('# of SNPs with value')
    for i in range(len(snp_count_per_sample.index)):
        ax.text(i, snp_count_per_sample[i], snp_count_per_sample[i], 
                ha='center', va='bottom')
    
    duplicate probes ::  455
    
    
    In [3]:
    data = data.loc[data.apply(lambda x: sum(x.isnull()),axis=1) == 0,:]
    print 'filtered data ::', data.shape
    
    filtered data :: (2494, 16)
    
    

    PCA dimension reduction

    In [4]:
    #PCA dimension reduction
    
    from sklearn.decomposition import PCA
    import math
    import numpy as np
    
    
    pca = PCA(n_components=3)
    pca.fit(data.T)
    tc = pca.transform(data.T)
    
    colors = cm.rainbow(np.linspace(0, 1, len(tc)))
    fig, ax = plt.subplots(figsize=(10,10))
    for i, sample in enumerate(data.columns):
        ax.plot(tc[i, 0], tc[i, 1], marker='o', ls='*', color=colors[i])
        ax.text(tc[i, 0], tc[i, 1], sample, ha='left', va='bottom', 
                color=colors[i])
    ax.set_title('PCA reduction')
    ax.set_xlabel('principle component 1')
    ax.set_ylabel('principle component 2')
    
    Out[4]:
    <matplotlib.text.Text at 0x107666510>
    

    Phylogenetic tree

    1. Calculate genetic distance

    • Calculate Genetic distance (http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1002451)
      : Genetic distance (D) between pair-wise combinations of individuals was calculated as PLINK where

      \[ D = 1−{(IBS2+0.5 \times IBS1) \over N} \]
      IBS2 and IBS1 are the number of loci that share either 2 or 1 alleles identical by state (IBS), respectively, and N is the number of loci tested.
    • Make distance matrix file (phylip input file)
      : The input format for distance data is straightforward. The first line of the input file contains the number of species. There follows species data, starting, as with all other programs, with a species name. The species name is ten characters long, and must be padded out with blanks if shorter. And every cell should be space delimited.
    In [7]:
    ## calculate genetic distance
    import math
    
    distance_file = 'distance_matrix.txt'
    
    def calculate_distance(s1, s2, allele_num):
        return 1 - (allele_num - (s1 - s2).apply(math.fabs)).sum() / (allele_num * len(s1))
    distance_matrix = data.apply(lambda x1: data.apply(
                            lambda x2: calculate_distance(x1, x2, 4)))
    str_output = '{0}\n'.format(len(distance_matrix))
    for sample in distance_matrix.index:
        sample_name = sample[:10] if len(sample) > 10 else sample + ' '*(10-len(sample))
        str_output += '{0} {1}\n'.format(sample_name,
            ' '.join(['{:.4f}'.format(v) for v in distance_matrix.loc[sample,]]))
    
    handle = open(distance_file, 'w')
    handle.write(str_output)
    handle.close()
    

    1. Convert distance matrix to newick format

    Use Phylip (check "Running the programs in background or under control of a command file" in http://evolution.genetics.washington.edu/phylip/doc/main.html).
    In [8]:
    import os
    from IPython.display import Image
    
    if os.path.isfile('outfile'):
        os.remove('outfile')
    if os.path.isfile('outtree'):
        os.remove('outtree')
    str_output = '{0}\nY\n'.format(distance_file)
    handle = open('fitch_input','w')
    handle.write(str_output)
    handle.close()
    os.system('fitch < fitch_input > fitch_result')
    
    Out[8]:
    0
    

    1. Draw tree

    use phylip drawgram, drawtree
    In [9]:
    ## draw tree with python package "ete2"
    
    from ete2 import Tree, TreeStyle
    
    handle = open('outtree')
    lines = handle.readlines()
    handle.close()
    newick = ''.join([line.strip() for line in lines])
    t = Tree(newick)
    t.show()
    #t.render('myTree.png', w=150, h = 150, units='mm')
    #Image(filename='myTree.png') 
    
    No module named MySQLdb
    
    
     MySQLdb module could not be loaded
    
    
    In [10]:
    ## draw tree with phylip "drawtree"
    ## phylip "drawtree" make ps format file. Converting file format is needed to render image.
    
    loc_fontfile = 'font1' # check the location of fontfile
    handle = open('drawtree_input', 'w')
    handle.write('outtree\n{0}\nY\n'.format(loc_fontfile)) 
    handle.close()
    commands = []
    commands.append('drawtree < drawtree_input > drawtree_result') # use phylip "drawtree"
    commands.append('mv plotfile drawtree.ps')
    commands.append('convert drawtree.ps drawtree.png') # use "convert"
    for command in commands:
        os.system(command)
    Image(filename='drawtree.png')
    
    Out[10]:
    In [11]:
    # draw tree with phylip "fitch" result
    
    handle = open('outfile')
    lines = handle.readlines()
    handle.close()
    start_index_str = 'Negative branch lengths not allowed\n'
    end_index_str = 'remember: this is an unrooted tree!\n'
    print '<Unrooted Tree>'
    print ''.join(lines[lines.index(start_index_str)+1: 
                                        lines.index(end_index_str)])
    
    <Unrooted Tree>
    
    
            +--sample16  
          +-7 
          ! +--sample2   
        +-9 
        ! ! +---sample3   
        ! +-3 
      +-2   !  +--sample4   
      ! !   +-13  
      ! !      +---sample5   
      ! ! 
      ! +----sample6   
      ! 
      !    +-----sample7   
      !    !  
      ! +-12     +---sample8   
      ! !  !  +-11  
      ! !  !  !  +----sample9   
      ! !  +-10  
      ! !     !  +----sample10  
      ! !     !  !  
      ! !     +-14   +---sample11  
      4-1        ! +-6 
      ! !        ! ! ! +---sample12  
      ! !        +-5 +-8 
      ! !          !   +----sample13  
      ! !          ! 
      ! !          +----sample14  
      ! ! 
      ! +----sample15  
      ! 
      +----sample1   
    
    
    
    
    

    Significant diff SNPs

    use plink (consider all of heterozygosity as 1) up to manhattan plot PLINK case-control association test result("assoc") fields summary
    CHR     Chromosome
    SNP     SNP ID
    BP      Physical position (base-pair)
    A1      Minor allele name (based on whole sample)
    F_A     Frequency of this allele in cases
    F_U     Frequency of this allele in controls
    A2      Major allele name
    CHISQ   Basic allelic test chi-square (1df)
    P       Asymptotic p-value for this test
    OR      Estimated odds ratio (for A1, i.e. A2 is reference)
    
    In [12]:
    # make PLINK input files
    group1 = ['sample1', 'sample2', 'sample3', 'sample4', 'sample5', 'sample6']
    group2 = ['sample7', 'sample8', 'sample9', 'sample10', 'sample11', 
              'sample12', 'sample13', 'sample14', 'sample15','sample16']
    
    ## make MAP file
    data_plink = pd.read_csv('snp_chips.parsed.csv', index_col=2)
    data_plink = data_plink.loc[data.index,:]
    data_plink['id'] = data_plink.index
    data_plink.drop_duplicates(cols='id', take_last=True, inplace=True)
    data_map = data_plink[['chr', 'id', 'pos']]
    data_map.to_csv('data.map', sep=' ', index=False, header=False)
    
    ## make PED file
    data_plink = data_plink.drop(['chr', 'pos', 'id'], axis=1)
    data_ped = data_plink.apply(lambda x: ' '.join('{0} {1}'.format(
                                            snp[0], snp[-1]) for snp in x))
    str_output = ''
    for sample in data_ped.index:
        str_output += '{0} {1} {2}\n'.format(sample, 1 if sample in group1
                                             else 2, data_ped[sample])
    handle = open('data.ped', 'w')
    handle.write(str_output)
    handle.close()
    
    # execute plink
    plink_command = 'plink --file data --assoc --no-fid --no-parents '\
                                    '--allow-no-sex --no-sex --map3 --adjust'
    os.system(plink_command)
    
    Out[12]:
    0
    
    In [13]:
    # draw Manhattan plot
    import matplotlib.cm as cm
    
    plink_map = pd.read_csv('data.map', header=None, sep=' ', index_col=1)
    plink_map.columns = ['chr', 'pos']
    handle = open('plink.assoc')
    lines = handle.readlines()
    handle.close()
    plink_pvalue = pd.DataFrame([{'id': line.strip().split()[1], 
                'pvalue':line.strip().split()[8]} for line\
                in lines[1:] if line.strip().split()[8] != 'NA' ])
    plink_pvalue.index = plink_pvalue['id']
    plink_map = plink_map.loc[plink_pvalue.index, :]
    result = pd.concat([plink_map, plink_pvalue], join='outer', axis=1)
    chr_x_axis = result.pivot_table(cols='chr', aggfunc='max')
    chr_x_axis = chr_x_axis.loc['pos',].cumsum()
    x_axis = result.apply(lambda x: (x['pos'] + chr_x_axis[x['chr']-1])\
                          if x['chr'] != 0else x['pos'], axis=1)
    y_axis = pd.Series([-math.log(float(x), 10) for x in result['pvalue']])
    
    fig, ax = plt.subplots(figsize=(15,4))
    
    colors = cm.rainbow(np.linspace(0, 1, len(chr_x_axis)))
    for i, chr in enumerate(chr_x_axis.index):
        rt = result.loc[result['chr'] == chr, :]
        x_axis = rt.apply(lambda x: (x['pos'] + chr_x_axis[x['chr']-1])\
                          if x['chr'] != 0else x['pos'], axis=1)
        y_axis = pd.Series([-math.log(float(x), 10) for x in rt['pvalue']])
        
        ax.plot(x_axis, y_axis, ls='*', marker='o',color=colors[i])
    ax.set_xticks([0] + [i for i in chr_x_axis.values])
    ax.set_xticklabels(['chr{}'.format(i) for i in chr_x_axis.index],
                       rotation=-45)
    ax.grid(True)
    ax.set_title('Manhattan plot')
    ax.set_xlabel('chromosome (bp)')
    ax.set_ylabel('pvalue (-log10)')
    
    Out[13]:
    <matplotlib.text.Text at 0x10a23cdd0>
    

    SNPs diversity

    ROD

    \[ ROD = 1 - \pi_{cul} / \pi_{wild} \] \[ \pi = \Sigma_{ij}x_ix_j\pi_{ij} = 2* \Sigma_{i=1}^n\Sigma_{j=1}^{i-1}x_ix_j\pi_{ij} \]
    where \(x_i\) and \(x_j\) are the respective frequencies of the ith and jth sequences, \(\pi_{ij}\) is the number of nucleotide differences per nucleotide site between the ith and jth sequences, and n is the number of sequences in the sample.
    - calculation example : http://www.vcu.edu/csbc/lfsc520/week7/Week7LectureNotes.pdf

    selective sweep analysis\({}^1\)

    We used allele counts at SNP positions to identify signatures of selection in sliding 40-kb windows, for pools of sequence data. For each pool and SNP, we determined the numbers of reads corresponding to the most and least abundant allele (\(n_{MAJ}\) and \(n_{MIN}\)). For each window in each breed pool, we calculated a pooled heterozygosity score: \(H_p = 2\Sigma n_{MAJ}\Sigma n_{MIN}/ (\Sigma n_{MAJ} + \Sigma n_{MIN}){}^2\), where SnMAJ and SnMIN are the sums of \(n_{MAJ}\) and, respectively, \(n_{MIN}\) for all SNPs in the window. Individual \(H_p\) values were then Z-transformed as follows: \(ZH_p = (H_p - \mu H_p)/ \sigma H_p\). \({}_{1. http://www.nature.com/nature/journal/v464/n7288/abs/nature08832.html}\)
    In []: