Throw a stone at me: NGS

Showing posts with label NGS. Show all posts

Thursday, July 7, 2011

Bismark manual 정리

음.. 이 포스팅 역시 보건원 과제에 의한 것으로.. BS-Seq 데이터를 reference에 mapping 할때 쓰는 프로그램인 Bismark의 manual을 정리한 것이다. 일반적으로 BS-Seq 한 논문들 보면 딱히 published 된 프로그램 말고 그냥 reference C->T로 바꾸고 read도 바꾸고 해서 그냥 각종 mapping 프로그램 사용하는데.. 어짜피 내가 짜는 것보단 published 된게 더 나을거 같다 싶어서 가장 최신으로 보이는 Bismark를 사용하기로 한다. 근데 뭐 잘 안맞으면 새로 만들어야 하니까..

사이트는 여기. 음 manual 보니까 Quick reference가 있다.

Tuesday, June 14, 2011

installation of assemblers & some tips(?)

< AllPaths-LG>
-installation-
denovo assembler 중에 한가지인 AllPaths-LG의 셋팅과 사용법에 대해 정리해보겠다. velvet이랑 newbler를 사용해 본적은 있는데 이건 어떨런지.. broad institute에서 나왔다니.. 왠지 메이져의 느낌이. 여튼 RNA-Seq도 다 못봤지만 박사님의 지시라. 보자.
논문에 대한 소개은 다음 포스팅에 있다. 여기선 설치와 사용 방법 결과 설명한 하도록 한다.

일단 AllPaths-LG의 사이트는 여기.
이거 설치는 위 사이트 download 탭에 pdf로 된 manual에 보면 나와있다.
기본적으로 셋팅 되어 있어야 하는게 있다. boost, graphviz, addr2line, picard.

boost : 그냥 다운 받아서 풀어 놓기만 한다. 그런데 ./bootstrap.sh, ./bjam install 이것도 해야 하는것 같네.
graphviz : ./configure --disable-swig, make, make install (dot 만 쓴다니까)
addr2line : 이건 깔려 있고 (man addr2line 쳐본다).
picard : 다운받아서 풀어서 path 잡아준다.

그담에 AllPaths-LG의 build는
./configure --with-boost=/path/to/boost
make
make install

test 파일을 다운받아서 실행해본다. test 폴더 안의 README를 읽고 prepare.sh 파일을 변경한 한다. 그리고 돌리기 전에 ulimit -s 100000을 해서 stack memory를 100MB까지 잡게 한다. 그리고 prepare.sh랑 assembly.sh 실행.

-output format-
test 파일을 돌리게 되면 reference/data/run/ASSEMBLIES/subdir 식의 폴더 구조가 생성된다. 상위폴더에 있는 데이터는 하위폴더에 중복되지 않는다.
reference는 organism 이다. 뭔소린가 하면 예를 들어 두개의 E.coli 를 reference로 assemble을 한다면(reference가 꼭 필요한건 아니지만 reference로 삼을 만한게 있으면 이용이 된단다) 일단은 E.coli라는 reference 폴더 안에 두개의 게놈과 데이터들이 존재하게 되는것.
data 폴더는 말그대로 데이터. assembly를 시도했던 read data를 담게 된다. data 폴더안에 여러개의 run 폴더가 존재 할수 있는데 이들 run 폴더간의 차이는 parameter setting이 다른 경우.
run 폴더는 final assembly stage를 위한 intermediate file이 들어 있는 폴더
ASSEMBLIES 폴더는 실제 assembly 가 있는 폴더인데 유일하게 폴더명이 ASSEMBLIES로 정해져 있다.
subdir 폴더는 음.. localized assembly가 생성된 폴더라는데..

-preparing data-
AllPaths의 input 파일이 완전 자기네들 스타일인데..
read data를 그들 형식으로 바꾸고 이 read data를 설명하는 metadata를 더해서 input 파일을 만든다. 그리고 reference 있으면 이것도 추가해야 한다.
일단 AllPaths는 paired end만 취급한다. 그리고 fragment library(180bp) 랑 jumping library(3-10kbp)가 반드시 있어야 한다. long jumping library는 option. 각 library마다 paired-end의 방향이 다르다. fragment, long jumping은 inward

jumping library는 outward, (mate-pair 방식)

data fold 안에는 반드시 library 마다 *.fastb(fasta binary), *.qualb(quality binary), *.pairs(read pair information) 파일과 ploidy 파일이 있어야 한다.
ploidy 파일은 1은 haploid, 2는 diploid genome을 의미. 나머지는 지원 안됨.
*.fastb, *.qualb, *.pairs 는 PrepareAllPathsInput.pl 으로 생성(.bam, .fastaq, .fasta, .fastab 파일 지원).
PrepareAllPathsInputs.pl의 input 파일로 in_groups.csv랑 in_libs.csv 가 필요.
*.fastq 을 변환하려면 paired read가 한 파일안에 존재 해야 한다. 그러니까 1pair의 1read record가 나오고 그 담에 바로 1pair의 2read record가 나오는 식으로.
quality score가 어떤식으로 encode됐는지 확인해서 PHRED_64 옵션으로 조정해야 한다.
in_group.csv 이랑 in_libs.csv 파일은 메뉴얼 보자

Example in_libs.cvs

나머진 manual 보자. 잘나왔다.

정리하자면
1. fastq 파일 합치고(read1, read2 를 하나의 파일로 합친다.)
2. in_libs.csv랑 in_group.csv 파일 만들고
3. PrepareAllPathsInputs.pl 돌려서 input 파일 만들고(IN_GROUPS_CSV, IN_LIBS_CSV, PHRED_64, PLOIDY,PICARD_TOOLS_DIR, DATA_DIR 파라미터 확인)
4. RunAllPathsLG 파이프 라인을 돌린다.

< SOAPdenovo>
-installation-
make

< Velvet>
-installation-
make (velvet manual 특히 for impatient people에 잘 되어 있음)

< Abyss>
-installation-
google sparsehash, pthread, openmp 설치
Google sparsehash 설치. 구글 코드 사이트에서 다운 받아서 설치
pthread 랑 openmp 서버에 있는거 같다. 찾아보니(c 코드로 #include 랑 #include 넣고 짯을 때 에러 안나면 그 헤더 파일이 시스템에 잡혀 있는것)..

Thursday, May 26, 2011

RNA-seq 에 관한 이것 저것

<expression unit>
RPKM : reads per kilobase of transcript per million of sequenced read
FPKM : expected fragments per kilobase of transcript per million fragments sequenced

음.. 정확한 개념은 ERANGE 논문과 Cufflinks 논문을 참조하면 된다.

간단한 intro는 Cufflinks 논문 intro에 잘 나와있는데.. RPKM이라는 개념이 gene 단위 레벨에서의 expression unit으로 생각한다면 FPKM은 transcript 단위에서의(하나의 gene에서도 여러 isoform의 transcript가 존재 가능) expression unit으로 생각하면 된다.

내가 여기서 이해가 안갔던 단어가 fragment인데.. 이건 뭐냐면 RPKM은 read 하나하나에 의미를 두는데 반해 FPKM은 paired-read의 한 쌍이 fragment를 의미한다 (별거 아닌건데..).

자세한 내용은 cufflinks 논문의 supplementary 데이터에 나오는데.. 이해 되면 다시 정리 하도록 하련다.

참고로 KOBIC에서도 NEUMA 라고 NAR에 논문을 낸게 있는데 이것도 보면 좋을듯. 이것 역시 정리되면 다시 올린기로 한다.

<bias>
EMBL의 자료를 보면 strand를 정보를 살리기 위한 두가지 방법과 그냥 strand를 안살리고 한거랑 tiling array로 비교한 그림이 있는데 bias가 상당하다. random hexamer bias도 있고.. 관련 자료를 좀 훑어봐야 할듯.

Wednesday, May 25, 2011

Trinity

논문 소개는 여기서 했고 실질적으로 setting 하면서 기억해야 할거 command line에서 주의 해야 할것 등을 기록하려 한다. 워낙 붕어 머리라.

일단 trinity download 및 documentation은 여기.

<install>
그냥 tar 랑 gzip 풀고 기본 폴더에서 make 한번 눌러주면 된다. 그럼 inchworm 이랑 chrysalis 가 컴파일 된다(c++로 되어 있음). butterfly 는 java인데 precompiled software라 할 필요 없단다.

<running>
기본 명령어는 sample_data 폴더 안에 runMe.sh 에 있다. run 하는 batch 파일은 기본 폴더에 있는 Trinity.pl 파일이다.
위 링크에 걸려 있는 documentation을 요약하자면..

strand-specific sequencing 일때는 -SS_lib_type 옵션을 사용하란다(default 는 strand 안가린다).
또 fungal genome 처럼 gene이 compact 한 genome에서는 UTR 부위에서 transcript의 overlap이 생기기 때문에 -jaccard_clip 옵션을 사용하란다. 요 옵션은 Bowtie가 깔려 있어야 한단다(정확하게는 jaccard_clip의 옵션의 기능을 모르겠다).
computing grid 일 때는 --run_butterfly 옵션을 사용하지 말고, grid 가 아닐때 단순시 cpu가 많을땐 --run_butterfly 옵션이랑 --cpu 옵션 넣어서 돌려라. --run_butterfly 옵션 없을때 결과 폴더 안에 Chrysalis 폴더 안에 butterfly_commnads 가 있는데 이걸 이용해서 grid를 사용하라.

<hardware requirements>
1G of RAM per 1M pairs of Illumina reads. Chrysalis 단계가 deep recursion 땜시 stack 메모리가 많이 필요하단다. 이럴 때 세팅은 여기처럼 하란다(Q&A에 나온다).
전체 시간은 1 hour per million pairs of reads(이 표현이 정확이 1Mbp 당인지 아니면 1M paired-end count를 이야기 하는건지 정확히 모르겠네).

<output file>
--output 옵션에 지정한 대로 아니면 default 인 Trinity_dir 폴더안에 Trinity.fasta 파일이 생기는데 이는 full-length transcript의 fasta 파일.

좀더 자세하게 알고 싶으면 여기, advanced guide

<Inchworm>
read를 k-mer로 쪼개서 hash table에 저장한다. 이때 base는 2-bit encoding을 사용한다(A:00,G:01 이런식). k-mer는 25 로 고정(package로 돌릴때), 단 따로 Inchworm만 돌리면 32까지는 허용.
monitor.out 이란 파일이 inchworm의 log 파일.
inchworm.K25.L48.fa 가 inchworm의 결과 파일. K는 k-mer를 뜻하고 L은 contig의 minimum Length를 의미. 그 안에 보면 각 contig의 이름에 >a1;142 K: 25 length: 3697로 되어 있는데 a1뒤의 숫자는 k-mer의 average frequency. 이는 나중에 expression quantity에 사용된다.

<Chrysalis>
Chrysalis는 inchworm에서 생긴 contig들을 alternative spliced transcript 나 paralogous transcript들끼리 clustering 하는 역할. 각 cluster의 contig를 사용해서 de bruijn graph 생성하고 read를 mapping한다. 각 cluster의 de bruijn graph는 butterfly 단계를 위해 각각 다른 파일로 저장됨
이때 생기는 파일은 chrysalis/RawComps.0 폴더 안에 있는데 파일들은 이렇단다.

*chrysalis intermediate outputs*
chrysalis/RawComps.0/comp9.raw.graph :de Bruijn graph based on Inchworm contigs only
chrysalis/RawComps.0/comp9.raw.fasta :raw RNA-Seq reads that map to the Inchworm contigs based on k-mer composition
*chrysalis outputs to be used by Butterfly as inputs*
chrysalis/RawComps.0/comp9.out :the de Bruijn graph with edge weights incorporating the mapped reads
chrysalis/RawComps.0/comp9.reads :the read sequences and anchor points within the above graph

이것과 함께 butterfly가 돌수 있는 명령문이 있는 butterfly_commands와 butterfly_commands.adj

가 생성되는데 butterfly_commands.adj가 heap size랑 parameter등이 있는 파일로 butterfly 실행시 이 파일을 실행해야 한다.

<Butterfly>
butterfly는 이제 chrysalis에서 생긴 *.out과 *.reads 파일을 보고 가능성 있어 보이는 trascript를 fasta 파일로 report 한다.
Trinity.pl 에다가 --run_butterfly를 줬으면 이건 /util/cmd_process_forker.pl을 돌리는 거다(forker란걸 보니 프로세스 늘려서 실행 시키는 프로그램인 갑다).
결과로 comp*_allProPaths.fasta 라는 파일이 생기는데 이게 possible transcrip의 fasta 파일. accession에 대해선 그냥 메뉴얼보자(쓰기 귀찮다).
butterfly 돌릴때 그러니까 butterfly_commands.adj에 있는 명령문에서 -V 5 라고 하면 돌아가는 상황이 투명하게 보여지는데, 뭔소리냐면 compacted graph structure를 report 하고 graph 순회하면서 들린 node를 report한다. 그 파일이 comp*_justBeforeFidingPaths.dot. 이 파일은 GraphViz 로 볼 수있단다. 이 파일의 accession도 설명하는데 것도 manual 보자

Tuesday, April 5, 2011

sff 파일 파싱해보기(?)

바이너리 파일 핸들링 두번째 시간으로(첫번째는 여기) 저번에 어떻게 binary file 핸들링 하는지 대충 감잡았으니까 이번에는 실전으로다가 지금 필요한 일을 하면서 적용해 보려한다.

이번 미션의 목표 : FLX의 sff 파일을 변경하기. 정확하게 이야기 하자면.. sff 라 함은 the Standard Flowgram File의 약자인데 여기에는 GS FLX 기기의 read와 그것의 trace data가 들어 있다. 보통 read를 trim할때 sff 파일을 fastq 파일로다가 전환해서 quality trimming을 하는데, 회사 과장님 말씀으로는 fastq 파일로 newbler로 assembly 한거랑 sff 파일로 assembly 한거랑은 결과가 다르다는데(fastq가 아닌가.. 그냥 fasta 파일로 했을 때 다르다는건가. 음 근데 다르다면 sff 파일에 뭔가 더 들어 있어서 그 정보를 이용한다는 건데.. 음).. 해서 sff 파일의 format을 이해하고 원하는 방향(trimming)으로다가 변형을 하고자 한다.

sff file format
http://sequence.otago.ac.nz/download/GS_FLX_Software_Manual.pdf page 528부터
여기에 따르면 sff 파일에는 각 homopolymer의 길이에 대한 측정값이 있단다(이건 좀 더 뒷 내용을 봐야 할 것 같다).
일단 sff 파일은 3 section으로 구성
1. common header section : 파일에 한번 나옴, byte size는 31 + number_of_flows_per_read + key_length보다 크거나 같은 가장 가까운 8의 배수. number_of_flows_per_read 가 800이고 flow_char를 보면 TACG가 200번씩 있는걸 보아하니.. 이 flow 라는게 read 그러니까 emulsion PCR이 된 bead를 시퀀싱 할때 TACG 순서대로 각각 200 번씩 nucleotide를 보내줬다는 의미.
2. read header section : byte size는 16 + name_length 보다 크거나 같은 최소의 8배수, 이 section에서 나타내는 position 숫자는 1-based indexing.
3. read data section : byte size는 2*number_of_flows + 3*number_of_bases 보다 크거나 같은 최소의 8배수, number_of_flows는 common header section에 number_of_bases는 read header section에 있다.

% uint8_t, uint16_t, uint32_t, uint64_t 는 각 각 1,2,4,8 byte numeric value를 뜻함, big endian byteorder, character field는 sinlge-byte ASCII character임, 모든 section은 "eight_byte_padding" 이라는 field로 끝난다(이 필드는 0 to 7 bytes of padding, 이게 뭔소리냐, 각 section의 byte가 8 배수가 되게끔 8배수가 되기에 모자라는 byte 만큼 0 으로 padding이 되어 있다.. 뭐 이런 소리). 나머지는 pdf의 표 참조.

보니까 각 section별로 byte size 만큼 읽어 들여서 필요한 정보만 빼오는데.. 내가 원하는게 read clipping(manual 보니까 trimming이란 용어보다 clipping이란 용어를 사용하는거 같다) 이므로 건드려야 할 부분이 read header section의 clip_qual_left, clip_qual_right, clip_adapter_left, clip_adapter_right 부분이다.

Tuesday, March 22, 2011

나중에 de novo assembly 할때 필요할게 될거 같은 논문

이건 음.. 부분부분 tagging 해서 parallel 하게 시퀀싱 하는거
http://www.nature.com/nmeth/journal/v7/n2/full/nmeth.1416.html

hybrid assembly
http://genome.cshlp.org/content/19/2/294.long

Wednesday, March 2, 2011

RNA-seq 분석을 위한 논문 탐험

예전에 RNA-seq 한번 리뷰하고 거기서 ERNAGE라는 프로그램에 관한 논문을 정리 한적이 있었다. 이번에 정말로 RNA-seq 데이터를 다뤄야 하고 예전에는 거의 초점이 eukaryote에 맞춰졌기 때문에 bacteria의 transcriptome에 관련하여 좀더 논문들을 정리해 보고자 한다.

start point
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3025319/

내가 분석해야 하는 대상이 미생물 균주, bacteria 이기 때문에 우선 위 논문을 시작으로 한다. 위 리뷰논문을 보면 관련 논문을 잘 정리해 놨다. 마지막 limitations에 관한 내용을 보면 RNA secondary structure, random hexamer priming, second strand synthesis, PCR amplication stage에 의해 문제가 유발되는데 이는 ion-catalyzed hydrolysis와 direct RNA sequencing으로 어느정도 해결 가능하다고 마무리.. 아 그리고 figure1 이 실험종류와 단계에 대한 정리를 잘해놨다(결론적으로 directional 이나 아니냐 둘로 나뉘는데.. 확실히 directional로 실험을 해야 맞는거 같은데.. 안타깝게도 우리 데이터는 아닌듯). 오른쪽 그림 참조

second paper
http://www.biomedcentral.com/1471-2180/8/72
음.. 우리가 non-directional 로다가 실험을 했기에 또 기기가 FLX인 관계로 위 first point 의 reference 중 상황이 가장 유사한 논문이 바로 위 url. 일단 genome size 가 대략 3Mb, megaplasmid 가 있는거 빼곤 우리 상황이랑 비슷하다.
아.. 이 논문은 아닌거 같다. BMC microbiology에 나온건데.. 초창기 논문이라 그런지 아마도 FLX로 다가 transcriptome을 거의 처음 했다는 이유로.. rRNA depletion을 했는데도 read의 90 % 가 rRNA에 mapping되니. 그래서 아마도 논문의 방향을 novel gene finding으로 전환한듯. 여튼 패스..

third paper
http://www.sciencemag.org/content/326/5957/1268.short
좀 뭔가 의미 있는 내용을 보기 위해 그냥 선택한 논문. science니까. 첫번째 논문에서 인용도 많이 한거 같고. 음.. 읽어보니 확실히 사이언스다. 두번째 논문 봤을 때는 이거 일주일이면 하겠다고 생각했는데..
spotted array, tiling array, rna-seq (rna-seq 도 directional 한 방법을 이용) 모두 이용해서 operon의 boundary를 정한다 (expression이 급격히 떨어지는 region). 그리고 operon이 poly- | mono- cistronic 인지 확인한다 (rna-seq만 이용했을때 false positive가 얼마인지도 조사). operon을 정하고 나면 promoter region을 찾고 대략적인 TSS와 CDS와의 거리도 조사한다. 또한 trascriotion end site의 2차구조를 봐서 hairpin 구조가 transcription termination에 영향이 있는지 확인한다. polycistronic operon에 있는 gene들의 decay behavior도 관측한다. 여까지는 대략적인 transcriptome landscape라고 할까.
그 뒤 여러 다양한 조건에서의 expression 변화를 가지고 context-dependent modulation of operon structure involving repression or activation of operon internal or end-located genes (아.. 그러니까.. 음.. 하나의 operon 안에 여러 유전자가 위치에 따라 module화가 되어 (suboperon 마냥) 상황에 따라 오듈 단위로 다르게 expression 한다 뭐 이런.. 맞나..) 을 봤다. 또 이와 같은 이유는 eukaryote 에서 처럼 다양한 factor의 작용에 의한 것이 아닐까 추측 그리고 factor가 될만한 후보자들을 가지고 expression clustering. 해서 아.. factor가 많음 갑다라고 추측.

중요한건 the proteome organiztion is not explainable by the genome organizaion. 그리고 the expression heterogeneity within operon 이 아마도 생각했던것보다 bacteria의 transcriptional regulation이 eukaryote과 많이 유사하기 때문이지 않나 싶다는 것.

fourth paper
http://bioinformatics.oxfordjournals.org/content/early/2009/10/24/bioinformatics.btp612.full.pdf
다음은 분석 툴에 대한 논문이다. 정확히 이야기 하면 R package.

checklist
1. Segmentation algorithm : for identification of uncovered region in tiling array
2. Local convolution method : finding operon boundary
3. How do they decide polycistronic & monocystronic operon (maybe by DSSS)
4. In polycistronic how they divide genes in operon
5. Sigma 70 promoter region (TSS located within 60bp from CDS start site)

Sunday, January 16, 2011

maq

short read alignment의 대표 프로그램, 1000 genome project에서 사용되었다. 사실 이 논문을 볼생각은 없었는데, 1000 genome 논문 정리하다가
http://genome.gov/Pages/Research/DER/1000GenomesProjectTutorials/DataDescription-GaborMarth.pdf
genotype likelihood라는 용어가 나오는데 이것의 수식이 maq 논문에서 나왔기 때문에 어찌보면 가장 먼저 봐야 할논문이 아닌가 해서 정리한다. 완벽 이해는 아직 한계가 있지만 최대로 끌어 올려본다.

maq 설명

--alignment stage :
1. ungapped match with lowest mismatch score(the sum of qualities at mismatching bases) in first 28bp(less than 2 mismatch)
2. failed in matching but whose mate pair is mapped are searching gapped alignment
3. maq assigns individual alignment a phred-scaled quality score
4. if read aligned multiple position, then maq peak one randomly and that mapping quality is 0.

--SNP calling stage :
1. produce a consensus genotype sequence inferred from a bayesian statistical model (그리고 이런 genotype consensus에다가 틀릴수 있는 확률을 phred quality로 할당한다)
2. compare consensus genotype sequence with reference to detect SNP
3. filter called SNP in step 2 by predefined rules(for compensation of simplificaton and assumptions used in the statistical model, treating neighbor positions independently)

Method

<--single end read mapping-->
1. build multipe hash tables to index the reads
2. scan the reference against the hash tables to find the hits(potential read alignment postion)
In More Detail
1. indexing
-모든 read를 메모리에 올린뒤, read를 indexing (1st template, 2nd template(complementary sequence of 1st template)의 sequence를 24bit integer로 hashing한뒤 이를 interger를 키로 해서 sorting & grouping), 이 정보를 hash table이 갖고 있음, 이런 hash table을 6개 만든다(sequence를 4조각내서 combination for allowance of 2 mismatch).
2. scan reference
-forward, reverse로 base-by-base로 28bp subsequence를 취해서 위의 1step에서와 마찬가지로 indexing한뒤 위 1번 step의 hash tables 에서 hit를 찾는다.
-hit가 발견되면 28bp(seed) 뒤로 전체 시퀀스를 gap 없이 align했을때 mismatch의 quality를 합한다(q).
-coordinate of hit과 read ID를 또 다른 24-bit integer (h) 로 hash한다. 이는 나중에 mutiple alignment되었을때 같은 최저 q를 갖을때 random하게 position을 고를수 있게 한다.

<--Single end mapping qualities-->
mapping quality(Qs) = -10 log10 Pr{read is wrongly mapped}
Qs가 30 이란건 1/1000확률로 잘못 alignment될 확률을 뜻함
x를 reference, z를 read, u 를 position이라고 할때
p(z|x,u)는 reference x의 posiotion u에서 read z가 나올 확률인데 이는 단순하게 생각하면 mismatch의 quality의 곱과 같다. 여튼 구하고자 하는 mapping quality,Qs(u|x,z) = -10 log10[1-ps(u|x,z)] 로 p(z|x,u) 를 베이지안으로 이용하면 구할수 있다 (x때문에 헷갈릴수 있는데 x는 reference 시퀀스이니 항상 주어져 있다고 생각하면 그냥 x를 제거해서 생각하면 된다). 그런데 이 베이지안을 구하기 위해서는 p(z|x) 를 구해야 하는데 이는 reference x의 모든 position에서의 p(z,x)의 합을 구해야 하는데 이는 실질적으로 불가능하다. 그래서 나온게
Qs = min{q2 - q1 - 4.343 logn2, 4+(3-k')(qave -14)-4.343logp1(3-k',28)}
이는 supplementary에 유도되어 있음.
문제는 snp가 mapping시 base error와 같이 취급된다는 것인데, we should set the minimum base error probability as the rate of differences between the reference and the reads(이해한걸론 reference에 비해 실제 sample이 snp를 갖을 비율이 0.001이면, 곧 1000베이스당 1나의 베이스가 snp일 수 있다면 0,001 보다 큰 error확률을 갖는 mismatch base 만을 mapping quality에 사용해야 한다는 것). 그러나 이 역시 approximation이므로, 첫번째 mapping하고 reference를 변형한뒤 다시 mapping해서 mapping quality를 구하길 권장

<--Paired-end read alignment-->

<--Detecting short indels-->

<--Consensus genotype calling-->
MAQ assumes the sample is diploid.
1.before consensus calling, MAQ combines mapping quality & base quality (base quality used in SNP calling cannot exceed the mapping quality of the reads) and reasigns the quality as the smaller value between mapping quality & base quality.
2. genotype likelihood
계산시 최대 많이 나온 2 종류의 nucleotide에 대해서만 고려(나머지는 error로 간주)
이 역시 supplementary 참조
genotype likelihood = Pr(D|genotype) , the probability of data given each possible genotype.
3. prior of genotypes
heterozygote의 확률을 r이라고 가정하면 각각의 homozygote는 (1-r)/2 로 한다. known SNP에 대해선 r은 0.2로 novel SNP에 대해선 0.001을 설정
4. posterior probability
실질적으로 구해야 하는것 P(g|D), 데이터(reads) 가 주어졌을때 genotype의 확률로 g^ = argmax P(g|D) 를 genotype으로 취하고 그 observing quality를 Qg = 10log10 [1 - P(g^|D)]

위랑은 크게 관련 있다고 봐야 할지는 모르겠지만. 그냥 참조
hash : http://internet512.chonbuk.ac.kr/datastructure/hash/hash2.htm

Friday, October 29, 2010

miRNA-seq

miRNA seq에 대한 분석을 리뷰 하기위해.. 다음과 같은 논문을 정리한다.

1.discovering microRNAs from deep sequencing data using miRDeep (Nature computational biology 2008)
2.Current tools for the identification of miRNA genes and their targets (Nucleic Acids Research 2009)
3.miRTRAP, a computational method for the systematic identification of miRNAs from high throughput sequencing data (2010 Genome Biology)
4.DSAP: deep-sequencing small RNA analysis pipeline (2010 Nucleic Acids Research)

위의 1번 논문은 거의 NGS 데어터를 이용한 miRNA 분석의 초창기 논문으로 볼수 있으며
2번 논문의 경우 거의 computational prediction에 가까운 논문이고 3번은 전혀 들여가 보지 않았으며 4번의 경우 web-server로 최근에 나온 논문.

다른건 모르겠는데.. isomiRs를 찾기위한(?) alignment 및 multiple alignment에 대한 공부가 필요한 것 같다(한때 BSA 책 공부 할때 바짝 up 됐었는데...).

PPT

Wednesday, October 20, 2010

Reference-Free validation of Short Read Data

plos one에 얼마전에 나온 논문
NGS 데이터 파이프 라인을 구축하는데 가장 처음의 step으로 들어가게 될 프로그램이라 생각해서 리뷰를 해봤는데...
genome bias 에 대한 이야긴 줄 알았는데 그건 아니다. 하긴 read 만 가지고 전체 genome이 아닌 bias된 게놈만이 시퀀싱 된 줄 알수는 없다만..

이 논문은 리드 자체에 bias가 있는지를 체크 해볼 수 있는 프로그램이라고 보면 될거 같다.
read 별 각 position 별로 AGTC가 잘 분포 되어 있는지 또 k-mer로 정했을때 그 k-mer또한 분포가 position별로 동일한지를 체크 할수 있다 (체크 할수 있다기 보다는 그림으로 그릴수 있다 (우리같이 수주를 하는 사람들한테는 가능한 그림이 많이 들어가는게 좋아보이기 때문에).

java로 된 프로그램인데 사실 내용은 너무나 단순하다. 그런데 그 단순한 프로그램에다가 왜 두번째 analysis는 넣지 않은건지.. java 전혀 모르는 나에게 고생하면서 k-mer 일 때의 frequency가 나오게 만들게 하다니.. 아 그리고 quality 파일 만들어주는 변수가 int형으로 선언되는 바람에 read가 무자게 많으면 overflow나서 음수값이 출력된다는거..(스크립트 언어에 익숙한 나로는 이거 찾느라 피곤하게 됐다.)

알게된 내용:
1. mac에서 개발할 프로그램을 archive로 packing하면 __MACOSX가 딸려 생성된다는 것(첨에 프로그램 풀고 나서 이 폴더가 뭔가 했다.http://floatingsun.net/2007/02/07/whats-with-__macosx-in-zip-files/ : 이것에 대해 한소리 하는 블로거).
2. solexa 의 read가 첫 10 base의 퀄리티는 좋을지 몰라도 상당히 시퀀스가 안정적이지 못하다(뒷부분의 퀄리티가 안좋아 져서 항상 그것만 고려해서 trimming을 햇는데 앞부분의 시퀀스도 썩 훌륭한 경향을 보이지 않다니... 얼마나 짤라내야 하는거야...).

PPT

Monday, September 13, 2010

Bowtie (short read alignment program)

I should have reviewed about mapper program (maq, bowtie..) earlier. Most of the first step in NGS data analysis is mapping read to reference genome. So knowing this process is prior to the others. But I just do this work now.

Here is the PPT.

ah.. I have to confess that some of slides in PPT come from mqoqol's ppt in SEQanswers.
Thanks mqoqol!

I found a blog today.
http://hackmap.blogspot.com/2010/07/aligners-since-starting-methylcoder.html

In the blog, the owner addressed shortcoming of Bowtie. And he introduced GSNAP. Actually in our firm, we made a tentative arrangement to use GSNAP for aligner.

I will also have to review the paper (GSNAP) ASAP.

Wednesday, August 18, 2010

SAM (sequence alignment/map format)

With advent of NGS, there are some projects such as 1000 genome project which can be possible by the technology. There are several sequencing machines and tools for alignment, this cause confusion and trivial handling works of result data for downstream analysis. In my firm, we also felt this problem when members use different tool for mapping the reads. While we tried to find or decide a formal format, I found SAM, so I planed to introduce this to members in my firm.
what a pity! We found this after one year passed from when it was made. In these days I am disappointed a lot to people in firm and my stuffiness. I am really sick of the fact that what I can do is just following someone's work. I really hope to be a leader in front of development of science.

Anyway,

Because I am not familiar with binary format and compression file, I just skipped that part of the format specification.
Actually the ppt here is just brief introduction.
Here is the PPT.

list of presentation,
1.The sequence alignment/map format and SAMtools
2.Sequence Alignment/map format document

<sam flag explanation>
http://picard.sourceforge.net/explain-flags.html

<pysam 사용시 주의점>
pysam 에서는 mapping pos가 0 based 이다. 그래서 fetch를 할때 유전자의 정보가 bed 파일이 아닌이상 start position에 -1 값으로 fetch 를 해야 한다.

Saturday, August 7, 2010

overview of discovering structural variation with NGS

I think this presentation will be the last of three consecutive presentation in my firm.

so far, I have reviewed ChIP-seq, RNA-seq, de novo assembly (I didn't do posting of this subject, but I already made my own pipeline). I expect that after finishing this posting I can look over overall utilization of NGS, of course, I know this conclusion should be arrogant.

In my plan, these papers below will be introduced in presentation.

1. Computational methods for discovering structural variation with next-generation sequencing
2. one of the paper which is referred in paper 1.

I decide the second paper for presentation. that is beakDancer "BreakDancer: an algorithm for high-resolution mapping of genomic structural variation". haha Isn't it fascinated? Their sense for naming.. anyway It will come soon.

Wednesday, August 4, 2010

RNA-seq analysis overview

after finishing ChIP-seq analysis overview, for next presentation, I will prepare mRNA-seq. this consists of one review paper and two articles. I will be back

the list of papers which are presented in this post.

1. Computation for ChIP-seq and RNA-seq studies
2. Mapping and quantifying mammalian transcriptome by RNA-seq
3. Dynamic transcriptomes during neural differentiation of human embryonic stem cells reveals by short, long, and paired-end sequencing

I finally made PPT.

ChIP-seq analysis overview

I already reviewed some papers and set up the pipeline for analysis of ChIP-Seq.
I just hope this website for uploading ppt or other file.

anyway I will post this later

I found that. keke I am an idiot. just use hyperlink. that's all. no.no... for link that should be in web. one of way to solve this problem is use google site. upload file in google site and link that web site in this blog. I don't why blogger didn't make button for uploading data. or I couldn't find the button.

anyway.

Here is PPT for this posting.

The list of papers which are presented in this ppt.

1. Computation for ChIP-seq and RNA-seq studies
2. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls

Monday, July 19, 2010

De novo assembly of a 40 Mb Eukaryotic Genome from Short Sequence Reads: Sordaria macrospora, a Model Organism for Fungal Morphogenesis

To see the most recent de novo assembly paper, I found this paper(http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1000891). I will just focus on genomic assembly part in this paper.

They use solexa and FLX data. They produce a single and two paired-end data of different insert size libraries from solexa and single read from FLX. The amount of data showed in figures. A little bit doubtful point is the amount of read is too small compared with my data. This maybe come from the difference of version of solexa.

assembly process

The assembly process is simple. They assembled the total raw data from both system and contig from FLX data with Velvet. They said that FLX data reduced the gaps in contigs and addition of paired-end data improved N50 size.

confirmation of assembly

Checking the existence of synteny block between draft genome and closed species. Predicted protein from draft genome mapped to relatively closed species.

Monday, June 28, 2010

Whole-Genome Sequencing Breaks the Cost Barrier

Someday I talked with my superior about direction of NGS and its affection to daily life on lunch break. I just told what I have previously from conversation with Keum(http://goldbio.blogspot.com). Maybe because of this, today morning suddenly my superior gave me a paper which is about trend of WGS. Anyhow I will summarize this one in this writing. The paper was published on June 11, 2010 in Cell .

Whole-Genome Sequencing Breaks the Cost Barrier

Laura Bonetta*

*Laura Bonetta is a freelance writer. she has written news and feature article in many top journals (http://www.linkedin.com/in/laurabonetta).

hmm.. It was originally planed to summarize this article.. but I think there is nothing new.

anyway..

the author introduced some research which is related to elucidation on causes of mendelian disease. Richard Gibbs of the Baylor College of Mdeicine in Texas sequenced the whole genome of his colleague who is suffering Charcot-Marie-Tooth disease (http://en.wikipedia.org/wiki/Charcot-Marie-Tooth_disease). He narrowed down the candidate genetic variations and confirmed that variations through identification from his colleague's siblings.

And the other research is exome sequencing of 500 genes that predispose to several childhood diseases by Leroy Hood and David Galas at the Institute for systems Biology in Seatttle. It will be one of the first gene sequencing-based tests to come on the market. Galas are preparing to sequence 30 individuals over 4 generation to answer the question "What does the mutation rate depend on?".

last one is researching on various aspect (transcriptome, epigenome, genome data) of identical twins discordant for multiple sclerosis. Although the researcher couldn't find any significant result from the their work, I think It's helpful to see article for reference how they approached on mixing multi-dimensional data.

And the author pointed out the obstacle of present sequencing tech. I will omit this part. this is pretty obvious tale.

About complex Disease.. the author introduce two points of view. From failure of GWAS, someone believe(Richard Lifton at Yale Univ.) individually rare variants with relatively large effects will play a substantial role on complex diseases or traits, while Kari Stefansson, president of the Icelandic company DeCode Genomics, think rare confluence of variants rather than just individual rare variants giving large effects. Well.. I think any opinion is not always right. That's biology and that's the reason for taking much time to solve biological system as you know.

It is sure that already the most research are using whole genome sequencing data or sort of that. The price of whole genome sequencing is gonna plummet, and the sequencing data will overflow. Preparing is needed to survive (Actually I don't like use the expression 'survive', because that sounds against one's will).