Throw a stone at me: de novo assembly

Showing posts with label de novo assembly. Show all posts

Wednesday, August 10, 2011

RNA-Seq denovo assembly pipeline 1

problem
RNA-Seq 데이터를 de novo assembly 할려고 하는데 가장 최근에 나온 것이 trinity. 사실 이게 돌아가면 끝나는 문제이긴 한데.. 이게 의외로 시간도 많이 걸리고 안돌아 가는 경우가 생긴다(그 원인에 대해서는 아직 조사하지 않았음).

solution
그래서 차선책으로 생각한 것이 다음 논문의 "additive Multiple-k" 방법. false positive가 존재할 확률에 대한 정확한 estimation이 없기 때문에 잘못된 transcript를 assembly 할 경우가 생기겠지만 일단은 sensitity를 높이는데 목적을 둔다.

procedure
내 붕어랑 동급인지라 기록을 안해놓으면 다 까먹어서 명령어만 쭉 나열해 본다. 중간 중간 in-house script(python으로 시작하는 건 전부)가 존재하고 프로그램들의 option은 아직 최적화 되어 있지 않다. draft라 생각하면 되겠다. 시간되면 이어놔야지.

perl popoolation_1.017/basic-pipeline/trim-fastq.pl --input1 Pum2_1.fastq --input2 Pum2_2.fastq --output result_trim #velvet을 사용할 것이기 때문에 quality trim의 영향을 많이 받는다. 해서 velvet을 하기 전에 popoolation 패키지의 프로그램을 사용해서 trimming 한다.
python /home/bi/code/mergeWOmemory4velvet.py result_trim_1 result_trim_2 Pum2_combine.fa 71 #velvet의 input파일을 생성하기 위해 paired end 데이터를 합치는데 multiple-k 방법을 사용하기 때문에 trimmed read의 길이가 최대 k의 길이보다 긴 것들만 골라낸다.
velveth Pum2_velveth 41,71,4 -shortPaired Pum2_combine.fasta
python runVelvet.py Pum2 206 #velvetg를 돌리는 batch file. insert size 만 고려했다.
python combineContigs.py Pum2 #velvet을 위처럼 돌리면 여러 폴더고 각각의 k-mer 사이즈마다의 contigs가 생기는데 이를 하나의 multifasta 합치는 것.
cd-hit-est -i contigs.fa -o result_cdHit -c 0.95 -n 8
bowtie #read를 cd-hit으로 만든 cluster의 ref seq에다가 mapping 한다. 이는 RPKM을 구하기 위한 것
python stats.py #assembly 결과 정리

Monday, July 18, 2011

using k-mer distribution in de novo assembly

assembly를 하기 전에 k-mer의 depth별 frequency로 genome에 대한 정보를 얻을 수 있는데 그에 대해 간략하게 이야기 하려한다.

Genome assembly 를 하기 전에 순수 read를 이용하여 genome의 대략적인 size를 예측할 수 있다. 전체 read를 특정 k-mer size (bp)로 hashing을 해서 각 k-mer의 depth 별 frequency (=depth)를 계산했을 때, 가장 frequency가 높게 나온 depth (M), read length (L), k-mer length (K), sequencing depth (N) 간에는 다음과 같은 관계가 성립하게 된다.

M = N * (L – K + 1) / L

여기서 K와 L은 주어진 값이고 M은 계산으로 아래와 같은 plot으로 계산이 가능하기 때문에 N, 즉 sequencing depth를 구할 수 있게 된다. 이때 구해진 sequencing depth로 전체 read의 length로 나누어 주게 되면 assembly를 하고 있는 genome의 대략적인 genome size를 예측할 수 있는 것이다.

아래 그림이 그 예. BGI의 giant panda 논문에서 발췌

또한 위의 k-mer의 depth 별 frequency를 이용하여 genome의 heterozygosity를 판단할 수 있다. 아래 그림에서 보듯이 k-mer의 depth 별 frequency에 대해 plotting을 했을 때 위의 그림과 같이 하나의 peak가 아닌 2개의 peak가 보일 시에 genome이 highly heterozygous 하다고 볼 수 있다. 이는 두 haploid가 동일하다면(=homozygosity) 각각의 haploid에서의 k-mer plot이 동일해야 하기 때문에 하나의 peak가 나타나는 형태여야 하지만 두 haploid가 heterozygous 하다면 서로 다른 위치에서 peak가 생성되기 때문에 bi-modal distribution이 생성되게 된다.
아래 그림(potato genome sequencing 논문에서 발췌)이 그 예로 RH의 경우 higher heterozygous 함을 알 수 있다.

Tuesday, June 14, 2011

installation of assemblers & some tips(?)

< AllPaths-LG>
-installation-
denovo assembler 중에 한가지인 AllPaths-LG의 셋팅과 사용법에 대해 정리해보겠다. velvet이랑 newbler를 사용해 본적은 있는데 이건 어떨런지.. broad institute에서 나왔다니.. 왠지 메이져의 느낌이. 여튼 RNA-Seq도 다 못봤지만 박사님의 지시라. 보자.
논문에 대한 소개은 다음 포스팅에 있다. 여기선 설치와 사용 방법 결과 설명한 하도록 한다.

일단 AllPaths-LG의 사이트는 여기.
이거 설치는 위 사이트 download 탭에 pdf로 된 manual에 보면 나와있다.
기본적으로 셋팅 되어 있어야 하는게 있다. boost, graphviz, addr2line, picard.

boost : 그냥 다운 받아서 풀어 놓기만 한다. 그런데 ./bootstrap.sh, ./bjam install 이것도 해야 하는것 같네.
graphviz : ./configure --disable-swig, make, make install (dot 만 쓴다니까)
addr2line : 이건 깔려 있고 (man addr2line 쳐본다).
picard : 다운받아서 풀어서 path 잡아준다.

그담에 AllPaths-LG의 build는
./configure --with-boost=/path/to/boost
make
make install

test 파일을 다운받아서 실행해본다. test 폴더 안의 README를 읽고 prepare.sh 파일을 변경한 한다. 그리고 돌리기 전에 ulimit -s 100000을 해서 stack memory를 100MB까지 잡게 한다. 그리고 prepare.sh랑 assembly.sh 실행.

-output format-
test 파일을 돌리게 되면 reference/data/run/ASSEMBLIES/subdir 식의 폴더 구조가 생성된다. 상위폴더에 있는 데이터는 하위폴더에 중복되지 않는다.
reference는 organism 이다. 뭔소린가 하면 예를 들어 두개의 E.coli 를 reference로 assemble을 한다면(reference가 꼭 필요한건 아니지만 reference로 삼을 만한게 있으면 이용이 된단다) 일단은 E.coli라는 reference 폴더 안에 두개의 게놈과 데이터들이 존재하게 되는것.
data 폴더는 말그대로 데이터. assembly를 시도했던 read data를 담게 된다. data 폴더안에 여러개의 run 폴더가 존재 할수 있는데 이들 run 폴더간의 차이는 parameter setting이 다른 경우.
run 폴더는 final assembly stage를 위한 intermediate file이 들어 있는 폴더
ASSEMBLIES 폴더는 실제 assembly 가 있는 폴더인데 유일하게 폴더명이 ASSEMBLIES로 정해져 있다.
subdir 폴더는 음.. localized assembly가 생성된 폴더라는데..

-preparing data-
AllPaths의 input 파일이 완전 자기네들 스타일인데..
read data를 그들 형식으로 바꾸고 이 read data를 설명하는 metadata를 더해서 input 파일을 만든다. 그리고 reference 있으면 이것도 추가해야 한다.
일단 AllPaths는 paired end만 취급한다. 그리고 fragment library(180bp) 랑 jumping library(3-10kbp)가 반드시 있어야 한다. long jumping library는 option. 각 library마다 paired-end의 방향이 다르다. fragment, long jumping은 inward

jumping library는 outward, (mate-pair 방식)

data fold 안에는 반드시 library 마다 *.fastb(fasta binary), *.qualb(quality binary), *.pairs(read pair information) 파일과 ploidy 파일이 있어야 한다.
ploidy 파일은 1은 haploid, 2는 diploid genome을 의미. 나머지는 지원 안됨.
*.fastb, *.qualb, *.pairs 는 PrepareAllPathsInput.pl 으로 생성(.bam, .fastaq, .fasta, .fastab 파일 지원).
PrepareAllPathsInputs.pl의 input 파일로 in_groups.csv랑 in_libs.csv 가 필요.
*.fastq 을 변환하려면 paired read가 한 파일안에 존재 해야 한다. 그러니까 1pair의 1read record가 나오고 그 담에 바로 1pair의 2read record가 나오는 식으로.
quality score가 어떤식으로 encode됐는지 확인해서 PHRED_64 옵션으로 조정해야 한다.
in_group.csv 이랑 in_libs.csv 파일은 메뉴얼 보자

Example in_libs.cvs

나머진 manual 보자. 잘나왔다.

정리하자면
1. fastq 파일 합치고(read1, read2 를 하나의 파일로 합친다.)
2. in_libs.csv랑 in_group.csv 파일 만들고
3. PrepareAllPathsInputs.pl 돌려서 input 파일 만들고(IN_GROUPS_CSV, IN_LIBS_CSV, PHRED_64, PLOIDY,PICARD_TOOLS_DIR, DATA_DIR 파라미터 확인)
4. RunAllPathsLG 파이프 라인을 돌린다.

< SOAPdenovo>
-installation-
make

< Velvet>
-installation-
make (velvet manual 특히 for impatient people에 잘 되어 있음)

< Abyss>
-installation-
google sparsehash, pthread, openmp 설치
Google sparsehash 설치. 구글 코드 사이트에서 다운 받아서 설치
pthread 랑 openmp 서버에 있는거 같다. 찾아보니(c 코드로 #include 랑 #include 넣고 짯을 때 에러 안나면 그 헤더 파일이 시스템에 잡혀 있는것)..

Tuesday, May 24, 2011

De Bruijn Graph

genome이건 transcriptome이건 de novo assembly에 사용되는 알고리즘인데..
wiki에 나온건 간단하게 그냥 그 graph의 특징만 나왔다.
그런데 다음 링크는 코드까지 있다. 내 오늘은 그냥 링크만 기록하지만 코드 분석 후에 다시 posting을 하도록 하겠다.

음 코드를 봤는데 생각보다 별게 없다. De bruijn graph를 그리는 것도 아니고.. 링크에 댓글처럼 나도 댓글 달아서 graph까지 그리는 코드를 달라고 해볼까나..

그래서 다른 코드도 찾아봤다.
새로 찾은 블로그는 왠지..

이 코드가 그나마 de bruijn graph를 그리긴 한다. 단.. 거기까지만..
visualization 하는 코드도 있는데 xml-rpc인가로 원격으로 하는것 같다.

Tuesday, March 22, 2011

나중에 de novo assembly 할때 필요할게 될거 같은 논문

이건 음.. 부분부분 tagging 해서 parallel 하게 시퀀싱 하는거
http://www.nature.com/nmeth/journal/v7/n2/full/nmeth.1416.html

hybrid assembly
http://genome.cshlp.org/content/19/2/294.long

Thursday, December 2, 2010

gap closing by phrap & consed

요즘 FLX로 de novo assemble 한 scaffold의 gap closing 중인데. gap 인 N부분을 메꾸기 위해서 sager sequencing 을 이용하고 그 gap 부위의 assembly를 phrap을 이용하고 consed로 확인을 한다. phill green 이라는 사람이 만들었는데 프로그램을 얻기가 여느 프로그램 중에 젤 까다롭다. phred랑 phrap은 바로 첨부파일로 보내주지만 consed는 파일이 크기 때문에 접속할 ip를 academic 한 이유로 프로그램을 이용한다는 동의서와 함께보내야한다.
뭐 여튼.. 기본적인 make 만 하면 되고 make install도 없다 다만 실행 파일을 모아놓은 bin을 path로 걸어놓기만 하면 끝.

gap이 무자게 많은 관계로 가능한 automation을 하고 싶어서 이것저것 짜고 있는데.. 결론은 아무리 정리를 해도 눈으로 확인해봐야 한다는... 여튼 눈으로 확인하기 전에 최대한 정보를 정리해서 summary를 해야 하는 까닭에... 필요한 데이터를 링크, 정리는 아직 모르겠고

-ace file format
http://bcr.musc.edu/manuals/CONSED.txt

autofish라고 있는데.. 이게 뭔지는 자세히 들여다 봐야 할듯..
http://www.phrap.org/consed/autofinish.html

Tuesday, July 27, 2010

velvet VS newbler

To make sure which perform more accurately, I tested two programs in two strategies. Actually the reason why I did do this test is that most of people in my firm say that for de novo assembly FLX is much more suitable than solexa without any evidence. So I decided to test two system, but there is no real data which I can use, so I just do simulation test of the two program which is representative for each system.

Both of them, velvet and newbler are used for de novo assembly. In case of velvet, by using De Bruijn graph methodology It carry out short read assembly with data from solexa, solid. On the other hand, newbler is software from Roche, FLX and it is based on overlap layout consensus methodology (for seeing about the algorithm refer http://www.ncbi.nlm.nih.gov/pubmed/20211242).

I will compare the results from both program in two strategies. All of read data in these tests are simulation data which were made from reference genome by computational simulation.

First, velvet with paired-end read of which length is 78 bp and insert size is 300. newbler with single read of which length is 300 bp.

second, in case of velvet from first test adding long insert library with same condition

Monday, July 19, 2010

De novo assembly of a 40 Mb Eukaryotic Genome from Short Sequence Reads: Sordaria macrospora, a Model Organism for Fungal Morphogenesis

To see the most recent de novo assembly paper, I found this paper(http://www.plosgenetics.org/article/info:doi/10.1371/journal.pgen.1000891). I will just focus on genomic assembly part in this paper.

They use solexa and FLX data. They produce a single and two paired-end data of different insert size libraries from solexa and single read from FLX. The amount of data showed in figures. A little bit doubtful point is the amount of read is too small compared with my data. This maybe come from the difference of version of solexa.

assembly process

The assembly process is simple. They assembled the total raw data from both system and contig from FLX data with Velvet. They said that FLX data reduced the gaps in contigs and addition of paired-end data improved N50 size.

confirmation of assembly

Checking the existence of synteny block between draft genome and closed species. Predicted protein from draft genome mapped to relatively closed species.