Throw a stone at me: 05/25/11

Wednesday, May 25, 2011

Trinity

논문 소개는 여기서 했고 실질적으로 setting 하면서 기억해야 할거 command line에서 주의 해야 할것 등을 기록하려 한다. 워낙 붕어 머리라.

일단 trinity download 및 documentation은 여기.

<install>
그냥 tar 랑 gzip 풀고 기본 폴더에서 make 한번 눌러주면 된다. 그럼 inchworm 이랑 chrysalis 가 컴파일 된다(c++로 되어 있음). butterfly 는 java인데 precompiled software라 할 필요 없단다.

<running>
기본 명령어는 sample_data 폴더 안에 runMe.sh 에 있다. run 하는 batch 파일은 기본 폴더에 있는 Trinity.pl 파일이다.
위 링크에 걸려 있는 documentation을 요약하자면..

strand-specific sequencing 일때는 -SS_lib_type 옵션을 사용하란다(default 는 strand 안가린다).
또 fungal genome 처럼 gene이 compact 한 genome에서는 UTR 부위에서 transcript의 overlap이 생기기 때문에 -jaccard_clip 옵션을 사용하란다. 요 옵션은 Bowtie가 깔려 있어야 한단다(정확하게는 jaccard_clip의 옵션의 기능을 모르겠다).
computing grid 일 때는 --run_butterfly 옵션을 사용하지 말고, grid 가 아닐때 단순시 cpu가 많을땐 --run_butterfly 옵션이랑 --cpu 옵션 넣어서 돌려라. --run_butterfly 옵션 없을때 결과 폴더 안에 Chrysalis 폴더 안에 butterfly_commnads 가 있는데 이걸 이용해서 grid를 사용하라.

<hardware requirements>
1G of RAM per 1M pairs of Illumina reads. Chrysalis 단계가 deep recursion 땜시 stack 메모리가 많이 필요하단다. 이럴 때 세팅은 여기처럼 하란다(Q&A에 나온다).
전체 시간은 1 hour per million pairs of reads(이 표현이 정확이 1Mbp 당인지 아니면 1M paired-end count를 이야기 하는건지 정확히 모르겠네).

<output file>
--output 옵션에 지정한 대로 아니면 default 인 Trinity_dir 폴더안에 Trinity.fasta 파일이 생기는데 이는 full-length transcript의 fasta 파일.

좀더 자세하게 알고 싶으면 여기, advanced guide

<Inchworm>
read를 k-mer로 쪼개서 hash table에 저장한다. 이때 base는 2-bit encoding을 사용한다(A:00,G:01 이런식). k-mer는 25 로 고정(package로 돌릴때), 단 따로 Inchworm만 돌리면 32까지는 허용.
monitor.out 이란 파일이 inchworm의 log 파일.
inchworm.K25.L48.fa 가 inchworm의 결과 파일. K는 k-mer를 뜻하고 L은 contig의 minimum Length를 의미. 그 안에 보면 각 contig의 이름에 >a1;142 K: 25 length: 3697로 되어 있는데 a1뒤의 숫자는 k-mer의 average frequency. 이는 나중에 expression quantity에 사용된다.

<Chrysalis>
Chrysalis는 inchworm에서 생긴 contig들을 alternative spliced transcript 나 paralogous transcript들끼리 clustering 하는 역할. 각 cluster의 contig를 사용해서 de bruijn graph 생성하고 read를 mapping한다. 각 cluster의 de bruijn graph는 butterfly 단계를 위해 각각 다른 파일로 저장됨
이때 생기는 파일은 chrysalis/RawComps.0 폴더 안에 있는데 파일들은 이렇단다.

*chrysalis intermediate outputs*
chrysalis/RawComps.0/comp9.raw.graph :de Bruijn graph based on Inchworm contigs only
chrysalis/RawComps.0/comp9.raw.fasta :raw RNA-Seq reads that map to the Inchworm contigs based on k-mer composition
*chrysalis outputs to be used by Butterfly as inputs*
chrysalis/RawComps.0/comp9.out :the de Bruijn graph with edge weights incorporating the mapped reads
chrysalis/RawComps.0/comp9.reads :the read sequences and anchor points within the above graph

이것과 함께 butterfly가 돌수 있는 명령문이 있는 butterfly_commands와 butterfly_commands.adj

가 생성되는데 butterfly_commands.adj가 heap size랑 parameter등이 있는 파일로 butterfly 실행시 이 파일을 실행해야 한다.

<Butterfly>
butterfly는 이제 chrysalis에서 생긴 *.out과 *.reads 파일을 보고 가능성 있어 보이는 trascript를 fasta 파일로 report 한다.
Trinity.pl 에다가 --run_butterfly를 줬으면 이건 /util/cmd_process_forker.pl을 돌리는 거다(forker란걸 보니 프로세스 늘려서 실행 시키는 프로그램인 갑다).
결과로 comp*_allProPaths.fasta 라는 파일이 생기는데 이게 possible transcrip의 fasta 파일. accession에 대해선 그냥 메뉴얼보자(쓰기 귀찮다).
butterfly 돌릴때 그러니까 butterfly_commands.adj에 있는 명령문에서 -V 5 라고 하면 돌아가는 상황이 투명하게 보여지는데, 뭔소리냐면 compacted graph structure를 report 하고 graph 순회하면서 들린 node를 report한다. 그 파일이 comp*_justBeforeFidingPaths.dot. 이 파일은 GraphViz 로 볼 수있단다. 이 파일의 accession도 설명하는데 것도 manual 보자