Throw a stone at me: samtools pileup format

Wednesday, July 10, 2013

samtools pileup format

과거 포스팅을 보면..
http://graphy2111.blogspot.kr/2011/07/sanger-fastq-file-format-for-sequences.html

<conversion illumina 1.3+ quality ascii to phred score>

illumina 1.3+ fastq의 ascii를 quality로 바꿔주기 위해서는
python의 경우)

ord('문자') - 64 # 64가 offset이기 때문에

<pileup format>
http://samtools.sourceforge.net/pileup.shtml

예)


seq1 272 T 24  ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
<chrmosome> <1-based coordinate> <reference base> <# of reads> <read bases> <base qualities>

<base qualities> :: 위 illumina phred score 확인.

<read bases> ::


"." :: match to ref base on the forward strand
"," :: match to ref base on the reverse strand
"ACGTN" :: mismatch on the forward
"acgtn" :: mismatch on the reverse
"\+[0-9]+[ACGTNacgtn]+" :: 그 위치와 다음 위치의 base 사이에 insertion이 있음을 의미
"\-[0-9]+[ACGTNacgtn]+" :: 그 위치와 다음 위치의 base 사이에 deletion이 있음을 의미

"^" :: start of a read segment, "^" 뒤의 문자는 mapping quality를 나타내는 ascii code (-33 필요)
"$" :: end of a read 


중요한건 insertion, deletion, start, end를 나타내는 기호 다음에는 반드시 ".", ",", "ACGTN", "acgtn" 가 나타난다. 

예)


seq2 156 A 11  .$......+2AG.+2AG.+2AGGG    <975;:<<<<< 
(.)($.)(.)(.)(.)(.)(.)(+2AG.)(+2AG.)(+2AGG)(G)  를 의미함.



위에서 보면 +n 다음 나오는 n개의 BASE(혹은 character)는 insert 된 base를 나타내고 그 다음 base는 그 위치에서는 base에 대한 정보. 곧 +2AG. 이라 함은 2개의 base AG가 insertion 되었고 .은 그 위치에서 base가 reference와 동일하다를 의미

                     |  <===== 이 위치에서의 pileup을 생각해보면

reference :: AGTTCG--G

read      :: AGTTCGAGG

이는 +2AG. 으로 표현됨

Throw a stone at me

Wednesday, July 10, 2013

samtools pileup format

No comments:

Post a Comment