Wednesday, August 18, 2010

SAM (sequence alignment/map format)

With advent of NGS, there are some projects such as 1000 genome project which can be possible by the technology. There are several  sequencing machines and tools for alignment, this cause confusion and trivial handling works of result data for downstream analysis. In my firm, we also felt this problem when members use different tool for mapping the reads. While we tried to find or decide a formal format, I found SAM, so I planed to introduce this to members in my firm.
what a pity! We found this after one year passed from when it was made. In these days I am disappointed a lot to people in firm and my stuffiness. I am really sick of the fact that what I can do is just following someone's work. I really hope to be a leader in front of development of science.


Anyway,


Because I am not familiar with binary format and compression file, I just skipped that part of the format specification.
Actually the ppt here is just brief introduction.
Here is the PPT.


list of presentation,
1.The sequence alignment/map format and SAMtools
2.Sequence Alignment/map format document 






<sam flag explanation> 
http://picard.sourceforge.net/explain-flags.html








<pysam 사용시 주의점>
pysam 에서는 mapping pos가 0 based 이다. 그래서 fetch를 할때 유전자의 정보가 bed 파일이 아닌이상 start position에 -1 값으로 fetch 를 해야 한다. 

Tuesday, August 17, 2010

gzip compression algorithm & Burrows-Wheeler Transform

I often see some documents which mention about compression and indexing such as bowtie, BGZF..
This is why I prepare this post. ah.. just link to reference.

Gzip compression algorithm
http://dalmasian.tistory.com/46

overall short introduction about compression algorithm
http://blog.naver.com/altools?Redirect=Log&logNo=150019572403

coding zip by using zlib
http://blog.naver.com/ksw7998?Redirect=Log&logNo=100011414029

Burrows-Wheeler Transform (bzip2)
http://james.fabpedigree.com/bwt.htm

Saturday, August 7, 2010

overview of discovering structural variation with NGS

I think this presentation will be the last of three consecutive presentation in my firm.

so far, I have reviewed ChIP-seq, RNA-seq, de novo assembly (I didn't do posting of this subject, but I already made my own pipeline). I expect that after finishing this posting I can look over overall utilization of NGS, of course, I know this conclusion should be arrogant.

In my plan, these papers below will be introduced in presentation.

1. Computational methods for discovering structural variation with next-generation sequencing
2. one of the paper which is referred in paper 1.

I decide the second paper for presentation. that is beakDancer "BreakDancer: an algorithm for high-resolution mapping of genomic structural variation". haha Isn't it fascinated? Their sense for naming.. anyway It will come soon.

Wednesday, August 4, 2010

RNA-seq analysis overview

after finishing ChIP-seq analysis overview, for next presentation, I will prepare mRNA-seq. this consists of one review paper and two articles. I will be back

the list of papers which are presented in this post.

1. Computation for ChIP-seq and RNA-seq studies
2. Mapping and quantifying mammalian transcriptome by RNA-seq
3. Dynamic transcriptomes during neural differentiation of human embryonic stem cells reveals by short, long, and paired-end sequencing

 I finally made PPT.

ChIP-seq analysis overview

I already reviewed some papers and set up the pipeline for analysis of ChIP-Seq.
I just hope this website for uploading ppt or other file.

anyway I will post this later

I found that. keke I am an idiot. just use hyperlink. that's all. no.no... for link that should be in web. one of way to solve this problem is use google site. upload file in google site and link that web site in this blog. I don't why blogger didn't make button for uploading data. or I couldn't find the button.

anyway.

Here is PPT for this posting.

The list of papers which are presented in this ppt.

1. Computation for ChIP-seq and RNA-seq studies
2. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls

Friday, July 30, 2010

q value

I always forget the concept of q value.
several month ago I found web site with good explanation of q value. but I cannot remember that site, so in this post I record the track for understanding q value.

how to interpret q value
http://www.nonlinear.com/support/progenesis/samespots/faq/pq-values.aspx


how to calculate q value
http://courses.ttu.edu/isqs6348-westfall/images/6348/BonHolmBenHoch.htm

Tuesday, July 27, 2010

velvet VS newbler

To make sure which perform more accurately, I tested two programs in two strategies. Actually the reason why I did do this test is that most of people in my firm say that for de novo assembly FLX is much more suitable than solexa without any evidence. So I decided to test two system, but there is no real data which I can use, so I just do simulation test of the two program which is representative for each system.  

Both of them, velvet and newbler are used for de novo assembly. In case of velvet, by using De Bruijn graph methodology It carry out short read assembly with data from solexa, solid. On the other hand, newbler is software from Roche, FLX and it is based on overlap layout consensus methodology (for seeing about the algorithm refer http://www.ncbi.nlm.nih.gov/pubmed/20211242).

I will compare the results from both program in two strategies. All of read data in these tests are simulation data which were made from reference genome by computational simulation.

First, velvet with paired-end read of which length is 78 bp  and insert size is 300. newbler with single read of which length is 300 bp.

second, in case of velvet from first test adding long insert library with same condition