Throw a stone at me

Monday, September 6, 2010

high-throughput sequencing data submission to NCBI (GEO, SRA)

The most papers upload their data into GEO or SRA. Therefore, understanding of format which is supported in those databases is needed. Here are links for the format.

soft file format :
http://www.ncbi.nlm.nih.gov/geo/info/soft-seq.html

submitting sequencing data :
http://www.ncbi.nlm.nih.gov/geo/info/seq.html

Why certain NGS data are in SRA database, while some are in GEO : Whole genome sequencing, metagenome, survey sequencing data and original short read format sequence files belong to SRA database.

SOFT (Simple Omnibus Format in Text) file format is just instruction about submission of data. Actual real data (fastq) can be contained or not.

Tip for checking of inclusion of
1.raw data : if SOFT file contain raw data, there should be "!Sample_raw_file...".
2. processed data : "!Sample_supplementary_file...".

Friday, September 3, 2010

NP-hard

bioinformatics 논문을 보다보면 종종 NP-hard 라는 용어를 많이 본다. 구글을 뒤져보라. 뒤질거 같다.ㅋ
다음 링크가 참 친절하다.
http://blog.naver.com/dekarno?Redirect=Log&logNo=140019592031

Thursday, September 2, 2010

cloud computing, grid computing... parallelism

nature genetics reveiw 에 'computational solutions to large-scale data management and analysis'라는 제목으로 논문이 나왔다. 이는 삼세대 시퀀싱 기계가 나옴에 따라 엄청난 양의 데이터와 high-dimensional 데이터를 어떻게 핸들링하는냐의 문제를 computational한 초점에서 바라본 논문이다.
여기서 내용을 간단하게 요약하자면 자신이 가지고 있는 데이터의 특성을 잘 파악하여 cluster computing, cloud computing, grid computing, heterogeneous computing등의 플랫폼을 선택하여 알고리즘의 병렬화를 통해서 해결하라는 거다.

그리드던 클라우드던 클러스터던 거기서 거기인거 같고 구분이 안되서 참고자료를 링크한다.
특히 그리드랑 클라우드 차이가 뭔지.. 아직도.. 잘.. ㅋ

http://blog.naver.com/happypcb?Redirect=Log&logNo=90077847232

아래는 병렬 컴퓨팅의 간략한 소개이다(메모리 접근 방식에 따른 분류)
http://blog.naver.com/belief_jesus?Redirect=Log&logNo=120102897585
간략하게 요약하면 메모리 접근 방식에 따라

Sunday, August 29, 2010

Genome-Wide Evolutionary Analysis of Eukaryotic DNA Methylation

I had an idea about evolution of methylation in these days from the fact that methylation pattern is conserved in othologous region between species. I decided to dig about this concept, so I did googling first. I had scarcely searched the google when the title of this post was appeared.

This paper which is published in SCIENCE.

Here is the PPT.

this link is also good to read.
http://blog.lib.umn.edu/denis036/thisweekinevolution/2010/05/evolution_of_dna_methylation_i.html

(열혈강의) 오용철의 데이터베이스 모델링

학부때 unigene 데이터를 다루면서 sql을 공부하고 이용해봤지만 데이터베이스라는 과목의 체계적인 컨셉이 부족하다고 생각하여 본 책. 물론 아직 뒤 두과 정도 (상향식 설계, 통합적 설계) 남았지만 미리 리뷰를 해보련다.

이책의 느낌 마치 내가 대학교 2학년때 컴퓨터 학부에 가서 자료구조를 들었던 느낌? 이랄까.. 다 읽어 보면 나름 편안하게 설명하고 있다는걸 느끼게 되지만 도입부의 설명의 적극성과 친근성이 떨어져서 아무것도 모르는 초짜에겐 아마도 지루함과 "왜 "라는 의문이 들 책이다.

나와 같은 데이터베이스를 아주 약간 알지만 정리를 해보고 싶은 사람에겐 쉽게 읽을 수 있는 아주 편한 책이나 정말 아무것도 모르는 이에게는 비추인 책이다.

간단하게 내용을 정리하자면 오른쪽 그림과 같다.
1.데이터베이스화 하고자 하는 세계를 데이터수집과 분석을 거쳐 정리하고
2.이를 먼저 개념적 설계과정을 거쳐 ER model (diagram)을 만든다.
3.그 뒤 논리적 설계과정(하향식, 상향식,통합식) 구현 데이터 모델을 만든다(이 책에서는 관계형 모델을 설명한다).
4.마지막으로 물리적 설계과정을 거쳐 실질적인 물리적 모델을 만든다.

각 단계별 설명과 실직적인 예가 있으며 책에서 담고 있는 내가 몰랐던 중요한 키워드를 꼽자면 정규화, 인덱스, PL/SQL, 트리거, 커서 등이다.

마지막으로 아쉬운 점을 꼽자면 figure에 오타가 많고 각 단계별 schema(개념적, 논리적, 물리적 스키마)에서 같은 개념 대한 서로 다른 용어를 혼란스럽게 사용한다는 점을 들 수 있겟다.

Monday, August 23, 2010

homologous recombination

우선 이번 포스팅은 코리안으로 하겠다.

오늘 science 잡지에 "re-replication may be a contributor to gene copy number changes" 라는 제목으로 논문이 실렸다. 그 메커니즘은 오른쪽과 같다. NAHR(non-allelic homologous recombination)에 의해 gene copy number 가 달라진다는 내용인듯하다 (그림만 보고 읽진 않았다). 이 그림을 보고 NAHR이 무엇인가를 찾아보게 되었다. 다름 아니라 하나의 allele에서 일어나는 HR(homologous recombination).
그렇다면 HR은 무엇인가? recombination은 재조합으로 예전 생물학 시간에 들었던 것이 얼핏 기억이 난다. 우선 두군데에서 정보를 찾았다.

sanger 와 wikipedia.
생거에서 말하는 정보는 매우 적고 좀더 자세히나온 위키 피디아를 본다.핵심 적인 내용(왼쪽 그림)은 double strand break repairs 과정의 하나의 pathway인 DSBR pathway에 의해 HR이 생기고 그 결과 crossover 내지는 gene conversion이 생긴다는 것이다.
double Holliday junctions이 nicking endonuclease에 의해 horizontal resolution(sanger site에서의 표현을 빌리자면) 에 의하면 gene conversion이 일어나고 vertical resolution에 의해 cross-over가 일어난다.

문제는 일반적으로 gene conversion을 찾아보면 그림이 오른쪽과 같은데 위의 과정에는 gene conversion이 일어나면 한쪽 duplex에는 다른 한쪽의 duplex에서 온 dna 조각이 double strand로 통으로 들어가는게 맞지만 gene conversion이 일어나지 않은 duplex에도 one strand로 다른 쪽의 dna 조각이 들어가게 되는데 오른쪽의 그림에서는 한쪽은 전혀 dna 가닥이 섞이지 않게 표현되어 있다. 이는 내가 잘못 이해한것인가 아니면 편의상 그림을 오른쪽과 같이 그린것인가?

anyone can answer this problem????

darwin's evolution theory

Several months ago, I was fascinated with evolution theory. So I found many documents and web sites for a week and tried to read most of them. But because whether my limitation on interpreting or lack of detailed explanation, my understanding to evolution theory is not clear at that time.
I found this book on last week by chance and read it on last weekend. Through this book I can do arrange my thinking about evolution theory and its' history.
Especially a section which was written by Motoo Kimura thirty years ago who make neutral theory of molecular evolution was really impressive. I read that part as if I heard lecture from Kimura.

I believe that ALL analysis of biological data is based on evolution theory and the biologists who don't have this concept are scientists without spirit.
As mentioned in Darwin 2.0 (the book? which is written by professor in Ewha univ.), Darwin's theory is the one which can be considered as the greatest theory in biology and it's greatness can be comparable with theory of relativity from physics.

I can found theory about evolution and meaning of population genetics and application of evolution. Of course although these are not that concrete, it's enough to set the basic concept of these things.