Throw a stone at me: microbe

Showing posts with label microbe. Show all posts

Wednesday, April 20, 2011

submission of genome - 4

NCBI submit 하면서 Q&A 와 최종 정리

genbank에 있는 genome의 annotation은 정해진 규칙이 없다. 곧 각자 어느정도의 로직을 가지고 annotation을 한것이며 NCBI에서 체크하는 것은 각 sequence의 id가 겹치지 않게끔 체계를 갖게 하는것 뿐이다.

protein_id 는 locus tag를 따르기를 권유한다. genome submit하고 나면 나중에 protein들에 대한 accession number가 NCBI로 부터 할당되어 통보된다.

<Sequin 프로그램에서 주의 사항>
1. topology 변경
2. codon table 변경
3. protein page랑 annotation page는 넘겨도 된다. 나중에 sequin feature table로 로드하면 됨.
4. gene이 reverse strand 일때 annotation input 파일에서 end position이 start position보다 앞에 나와야 한다.
5.다 만들었다 싶으면 search에서 validate를 실행.

Sequin 프로그램이 오작동 하면 이건 input 파일이 잘못 된거다(내가 겪은 예로 genome 이 circular 인데 orf가 -strand로 genome 끝에서 다시 처음으로 연결된 orf가 있었다. 이때 위의 3번 주의 사항으로다가 아무 생각없이 annotation input file을 만들었더니 계속 오류가 나더라. 결국 다시 genome 셋팅하고 annotation position 다 바꿈).

Monday, April 18, 2011

submission of genome - 3

아직 NCBI에 submit 하는걸 완벽히 다 안거는 아니지만(특히 sequin 프로그램 사용법이 아직 약하다) 이번에는 journal of bacteriology 의 genome announcement 에 대해 좀 자세히 알아 보고 거기에 나온 논문 두개 정도 읽어보면서 어떤 내용을 넣어야 할지 정해보려 한다.

일단 뭔가를 읽기 전에 checklist를 생각해보자면..
1. 반드시 NCBI 등이 데이터베이스에 등록이 된 genome만 publish가 가능한가 확인
2. 500 words 안으로 써야 하는데 어떤걸 써야 하는가
3. annotation의 reference를 어떻게 처리 했나?

<genome announcement 논문 훓어보기>
뭐 어떤 균주냐 이런건 안 중요하고
review 첫번째 논문
1.균주에 대한 intro: genus부터 소개, 왜 중요한지, source가 어딘지, 어떤 특성을 갖는지 등등
2.de novo assembly 전체적인 소개 및 annotation 방식 소개: 무슨 기계로 얼마만큼의 데이터를 만들어 냈는지 그리고 assembler로 뭘 썻는지, CRITICA와 glimmer2를 이용해서 cds를 prediction했고, 그담이 이해가 안가는데 GO로 분류했고 상용 프로그램인 뭐를 썻다네.
3.genome에 대한 overview : 게놈 사이즈가 어떤지, ORF 갯수, 그리고 같은 genus 의 다른 species와의 길이 및 orf 갯수 비교. GC content.
4.orf annotation에 대한 overview : nr,cog blast 결과 보고. 다른 species에 없는 기능을 하는 유전자 소개

review 두번째 논문
1.균주에 대한 intro : 이 균주는 beneficial 한 균주다 뭐 이런 내용.
2.또 균주 소개
3.시퀀싱 방법과 annotation 에 대한 내용: 이거 특이하게도 AB3700 DNA analyzer를 이용했다. 물론 solexa로 confirm을 하긴 했지만. 여튼 Yacop으로 orf prediction했고 Uniprot, COG, KEGG, TIGRFAMs을 이용해 annotation했단다.
4.genome 에 대한 설명 : genome size, GC content, structural RNA 갯수, 몇개의 유전자가 putative function이 있는지,

음 보아하니 유전자에 대한 comment 가 있어야 할 듯 하다. 다른 species와의 비교로 어떤 유전자가 더 있었다 이정도..

<genome announcement 논문 들어가야 할 list>
실험 방식(FLX+ sanger 3730), FLX 데이터 양(read, depth, paired-end), de novo assembler(GS De Novo Assembler version), 첫 de novo assembly 했을시 scaffold 갯수 및 양, sanger 3730으로 gap closing 시 read 양과 bp 길이, 어떤 프로그램 (phred/phrap/consed), annotation tool(glimmer3, rnammer, trnascan,) 및 방식(nr blastp, cog rpsblast, signalp, pfam)

reference list도 10~15 개 정도로 하고, 쓸데 없는 말은 전부 배제하기로 한다. 논문 본연의 목적에 맞게 쓰도록 한다. 딱 지금 한 것만 쓰자.

Sunday, April 17, 2011

submission of genome - 2

여행 다녀왔는데.. 봄이라.. 아.. 몸살인지 감기인지.. 거의 좀비 상태다. 아.. 힘들어 죽을거 같어.. 휴가를 내고 싶지만 벌써 놀러 갔다 오느라 이틀을 쓴 상태라 최대한 아끼려는 생각에.. 버텨볼려 했다가 탈수에 병원 응급실까지가서 드러누워버리는 바람에.. 내 태어나서 이리 고생해보긴 첨인듯. 아직도 머리가 어지럽지만.. 집에서 드러누워 티비보면 뭐할쏘냐.. 병이 낫는것도 아니고, 빅토르 위고는 억지로라도 글쓸려고 하인시켜서 가운가지고 가게 했다는데 같은 심정으로다가 회사에 일단 왔으니 뭐라도 좀 하고 가자..

일단 저번에 포스팅하면서 genome project 등록했고 locus tag prefix 까지 등록을 했으니 annotation 부분을 좀더 자세하게 읽어보고 Sequin 사용법을 알아봐야 할 것이다.

<annotation>
http://www.ncbi.nlm.nih.gov/genbank/genomesubmit_annotation.html#disrupted_genes
일단 annotation 에 들어갈 feature들(gene, CDS, 등등)은 feature table(five-column tab-delimited table) 파일 안에 다 들어가 있어야 한다. 이 feature table 파일이란게 뭐냐 그럼. 이게 Sequin이나 tbl2asn 프로그램의 input 파일인 듯 싶다. 그럼 컬럼이 5개라고 했는데 뭐가 들어가냐? 1.start location of feature, 2.stop location of feature, 3.feature key, 4.qualifier key, 5.qualifier value.
딴건 별내용 없고.. 아 중요한거 하나 feature table에 맨 첫줄에 >하고 나서 seqid를 넣어야 하는데 이는 fasta 파일의 seqid와 동일해야 한다. 그런데 뭘 seqid로 정하냐? 아하.. 이거 임시다. 아무거나 정해도 된다. NCBI staff가 review 할때 accession number로 바꿔준단다. protein id는 locus id랑 동일하게 하면 될것 같고, CDS 가 반드시 product qualifier (protein name)이 있어야 하는거 같은데.. naming 에 대해서는 주의해야 할건.. function, cellular location 같은 정보를 이름에 담지 말아햐 한다(이는 note feature에 넣을 것). protein의 unkown 일때 hypothetical protein이라는 용어를 쓸것. 여차함 gene symbol이랑 같은거 쓰는데 단 첫글자는 대문자로 할것. multigene family에 속하는 것들(이게 좀 이상한데 multigene family 에 대한 항목과 sequence similarity나 function share에 의한 homology 항목을 분리 시켰는데 같은 의미 아닌가?) 숫자로 구분하고 복수형 단어는 사용하지 말것. 기능이 알려지지 않은 protein인데 defined domain을 갖었을 때 -containing protein 이라고 명명할 수 있다.
notes feature에는 데이터베이스의 entry와의 sequence similarity 를 넣는것을 피하란다.
tRNA의 경우 어떤 amino acid에 해당하는것인지 명시하고 잘 모르겠으면 tRNA-Xxx 라고 하란다.
글고 2005년 미팅으로 /experimental과 /inference 라는 항목이 정해졌다는데 이 설명은 여기.
해석해보자면.. 2005년에 INSD, DDBJ, EMBL, GenBank 모여서 회의했을 때 feature의 evidence 항목에 대해서 새로히 뭔가 정했다는데..기존의 evidence=expermental이라는 항목을 대체해서 /experimental=text 라는 항목과 /inference=TYPE:text 라는 항목을 넣자고 정했단다. text는 규격화된 text(곧 설명한다)를 TYPE에는 정해진 list에서 뽑아서 선택하는거. experimental 항목은 말그대로 실험한 내용쓰는거, 단 간단하게. inference는 non-experimental evidence를 명시하는것. TYPE은 11가지중 하난데.(이건 직접 링크 따라가서 보자)

??궁극적으로 의문이 드는건.. 그렇다면 annotation의 제한은 없다는 건가? 누구는 이런식으로 ORF를 prediction하고 또 prediction한 ORF의 protein을 특정 방식으로, 그러니까 sequence similarity로 만 따져서 protein naming을 해도 되는 것인가? 아.. 이거 그냥 inference 항목으로 사용한 프로그램 명시하면 되는건가? 뭐.. 관리자한테 메일 보냈으니 답장오겠지..근데 전에 COG 관련 해서 질문했다가 씹힌거 같은데.. 제대로 올려나 모르겠네.

<Sequin>
http://www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm
사실 이 프로그램 별거 아니다(아.. 다만 내가 하려는거에 한해서만 이야기 하자면). 위의 feature table을 genbank format으로 바꿔주는건데.. (내가 만든 파이프 라인 돌아가면 biopython을 이용해서 genbank 파일 만들어주긴 하지만 아무래도 혹시나 라는 생각에 그냥 이거 쓰기로 한다) 뭐 여튼..
fasta 파일을 일단 읽어들이는데 의문점 하나가.. nucleotide sequence가 하나 이상의 protein product를 endoing 하면 2개의 파일, 그러니까 하나는 nucleotides, 다른 하나는 protein 을 위한 파일이 필요하다는데.. 이건 뭔소린지.. 더 읽어봐야 알듯. 글고 fasta 파일의 title 그러니까 첫줄( > 있는 라인) 에 각종 modifier를 이용해서 정보를 넣을 수 있는데 어떤 modifier를 써야 할지 모르겠다면 그냥 note 라는modifier를 써라(staff이 바꿔준단다).

Monday, April 11, 2011

submission of genome

음.. 점점 1년 반동안 회사에서 했던 일들을 마무리 해야 겠다는 생각이 든다. 너무 한곳에 오래 있었던 거 같기도 하고 이제는 제자리 걸음을 하는게 아닌가 하는 생각이 든다. 해서 얼마전에 genome assembly를 끝낸 2 개의 균주에 대해 논문을 써볼려고 한다.

계획하고 있는 논문은 journal of bacteriology의 genome announcements(http://jb.asm.org/misc/about.dtl). 음.. 이런 걸 논문이냐라고 할수 있을 정도의 것. 이 feature의 목적은 다만 genbank에 genome을 올리고 인증하는 정도하라고 할 수 있겠다(이 글이 이 논문의 목적을 가장 잘 설명하는 듯).

일단은 지금 해야 할 일은 두가지. NCBI에 어떻게 bacterial genome을 submit 하는지와 타겟 저널의 report 형식을 보고 writing 하는 것.

<NCBI submission instruction>

우선 NCBI submission instructions 부터 보자.

Register your Project

일단 genbank에 올리기에 앞서 genome project에 등록을 해야 한다. 이때 locus_tag의 prefix도 정해 줘야 하는데 그 proposal은 다음과 같다. 요약하자면 locus_tag prefix는 3개 이상의 문자와 숫자(symbol은 사용 금지), 첫글자는 반드시 문자, 모든 유전자(structural RNA 포함, repeat region은 제외)는 고유의 locus_tag를 갖으며, 한 유전자의 여러 feature는 같은 locus_tag를 갖는다.
nucleotide sequence는 FASTA format을 따른다.

Annotation
complete genome 일때는 annotation은 필수. gene name(biological name) 은 standard bacterial nomenclature rule (three lower case letters)를 따르고 다른 loci는 대문자 suffix를 붙여 구분한다. 같은 genome project에 있는 genome이라면 동일한 locus_tag prefix를 사용해야 하고 유전자마다는 unique한 locus_tag(systematic identifier)를 사용해야 한다.
CDS는 protein coding region으로 반드시 product qualifier(protein name)이 필요하며 여차하면 그냥 gene name이랑 같은 걸 써라(단, 첫글자는 대문자). 그리고 protein이 안알려진거면 hypothetical protein이라고 써라. 그래서 나중에 release 되고 검색하면 locus_tag로 대신 나타내게 하기 위해.
CDS의 한 qualifier 중 중요한게 protein id 인데 음.. (이것 좀더 자세히 읽어봐야 겠다)
structural RNA는 tRNA, rRNA만을 의미. 이것 역시 locus_tag 필요(그 위의 proposal에서 보면 RNA던 CDS 던 같은 locus_tag numbering 방식을 사용할것을 권장하나 굳이 locus_tag에 그런 정보를 넣고 싶다면 _t112 식으로 underscore 뒤에 쓸라고 한다).
자세한 내용은 다음을 참조한다.
Create your submission
submission file을 만드는데 Sequin과 tbl2asn 이렇게 두 개의 프로그램이 있다. 정확하게 아직 이 프로그램의 정체를 모르겠으나 여기서 말하기를 두 프로그램의 가장 큰 차이가 2개인데 Sequin은 GUI고 tbl2asn은 command line이라는거, 그렇기 때문에 아직 assembly가 미완이라 contig가 많거나 아니면 chromosome이 많을 경우 tbl2asn을 사용하는게 용이하단다, 아.. 그리고 assembly가 아직 완성되지 않았으면 WGS 에 submit해야 한다. 난 게놈 completion이 된거라 Sequin을 사용해본다. 아래의 것을 봐야 할것(뭐이리 볼게 많다냐.. 에이..).
Sequin Quick Guide :

submitting
FTP를 이용하거나 아니면 Genomes Submission Tool을 사용한다. 자주 submission을 한다면 NCBI에서 FTP account를 만들어 준다고 한다니 Email 보내란다.
What happens next
일단 submission하면 NCBI 쪽에서 review하고 별문제 없으면 accession number를 보내준다. 그 뒤 다시 annotation에 대한 review가 들어간다.public release는 바로 할수도 있고 publication문제가 있다면 특정 기간 동안 release를 보류할 수 있다고 한다.

요약하자면
1.genome project에 등록하기 (locus_tag prefix 도 등록)
2.Sequin이나 tbl2asn 프로그램을 사용하여 submission file(.sqn) 생성
3.discrepancy Report 와 Genome submission check tool 로 annotation 파일에 에러가 있는지 체크
4.genomes submission tool을 이용해서 NCBI에 등록.

<Instructions to authors (journal of bacteriology) >

아 요거.. genome announcements가 July에 없어진단다. 시간이 없다.

Monday, December 20, 2010

microbe annotation 추가 사항

BER
TIGRFAMs,
equivalog-level HMM
TIGR role catergory scheme
PROSITE motifs
XDOM

추가 고려

Wednesday, June 30, 2010

The Zymomonas mobilis regulator hfq contributes to tolerance against multiple lignocellulosic pretreatment inhibitors

I will summarize article and make other sources come together for understanding and preparing our own paper.

Shihui Yang, Steven D Brown*

*They are doing research in Oak Ridge National Laboratory(http://www.esd.ornl.gov/). They are also studying on zymomonas mobilis. They published brief paper on Nature in last year about new annotation of zymomonas mobilis and they announced genome sequence of AcR (acetate tolerant strain, but in reality it looks like it is tolerant on sodium) on PNAS.

In previous paper on PNAS(http://www.pnas.org/content/107/23/10395.full) they compared genome of AcR with ZM4 and found 1.5kb deletion that truncated ZMO0117 and DNA upstream of ZMO0119 (nhaA ; sodium proton antipoter) in AcR. They thought that the ZMO0117 promoter affected the expression of nhaA through deletion and this caused sodium acetate tolerance.

-background-

1.Demand for engineering of microbe : alternative energy is in need -> using agricultural biofuel, lignocellulosic biomass which is composed of cellulose is one method -> for fermentation by microbe, pretreament of biomass, breaking cellulose down into smaller molecule like 5- or 6-carbon sugar, is needed (http://biotech.about.com/b/2008/06/11/pretreatment-of-cellulosic-biomass.htm) -> this pretreatment produce inhibitor for microbe -> improved strain which is tolerant on these

inhibitors is developing by mutation.

2. Z.mobilis (Zymomonas mobilis) : ethanol tolerance, virtually unique

property among bacteria, 3~5 times higher productivity than S.cerevisiae and ethanol yield reaching 97% of theoretical maximum (http://www.nature.com/nbt/journal/v23/n1/full/nbt0105-40.html).

It use Entner-Doudoroff pathway for fermentation of glucose (6-C sugar, although some improved strain also use 5-C sugar). This pathway yield 1 ATP from conversion of 1 glucose into 2 ethanol, whereas glycolysis yield 2 ATP. Because Low ATP yield means low cell mass, Z.mobilis have higher potential than S.cerevisiae.

-the aim of this study-

Investigation the role of a hfq gene (ZMO0347) on multiple pretreatment inhibitor tolerances. htq is expressed more intensively in anaerobic stationary phase than aerobic condition (This fact was revealed by same author in BMC Genomics, http://www.biomedcentral.com/1471-2164/10/34). This gene is global regulator that acts as an R

NA chaperone and is involved in coordinating regulatory responses to multiple stresses.

There are some others focusing work such as utilization of specific plasmid and role of LSM protein in S.cerevisiae in this paper. But I will omit these things.

-results-

Using by blastP, they find hfq in ZM4 is similar with E.coli global regulator Hfq protein and Sm protein in S.cerevisiae. ---> An interesting thing is there is two Sm-like domain in ZM4's hfq.

They made AcRIM0347 by introduction of hfq insertion muation in AcR (Z.mobilis acetate tolerant strain).This . And they introduced plasmid p42-0347 (expressing hfq) into ZM4, AcR and AcRIM0347. ---> these can be specified as ZM4(p42-0347), AcR(p42-0347), AcRIM0347(p42-0347).

They tested growth of those above in various acetate counter-ions (NaCl, NaAc, NH4OAc, KAc) and in pretreatment inhibitors (vanillin,

furfural, HMF). ---> AcRIM0347

grow slowly than AcR, ZM4(p42-

0347) was able to grow in NaAc like AcR. AcRIM0347(p42-0347) recover growth to a certain degree in acetate counter-ions and inhibitors.

-conclusion-

hfq play an important role in tolerance to multiple biomass pretreatment inhibitors.

Wednesday, June 23, 2010

Genomes | Klebsiella oxytoca M5al | The Genome Center at Washington University

This is one of the microbe in which I have to be interested.

Genomes | Klebsiella oxytoca M5al | The Genome Center at Washington University