﻿<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:trackback="http://madskills.com/public/xml/rss/module/trackback/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/"><channel><title>C++博客-以致宏大，以致高远-文章分类-Bioinformatics</title><link>http://www.cppblog.com/ewre/category/18250.html</link><description>knowing how to do is different from learning how to do</description><language>zh-cn</language><lastBuildDate>Mon, 13 May 2013 11:15:49 GMT</lastBuildDate><pubDate>Mon, 13 May 2013 11:15:49 GMT</pubDate><ttl>60</ttl><item><title>关于Corona Lite</title><link>http://www.cppblog.com/ewre/articles/161163.html</link><dc:creator>ewre</dc:creator><author>ewre</author><pubDate>Tue, 29 Nov 2011 07:44:00 GMT</pubDate><guid>http://www.cppblog.com/ewre/articles/161163.html</guid><wfw:comment>http://www.cppblog.com/ewre/comments/161163.html</wfw:comment><comments>http://www.cppblog.com/ewre/articles/161163.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/ewre/comments/commentRss/161163.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/ewre/services/trackbacks/161163.html</trackback:ping><description><![CDATA[<div><p>&nbsp;&nbsp;&nbsp; AB的这个Corona  Lite用的时候一定注意看reference里面的介绍，注意力面的每个函数的参数，尽量都写上并指定参数值，我在用的时候就是因为 submit_script那个perl程序没有指定tracking  quene（原以为用默认的就可以了）折腾了3，4天的时间，最后发现是这么一个错误。。。</p></div><img src ="http://www.cppblog.com/ewre/aggbug/161163.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/ewre/" target="_blank">ewre</a> 2011-11-29 15:44 <a href="http://www.cppblog.com/ewre/articles/161163.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>关于大规模数据操作</title><link>http://www.cppblog.com/ewre/articles/161160.html</link><dc:creator>ewre</dc:creator><author>ewre</author><pubDate>Tue, 29 Nov 2011 07:42:00 GMT</pubDate><guid>http://www.cppblog.com/ewre/articles/161160.html</guid><wfw:comment>http://www.cppblog.com/ewre/comments/161160.html</wfw:comment><comments>http://www.cppblog.com/ewre/articles/161160.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/ewre/comments/commentRss/161160.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/ewre/services/trackbacks/161160.html</trackback:ping><description><![CDATA[<div>&nbsp; 大规模数据操作牵扯到效率以及计算机硬件资源占用的问题，当然，这两个指标是鱼与熊掌的关系。<br /> 但是，通过对数据的提前预处理，我们可以在一定程度上将鱼与熊掌得兼。<br /> 常用的预处理方法：<br /> 一，多维排序并创建索引。<br /> 将数据按照一定的层次排序，每个层次内部按照某一个指标再排序，同时，再排序过程中生成一个记录每个层次位置的索引表。<br /> 二，利用现有格式<br /> 关于利用已经有资源的情况已经不知一次被重申并且强调。<br /> 现有的大规模基因组相关数据格式有：GTF，GFF等等</div><img src ="http://www.cppblog.com/ewre/aggbug/161160.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/ewre/" target="_blank">ewre</a> 2011-11-29 15:42 <a href="http://www.cppblog.com/ewre/articles/161160.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>linux disk usage command-du</title><link>http://www.cppblog.com/ewre/articles/161151.html</link><dc:creator>ewre</dc:creator><author>ewre</author><pubDate>Tue, 29 Nov 2011 07:37:00 GMT</pubDate><guid>http://www.cppblog.com/ewre/articles/161151.html</guid><wfw:comment>http://www.cppblog.com/ewre/comments/161151.html</wfw:comment><comments>http://www.cppblog.com/ewre/articles/161151.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/ewre/comments/commentRss/161151.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/ewre/services/trackbacks/161151.html</trackback:ping><description><![CDATA[<div><p>$ man du</p><p><br />NAME<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; du - estimate file space usage<br /><br />SYNOPSIS<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; du [OPTION]... [FILE]...<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; du [OPTION]... --files0-from=F<br /><br />DESCRIPTION<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Summarize disk usage of each FILE, recursively for directories.<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; Mandatory arguments to long options are mandatory for short options too.<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -a, --all<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; write counts for all files, not just directories<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; --apparent-size<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  print&nbsp; apparent&nbsp; sizes,&nbsp; rather than disk usage; although the apparent  size is usually smaller, it may be larger due to holes in ('sparse')  files, internal fragmenta-<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; tion, indirect blocks, and the like<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -B, --block-size=SIZE use SIZE-byte blocks<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -b, --bytes<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; equivalent to '--apparent-size --block-size=1'<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -c, --total<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; produce a grand total<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -D, --dereference-args<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; dereference FILEs that are symbolic links<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; --files0-from=F<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; summarize disk usage of the NUL-terminated file names specified in file F<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -H&nbsp;&nbsp;&nbsp;&nbsp; like --si, but also evokes a warning; will soon change to be equivalent to --dereference-args (-D)<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -h, --human-readable<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; print sizes in human readable format (e.g., 1K 234M 2G)<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; --si&nbsp;&nbsp; like -h, but use powers of 1000 not 1024<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -k&nbsp;&nbsp;&nbsp;&nbsp; like --block-size=1K<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -l, --count-links<br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; count sizes many times if hard linked<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -m&nbsp;&nbsp;&nbsp;&nbsp; like --block-size=1M<br /><br />&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -L, --dereference</p><p>&nbsp;</p><p>如果想以G为单位显示当前文件夹下的文件及文件夹大小：</p><p>$ du -a -B g . (或者 du -sh .)</p>2G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057636_2.fastq<br />5G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057631.fastq.sam<br />3G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057638_1.fastq<br />3G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057638_2.fastq<br />3G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057639_1.fastq<br />3G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057639_2.fastq<br />3G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057640_1.fastq<br />3G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057640_2.fastq<br />3G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057641_1.fastq<br />3G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057641_2.fastq<br />3G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057642_1.fastq<br />3G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057642_2.fastq<br />4G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057643_2.fastq<br />3G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057644_2.fastq<br />3G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057645_2.fastq<br />3G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057646_2.fastq<br />2G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057647_2.fastq<br />2G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057629.fastq.sam<br />1G&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ./SRR057632.fastq.sam<br />128G&nbsp;&nbsp;&nbsp; .</div> <div id="haloword-lookup" class="ui-widget-content ui-draggable"><div id="haloword-title"><span id="haloword-word"></span><a herf="#" id="haloword-pron" class="haloword-button" title="发音"></a><audio id="haloword-audio"></audio><div id="haloword-control-container"><a herf="#" id="haloword-add" class="haloword-button" title="加入单词表"></a><a herf="#" id="haloword-remove" class="haloword-button" title="移出单词表"></a><a href="#" id="haloword-open" class="haloword-button" title="查看单词详细释义" target="_blank"></a><a herf="#" id="haloword-close" class="haloword-button" title="关闭查询窗"></a></div></div><div id="haloword-content"></div></div><img src ="http://www.cppblog.com/ewre/aggbug/161151.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/ewre/" target="_blank">ewre</a> 2011-11-29 15:37 <a href="http://www.cppblog.com/ewre/articles/161151.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>KEGG数据库收费了</title><link>http://www.cppblog.com/ewre/articles/161145.html</link><dc:creator>ewre</dc:creator><author>ewre</author><pubDate>Tue, 29 Nov 2011 07:35:00 GMT</pubDate><guid>http://www.cppblog.com/ewre/articles/161145.html</guid><wfw:comment>http://www.cppblog.com/ewre/comments/161145.html</wfw:comment><comments>http://www.cppblog.com/ewre/articles/161145.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/ewre/comments/commentRss/161145.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/ewre/services/trackbacks/161145.html</trackback:ping><description><![CDATA[<div><p>kegg数据库的管理者说因为资金原因kegg pathway db从2011.07.01开始<a target="_blank" href="http://www.genome.jp/kegg/docs/plea.html">不再提供免费服务</a>。悲剧。。</p><p>kegg的pathway数据质量还是非常高的，它一旦收费搞不好访问量就下来了。个人觉得这个模式不大对</p><p>目前用的比较多的几个数据库应该从funding到product联合起来，这样既可以节约成本又能提高库的质量。</p><p>&nbsp;</p></div><img src ="http://www.cppblog.com/ewre/aggbug/161145.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/ewre/" target="_blank">ewre</a> 2011-11-29 15:35 <a href="http://www.cppblog.com/ewre/articles/161145.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>你做的公共分析工具，请你维护</title><link>http://www.cppblog.com/ewre/articles/161144.html</link><dc:creator>ewre</dc:creator><author>ewre</author><pubDate>Tue, 29 Nov 2011 07:35:00 GMT</pubDate><guid>http://www.cppblog.com/ewre/articles/161144.html</guid><wfw:comment>http://www.cppblog.com/ewre/comments/161144.html</wfw:comment><comments>http://www.cppblog.com/ewre/articles/161144.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/ewre/comments/commentRss/161144.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/ewre/services/trackbacks/161144.html</trackback:ping><description><![CDATA[<div><p>在做一些生物信息学相关的分析的时候，发现一些想法很不错，而且也有实现可以用，大部分做成网站的形式。但是也发现</p><p>一个现象：好多工具做出来之后，发了论文就丢到哪里不管了，好多工具现在的数据库版本仍然是06年的。</p><p>您做的工具，请您维护。。。。</p></div><img src ="http://www.cppblog.com/ewre/aggbug/161144.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/ewre/" target="_blank">ewre</a> 2011-11-29 15:35 <a href="http://www.cppblog.com/ewre/articles/161144.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item><item><title>GTF与GFF file format</title><link>http://www.cppblog.com/ewre/articles/161130.html</link><dc:creator>ewre</dc:creator><author>ewre</author><pubDate>Tue, 29 Nov 2011 07:26:00 GMT</pubDate><guid>http://www.cppblog.com/ewre/articles/161130.html</guid><wfw:comment>http://www.cppblog.com/ewre/comments/161130.html</wfw:comment><comments>http://www.cppblog.com/ewre/articles/161130.html#Feedback</comments><slash:comments>0</slash:comments><wfw:commentRss>http://www.cppblog.com/ewre/comments/commentRss/161130.html</wfw:commentRss><trackback:ping>http://www.cppblog.com/ewre/services/trackbacks/161130.html</trackback:ping><description><![CDATA[<div>GFF file format:<br /> <p>Fields are: &lt;seqname&gt; &lt;source&gt; &lt;feature&gt;  &lt;start&gt; &lt;end&gt; &lt;score&gt; &lt;strand&gt;              &lt;frame&gt; [attributes] [comments]</p> <dl><dt>               &lt;seqname&gt;             </dt><dd>                The name of the sequence. Having an explicit sequence  name allows a feature file to be prepared for a data set of                multiple sequences. Normally the seqname will be the identifier of the  sequence in an accompanying fasta format               file. An  alternative is that &lt;seqname&gt; is the identifier for a sequence in a  public database, such as an               EMBL/Genbank/DDBJ accession  number. Which is the case, and which file or database to use, should be  explained in               accompanying information.             </dd><dt>               &lt;source&gt;             </dt><dd>                The source of this feature. This field will normally be  used to indicate the program making the prediction, or if                it comes from public database annotation, or is experimentally verified,  etc.             </dd><dt>               &lt;feature&gt;             </dt><dd>                The feature type name. We hope to suggest a standard set   of features, to facilitate import/export, comparison etc..                Of course, people are free to define new ones as needed.  For example,  Genie splice detectors account for a region               of DNA, and  multiple detectors may be available for the  same site, as shown above.              </dd><dd>               We would like to enforce a standard  nomenclature for common GFF features. This does not forbid the use of  other               features, rather, just that if the feature is  obviously described in the standard list, that the standard label                should be used. For this standard table we propose to fall back  on the international public standards for genomic               database  feature annotation, specifically, the <a target="_blank" title="* This link opens in a new window" href="http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html">DDBJ/EMBL/GenBank feature table               documentation</a>).             </dd><dt>               &lt;start&gt;, &lt;end&gt;             </dt><dd>                Integers. &lt;start&gt; must be less than or equal to   &lt;end&gt;. Sequence numbering starts at 1, so these numbers                should be between 1 and the length of the relevant  sequence,  inclusive. (<strong>Version 2 change</strong>: version               2  condones values of &lt;start&gt; and &lt;end&gt; that  extend outside  the reference sequence. This is often more               natural when  dumping from acedb, rather than clipping. It  means that some software  using the files may need to clip               for itself.)             </dd><dt>               &lt;score&gt;             </dt><dd>                A floating point value. When there is no score (i.e. for a  sensor that just records the possible presence of a                signal, as for the EMBL features above) you should use '.'. (<strong>Version 2 change</strong>: in version 1 of GFF               you had to write 0 in such circumstances.)             </dd><dt>               &lt;strand&gt;             </dt><dd>                One of '+', '-' or '.'. '.' should be used when strand is  not relevant, e.g. for dinucleotide repeats.               <strong>Version 2 change</strong>: This field is left empty '.' for RNA and protein features.             </dd><dt>               &lt;frame&gt;             </dt><dd>                One of '0', '1', '2' or '.'. '0' indicates that the  specified region is in frame, i.e. that its first base                corresponds to the first base of a codon. '1' indicates that there is  one extra base, i.e. that the second base of               the region  corresponds to the first base of a codon, and '2' means that the third  base of the region is the first               base of a codon. If the  strand is '-', then the first base of the region is value of  &lt;end&gt;, because the               corresponding coding region will  run from &lt;end&gt; to &lt;start&gt; on the reverse strand. As with                &lt;strand&gt;, if the frame is not relevant then set  &lt;frame&gt; to '.'. It has been pointed out that "phase"                might be a better descriptor than "frame" for this field.             </dd><dd>               <strong>Version 2 change</strong>: This field is left empty '.' for RNA and protein features.             </dd><dt>               [attribute]             </dt><dd>                From version 2 onwards, the attribute field must have an  tag value structure following the syntax used within                objects in a .ace file, flattened onto one line by semicolon separators.  Tags must be standard identifiers                ([A-Za-z][A-Za-z0-9_]*). Free text values must be quoted with double  quotes. <em>Note: all non-printing characters               in such free  text value strings (e.g. newlines, tabs, control characters, etc) must  be explicitly represented by               their C (UNIX) style  backslash-escaped representation (e.g. newlines as '\n', tabs as '\t').</em>   As in ACEDB,               multiple values can follow a specific tag.  The aim is to  establish consistent use of particular tags,  corresponding               to an underlying implied ACEDB model if you  want to think  that way (but acedb is not required). Examples of these                would be: <pre>seq1     BLASTX  similarity   101  235 87.1 + 0  Target "HBA_HUMAN" 11 55 ; E_value 0.0003<br />dJ102G20 GD_mRNA coding_exon 7105 7201   .  - 2 Sequence "dJ102G20.C1.1"</pre> The semantics of tags in attribute field tag-values pairs has  intentionally not been formalized. Two useful guidelines are to use  DDBJ/EMBL/GenBank feature 'qualifiers' (see <a target="_blank" title="* This link opens in a new window" href="http://www.ebi.ac.uk/Services/WebFeat/">DDBJ/EMBL/GenBank feature table documentation</a>), or the features that ACEDB generates when it dumps GFF.             </dd><dd>               <strong>Version 1 note</strong> In version 1 the attribute field was called the group field, with the following               specification:             </dd><dd>                An optional string-valued field that can be used as a  name to group together a set of records. Typical uses might                be to group the introns and exons in one gene prediction (or  experimentally verified gene structure), or to group                multiple regions of match to another sequence, such as an EST or a  protein.             </dd></dl> <p>All of the above described fields should be separated by TAB  characters ('\t'). All values of the mandatory fields             should  not include whitespace (i.e. the strings for &lt;seqname&gt;,  &lt;source&gt; and &lt;feature&gt; fields).</p> <p><strong>Version 1 note</strong> In version 1 each string had  to be  under 256 characters long, and the whole line             should under  32k long. This was to make things easier for  guaranteed conforming  parsers, but seemed unnecessary given             modern languages.</p> <h3>Comments</h3> <p>Comments are allowed, starting with "#" as in Perl, awk etc.  Everything following # until the end of the line is             ignored.  Effectively this can be used in two ways. Either it must be at the  beginning of the line (after any             whitespace), to make the  whole line a comment, or the comment could come after all the required  fields on the line.</p> <h3>## comment lines for meta information</h3> <p>There is a set of standardised (i.e. parsable) ## line types that can  be used optionally at the top of a gff file.             The philosophy  is a little like the special set of %% lines at the top of postscript  files, used for example to give             the BoundingBox for EPS  files.</p> <p>Current proposed ## lines are:</p> <dl><dt>               gff-version             </dt><dd> <pre>##gff-version 2</pre> GFF version - in case it is a real success and we want to change it. The  current default version is 2, so if this line is not present version 2  is assumed.             </dd><dt>               source-version             </dt><dd> <pre>##source-version &lt;source&gt; &lt;version text&gt;</pre> So that people can record what version of a program or package was used  to make the data in this file. I suggest the version is text without  whitespace. That allows things like 1.3, 4a etc. There should be at most  one source-version line per source.             </dd><dt>               date             </dt><dd> <pre>##date &lt;date&gt;</pre> The date the file was made, or perhaps that the prediction programs were  run. We suggest to use astronomical format: 1997-11-08 for 8th November  1997, first because these sort properly, and second to avoid any  US/European bias.             </dd><dt>               Type             </dt><dd> <pre>##Type &lt;type&gt; [&lt;seqname&gt;]</pre> The type of host sequence described by the features. Standard  types are  'DNA', 'Protein' and 'RNA'. The optional &lt;seqname&gt; allows  multiple ##Type definitions describing multiple  GFF sets in one file,  each of which have a distinct type. If the name is not provided, then  all the features in the file are of  the given type. Thus, with this  meta-comment, a single file could contain DNA, RNA and Protein features,  for example, representing a  single genomic locus or 'gene', alongside  type-specific features of its transcribed mRNA and translated protein  sequences. If no  ##Type meta-comment is provided for a given GFF file,  then the type is assumed to be DNA.             </dd><dt>               DNA             </dt><dd> <pre><br /> ##DNA &lt;seqname&gt;<br /> ##acggctcggattggcgctggatgatagatcagacgac<br /> ##...<br /> ##end-DNA</pre> To give a DNA sequence. Several people have pointed out that it may be  convenient to include the sequence in the file. It should not become  mandatory to do so, and in our experience this has been very little  used. Often the seqname will be a well-known identifier, and the  sequence can easily be retrieved from a database, or an accompanying  file.             </dd><dt>               RNA             </dt><dd> <pre><br /> ##RNA &lt;seqname&gt;<br /> ##acggcucggauuggcgcuggaugauagaucagacgac<br /> ##...<br /> ##end-RNA</pre> Similar to DNA. Creates an implicit ##Type RNA &lt;seqname&gt; directive.             </dd><dt>               Protein             </dt><dd> <pre><br /> ##Protein &lt;seqname&gt;<br /><br /> ##MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF<br /> ##...<br /> ##end-Protein</pre> Similar to DNA. Creates an implicit ##Type Protein &lt;seqname&gt; directive.             </dd><dt>               sequence-region             </dt><dd> <pre>##sequence-region &lt;seqname&gt; &lt;start&gt; &lt;end&gt;</pre> To indicate that this file only contains entries for the specified subregion of a sequence.             </dd></dl> <p>Please feel free to propose new ## lines.</p> <p>The ## line proposal came out of some discussions including Anders  Krogh, David Haussler, people at the Newton             Institute on  1997-10-29 and some email from Suzanna Lewis. Of course, naive programs  can ignore all of these...</p> <h3>File Naming</h3> <p>We propose that the format is called "GFF", with conventional file name ending ".gff".</p> <h3>Semantics</h3> <p>We have intentionally avoided overspecifying the semantics of the  format. For example, we have not restricted the             items  expressible in GFF to a specified set of feature types (splice sites,  exons etc.) with defined semantics.             Therefore, in order for  the information in a gff file to be useful to somebody else, the person  producing the             features must describe the meaning of the  features.</p> <p>In the example given above the feature "splice5" indicates  that  there is a candidate 5' splice site between positions             172  and 173. The "sp5-20" feature is a prediction based on a  window of 20  bp for the same splice site. To use either             of these, you  must know the position within the feature of  the predicted splice site.  This only needs to be given             once, possibly in comments at  the head of the file, or in a  separate document.</p> <p>Another example is the scoring scheme; we ourselves would  like the  score to be a log-odds likelihood score in bits to             a defined  null model, but that is not required, because  different methods take  different approaches.</p> <p>Avoiding a prespecified feature set also leaves open the possibility  for GFF to be used for new feature types, such             as CpG  islands, hypersensitive sites, promoter/enhancer elements, etc.</p> <br /> <br /> GTF2(gene transfer format) file format:<br /> &lt;seqname&gt; &lt;source&gt; &lt;feature&gt; &lt;start&gt; &lt;end&gt;  &lt;score&gt;  &lt;strand&gt; &lt;frame&gt; [attributes] [comments] <p>Here is a simple example with 3 translated exons. Order of rows is not important.</p> <pre>AB000381 Twinscan&nbsp; CDS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 380&nbsp;&nbsp; 401&nbsp;&nbsp; .&nbsp;&nbsp; +&nbsp;&nbsp; 0&nbsp; gene_id "001"; transcript_id "001.1";<br />AB000381 Twinscan&nbsp; CDS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 501&nbsp;&nbsp; 650&nbsp;&nbsp; .&nbsp;&nbsp; +&nbsp;&nbsp; 2&nbsp; gene_id "001"; transcript_id "001.1";<br />AB000381 Twinscan&nbsp; CDS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 700&nbsp;&nbsp; 707&nbsp;&nbsp; .&nbsp;&nbsp; +&nbsp;&nbsp; 2&nbsp; gene_id "001"; transcript_id "001.1";<br />AB000381 Twinscan&nbsp; start_codon&nbsp; 380&nbsp;&nbsp; 382&nbsp;&nbsp; .&nbsp;&nbsp; +&nbsp;&nbsp; 0&nbsp; gene_id "001"; transcript_id "001.1";<br />AB000381 Twinscan&nbsp; stop_codon&nbsp;&nbsp; 708&nbsp;&nbsp; 710&nbsp;&nbsp; .&nbsp;&nbsp; +&nbsp;&nbsp; 0&nbsp; gene_id "001"; transcript_id "001.1";</pre> The whitespace in this example is provided only for readability. In GTF,  fields must be separated by a single TAB and no white space. <p><strong>&lt;seqname&gt;</strong> <br /> The FPC contig ID from the Golden Path.</p> <p><strong>&lt;source&gt;</strong> <br /> The source column should be a unique label indicating where the  annotations came from --- typically the name of either a prediction  program or a public database.</p> <p><strong>&lt;feature&gt;</strong> <br /> The following feature types are required: "CDS", "start_codon",  "stop_codon". The feature "exon" is optional, since this project will  not evaluate predicted splice sites outside of protein coding regions.  All other features will be ignored.</p> <p>CDS represents the coding sequence starting with the first translated  codon and proceeding to the last translated codon. Unlike Genbank  annotation, the stop codon is not included in the CDS for the terminal  exon.</p> <p><strong>&lt;start&gt; &lt;end&gt;</strong> <br /> Integer start and end coordinates of the feature relative to the  beginning of the sequence named in &lt;seqname&gt;.&nbsp; &lt;start&gt; must  be less than or equal to &lt;end&gt;. Sequence numbering starts at 1.  Values of &lt;start&gt; and &lt;end&gt; that extend outside the  reference sequence are technically acceptable, but they are discouraged  for purposes of this project.</p> <p><strong>&lt;score&gt;</strong> <br /> The score field will not be used for this project, so you can either provide a meaningful float or replace it by a dot.</p> <p><strong>&lt;frame&gt;</strong> <br /> 0 indicates that the first whole codon of the reading frame is located  at 5'-most base. 1 means that there is one extra base before the first  codon and 2 means that there are two extra bases before the first codon.  Note that the frame is not the length of the CDS mod 3.</p> <p>Here are the details excised from the <a href="http://www.sanger.ac.uk/Software/formats/GFF/GFF_Spec.shtml" target="_blank">GFF spec</a>. <strong>Important: Note comment on reverse strand.</strong></p> <blockquote>'0' indicates that the specified region is in frame, i.e.  that its first base corresponds to the first base of a codon. '1'  indicates that there is one extra base, i.e. that the second base of the  region corresponds to the first base of a codon, and '2' means that the  third base of the region is the first base of a codon. <strong>If the strand is '-', then the first base of the region is value of &lt;end&gt;</strong>, because the corresponding coding region will run from &lt;end&gt; to &lt;start&gt; on the reverse strand.</blockquote>  <strong>[attributes]</strong> <br /> All four features have the same two mandatory attributes at the end of the record: <ul><li><em>gene_id value;</em>&nbsp;&nbsp;&nbsp;&nbsp; A globally unique identifier for the genomic source of the transcript</li><li><em>transcript_id value;</em>&nbsp;&nbsp;&nbsp;&nbsp; A globally unique identifier for the predicted transcript.</li></ul> These attributes are designed for handling multiple transcripts from the  same genomic region. Any other attributes or comments must appear after  these two and will be ignored. <p>Attributes must end in a semicolon which must then be separated from  the start of any subsequent attribute by exactly one space character  (NOT a tab character).</p> <p>Textual attributes should be surrounded by doublequotes.</p> <p>Here is an example of a gene on the negative strand. Larger  coordinates are 5' of smaller coordinates. Thus, the start codon is 3 bp  with largest coordinates among all those bp that fall within the CDS  regions. Similarly, the stop codon is the 3 bp with coordinates just  less than the smallest coordinates within the CDS regions.</p> <p><tt>AB000123&nbsp;&nbsp;&nbsp; Twinscan&nbsp;&nbsp;&nbsp;&nbsp; CDS&nbsp;&nbsp;&nbsp; 193817&nbsp;&nbsp;&nbsp; 194022&nbsp;&nbsp;&nbsp; .&nbsp;&nbsp;&nbsp; -&nbsp;&nbsp;&nbsp;  2&nbsp;&nbsp;&nbsp; gene_id "AB000123.1"; transcript_id "AB00123.1.2";</tt> <br /> <tt>AB000123&nbsp;&nbsp;&nbsp; Twinscan&nbsp;&nbsp;&nbsp;&nbsp; CDS&nbsp;&nbsp;&nbsp; 199645&nbsp;&nbsp;&nbsp; 199752&nbsp;&nbsp;&nbsp; .&nbsp;&nbsp;&nbsp; -&nbsp;&nbsp;&nbsp;  2&nbsp;&nbsp;&nbsp; gene_id "AB000123.1"; transcript_id "AB00123.1.2";</tt> <br /> <tt>AB000123&nbsp;&nbsp;&nbsp; Twinscan&nbsp;&nbsp;&nbsp;&nbsp; CDS&nbsp;&nbsp;&nbsp; 200369&nbsp;&nbsp;&nbsp; 200508&nbsp;&nbsp;&nbsp; .&nbsp;&nbsp;&nbsp; -&nbsp;&nbsp;&nbsp;  1&nbsp;&nbsp;&nbsp; gene_id "AB000123.1"; transcript_id "AB00123.1.2";</tt> <br /> <tt>AB000123&nbsp;&nbsp;&nbsp; Twinscan&nbsp;&nbsp;&nbsp;&nbsp; CDS&nbsp;&nbsp;&nbsp; 215991&nbsp;&nbsp;&nbsp; 216028&nbsp;&nbsp;&nbsp; .&nbsp;&nbsp;&nbsp; -&nbsp;&nbsp;&nbsp;  0&nbsp;&nbsp;&nbsp; gene_id "AB000123.1"; transcript_id "AB00123.1.2";</tt> <br /> <tt>AB000123&nbsp;&nbsp;&nbsp; Twinscan&nbsp;&nbsp;&nbsp;&nbsp; start_codon&nbsp;&nbsp; 216026&nbsp;&nbsp;&nbsp; 216028&nbsp;&nbsp;&nbsp; .&nbsp;&nbsp;&nbsp; -&nbsp;&nbsp;&nbsp;  .&nbsp;&nbsp;&nbsp; gene_id&nbsp;&nbsp;&nbsp; "AB000123.1"; transcript_id "AB00123.1.2";</tt> <br /> <tt>AB000123&nbsp;&nbsp;&nbsp; Twinscan&nbsp;&nbsp;&nbsp;&nbsp; stop_codon&nbsp;&nbsp;&nbsp; 193814&nbsp;&nbsp;&nbsp; 193816&nbsp;&nbsp;&nbsp; .&nbsp;&nbsp;&nbsp; -&nbsp;&nbsp;&nbsp;  .&nbsp;&nbsp;&nbsp; gene_id&nbsp;&nbsp;&nbsp; "AB000123.1"; transcript_id "AB00123.1.2";</tt></p> <p>Note the frames of the coding exons. For example:</p> <ol><li>The first CDS (from 216028 to 215991) always has frame zero.</li><li>Frame of the 1st CDS =0, length =38.&nbsp; (frame - length) % 3&nbsp; = 1, the frame of the 2nd CDS.</li><li>Frame of the 2nd CDS=1, length=140. (frame - length) % 3&nbsp; = 2, the frame of the 3rd CDS.</li><li>Frame of the 3rd CDS=2, length=108. (frame - length) % 3&nbsp; =&nbsp; 2, the frame of the terminal CDS.</li><li>Alternatively, the frame of terminal CDS can be calculated  without the rest of the gene. Length of the terminal CDS=206. length % 3  =2, the frame of the terminal CDS.</li></ol> Here is an example in which the "exon" feature is used. It is a 5 exon gene with 3 translated exons. <tt>AB000381 Twinscan&nbsp; exon&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 150&nbsp;&nbsp; 200&nbsp;&nbsp; .&nbsp;&nbsp; +&nbsp;&nbsp; .&nbsp; gene_id "AB000381.000"; transcript_id "AB000381.000.1";</tt>  <br /> <tt>AB000381 Twinscan&nbsp; exon&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 300&nbsp;&nbsp; 401&nbsp;&nbsp; .&nbsp;&nbsp; +&nbsp;&nbsp; .&nbsp; gene_id "AB000381.000"; transcript_id "AB000381.000.1";</tt> <br /> <tt>AB000381 Twinscan&nbsp; CDS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  380&nbsp;&nbsp; 401&nbsp;&nbsp; .&nbsp;&nbsp; +&nbsp;&nbsp; 0&nbsp; gene_id "AB000381.000"; transcript_id "AB000381.000.1";</tt> <br /> <tt>AB000381 Twinscan&nbsp; exon&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 501&nbsp;&nbsp; 650&nbsp;&nbsp; .&nbsp;&nbsp; +&nbsp;&nbsp; .&nbsp; gene_id "AB000381.000"; transcript_id "AB000381.000.1";</tt>  <br /> <tt>AB000381 Twinscan&nbsp; CDS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 501&nbsp;&nbsp; 650&nbsp;&nbsp; .&nbsp;&nbsp; +&nbsp;&nbsp; 2&nbsp; gene_id "AB000381.000"; transcript_id "AB000381.000.1";</tt> <br /> <tt>AB000381 Twinscan&nbsp; exon&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;  700&nbsp;&nbsp; 800&nbsp;&nbsp; .&nbsp;&nbsp; +&nbsp;&nbsp; .&nbsp; gene_id "AB000381.000"; transcript_id "AB000381.000.1";</tt> <br /> <tt>AB000381 Twinscan&nbsp; CDS&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 700&nbsp;&nbsp; 707&nbsp;&nbsp; .&nbsp;&nbsp; +&nbsp;&nbsp; 2&nbsp; gene_id "AB000381.000"; transcript_id "AB000381.000.1";</tt>  <br /> <tt>AB000381 Twinscan&nbsp; exon&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 900&nbsp; 1000&nbsp;&nbsp; .&nbsp;&nbsp; +&nbsp;&nbsp; .&nbsp; gene_id "AB000381.000"; transcript_id "AB000381.000.1";</tt> <br /> <tt>AB000381 Twinscan&nbsp; start_codon&nbsp; 380&nbsp;&nbsp; 382&nbsp;&nbsp;  .&nbsp;&nbsp; +&nbsp;&nbsp; 0&nbsp; gene_id "AB000381.000"; transcript_id "AB000381.000.1";</tt> <br /> <tt>AB000381 Twinscan&nbsp; stop_codon&nbsp;&nbsp; 708&nbsp;&nbsp; 710&nbsp;&nbsp; .&nbsp;&nbsp; +&nbsp;&nbsp; 0&nbsp; gene_id "AB000381.000"; transcript_id "AB000381.000.1";<br /> <br /> attention:related content are referred from related websites mainteined by related orgnization, refer them when neccessary.<br /> 注意：相关内容引自维护该格式的组织网站，如有必要请注明出处。<br /> </tt></div><img src ="http://www.cppblog.com/ewre/aggbug/161130.html" width = "1" height = "1" /><br><br><div align=right><a style="text-decoration:none;" href="http://www.cppblog.com/ewre/" target="_blank">ewre</a> 2011-11-29 15:26 <a href="http://www.cppblog.com/ewre/articles/161130.html#Feedback" target="_blank" style="text-decoration:none;">发表评论</a></div>]]></description></item></channel></rss>