C++博客-以致宏大，以致高远-文章分类-Bioinformatics

关于Corona Lite

ewre — Tue, 29 Nov 2011 07:44:00 GMT

AB的这个Corona Lite用的时候一定注意看reference里面的介绍，注意力面的每个函数的参数，尽量都写上并指定参数值，我在用的时候就是因为 submit_script那个perl程序没有指定tracking quene（原以为用默认的就可以了）折腾了3，4天的时间，最后发现是这么一个错误。。。

ewre 2011-11-29 15:44 发表评论

关于大规模数据操作

ewre — Tue, 29 Nov 2011 07:42:00 GMT

大规模数据操作牵扯到效率以及计算机硬件资源占用的问题，当然，这两个指标是鱼与熊掌的关系。
但是，通过对数据的提前预处理，我们可以在一定程度上将鱼与熊掌得兼。
常用的预处理方法：
一，多维排序并创建索引。
将数据按照一定的层次排序，每个层次内部按照某一个指标再排序，同时，再排序过程中生成一个记录每个层次位置的索引表。
二，利用现有格式
关于利用已经有资源的情况已经不知一次被重申并且强调。
现有的大规模基因组相关数据格式有：GTF，GFF等等

ewre 2011-11-29 15:42 发表评论

linux disk usage command-du

ewre — Tue, 29 Nov 2011 07:37:00 GMT

$ man du

NAME
       du - estimate file space usage

SYNOPSIS
       du [OPTION]... [FILE]...
       du [OPTION]... --files0-from=F

DESCRIPTION
       Summarize disk usage of each FILE, recursively for directories.

       Mandatory arguments to long options are mandatory for short options too.

       -a, --all
              write counts for all files, not just directories

       --apparent-size
              print apparent sizes, rather than disk usage; although the apparent size is usually smaller, it may be larger due to holes in ('sparse') files, internal fragmenta-
              tion, indirect blocks, and the like

       -B, --block-size=SIZE use SIZE-byte blocks

       -b, --bytes
              equivalent to '--apparent-size --block-size=1'

       -c, --total
              produce a grand total

       -D, --dereference-args
              dereference FILEs that are symbolic links

       --files0-from=F
              summarize disk usage of the NUL-terminated file names specified in file F

       -H     like --si, but also evokes a warning; will soon change to be equivalent to --dereference-args (-D)

       -h, --human-readable
              print sizes in human readable format (e.g., 1K 234M 2G)

       --si   like -h, but use powers of 1000 not 1024

       -k     like --block-size=1K

       -l, --count-links
              count sizes many times if hard linked

       -m     like --block-size=1M

       -L, --dereference

如果想以G为单位显示当前文件夹下的文件及文件夹大小：

$ du -a -B g . (或者 du -sh .)

2G      ./SRR057636_2.fastq
5G      ./SRR057631.fastq.sam
3G      ./SRR057638_1.fastq
3G      ./SRR057638_2.fastq
3G      ./SRR057639_1.fastq
3G      ./SRR057639_2.fastq
3G      ./SRR057640_1.fastq
3G      ./SRR057640_2.fastq
3G      ./SRR057641_1.fastq
3G      ./SRR057641_2.fastq
3G      ./SRR057642_1.fastq
3G      ./SRR057642_2.fastq
4G      ./SRR057643_2.fastq
3G      ./SRR057644_2.fastq
3G      ./SRR057645_2.fastq
3G      ./SRR057646_2.fastq
2G      ./SRR057647_2.fastq
2G      ./SRR057629.fastq.sam
1G      ./SRR057632.fastq.sam
128G    .

ewre 2011-11-29 15:37 发表评论

KEGG数据库收费了

ewre — Tue, 29 Nov 2011 07:35:00 GMT

kegg数据库的管理者说因为资金原因kegg pathway db从2011.07.01开始不再提供免费服务。悲剧。。

kegg的pathway数据质量还是非常高的，它一旦收费搞不好访问量就下来了。个人觉得这个模式不大对

目前用的比较多的几个数据库应该从funding到product联合起来，这样既可以节约成本又能提高库的质量。

ewre 2011-11-29 15:35 发表评论

你做的公共分析工具，请你维护

ewre — Tue, 29 Nov 2011 07:35:00 GMT

在做一些生物信息学相关的分析的时候，发现一些想法很不错，而且也有实现可以用，大部分做成网站的形式。但是也发现

一个现象：好多工具做出来之后，发了论文就丢到哪里不管了，好多工具现在的数据库版本仍然是06年的。

您做的工具，请您维护。。。。

ewre 2011-11-29 15:35 发表评论

GTF与GFF file format

ewre — Tue, 29 Nov 2011 07:26:00 GMT

GFF file format:

Fields are: [attributes] [comments]

The name of the sequence. Having an explicit sequence name allows a feature file to be prepared for a data set of multiple sequences. Normally the seqname will be the identifier of the sequence in an accompanying fasta format file. An alternative is that is the identifier for a sequence in a public database, such as an EMBL/Genbank/DDBJ accession number. Which is the case, and which file or database to use, should be explained in accompanying information.

The source of this feature. This field will normally be used to indicate the program making the prediction, or if it comes from public database annotation, or is experimentally verified, etc.

The feature type name. We hope to suggest a standard set of features, to facilitate import/export, comparison etc.. Of course, people are free to define new ones as needed. For example, Genie splice detectors account for a region of DNA, and multiple detectors may be available for the same site, as shown above.

We would like to enforce a standard nomenclature for common GFF features. This does not forbid the use of other features, rather, just that if the feature is obviously described in the standard list, that the standard label should be used. For this standard table we propose to fall back on the international public standards for genomic database feature annotation, specifically, the DDBJ/EMBL/GenBank feature table documentation).

,

Integers. must be less than or equal to . Sequence numbering starts at 1, so these numbers should be between 1 and the length of the relevant sequence, inclusive. (Version 2 change: version 2 condones values of and that extend outside the reference sequence. This is often more natural when dumping from acedb, rather than clipping. It means that some software using the files may need to clip for itself.)

A floating point value. When there is no score (i.e. for a sensor that just records the possible presence of a signal, as for the EMBL features above) you should use '.'. (Version 2 change: in version 1 of GFF you had to write 0 in such circumstances.)

One of '+', '-' or '.'. '.' should be used when strand is not relevant, e.g. for dinucleotide repeats. Version 2 change: This field is left empty '.' for RNA and protein features.

One of '0', '1', '2' or '.'. '0' indicates that the specified region is in frame, i.e. that its first base corresponds to the first base of a codon. '1' indicates that there is one extra base, i.e. that the second base of the region corresponds to the first base of a codon, and '2' means that the third base of the region is the first base of a codon. If the strand is '-', then the first base of the region is value of , because the corresponding coding region will run from to on the reverse strand. As with , if the frame is not relevant then set to '.'. It has been pointed out that "phase" might be a better descriptor than "frame" for this field.

Version 2 change: This field is left empty '.' for RNA and protein features.

[attribute]

From version 2 onwards, the attribute field must have an tag value structure following the syntax used within objects in a .ace file, flattened onto one line by semicolon separators. Tags must be standard identifiers ([A-Za-z][A-Za-z0-9_]*). Free text values must be quoted with double quotes. Note: all non-printing characters in such free text value strings (e.g. newlines, tabs, control characters, etc) must be explicitly represented by their C (UNIX) style backslash-escaped representation (e.g. newlines as '\n', tabs as '\t'). As in ACEDB, multiple values can follow a specific tag. The aim is to establish consistent use of particular tags, corresponding to an underlying implied ACEDB model if you want to think that way (but acedb is not required). Examples of these would be:

seq1     BLASTX  similarity   101  235 87.1 + 0  Target "HBA_HUMAN" 11 55 ; E_value 0.0003
dJ102G20 GD_mRNA coding_exon 7105 7201   .  - 2 Sequence "dJ102G20.C1.1"

The semantics of tags in attribute field tag-values pairs has intentionally not been formalized. Two useful guidelines are to use DDBJ/EMBL/GenBank feature 'qualifiers' (see DDBJ/EMBL/GenBank feature table documentation), or the features that ACEDB generates when it dumps GFF.

Version 1 note In version 1 the attribute field was called the group field, with the following specification:

An optional string-valued field that can be used as a name to group together a set of records. Typical uses might be to group the introns and exons in one gene prediction (or experimentally verified gene structure), or to group multiple regions of match to another sequence, such as an EST or a protein.

All of the above described fields should be separated by TAB characters ('\t'). All values of the mandatory fields should not include whitespace (i.e. the strings for , and fields).

Version 1 note In version 1 each string had to be under 256 characters long, and the whole line should under 32k long. This was to make things easier for guaranteed conforming parsers, but seemed unnecessary given modern languages.

Comments

Comments are allowed, starting with "#" as in Perl, awk etc. Everything following # until the end of the line is ignored. Effectively this can be used in two ways. Either it must be at the beginning of the line (after any whitespace), to make the whole line a comment, or the comment could come after all the required fields on the line.

## comment lines for meta information

There is a set of standardised (i.e. parsable) ## line types that can be used optionally at the top of a gff file. The philosophy is a little like the special set of %% lines at the top of postscript files, used for example to give the BoundingBox for EPS files.

Current proposed ## lines are:

gff-version

##gff-version 2

GFF version - in case it is a real success and we want to change it. The current default version is 2, so if this line is not present version 2 is assumed.

source-version

##source-version

So that people can record what version of a program or package was used to make the data in this file. I suggest the version is text without whitespace. That allows things like 1.3, 4a etc. There should be at most one source-version line per source.

date

##date

The date the file was made, or perhaps that the prediction programs were run. We suggest to use astronomical format: 1997-11-08 for 8th November 1997, first because these sort properly, and second to avoid any US/European bias.

Type

##Type  []

The type of host sequence described by the features. Standard types are 'DNA', 'Protein' and 'RNA'. The optional allows multiple ##Type definitions describing multiple GFF sets in one file, each of which have a distinct type. If the name is not provided, then all the features in the file are of the given type. Thus, with this meta-comment, a single file could contain DNA, RNA and Protein features, for example, representing a single genomic locus or 'gene', alongside type-specific features of its transcribed mRNA and translated protein sequences. If no ##Type meta-comment is provided for a given GFF file, then the type is assumed to be DNA.

DNA


 ##DNA 
 ##acggctcggattggcgctggatgatagatcagacgac
 ##...
 ##end-DNA

To give a DNA sequence. Several people have pointed out that it may be convenient to include the sequence in the file. It should not become mandatory to do so, and in our experience this has been very little used. Often the seqname will be a well-known identifier, and the sequence can easily be retrieved from a database, or an accompanying file.

RNA


 ##RNA 
 ##acggcucggauuggcgcuggaugauagaucagacgac
 ##...
 ##end-RNA

Similar to DNA. Creates an implicit ##Type RNA directive.

Protein


 ##Protein 

 ##MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF
 ##...
 ##end-Protein

Similar to DNA. Creates an implicit ##Type Protein directive.

sequence-region

##sequence-region

To indicate that this file only contains entries for the specified subregion of a sequence.

Please feel free to propose new ## lines.

The ## line proposal came out of some discussions including Anders Krogh, David Haussler, people at the Newton Institute on 1997-10-29 and some email from Suzanna Lewis. Of course, naive programs can ignore all of these...

File Naming

We propose that the format is called "GFF", with conventional file name ending ".gff".

Semantics

We have intentionally avoided overspecifying the semantics of the format. For example, we have not restricted the items expressible in GFF to a specified set of feature types (splice sites, exons etc.) with defined semantics. Therefore, in order for the information in a gff file to be useful to somebody else, the person producing the features must describe the meaning of the features.

In the example given above the feature "splice5" indicates that there is a candidate 5' splice site between positions 172 and 173. The "sp5-20" feature is a prediction based on a window of 20 bp for the same splice site. To use either of these, you must know the position within the feature of the predicted splice site. This only needs to be given once, possibly in comments at the head of the file, or in a separate document.

Another example is the scoring scheme; we ourselves would like the score to be a log-odds likelihood score in bits to a defined null model, but that is not required, because different methods take different approaches.

Avoiding a prespecified feature set also leaves open the possibility for GFF to be used for new feature types, such as CpG islands, hypersensitive sites, promoter/enhancer elements, etc.

GTF2(gene transfer format) file format:
[attributes] [comments]

Here is a simple example with 3 translated exons. Order of rows is not important.

AB000381 Twinscan  CDS          380   401   .   +   0  gene_id "001"; transcript_id "001.1";
AB000381 Twinscan  CDS          501   650   .   +   2  gene_id "001"; transcript_id "001.1";
AB000381 Twinscan  CDS          700   707   .   +   2  gene_id "001"; transcript_id "001.1";
AB000381 Twinscan  start_codon  380   382   .   +   0  gene_id "001"; transcript_id "001.1";
AB000381 Twinscan  stop_codon   708   710   .   +   0  gene_id "001"; transcript_id "001.1";

The whitespace in this example is provided only for readability. In GTF, fields must be separated by a single TAB and no white space.

The FPC contig ID from the Golden Path.

The source column should be a unique label indicating where the annotations came from --- typically the name of either a prediction program or a public database.

The following feature types are required: "CDS", "start_codon", "stop_codon". The feature "exon" is optional, since this project will not evaluate predicted splice sites outside of protein coding regions. All other features will be ignored.

CDS represents the coding sequence starting with the first translated codon and proceeding to the last translated codon. Unlike Genbank annotation, the stop codon is not included in the CDS for the terminal exon.

Integer start and end coordinates of the feature relative to the beginning of the sequence named in . must be less than or equal to . Sequence numbering starts at 1. Values of and that extend outside the reference sequence are technically acceptable, but they are discouraged for purposes of this project.

The score field will not be used for this project, so you can either provide a meaningful float or replace it by a dot.

0 indicates that the first whole codon of the reading frame is located at 5'-most base. 1 means that there is one extra base before the first codon and 2 means that there are two extra bases before the first codon. Note that the frame is not the length of the CDS mod 3.

Here are the details excised from the GFF spec. Important: Note comment on reverse strand.

'0' indicates that the specified region is in frame, i.e. that its first base corresponds to the first base of a codon. '1' indicates that there is one extra base, i.e. that the second base of the region corresponds to the first base of a codon, and '2' means that the third base of the region is the first base of a codon. If the strand is '-', then the first base of the region is value of , because the corresponding coding region will run from to on the reverse strand.

[attributes]
All four features have the same two mandatory attributes at the end of the record:

gene_id value; A globally unique identifier for the genomic source of the transcript
transcript_id value; A globally unique identifier for the predicted transcript.

These attributes are designed for handling multiple transcripts from the same genomic region. Any other attributes or comments must appear after these two and will be ignored.

Attributes must end in a semicolon which must then be separated from the start of any subsequent attribute by exactly one space character (NOT a tab character).

Textual attributes should be surrounded by doublequotes.

Here is an example of a gene on the negative strand. Larger coordinates are 5' of smaller coordinates. Thus, the start codon is 3 bp with largest coordinates among all those bp that fall within the CDS regions. Similarly, the stop codon is the 3 bp with coordinates just less than the smallest coordinates within the CDS regions.

AB000123    Twinscan     CDS    193817    194022    .    -    2    gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123    Twinscan     CDS    199645    199752    .    -    2    gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123    Twinscan     CDS    200369    200508    .    -    1    gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123    Twinscan     CDS    215991    216028    .    -    0    gene_id "AB000123.1"; transcript_id "AB00123.1.2";
AB000123    Twinscan     start_codon   216026    216028    .    -    .    gene_id    "AB000123.1"; transcript_id "AB00123.1.2";
AB000123    Twinscan     stop_codon    193814    193816    .    -    .    gene_id    "AB000123.1"; transcript_id "AB00123.1.2";

Note the frames of the coding exons. For example:

The first CDS (from 216028 to 215991) always has frame zero.
Frame of the 1st CDS =0, length =38. (frame - length) % 3 = 1, the frame of the 2nd CDS.
Frame of the 2nd CDS=1, length=140. (frame - length) % 3 = 2, the frame of the 3rd CDS.
Frame of the 3rd CDS=2, length=108. (frame - length) % 3 = 2, the frame of the terminal CDS.
Alternatively, the frame of terminal CDS can be calculated without the rest of the gene. Length of the terminal CDS=206. length % 3 =2, the frame of the terminal CDS.

Here is an example in which the "exon" feature is used. It is a 5 exon gene with 3 translated exons. AB000381 Twinscan exon         150   200   .   +   . gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon         300   401   .   +   . gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan CDS          380   401   .   +   0 gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon         501   650   .   +   . gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan CDS          501   650   .   +   2 gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon         700   800   .   +   . gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan CDS          700   707   .   +   2 gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan exon         900 1000   .   +   . gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan start_codon 380   382   .   +   0 gene_id "AB000381.000"; transcript_id "AB000381.000.1";
AB000381 Twinscan stop_codon   708   710   .   +   0 gene_id "AB000381.000"; transcript_id "AB000381.000.1"; attention:related content are referred from related websites mainteined by related orgnization, refer them when neccessary. 注意：相关内容引自维护该格式的组织网站，如有必要请注明出处。

ewre 2011-11-29 15:26 发表评论