注册 登录  
 加关注
   显示下一条  |  关闭
温馨提示!由于新浪微博认证机制调整,您的新浪微博帐号绑定已过期,请重新绑定!立即重新绑定新浪微博》  |  关闭

Bioinformatics home

 
 
 

日志

 
 

AGP of Genome annotation  

2011-07-07 02:54:38|  分类: 生物信息编程 |  标签: |举报 |字号 订阅

  下载LOFTER 我的照片书  |
AGP Information

AGP Specification (pdf)


AGP File Specification (v. 1.1)
Introduction
File Format
Describing Breaks and Continuity
Describing Scaffolds with Unknown Orientation
Validation
Examples
Introduction:

What it is: Describes the assembly of an object. This object can be a contig, a scaffold (supercontig), or a chromosome. Each line (row) of the AGP file describes a different piece of the object, and has the column entries defined below. Extended comments follow. The format was initially developed during the early assembly phase of the human genome by UCSC, EBI and NCBI. Special thanks to UCSC for their nice web site (where I was able to obtain additional information).

What it is not: a description of the alignments between components used to construct the larger molecule. Not all of the information in proprietary assembly files can be represented in the AGP format. It is also not for recording the spans of features like repeats or genes.


Definitions:
Contig:
a non-redundant sequence formed by joining, based on sequence overlap, one or more smaller sequences. The smaller sequences can be individual sequence reads (commonly called traces) or entire clone sequences. There should be no gaps in a contig (although there may be short runs of Ns due to ambiguous base calls).
Scaffold (supercontig):
a non-redundant sequence formed by joining one or more contig sequences. The distinction is that no sequence overlap is required to construct the larger sequence. Additional information, such as clone end analysis, can support the relationship. There can be, and typically there are, gaps in a scaffold.
Gap:
a sub region within an object where there is no known sequence. Generally represented as a series of the letter ‘N’
Component:
a sequence used to construct a larger sequence.

File Format:

One feature of the AGP file is that column definitions change depending on whether the line is a component line or a gap line. There is a single column definition up to column 5, then each column will have two definitions, depending on the value in column 5.

AGP File Format
column content description
1 object This is the identifier for the object being assembled. This can be a chromosome, scaffold or contig. If the object is a chromosome and an accession.version identifier is not used to describe the object, then the naming convention is to precede the chromosome number with “chr” (if a chromosome) or “LG” (if a linkage group). For example: chr1. If the object is a contig or scaffold, then the identifier needs to be unique within the assembly.
2 object_beg The starting coordinates of the component/gap on the object in column 1. These are the location in the object’s coordinate system, not the component’s.
3 object_end The ending coordinates of the component/gap on the object in column 1. These are the location in the object’s coordinate system, not the component’s.
4 part_number The line count for the components/gaps that make up the object described in column 1.
5 component_type The sequencing status of the component. These typically correspond to keywords in the International Sequence Database (GenBank/EMBL/DDBJ) submission. Current acceptable values are:
A
Active Finishing
D
Draft HTG (often phase1 and phase2 are called Draft, whether or not they have the draft keyword).
F
Finished HTG (phase 3)
G
Whole Genome Finishing
N
gap with specified size
O
Other sequence (typically means no HTG keyword)
P
Pre Draft
U
gap of unknown size, typically defaulting to predefined values.
W
WGS contig
6a component_id If column 5 not equal to N: This is a unique identifier for the sequence component contributing to the object described in column 1. Ideally this will be a valid accession.version identifier assigned by GenBank/EMBL/DDBJ. If the sequence has not been submitted to a public repository yet, a local identifier should be used.
6b gap_length If column 5 equal to N: This column represents the length of the gap.
7a component_beg If column 5 not equal to N: This column specifies the beginning of the part of the component sequence that contributes to the object in column 1 (in component coordinates).
7b gap_type

If column 5 equal to N: This column specifies the gap type. The combination of gap type and linkage (column 8b) indicates whether the gap is captured or uncaptured. In some cases, the gap types are assigned a biological value (e.g. centromere).

Accepted values:

fragment:
gap between two sequence contigs (also called a ‘sequence gap’).
clone:
a gap between two clones that do not overlap.
contig:
a gap between clone contigs (also called a "layout gap").
centromere:
a gap inserted for the centromere.
short_arm:
a gap inserted at the start of an acrocentric chromosome.
heterochromatin:
a gap inserted for an especially large region of heterochromatic sequence (may also include the centromere).
telomere:
a gap inserted for the telomere.
repeat:
an unresolvable repeat.
8a component_end If column 5 not equal to N: This column specifies the end of the part of the component that contributes to the object in column 1 (in component coordinates).
8b linkage

If column 5 equal to N: This column indicates if there is evidence of linkage between the adjacent lines.

Values:


yes
no
9a orientation

If column 5 not equal to N: This column specifies the orientation of the component relative to the object in column 1.

Values:

+
plus
-
minus
0 (zero)
unknown
na
irrelevant

By default, components with unknown orientation (0 or na) are treated as if they had + orientation.

9b   If column 5 equal to N: This column is empty- there is no filler. A tab should be inserted after the 8 th column though so that all lines have 9 columns.
Extended comments:
Columns should be tab delimited. Lines end with a new line (\n). There should be no extra space around the individual tokens.
All coordinates given in the file are 1-based inclusive (not 0-based). i.e. the first base of an object is 1 (not 0).
Evidence of linkage. In general, evidence of linkage is provided by end pairs (sometimes referred to as mate pairs). Although, other evidence could be used such as transcript alignments). In some cases, evidence of linkage may be indirect. For example, given the following scaffold:
     A--B--C--D
Where A, B, C and D are components, there could be end pairs linking A and B and end pairs linking A and C. There might be no pairs linking B and C but their linkage is implied.
If the object is a contig or scaffold, the object should not start with a gap line. A chromosome will frequently start or end with one or more biological gap types (e.g. telomere or short_arm).
A gap of type fragment will usually be flanked by components and not by other gap lines. Typically, successive gap lines are not encouraged, except in the case of gaps implying some biologically defined entity (such as centromere, heterochromatin, etc.).
Coordinates of the object are all with respect to the plus strand, no matter the orientation of the component.
object_beg (column 2) should always be less than or equal to object_end (column 3).
component_beg (column 7) should always be less than or equal to component_end (column 8).
Each object must start with a part_num of 1 (column 4) and an object_beg coordinate of 1 (column 2).
Gap lengths must be positive. Negative gaps and gap lines with zero length are not valid.
For negative gaps or gaps of unknown size, use 100 as the gaps size, as that is the GenBank/EMBL/DDBJ standard for gaps of unknown size.
In the case of an GenBank/EMBL/DDBJ submission, the object identifier should be unique not only within the assembly but also across different versions of the assembly. For example, chrUn01.0001 in the first version of a genome and chrUn02.0001 in the second version.
Any text after a # symbol is assumed to be a comment
The use of comment lines at the head of the file is encouraged. Useful information to include in such headers is:
organism name
assembly name
a description of any non-standard object identifiers
Describing breaks and continuity:

Information about continuity is provided by a combination of the values in 7b and 8b that provide information on building the object. This first version of this specification did not specifically define how to use these columns, thus there has been a divergence in how they are currently used. Below is a proposal on how information should be encoded.

column 7b column 8b Interpretation and description
contig no Break scaffold
A contig gap suggests a break between adjacent scaffolds implying no linkage.
contig yes Invalid
clone no Break scaffold
There may be a clone slated for the gap, but there is no evidence of linkage suggesting how this clone relates to its neighbors.
clone no Break scaffold
There may be a clone slated for the gap, but there is no evidence of linkage suggesting how this clone relates to its neighbors.
clone yes Do not break scaffold
There is evidence linking a clone to sequence on both sides of the gap. Default size is 50000 (entered in column 6b)
fragment no Do not break scaffold
A fragment gap implies that there is clone coverage across the gap, and therefore implies linkage. The ‘no’ in 8b suggests the adjacent sequences have unknown order or orientation. For example, gaps between sequence contigs in an HTGS_PHASE1 BAC clone will typically be ‘fragment no’ type gaps. Default gap size is 100 (entered in column in 6b).
fragment yes Do not break scaffold
Same as above, although the’ yes’ here suggests the adjacent sequences have known order and orientation. Gaps between sequence contigs that are linked by mate-pair evidence will typically be ‘fragment yes’ type gaps. For example, gaps between sequence contigs in scaffolds from a WGS assembly.
repeat no Break scaffold
If an unresolvable repeat unit is not spanned by clones, the linkage will be ‘no’.
centromere/ short_arm/ heterochromatin/telomere no Break scaffold
centromere/ short_arm/ heterochromatin/telomere yes Invalid

Describing scaffolds with unknown orientation:

Scaffolds can sometimes be positioned along a chromosome or linkage group without there being sufficient data to orient the scaffold. Such placed but unoriented scaffolds can be indicated in an AGP that specifies how a chromosome or linkage group is assembled from scaffolds by using ‘0’ in the orientation column (9a) (see the example “chromosome from scaffolds”). It is not appropriate to use an orientation of ‘0’ in an AGP that specifies how a chromosome is assembled from contigs, except for any contigs that are not scaffolded to other components (singletons). Using an orientation of ‘0’ for all the contigs in a multi-component scaffold is misleading because to do so implies that the contig lies at the position indicated but could be in either orientation. Depending on the orientation of the scaffold, however, the contigs in an unorientated multi-component scaffold either lie at the indicated position in the ‘+’ orientation (the default) or at a different position in the ‘-‘ orientation. The preferred method to indicate that scaffolds have been placed but their orientation is unknown is to provide a chromosome-from-scaffold AGP. Alternatively, a separate file listing placed but unoriented scaffolds can be provided to supplement a chromosome-from-contig AGP that uses the default orientation of ‘+’ for the component contigs in unoriented scaffolds.


Validation:

File structure needs to be validated in the following ways:


Columns are tab delimited
All columns of numeric data must contain positive integers
Accession identifiers (and versions) must be valid
Columns with controlled values must only use those values
All columns must have some data (except 9b)

File content needs to be validated in the following ways:


Each object must start with a part_num of 1 and an object_beg coordinate of 1.
All object ranges should be sequential and non-overlapping
object_beg must be less than or equal to object_end
component_beg must be less than or equal to component_end
The span specific for a component must be valid.
The length of the span specified for the component (in columns 7 and 8) must match the length of the span specified for the object (in columns 2 and 3).
If no gap lines exist between components, the defined switch points should be consistent with an alignment of the two components.
All gap lengths must be 1 base or longer.

Examples:


Scaffold from contig (WGS)
Chromosome from scaffold (WGS)
Chromosome from contig (WGS)
Chromosome from contig (BAC)
  评论这张
 
阅读(1310)| 评论(0)
推荐 转载

历史上的今天

在LOFTER的更多文章

评论

<#--最新日志,群博日志--> <#--推荐日志--> <#--引用记录--> <#--博主推荐--> <#--随机阅读--> <#--首页推荐--> <#--历史上的今天--> <#--被推荐日志--> <#--上一篇,下一篇--> <#-- 热度 --> <#-- 网易新闻广告 --> <#--右边模块结构--> <#--评论模块结构--> <#--引用模块结构--> <#--博主发起的投票-->
 
 
 
 
 
 
 
 
 
 
 
 
 
 

页脚

网易公司版权所有 ©1997-2017