Genome Information Research Center, Osaka Univ.


FEATURES

FEATURES        - Table containing information on portions of the
sequence that code for proteins and RNA molecules and information on
experimentally determined sites of biological significance. Optional
keyword/one or more records.

FEATURES Format

  GenBank releases use a new feature table format designed jointly by
GenBank, the EMBL Nucleotide Sequence Data Library, and the DNA Data
Bank of Japan. This format is now used by all three data banks.

  The feature table contains information about genes and gene products,
as well as regions of biological significance reported in the
sequence. The feature table contains information on regions of the
sequence that code for proteins and RNA molecules. It also enumerates
differences between different reports of the same sequence, and
provides cross-references to other data collections, as described in
more detail below.

  The first line of the feature table is a header that includes the
keyword `FEATURES' and the column header `Location/Qualifier.' Each
feature consists of a descriptor line containing a feature key and a
location (see sections below for details). If the location does not
fit on this line, a continuation line may follow. If further
information about the feature is required, one or more lines
containing feature qualifiers may follow the descriptor line.

  The feature key begins in column 6 and may be no more than 15
characters in length. The location begins in column 22. Feature
qualifiers begin on subsequent lines at column 22. Location,
qualifier, and continuation lines may extend from column 22 to 80.

  Feature tables are optional. However, a feature table must include one
header line and at least one feature descriptor line.

  The sections below provide a brief introduction to the new feature
table format. For a thorough description of the new feature table
format, see the document `The DDBJ/EMBL/GenBank Feature Table:
Definition.' If you would like a copy of this publication, contact
GenBank at the address shown on the front page of these Release Notes.

Feature Key Names

  The first column of the feature descriptor line contains the feature
key. It starts at column 6 and can continue to column 20. The list of
valid feature keys is shown below.

allele          Related strain contains alternative gene form
attenuator      Sequence related to transcription termination
C_region        Span of the C immunological feature
CAAT_signal     `CAAT box' in eukaryotic promoters
CDS             Sequence coding for amino acids in protein (includes
                stop codon)
cellular        Region of cellular DNA
conflict        Independent determinations differ
D-loop          Displacement loop
D_region        Span of the D immunological feature
enhancer        Cis-acting enhancer of promoter function
exon            Region that codes for part of spliced mRNA
GC_signal       `GC box' in eukaryotic promoters
iDNA            Intervening DNA eliminated by recombination
insertion_seq   Insertion sequence (IS), a small transposon
intron          Transcribed region excised by mRNA splicing
J_region        Span of the J immunological feature
LTR             Long terminal repeat
mat_peptide     Mature peptide coding region (does not include stop codon)
misc_binding    Miscellaneous binding site
misc_difference Miscellaneous difference feature
misc_feature    Region of biological significance that cannot be described
                by any other feature
misc_recomb     Miscellaneous recombination feature
misc_RNA        Miscellaneous transcript feature not defined by other RNA keys
misc_signal     Miscellaneous signal
misc_structure  Miscellaneous DNA or RNA structure
modified_base   The indicated base is a modified nucleotide
mRNA            Messenger RNA
mutation        A mutation alters the sequence here
N_region        Span of the N immunological feature
old_sequence    Presented sequence revises a previous version
polyA_signal    Signal for cleavage & polyadenylation
polyA_site      Site at which polyadenine is added to mRNA
precursor_RNA   Any RNA species that is not yet the mature RNA product
prim_transcript Primary (unprocessed) transcript
primer          Primer binding region used with PCR
primer_bind     Non-covalent primer binding site
promoter        A region involved in transcription initiation
protein_bind    Non-covalent protein binding site on DNA or RNA
provirus        Proviral sequence
RBS             Ribosome binding site
rep_origin      Replication origin for duplex DNA
repeat_region   Sequence containing repeated subsequences
repeat_unit     One repeated unit of a repeat_region
rRNA            Ribosomal RNA
S_region        Span of the S immunological feature
satellite       Satellite repeated sequence
scRNA           Small cytoplasmic RNA
sig_peptide     Signal peptide coding region
snRNA           Small nuclear RNA
stem_loop       Hair-pin loop structure in DNA or RNA
STS             Sequence Tagged Site; operationally unique sequence that
                identifies the combination of primer spans used in a PCR assay
TATA_signal     `TATA box' in eukaryotic promoters
terminator      Sequence causing transcription termination
transit_peptide Transit peptide coding region
transposon      Transposable element (TN)
tRNA            Transfer RNA
unsure          Authors are unsure about the sequence in this region
V_region        Span of the V immunological feature
variation       A related population contains stable mutation
virion          Virion (encapsidated) viral sequence
- (hyphen)      Placeholder
-10_signal      `Pribnow box' in prokaryotic promoters
-35_signal      `-35 box' in prokaryotic promoters
3'clip          3'-most region of a precursor transcript removed in processing
3'UTR           3' untranslated region (trailer)
5'clip          5'-most region of a precursor transcript removed in processing
5'UTR           5' untranslated region (leader)

Feature Location

  The second column of the feature descriptor line designates the
location of the feature in the sequence. The location descriptor
begins at position 22. Several conventions are used to indicate
sequence location.

  Base numbers in location descriptors refer to numbering in the entry,
which is not necessarily the same as the numbering scheme used in the
published report. The first base in the presented sequence is numbered
base 1. Sequences are presented in the 5 to 3 direction.

Location descriptors can be one of the following:

1. A single base;

2. A contiguous span of bases;

3. A site between two bases;

4. A single base chosen from a range of bases;

5. A single base chosen from among two or more specified bases;

6. A joining of sequence spans;

7. A reference to an entry other than the one to which the feature
belongs (i.e., a remote entry), followed by a location descriptor
referring to the remote sequence;

8. A literal sequence (a string of bases enclosed in quotation marks).

  A site between two residues, such as an endonuclease cleavage site, is
indicated by listing the two bases separated by a carat (e.g., 23^24).

  A single residue chosen from a range of residues is indicated by the
number of the first and last bases in the range separated by a single
period (e.g., 23.79). The symbols < and > indicate that the end point
of the range is beyond the specified base number.

  A contiguous span of bases is indicated by the number of the first and
last bases in the range separated by two periods (e.g., 23..79). The
symbols < and > indicate that the end point of the range is beyond the
specified base number. Starting and ending positions can be indicated
by base number or by one of the operators described below.

  Operators are prefixes that specify what must be done to the indicated
sequence to locate the feature. The following are the operators
available, along with their most common format and a description.

complement (location): The feature is complementary to the location
indicated. Complementary strands are read 5 to 3.

join (location, location, .. location): The indicated elements should
be placed end to end to form one contiguous sequence.

order (location, location, .. location): The elements are found in the
specified order in the 5 to 3 direction, but nothing is implied about
the rationality of joining them.

group (location, location, .. location): The elements are related and
should be grouped together, but no order is implied.

one-of (location, location, .. location): The element can be any one,
but only one, of the items listed.

replace (location, location): The first location indicated should be
replaced by the sequence from the second location; used for
insertions, deletions, and variants.

Feature Qualifiers

  Qualifiers provide additional information about features. They take
the form of a slash (/) followed by a qualifier name and, if
applicable, an equal sign (=) and a qualifier value. Feature
qualifiers begin at column 22.

Qualifiers convey many types of information. Their values can,
therefore, take several forms:

1. Free text;
2. Controlled vocabulary or enumerated values;
3. Citations or reference numbers;
4. Sequences;
5. Feature labels.

  Text qualifier values must be enclosed in double quotation marks. The
text can consist of any printable characters (ASCII values 32-126
decimal). If the text string includes double quotation marks, each set
must be `escaped' by placing a double quotation mark in front of it
(e.g., /note="This is an example of ""escaped"" quotation marks").

  Some qualifiers require values selected from a limited set of choices.
For example, the `/direction' qualifier has only three values `left,'

`right,' or `both.' These are called controlled vocabulary qualifier
values. Controlled qualifier values are not case sensitive; they can
be entered in any combination of upper- and lowercase without changing
their meaning.

  Citation or published reference numbers for the entry should be
enclosed in square brackets ([]) to distinguish them from other
numbers. Multiple citations are separated by commas (e.g.,
[1],[2],[3]).

  A literal sequence of bases (e.g., "atgcatt") should be enclosed in
quotation marks. Literal sequences are distinguished from free text by
context. Qualifiers that take free text as their values do not take
literal sequences, and vice versa.

  The `/label=' qualifier takes a feature label as its qualifier.
Although feature labels are optional, they allow unambiguous
references to the feature. The feature label identifies a feature
within an entry; when combined with the accession number and the name
of the data bank from which it came, it is a unique tag for that
feature. Feature labels must be unique within an entry, but can be the
same as a feature label in another entry. Feature labels are not case
sensitive; they can be entered in any combination of upper-and
lowercase without changing their meaning.

The following is a list of valid feature qualifiers.

/anticodon      Location of the anticodon of tRNA and the amino acid
for which it codes

/bound_moiety   Moiety bound

/citation       Reference to a citation providing the claim of or
evidence for a feature

/codon          Specifies a codon that is different from any found in the
reference genetic code

/codon_start    Indicates the first base of the first complete codon
in a CDS (as 1 or 2 or 3)

/cons_splice    Identifies intron splice sites that do not conform to
the 5'-GT... AG-3' splice site consensus

/direction      Direction of DNA replication

/EC_number      Enzyme Commission number for the enzyme product of the
sequence

/evidence       Value indicating the nature of supporting evidence

/frequency      Frequency of the occurrence of a feature

/function       Function attributed to a sequence

/gene           Symbol of the gene corresponding to a sequence region (usable
with all features)

/label          A label used to permanently identify a feature

/map            Map position of the feature in free-format text

/mod_base       Abbreviation for a modified nucleotide base

/note           Any comment or additional information

/number         A number indicating the order of genetic elements
(e.g., exons or introns) in the 5 to 3 direction

/organism       Name of organism if different from that contained in
the entry's ORGANISM field

/partial        Differentiates between complete regions and partial ones

/phenotype      Phenotype conferred by the feature

/product        Name of a product encoded by the sequence

/pseudo         Indicates that this feature is a non-functional
version of the element named by the feature key

/rpt_family     Type of repeated sequence; Alu or Kpn, for example

/rpt_type       Organization of repeated sequence

/rpt_unit       Identity of repeat unit that constitutes a repeat_region

/standard_name  Accepted standard name for this feature

/transl_except  Translational exception: single codon, the translation
of which does not conform to the reference genetic code

/translation    Amino acid translation of coding region (automatically
generated)

/type           Name of a strain if different from that in the SOURCE field

/usedin         Indicates that feature is used in a compound feature
in another entry

Cross-Reference Information

  One type of information in the feature table lists cross-references to
the annual compilation of transfer RNA sequences in Nucleic Acids
Research, which has kindly been sent to us on CD-ROM by Dr. Sprinzl.
Each tRNA entry of the feature table contains a /note= qualifier that
includes a reference such as `(NAR: 1234)' to identify code 1234 in
the NAR compilation. When such a cross-reference appears in an entry
that contains a gene coding for a transfer RNA molecule, it refers to
the code in the tRNA gene compilation. Similar cross-references in
entries containing mature transfer RNA sequences refer to the
companion compilation of tRNA sequences published by D.H. Gauss and M.
Sprinzl in Nucleic Acids Research.

Feature Table Examples

  In the first example a number of key names, feature locations, and
qualifiers are illustrated, taken from different sequences. The first
table entry is a coding region consisting of a simple span of bases
and including a /gene qualifier. In the second table entry, an NAR
cross-reference is given (see the previous section for a discussion of
these cross-references). The third and fourth table entries use the
symbols `<`and `>' to indicate that the beginning or end of the
feature is beyond the range of the presented sequence. In the fifth
table entry, the symbol `^' indicates that the feature is between
bases. In the sixth table entry, the replace operator is shown.

1       10        20        30        40        50        60        70       79
---------+---------+---------+---------+---------+---------+---------+---------
     CDS             5..1261
                     /product="alpha-1-antitrypsin precursor"
                     /map="14q32.1"
                     /gene="PI"
     tRNA            1..87
                     /note="Leu-tRNA-CAA (NAR: 1057)"
                     /anticodon=(pos:35..37,aa:Leu)
     mRNA            1..>66
                     /note="alpha-1-acid glycoprotein mRNA"
     transposon      <1..267
                     /note="insertion element IS5"
     misc_recomb     105^106
                     /note="B.subtilis DNA end/IS5 DNA start"
     conflict        replace(258..258,"t")
                     /citation=[2]
---------+---------+---------+---------+---------+---------+---------+---------
1       10        20        30        40        50        60        70       79

Example 10. Feature Table Entries


The next example shows the representation for a CDS that spans more
than one entry.

1       10        20        30        40        50        60        70       79
---------+---------+---------+---------+---------+---------+---------+---------
LOCUS       HUMPGAMM1    3688 bp ds-DNA             PRI       15-OCT-1990
DEFINITION  Human phosphoglycerate mutase (muscle specific isozyme) (PGAM-M)
            gene, 5' end.
ACCESSION   M55673 M25818 M27095
KEYWORDS    phosphoglycerate mutase.
SEGMENT     1 of 2
  .
  .
  .
FEATURES             Location/Qualifiers
     CAAT_signal     1751..1755
                     /gene="PGAM-M"
     TATA_signal     1791..1799
                     /gene="PGAM-M"
     exon            1820..2274
                     /number=1
                     /EC_number="5.4.2.1"
                     /gene="PGAM-M"
     intron          2275..2377
                     /number=1
                     /gene="PGAM2"
     exon            2378..2558
                     /number=2
                     /gene="PGAM-M"
  .
  .
  .
//
LOCUS       HUMPGAMM2     677 bp ds-DNA             PRI       15-OCT-1990
DEFINITION  Human phosphoglycerate mutase (muscle specific isozyme) (PGAM-M),
            exon 3.
ACCESSION   M55674 M25818 M27096
KEYWORDS    phosphoglycerate mutase.
SEGMENT     2 of 2
  .
  .
  .
FEATURES             Location/Qualifiers
     exon            255..457
                     /number=3
                     /gene="PGAM-M"
     intron          order(M55673:2559..>3688,<1..254)
                     /number=2
                     /gene="PGAM-M"
     mRNA            join(M55673:1820..2274,M55673:2378..2558,255..457)
                     /gene="PGAM-M"
     CDS             join(M55673:1861..2274,M55673:2378..2558,255..421)
                     /note="muscle-specific isozyme"
                     /gene="PGAM2"
                     /product="phosphoglycerate mutase"
                     /codon_start=1
                     /translation="MATHRLVMVRHGESTWNQENRFCGWFDAELSEKGTEEAKRGAKA
                     IKDAKMEFDICYTSVLKRAIRTLWAILDGTDQMWLPVVRTWRLNERHYGGLTGLNKAE
                     TAAKHGEEQVKIWRRSFDIPPPPMDEKHPYYNSISKERRYAGLKPGELPTCESLKDTI
                     ARALPFWNEEIVPQIKAGKRVLIAAHGNSLRGIVKHLEGMSDQAIMELNLPTGIPIVY
                     ELNKELKPTKPMQFLGDEETVRKAMEAVAAQGKAK"
  .
  .
  .
//
---------+---------+---------+---------+---------+---------+---------+---------
1       10        20        30        40        50        60        70       79

Example 11. Joining Sequences
-----------------------------------------------------------------------