FEATURES

FEATURES - Table containing information on portions of the sequence that code for proteins and RNA molecules and information on experimentally determined sites of biological significance. Optional keyword/one or more records.

FEATURES Format

GenBank releases use a new feature table format designed jointly by GenBank, the EMBL Nucleotide Sequence Data Library, and the DNA Data Bank of Japan. This format is now used by all three data banks. The feature table contains information about genes and gene products, as well as regions of biological significance reported in the sequence. The feature table contains information on regions of the sequence that code for proteins and RNA molecules. It also enumerates differences between different reports of the same sequence, and provides cross-references to other data collections, as described in more detail below. The first line of the feature table is a header that includes the keyword `FEATURES' and the column header `Location/Qualifier.' Each feature consists of a descriptor line containing a feature key and a location (see sections below for details). If the location does not fit on this line, a continuation line may follow. If further information about the feature is required, one or more lines containing feature qualifiers may follow the descriptor line. The feature key begins in column 6 and may be no more than 15 characters in length. The location begins in column 22. Feature qualifiers begin on subsequent lines at column 22. Location, qualifier, and continuation lines may extend from column 22 to 80. Feature tables are optional. However, a feature table must include one header line and at least one feature descriptor line. The sections below provide a brief introduction to the new feature table format. For a thorough description of the new feature table format, see the document `The DDBJ/EMBL/GenBank Feature Table: Definition.' If you would like a copy of this publication, contact GenBank at the address shown on the front page of these Release Notes.

Feature Key Names

The first column of the feature descriptor line contains the feature key. It starts at column 6 and can continue to column 20. The list of valid feature keys is shown below. allele Related strain contains alternative gene form attenuator Sequence related to transcription termination C_region Span of the C immunological feature CAAT_signal `CAAT box' in eukaryotic promoters CDS Sequence coding for amino acids in protein (includes stop codon) cellular Region of cellular DNA conflict Independent determinations differ D-loop Displacement loop D_region Span of the D immunological feature enhancer Cis-acting enhancer of promoter function exon Region that codes for part of spliced mRNA GC_signal `GC box' in eukaryotic promoters iDNA Intervening DNA eliminated by recombination insertion_seq Insertion sequence (IS), a small transposon intron Transcribed region excised by mRNA splicing J_region Span of the J immunological feature LTR Long terminal repeat mat_peptide Mature peptide coding region (does not include stop codon) misc_binding Miscellaneous binding site misc_difference Miscellaneous difference feature misc_feature Region of biological significance that cannot be described by any other feature misc_recomb Miscellaneous recombination feature misc_RNA Miscellaneous transcript feature not defined by other RNA keys misc_signal Miscellaneous signal misc_structure Miscellaneous DNA or RNA structure modified_base The indicated base is a modified nucleotide mRNA Messenger RNA mutation A mutation alters the sequence here N_region Span of the N immunological feature old_sequence Presented sequence revises a previous version polyA_signal Signal for cleavage & polyadenylation polyA_site Site at which polyadenine is added to mRNA precursor_RNA Any RNA species that is not yet the mature RNA product prim_transcript Primary (unprocessed) transcript primer Primer binding region used with PCR primer_bind Non-covalent primer binding site promoter A region involved in transcription initiation protein_bind Non-covalent protein binding site on DNA or RNA provirus Proviral sequence RBS Ribosome binding site rep_origin Replication origin for duplex DNA repeat_region Sequence containing repeated subsequences repeat_unit One repeated unit of a repeat_region rRNA Ribosomal RNA S_region Span of the S immunological feature satellite Satellite repeated sequence scRNA Small cytoplasmic RNA sig_peptide Signal peptide coding region snRNA Small nuclear RNA stem_loop Hair-pin loop structure in DNA or RNA STS Sequence Tagged Site; operationally unique sequence that identifies the combination of primer spans used in a PCR assay TATA_signal `TATA box' in eukaryotic promoters terminator Sequence causing transcription termination transit_peptide Transit peptide coding region transposon Transposable element (TN) tRNA Transfer RNA unsure Authors are unsure about the sequence in this region V_region Span of the V immunological feature variation A related population contains stable mutation virion Virion (encapsidated) viral sequence - (hyphen) Placeholder -10_signal `Pribnow box' in prokaryotic promoters -35_signal `-35 box' in prokaryotic promoters 3'clip 3'-most region of a precursor transcript removed in processing 3'UTR 3' untranslated region (trailer) 5'clip 5'-most region of a precursor transcript removed in processing 5'UTR 5' untranslated region (leader)

Feature Location

The second column of the feature descriptor line designates the location of the feature in the sequence. The location descriptor begins at position 22. Several conventions are used to indicate sequence location. Base numbers in location descriptors refer to numbering in the entry, which is not necessarily the same as the numbering scheme used in the published report. The first base in the presented sequence is numbered base 1. Sequences are presented in the 5 to 3 direction. Location descriptors can be one of the following: 1. A single base; 2. A contiguous span of bases; 3. A site between two bases; 4. A single base chosen from a range of bases; 5. A single base chosen from among two or more specified bases; 6. A joining of sequence spans; 7. A reference to an entry other than the one to which the feature belongs (i.e., a remote entry), followed by a location descriptor referring to the remote sequence; 8. A literal sequence (a string of bases enclosed in quotation marks). A site between two residues, such as an endonuclease cleavage site, is indicated by listing the two bases separated by a carat (e.g., 23^24). A single residue chosen from a range of residues is indicated by the number of the first and last bases in the range separated by a single period (e.g., 23.79). The symbols < and > indicate that the end point of the range is beyond the specified base number. A contiguous span of bases is indicated by the number of the first and last bases in the range separated by two periods (e.g., 23..79). The symbols < and > indicate that the end point of the range is beyond the specified base number. Starting and ending positions can be indicated by base number or by one of the operators described below. Operators are prefixes that specify what must be done to the indicated sequence to locate the feature. The following are the operators available, along with their most common format and a description. complement (location): The feature is complementary to the location indicated. Complementary strands are read 5 to 3. join (location, location, .. location): The indicated elements should be placed end to end to form one contiguous sequence. order (location, location, .. location): The elements are found in the specified order in the 5 to 3 direction, but nothing is implied about the rationality of joining them. group (location, location, .. location): The elements are related and should be grouped together, but no order is implied. one-of (location, location, .. location): The element can be any one, but only one, of the items listed. replace (location, location): The first location indicated should be replaced by the sequence from the second location; used for insertions, deletions, and variants.

Feature Qualifiers

Qualifiers provide additional information about features. They take the form of a slash (/) followed by a qualifier name and, if applicable, an equal sign (=) and a qualifier value. Feature qualifiers begin at column 22. Qualifiers convey many types of information. Their values can, therefore, take several forms: 1. Free text; 2. Controlled vocabulary or enumerated values; 3. Citations or reference numbers; 4. Sequences; 5. Feature labels. Text qualifier values must be enclosed in double quotation marks. The text can consist of any printable characters (ASCII values 32-126 decimal). If the text string includes double quotation marks, each set must be `escaped' by placing a double quotation mark in front of it (e.g., /note="This is an example of ""escaped"" quotation marks"). Some qualifiers require values selected from a limited set of choices. For example, the `/direction' qualifier has only three values `left,' `right,' or `both.' These are called controlled vocabulary qualifier values. Controlled qualifier values are not case sensitive; they can be entered in any combination of upper- and lowercase without changing their meaning. Citation or published reference numbers for the entry should be enclosed in square brackets ([]) to distinguish them from other numbers. Multiple citations are separated by commas (e.g., [1],[2],[3]). A literal sequence of bases (e.g., "atgcatt") should be enclosed in quotation marks. Literal sequences are distinguished from free text by context. Qualifiers that take free text as their values do not take literal sequences, and vice versa. The `/label=' qualifier takes a feature label as its qualifier. Although feature labels are optional, they allow unambiguous references to the feature. The feature label identifies a feature within an entry; when combined with the accession number and the name of the data bank from which it came, it is a unique tag for that feature. Feature labels must be unique within an entry, but can be the same as a feature label in another entry. Feature labels are not case sensitive; they can be entered in any combination of upper-and lowercase without changing their meaning. The following is a list of valid feature qualifiers. /anticodon Location of the anticodon of tRNA and the amino acid for which it codes /bound_moiety Moiety bound /citation Reference to a citation providing the claim of or evidence for a feature /codon Specifies a codon that is different from any found in the reference genetic code /codon_start Indicates the first base of the first complete codon in a CDS (as 1 or 2 or 3) /cons_splice Identifies intron splice sites that do not conform to the 5'-GT... AG-3' splice site consensus /direction Direction of DNA replication /EC_number Enzyme Commission number for the enzyme product of the sequence /evidence Value indicating the nature of supporting evidence /frequency Frequency of the occurrence of a feature /function Function attributed to a sequence /gene Symbol of the gene corresponding to a sequence region (usable with all features) /label A label used to permanently identify a feature /map Map position of the feature in free-format text /mod_base Abbreviation for a modified nucleotide base /note Any comment or additional information /number A number indicating the order of genetic elements (e.g., exons or introns) in the 5 to 3 direction /organism Name of organism if different from that contained in the entry's ORGANISM field /partial Differentiates between complete regions and partial ones /phenotype Phenotype conferred by the feature /product Name of a product encoded by the sequence /pseudo Indicates that this feature is a non-functional version of the element named by the feature key /rpt_family Type of repeated sequence; Alu or Kpn, for example /rpt_type Organization of repeated sequence /rpt_unit Identity of repeat unit that constitutes a repeat_region /standard_name Accepted standard name for this feature /transl_except Translational exception: single codon, the translation of which does not conform to the reference genetic code /translation Amino acid translation of coding region (automatically generated) /type Name of a strain if different from that in the SOURCE field /usedin Indicates that feature is used in a compound feature in another entry

Cross-Reference Information

One type of information in the feature table lists cross-references to the annual compilation of transfer RNA sequences in Nucleic Acids Research, which has kindly been sent to us on CD-ROM by Dr. Sprinzl. Each tRNA entry of the feature table contains a /note= qualifier that includes a reference such as `(NAR: 1234)' to identify code 1234 in the NAR compilation. When such a cross-reference appears in an entry that contains a gene coding for a transfer RNA molecule, it refers to the code in the tRNA gene compilation. Similar cross-references in entries containing mature transfer RNA sequences refer to the companion compilation of tRNA sequences published by D.H. Gauss and M. Sprinzl in Nucleic Acids Research.

Feature Table Examples

In the first example a number of key names, feature locations, and qualifiers are illustrated, taken from different sequences. The first table entry is a coding region consisting of a simple span of bases and including a /gene qualifier. In the second table entry, an NAR cross-reference is given (see the previous section for a discussion of these cross-references). The third and fourth table entries use the symbols `<`and `>' to indicate that the beginning or end of the feature is beyond the range of the presented sequence. In the fifth table entry, the symbol `^' indicates that the feature is between bases. In the sixth table entry, the replace operator is shown. 1 10 20 30 40 50 60 70 79 ---------+---------+---------+---------+---------+---------+---------+--------- CDS 5..1261 /product="alpha-1-antitrypsin precursor" /map="14q32.1" /gene="PI" tRNA 1..87 /note="Leu-tRNA-CAA (NAR: 1057)" /anticodon=(pos:35..37,aa:Leu) mRNA 1..>66 /note="alpha-1-acid glycoprotein mRNA" transposon <1..267 /note="insertion element IS5" misc_recomb 105^106 /note="B.subtilis DNA end/IS5 DNA start" conflict replace(258..258,"t") /citation=[2] ---------+---------+---------+---------+---------+---------+---------+--------- 1 10 20 30 40 50 60 70 79 Example 10. Feature Table Entries The next example shows the representation for a CDS that spans more than one entry. 1 10 20 30 40 50 60 70 79 ---------+---------+---------+---------+---------+---------+---------+--------- LOCUS HUMPGAMM1 3688 bp ds-DNA PRI 15-OCT-1990 DEFINITION Human phosphoglycerate mutase (muscle specific isozyme) (PGAM-M) gene, 5' end. ACCESSION M55673 M25818 M27095 KEYWORDS phosphoglycerate mutase. SEGMENT 1 of 2 . . . FEATURES Location/Qualifiers CAAT_signal 1751..1755 /gene="PGAM-M" TATA_signal 1791..1799 /gene="PGAM-M" exon 1820..2274 /number=1 /EC_number="5.4.2.1" /gene="PGAM-M" intron 2275..2377 /number=1 /gene="PGAM2" exon 2378..2558 /number=2 /gene="PGAM-M" . . . // LOCUS HUMPGAMM2 677 bp ds-DNA PRI 15-OCT-1990 DEFINITION Human phosphoglycerate mutase (muscle specific isozyme) (PGAM-M), exon 3. ACCESSION M55674 M25818 M27096 KEYWORDS phosphoglycerate mutase. SEGMENT 2 of 2 . . . FEATURES Location/Qualifiers exon 255..457 /number=3 /gene="PGAM-M" intron order(M55673:2559..>3688,<1..254) /number=2 /gene="PGAM-M" mRNA join(M55673:1820..2274,M55673:2378..2558,255..457) /gene="PGAM-M" CDS join(M55673:1861..2274,M55673:2378..2558,255..421) /note="muscle-specific isozyme" /gene="PGAM2" /product="phosphoglycerate mutase" /codon_start=1 /translation="MATHRLVMVRHGESTWNQENRFCGWFDAELSEKGTEEAKRGAKA IKDAKMEFDICYTSVLKRAIRTLWAILDGTDQMWLPVVRTWRLNERHYGGLTGLNKAE TAAKHGEEQVKIWRRSFDIPPPPMDEKHPYYNSISKERRYAGLKPGELPTCESLKDTI ARALPFWNEEIVPQIKAGKRVLIAAHGNSLRGIVKHLEGMSDQAIMELNLPTGIPIVY ELNKELKPTKPMQFLGDEETVRKAMEAVAAQGKAK" . . . // ---------+---------+---------+---------+---------+---------+---------+--------- 1 10 20 30 40 50 60 70 79 Example 11. Joining Sequences
-----------------------------------------------------------------------