DDBJ Release Notes


                          DNA Data Bank of Japan

                              DNA Database

Release 107.0, Dec. 2016: 790,211,658 entries, 2,144,818,812,438 bases
Last published date in the present release: November 25, 2016

------------------------------------------------------------------------------- 
Table of contents
-------------------------------------------------------------------------------

  1. Introduction
    1.1. Announcement for changes in the present release
    1.2. Announcement for the forthcoming changes

  2. Data categories
    2.1. Categories for conventional sequence data
    2.2. Categories for bulk sequence data
    2.3. Notice for data derived from Patent Offices

  3. Statistics and files
    3.1. Files for conventional sequence data
    3.2. Files for bulk sequence data

  4. Citation

  5. DDBJ staff

  6. Acknowledgment

  7. Disclaimer

  8. DDBJ flat file format
    8.1.  LOCUS line
    8.2.  DEFINITION line
    8.3.  ACCESSION line
    8.4.  VERSION line
    8.5.  DBLINK line
    8.6.  KEYWORDS line
    8.7.  SOURCE line
    8.8.  REFERENCE line
    8.9.  COMMENT line
    8.10. FEATURES line
    8.11. BASE COUNT line
    8.12. ORIGIN line

  9. Sample of the contents in each file
    9.1. Part of the contents in the file 'ddbjbct1.seq'
    9.2. Part of the contents in the accession number index file 'ddbjacc1.idx'
    9.3. Part of the contents in the gene name index 'ddbjgen1.idx'

  10. Release history

-------------------------------------------------------------------------------


1. Introduction

The present release contains the newest data prepared by the DNA Data Bank of 
Japan (DDBJ), GenBank (*), and EMBL-Bank/European Bioinformatics Institute 
(EMBL-Bank/EBI) as of November 25, 2016.  This unified database was made 
possible thanks to the international collaboration among the three data banks.
All the entries have accordingly been annotated using the feature keys common 
to them.  

In 2005, DDBJ, EMBL-Bank and GenBank agreed to call their collaboration 
"the International Nucleotide Sequence Database Collaboration (INSDC); 
http://www.insdc.org" and to call the unified nucleotide sequence database 
"the International Nucleotide Sequence Database (INSD)".  

* 'GenBank' is a trademark of NIH, USA, and is operated by National Center for 
Biotechnology Information (NCBI) at NIH.


1.1. Announcement for changes in the present release

Statistical information was changed: 

In the past, DDBJ periodical release did not contain entries with four-
letter prefix accession numbers, such as WGS and large part of TSA data.  
Since the new types of high throughput data assigned WGS-like accession 
numbers will be added near future, we modify the statistical information 
to include entries with four-letter prefix accession numbers as 'bulk 
sequence data', from the present release.  See also '2. Data categories'.  


Revision of the DDBJ/ENA/GenBank Feature Table Definition:  

Following the agreement at the INSD collaborative meeting in 2016, the 
document, DDBJ/ENA/GenBank Feature Table Definition, was revised in 
November 2016.  See also '8.10. FEATURES line' below.  

The revised points are introduced in advance on the following URL; 
http://www.ddbj.nig.ac.jp/insdc/icm2016-e.html#ft


1.2. Announcement for the forthcoming changes

A new data type of bulk sequence data, TLS, will be introduced: 

The data for large-scale sequencing studies of special marker genes will be 
included as a data type of bulk sequence data, Targeted Locus Study (TLS) 
from the next periodical release 108.  

TLS data include sequences of 16S rRNAs or some other targeted loci to be 
clustered into operational taxonomic unit with four-letter prefix accession 
numbers.  


2. Data categories

The sequence data of the periodical DDBJ release are divided into two main 
groups, conventional sequence data and bulk sequence data.  The former 
includes data whose entries are assigned accession numbers with one- or two-
letter prefixes.  The later includes ultra-high throughput data sets whose 
entries are assigned accession numbers with four-letter prefixes.  See also 
'8.3. ACCESSION line'.  


2.1. Categories for conventional sequence data

The conventional sequence data in the present release is divided into 21 
categories, called 'division', of organisms and others.  The contents of the 
divisions are shown in the following.  

HUM; human  
PRI; primates other than human 
ROD; rodents 
MAM; mammals other than primates and rodents 
VRT; vertebrates other than mammals 
INV; invertebrates (animals other than vertebrates) 
PLN; plants, fungi, plastids (eukaryotes other than animals) 
BCT; bacteria (including both Eubacteria and Archaea) 
VRL; viruses 
PHG; bacteriophages 
ENV; sequences obtained via environmental sampling methods 
SYN; synthetic construct; artificially constructed sequences 
EST; expressed sequence tag; short single pass cDNA sequence 
TSA; transcriptome shotgun assembly; 
     Assembled RNA transcripts/cDNA sequences.
HTC; high throughput cDNA sequence; 
     The sequence submitted from cDNA sequencing projects except for EST.  
     This division is to include unfinished high throughput cDNA sequences, 
     each of which has 5'UTR and 3'UTR at both ends and part of a coding region.
     The sequence may also include introns.  When the sequence becomes finished 
     later, it moves to the corresponding taxonomic division.  
GSS; genome survey sequence; short single pass genomic sequence 
HTG; high throughput genomic sequence; 
     The sequence submitted mainly from genome sequencing projects which 
     regarded a clone as a sequencing unit.  
STS; sequence tagged site; 
     The tag site for genome sequencing.  The information of chromosome, map, 
     is mandatory for this division.  
PAT; sequence data derived from Patent Offices; 
     The data those which the Japan Patent Office (JPO), United States Patent 
     and Trademark Office (USPTO), the European Patent Office (EPO), and Korean 
     Intellectual Property Office (KIPO) collected, processed and released.  
     See also '2.3. Notice for data derived from Patent Offices' below.  
UNA; the sequence data not annotated; 
     The UNA division is not used for recently submitted sequences.  
CON; Contig / Constructed; 
     To conjugate a series of entries, such as those submitted from a genome 
     project, each of the three data banks constructs an entry and assign an 
     accession number to a large scale sequence dataset.  Such entries are 
     classified into the CON division.  The entry in the CON division has the 
     information of joined accession numbers instead of the sequence data.  
     The corresponding entries of the CON entry have been submitted to other 
     divisions.  The entries and bases in the CON division are not counted in 
     the released numbers given on the top of the release note.  


2.2. Categories for bulk sequence data

The bulk sequence data in the present release is divided into 2 categories, 
called 'data type'.  The contents of the data types are shown in the 
following.  

TSA; transcriptome shotgun assembly; 
     Assembled RNA transcripts/cDNA sequences.
WGS; whole genome shotgun; 
     The draft genomic sequences of various organisms determined by whole 
     genome shotgun approach. 

Note that TSA is at once a division of conventional sequence data and a data 
type of bulk sequence data.  


2.3. Notice for data derived from Patent Offices

This release includes PAT division for sequence data derived from Patent 
Offices as described above.  The data those which Japan Patent Office (JPO), 
United States Patent and Trademark Office (USPTO), European Patent Office 
(EPO), and Korean Intellectual Property Office (KIPO) collected, processed 
and released.  

The prefixes of accession numbers for the PAT division can be found at the 
following URL; http://www.ddbj.nig.ac.jp/sub/prefix.html  

Note also that unauthorized use of the patented data may cause legal issues 
for which DDBJ takes no responsibility.  See also '7. Disclaimer'.  


3. Statistics and files

The statistics of the present release are shown in the following table:  

-------------------------------------------------------
categories    number of entries       number of bases
-------------------------------------------------------
BCT                      1496340            29232018629
ENV                      8069531             5273599021
EST                     76572230            42789346444
GSS                     40054600            25842088149
HTC                       546835              637990352
HTG                       174702            27540604129
HUM                       700050             5574265065
INV                      6844391            16766612792
MAM                       481390             3844202948
PAT                     35649049            18516869245
PHG                        11665              306357871
PLN                      4293860            15433636347
PRI                       140916             2251564092
ROD                       527037             4506476789
STS                      1346867              640874549
SYN                       216015             1069065170
TSA                    149076157           132576915262
UNA                          376                 266598
VRL                      2153947             3097174312
VRT                      2625721             6988560250
WGS                    459229979          1801930324424
-------------------------------------------------------
Total                  790211658          2144818812438

CON                     31364775          1045764520067

The entries and bases in the CON division are not counted in the numbers given 
on the top of the release note or 'total' on the above table.


3.1. Files for conventional sequence data

The conventional sequence data in this release covers 21 categories (See also 
'2.1. Categories for conventional sequence data') of organisms and others as 
follows: 

------------------------------------------------------------------------------
ddbjbct; Category for BCT
ddbjcon; Category for CON
ddbjenv; Category for ENV
ddbjest; Category for EST
ddbjgss; Category for GSS
ddbjhtc; Category for HTC
ddbjhtg; Category for HTG
ddbjhum; Category for HUM
ddbjinv; Category for INV
ddbjmam; Category for MAM
ddbjpat; Category for PAT
ddbjphg; Category for PHG
ddbjpln; Category for PLN
ddbjpri; Category for PRI
ddbjrod; Category for ROD
ddbjsts; Category for STS
ddbjsyn; Category for SYN
ddbjtsa; Category for TSA
ddbjuna; Category for UNA
ddbjvrl; Category for VRL
ddbjvrt; Category for VRT
------------------------------------------------------------------------------

All of above in the present release are recorded in multiple ddbj***###.seq 
files, each of which at most has 1.5 GB storage capacity, as follows, 
respectively.  

-------------------------------
file prefix     number of files
-------------------------------
ddbjbct                      45
ddbjcon                      55
ddbjenv                      15
ddbjest                     168
ddbjgss                      79
ddbjhtc                       2
ddbjhtg                      25
ddbjhum                       7
ddbjinv                      24
ddbjmam                       5
ddbjpat                      45
ddbjphg                       1
ddbjpln                      22
ddbjpri                       3
ddbjrod                       5
ddbjsts                       4
ddbjsyn                       2
ddbjtsa                      37
ddbjuna                       1
ddbjvrl                       7
ddbjvrt                      10
-------------------------------

The files contain nucleotide sequence data in DDBJ flat file format.  See 
also '8. DDBJ flat file format'.  

The index files included in this release are ddbjacc#.idx and ddbjgen.idx.  
All of them are recorded in multiple ddbjacc#.idx files, each of which at most 
has 1.5 GB storage capacity.  

The file lists of conventional sequence data in this release are arranged in 
the file, 'ddbj107_filelist.txt'.  The file provides the lists of the sequence 
data files and the index files.  The file list of sequence data consists of 
four columns; "file name", "number of entries", "number of bases" and "file 
size".  The list of index files consists of two columns; "file name" and 
"file size".  

From the present periodical release to the next one, daily updates of 
conventional sequence data are available at the following directory; 
ftp://ftp.ddbj.nig.ac.jp/ddbj_database/ddbjnew/


3.2. Files for bulk sequence data

The latest files of bulk sequence data are available at following sites; 
------------------------------------------------------------------------------
WGS: ftp://ftp.ddbj.nig.ac.jp/ddbj_database/wgs/
TSA: ftp://ftp.ddbj.nig.ac.jp/ddbj_database/tsa/
------------------------------------------------------------------------------
The files of bulk sequence data are named by their prefixes of accession 
numbers.  They contain nucleotide sequence data in DDBJ flat file format.  
See also '8. DDBJ flat file format'.  

Since the directories of bulk sequence data are daily updated, the statistics 
of bulk sequence data in the present release are snapshots of above 
directories at the last published date, November 25, 2016.  
The statistics are available in the following files: 
------------------------------------------------------------------------------
WGS: ddbj107_wgs_filelist.txt
TSA: ddbj107_tsa_filelist.txt
------------------------------------------------------------------------------

The list tables in the files consist of four columns; "file name", "number of 
entries", "number of bases" and "file size".  Please note that both of 
columns, "file name" and "file size", correspond to values after uncompressed 
from files, "****.gz".  


4. Citation

When you use DDBJ in your research, we would appreciate it if you would 
include a reference to DDBJ in your publications related to your research.  

When citing an entry in the DDBJ database, it is appropriate to give its 
accession number.  Also, it is recommended to cite the first publication in 
REFERENCE of the entry other than submitter information.  
 
DDBJ suggests authors add a reference for DDBJ itself.  The following 
publication, which describes the recent activities of the DDBJ center, 
would be appropriate to be cited:
 
  Mashima J, Kodama Y, Kosuge T, Fujisawa T, Katayama T, Nagasaki H, Okuda Y, 
  Kaminuma E, Ogasawara O, Okubo K, Nakamura Y and Takagi T.
  DNA data bank of Japan (DDBJ) progress report.
  Nucleic Acids Res. 44 (Database issue), D51-D57 (2016)
  DOI: 10.1093/nar/gkv1105

The following sentence is an example to cite an entry in the DDBJ database:  
-----------------------------------------------------------------------------
"We searched the DDBJ database (1) by sequence similarities and found a 
nucleotide sequence (2), with DDBJ accession number AB000714, which had 
significant similarity with ..."

 (1) Mashima, J. et al, Nucleic Acids Res. 44(Database issue), D51-D57 (2016). 
 (2) Katahira, J. et al, J. Biol. Chem. 272, 26652-26658 (1997). 
------------------------------------------------------------------------------


5. DDBJ staff

This release is published by the following DDBJ staff.  

Jun Mashima, Hideo Aono, Yuji Ashizawa, Yukino Dobashi, Mayumi Ejima, 
Masahiro Fujimoto, Asami Fukuda, Tomohiro Hirai, Naofumi Ishikawa, 
Chiharu Kawagoe, Yuichi Kodama, Junko Kohira, Takehide Kosuge, 
Kyungbum Lee, Mika Maki, Hisako Mashima, Fujitaka Matsumori, 
Kimiko Mimura, Shiho Mukaida, Naoko Murakata, Toshihisa Okido, 
Yoshihiro Okuda, Katsunaga Sakai, Makoto Sato, Aimi Shiida, 
Rie Sugita, Kimiko Suzuki, Toshiaki Tokimatsu, Haru Tsutsui, 
Koji Watanabe, Tomoka Watanabe, Tomohiko Yasuda, Emi Yokoyama, 
Masanori Arita, Eli Kaminuma, Osamu Ogasawara, Kosaku Okubo, 
Toshihisa Takagi, and Yasukazu Nakamura

DNA Data Bank of Japan
DDBJ Center
National Institute of Genetics
Research Organization of Information and Systems

Mishima, 411-8540, Japan 
Phone:  +81 55 981 6853
FAX:    +81 55 981 6849
E-mail: [email protected] (for general inquiry)
        [email protected] (for data submission)
        [email protected] (for updates and notification of publication)
WWW:    http://www.ddbj.nig.ac.jp/ 


6. Acknowledgment

We are grateful to NCBI and EBI for a firm friendship and an excellent 
collaboration with us.  We thank JPO and KIPO for a steady cooperation with 
us.  We also thank Byungwook Lee at Korean Bioinformation Center for proper 
process of the sequence data in patent claims to KIPO.  

The operation of DDBJ is supported by the Ministry of Education, Culture, 
Sports, Science and Technology, and we would gratefully note this here.  
DDBJ uses the Super-SINET computer network for data collection, data exchange 
and various services.   


7. Disclaimer

While DDBJ endeavors to keep its data correct, DDBJ makes no representations 
or warranties of any kind about the completeness, accuracy or reliability 
with respect to the entries contained in the DDBJ periodical release.  DDBJ 
also makes no legal liability or responsibility of merchantability or fitness 
for a particular purpose or that the use of the sequence data will not 
infringe any patent or other rights.  Any receipt, reliance or use you place 
on such data is therefore strictly at your own risk.  


8. DDBJ flat file format

The database is a collection of "entry" which is the unit of the data.  The 
entries submitted to databanks were processed and publicized according to the 
DDBJ format for distribution (flat file).  The flat file includes the sequence 
and the information of submitters, references, source organisms, and "feature" 
information, etc.  The items of the DDBJ flat file are explained at following; 

-------------------------------------------------------------------------------
LOCUS       AB000000                 450 bp    mRNA    linear   HUM 08-JUL-2002
DEFINITION  Homo sapiens GAPD mRNA for glyceraldehyde-3-phosphate
            dehydrogenase, partial cds.
ACCESSION   AB000000
VERSION     AB000000.1
DBLINK      BioProject:PRJDA12345
KEYWORDS    .
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 
REFERENCE   1  (bases 1 to 450)
  AUTHORS   Mishima,H. and Shizuoka,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (30-NOV-2000) to the DDBJ/EMBL/GenBank databases.
            Contact:Hanako Mishima
            National Institute of Genetics, DNA Data Bank of Japan; 1111, Yata,
            Mishima, Shizuoka 411-8540, Japan
REFERENCE   2  
  AUTHORS   Mishima,H., Shizuoka,T. and Fuji,I.
  TITLE     Glyceraldehyde-3-phosphate dehydrogenase expressed in human liver
  JOURNAL   Unpublished (2002)
COMMENT     Human cDNA sequencing project.
FEATURES             Location/Qualifiers
     source          1..450
                     /chromosome="12"
                     /clone="GT200015"
                     /clone_lib="lambda gt11 human liver cDNA (GeneTech.
                     No.20)"
                     /map="12p13"
                     /mol_type="mRNA"
                     /organism="Homo sapiens"
                     /tissue_type="liver"
     CDS             86..>450
                     /codon_start=1
                     /gene="GAPD"
                     /product="glyceraldehyde-3-phosphate dehydrogenase"
                     /protein_id="BAA12345.1"
                     /transl_table=1
                     /translation="MAKIKIGINGFGRIGRLVARVALQSDDVELVAVNDPFITTDYMT
                     YMFKYDTVHGQWKHHEVKVKDSKTLLFGEKEVTVFGCRNPKEIPWGETSAEFVVEYTG
                     VFTDKDKAVAQLKGGAKKV"
BASE COUNT          102 a          119 c          131 g           98 t
ORIGIN
        1 cccacgcgtc cggtcgcatc gcacttgtag ctctcgaccc ccgcatctca tccctcctct
       61 cgcttagttc agatcgaaat cgcaaatggc gaagattaag atcgggatca atgggttcgg
      121 gaggatcggg aggctcgtgg ccagggtggc cctgcagagc gacgacgtcg agctcgtcgc
      181 cgtcaacgac cccttcatca ccaccgacta catgacatac atgttcaagt atgacactgt
      241 gcacggccag tggaagcatc atgaggttaa ggtgaaggac tccaagaccc ttctcttcgg
      301 tgagaaggag gtcaccgtgt tcggctgcag gaaccctaag gagatcccat ggggtgagac
      361 tagcgctgag tttgttgtgg agtacactgg tgttttcact gacaaggaca aggccgttgc
      421 tcaacttaag ggtggtgcta agaaggtctg
//
-------------------------------------------------------------------------------


8.1. LOCUS line

The format of LOCUS line in the flat file is shown below; 
---------  --------
Positions  Contents
---------  --------
  01-05    'LOCUS'
  06-12     spaces
  13-28     Locus name
  29-29     space
  30-40     Length of sequence, right-justified
  41-41     space
  42-43     'bp'
  44-47     spaces
  48-54     DNA, RNA, mRNA, rRNA, tRNA or cRNA, left justified
  55-55     space
  56-63     'linear' followed by two spaces, or 'circular'
  64-64     space
  65-67     The division code (See '2. Data categories.')
  68-68     space
  69-79     Date, in the form dd-MMM-yyyy (e.g., 08-JUL-2002)
------------------------------------------------------------------------------


8.2. DEFINITION line

The definition briefly describes the information of gene(s).  "DEFINITION" is 
constructed by each of the three data banks.  


8.3. ACCESSION line

This line shows accession number of the entry data.  
A unique accession number is issued to the data submitted by each of the three 
data banks.  The accession number is composed of 1 alphabet character and 5 
digits (ex. A12345), 2 alphabet characters and 6 digits (ex. AB123456) or 4 
alphabet characters and 8-10 digits (AAAA01012345).  The first style was used 
in 1980s, but later, the second and the third styles were introduced because 
of data explosion.  See also the following URL; 
http://www.ddbj.nig.ac.jp/sub/acc_def-e.html

The alphabet part of accession number is called "prefix".  You can find the 
prefix list of the accession numbers at the following URL; 
http://www.ddbj.nig.ac.jp/sub/prefix.html

If multiple entries are united to an entry, or if an entry is extensively 
modified after the submission, the responsible data banks may assign a new 
accession number to it.  In these cases, the new accession number is called 
the primary accession number, and the old accession number(s) is/are 
called the secondary accession number(s).  In the flat file, the primary 
accession number is indicated first, then the secondary accession number(s) 
follows.  You can find the same updated entry with both the primary and the 
secondary accession numbers.  


8.4. VERSION line

This line consists of an accession number and a version number, like 
"AB123456.1", in which the digit(s) after the period is a version number.  
The data open to public for the first time is version number as "1".  The 
reason for adding VERSION is that since a released sequence sometimes 
revised by the submitter, the accession number alone cannot specify the 
sequence in question causing the user a trouble.  The number is increased 
by one every time when a revised sequence is made public.  


8.5. DBLINK line

The DBLINK line provides links to records of other databases with accession 
numbers of BioProject, BioSample, Sequence Read Archive and so on.  


8.6. KEYWORDS line

The data banks describe this line, if necessary.  In many cases, the 
categories of the data (EST, HTG etc.), gene names and product names 
included in "KEYWORDS".  


8.7. SOURCE line

This line shows the scientific name (and a corresponding common name, if 
defined as "Genbank common name" in taxonomy database) on organism from which 
the sequence is obtained and an organelle type if the sequence is derived 
from an organelle other than the nucleus.  


8.8. REFERENCE line

The information on the submitters and references related to the submitted 
sequence is indicated in REFERENCE line.  


8.9. COMMENT line.

The information about an entry that cannot be described using FEATURES or 
the other fields.  


8.10. FEATURES line

Biological features of a submitted sequence data are described with 
"Feature" key (the biological nature of the annotated feature), "Location"
(the region of the sequence which corresponds to Feature), and "Qualifier" 
(supplementary information about Feature).  The "Feature" and "Qualifier" keys 
used in the present release is defined by DDBJ/ENA/GenBank Feature Table 
Definition Version 10.6 (November, 2016).  The document is continuously 
updated every year, in principle.  You can find its newest version on URL; 
http://www.ddbj.nig.ac.jp/FT/full_index.html


8.11. BASE COUNT line

In the BASE COUNT line of the DDBJ flat file, 9 digits are allocated for each 
number of a (adenine), c (cytosine), g (guanine) and t (thymine).  In the case 
of RNA sequence, uracil is indicated as "t" according to the rule of the 
international nucleotide database.  


8.12. ORIGIN line

The sequence data starts from the next line of ORIGIN.  The sequence is 
indicated as lower case letters, delimited by space per 10 bases, starts a new 
line by 60 bases.  The numbers described at left side of lines mean the ordinal 
number of the top base of the line.  


9. Sample of the contents in each file

9.1. Part of the contents in the file 'ddbjbct1.seq'

This shows all pieces of information on one entry in DDBJ format.  
------------------------------------------------------------------------------
LOCUS       D87069                   993 bp    mRNA    linear   BCT 05-OCT-2006
DEFINITION  Escherichia coli mRNA for RNA polymerase sigma subunit, truncated
            form of sigma-38, complete cds.
ACCESSION   D87069
VERSION     D87069.1
KEYWORDS    RNA polymerase sigma subunit, truncated form of sigma-38.
SOURCE      Escherichia coli
  ORGANISM  Escherichia coli
            Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
            Enterobacteriaceae; Escherichia.
REFERENCE   1  (bases 1 to 993)
  AUTHORS   Jishage,M.
  TITLE     Direct Submission
  JOURNAL   Submitted (14-AUG-1996) to the DDBJ/EMBL/GenBank databases.
            Contact:Miki Jishage
            National Institute of Genetics, Molecular Genetics; Yata 1111,
            Mishima, Shizuoka 411, Japan
REFERENCE   2  
  AUTHORS   Jishage,M. and Ishihama,A.
  TITLE     Variation in RNA polymerase sigma subunit composition within
            different stocks of Escherichia coli starin W3110
  JOURNAL   Unpublished (1996)
REFERENCE   3  
  AUTHORS   Ivanova,A., Renshaw,M., Guntaka,R. and Eisenstark,A.
  TITLE     DNA base sequence variability in katF (putative sigma factor) gene
            Escherichia coli
  JOURNAL   Nucleic Acids Res. 20, 5479-5480 (1992)
REFERENCE   4  
  AUTHORS   Takayanagi,Y., Tanaka,K. and Takahashi,H.
  TITLE     Structure of the 5' upstream region and the regulation of the rpoS
            gene of Escherichia coli
  JOURNAL   Mol. Gen. Genet. 243, 525-531 (1994)
COMMENT     
FEATURES             Location/Qualifiers
     source          1..993
                     /db_xref="taxon:562"
                     /mol_type="mRNA"
                     /organism="Escherichia coli"
                     /strain="W3110"
     CDS             1..810
                     /note="the gene has four single base changes, resulting
                     in two amino acid substitutions and an amber mutation"
                     /product="RNA polymerase sigma subunit, truncated form of
                     sigma-38"
                     /protein_id="BAA13238.1"
                     /transl_table=11
                     /translation="MSQNTLKVHDLNEDAEFDENGVEVFDEKALVEYEPSDNDLAEEE
                     LLSQGATQRVLDATQLYLGEIGYSPLLTAEEEVYFARRALRGDVASRRRMIESNLRLV
                     VKIARRYGNRGLALLDLIEEGNLGLIRAVEKFDPERGFRFSTYATWWIRQTIERAIMN
                     QTRTIRLPIHIVKELNVYLRTARELSHKLDHEPSAEEIAEQLDKPVDDVSRMLRLNER
                     ITSVDTPLGGDSEKALLDILADEKENGPEDTTQDDDMKQSIVKWLFELNAK"
     variation       75
                     /citation=[3]
                     /replace="t"
     variation       97
                     /citation=[3]
                     /replace="t"
     variation       99
                     /citation=[3]
                     /replace="t"
     variation       808
                     /citation=[3]
                     /replace="t"
BASE COUNT          254 a          223 c          291 g          225 t
ORIGIN      
        1 atgagtcaga atacgctgaa agttcatgat ttaaatgaag atgcggaatt tgatgagaac
       61 ggagttgagg tttttgacga aaaggcctta gtagaatatg aacccagtga taacgatttg
      121 gccgaagagg aactgttatc gcagggagcc acacagcgtg tgttggacgc gactcagctt
      181 taccttggtg agattggtta ttcaccactg ttaacggccg aagaagaagt ttattttgcg
      241 cgtcgcgcac tgcgtggaga tgtcgcctct cgccgccgga tgatcgagag taacttgcgt
      301 ctggtggtaa aaattgcccg ccgttatggc aatcgtggtc tggcgttgct ggaccttatc
      361 gaagagggca acctggggct gatccgcgcg gtagagaagt ttgacccgga acgtggtttc
      421 cgcttctcaa catacgcaac ctggtggatt cgccagacga ttgaacgggc gattatgaac
      481 caaacccgta ctattcgttt gccgattcac atcgtaaagg agctgaacgt ttacctgcga
      541 accgcacgtg agttgtccca taagctggac catgaaccaa gtgcggaaga gatcgcagag
      601 caactggata agccagttga tgacgtcagc cgtatgcttc gtcttaacga gcgcattacc
      661 tcggtagaca ccccgctggg tggtgattcc gaaaaagcgt tgctggacat cctggccgat
      721 gaaaaagaga acggtccgga agataccacg caagatgacg atatgaagca gagcatcgtc
      781 aaatggctgt tcgagctgaa cgccaaatag cgtgaagtgc tggcacgtcg attcggtttg
      841 ctggggtacg aagcggcaac actggaagat gtaggtcgtg aaattggcct cacccgtgaa
      901 cgtgttcgcc agattcaggt tgaaggcctg cgccgtttgc gcgaaatcct gcaaacgcag
      961 gggctgaata tcgaagcgct gttccgcgag taa
//
------------------------------------------------------------------------------


9.2. Part of the contents in the accession number index file 'ddbjacc1.idx' 

The following excerpt from the accession number index file illustrates the
format of the index.  
------------------------------------------------------------------------------
A00001	A00001	PAT	A00001
A00002	A00002	PAT	A00002
A00003	A00003	PAT	A00003
A00004	A00004	PAT	A00004
A00005	A00005	PAT	A00005
A00006	A00006	PAT	A00006
A00008	A00008	PAT	A00008
A00009	A00009	PAT	A00009
A00010	A00010	PAT	A00010
------------------------------------------------------------------------------

The accession number index file consists of four columns delimited by tab 
code.  The first column indicates secondary accession number.  If there is 
no secondary accession number, the first column indicates primary accession 
number.  Following columns are locus name, division and primary accession 
number, respectively.  


9.3. Part of the contents in the gene name index file 'ddbjgen1.idx'

This file lists all the gene names that appear in the feature table.  
------------------------------------------------------------------------------
 2	AJ431263	PLN	AJ431263
 B epsilon	AJ276037	PLN	AJ276037
 B epsilon	AJ276037	PLN	AJ276037
 B epsilon	AJ276037	PLN	AJ276037
 B epsilon	AJ276037	PLN	AJ276037
 D beta 	Z22855	ROD	Z22855
 D beta 1	Z22854	ROD	Z22854
 D34	Z93215	HUM	Z93215
 H5 	X15387	INV	X15387
 H5 	X15387	INV	X15387
 HLA-DBR1	X68272	HUM	X68272
------------------------------------------------------------------------------

The gene name index file consists of four columns, gene name, locus name, 
division and primary accession number, respectively.  Columns are delimited by 
tab code.  


10. Release history

Release  Date      Entries              Bases  Comments
107     12/16  790,211,658  2,144,818,812,438  bulk sequence data inclusion 
                                               started
106     09/16  196,421,345    218,729,237,634
105     06/16  194,599,140    213,484,513,978
104     03/16  191,094,643    207,977,078,920
103     12/15  189,264,014    204,119,485,393
102     09/15  187,785,897    200,654,335,022
101     06/15  184,281,713    192,506,352,252
100     03/15  181,941,277    187,798,813,739
 99     12/14  178,825,615    184,410,381,191
 98     09/14  174,391,281    166,692,710,729
 97     06/14  172,402,324    161,078,598,329
 96     03/14  171,164,046    158,539,702,882
 95     12/13  169,094,459    156,527,217,715
 94     09/13  167,480,294    154,916,713,861
 93     06/13  165,072,766    152,702,928,183
 92     03/13  163,017,305    150,760,062,903
 91     12/12  160,729,709    148,418,537,672
 90     09/12  156,952,755    144,754,534,372
 89     06/12  153,273,314    141,016,380,296  Part of index files terminated
 88     12/11  145,861,965    134,956,109,049
 87     09/11  142,339,601    131,276,394,833
 86     06/11  138,030,308    128,745,918,079
 85     03/11  132,302,771    124,516,775,718
 84     12/10  128,607,035    120,919,136,706
 83     09/10  124,079,491    117,728,717,442
 82     06/10  120,034,097    115,169,689,543
 81     03/10  116,720,237    112,394,932,676  TPA excluded
 80     12/09  112,314,250    109,636,862,252  SOURCE line modified
 79     09/09  108,593,519    106,684,379,504  DBLINK line started 
                                               PROJECT line terminated
 78     06/09  105,737,359    104,597,360,291
 77     03/09  102,099,156    101,765,388,414
 76     12/08   98,220,409     98,741,908,446
 75     09/08   92,840,037     95,219,505,205  TSA division started
 74     06/08   87,903,140     91,294,770,939
 73     03/08   83,167,582     86,099,950,395  KIPO inclusion started
 72     12/07   79,004,098     82,592,245,487  Most of E-mail addresses 
                                               discarded
 71     09/07   76,273,345     79,706,204,461
 70     06/07   72,801,679     76,788,510,646
 69     03/07   67,523,680     71,775,679,500  PROJECT line started
                                               Indexes for categories 
                                               terminated
 68     12/06   64,267,978     68,259,314,742  1.5 GB storage started
 67     09/06   61,144,621     65,443,024,193
 66     06/06   58,176,628     62,945,843,881
 65     03/06   55,890,995     60,564,721,635  TPA subcategories started
 64     12/05   52,272,669     56,098,558,378  Some index files split
 63     09/05   47,741,593     52,246,110,341
 62     06/05   45,249,444     49,158,155,283  ENV division started
                                               Version for release note started
 61     03/05   43,118,204     47,099,081,750  Changed style of release note
 60     12/04   40,583,945     44,416,752,273  /db_xref="H-inv:**" started
 59     09/04   37,926,117     42,245,956,937
 58     06/04   34,917,581     39,812,635,108
 57     03/04   32,693,678     38,008,449,840
 56     12/03   30,405,173     36,079,046,032
 55     09/03   27,753,140     34,280,225,489
 54     06/03   25,149,821     32,162,041,177
 53     02/03   23,250,813     29,711,299,332
 52     12/02   20,354,812     26,931,456,316
 51     09/02   18,401,358     22,782,404,136  TPA started
 50     06/02   17,260,693     20,158,357,982
 49     04/02   16,503,157     18,579,627,226
 48     01/02   15,016,100     16,197,713,855
 47     10/01   13,266,610     14,145,671,645
 46     07/01   12,313,759     13,037,646,166
 45     04/01   11,434,113     12,207,092,905  HTC division started
 44     01/01   10,165,597     11,136,298,841
 43     10/00    8,666,551     10,034,532,698
 42     07/00    7,554,995      8,880,721,093
 41     04/00    5,962,608      6,409,581,885  CON division started
 40     01/00    5,388,125      4,762,696,173  RNA division terminated
 39     10/99    4,810,773      3,728,000,562  NID and PID discarded
 38     07/99    4,294,369      3,098,519,597
 37     03/99    3,311,627      2,375,261,951  VERSION, /protein_id started
 36     01/99    3,073,166      2,190,425,560
 35     10/98    2,759,261      1,957,341,169
 34     07/98    2,412,785      1,708,580,623
 33     04/98    2,174,769      1,479,303,279
 32     01/98    1,956,669      1,300,950,613
 31     10/97    1,731,532      1,139,869,464  Adoption of the unified 
                                               taxonomy database
 30     07/97    1,534,115        992,788,339  NID and PID terminated
 29     04/97    1,270,194        841,415,232
 28     01/97    1,154,120        756,785,219  HTG division started
                                               ORG division terminated
 27     10/96      936,697        608,103,057  GSS division started
 26     07/96      835,552        551,932,448
 25     04/96      744,490        499,300,364  /translation started
 24     01/96      637,508        431,771,652
 23     10/95      569,757        390,694,350
 22     07/95      437,588        322,982,425  HUM division started
 21     04/95      274,596        250,875,023
 20     01/95      239,689        231,299,557
 19     10/94      204,332        205,274,131
 18     07/94      185,230       192,473,021
 17     04/94      169,957       179,942,209
 16     01/94      154,626       165,017,628
 15     10/93      131,649       147,224,690
 14     07/93      120,350       138,686,333  JPO inclusion started
 13     04/93      112,067       129,784,445
 12     01/93       97,683       120,815,244  EST division started
 11     07/92       65,693        84,839,075
 10     01/92       59,317        77,805,556  GenBank/EMBL inclusion started
  9     07/91        1,130         2,002,124
  8     01/91          879         1,573,442
  7     07/90          681         1,154,211
  6     01/90          496           841,236
  5     07/89          395           679,378
  4     01/89          302           535,985
  3     07/88          230           345,850
  2     01/88          142           199,392
  1     07/87           66           108,970  Started with DDBJ only


------------------
Since release 89
------------------
Index files have been changed:  
Previously, DDBJ periodical release included index files for accession 
numbers, keyword phrases, journal citations, and gene names.  
After arrangement of index files, index files for keyword phrase and journal 
citation have been terminated and formats of index files for accession number 
and gene name have been changed.  See also "9.2. Part of the contents in the 
accession number index file 'ddbjacc1.idx'" and "9.3. Part of the contents in 
the gene name index 'ddbjgen1.idx'"


------------------
Since release 81
------------------
TPA category data have been excluded from DDBJ periodical release:  
Since September 2002 (DDBJ release 51), we provided DDBJ periodical releases 
including TPA category data.  However, it is potentially confusing, because 
TPA category is not primary nucleotide sequence data.  Therefore, DDBJ 
terminated to include TPA data.  TPA data has been available from the other 
FTP site.  See following site in detail.  
URL; http://www.ddbj.nig.ac.jp/whatsnew/whatsnew2009-e.html#090828


------------------
Since release 80
------------------
The format of the SOURCE line in DDBJ flat file has been changed:  
The SOURCE lines in some of DDBJ flat file included a common name like as 
GenBank flat file.  The change is shown below

----------------
Old (-rel. 79)
----------------
Format:  
SOURCE       []
Example:  
SOURCE      Homo sapiens mitochondrion

----------------
New (rel. 80-)
----------------
Format:  
SOURCE      []  [()]
Example:  
SOURCE      mitochondrion Homo sapiens (human)

See also '8. DDBJ flat file format'.  


------------------
Since release 79
------------------
A new line, DBLINK, has replaced PROJECT line:  

Following the agreement at the INSD collaborative meeting in 2008, the scope 
of the project ID has expanded to include projects that are not necessarily 
targeted to the sequencing of a complete genome.  In addition, there are other 
resources such as the Trace Assembly Archive at the NCBI and the like.  

Therefore, we have decided to replace the PROJECT line by a new line format, 
"DBLINK".  

The replacement is illustrated in the following; 

From the use of the PROJECT line (-release 78); 
-------------------------------------------------------------------------------
LOCUS       AP000000             4700000 bp    DNA     circular BCT 27-FEB-2009
DEFINITION  Escherichia coli DDBJ genomic DNA, complete genome.
ACCESSION   AP000000
VERSION     AP000000.1
PROJECT     GenomeProject:99999
KEYWORDS    .
-------------------------------------------------------------------------------

To the DBLINK line format (release 79-); 
-------------------------------------------------------------------------------
LOCUS       AP000000             4700000 bp    DNA     circular BCT 27-FEB-2009
DEFINITION  Escherichia coli DDBJ genomic DNA, complete genome.
ACCESSION   AP000000
VERSION     AP000000.1
DBLINK      Project:99999
KEYWORDS    .
-------------------------------------------------------------------------------


------------------
Since release 75
------------------
A new division for assembled mRNA sequences, Transcriptome Shotgun Assembly 
(TSA), has been included since the release 75.  

With new sequencing technologies in use, INSDC have faced many requests to 
accept assembled EST sequences.  These sequence data have become more useful 
than used to be, although they may not be correctly assembled or exist in 
nature.  Therefore, INSDC decided to collect assembled EST sequences and 
classified them into the new division 'TSA'.  

TSA sequences are shotgun assemblies of primary sequences deposited in the 
EST division of INSDC, Trace Archive (TA) or Short-Read Archive (SRA).  Two 
specific keywords, "TSA" and "Transcriptome Shotgun Assembly", are present 
in all TSA entries.  The new division code, "TSA", is also described in the 
the LOCUS line in all TSA entries.

No format changes in the flat file are anticipated for the TSA division, 
however, note that TSA entries make use of the same PRIMARY line that is 
described for the entries in TPA category.  The PRIMARY block contains 
references to the underlying reads/transcripts that are assembled to construct 
a TSA record.  

Note that it is required for a TSA submission to submit sequence data of 
primary transcripts to the EST division of INSDC, TA, or SRA.  More 
information about how to submit a TSA entry is provided via the following 
URL; http://www.ddbj.nig.ac.jp/sub/tsa-e.html


------------------
Since release 73
------------------
Introduction of the sequence data from the Korean Intellectual Property Office:

The nucleotide sequence data transferred from Korean Intellectual Property 
Office (KIPO) have been included in DDBJ release.  See also, '2. Data 
categories' and '2.3. Notice for data derived from Patent Offices'.  


------------------
Since release 72
------------------
Deletion of E-mail address, phone and fax numbers from DDBJ flat file:  

To follow the Japanese law of protecting personal information, DDBJ deleted 
both phone and fax numbers, and E-mail address from the flat files of the 
entries submitted to DDBJ.  It would be also helpful to protect DDBJ releases 
against SPAM mail senders.  
DDBJ retrofitted most of all entries submitted to DDBJ, not to GenBank or EMBL, 
by the DDBJ periodical release 72.  

Before the release 72, the submitter information was described in JOURNAL line 
at REFERENCE 1 as, 
--------------------------------------------------------------------------------
REFERENCE   1  (bases 1 to 1200)
  AUTHORS   Mishima,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (01-Jan-1990) to the DDBJ/EMBL/GenBank databases.
            Taro Mishima, DNA Data Bank of Japan, National Institute of
            Genetics; 1111, Yata, Mishima, Shizuoka 411-8540, Japan
            (E-mail:[email protected], URL:http://www.ddbj.nig.ac.jp/,
            Tel:81-12-345-6789, Fax:81-12-345-9876)
--------------------------------------------------------------------------------

After the deletion or the information in question, DDBJ flat file is either one 
of the following two types;  

Type 1: Phone and fax numbers and E-mail address are deleted.  
--------------------------------------------------------------------------------
REFERENCE   1  (bases 1 to 1200)
  AUTHORS   Mishima,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (01-Jan-1990) to the DDBJ/EMBL/GenBank databases.
            Contact:Taro Mishima
            DNA Data Bank of Japan, National Institute of Genetics; 1111, 
            Yata, Mishima, Shizuoka 411-8540, Japan
            URL    :http://www.ddbj.nig.ac.jp/
-------------------------------------------------------------------------------

Type 2: When the submitters wish to keep their contact information disclosed, 
it is described as, 
-------------------------------------------------------------------------------
REFERENCE   1  (bases 1 to 1200)
  AUTHORS   Mishima,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (01-Jan-1990) to the DDBJ/EMBL/GenBank databases.
            Contact:Taro Mishima
            DNA Data Bank of Japan, National Institute of Genetics; 1111, 
            Yata, Mishima, Shizuoka 411-8540, Japan
            URL    :http://www.ddbj.nig.ac.jp/
            E-mail :[email protected]
            Phone  :81-12-345-6789
            Fax    :81-12-345-9876
-------------------------------------------------------------------------------


------------------
Since release 69
------------------
Introduction of the project ID at PROJECT line in DDBJ flat file: 
Following the agreement at the INSD collaborative meeting in 2006, INSDC has 
started to assign the project ID for submissions from sequencing projects.  
The description of project ID is shown as below;  
----------------------------------------------------------------------------
  A unique identifier, assigned at the time of the submission by a sequencing 
  project that informed INSDC of the submission beforehand.  It is recommended 
  that the submitter quotes the assigned project ID in all communication with 
  INSDC databases to allow for easier and faster tracking of issues.  
  The project ID field provides an umbrella identifier that points to all 
  related sequence data for the project.  
----------------------------------------------------------------------------
The PROJECT lines contain INSDC-assigned ID for the sequencing project.  
It will be appeared between VERSION and KEYWORDS lines in DDBJ flat files, 
from the DDBJ periodical release, 69 as shown below.  See also '8. DDBJ flat 
file format'.  
----------------------------------------------------------------------------
ACCESSION   AB012345
VERSION     AB012345.1
PROJECT     GenomeProject:123
KEYWORDS    .
----------------------------------------------------------------------------

Termination of providing the index files for each category: 


------------------
Since release 68
------------------
Split of files:  
We changed the maximum file size from 300 MB to 1.5 GB, because the network 
capacity has been remarkably increased.  Each file named as ddbj***##.seq 
has at most 1.5 GB storage capacity.  See also the sections, ' 3.1. Files for 
conventional sequence data'.  


------------------
Since release 64
------------------
Split of index files:  
In the present release, some of index files (ddbjacc.idx, ddbjjou.idx, and 
ddbjkey.idx) have been greater than 2 GB in the file size.  So, these have been 
recorded in multiple ddbj****.idx files, each of which at most has 1.5 GB 
storage capacity as follows, respectively.  See also 3., 9.2., 9.3. and 9.4.  


------------------
Since release 62
------------------
Release version number is introduced:  
DDBJ has started to include the item, 'version', for its release note, which 
indicates a version for its periodical release.  It is expressed like '62.0', 
in which the digit(s) after the period is a version number.  The reason for 
adding the version number is that a released data is sometimes revised due to 
urgent and necessary corrections.  The number is increased by one every time 
when a revised periodical release is made public until the next release.  

Introduction of ENV division:  
Recently, the submissions of the sequences derived from environmental samples 
have rapidly increased.  To accommodate such submissions, a new division, ENV, 
has been created (See also '2.1. Categories for conventional sequence data').  
This division contains the sequences obtained via direct molecular isolation 
such as PCR, DGGE, or any anonymous method.  In the past, the sequences 
derived from environmental samples belonged to taxonomic divisions, mainly 
BCT.  At DDBJ, the retrofit to transfer relevant entries from taxonomic 
divisions to the ENV division starts in the present release, and ends by the 
periodical release 62.  

Strand information is removed:  
The strand information of LOCUS line in the flat file has been removed as shown 
below.  See also '8.1. LOCUS line'.  
----------------------------------------------------------------------------
Old (-rel. 61):
  44-44     space
  45-47     spaces, ss- (single-stranded), ds- (double-stranded), or 
             ms- (mixed-stranded)
New (rel. 62-):
  44-47     spaces
----------------------------------------------------------------------------


------------------
Since release 61
------------------
The style of release note (this file) has been changed.  

Some entries have the sequential format for the secondary accession numbers in 
the ACCESSION line, in order to make the expression of secondary accession 
numbers in the past short.  For example;
------------------------------------------------------------------------------
Before;
ACCESSION   AB000802 D85885 D85886 D85887
After;
ACCESSION   AB000802 D85885-D85887
------------------------------------------------------------------------------
See also '8.3. ACCESSION line'.  


------------------
Since release 60
------------------
The cross-reference to the H-invitational has been included.


------------------
Since release 56
------------------
The three data banks have agreed that the maximum length limitation (350 kb)
of a submitted sequence be relaxed.

The BASE COUNT line of the DDBJ flat file format has been changed, 
corresponding to the relaxation of the maximum sequence length restriction in 
the entry that had been practiced at DDBJ/EMBL/GenBank International Nucleotide 
Sequence Databases.  In the BASE COUNT line of the DDBJ flat file, 6 digits 
had been allocated for each number of a, c, g, t and other bases in the 
sequence.  Hereafter, in the new flat file format, 9 digits are allocated for 
each number of a, c, g and t, while the numbers of other bases are removed.  
In accordance with the relaxation of sequence length limitation, GenBank had 
already dropped the BASE COUNT line from their flat file format from GenBank 
Release 138 (Oct. 2003).  We DDBJ have decided to maintain the BASE COUNT line 
in our flat file format from the view that GC contents are still important 
information to characterize the sequence.  The changes in the BASE COUNT line 
are shown below.  
----------------------------------------------------------------------------
Old (-rel. 55): 
    1    6   11   16   21   26   31   36   41   46   51   56   61   66   71
    |----|----|----|----|----|----|----|----|----|----|----|----|----|----|
    BASE COUNT   123456 a 123456 c 123456 g 123456 t 123456 others

New (rel. 56-): 
    1    6   11   16   21   26   31   36   41   46   51   56   61   66   71
    |----|----|----|----|----|----|----|----|----|----|----|----|----|----|
    BASE COUNT    123456789 a    123456789 c    123456789 g    123456789 t
----------------------------------------------------------------------------


------------------
Since release 54
------------------
'/sequenced_mol' qualifier has been changed to '/mol_type' qualifier.  
We accordingly completed retrofitting the pertinent entries.  
This change was made on the agreement at the INSD collaborative meeting in 2002.


------------------
Since release 51
------------------
The format of LOCUS line in the flat file has been changed as shown below 
to adjust to the GenBank format.  
------------------------------------------------------------------------------
Old (-rel. 50): 
LOCUS       AB000001      660 bp    DNA             PLN       01-FEB-2001
New (rel. 51-): 
LOCUS       AB000001                 660 bp    DNA     linear   PLN 01-FEB-2001
------------------------------------------------------------------------------


------------------
Since release 45
------------------
The HTC (High Throughput cDNA) division has been included.  This is to include 
unfinished high throughput cDNA sequences, each of which has 5'UTR and 3'UTR 
at both ends and part of a coding region.  The sequence may also include 
introns.  When the sequence becomes finished later, it moves to the 
corresponding taxonomic division.  The sequence is accompanied with a keyword, 
HTC (High Throughput cDNA), which is dropped when the sequence is finished and 
moved to a taxonomic division.  


------------------
Since release 41
------------------
The CON division has been included.  This division is to show the order of 
related sequences in a genome, and expressed by join and the accession numbers 
of the sequences.  The contents of the CON division are compiled by the three 
data banks not by the data submitter.  


------------------
Since release 40
------------------
The RNA division was terminated.  


------------------
Since release 37
------------------
The three data banks include the item VERSION in the flat file, which 
indicates a version of a submitted nucleotide sequence.  It is expressed 
like AB123456.1, in which the digit(s) after the period is a version number.  
The reason for adding VERSION is that since a released sequence sometimes 
revised by the submitter, the accession number alone cannot specify the 
sequence in question causing the user a trouble.  The number is increased by 
one every time when a revised sequence is made public.  

Accordingly, the translated protein sequence will be accompanied with a 
/protein_id which is expressed as BAA12345.1, in which the digit(s) after the 
period is again a version number.  The number is increased by one when the  

corresponding nucleotide sequence is revised and the protein sequence is 
changed as a result, and when the revised protein sequence is made public.


------------------
Since release 31
------------------
We have started adopting the unified taxonomy database to unify the biological 
source of the sequence.  The database is made up with scientific names, ID of 
unidentified organisms, and synthetic constructs etc.  


------------------
Since release 30
------------------
NID and PID were terminated.  This change was made on the agreement at the 
INSD collaborative meeting in 1999.  


------------------
Since release 28
------------------
The HTG (High Throughput Genomic sequence) has been included.  
We terminated the ORG (Organelle) division.  


------------------
Since release 27
------------------
The GSS division has been included.  GSS stands for Genome Survey Sequence, 
which is similar to EST, except that GSS is genomic DNA whereas EST is cDNA.  


------------------
Since release 25
------------------
DDBJ release contains amino acid sequences that were translated from the 
corresponding nucleotide sequences of the database.  


------------------
Since release 22
------------------
The HUM division has been included.  We have the human (HUM) division solely 
for human sequences and the primate (PRI) division for non-human primate 
sequences.  


------------------
Since release 12
------------------
The EST (Expressed Sequence Tag) division has been included.  


------------------
Since release 10
------------------
The sequences submitted to GenBank or EMBL have been included in the release.