DDBJ Release Notes


DNA Data Bank of Japan DNA Database Release 107.0, Dec. 2016: 790,211,658 entries, 2,144,818,812,438 bases Last published date in the present release: November 25, 2016 ------------------------------------------------------------------------------- Table of contents ------------------------------------------------------------------------------- 1. Introduction 1.1. Announcement for changes in the present release 1.2. Announcement for the forthcoming changes 2. Data categories 2.1. Categories for conventional sequence data 2.2. Categories for bulk sequence data 2.3. Notice for data derived from Patent Offices 3. Statistics and files 3.1. Files for conventional sequence data 3.2. Files for bulk sequence data 4. Citation 5. DDBJ staff 6. Acknowledgment 7. Disclaimer 8. DDBJ flat file format 8.1. LOCUS line 8.2. DEFINITION line 8.3. ACCESSION line 8.4. VERSION line 8.5. DBLINK line 8.6. KEYWORDS line 8.7. SOURCE line 8.8. REFERENCE line 8.9. COMMENT line 8.10. FEATURES line 8.11. BASE COUNT line 8.12. ORIGIN line 9. Sample of the contents in each file 9.1. Part of the contents in the file 'ddbjbct1.seq' 9.2. Part of the contents in the accession number index file 'ddbjacc1.idx' 9.3. Part of the contents in the gene name index 'ddbjgen1.idx' 10. Release history ------------------------------------------------------------------------------- 1. Introduction The present release contains the newest data prepared by the DNA Data Bank of Japan (DDBJ), GenBank (*), and EMBL-Bank/European Bioinformatics Institute (EMBL-Bank/EBI) as of November 25, 2016. This unified database was made possible thanks to the international collaboration among the three data banks. All the entries have accordingly been annotated using the feature keys common to them. In 2005, DDBJ, EMBL-Bank and GenBank agreed to call their collaboration "the International Nucleotide Sequence Database Collaboration (INSDC); http://www.insdc.org" and to call the unified nucleotide sequence database "the International Nucleotide Sequence Database (INSD)". * 'GenBank' is a trademark of NIH, USA, and is operated by National Center for Biotechnology Information (NCBI) at NIH. 1.1. Announcement for changes in the present release Statistical information was changed: In the past, DDBJ periodical release did not contain entries with four- letter prefix accession numbers, such as WGS and large part of TSA data. Since the new types of high throughput data assigned WGS-like accession numbers will be added near future, we modify the statistical information to include entries with four-letter prefix accession numbers as 'bulk sequence data', from the present release. See also '2. Data categories'. Revision of the DDBJ/ENA/GenBank Feature Table Definition: Following the agreement at the INSD collaborative meeting in 2016, the document, DDBJ/ENA/GenBank Feature Table Definition, was revised in November 2016. See also '8.10. FEATURES line' below. The revised points are introduced in advance on the following URL; http://www.ddbj.nig.ac.jp/insdc/icm2016-e.html#ft 1.2. Announcement for the forthcoming changes A new data type of bulk sequence data, TLS, will be introduced: The data for large-scale sequencing studies of special marker genes will be included as a data type of bulk sequence data, Targeted Locus Study (TLS) from the next periodical release 108. TLS data include sequences of 16S rRNAs or some other targeted loci to be clustered into operational taxonomic unit with four-letter prefix accession numbers. 2. Data categories The sequence data of the periodical DDBJ release are divided into two main groups, conventional sequence data and bulk sequence data. The former includes data whose entries are assigned accession numbers with one- or two- letter prefixes. The later includes ultra-high throughput data sets whose entries are assigned accession numbers with four-letter prefixes. See also '8.3. ACCESSION line'. 2.1. Categories for conventional sequence data The conventional sequence data in the present release is divided into 21 categories, called 'division', of organisms and others. The contents of the divisions are shown in the following. HUM; human PRI; primates other than human ROD; rodents MAM; mammals other than primates and rodents VRT; vertebrates other than mammals INV; invertebrates (animals other than vertebrates) PLN; plants, fungi, plastids (eukaryotes other than animals) BCT; bacteria (including both Eubacteria and Archaea) VRL; viruses PHG; bacteriophages ENV; sequences obtained via environmental sampling methods SYN; synthetic construct; artificially constructed sequences EST; expressed sequence tag; short single pass cDNA sequence TSA; transcriptome shotgun assembly; Assembled RNA transcripts/cDNA sequences. HTC; high throughput cDNA sequence; The sequence submitted from cDNA sequencing projects except for EST. This division is to include unfinished high throughput cDNA sequences, each of which has 5'UTR and 3'UTR at both ends and part of a coding region. The sequence may also include introns. When the sequence becomes finished later, it moves to the corresponding taxonomic division. GSS; genome survey sequence; short single pass genomic sequence HTG; high throughput genomic sequence; The sequence submitted mainly from genome sequencing projects which regarded a clone as a sequencing unit. STS; sequence tagged site; The tag site for genome sequencing. The information of chromosome, map, is mandatory for this division. PAT; sequence data derived from Patent Offices; The data those which the Japan Patent Office (JPO), United States Patent and Trademark Office (USPTO), the European Patent Office (EPO), and Korean Intellectual Property Office (KIPO) collected, processed and released. See also '2.3. Notice for data derived from Patent Offices' below. UNA; the sequence data not annotated; The UNA division is not used for recently submitted sequences. CON; Contig / Constructed; To conjugate a series of entries, such as those submitted from a genome project, each of the three data banks constructs an entry and assign an accession number to a large scale sequence dataset. Such entries are classified into the CON division. The entry in the CON division has the information of joined accession numbers instead of the sequence data. The corresponding entries of the CON entry have been submitted to other divisions. The entries and bases in the CON division are not counted in the released numbers given on the top of the release note. 2.2. Categories for bulk sequence data The bulk sequence data in the present release is divided into 2 categories, called 'data type'. The contents of the data types are shown in the following. TSA; transcriptome shotgun assembly; Assembled RNA transcripts/cDNA sequences. WGS; whole genome shotgun; The draft genomic sequences of various organisms determined by whole genome shotgun approach. Note that TSA is at once a division of conventional sequence data and a data type of bulk sequence data. 2.3. Notice for data derived from Patent Offices This release includes PAT division for sequence data derived from Patent Offices as described above. The data those which Japan Patent Office (JPO), United States Patent and Trademark Office (USPTO), European Patent Office (EPO), and Korean Intellectual Property Office (KIPO) collected, processed and released. The prefixes of accession numbers for the PAT division can be found at the following URL; http://www.ddbj.nig.ac.jp/sub/prefix.html Note also that unauthorized use of the patented data may cause legal issues for which DDBJ takes no responsibility. See also '7. Disclaimer'. 3. Statistics and files The statistics of the present release are shown in the following table: ------------------------------------------------------- categories number of entries number of bases ------------------------------------------------------- BCT 1496340 29232018629 ENV 8069531 5273599021 EST 76572230 42789346444 GSS 40054600 25842088149 HTC 546835 637990352 HTG 174702 27540604129 HUM 700050 5574265065 INV 6844391 16766612792 MAM 481390 3844202948 PAT 35649049 18516869245 PHG 11665 306357871 PLN 4293860 15433636347 PRI 140916 2251564092 ROD 527037 4506476789 STS 1346867 640874549 SYN 216015 1069065170 TSA 149076157 132576915262 UNA 376 266598 VRL 2153947 3097174312 VRT 2625721 6988560250 WGS 459229979 1801930324424 ------------------------------------------------------- Total 790211658 2144818812438 CON 31364775 1045764520067 The entries and bases in the CON division are not counted in the numbers given on the top of the release note or 'total' on the above table. 3.1. Files for conventional sequence data The conventional sequence data in this release covers 21 categories (See also '2.1. Categories for conventional sequence data') of organisms and others as follows: ------------------------------------------------------------------------------ ddbjbct; Category for BCT ddbjcon; Category for CON ddbjenv; Category for ENV ddbjest; Category for EST ddbjgss; Category for GSS ddbjhtc; Category for HTC ddbjhtg; Category for HTG ddbjhum; Category for HUM ddbjinv; Category for INV ddbjmam; Category for MAM ddbjpat; Category for PAT ddbjphg; Category for PHG ddbjpln; Category for PLN ddbjpri; Category for PRI ddbjrod; Category for ROD ddbjsts; Category for STS ddbjsyn; Category for SYN ddbjtsa; Category for TSA ddbjuna; Category for UNA ddbjvrl; Category for VRL ddbjvrt; Category for VRT ------------------------------------------------------------------------------ All of above in the present release are recorded in multiple ddbj***###.seq files, each of which at most has 1.5 GB storage capacity, as follows, respectively. ------------------------------- file prefix number of files ------------------------------- ddbjbct 45 ddbjcon 55 ddbjenv 15 ddbjest 168 ddbjgss 79 ddbjhtc 2 ddbjhtg 25 ddbjhum 7 ddbjinv 24 ddbjmam 5 ddbjpat 45 ddbjphg 1 ddbjpln 22 ddbjpri 3 ddbjrod 5 ddbjsts 4 ddbjsyn 2 ddbjtsa 37 ddbjuna 1 ddbjvrl 7 ddbjvrt 10 ------------------------------- The files contain nucleotide sequence data in DDBJ flat file format. See also '8. DDBJ flat file format'. The index files included in this release are ddbjacc#.idx and ddbjgen.idx. All of them are recorded in multiple ddbjacc#.idx files, each of which at most has 1.5 GB storage capacity. The file lists of conventional sequence data in this release are arranged in the file, 'ddbj107_filelist.txt'. The file provides the lists of the sequence data files and the index files. The file list of sequence data consists of four columns; "file name", "number of entries", "number of bases" and "file size". The list of index files consists of two columns; "file name" and "file size". From the present periodical release to the next one, daily updates of conventional sequence data are available at the following directory; ftp://ftp.ddbj.nig.ac.jp/ddbj_database/ddbjnew/ 3.2. Files for bulk sequence data The latest files of bulk sequence data are available at following sites; ------------------------------------------------------------------------------ WGS: ftp://ftp.ddbj.nig.ac.jp/ddbj_database/wgs/ TSA: ftp://ftp.ddbj.nig.ac.jp/ddbj_database/tsa/ ------------------------------------------------------------------------------ The files of bulk sequence data are named by their prefixes of accession numbers. They contain nucleotide sequence data in DDBJ flat file format. See also '8. DDBJ flat file format'. Since the directories of bulk sequence data are daily updated, the statistics of bulk sequence data in the present release are snapshots of above directories at the last published date, November 25, 2016. The statistics are available in the following files: ------------------------------------------------------------------------------ WGS: ddbj107_wgs_filelist.txt TSA: ddbj107_tsa_filelist.txt ------------------------------------------------------------------------------ The list tables in the files consist of four columns; "file name", "number of entries", "number of bases" and "file size". Please note that both of columns, "file name" and "file size", correspond to values after uncompressed from files, "****.gz". 4. Citation When you use DDBJ in your research, we would appreciate it if you would include a reference to DDBJ in your publications related to your research. When citing an entry in the DDBJ database, it is appropriate to give its accession number. Also, it is recommended to cite the first publication in REFERENCE of the entry other than submitter information. DDBJ suggests authors add a reference for DDBJ itself. The following publication, which describes the recent activities of the DDBJ center, would be appropriate to be cited: Mashima J, Kodama Y, Kosuge T, Fujisawa T, Katayama T, Nagasaki H, Okuda Y, Kaminuma E, Ogasawara O, Okubo K, Nakamura Y and Takagi T. DNA data bank of Japan (DDBJ) progress report. Nucleic Acids Res. 44 (Database issue), D51-D57 (2016) DOI: 10.1093/nar/gkv1105 The following sentence is an example to cite an entry in the DDBJ database: ----------------------------------------------------------------------------- "We searched the DDBJ database (1) by sequence similarities and found a nucleotide sequence (2), with DDBJ accession number AB000714, which had significant similarity with ..." (1) Mashima, J. et al, Nucleic Acids Res. 44(Database issue), D51-D57 (2016). (2) Katahira, J. et al, J. Biol. Chem. 272, 26652-26658 (1997). ------------------------------------------------------------------------------ 5. DDBJ staff This release is published by the following DDBJ staff. Jun Mashima, Hideo Aono, Yuji Ashizawa, Yukino Dobashi, Mayumi Ejima, Masahiro Fujimoto, Asami Fukuda, Tomohiro Hirai, Naofumi Ishikawa, Chiharu Kawagoe, Yuichi Kodama, Junko Kohira, Takehide Kosuge, Kyungbum Lee, Mika Maki, Hisako Mashima, Fujitaka Matsumori, Kimiko Mimura, Shiho Mukaida, Naoko Murakata, Toshihisa Okido, Yoshihiro Okuda, Katsunaga Sakai, Makoto Sato, Aimi Shiida, Rie Sugita, Kimiko Suzuki, Toshiaki Tokimatsu, Haru Tsutsui, Koji Watanabe, Tomoka Watanabe, Tomohiko Yasuda, Emi Yokoyama, Masanori Arita, Eli Kaminuma, Osamu Ogasawara, Kosaku Okubo, Toshihisa Takagi, and Yasukazu Nakamura DNA Data Bank of Japan DDBJ Center National Institute of Genetics Research Organization of Information and Systems Mishima, 411-8540, Japan Phone: +81 55 981 6853 FAX: +81 55 981 6849 E-mail: [email protected] (for general inquiry) [email protected] (for data submission) [email protected] (for updates and notification of publication) WWW: http://www.ddbj.nig.ac.jp/ 6. Acknowledgment We are grateful to NCBI and EBI for a firm friendship and an excellent collaboration with us. We thank JPO and KIPO for a steady cooperation with us. We also thank Byungwook Lee at Korean Bioinformation Center for proper process of the sequence data in patent claims to KIPO. The operation of DDBJ is supported by the Ministry of Education, Culture, Sports, Science and Technology, and we would gratefully note this here. DDBJ uses the Super-SINET computer network for data collection, data exchange and various services. 7. Disclaimer While DDBJ endeavors to keep its data correct, DDBJ makes no representations or warranties of any kind about the completeness, accuracy or reliability with respect to the entries contained in the DDBJ periodical release. DDBJ also makes no legal liability or responsibility of merchantability or fitness for a particular purpose or that the use of the sequence data will not infringe any patent or other rights. Any receipt, reliance or use you place on such data is therefore strictly at your own risk. 8. DDBJ flat file format The database is a collection of "entry" which is the unit of the data. The entries submitted to databanks were processed and publicized according to the DDBJ format for distribution (flat file). The flat file includes the sequence and the information of submitters, references, source organisms, and "feature" information, etc. The items of the DDBJ flat file are explained at following; ------------------------------------------------------------------------------- LOCUS AB000000 450 bp mRNA linear HUM 08-JUL-2002 DEFINITION Homo sapiens GAPD mRNA for glyceraldehyde-3-phosphate dehydrogenase, partial cds. ACCESSION AB000000 VERSION AB000000.1 DBLINK BioProject:PRJDA12345 KEYWORDS . SOURCE Homo sapiens (human) ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 450) AUTHORS Mishima,H. and Shizuoka,T. TITLE Direct Submission JOURNAL Submitted (30-NOV-2000) to the DDBJ/EMBL/GenBank databases. Contact:Hanako Mishima National Institute of Genetics, DNA Data Bank of Japan; 1111, Yata, Mishima, Shizuoka 411-8540, Japan REFERENCE 2 AUTHORS Mishima,H., Shizuoka,T. and Fuji,I. TITLE Glyceraldehyde-3-phosphate dehydrogenase expressed in human liver JOURNAL Unpublished (2002) COMMENT Human cDNA sequencing project. FEATURES Location/Qualifiers source 1..450 /chromosome="12" /clone="GT200015" /clone_lib="lambda gt11 human liver cDNA (GeneTech. No.20)" /map="12p13" /mol_type="mRNA" /organism="Homo sapiens" /tissue_type="liver" CDS 86..>450 /codon_start=1 /gene="GAPD" /product="glyceraldehyde-3-phosphate dehydrogenase" /protein_id="BAA12345.1" /transl_table=1 /translation="MAKIKIGINGFGRIGRLVARVALQSDDVELVAVNDPFITTDYMT YMFKYDTVHGQWKHHEVKVKDSKTLLFGEKEVTVFGCRNPKEIPWGETSAEFVVEYTG VFTDKDKAVAQLKGGAKKV" BASE COUNT 102 a 119 c 131 g 98 t ORIGIN 1 cccacgcgtc cggtcgcatc gcacttgtag ctctcgaccc ccgcatctca tccctcctct 61 cgcttagttc agatcgaaat cgcaaatggc gaagattaag atcgggatca atgggttcgg 121 gaggatcggg aggctcgtgg ccagggtggc cctgcagagc gacgacgtcg agctcgtcgc 181 cgtcaacgac cccttcatca ccaccgacta catgacatac atgttcaagt atgacactgt 241 gcacggccag tggaagcatc atgaggttaa ggtgaaggac tccaagaccc ttctcttcgg 301 tgagaaggag gtcaccgtgt tcggctgcag gaaccctaag gagatcccat ggggtgagac 361 tagcgctgag tttgttgtgg agtacactgg tgttttcact gacaaggaca aggccgttgc 421 tcaacttaag ggtggtgcta agaaggtctg // ------------------------------------------------------------------------------- 8.1. LOCUS line The format of LOCUS line in the flat file is shown below; --------- -------- Positions Contents --------- -------- 01-05 'LOCUS' 06-12 spaces 13-28 Locus name 29-29 space 30-40 Length of sequence, right-justified 41-41 space 42-43 'bp' 44-47 spaces 48-54 DNA, RNA, mRNA, rRNA, tRNA or cRNA, left justified 55-55 space 56-63 'linear' followed by two spaces, or 'circular' 64-64 space 65-67 The division code (See '2. Data categories.') 68-68 space 69-79 Date, in the form dd-MMM-yyyy (e.g., 08-JUL-2002) ------------------------------------------------------------------------------ 8.2. DEFINITION line The definition briefly describes the information of gene(s). "DEFINITION" is constructed by each of the three data banks. 8.3. ACCESSION line This line shows accession number of the entry data. A unique accession number is issued to the data submitted by each of the three data banks. The accession number is composed of 1 alphabet character and 5 digits (ex. A12345), 2 alphabet characters and 6 digits (ex. AB123456) or 4 alphabet characters and 8-10 digits (AAAA01012345). The first style was used in 1980s, but later, the second and the third styles were introduced because of data explosion. See also the following URL; http://www.ddbj.nig.ac.jp/sub/acc_def-e.html The alphabet part of accession number is called "prefix". You can find the prefix list of the accession numbers at the following URL; http://www.ddbj.nig.ac.jp/sub/prefix.html If multiple entries are united to an entry, or if an entry is extensively modified after the submission, the responsible data banks may assign a new accession number to it. In these cases, the new accession number is called the primary accession number, and the old accession number(s) is/are called the secondary accession number(s). In the flat file, the primary accession number is indicated first, then the secondary accession number(s) follows. You can find the same updated entry with both the primary and the secondary accession numbers. 8.4. VERSION line This line consists of an accession number and a version number, like "AB123456.1", in which the digit(s) after the period is a version number. The data open to public for the first time is version number as "1". The reason for adding VERSION is that since a released sequence sometimes revised by the submitter, the accession number alone cannot specify the sequence in question causing the user a trouble. The number is increased by one every time when a revised sequence is made public. 8.5. DBLINK line The DBLINK line provides links to records of other databases with accession numbers of BioProject, BioSample, Sequence Read Archive and so on. 8.6. KEYWORDS line The data banks describe this line, if necessary. In many cases, the categories of the data (EST, HTG etc.), gene names and product names included in "KEYWORDS". 8.7. SOURCE line This line shows the scientific name (and a corresponding common name, if defined as "Genbank common name" in taxonomy database) on organism from which the sequence is obtained and an organelle type if the sequence is derived from an organelle other than the nucleus. 8.8. REFERENCE line The information on the submitters and references related to the submitted sequence is indicated in REFERENCE line. 8.9. COMMENT line. The information about an entry that cannot be described using FEATURES or the other fields. 8.10. FEATURES line Biological features of a submitted sequence data are described with "Feature" key (the biological nature of the annotated feature), "Location" (the region of the sequence which corresponds to Feature), and "Qualifier" (supplementary information about Feature). The "Feature" and "Qualifier" keys used in the present release is defined by DDBJ/ENA/GenBank Feature Table Definition Version 10.6 (November, 2016). The document is continuously updated every year, in principle. You can find its newest version on URL; http://www.ddbj.nig.ac.jp/FT/full_index.html 8.11. BASE COUNT line In the BASE COUNT line of the DDBJ flat file, 9 digits are allocated for each number of a (adenine), c (cytosine), g (guanine) and t (thymine). In the case of RNA sequence, uracil is indicated as "t" according to the rule of the international nucleotide database. 8.12. ORIGIN line The sequence data starts from the next line of ORIGIN. The sequence is indicated as lower case letters, delimited by space per 10 bases, starts a new line by 60 bases. The numbers described at left side of lines mean the ordinal number of the top base of the line. 9. Sample of the contents in each file 9.1. Part of the contents in the file 'ddbjbct1.seq' This shows all pieces of information on one entry in DDBJ format. ------------------------------------------------------------------------------ LOCUS D87069 993 bp mRNA linear BCT 05-OCT-2006 DEFINITION Escherichia coli mRNA for RNA polymerase sigma subunit, truncated form of sigma-38, complete cds. ACCESSION D87069 VERSION D87069.1 KEYWORDS RNA polymerase sigma subunit, truncated form of sigma-38. SOURCE Escherichia coli ORGANISM Escherichia coli Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia. REFERENCE 1 (bases 1 to 993) AUTHORS Jishage,M. TITLE Direct Submission JOURNAL Submitted (14-AUG-1996) to the DDBJ/EMBL/GenBank databases. Contact:Miki Jishage National Institute of Genetics, Molecular Genetics; Yata 1111, Mishima, Shizuoka 411, Japan REFERENCE 2 AUTHORS Jishage,M. and Ishihama,A. TITLE Variation in RNA polymerase sigma subunit composition within different stocks of Escherichia coli starin W3110 JOURNAL Unpublished (1996) REFERENCE 3 AUTHORS Ivanova,A., Renshaw,M., Guntaka,R. and Eisenstark,A. TITLE DNA base sequence variability in katF (putative sigma factor) gene Escherichia coli JOURNAL Nucleic Acids Res. 20, 5479-5480 (1992) REFERENCE 4 AUTHORS Takayanagi,Y., Tanaka,K. and Takahashi,H. TITLE Structure of the 5' upstream region and the regulation of the rpoS gene of Escherichia coli JOURNAL Mol. Gen. Genet. 243, 525-531 (1994) COMMENT FEATURES Location/Qualifiers source 1..993 /db_xref="taxon:562" /mol_type="mRNA" /organism="Escherichia coli" /strain="W3110" CDS 1..810 /note="the gene has four single base changes, resulting in two amino acid substitutions and an amber mutation" /product="RNA polymerase sigma subunit, truncated form of sigma-38" /protein_id="BAA13238.1" /transl_table=11 /translation="MSQNTLKVHDLNEDAEFDENGVEVFDEKALVEYEPSDNDLAEEE LLSQGATQRVLDATQLYLGEIGYSPLLTAEEEVYFARRALRGDVASRRRMIESNLRLV VKIARRYGNRGLALLDLIEEGNLGLIRAVEKFDPERGFRFSTYATWWIRQTIERAIMN QTRTIRLPIHIVKELNVYLRTARELSHKLDHEPSAEEIAEQLDKPVDDVSRMLRLNER ITSVDTPLGGDSEKALLDILADEKENGPEDTTQDDDMKQSIVKWLFELNAK" variation 75 /citation=[3] /replace="t" variation 97 /citation=[3] /replace="t" variation 99 /citation=[3] /replace="t" variation 808 /citation=[3] /replace="t" BASE COUNT 254 a 223 c 291 g 225 t ORIGIN 1 atgagtcaga atacgctgaa agttcatgat ttaaatgaag atgcggaatt tgatgagaac 61 ggagttgagg tttttgacga aaaggcctta gtagaatatg aacccagtga taacgatttg 121 gccgaagagg aactgttatc gcagggagcc acacagcgtg tgttggacgc gactcagctt 181 taccttggtg agattggtta ttcaccactg ttaacggccg aagaagaagt ttattttgcg 241 cgtcgcgcac tgcgtggaga tgtcgcctct cgccgccgga tgatcgagag taacttgcgt 301 ctggtggtaa aaattgcccg ccgttatggc aatcgtggtc tggcgttgct ggaccttatc 361 gaagagggca acctggggct gatccgcgcg gtagagaagt ttgacccgga acgtggtttc 421 cgcttctcaa catacgcaac ctggtggatt cgccagacga ttgaacgggc gattatgaac 481 caaacccgta ctattcgttt gccgattcac atcgtaaagg agctgaacgt ttacctgcga 541 accgcacgtg agttgtccca taagctggac catgaaccaa gtgcggaaga gatcgcagag 601 caactggata agccagttga tgacgtcagc cgtatgcttc gtcttaacga gcgcattacc 661 tcggtagaca ccccgctggg tggtgattcc gaaaaagcgt tgctggacat cctggccgat 721 gaaaaagaga acggtccgga agataccacg caagatgacg atatgaagca gagcatcgtc 781 aaatggctgt tcgagctgaa cgccaaatag cgtgaagtgc tggcacgtcg attcggtttg 841 ctggggtacg aagcggcaac actggaagat gtaggtcgtg aaattggcct cacccgtgaa 901 cgtgttcgcc agattcaggt tgaaggcctg cgccgtttgc gcgaaatcct gcaaacgcag 961 gggctgaata tcgaagcgct gttccgcgag taa // ------------------------------------------------------------------------------ 9.2. Part of the contents in the accession number index file 'ddbjacc1.idx' The following excerpt from the accession number index file illustrates the format of the index. ------------------------------------------------------------------------------ A00001 A00001 PAT A00001 A00002 A00002 PAT A00002 A00003 A00003 PAT A00003 A00004 A00004 PAT A00004 A00005 A00005 PAT A00005 A00006 A00006 PAT A00006 A00008 A00008 PAT A00008 A00009 A00009 PAT A00009 A00010 A00010 PAT A00010 ------------------------------------------------------------------------------ The accession number index file consists of four columns delimited by tab code. The first column indicates secondary accession number. If there is no secondary accession number, the first column indicates primary accession number. Following columns are locus name, division and primary accession number, respectively. 9.3. Part of the contents in the gene name index file 'ddbjgen1.idx' This file lists all the gene names that appear in the feature table. ------------------------------------------------------------------------------ 2 AJ431263 PLN AJ431263 B epsilon AJ276037 PLN AJ276037 B epsilon AJ276037 PLN AJ276037 B epsilon AJ276037 PLN AJ276037 B epsilon AJ276037 PLN AJ276037 D beta Z22855 ROD Z22855 D beta 1 Z22854 ROD Z22854 D34 Z93215 HUM Z93215 H5 X15387 INV X15387 H5 X15387 INV X15387 HLA-DBR1 X68272 HUM X68272 ------------------------------------------------------------------------------ The gene name index file consists of four columns, gene name, locus name, division and primary accession number, respectively. Columns are delimited by tab code. 10. Release history Release Date Entries Bases Comments 107 12/16 790,211,658 2,144,818,812,438 bulk sequence data inclusion started 106 09/16 196,421,345 218,729,237,634 105 06/16 194,599,140 213,484,513,978 104 03/16 191,094,643 207,977,078,920 103 12/15 189,264,014 204,119,485,393 102 09/15 187,785,897 200,654,335,022 101 06/15 184,281,713 192,506,352,252 100 03/15 181,941,277 187,798,813,739 99 12/14 178,825,615 184,410,381,191 98 09/14 174,391,281 166,692,710,729 97 06/14 172,402,324 161,078,598,329 96 03/14 171,164,046 158,539,702,882 95 12/13 169,094,459 156,527,217,715 94 09/13 167,480,294 154,916,713,861 93 06/13 165,072,766 152,702,928,183 92 03/13 163,017,305 150,760,062,903 91 12/12 160,729,709 148,418,537,672 90 09/12 156,952,755 144,754,534,372 89 06/12 153,273,314 141,016,380,296 Part of index files terminated 88 12/11 145,861,965 134,956,109,049 87 09/11 142,339,601 131,276,394,833 86 06/11 138,030,308 128,745,918,079 85 03/11 132,302,771 124,516,775,718 84 12/10 128,607,035 120,919,136,706 83 09/10 124,079,491 117,728,717,442 82 06/10 120,034,097 115,169,689,543 81 03/10 116,720,237 112,394,932,676 TPA excluded 80 12/09 112,314,250 109,636,862,252 SOURCE line modified 79 09/09 108,593,519 106,684,379,504 DBLINK line started PROJECT line terminated 78 06/09 105,737,359 104,597,360,291 77 03/09 102,099,156 101,765,388,414 76 12/08 98,220,409 98,741,908,446 75 09/08 92,840,037 95,219,505,205 TSA division started 74 06/08 87,903,140 91,294,770,939 73 03/08 83,167,582 86,099,950,395 KIPO inclusion started 72 12/07 79,004,098 82,592,245,487 Most of E-mail addresses discarded 71 09/07 76,273,345 79,706,204,461 70 06/07 72,801,679 76,788,510,646 69 03/07 67,523,680 71,775,679,500 PROJECT line started Indexes for categories terminated 68 12/06 64,267,978 68,259,314,742 1.5 GB storage started 67 09/06 61,144,621 65,443,024,193 66 06/06 58,176,628 62,945,843,881 65 03/06 55,890,995 60,564,721,635 TPA subcategories started 64 12/05 52,272,669 56,098,558,378 Some index files split 63 09/05 47,741,593 52,246,110,341 62 06/05 45,249,444 49,158,155,283 ENV division started Version for release note started 61 03/05 43,118,204 47,099,081,750 Changed style of release note 60 12/04 40,583,945 44,416,752,273 /db_xref="H-inv:**" started 59 09/04 37,926,117 42,245,956,937 58 06/04 34,917,581 39,812,635,108 57 03/04 32,693,678 38,008,449,840 56 12/03 30,405,173 36,079,046,032 55 09/03 27,753,140 34,280,225,489 54 06/03 25,149,821 32,162,041,177 53 02/03 23,250,813 29,711,299,332 52 12/02 20,354,812 26,931,456,316 51 09/02 18,401,358 22,782,404,136 TPA started 50 06/02 17,260,693 20,158,357,982 49 04/02 16,503,157 18,579,627,226 48 01/02 15,016,100 16,197,713,855 47 10/01 13,266,610 14,145,671,645 46 07/01 12,313,759 13,037,646,166 45 04/01 11,434,113 12,207,092,905 HTC division started 44 01/01 10,165,597 11,136,298,841 43 10/00 8,666,551 10,034,532,698 42 07/00 7,554,995 8,880,721,093 41 04/00 5,962,608 6,409,581,885 CON division started 40 01/00 5,388,125 4,762,696,173 RNA division terminated 39 10/99 4,810,773 3,728,000,562 NID and PID discarded 38 07/99 4,294,369 3,098,519,597 37 03/99 3,311,627 2,375,261,951 VERSION, /protein_id started 36 01/99 3,073,166 2,190,425,560 35 10/98 2,759,261 1,957,341,169 34 07/98 2,412,785 1,708,580,623 33 04/98 2,174,769 1,479,303,279 32 01/98 1,956,669 1,300,950,613 31 10/97 1,731,532 1,139,869,464 Adoption of the unified taxonomy database 30 07/97 1,534,115 992,788,339 NID and PID terminated 29 04/97 1,270,194 841,415,232 28 01/97 1,154,120 756,785,219 HTG division started ORG division terminated 27 10/96 936,697 608,103,057 GSS division started 26 07/96 835,552 551,932,448 25 04/96 744,490 499,300,364 /translation started 24 01/96 637,508 431,771,652 23 10/95 569,757 390,694,350 22 07/95 437,588 322,982,425 HUM division started 21 04/95 274,596 250,875,023 20 01/95 239,689 231,299,557 19 10/94 204,332 205,274,131 18 07/94 185,230 192,473,021 17 04/94 169,957 179,942,209 16 01/94 154,626 165,017,628 15 10/93 131,649 147,224,690 14 07/93 120,350 138,686,333 JPO inclusion started 13 04/93 112,067 129,784,445 12 01/93 97,683 120,815,244 EST division started 11 07/92 65,693 84,839,075 10 01/92 59,317 77,805,556 GenBank/EMBL inclusion started 9 07/91 1,130 2,002,124 8 01/91 879 1,573,442 7 07/90 681 1,154,211 6 01/90 496 841,236 5 07/89 395 679,378 4 01/89 302 535,985 3 07/88 230 345,850 2 01/88 142 199,392 1 07/87 66 108,970 Started with DDBJ only ------------------ Since release 89 ------------------ Index files have been changed: Previously, DDBJ periodical release included index files for accession numbers, keyword phrases, journal citations, and gene names. After arrangement of index files, index files for keyword phrase and journal citation have been terminated and formats of index files for accession number and gene name have been changed. See also "9.2. Part of the contents in the accession number index file 'ddbjacc1.idx'" and "9.3. Part of the contents in the gene name index 'ddbjgen1.idx'" ------------------ Since release 81 ------------------ TPA category data have been excluded from DDBJ periodical release: Since September 2002 (DDBJ release 51), we provided DDBJ periodical releases including TPA category data. However, it is potentially confusing, because TPA category is not primary nucleotide sequence data. Therefore, DDBJ terminated to include TPA data. TPA data has been available from the other FTP site. See following site in detail. URL; http://www.ddbj.nig.ac.jp/whatsnew/whatsnew2009-e.html#090828 ------------------ Since release 80 ------------------ The format of the SOURCE line in DDBJ flat file has been changed: The SOURCE lines in some of DDBJ flat file included a common name like as GenBank flat file. The change is shown below ---------------- Old (-rel. 79) ---------------- Format: SOURCE [] Example: SOURCE Homo sapiens mitochondrion ---------------- New (rel. 80-) ---------------- Format: SOURCE [] [()] Example: SOURCE mitochondrion Homo sapiens (human) See also '8. DDBJ flat file format'. ------------------ Since release 79 ------------------ A new line, DBLINK, has replaced PROJECT line: Following the agreement at the INSD collaborative meeting in 2008, the scope of the project ID has expanded to include projects that are not necessarily targeted to the sequencing of a complete genome. In addition, there are other resources such as the Trace Assembly Archive at the NCBI and the like. Therefore, we have decided to replace the PROJECT line by a new line format, "DBLINK". The replacement is illustrated in the following; From the use of the PROJECT line (-release 78); ------------------------------------------------------------------------------- LOCUS AP000000 4700000 bp DNA circular BCT 27-FEB-2009 DEFINITION Escherichia coli DDBJ genomic DNA, complete genome. ACCESSION AP000000 VERSION AP000000.1 PROJECT GenomeProject:99999 KEYWORDS . ------------------------------------------------------------------------------- To the DBLINK line format (release 79-); ------------------------------------------------------------------------------- LOCUS AP000000 4700000 bp DNA circular BCT 27-FEB-2009 DEFINITION Escherichia coli DDBJ genomic DNA, complete genome. ACCESSION AP000000 VERSION AP000000.1 DBLINK Project:99999 KEYWORDS . ------------------------------------------------------------------------------- ------------------ Since release 75 ------------------ A new division for assembled mRNA sequences, Transcriptome Shotgun Assembly (TSA), has been included since the release 75. With new sequencing technologies in use, INSDC have faced many requests to accept assembled EST sequences. These sequence data have become more useful than used to be, although they may not be correctly assembled or exist in nature. Therefore, INSDC decided to collect assembled EST sequences and classified them into the new division 'TSA'. TSA sequences are shotgun assemblies of primary sequences deposited in the EST division of INSDC, Trace Archive (TA) or Short-Read Archive (SRA). Two specific keywords, "TSA" and "Transcriptome Shotgun Assembly", are present in all TSA entries. The new division code, "TSA", is also described in the the LOCUS line in all TSA entries. No format changes in the flat file are anticipated for the TSA division, however, note that TSA entries make use of the same PRIMARY line that is described for the entries in TPA category. The PRIMARY block contains references to the underlying reads/transcripts that are assembled to construct a TSA record. Note that it is required for a TSA submission to submit sequence data of primary transcripts to the EST division of INSDC, TA, or SRA. More information about how to submit a TSA entry is provided via the following URL; http://www.ddbj.nig.ac.jp/sub/tsa-e.html ------------------ Since release 73 ------------------ Introduction of the sequence data from the Korean Intellectual Property Office: The nucleotide sequence data transferred from Korean Intellectual Property Office (KIPO) have been included in DDBJ release. See also, '2. Data categories' and '2.3. Notice for data derived from Patent Offices'. ------------------ Since release 72 ------------------ Deletion of E-mail address, phone and fax numbers from DDBJ flat file: To follow the Japanese law of protecting personal information, DDBJ deleted both phone and fax numbers, and E-mail address from the flat files of the entries submitted to DDBJ. It would be also helpful to protect DDBJ releases against SPAM mail senders. DDBJ retrofitted most of all entries submitted to DDBJ, not to GenBank or EMBL, by the DDBJ periodical release 72. Before the release 72, the submitter information was described in JOURNAL line at REFERENCE 1 as, -------------------------------------------------------------------------------- REFERENCE 1 (bases 1 to 1200) AUTHORS Mishima,T. TITLE Direct Submission JOURNAL Submitted (01-Jan-1990) to the DDBJ/EMBL/GenBank databases. Taro Mishima, DNA Data Bank of Japan, National Institute of Genetics; 1111, Yata, Mishima, Shizuoka 411-8540, Japan (E-mail:[email protected], URL:http://www.ddbj.nig.ac.jp/, Tel:81-12-345-6789, Fax:81-12-345-9876) -------------------------------------------------------------------------------- After the deletion or the information in question, DDBJ flat file is either one of the following two types; Type 1: Phone and fax numbers and E-mail address are deleted. -------------------------------------------------------------------------------- REFERENCE 1 (bases 1 to 1200) AUTHORS Mishima,T. TITLE Direct Submission JOURNAL Submitted (01-Jan-1990) to the DDBJ/EMBL/GenBank databases. Contact:Taro Mishima DNA Data Bank of Japan, National Institute of Genetics; 1111, Yata, Mishima, Shizuoka 411-8540, Japan URL :http://www.ddbj.nig.ac.jp/ ------------------------------------------------------------------------------- Type 2: When the submitters wish to keep their contact information disclosed, it is described as, ------------------------------------------------------------------------------- REFERENCE 1 (bases 1 to 1200) AUTHORS Mishima,T. TITLE Direct Submission JOURNAL Submitted (01-Jan-1990) to the DDBJ/EMBL/GenBank databases. Contact:Taro Mishima DNA Data Bank of Japan, National Institute of Genetics; 1111, Yata, Mishima, Shizuoka 411-8540, Japan URL :http://www.ddbj.nig.ac.jp/ E-mail :[email protected] Phone :81-12-345-6789 Fax :81-12-345-9876 ------------------------------------------------------------------------------- ------------------ Since release 69 ------------------ Introduction of the project ID at PROJECT line in DDBJ flat file: Following the agreement at the INSD collaborative meeting in 2006, INSDC has started to assign the project ID for submissions from sequencing projects. The description of project ID is shown as below; ---------------------------------------------------------------------------- A unique identifier, assigned at the time of the submission by a sequencing project that informed INSDC of the submission beforehand. It is recommended that the submitter quotes the assigned project ID in all communication with INSDC databases to allow for easier and faster tracking of issues. The project ID field provides an umbrella identifier that points to all related sequence data for the project. ---------------------------------------------------------------------------- The PROJECT lines contain INSDC-assigned ID for the sequencing project. It will be appeared between VERSION and KEYWORDS lines in DDBJ flat files, from the DDBJ periodical release, 69 as shown below. See also '8. DDBJ flat file format'. ---------------------------------------------------------------------------- ACCESSION AB012345 VERSION AB012345.1 PROJECT GenomeProject:123 KEYWORDS . ---------------------------------------------------------------------------- Termination of providing the index files for each category: ------------------ Since release 68 ------------------ Split of files: We changed the maximum file size from 300 MB to 1.5 GB, because the network capacity has been remarkably increased. Each file named as ddbj***##.seq has at most 1.5 GB storage capacity. See also the sections, ' 3.1. Files for conventional sequence data'. ------------------ Since release 64 ------------------ Split of index files: In the present release, some of index files (ddbjacc.idx, ddbjjou.idx, and ddbjkey.idx) have been greater than 2 GB in the file size. So, these have been recorded in multiple ddbj****.idx files, each of which at most has 1.5 GB storage capacity as follows, respectively. See also 3., 9.2., 9.3. and 9.4. ------------------ Since release 62 ------------------ Release version number is introduced: DDBJ has started to include the item, 'version', for its release note, which indicates a version for its periodical release. It is expressed like '62.0', in which the digit(s) after the period is a version number. The reason for adding the version number is that a released data is sometimes revised due to urgent and necessary corrections. The number is increased by one every time when a revised periodical release is made public until the next release. Introduction of ENV division: Recently, the submissions of the sequences derived from environmental samples have rapidly increased. To accommodate such submissions, a new division, ENV, has been created (See also '2.1. Categories for conventional sequence data'). This division contains the sequences obtained via direct molecular isolation such as PCR, DGGE, or any anonymous method. In the past, the sequences derived from environmental samples belonged to taxonomic divisions, mainly BCT. At DDBJ, the retrofit to transfer relevant entries from taxonomic divisions to the ENV division starts in the present release, and ends by the periodical release 62. Strand information is removed: The strand information of LOCUS line in the flat file has been removed as shown below. See also '8.1. LOCUS line'. ---------------------------------------------------------------------------- Old (-rel. 61): 44-44 space 45-47 spaces, ss- (single-stranded), ds- (double-stranded), or ms- (mixed-stranded) New (rel. 62-): 44-47 spaces ---------------------------------------------------------------------------- ------------------ Since release 61 ------------------ The style of release note (this file) has been changed. Some entries have the sequential format for the secondary accession numbers in the ACCESSION line, in order to make the expression of secondary accession numbers in the past short. For example; ------------------------------------------------------------------------------ Before; ACCESSION AB000802 D85885 D85886 D85887 After; ACCESSION AB000802 D85885-D85887 ------------------------------------------------------------------------------ See also '8.3. ACCESSION line'. ------------------ Since release 60 ------------------ The cross-reference to the H-invitational has been included. ------------------ Since release 56 ------------------ The three data banks have agreed that the maximum length limitation (350 kb) of a submitted sequence be relaxed. The BASE COUNT line of the DDBJ flat file format has been changed, corresponding to the relaxation of the maximum sequence length restriction in the entry that had been practiced at DDBJ/EMBL/GenBank International Nucleotide Sequence Databases. In the BASE COUNT line of the DDBJ flat file, 6 digits had been allocated for each number of a, c, g, t and other bases in the sequence. Hereafter, in the new flat file format, 9 digits are allocated for each number of a, c, g and t, while the numbers of other bases are removed. In accordance with the relaxation of sequence length limitation, GenBank had already dropped the BASE COUNT line from their flat file format from GenBank Release 138 (Oct. 2003). We DDBJ have decided to maintain the BASE COUNT line in our flat file format from the view that GC contents are still important information to characterize the sequence. The changes in the BASE COUNT line are shown below. ---------------------------------------------------------------------------- Old (-rel. 55): 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 |----|----|----|----|----|----|----|----|----|----|----|----|----|----| BASE COUNT 123456 a 123456 c 123456 g 123456 t 123456 others New (rel. 56-): 1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 |----|----|----|----|----|----|----|----|----|----|----|----|----|----| BASE COUNT 123456789 a 123456789 c 123456789 g 123456789 t ---------------------------------------------------------------------------- ------------------ Since release 54 ------------------ '/sequenced_mol' qualifier has been changed to '/mol_type' qualifier. We accordingly completed retrofitting the pertinent entries. This change was made on the agreement at the INSD collaborative meeting in 2002. ------------------ Since release 51 ------------------ The format of LOCUS line in the flat file has been changed as shown below to adjust to the GenBank format. ------------------------------------------------------------------------------ Old (-rel. 50): LOCUS AB000001 660 bp DNA PLN 01-FEB-2001 New (rel. 51-): LOCUS AB000001 660 bp DNA linear PLN 01-FEB-2001 ------------------------------------------------------------------------------ ------------------ Since release 45 ------------------ The HTC (High Throughput cDNA) division has been included. This is to include unfinished high throughput cDNA sequences, each of which has 5'UTR and 3'UTR at both ends and part of a coding region. The sequence may also include introns. When the sequence becomes finished later, it moves to the corresponding taxonomic division. The sequence is accompanied with a keyword, HTC (High Throughput cDNA), which is dropped when the sequence is finished and moved to a taxonomic division. ------------------ Since release 41 ------------------ The CON division has been included. This division is to show the order of related sequences in a genome, and expressed by join and the accession numbers of the sequences. The contents of the CON division are compiled by the three data banks not by the data submitter. ------------------ Since release 40 ------------------ The RNA division was terminated. ------------------ Since release 37 ------------------ The three data banks include the item VERSION in the flat file, which indicates a version of a submitted nucleotide sequence. It is expressed like AB123456.1, in which the digit(s) after the period is a version number. The reason for adding VERSION is that since a released sequence sometimes revised by the submitter, the accession number alone cannot specify the sequence in question causing the user a trouble. The number is increased by one every time when a revised sequence is made public. Accordingly, the translated protein sequence will be accompanied with a /protein_id which is expressed as BAA12345.1, in which the digit(s) after the period is again a version number. The number is increased by one when the corresponding nucleotide sequence is revised and the protein sequence is changed as a result, and when the revised protein sequence is made public. ------------------ Since release 31 ------------------ We have started adopting the unified taxonomy database to unify the biological source of the sequence. The database is made up with scientific names, ID of unidentified organisms, and synthetic constructs etc. ------------------ Since release 30 ------------------ NID and PID were terminated. This change was made on the agreement at the INSD collaborative meeting in 1999. ------------------ Since release 28 ------------------ The HTG (High Throughput Genomic sequence) has been included. We terminated the ORG (Organelle) division. ------------------ Since release 27 ------------------ The GSS division has been included. GSS stands for Genome Survey Sequence, which is similar to EST, except that GSS is genomic DNA whereas EST is cDNA. ------------------ Since release 25 ------------------ DDBJ release contains amino acid sequences that were translated from the corresponding nucleotide sequences of the database. ------------------ Since release 22 ------------------ The HUM division has been included. We have the human (HUM) division solely for human sequences and the primate (PRI) division for non-human primate sequences. ------------------ Since release 12 ------------------ The EST (Expressed Sequence Tag) division has been included. ------------------ Since release 10 ------------------ The sequences submitted to GenBank or EMBL have been included in the release.