Formatdb README ------------------ Table of Contents Introduction Command Line Options Formatdb Notes/Troubleshooting A The -o option and identifiers B "SORTFiles failed" message C Formatting large FASTA files D Piping a database to formatdb without uncompressing E Creating custom databases. F General troubleshooting tips. G "SeqIdParse Failure" error H "FileOpen" error Appendix 1: The Files Produced by Formatdb Introduction ------------ Formatdb must be used in order to format protein or nucleotide source databases before these databases can be searched by blastall, blastpgp or MegaBLAST. The source database may be in either FASTA or ASN.1 format. Although the FASTA format is most often used as input to formatdb, the use of ASN.1 is advantageous for those who are using ASN.1 as the common source for other formats such as the GenBank report. Once a source database file has been formatted by formatdb it is not needed by BLAST. Please note that if you are going to apply periodic updates to your BLAST databases using fmerge (ftp://ncbi.nlm.nih.gov/blast/fmerge/README), you will need to keep the source database file. If you are having problems formatting a BLAST databases please scroll down to the "Formatdb Notes/Troubleshooting" section below. Or contact the BLAST Desk at blast-help@ncbi.nlm.nih.gov Command Line Options -------------------- A list of the command line options for formatdb may be obtained by executing formatdb without options, as in: formatdb - The formatdb options are summarized below: formatdb arguments: -t Title for database file [String] Optional -i Input file for formatting (this parameter must be set) [File In] -l Logfile name: [File Out] Optional default = formatdb.log -p Type of file T - protein F - nucleotide [T/F] Optional default = T -o Parse options T - True: Parse SeqId and create indexes. F - False: Do not parse SeqId. Do not create indexes. [T/F] Optional default = F If the "-o" option is TRUE (and the source database is in FASTA format), then the database identifiers in the FASTA definition line must follow the convention of the FASTA Defline Format. Please see section "F Note on creating custom databases" below. -a Input file is database in ASN.1 format (otherwise FASTA is expected) T - True, F - False. [T/F] Optional default = F -b ASN.1 database in binary mode T - binary, F - text mode. [T/F] Optional default = F A source ASN.1 database may be represented in two formats - ascii text and binary. The "-b" option, if TRUE, specifies that input ASN.1 database is in binary format. The option is ignored in case of FASTA input database. -e Input is a Seq-entry [T/F] Optional default = F A source ASN.1 database (either text ascii or binary) may contain a Bioseq-set or just one Bioseq. In the latter case the "-e" switch should be set to TRUE. -n Base name for BLAST files [String] Optional This options allows one to produce BLAST databases with a different name than that of the original FASTA file. For instance, one could have a file named 'ecoli.nuc.txt' and and format it as 'ecoli': formatdb -i ecoli.nuc.txt -p F -o T -n ecoli uncompress -c nr.z | formatdb -i stdin -o T -n nr This can be used in situations where the original FASTA file is not required other than by formatdb. This can help in a situation where disk-space is tight. -v Number of sequence bases to be created in the volume [Integer] Optional default = 0 This option breaks up large FASTA files into 'volumes' (each with a maximum size of 2 billion letters). As part of the creation of a volume formatdb writes a new type of BLAST database file, called an alias file, with the extension 'nal' or 'pal'. -s Create indexes limited only to accessions - sparse [T/F] Optional default = F This option limits the indices for the string identifiers (used by formatdb) to accessions (i.e., no locus names). This is especially useful for sequences sets like the EST's where the accession and locus names are identical. Formatdb runs faster and produces smaller temporary files if this option is used. It is strongly recommended for EST's, STS's, GSS's, and HTGS's. -A Create ASN.1 structured deflines [T/F] Optional default = F This option produces a new BLAST database format. The BLAST programs can read this format as well as the current format. The program automatically identifies which version it should work with. This new format stores the sequence definition lines in a structured manner (as ASN.1), this will allow future versions of BLAST to better present taxonomic information as well as information about other resources (e.g., UniGene, LocusLink) for a database sequence. -L Create an alias file with this name use the gifile arg (below) if set to calculate db size use the BLAST db specified with -i (above) [File Out] Optional This option produces a BLAST database alias file using a specified database, but limiting the sequences searched to those in the GI list given by the -F argument. See the section "Note on creating an alias file for a GI list" for more information. -F Gifile (file containing list of gi's) [File In] Optional This option can be used to specify the GI list for the alias file construction (-L option above) or to produce a binary GI list if the -B option (below) is set. -B Binary Gifile produced from the Gifile specified above [File Out] Optional This option specifies the name of a binary GI list file. This option should be used with the -F option. A text GI list may be specified with the -F option and the -B option will produce that GI list in binary format. The binary file is smaller and BLAST does not need to convert it, so it can be read faster. Formatdb Notes/Troubleshooting: ------------------------------- A.) Note on -o option: It is always advantageous to use the '-o' option if the database identifiers are in the format specified at ftp://ncbi.nlm.nih.gov/blast/db/README. If the database identifiers are in this parseable format, formatdb produces additional indices allowing retrieval from the databases by identifier. The databases on the NCBI FTP site contain parseable identifiers. It is sufficient if the first word on the FASTA definition line is a unique identifier (e.g., ">3091 Alcoho de..."). It is necessary to use parseable identifiers for the following cases: 1.) ASN.1 is to be produced from blastall or blastpgp, then "-o" must be TRUE. 2.) query-anchored alignments are desired (i.e., the '-m' option with a non-zero value is used). 3.) The gi's are desired as part of the output (i.e., '-I' is used). 4.) fastacmd will be used to fetch sequences from the database by accession or gi. See Appendix 1: The Files Produced by Formatdb for more information in the -o T option. B.) Note on "SORTFiles failed" message: Formatdb will use the 'standard' temporary directory to sort the string indices on disk. Under UNIX this directory is often /var/tmp and if there is not enough space there, then the error message: "ERROR: [000.000] SORTFiles failed" will be issued. This can be avoided by setting the TMPDIR environment variable to a partition with more free space. This message may also often be avoided by using the sparse option (-s) for formatdb described above. C.) Note on formatting large (4 Gig and larger) FASTA files: A single BLAST database can contain up to 4 billion letters. If one wishes to formatdb a FASTA file containing more letters than this, several databases, each of a maximum size of 4 billion bases, will be produced. This will be done automatically if the -v argument is not set. One may also specify a smaller size for the volume databases by using the -v option: formatdb -i hugefasta -p F -v 2000000000 This command line will format the "hugefasta" FASTA file as a number of database "volumes," each containing a maximum of two billion base pairs, as specified by the "-v" option. Two billion is the current limitation on the NCBI toolkit command-line parser. The volumes will have names consisting of the root database name, "hugefasta" followed by a two-digit volume extension, followed by the usual BLAST database extensions. These smaller databases can be searched as if they were a single entity using: blastall -i infile -d hugefasta -p blastn -o out In this case, BLAST recognizes that the database "hugefasta" has been partitioned into several volumes because it detects a file with the name of the root database followed by an extension of "nal" (for protein databases, the extension is "pal"). This file specifies a database list to be searched when the root database name is specified to BLAST. BLAST sequentially searches each database listed in this "nal" file and generates output that is indistinguishable from that of a single database search. A sample "nal" file, resulting from formatting the datafile "hugefasta" into three volumes, is given below. The "DBLIST" line can also be edited to specify additional databases to be searched. # # Alias file created Tue Jan 18 13:12:24 2000 # # TITLE hugefasta # DBLIST hugefasta.00 hugefasta.01 hugefasta.02 # #GILIST # #OIDLIST # The "nal" and "pal" files can also be used to simplify searches of multiple databases created separately. For instance, a file called "multi.nal" containing the following lines could be created from scratch using a text editor. # # Alias file created Tue Jan 18 13:12:24 2000 # # TITLE multi # DBLIST part1 part2 part3 # #GILIST # #OIDLIST # The "multi.nal" file would allow the three databases, "part1", "part2", and "part3", to be searched by specifying a single database name, "multi", on the blastall command line as follows: blastall -i infile -d multi -p blastn -o out The reason for using database volumes, as opposed to simply making the indices in the BLAST databases large enough to handle all conceivable databases with an eight-byte 'integer', is that this would have doubled the size of the indices for all searches no matter how small the database. Hence very large FASTA files are broken down into a couple of databases. Formatdb must be able to open files larger than 2 Gig in order to work on very large files. This is not a problem on a 64-bit OS and on certain 32-bit OS that allows binaries to be made large-file aware. The 32-bit Solaris formatdb binary on the NCBI FTP site is now compiled large file aware. D.) Note on running formatdb on a database without uncompressing it: Under UNIX it is possible to uncompress a database on the fly and pipe it to formatdb. This can reduce the disk-space needed for running formatdb on a large database. In addition, some operating systems cannot write files larger than 2 Gig to disk. To circumvent this on Unix or Linux systems, use a "pipe" system such as: uncompress -c nt.Z|formatdb -i stdin -o T -p T -n "nt" -v 100000000 In this case, no file is written which is larger than 1 Gig and an arbitrarily large database is formatted as a set of 1 Gig volumes. Note the use of the '-n' option that specifies the name of the resulting BLAST database. Note also that 'stdin' specifies that input will be coming from 'standard input'. The nt FASTA file is not needed for running BLAST searches and nt.Z may be deleted after formatdb has been run. E) Note on creating custom databases With Standalone BLAST it is possible to take any custom file of FASTA sequences and use these as a database source file for searching. All BLAST database source files must be in FASTA format. In order to use the formatdb option -o T, especially for use with NCBI tool kit retrieval tools the FASTA defline must follow a specific format. F) Note on creating an alias file for a GI list. Formatdb can now produce a BLAST database alias file that specifies a (real) BLAST database to search as well as a GI list to limit the search. This is useful if one often searches a subset of a database (e.g., based on organism or a curated list). The alias file makes the search appear as if one were searching a real database rather than the subset of one. The procedure to produce an alias file for searching (protein) nr limiting it to a list of zebrafish GI's would be: 1.) obtain the list of zebrafish GI's from Entrez or some other source and keep it in a file called "zebrafish.gi.in". 2.) invoke formatdb to convert the text GI list to the more efficient binary format: formatdb -F zebrafish.gi.in -B zebrafish.gi 3.) invoke formatdb with the following options: formatdb -i nr -p T -L zebrafish -F zebrafish.gi -t "My zebrafish database" This will produce the alias file zebrafish.pal listing the database title, the real database to be searched, the GI file, and some statistics: # # Alias file created Thu Jul 5 15:04:29 2001 # # TITLE My zebrafish database # DBLIST nr # GILIST zebrafish.gi # #OIDLIST # NSEQ 1836 LENGTH 640724 One can search this by invoking (for example): blastall -p blastp -d zebrafish -i MYQUERY -o MYOUTPUT FASTA Defline Format -------------------- The syntax of FASTA Deflines used by the NCBI BLAST server depends on the database from which each sequence was obtained. The table below lists the identifiers for the databases from which the sequences were derived. Database Name Identifier Syntax GenBank gb|accession|locus EMBL Data Library emb|accession|locus DDBJ, DNA Database of Japan dbj|accession|locus NBRF PIR pir||entry Protein Research Foundation prf||name SWISS-PROT sp|accession|entry name Brookhaven Protein Data Bank pdb|entry|chain Patents pat|country|number GenInfo Backbone Id bbs|number General database identifier gnl|database|identifier NCBI Reference Sequence ref|accession|locus Local Sequence identifier lcl|identifier For example, an identifier might be "gb|M73307|AGMA13GT", where the "gb" tag indicates that the identifier refers to a GenBank sequence, "M73307" is its GenBank ACCESSION, and "AGMA13GT" is the GenBank LOCUS. "gi" identifiers are being assigned by NCBI for all sequences contained within NCBI's sequence databases. The 'gi' identifier provides a uniform and stable naming convention whereby a specific sequence is assigned its unique gi identifier. If a nucleotide or protein sequence changes, however, a new gi identifier is assigned, even if the accession number of the record remains unchanged. Thus gi identifiers provide a mechanism for identifying the exact sequence that was used or retrieved in a given search. For searches of the nr protein database where the sequences are derived from conceptual translations of sequences from the nucleotide databases the following syntax is used: gi|gi_identifier An example would be: gi|451623 (U04987) env [Simian immunodeficiency..." where '451623' is the gi identifier and the 'U04987' is the accession number of the nucleotide sequence from which it was derived. Users are encouraged to use the '-I' option for Blast output which will produce a header line with the gi identifier concatenated with the database identifier of the database from which it was derived, for example, from a nucleotide database: gi|176485|gb|M73307|AGMA13GT And similarly for protein databases: gi|129295|sp|P01013|OVAX_CHICK The gnl ('general') identifier allows databases not on the above list to be identified with the same syntax. An example here is the PID identifier: gnl|PID|e1632 PID stands for Protein-ID; the "e" (in e1632) indicates that this ID was issued by EMBL. As mentioned above, use of the "-I" option produces the NCBI gi (in addition to the PID) which users can also retrieve on. The bar ("|") separates different fields as listed in the above table. In some cases a field is left empty, even though the original specification called for including this field. To make these identifiers backwards-compatiable for older parsers the empty field is denoted by an additional bar ("||"). BLAST Databases without GI's or GenBank accessions -------------------------------------------------- Some BLAST users wish to format a FASTA file of sequences that do not contain NCBI ID's such as accessions or GI numbers. This may be the case if the sequences have not yet been submitted to GenBank or are proprietary. If the only goal is to perform BLAST searches and format a standard BLAST report then the simplest solution is to not set the "-o" option when running formatdb and indices for the identifiers will not be constructed. If one wishes to produce XML or ASN.1 output or wants to fetch sequences by ID with fastacmd, it is necessary to observe a few simple rules when constructing the ID's. These rules are necessary to ensure that the ID's can be reliably parsed to make the ID indices. 1.) ID's of type "local" or "general" should be used. This means that the ID's will have the syntax "lcl|IDENTIFIER" (for "local") or "gnl|DATABASE|IDENTIFIER" (for "general"). The tokens DATABASE and IDENTIFIER should be assigned by the user here. The local ID has only one user provided token, the general ID requires two. The fields are separated by vertical bars ("|").. 2.) Letters, numbers, underscores ("_"), dashes, and periods may be used. Uppercase and lowercase letters are treated as being distinct. No spaces are allowed in the ID, this indicates the end of the ID. 3.) All ID's should be unique, if the entire ID is examined. As an example consider the following four ID's: gnl|H.sapiens|seq1 gnl|H.sapiens|seq2 gnl|M.Mus|seq1 lcl|seq1 All of these ID's are considered unique. The first two might be sequences one and two of a collection of Human sequences; the fourth might be the first sequence in a collection of mouse sequences; the fourth is simply identified as the first sequence. ID's must fit into the framework described above to ensure that they can be reliably parsed and indices built for them. This means that it is not possible for users to invent new ID formats on the fly. Examples of illegal ID's would be: H.sapiens|seq1 gnl|H.sapiens|seq1|A The first identifier is missing a database tag (e.g., no "lcl"), the second identifier has an extra field since vertical bars separate fields. Illegal ID's will not be processed by formatdb if the "-o" option is used. F. General troubleshooting tips. The Latest Version: Make sure you are using the latest version of the formatdb executable. Earlier versions of formatdb may not recognize changes in the ASN.1 or FASTA definition line format of current BLAST databases or other sources of NCBI sequences. G. Troubleshooting "SeqIdParse Failure" errors The most frequent cause of SeqIdParse Failure errors come from the syntax of the FASTA definition lines in the source database file. Many third parties do not follow the syntax in section F. If you are getting a SeqIdParse error this can often be eliminated by formatting the database with the -o option set to FALSE (e.g. -o F). The -o option is really not important for BLAST searching unless you are going to use the results to parse out the identifiers for searching Entrez and downloading the sequences. If you need to use the -o T option then your best option is to examine the definition lines of the database sequences and attempt to make them conform the FASTA syntax. H. Troubleshooting "FileOpen" errors. This is caused when the formatdb program can not find the /data subdirectory. When you download and extract the Standalone BLAST executables, the formatdb program is located in the same directory as the /data subdirectory. If either if these move, formatdb will not function without a .ncbirc file telling it where the /data subdirectory resides. Create a text file in the same directory as formatdb that contains the following lines: [NCBI] Data="path/data/" Where "path/data/" is the path to the location of the Standalone BLAST "data" subdirectory. For Example: Data=/root/blast/data For PC's this would be [NCBI] Data="C:\path\data\" Where "C:path\data\" is the path to the location of the Standalone BLAST "data" subdirectory. For example: Data=C:\blast\data For Macintosh this would be a simpletext file called "ncbi.cnf", placed in System folder, that contains: [BLAST] BLASTDB=root:blast:data Where "root:blast:data" is the path to the location of the Standalone BLAST "data" subdirectory. For example on the machine names "LabMac": BLASTDB=LabMac:blast:data Appendix 1: The Files Produced by Formatdb ------------------------------------------ Using formatdb without the "-o T" indexing option results in three BLAST database files (.nhr, .nin, ,nsq). Using the "-o T" option will result in additional files. If gi's are present in the FASTA definition lines of the source file, there will be four additional files created (.nsd, nsi, nni, nnd). These are ISAM indices for mapping a sequence identifier to a particular sequence in the BLAST database If gi's are not use there will be only two additional files created (.nsd, .nsi). These files are listed below for both an example nucleotide and a protein sequence database. The actual sequence data is stored in the files with extension "nsq" or "psq". The compression ratio for these files is about 4:1 for nucleotides and 1:1 for protein sequence source files. Extension Content Format --------------------------------------------- Nucleotide database formatted without "-o T" nhr deflines binary nin ISAM binary nsq ISAM binary Nucleotide database formatted with "-o T" add these: nsd parsed binary deflines nsi gis binary nni ISAM binary nnd ISAM binary The formatdb index files involving deflines are small relative to the source database due to entries such as the one below in which the defline is much shorter than the sequence. >gi|5819095|ref|NC_001321.1| Balaenoptera physalus mitochondrion, complete genome GTTAATTACTAATCAGCCCATGATCATAACATAACTGAGGTTTCATACATTTGGTATTTTTTTATTTTTTTTGGGGGGCT TGCACGGACTCCCCTATGACCCTAAAGGGTCTCGTCGCAGTCAGATAAATTGTAGCTGGGCCTGGATGTATTTGTTATTT GACTAGCACAACCAACATGTGCAGTTAAATTAATGGTTACAGGACATAGTACTCCACTATTCCCCCCGGGCTCAAAAAAC TGTATGTCTTAGAGGACCAAACCCCCCTCCTTCCATACAATACTAACCCTCTGCTTAGATATTCACCACCCCCCTAGACA GGCTCGTCCCTAGATTTAAAAGCCATTTTATTTATAAATCAATACTAAATCTGACACAAGCCCAATAATGAAAATACATG AACGCCATCCCTATCCAATACGTTGATGTAGCTTAAACACTTACAAAGCAAGACACTGAAAATGTCTAGATGGGTCTAGC CAACCCCATTGACATTAAAGGTTTGGTCCCAGCCTTTCTATTAGTTCTTAACAGACTTACACATGCAAGTATCCACATCC CAGTGAGAACGCCCTCTAAATCATAAAGATTAAAAGGAGCGGGTATCAAGCACGCTAGCACTAGCAGCTCACAACGCCTC GCTTAGCCACGCCCCCACGGGACACAGCAGTGATAAAAATTAAGCTATAAACGAAAGTTCGACTAAGTCATGTTAATTTA ....16398 bp total Extension Content Format ------------------------------------------ Protein database formatted without "-o T phr deflines binary pin ISAM binary psq ISAM binary Protein database formatted with "-o T" add these: pnd ISAM binary pni ISAM binary psd parsed binary psi gis binary The formatdb index files involving deflines are large relative to the source database due to entries such as the one below in which the defline is much longer than the sequence. >gi|229659|pdb|1AAP|A Chain A, Protease Inhibitor Domain Of Alzheimer's Amyloid Beta-Protein Precursor (APPI)gi|229660|pdb|1AAP|B Chain B, Protease Inhibitor Domain Of Alzheimer's Amyloid Beta-Protein Precursor (APPI) VREVCSEQAETGPCRAMISRWYFDVTEGKCAPFFYGGCGGNRNNFDTEEYCMAVCGSA DISCLAIMER: The internal structure of the BLAST databases is subject to change with little or no notice. The readdb API should be used to extract data from the BLAST databases. Readdb is part of the the NCBI toolkit (ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools/), readdb.h contains a list of supported function calls. Last updated April 02 2001