README for stand-alone BLAST

(last updated 7/5/2001)

 

 

This document provides information on stand-alone BLAST. Topics covered are

setting up stand-alone BLAST, command-line options for stand-alone BLAST,

and a release history of the different versions.

BLAST binaries are provided for IRIX6.2, Solaris2.6 (Sparc) Solaris2.7 (Intel),

DEC OSF1 (ver. 5.1), LINUX/Intel, HPUX, MacIntosh, and Win32 systems.

We will attempt to produce binaries for other platforms upon request.

Stand-alone binaries are available from ftp://ncbi.nlm.nih.gov/blast/executables/

Please remember to FTP in binary mode.

 

Setting up Standalone BLAST for UNIX:

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Basically, there are three steps needed to setup the Standalone BLAST

executable for the UNIX platform.

1) Download the UNIX binary, uncompress and untar the file. It is

suggested that you do this in a separate directory, perhaps called

"blast".

2) Create a .ncbirc file. In order for Standalone BLAST to operate, you

have will need to have a .ncbirc file that contains the following lines:

[NCBI]

Data="path/data/"

Where "path/data/" is the path to the location of the Standalone BLAST

"data" subdirectory. For Example:

Data=/root/blast/data

The data subdirectory should automatically appear in the directory where

the downloaded file was extracted. Please note that in many cases it may

be necessary to delimit the entire path including the machine name and

or the net work you are located on. Your systems administrator can help

you if you do not know the entire path to the data subdirectory.

Make sure that your .ncbirc file is either in the directory that you

call the Standalone BLAST program from or in your root directory.

3) Format your BLAST database files. The main advantage of Standalone

BLAST is to be able to create your own BLAST databases. This can be done

with any file of FASTA formatted protein or nucleotide sequences. If you

are interested in creating your own database files you should refer to

the sections "Non-redundant defline syntax" and "Appendix 1: Sequence

Identifier Syntax" of the README in the BLAST database directory

(ftp://ncbi.nlm.nih.gov/blast/db/). You can also refer to the FASTA

description available from the BLAST search pages

(http://www.ncbi.nlm.nih.gov/BLAST/fasta.html).

However, for a testing purposes you should download one of the NCBI

databases and run a search against it.

In the BLAST database FTP directory (ftp://ncbi.nlm.nih.gov/blast/db/)

you will find the downloadable BLAST database files. For your first

search we recommend downloading something relatively small like

ecoli.nt.Z (1349 Kb). This is a FASTA formatted file of nucleotide

sequences which is also compressed. Once uncompressed, you will need to

format the database using the 'formatdb' program which comes with your

Standalone BLAST executable. The list of arguments for this program and

all other BLAST programs are located at the end of the README in the

Standalone BLAST FTP directory (ftp://ncbi.nlm.nih.gov/blast/executable/). Or

you can get these arguments by running each of the BLAST programs (formatdb,

blastall etc.) with a single hyphen as the argument (Example: formatdb -). For

this document we are just going to show you the basic commands for formatting

the database and running your first search.

To format the ecoli.nt database run the following from the command

line:

formatdb -i ecoli.nt -p F -o T

This will create seven index files that Standalone BLAST needs to

perform the searches and produce results. The ecoli.nt file is not

needed after formatdb has been done and you can delete this.

Next create a test nucleotide file to run against the new database. It

may be easier to 'cheat' here and just extract a portion of a

nucleotide sequence you know is in the downloaded ecoli.nt database.

Make a text file called test.txt with the following sequence:

>Test

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA

TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC

ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG

CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA

GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC

AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG

AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT

To run the first search enter the following command from the UNIX

command line in your BLAST directory:

blastall -p blastn -d ecoli.nt -i test.txt -o test.out

This should generate a results file called test.out in the Standalone

BLAST directory.

Now you are ready to create your own databases and run BLAST searches.

For more information you should refer to the Standalone BLAST README (

ftp://ncbi.nlm.nih.gov/blast/executable/) and the BLAST literature.

This will give you some idea of all the programs BLAST supports and the

use of different parameters for increasing or decreasing the stringency

of your results.

If you have any questions please send them to the

blast-help@ncbi.nlm.nih.gov e-mail address.

 

Setting up Standalone BLAST for Windows

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

There are three steps needed to setup the Standalone BLAST

executable.

1) Download and compress the Standalone BLAST Windows binary

blastcz.exe. We suggest doing this in it's own directory, perhaps called

blast. This is a 'self-extracting' archive and all you need to do is run

this either through a Command Prompt (DOS Prompt) or by selecting "Run"

from the Windows "Start button" and browsing the blastcz.exe file.

2) Create an ncbi.ini file. In order for Standalone BLAST to operate,

you have will need to have an ncbi.ini file that contains the following

lines:

[NCBI]

Data="C:\path\data\"

Where "C:path\data\" is the path to the location of the Standalone

BLAST "data" subdirectory. For example:

Data=C:\blast\data

This data subdirectory should automatically appear in the directory

where the downloaded file was extracted.

Make sure that your ncbi.ini file is in the Windows or WINNT directory

on your machine. Note: If you already have an ncbi.ini file on your

machine from installing other NCBI software(Network Entrez, Sequin etc.)

you can skip this section. However, if you see the following error

message, you should rename the old ncbi.ini file to something like

ncbi.bak and follow the instructions in number 2 above.

Abrupt: code=1

FATAL ERROR: FindPath failed.

C) The main advantage of Standalone BLAST is to be able to create your

own BLAST databases. This can be done with any file of FASTA formatted

protein or nucleotide sequences. If you are interested in creating your

own database you should refer to the sections "Non-redundant defline

syntax" and "Appendix 1: Sequence Identifier Syntax" of the README in

the BLAST database directory (ftp://ncbi.nlm.nih.gov/blast/db/). You can

also refer to the FASTA description available from the BLAST search

pages (http://www.ncbi.nlm.nih.gov/BLAST/fasta.html).

However, for a testing purposes you should download one of the NCBI

databases and run a search against it.

In the BLAST database FTP directory ftp://ncbi.nlm.nih.gov/blast/db/

you will find the downloadable BLAST database files. For your first

search we recommend downloading something relatively small like

ecoli.nt.Z (1349 Kb). This is a FASTA formatted file of nucleotide

sequences which is also compressed. (If you do not have a copy of UNIX

"uncompress" for your Windows PC contact NCBI Info at

info@ncbi.nlm.nih.gov).

Once uncompressed, you will now need to format the database using the

'formatdb' program which comes with your Standalone BLAST executable.

The list of arguments for this program and all other BLAST programs are

located at the end of the README in the Standalone BLAST FTP directory

(ftp://ncbi.nlm.nih.gov/blast/executable/). Or you can get these

arguments by running each of the BLAST programs (formatdb, blastall

etc.) with a single hyphen as the argument (Example: formatdb -). For

this document we are just going to show you the basic commands for

formatting the database and running your first search.

To format the ecoli.nt database run the following from the command

line:

formatdb -i ecoli.nt -p F -o T

This will create seven index files that Standalone BLAST needs to

perform the searches and produce results. The ecoli.nt file can be

removed once formatdb has been run.

Next create a test nucleotide file to run against the new database. It

may be easier to 'cheat' here and just extract a portion of a

nucleotide sequence you know is in the downloaded ecoli.nt database.

So make a text file called test.txt with the following sequence:

>Test

AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC

TTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAA

TATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTACACAACATCCATGAAACGCATTAGCACCACC

ATTACCACCACCATCACCATTACCACAGGTAACGGTGCGGGCTGACGCGTACAGGAAACACAGAAAAAAG

CCCGCACCTGACAGTGCGGGCTTTTTTTTTCGACCAAAGGTAACGAGGTAACAACCATGCGAGTGTTGAA

GTTCGGCGGTACATCAGTGGCAAATGCAGAACGTTTTCTGCGTGTTGCCGATATTCTGGAAAGCAATGCC

AGGCAGGGGCAGGTGGCCACCGTCCTCTCTGCCCCCGCCAAAATCACCAACCACCTGGTGGCGATGATTG

AAAAAACCATTAGCGGCCAGGATGCTTTACCCAATATCAGCGATGCCGAACGTATTTTTGCCGAACTTTT

To run the first search just do the command:

blastall -p blastn -d ecoli.nt -i test.txt -o test.out

This should generate a results file called test.out in the Standalone

BLAST directory. Now you are ready to create your own databases and run

BLAST searches. For more information you should refer to the Standalone

BLAST README ( ftp://ncbi.nlm.nih.gov/blast/executable/) and the BLAST

literature. This will give you some idea of all the programs BLAST

supports and the use of different parameters for increasing or

decreasing the stringency of your results.

If you have any questions please send them to the

blast-help@ncbi.nlm.nih.gov e-mail address.

 

SGI Note:

---------

SGI recommends the following threads patches on IRIX6 systems:

For 6.2 systems, install SG0001404, SG0001645, SG0002000, SG0002420 and SG0002458 (in that order)

For 6.3 systems, install SG0001645, SG0002420 and SG0002458 (in that order)

For 6.4 systems, install SG0002194, SG0002420 and SG0002458 (in that order)

These patches can be obtained by calling SGI customer service or from the web: http://support.sgi.com/

System recommendations:

----------------------

BLAST uses memory-mapped files (on UNIX and NT systems), so it runs best if

it can read the entire BLAST database into memory, then keep on using it

there. Resources consumed reading a database into memory can easily

outweight the cost of a BLAST search, so that the memory of a machine is

normally more important than the CPU speed. This means that one should have

sufficient memory for the largest BLAST database one will use, then run all

the searches against this databases in serial, then run queries against

another database in serial. This guarantees that the database will be read

into memory only once. As of Aug. 1997 the EST FASTA file is about 500 Meg,

which translates to about 170-200 Meg of BLAST database. At least another

100-200 Meg should be allowed for memory consumed by the actual BLAST

program. All of the FASTA databases together are about 1.5 Gig, the BLAST

databases produced from this will probably be about another Gig or so. 4 Gig

of disk space, to make room for software and output, is probably a pretty

good bet.

 

BLAST OPTIONS

-------------

Formatdb

--------

There is now a separate document describing formatdb (README.formatdb). Please

refer to it for information on formatting FASTA files for BLAST searches.

 

Blastall

--------

Blastall may be used to perform all five flavors of blast comparison. One

may obtain the blastall options by executing 'blastall -' (note the dash). A

typical use of blastall would be to perform a blastn search (nucl. vs. nucl.)

of a file called QUERY would be:

blastall -p blastn -d nr -i QUERY -o out.QUERY

The output is placed into the output file out.QUERY and the search is performed

against the 'nr' database. If a protein vs. protein search is desired,

then 'blastn' should be replaced with 'blastp' etc.

Some of the most commonly used blastall options are:

blastall arguments:

-p Program Name [String]

Input should be one of "blastp", "blastn", "blastx", "tblastn", or "tblastx".

-d Database [String]

default = nr

The database specified must first be formatted with formatdb.

Multiple database names (bracketed by quotations) will be accepted.

An example would be

-d "nr est"

which will search both the nr and est databases, presenting the results as if one

'virtual' database consisting of all the entries from both were searched. The

statistics are based on the 'virtual' database of nr and est.

-i Query File [File In]

default = stdin

The query should be in FASTA format. If multiple FASTA entries are in the input

file, all queries will be searched.

-e Expectation value (E) [Real]

default = 10.0

-o BLAST report Output File [File Out] Optional

default = stdout

-F Filter query sequence (DUST with blastn, SEG with others) [String]

default = T

BLAST 2.0 and 2.1 uses the dust low-complexity filter for blastn and seg for the

other programs. Both 'dust' and 'seg' are integral parts of the NCBI toolkit

and are accessed automatically.

If one uses "-F T" then normal filtering by seg or dust (for blastn)

occurs (likewise "-F F" means no filtering whatsoever).

This options also takes a string as an argument. One may use such a

string to change the specific parameters of seg or invoke other filters.

Please see the "Filtering Strings" section (below) for details.

-S Query strands to search against database (for blast[nx], and tblastx). 3 is both, 1 is top, 2 is bottom [Integer]

default = 3

-T Produce HTML output [T/F]

default = F

-l Restrict search of database to list of GI's [String] Optional

This option specifies that only a subset of the database should be

searched, determined by the list of gi's (i.e., NCBI identifiers) in a

file. One can obtain a list of gi's for a given Entrez query from

http://www.ncbi.nlm.nih.gov/Entrez/batch.html. This file should

be in the same directory as the database, or in the directory that

BLAST is called from.

-U Use lower case filtering of FASTA sequence [T/F] Optional

default = F

This option specifies that any lower-case letters in the input FASTA file

should be masked.

 

Documentation for PSI-TBLASTN

PSI-BLASTN is a variant of blastall that searches a protein query

sequence against a nucleotide sequence database using a position

specific matrix created by PSI-BLAST. The nucleotide sequence database

is dynamically translated in all reading frames during PSI-TBLASTN

search. Using a position specific matrix may enable finding more

distantly related sequences.

Programs:

blastpgp [takes a protein query and perform PSI-BLAST search to

creates a position specific matrix using a protein

database]

blastall [reads position specific matrix and performs PSI-TBLASTN

search]

Usage:

A user would typically run blastpgp to create and save a position

specific matrix, followed by a run of blastall for PSI-TBLASTN search.

blastpgp must be executed with -C option followed by a file name to

save position specific score matrix.

blastall with "-p psitblastn" option executes PSI-TBLASTSN search, and

-R option followed by a file name specifying the file that contains

position specific score matrix. All other options that apply when

using "blastall -p tblastn ..." also apply when using "blastall -p

psitblastn ...", but there are some restrictions to parameters: 1) The

query must be the same as the one used in blastpgp for creating a

position specific matrix. 2) By default, blastpgp has filtering off

(-F F) and blastall has filtering on (-F T). To ensure consistent

usage of the blastpgp/psitblastn combination, the -F option should be

explicitly set in one or the other run.

 

Example:

One may run PSI-BLST to create and save a position specific score matrix

as follows:

blastpgp -d nr -i ff.chd -j 2 -C ff.chd.ckp

Position specific score matrix is saved in ff.chd.ckp. Then, using

this matrix, one may run PSI-TBLASTN search:

blastall -i ff.chd -d yeast -p psitblastn -R ff.chd.ckp

Note that this allows the score matrix to be constructed using one

database (nr in the example) and then used to search a second database

(yeast in the example). Even if the two database names are the same,

blastpgp uses the protein version while "blastall -p psitblastn" uses

the DNA version.

 

 

Blastpgp

--------

Blastpgp performs gapped blastp searches and can be used to perform

iterative searches in psi-blast and phi-blast mode. See the PSI-Blast and

PHI-BLAST sections (below) for a description of this binary. The options may be

obtained by executing 'blastpgp -'.

-T Produce HTML output [T/F]

default = F

-Q Output File for PSI-BLAST Matrix in ASCII [File Out] Optional

Bl2seq

------

Bl2seq performs a comparison between two sequences using either the blastn or

blastp algorithm. Both sequences must be either nucleotides or proteins.

The options may be obtained by executing 'bl2seq -'.

-i First sequence [File In]

-j Second sequence [File In]

-p Program name: blastp, blastn, blastx. For blastx 1st argument should be nucleotide [String]

default = blastp

-g Gapped [T/F]

default = T

-o alignment output file [File Out]

default = stdout

-d theor. db size (zero is real size) [Integer]

default = 0

-a SeqAnnot output file [File Out] Optional

-G Cost to open a gap (zero invokes default behavior) [Integer]

default = 0

-E Cost to extend a gap (zero invokes default behavior) [Integer]

default = 0

-X X dropoff value for gapped alignment (in bits) (zero invokes default behavior) [Integer]

default = 0

-W Wordsize (zero invokes default behavior) [Integer]

default = 0

-M Matrix [String]

default = BLOSUM62

-q Penalty for a nucleotide mismatch (blastn only) [Integer]

default = -3

-r Reward for a nucleotide match (blastn only) [Integer]

default = 1

-F Filter query sequence (DUST with blastn, SEG with others) [String]

default = T

-e Expectation value (E) [Real]

default = 10.0

-S Query strands to search against database (blastn only). 3 is both, 1 is top, 2 is bottom [Integer]

default = 3

-T Produce HTML output [T/F]

default = F

 

Fastacmd

--------

Fastacmd retrives FASTA formatted sequences from a BLAST database, if it was formatted

using the '-o' option. An example fastacmd call would be:

fastacmd -d nr -s p38398

The fastacmd options are:

fastacmd arguments:

-d Database [String]

default = nr

-s Search string: GIs, accessions and locuses may be used delimited

by comma or space) [String] Optional

-i Input file wilth GIs/accessions/locuses for batch retrieval [String] Optional

-a Retrieve duplicated accessions [T/F] Optional

default = F

-l Line length for sequence [Integer] Optional

default = 80

 

 

Filtering Strings

-----------------

The -F argument can take a string as input specifying that seg should be

run with certain values or that other non-standard filters should be used.

This sections describes this syntax.

The seg options can be changed by using:

-F "S 10 1.0 1.5"

which specifies a window of 10, locut of 1.0 and hicut of 1.5.

A coiled-coiled filter, based on the work of Lupas et al. (Science, vol 252, pp. 1162-4 (1991))

and written by John Kuzio (Wilson et al., J Gen Virol, vol. 76, pp. 2923-32 (1995)), may be invoked

by specifying:

-F "C"

There are three parameters for this: window, cutoff (prob of a coil-coil), and

linker (distance between two coiled-coiled regions that should be linked

together). These are now set to

window: 22

cutoff: 40.0

linker: 32

One may also change the coiled-coiled parameters in a manner analogous to

that of seg:

-F "C 28 40.0 32" will change the window to 28.

One may also run both seg and coiled-coiled together by using a ";":

-F "C;S"

Filtering by dust may also be specified by:

-F "D"

It is possible to specify that the masking should only be done during

the process of building the initial words by starting the filtering

command with 'm', e.g.:

-F "m S"

which specifies that seg (with default arguments) should be used for masking,

but that the masking should only be done when the words are being built.

This masking option is available with all filters.

If the -U option (to mask any lower-case sequence in the input FASTA file) is used and

one does not wish any other filtering, but does wish to mask when building the lookup tables

then one should specify:

-F "m"

This is the only case where "m" should be specified alone.

 

PSI-Blast

---------

The blastpgp program can do an iterative search in which

sequences found in one round of searching are used to build

a score model for the next round of searching. In this usage,

the program is called Position-Specific Iterated BLAST, or PSI-BLAST.

As explained in the accompanying paper, the BLAST algorithm is

not tied to a specific score matrix. Traditionally, it has been

implemented using an AxA substitution matrix where A is the alphabet size.

PSI-BLAST instead uses a QxA matrix, where Q is the length of the query

sequence; at each position the cost of a letter depends on the position

w.r.t. the query and the letter in the subject sequence.

The position-specific matrix for round i+1 is built from a constrained

multiple alignment among the query and the sequences found with

sufficiently low e-value in round i. The top part of the output for

each round distinguishes the sequences into: sequences found

previously and used in the score model, and sequences not used in the

score model. The output currently includes lots of diagnostics

requested by users at NCBI. To skip quickly from the output of

one round to the next, search for the string "producing", which is

part of the header for each round and likely does not appear elsewhere

in the output. PSI-BLAST "converges" and stops if all sequences

found at round i+1 below the e-value threshold were already in

the model at the beginning of the round.

There are several blastpgp parameters specifically for PSI-BLAST:

-j is the maximum number of rounds (default 1; i.e., regular BLAST)

-h is the e-value threshold for including sequences in the

score matrix model (default 0.001)

-c is the "constant" used in the pseudocount formula specified in the

paper (default 10)

The -C and -R flags provide a "checkpointing" facility whereby

a score model can be stored and later reused.

-C stores the query and frequency count ratio matrix in a

file

-R restarts from a file stored previously.

When using -R, it is required that the query specified on the command line

match exactly the query in the restart file.

The checkpoint files are stored in a byte-encoded (not human readable)

format, so as to prevent roundoff error between writing and reading

the checkpoint.

Users who also develop their own sequence analysis software may wish

to develop their own scoring systems. For this purpose the code

in posit.c that writes out the checkpoint can be easily adapated to

write out scoring systems derived by other algorithms in such

a way that PSI-BLAST can read the files in later.

The checkpoint structure is general in the sense that it can handle

any position-specific matrix that fits in the Karlin-Altschul

statistical framework for BLAST scoring.

The -B flag provides a way to jump start PSI-BLAST from a master-slave

multiple alignment computed outside PSI-BLAST. The multiple alignment

must include the query sequence as one of the sequences, but it need

not be the first sequence. The multiple alignment must be specified

in a format that is derived from Clustal, but without some headers and

trailers. See example below. The rules are also described by the

following words. Suppose the multiple alignments has N sequences. It

may be presented in 1 or more blocks, where each block presents a

range of columns from the multiple alignment. E.g., the first block

might have columns 1-60, the second block might have columns 61-95,

the third block might have columns 96-128. Each block should have N

rows, 1 row per sequence. The sequences should be in the same order

in every block. Blocks are separated by 1 or more blank lines.

Within a block there are no blank lines, and each line consists of 1

sequence identifier followed by some white space followed by

characters (and gaps) for that sequence in the multiple alignment. In

each column, all letters must be in upper case, or all letters must be

in lower case. Upper case means that this column is to be given

position-specific scores. Lower-case means to use the underlying

matrix (specified by -M) for this column; e.g., if the query sequence

has an 'l' residue in the column, then the standard scores for

matching an L are used in the column.

A sample usage would be:

blastpgp -i seq1 -B align1 -j 2 -d nr

where seq1 is the query

align1 is the alignment file

-j 2 indicates to do 2 rounds

-d nr indicates to use the nr database

The example files

seq1

align1

copied below were kindly supplied by L. Aravind from a paper

he and Chris Ponting published in Protein Science:

Aravind L, Ponting CP, Homologues of 26S proteasome subunits

are regulators of transcription and translation, Protein Science

7(1998) 1250-1254.

L. Aravind (aravind@ncbi.nlm.nih.gov) was the first user

and helped define how -B should work. Y. Wolf (wolf@ncbi.nlm.nih.gov)

helped design a more flexible input format for the alignments.

If you like how -B works, let them know.

If you do not like how -B works, complain to

A. Schaffer(schaffer@helix.nih.gov) who did the implementation.

seq1

----

> 26SPS9_Hs

IHAAEEKDWKTAYSYFYEAFEGYDSIDSPKAITSLKYMLLCKIMLNTPEDVQALVSGKLALRYAGRQTEA

LKCVAQASKNRSLADFEKALTDYRAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKL

SKADVERKLSQMILDKKFHGILDQGEGVLIIFDEPP

 

align1

------

26SPS9_Hs IHAAEEKDWKTAYSYFYEAFEGYdsidspkaitslkymllckimlntpedvqalvsgklalryagrqtealkcvaqasknr

F57B9_Ce LHAADEKDFKTAFSYFYEAFEGYdsvdekvsaltalkymllckvmldlpdevnsllsaklalkyngsdldamkaiaaaaqk

YDL097c_Sc ILHCEDKDYKTAFSYFFESFESYhnltthnsyekacqvlkymllskimlnliddvknilnakytketyqsrgidamkavae

YMJ5_Ce LYSAEERDYKTSFSYFYEAFEGFasigdkinatsalkymilckimlneteqlagllaakeivayqkspriiairsmadafr

FUS6_ARATH KNYIRTRDYCTTTKHIIHMCMNAilvsiemgqfthvtsyvnkaeqnpetlepmvnaklrcasglahlelkkyklaarkfld

COS41.8_Ci SLDYKLKTYLTIARLYLEDEDPVqaemyinrasllqnetadeqlqihykvcyarvldyrrkfleaaqrynelsyksaihet

644879 KCYSRARDYCTSAKHVINMCLNVikvsvylqnwshvlsyvskaestpeiaeqrgerdsqtqailtklkcaaglaelaarky

YPR108w_Sc IHCLAVRNFKEAAKLLVDSLATFtsieltsyesiatyasvtglftlertdlkskvidspellslisttaalqsissltisl

eif-3p110_Hs SKAMKMGDWKTCHSFIINEKMNGkvw-------------------------------------------------------

T23D8.4_Ce SKAMLNGDWKKCQDYIVNDKMNQkvw-------------------------------------------------------

YD95_Sp IYLMSIRNFSGAADLLLDCMSTFsstellpyydvvryavisgaisldrvdvktkivdspevlavlpqnesmssleacinsl

KIAA0107_Hs LYCVAIRDFKQAAELFLDTVSTFtsyelmdyktfvtytvyvsmialerpdlrekvikgaeilevlhslpavrqylfslyec

F49C12.8_Hs LYRMSVRDFAGAADLFLEAVPTFgsyelmtyenlilytvitttfaldrpdlrtkvircnevqeqltggglngtlipvreyl

Int-6_Mm KFQYECGNYSGAAEYLYFFRVLVpatdrnalsslwgklaseilmqnwdaamedltrlketidnnsvssplqslqqrtwlih

26SPS9_Hs sladfekaltdy-----------------------------------------------------------------------------------

F57B9_Ce rslkdfqvafgsf----------------------------------------------------------------------------------

YDL097c_Sc aynnrslldfntalkqy------------------------------------------------------------------------------

YMJ5_Ce krslkdfvkalaeh---------------------------------------------------------------------------------

FUS6_ARATH vnpelgnsyneviapqdiatygglcalasfdrselkqkvidninfrnflelvpdvrelindfyssryascleylasl------------------

COS41.8_Ci eqtkalekalncailapagqqrsrmlatlfkdercqllpsfgilekmfldriiksdemeefar--------------------------------

644879 kqaakclllasfdhcdfpellspsnvaiygglcalatfdrqelqrnvissssfklflelepqvrdiifkfyeskyasclkmldem----------

YPR108w_Sc yasdyasyfpyllety-------------------------------------------------------------------------------

eif-3p110_Hs -----------------------------------------------------------------------------------------------

T23D8.4_Ce -----------------------------------------------------------------------------------------------

YD95_Sp ylcdysgffrtladve-------------------------------------------------------------------------------

KIAA0107_Hs rysvffqslavv-----------------------------------------------------------------------------------

F49C12.8_Hs esyydchydrffiqlaale----------------------------------------------------------------------------

Int-6_Mm wslfvffnhpkgrdniidlflyqpqylnaiqtmcphilrylttavitnkdvrkrrqvlkdlvkviqqesytykdpitefveclyvnfdfdgaqkk

26SPS9_Hs ----RAELRDDPIISTHLAKLYDNLLEQNLIRVIEPFSRVQIEHISSLIKLSKADVERKLSQMILDKKFHGILDQGEGVLIIFDEPP

F57B9_Ce ----PQELQMDPVVRKHFHSLSERMLEKDLCRIIEPYSFVQIEHVAQQIGIDRSKVEKKLSQMILDQKLSGSLDQGEGMLIVFEIAV

YDL097c_Sc ----EKELMGDELTRSHFNALYDTLLESNLCKIIEPFECVEISHISKIIGLDTQQVEGKLSQMILDKIFYGVLDQGNGWLYVYETPN

YMJ5_Ce ----KIELVEDKVVAVHSQNLERNMLEKEISRVIEPYSEIELSYIARVIGMTVPPVERAIARMILDKKLMGSIDQHGDTVVVYPKAD

FUS6_ARATH ----KSNLLLDIHLHDHVDTLYDQIRKKALIQYTLPFVSVDLSRMADAFKTSVSGLEKELEALITDNQIQARIDSHNKILYARHADQ

COS41.8_Ci ----QLMPHQKAITADGSNILHRAVTEHNLLSASKLYNNIRFTELGALLEIPHQMAEKVASQMICESRMKGHIDQIDGIVFFERRET

644879 ----KDNLLLDMYLAPHVRTLYTQIRNRALIQYFSPYVSADMHRMAAAFNTTVAALEDELTQLILEGLISARVDSHSKILYARDVDQ

YPR108w_Sc ----ANVLIPCKYLNRHADFFVREMRRKVYAQLLESYKTLSLKSMASAFGVSVAFLDNDLGKFIPNKQLNCVIDRVNGIVETNRPDN

eif-3p110_Hs ----DLFPEADKVRTMLVRKIQEESLRTYLFTYSSVYDSISMETLSDMFELDLPTVHSIISKMIINEELMASLDQPTQTVVMHRTEP

T23D8.4_Ce ----NLFHNAETVKGMVVRRIQEESLRTYLLTYSTVYATVSLKKLADLFELSKKDVHSIISKMIIQEELSATLDEPTDCLIMHRVEP

YD95_Sp ----VNHLKCDQFLVAHYRYYVREMRRRAYAQLLESYRALSIDSMAASFGVSVDYIDRDLASFIPDNKLNCVIDRVNGVVFTNRPDE

KIAA0107_Hs ----EQEMKKDWLFAPHYRYYVREMRIHAYSQLLESYRSLTLGYMAEAFGVGVEFIDQELSRFIAAGRLHCKIDKVNEIVETNRPDS

F49C12.8_Hs ----SERFKFDRYLSPHFNYYSRGMRHRAYEQFLTPYKTVRIDMMAKDFGVSRAFIDRELHRLIATGQLQCRIDAVNGVIEVNHRDS

Int-6_Mm lrecESVLVNDFFLVACLEDFIENARLFIFETFCRIHQCISINMLADKLNMTPEEAERWIVNLIRNARLDAKIDSKLGHVVMGNNAV

 

 

 

 

PHI-Blast

---------

PHI-BLAST (Pattern-Hit Initiated BLAST) is a search

program that combines matching of regular expressions

with local alignments surrounding the match.

The most important features of the program have been

incorporated into the BLAST software framework

partly for user convenience and partly so that

PHI-BLAST may be combined seamlessly with PSI-BLAST.

Other features that do not fit into the BLAST framework

will be released later as a separate program and/or

separate Web page query options.

One very restrictive way to identify protein motifs

is by regular expressions that must contain each instance

of the motif. The PROSITE database is a compilation of

restricted regular expressions that describe protein motifs.

Given a protein sequence S and a regular expression pattern P

occurring in S, PHI-BLAST helps answer the question:

What other protein sequences both contain an occurrence of P

and are homologous to S in the vicinity of the pattern occurrences?

PHI-BLAST may be preferable to just searching for pattern occurrences

because it filters out those cases where the pattern occurrence is

probably random and not indicative of homology.

PHI-BLAST may be preferable to other flavors of BLAST because

it is faster and because it allows the user to express

a rigid pattern occurrence requirement.

The pattern search methods in PHI-BLAST are based on the

algorithms in:

R. Baeza-Yates and G. Gonnet, Communications of the ACM 35(1992), pp. 74-82.

S. Wu and U. Manber, Communications of the ACM 35(1992), pp. 83-91.

The calculation of local alignments is done using a method

very similar to (and much of the same code as) gapped BLAST.

However, the method of evaluating statistical significance is different, and

is described below.

In the stand-alone mode the typical PHI-BLAST usage looks like:

blastpgp -i -k -p patseedp

where -i is followed by the file containing the query in FASTA format

where -k is followed by the file containing the pattern in a syntax given below

and "patseedp" indicates the mode of usage, not representing any file.

The syntax for the query sequence is FASTA format as for all other

BLAST queries. The syntax for patterns follows the rules of

PROSITE and is documented in detail below.

The specified pattern is not required to be in the PROSITE list.

Most of the other BLAST flags can be used with PHI-BLAST.

One important exception is that PHI-BLAST requires gapped

alignments (i.e. forbids -g F in the flags) because ungapped

alignments do not make sense for almost all patterns in PROSITE.

There is a second mode of PHI-BLAST usage that is important when

the specified pattern occurs more than 1 time in the query.

In this case, the user may be interested in restricting the

search for local alignments to a subset of the pattern occurrences.

This can be done with a search that looks like:

blastpgp -i -k -p seedp

in which case the use of the "seedp" option requires the user to

specify the location(s) of the interesting pattern occurrence(s)

in the pattern file. The syntax for how to specify pattern

occurrences is below. When there are multiple pattern occurrences in the

query it may be important to decide how many are of interest because

the E-value for matches is effectively multiplied by the number

of interesting pattern occurrences.

The PHI-BLAST Web page supports only the "patseedp" option.

PHI-BLAST is integrated with PSI-BLAST. In the command-line

mode, PSI-BLAST can be invoked by using the -j option, as usual.

When this is done as:

blastpgp -i -k -p patseedp -j

then the first round of searching uses PHI-BLAST and all subsequent

rounds use PSI-BLAST.

In the Web page setting, the user must explicitly invoke one round

at a time, and the PHI-BLAST Web page provides the option to

initiate a PSI-BLAST round with the PHI-BLAST results.

To describe a combined usage, use the term "PHI-PSI-BLAST"

(Pattern-Hit Initiated, Position-Specific Iterated BLAST).

Determining statistical significance.

When a query sequence Q matches a database sequence D in PHI-BLAST,

it is useful to subdivide Q and D into 3 disjoint pieces

Qleft Qpattern Qright

Dleft Dpattern Dright

The substrings Qpattern and Dpattern contain the pattern specified

in the pattern file. The pieces Qpattern and Dpattern are aligned

and that alignment is displayed as part of the PHI-BLAST output,

but the score for that alignment is mostly ignored.

The "reduced" score r of an alignment is the sum of the scores obtained

by aligning Qleft with Dleft and by aligning Qright with Dright.

The expected number of alignments with a reduced score >= x

is given by:

CN(Lambda*x + 1)e^(-Lambda *x)

where:

C and Lambda are "constants" depending on the score matrix and the

gap costs.

N is (number of occurrences of pattern in database) * (number of

occurrences of pattern in Q)

e is the base of the natural logarithm.

It is important to understand that this method of computing

the statistical significance of a PHI-BLAST alignment is mathematically

different from the method used for BLAST and PSI-BLAST alignments.

However, both methods provide E-values, so they the E_values are

displayed with a similar output syntax.

Rules for pattern syntax for PHI-BLAST.

The syntax for patterns in PHI-BLAST follows the conventions

of PROSITE. When using the stand-alone program, it

is permissible to have multiple patterns in a file separated

by a blank line between patterns. When using the Web-page

only one pattern is allowed per query.

Valid protein characters for PHI-BLAST patterns:

ABCDEFGHIKLMNPQRSTVWXYZU

Valid DNA characters for PHI-BLAST patterns:

ACGT

Other useful delimiters:

[ ] means any one of the characters enclosed in the brackets

e.g., [LFYT] means one occurrence of L or F or Y or T

- means nothing (this is a spacer character used by PROSITE)

x with nothing following means any residue

x(5) means 5 positions in which any residue is allowed (and similarly for any other

single number in parentheses after x)

x(2,4) means 2 to 4 positions where any residue is allowed,

and similarly for any other two numbers separated by a comma;

the first number should be < the second number.

> can occur only at the end of a pattern and means nothing

it may occur before a period

(another spacer used by PROSITE)

. may be used at the end of the pattern and means nothing

When using the stand-alone program, the pattern should

be in a file, with the first line starting:

ID

followed by 2 spaces and a text string giving the pattern a name.

There should also be a line starting

PA

followed by 2 spaces followed by the pattern description.

All other PROSITE codes in the first two columns are allowed,

but only the HI code, described below is relevant to PHI-BLAST.

Here is an example from PROSITE.

ID CNMP_BINDING_2; PATTERN.

AC PS00889;

DT OCT-1993 (CREATED); OCT-1993 (DATA UPDATE); NOV-1995 (INFO UPDATE).

DE Cyclic nucleotide-binding domain signature 2.

PA [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].

NR /RELEASE=32,49340;

NR /TOTAL=57(36); /POSITIVE=57(36); /UNKNOWN=0(0); /FALSE_POS=0(0);

NR /FALSE_NEG=1; /PARTIAL=1;

CC /TAXO-RANGE=??EP?; /MAX-REPEAT=2;

The line starting

ID

gives the pattern a name.

The lines starting

AC, DT, DE, NR, NR, CC

are relevant to PROSITE users, but irrelevant to PHI-BLAST.

These lines are tolerated, but ignored by PHI-BLAST.

The line starting

PA

describes the pattern as:

one of LIVMF

followed by

G

followed by

E

followed by

any single character

followed by

one of GAS

followed by

one of LIVM

followed by

any 5 to 11 characters

followed by

R

followed by

one of STAQ

followed by

A

followed by

any single character

followed by

one of LIVMA

followed by

any single character

followed by

one of STACV

In this case the pattern ends with a period.

It can end with nothing after the last specifying symbol

or any number of > signs or periods or combination thereof.

Here is another example, illustrating the use of an HI line.

ID ER_TARGET; PATTERN.

PA [KRHQSA]-[DENQ]-E-L>.

HI (19 22)

HI (201 204)

In this example, the HI lines specify that the pattern

occurs twice, once from positions 19 through 22 in the

sequence and once from positions 201 through 204 in the

sequence.

These specifications are relevant when stand-alone PHI-BLAST is

used with the

seedp

option, in which the interesting occurrences of the pattern

in the sequence are specified. In this case the

HI lines specify which occurrence(s) of the pattern

should be used to find good alignments.

In general, the seedp option is more useful than the

standard patternp option ONLY when the

pattern occurs K > 1 times in the sequence AND

the user is interested in matching to J < K of those

occurrences.

Then using the HI lines enables the user to specify which

occurrences are of interest.

Additional functionality related to PHI-BLAST.

PHI-BLAST takes as input both a sequence and a query containing

that sequence and searches a sequence database for

other sequences containing the same pattern and having a good alignment.

One may be interested in asking two related, simpler questions:

1. Given a sequence and a database of patterns, which patterns occur

in the sequence and where?

2. Given a pattern and a sequence database, which sequences contain the

pattern and where?

These queries can be answered wih software closely related to PHI-BLAST,

but they do not fit into the output framework of BLAST because the

answers are simple lists without alignments and with no notion of

statistical significance.

The NCBI toolbox includes another program, currently called

seedtop

to answer the two queries above.

Query 1 can be asked with:

seedtop -i -k -p patmatchp

Query 2 can be asked with:

seedtop -d -k -p patternp

The -k argument is used similarly in all queries and the file

format is always the same. The standard pattern database is

PROSITE, but others (or a subset) can be used.

There are plans afoot to offer the patmatchp query (number 1) on

the PHI-BLAST web page or in its vicinity, but this would

be restricted to having PROSITE as the pattern database.

References

Zhang, Zheng, Alejandro A. Schäffer, Webb Miller, Thomas L. Madden,

David J. Lipman, Eugene V. Koonin, and Stephen F. Altschul (1998),

"Protein sequence similarity searches using patterns as seeds", Nucleic

Acids Res. 26:3986-3990.

Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,

Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),

"Gapped BLAST and PSI-BLAST: a new generation of protein database

search programs", Nucleic Acids Res. 25:3389-3402.

Karlin, Samuel and Stephen F. Altschul (1990). Methods for

assessing the statistical significance of molecular sequence

features by using general scoring schemes. Proc. Natl. Acad.

Sci. USA 87:2264-68.

Karlin, Samuel and Stephen F. Altschul (1993). Applications

and statistics for multiple high-scoring segments in molecu-

lar sequences. Proc. Natl. Acad. Sci. USA 90:5873-7.

 

Release History

---------------

Notes for 2.2.1 release:

Enhancements:

1.) BLAST and PSI-BLAST improvements as described in

Schaffer et al., Nucleic Acids Research 2001 Jul 15;29(14).

These include improvements the use of composition-based statistics

and improvements to the edge-correction effects. Composition-based

statistics were initially implemented in release 2.1.1, but the

implementation is improved in release 2.2.1.

2.) Formatdb automatically produces database volumes for input

consisting of more than 4 billion letters.

3.) Formatdb can produce an alias file for a given database and GI list

as well as convert a GI list to the more efficient binary format. See

details in README.formatdb.

4.) RPSBLAST now works properly with 'scaled' databases. The scaling factor must

be set when executing the program 'makemat' (which takes PSI-BLAST checkpoints

as input). Scaling-up the matrix improves the precision of the (integer) calculations.

5.) Tabular output has now been added to blastpgp and rpsblast, use the "-m 8" option.

6.) Blastpgp will now process multiple queries.

Bug fixes:

1.) A problem with the -K option (for culling) that caused BLAST to crash has been fixed.

2.) A problem with the "gnl" identifier and multi-volume databases has been fixed.

3.) A problem that caused BLASTN to very rarely find suboptimal alignments has been fixed.

4.) A problem that could cause makemat to crash has been fixed.

4.) Some multi-threading problem pointed out by Henry Gabb of KAI were fixed.

5.) Some PC-lint errors and warnings pointed out by Russ Williams of United Devices

were fixed.

 

Notes for 2.1.3 release:

Enhancements:

1.) Addition of PSI-TBLASTN ability to blastall, see description in

README.bls.

2.) Database sequences over 5 million bases in length are now broken

into chunks to keep memory usage reasonable.

3.) Blastall now allows one to enter a location if it is desired

to search a subsequence of the query.

4.) Formatdb can produce a new BLAST database format using the -A option.

The BLAST programs can read this format as well as the current format (the

program automatically identifies which version it should work with). This

new format stores the sequence definition lines in a structured manner

(as ASN.1), this will allow future versions of BLAST to better present

taxonomic information as well as information about other resources (e.g.,

UniGene, LocusLink) for a database sequence.

5.) Blastall can now produce tab-delimited, use "-m 8" to specify this.

6.) Improved Karlin-Altschul parameters are now being used, they were

calculated using the "island" method

7.) A "gapped" check was added to BLASTN to ensure that if a hit is low-scoring

after an ungapped extension, but high-scoring after a gapped extension, it will

not be missed.

8.) The formatdb error messages have been improved for the case of illegal

characters in the sequence.

9.) The number of HSP's saved in an ungapped search has been increased to 400 from 200.

Bug fixes:

1.) A problem with XML output was fixed.

2.) A problem with the seg filtering under LINUX was

fixed (many thanks to Eric Cabot at GCG for pointing this out).

3.) A problem with format of BLAST reports if the "-o" flag

was not used when the database was produced was fixed

(thanks again to Eric Cabot).

4.) A problem with reading the BLAST database caused by a 4-byte signed integer

than should have been unsigned was fixed (thanks to Haruna Cofer at SGI

for pointing this out).

5.) A problem with copymat under NT and IRIX was fixed.

 

Notes for 2.1.2 release:

Enhancements:

1.) Release of rpsblast. Rpsblast performs a search against a database

of profiles. See README.rps for full details.

2.) Release of blastclust. BLASTCLUST automatically and systematically clusters protein sequences

based on pairwise matches found using the BLAST algorithm. See README.bcl for

full details.

3.) Release of megablast. Megablast uses the greeedy algorithm of Webb Miller et al.

for nucleotide sequence alignment search and concatenates many queries to save

time spent scanning the database. See README.mbl for full details.

4.) XML output can now be produced. Use the '-m 7' option for this.

The XML output is still experimental.

5.) the default behavior the culling (-K) option has been changed. Previously

this option was set to 100, meaning that if more than 100 HSP's had a

hit to a region lower scoring ones would be dropped. The option is now

zero, which turns off this behavior. In a few cases this change will

result in more database sequences being reported. The previous behavior can

be recovered by using '-K 100' on the command-line.

Bug fixes:

1.) A bug that caused only the last SeqAnnot to be written (if the -O option

was used) when multiple sequences were searched has been fixed. All

SeqAnnots are printed out.

2.) A bug that caused the search space (set on the command line with the -Y option)

to be ignored for some blastx and tblastn calculations has been fixed.

3.) A failure to close a file if a gilst was used (using the -l option) was

fixed. Many thanks to David Mathog at CalTech for spotting this problem

and suggesting a fix.

4.) A bug that caused all the database names listed in an alias file to be

printed, rather than the "TITLE" field has been fixed.

 

 

Notes for 2.1.1:

Enhancements:

1.) Addition of compostion-based statistics:

BLAST and PSI-BLAST now permit calculated E-values to take into account the amino acid composition of the individual database sequences involved in reported

alignments. This improves E-value accuracy, thereby reducing the number of false positive results.

The improved statistics are achieved with a scaling procedure [1,2] which in effect employs a slightly different scoring system for each database sequence. As a result,

raw BLAST alignment scores in general will not correspond precisely to those implied by any standard substitution matrix. Furthermore, identical alignments can receive

different scores, based upon the compositions of the sequences they involve. The improved statistics are now used by default for all rounds of searching on the

PSI-BLAST page, but not on the BLAST page. Therefore, if one uses default settings, the results of the first round of searching will be different on the BLAST and

PSI-BLAST pages.

In addition adjustments have been made to two PSI-BLAST parameters: the pseudocount constant default has been changed from 10 to 7, and the E-value threshold for

including matches in the PSI-BLAST model has been changed from 0.001 to 0.002.

1. Altschul, S.F. et al. (1997) Nucl. Acids Res. 25:3389-3402.

2. Schäffer, A.A. et al. (1999) Bioinformatics 15:1000-1011.

 

Notes for 2.0.14 release:

 

Bug fixes:

1.) extra line returns between sequences in the a FASTA file

causes formatdb to produce corrupted databases.

2.) ";" at the beginning of a line was not being treated as a comment.

3.) a problem with the formatter causes blast to core-dump if

the FASTA definition line only contains an identifier and

no description.

4.) a problem in the ungapped extension for protein sequences

causes a rare problem.

5.) the '-U' option that causes lower-case sequence to be masked

does not work correctly for blastx.

 

Notes for 2.0.13 release:

Enhancements:

1.) The output format for pairwise alignments was changed to

put each new gi (if the sequence has redundant gi's) on a

new line. If HTML output is specified then each gi is hyperlinked.

Bug fixes:

1.) An NCBI toolkit problem parsing the new RefSeq format in FASTA files

(two bars instead of three) was fixed. This fix applies to all

BLAST binaries (formatdb, blastall, blastpgp, etc.).

2.) A problem that caused BLAST version 2.0.12 under NT to freeze in

multithreaded mode has been fixed.

Notes for 2.0.12 release:

Enhancements:

1.) Bl2seq can now perform nucleotide-protein (blastx style) comparisons.

This necessitated changing the '-p' option from a Boolean to a

string. Valid arguments are "blastn", "blastp", or "blastx".

Bug fixes:

1.) A problem in the NCBI threads library that caused BLAST to sometimes

stick was corrected. Many thanks to Haruna Cofer and colleauges at SGI

for providing a fix.

2.) A problem that caused BLAST to core-dump (especially on long queries)

has been fixed. Many thanks to Gary Williams for providing examples.

3.) A problem that prevented the search of multiple multivolume databases

has been fixed.

 

 

Notes for 2.0.11 release:

Enhancements:

1.) Optimizations were contributed by Chris Joerg of COMPAQ. These changes

reduce the number of cache misses, unroll loops, and make some instructions

unnecessary. These improvements can speed up BLAST for long sequences

several-fold.

2.) A database is now only memory-mapped while being searched. If multiple databases

are searched and the total exceeds the allowed memory-map limit this allows

all databases to be searched as memory-mapped files. If a database cannot

be memory-mapped it is read as an ordinary file, rather than causing an error.

Bug fixes:

1.) Formatdb was fixed to correct a problem with FASTA string identifiers under NT.

2.) Blastpgp was fixed to prevent a core-dump under LINUX

3.) BLASTN was found to miss some hits near the expect value cutoff. This has been

corrected.

 

 

Notes for 2.0.10 release:

Enhancements:

1.) Bl2seq, a utility to compare two sequences using the blastn or blastp approach,

is included in the archive. See the full description in the README.bls for details.

2.) A 'sparse' option ('-s') has been added to formatdb. This option limits the indices

for the string identifiers (used by formatdb) to accessions (i.e., no locus names).

This is especially useful for sequences sets like the EST's where the accession and locus

names are identical. Formatdb runs faster and produces smaller temporary files if this

option is used. It is strongly recommended for EST's, STS's, GSS's, and HTGS's.

3.) A volume option ('-v') has been added to formatdb. This option breaks up large

FASTA files into 'volumes' (each with a maximum size of 2 billion letters).

As part of the creation of a volume formatdb writes a new type of BLAST database file,

called an alias file, with the extension 'nal' or 'pal', is written. This option

should be used if one wishes to formatdb large databases (e.g., over 2 billion

base pairs).

4.) It is is now possible to jump start the command line version of PSI-BLAST (blastpgp)

from a multiple alignment that includes the query sequence using the -B option. Details

are in README.bls.

5.) The maximum wordsize limit for BLASTN has been removed.

Bug fixes:

1.) A problem if the database length, set by the '-z' option was greater than

2 billion, was fixed.

2.) A core-dump that resulted from the use of the coil-coil masking

('-F C') was fixed by including a file needed for the data directory.

3.) A bug was fixed that caused some very short alignments to be assigned incorrect

expect values.

4.) A bug was fixed that caused formatdb to produce incorrect BLAST databases if

the input was ASN.1.

5.) A serious performance problem with BLASTN and longer words (greater than 16)

was fixed.

Notes for 2.0.9 release:

Enhancements:

1.) two new options have been added to blastall: to produce output in HTML and

to search a subset of the database based upon a list of GI's. Please see

the options section for full information.

2.) two new options have been added to blastpgp: to produce HTML output and to

produce an ASCII version of the PSI-BLAST Matrix. Please see the options section

for more information.

3.) formatdb has a new option to allow specification of a 'base' name. see the options

section for full details.

4.) it is possible to mask only during the phase when the lookup table is being built,

but not during the extensions. See the options section for full details.

Bug fixes:

1.) a problem that occurred when too many HSP's aligned to the same part

of the query from one database sequence has been fixed.

2.) a problem that caused seedtop to not perform pattern-matching for DNA

sequences has been fixed.

3.) the number of HSP's saved for ungapped BLAST and tblastx is now limited to

200 to prevent problems with memory and speed.

4.) a missing thread join that caused problems under DEC Alpha has been added.

5.) a formatting problem with the database summary at the beginning of the

BLAST output (if multiple databases totaling over 2 Gig) has been fixed.

6.) a bug in formatdb that caused a core-dump if the total number of sequences was an

exact multiple of 100000 was fixed.

 

Notes for 2.0.8 release:

Enhancements:

1.) Frame and strand information was added to the output. Examples of the

new output format may be found at http://www.ncbi.nlm.nih.gov/BLAST/example.html.

2.) An option that specifes the query strand to be searched (for blastn, blastx, and tblastx)

has been added. The option is '-S'.

Bug fixes:

1.) The problem with the 'too-wide' parameter input screen under NT was fixed.

2.) BLAST no longer core-dump's when the query is NULL.

3.) BLAST no longer core-dump's when the query contains an '@' and blastx or tblastx is selected.

Notes for 2.0.7 release:

Bug fixes:

1.) BLAST now multi-threads properly under LINUX.

2.) A problem with very redundant databases and psi-blast was fixed.

3.) A problem with the formatting of the number of identities and positives

was fixed. This affected results on the minus strand only and did not

affect the expect value or scores.

4.) A problem that caused tblastn to core-dump very occassionally was corrected.

5.) A problem with multiple patterns in PHI-BLAST was fixed.

6.) A limit on the number of HSP's that were saved (100) was removed.

Notes for 2.0.6 release:

Enhancements:

1.) PHI-BLAST is included in this release. Please see notes on PHI-BLAST for

details.

2.) SEG has become an integral part of the NCBI toolkit and it is no longer necessary

to install it separately. It is also now supported under non-UNIX platforms.

3.) Access to filtering options.

If one uses "-F T" then normal filtering by seg or dust (for blastn)

occurs (likewise "-F F" means no filtering whatsoever). The seg options

can be changed by using:

-F "S 10 1.0 1.5"

which specifies a window of 10, locut of 1.0 and hicut of 1.5. One may

also specify coiled-coiled filtering by specifying:

-F "C"

There are three parameters for this: window, cutoff (prob of a coil-coil), and

linker (distance between two coiled-coiled regions that should be linked

together). These are now set to

window: 22

cutoff: 40.0

linker: 32

One may also change the coiled-coiled parameters in a manner analogous to

that of seg:

-F "C 28 40.0 32" will change the window to 28.

One may also run both seg and coiled-coiled together by using a ";":

-F "C;S"

4.) BLAST has been changed to reduce the number of redundant hits that a user

may see. This is acheived by keeping track of the number of hits completely

contained in a certain region and eliminating those lower scoring hits that

are redundant with others. This behavior may be controlled with the -K and -L

options:

-K Number of best hits from a region to keep [Integer]

default = 50

-L Length of region used to judge hits [Integer]

default = 20

Setting -K to zero turns off this feature. This is the default only on blastall.

Bug fixes:

1.) There was a problem with the procedure that called the external utility seg.

The need to fix this was obviated by the integration of seg into the toolkit.

This showed up under LINUX.

2.) There was a memory problem with formatdb that has been fixed. This showed up

mostly under NT and LINUX.

3.) A problem with running in multi-processing mode under IRIX6.5 (as a non-root user)

was fixed.

Notes for 2.0.5 release:

Enhancements:

1.) The BLAST version is printed by formatdb in it's log file.

2.) Multi-database searches no longer require that the -o option be used when

preparing the databases (i.e., with formatdb).

Bugs fixed:

1.) A serious bug with multi-database iterative searches was fixed (thanks to

Steve Brenner for providing an example).

2.) 'lcl' is not formatted in the BLAST report when the sequence identifier

is a local identifier or does not contain a bar ("|").

3.) A large memory leak in formatdb was fixed.

4.) An unnecessary cast that caused formatdb to fail on Solaris 2.5 machines

if the binary was made under 2.6 was fixed.

5.) Better error checking was added to protect against core-dumps.

6.) Some problems with the sum statistics treatment of the blastx and tblastn

programs reported by D. Rozenbaum were fixed. The number of alignments

involved in a sum group was misrepresented. Also the incorrect length for

the database sequence was used, sometimes casuing a slight change in the

value reported.

7.) A problem with blastpgp was fixed that reported incorrect values for

matrices other than BLOSUM62 during iterative searches.

Notes for 2.0.4 release:

Enhancements:

1.) multiple database searches:

Version 2.0.4 will accept multiple database names (bracketed by quotations).

An example would be

-d "nr est"

which will search both the nr and est databases, presenting the results as if one

'virtual' database consisting of all the entries from both were searched. The

statistics are based on the 'virtual' database.

2.) new options:

-W Word size, default if zero [Integer]

default = 0

-z Effective length of the database (use zero for the real size) [Integer]

default = 0

3.) The number of identities, positives, and gaps are now printed out before the

alignments for gapped blastx, tblastn, and tblastx. Additionally this feature is

now also enabled for ungapped BLAST.

4.) Formatdb now accepts ASN.1, as well as FASTA, as input.

Bugs fixed:

1.) In blastx, tblastn, and tblastx a codon was incorrectly formatted as a start codon in

some cases.

2.) The last alignment of the last sequence being presented was incorrectly dropped

in some cases. This change could affect the statistical significance of the last database

sequence if the dropped alignment had a lower e-value than any other alignments from the

same database sequence.