The UniVec Database

UniVec is a database that can be used to quickly identify segments within nucleic acid sequences which may be of vector origin (vector contamination). Screening using UniVec is efficient because a large number of redundant subsequences have been eliminated to create a database that contains only one copy of every unique sequence segment from a large number of vectors.

In addition to vector sequences, UniVec also contains sequences for those adapters, linkers, and primers commonly used in the process of cloning cDNA or genomic DNA. This enables contamination with these oligonucleotide sequences to be found during the vector screen.

UniVec can be obtained from the NCBI FTP directory: ftp://ftp.ncbi.nlm.nih.gov/pub/UniVec/.

Eliminating the Redundancy from Vector Sequences

Many vectors have the same backbone or share common functional cassettes. Consequently, databases with the full sequence for each vector contain multiple copies of such elements. A single copy of each unique element is sufficient to allow that sequence to be recognized as vector contamination. The size of a database designed for screening can therefore be greatly reduced by eliminating the redundant copies of any sequence (see statistics for the current UniVec build).

The UniVec database is built by sequentially processing each input sequence. The input sequence is first compared to all the sequences already in the database. The location of any segment identical to database sequences is recorded. This information is used to extract only those segments of the input sequence that contain novel sequence. These novel elements are then added to the database. This cycle is repeated for each sequence that is to be represented in the finished database.

Benefits of a Non-Redundant Database for Screening

Elimination of redundant sequence segments reduces UniVec to less than 20% of the size of an equivalent database containing the full sequences for the same set of vectors. This has two major benefits for screening:

The computation time required to screen a query sequence is reduced significantly.
Analysis of the results is facilitated because redundant hits to multiple copies of the same sequence are largely eliminated.

Pseudo-Circularization

Most vectors are circular, but their sequences are represented in linear form by opening the sequence at one particular location (the circular junction). Because programs such as BLAST are unable to extend a match across the end/beginning of the linear sequence, contamination with a segment of vector that spans the circular junction may be missed, or its full extent and strength may be underestimated. To circumvent this limitation, a copy of the first 49 bases of the sequence for a circular vector is appended to the end of the sequence before it is processed for addition to UniVec. This "pseudo-circularization" allows matches that span the circular junction to be identified correctly.

Vectors Represented in the UniVec Database

UniVec contains the unique segments, as well as a single copy of each of the shared segments, from all the vector, adapter, linker, and primer sequences that were used to build the database. The sequences used to build the current version of UniVec are listed in the current UniVec representation list.

Screening a query sequence against UniVec will result in the detection of significant contamination with any sequence from the current UniVec representation list. Contamination with vectors not on this list can also be detected if the vector is similar to one of the represented vectors, although in such cases the full extent of the contamination may not be reported.

UniVec will be periodically updated with additional vector sequences.

Send suggestions for additional sequences to include in UniVec to the NCBI Service Desk ([email protected]). Please give a brief description of the vector or oligonucleotide, indicate where the sequence can be obtained, and provide references to a detailed description if available. Vectors and oligonucleotides that are commonly used for cloning and/or amplification will be given the highest priority for inclusion in future versions of UniVec.

Sources of the Sequences in UniVec

Most of the sequences in UniVec were derived from GenBank entries. In these cases, the parent sequence and annotation (when available) can be obtained from Entrez Nucleotide using the GenBank Accession number.version from the UniVec definition line.

The sequences for some commercial vectors, linkers, adapters, and primers that are not available in GenBank were obtained from company Web sites or product literature. UniVec entries derived from such non-GenBank sequences have a definition line containing an identifier of the form NGBxxxxx.x. The most up-to-date versions of these non-GenBank sequences, and in many cases annotations, can be obtained from the Web site of the company concerned.

Limitations of the UniVec Database

UniVec was built so that every unique sequence of 50 (or fewer) contiguous bases from the input sequences is represented in the database. Longer stretches of sequence are not necessarily represented as one contiguous piece in UniVec. This particular construction places certain limitations on the use of the database and on the interpretation of results from a search against the database.

Searches using UniVec will not indicate the identity of the vector having the strongest match to the query sequence. The full extent of the match to any individual vector will not be apparent because the sequence for most vectors in UniVec is not present as one contiguous piece. The most likely source(s) of vector contamination can be deduced from the cloning history of the sequenced DNA (more details are available in Interpretation of VecScreen Results). If it is necessary to identify the vector that has the best match to the query sequence, a BLAST search should be made using a database that contains a contiguous sequence for each vector, such as the artificial sequences subset of NCBI's nr/nt database.

UniVec should not be used for a search in which the criteria for a significant hit require an alignment of more than 50 bases.