ProSplign is a global alignment tool developed by Dr. Boris Kiryutin. It produces accurate spliced alignments and computes alignments of distantly related proteins with low similarity. Extra afford is taken to locate frameshift positions.
ProSplign algorithm is an integral component of the NCBI Eukaryotic Genome Annotation Pipeline, which has been used to annotate critical genomes that include many different plant and animal species (such as human, mouse, cow etc.). The Pipeline was used by the Sea Urchin Genome Sequencing center for sequence analysis of the 814-megabase genome of the sea urchin Strongylocentrotus purpuratus that was published in Science in 2006. The integration of ProSplign with the genome annotation pipeline significantly improved the quality of genome annotation over existing available methods. Due to the success of the method it was used to annotate Tribolium castaneum (Nature, 2008), Taurine Cattle (Science, 2009), Acyrthosiphon Pisum (PLoS Biology, 2010), Nasonia (Science, 2010), and many other genomes.
Also ProSplign is a central part of the automatic pipeline for Influenza virus genomes, an important part of the Influenza Genome Sequencing Project. Sponsored by the National Institutes of Health, the Influenza Project is an international collaboration of critical importance for the public health. It has already led to multiple new discoveries about the recent evolution and pathogenesis of influenza, which have been published in leading journals including Journal of Virology, PLoS Biology, and Nature.
|
ProSplign is a utility for computing the alignment of proteins to genomic nucleotide sequence. This alignment can include eukaryotic splicing. At the heart of the program is a global alignment algorithm that specifically accounts for introns and splice signals. It is due to this algorithm that ProSplign is accurate in determining splice sites and tolerant to sequencing errors.
ProSplign uses BLAST hits to identify possible locations of genes and their duplications on genomic sequences and then to speed up the core dynamic programming.
Please follow one of the links below or navigate using the menu bar at the top of this page.
This web site is a single-point source of information on ProSplign, the tool for computing protein-to-genomic alignments that include an effort to account for mRNA splicing. ProSplign was developed with the following goals in mind:
- Accuracy in determining splice signals
- Recognition of short exons and non-consensus splices where feasible
- Ability to identify and separate multiple compartments typically representing gene copying events
- Frameshift detection
ProSplign is used in the NCBI Eukaryotic Genome Annotation Pipeline to compute spliced protein alignments
and in the NCBI Prokaryotic Genome Annotation Pipeline to find frameshifted genes and to locate frameshift positions on genome.
ProSplign is available for use in a number of different ways. There is no online version of ProSplign. You must download and install the console version which is available for Linux (and may also be available for a few other platforms - please request). You can also link to ProSplign library from your own applications in a portable way since ProSplign is a part of the NCBI C++ Toolkit. And finally, ProSplign is available as a plugin for NCBI Genome Workbench.
Reference: ProSplign - Protein to Genomic Alignment Tool. B. Kiryutin, A. Souvorov, T. Tatusova. Manuscript in preparation
|
Binaries (updated 02/23/15)
Pre-built executables are available for
Linux/i386 (64bit)
Sources
ProSplign was written for gene prediction at NCBI. There is no effort to encompass backward-compatibility between versions.
ProSplign is included into the NCBI C++ Toolkit. For details on how to download, configure, and build the Toolkit, please consult the NCBI C++ Toolkit book.
You can browse the Toolkit's code through the LXR or Doxygen source browsers. Search for CProSplign C/C++ Symbol to go directly to ProSplign sources.
Graphical view
NCBI Genome Workbench provides graphical alignment views. Watch NCBI Genome Workbench tutorial for ProSplign.
Video tutorial is also available on Youtube.
|
Using the console version
The console ProSplign can be launched in two modes - pairwise and batch. The pairwise mode is useful if you need to quickly align a few sequences and you don't want to compute separate blast hits for them. Batch mode is the best candidate for performing massive transcript alignment jobs, e.g. as a part of your genome annotation process. To see the parameters run "./prosplign -help" Most of the parameters are for the internal NCBI gene prediction process.
|
In pairwise mode, put your protein query and nucleic acid subject sequences in two files (only first sequences in each file will be aligned) and the command-line "./prosplign -full -nfa nuc.fa -pfa prot.fa -out aln.txt -fasn aln.asn". The nfa parameter is the file of the nucleic acid subject, the pfa parameter is the file of the protein query. The output is text output to the file specified in the out parameter and ASN1 output to the file specified in the fasn parameter.
|
Batch mode is organized in three steps.
-
Run BLAST program to generate the 12-column, tab-separated output. Make sure the output is sorted by subject and query.
For example (input fasta files could be found here ):
makeblastdb -dbtype nucl -in subj.fa
tblastn -query query.fa -db subj.fa -outfmt 6 | sort -k 2,2 -k 1,1 > blast.hit
resulting in:
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 35.000 140 57 5 58 163 20639910 20639491 2.87e-11 62.0
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 42.400 125 39 3 58 149 20602325 20601951 1.35e-15 74.7
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 42.400 125 39 3 58 149 20625221 20624847 1.44e-14 71.6
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 45.455 88 44 3 108 191 20647262 20646999 2.94e-12 64.7
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 47.500 40 20 1 58 96 20610519 20610400 1.44e-05 45.1
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 52.500 40 19 0 22 61 20602657 20602538 1.20e-05 45.1
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 52.500 40 19 0 22 61 20625553 20625434 1.44e-05 45.1
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 52.500 40 19 0 22 61 20640242 20640123 3.08e-05 43.9
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 55.000 40 17 1 58 96 20647507 20647388 6.43e-08 52.0
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 56.897 58 23 2 108 163 20610274 20610101 4.94e-11 61.2
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 60.976 41 16 0 22 62 20610837 20610715 5.15e-10 58.2
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 63.235 68 24 1 149 216 20609895 20609695 5.39e-23 96.3
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 63.235 68 24 1 149 216 20639285 20639085 4.97e-21 90.5
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 65.000 40 14 0 22 61 20647824 20647705 7.01e-10 57.8
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 66.176 68 22 1 149 216 20601700 20601500 4.58e-23 96.3
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 66.176 68 22 1 149 216 20624596 20624396 4.58e-23 96.3
gi|6679997|ref|NP_032143.1| gi|37544107|ref|NT_010783.14|Hs17_10940 67.647 68 21 1 149 216 20646883 20646683 5.12e-25 102
-
Run the compartment tool to find approximate locations of the protein instances on the nucleic acid (cat blast.hit | ./procompart -t > comp). Each line of the output file represents a single instance, or 'compartment'.
1 NT_010783.14 NP_032143.1 20601000 20603157 - 195 210.778
2 NT_010783.14 NP_032143.1 20609195 20611337 - 184 238.625
3 NT_010783.14 NP_032143.1 20623896 20626053 - 195 207.712
4 NT_010783.14 NP_032143.1 20638585 20640742 - 195 183.236
5 NT_010783.14 NP_032143.1 20646183 20648324 - 184 238.046
Tab separated columns are
compartment number, genomic id, protein id, compartment from, compartment to, strand, protein coverage, compartment BLAST score
The last two columns are for internal use, ignored by ProSplign.
-
Run ProSplign with the compartment file and the fasta files to generate an alignment for each compartment
(./prosplign -i comp -fasta subj.fa,query.fa -nogenbank -o pro.asn -eo pro.txt).
The .asn file contains alignments in ASN format. The .txt file is designed for human reading.
When ProSplign is run without '-full' option, output file shows 'partial' alignments.
A partial alignment is made from the full global alignment by throwing out low
identity portions of the alignment and keeping conserved portions. The conserved portions
are marked in text 'pro.txt' file with stars in the status line. Introns are marked with dots in the protein line.
For example, the following fragment
1 NT_010783.14 NP_032143.1 20601000 20603157 -
20602957 CCTTTGGGCACAACGTGTCCTGAGGGGAGAGGCAGCGCCCTGTAGATGGGACGGGGGCACTAACCCTCAGGTTTGGGGCTTATGAATGTGAGTATCGCCA 20602858
------------------ M A T D ----------------------------------------------------------------------
20602857 TCTAAGGCCAGATATTTGGCCAATCTCTGAATGTTCCTGGTCTCTGGAGGGATGGAGAGAGAGAAAAAAACAAACAGCTCCTGGAGCAGGGAGAGCGCTG 20602758
----------------------------------------------------------------------------------------------------
20602757 GCCTCTTCCTCTCCGGCTCCCTCCATTGCCCTCCGGTTTCTCCCCAGGCTCCCGGACGTCCCTGCTCCTGGCTTTTGCCCTGCTCTGCCTGCCCTGGCTT 20602658
S R T S L L L A F A L L C L P W L
| | | | | | + | | | | |
------------------------------------------------- S R T S W L L T V S L L C L L W P
***************************************************
20602657 CAAGAGGCTGGTGCCGTCCAAACCGTTCCGTTATCCAGGCTTTTTGACCACGCTATGCTCCAAGCCCATCGCGCGCACCAGCTGGCCATTGACACCTACC 20602558
Q E A G A V Q T V P L S R L F D H A M L Q A H R A H Q L A I D T Y
| | | | + | | | | | + | + | + | | | | | | | |
Q E A S A F P A M P L S S L F S N A V L R A Q H L H Q L A A D T Y
****************************************************************************************************
20602557 AGGAGTTTGTAAGTTCTTGGGGAATGGGTGCGGGTCAGGGGTGGCAAGAAGGGGTGACTTTCCCCCACTGGGGAAGTAATGGGAGGAGACTAAGGAGCTC 20602458
Q E F
+ | |
K E F ............................................................................................
****************************************************************************************************
20602457 AGGGTTGTTTTCTGAAGCGAAAATGCAGGCAGATGAGCATAGGCTGAGCCAGGTTCCCAGAAAAGCAACAATGGGAGCTGGTCTCCAGCATAGAAACCAG 20602358
....................................................................................................
****************************************************************************************************
20602357 CAGTCCTTCTTGGTGGGGGGTCCTTCTCCTAGGAAGAAACCTATATCCCAAAGGACCAGAAGTATTCATTCCTGCATGACTCCCAGACCTCCTTCTGCTT 20602258
E E T Y I P K D Q K Y S F L H D S Q T S F C F
| | | | + | + | | + + + | + | | |
................................ E R A Y I P E G Q R Y S --- I Q N A Q A A F C F
****************************************************************************************************
...
means that the first four aminoacids (MATD) were not aligned. The alignment starts with SRTS... on the protein.
The first exon ends at KEF. The second exon starts with ERA... on the protein. Intron with GT/AG splice is marked with dots.
|
|
Algorithmic details
ProSplign works with input sequences on a pairwise basis. In other words, exon/intron structures are determined independently for each query and subject.
The dynamic programming alone is accurate in determining splice junctions but computationally expensive. Also, if copies of a gene share same genomic sequence and strand, direct application may produce incorrect results by connecting exons from different copies.
Thus, for every input query/subject pair, it is important to localize genes on the genomic sequence which ProSplign achieves with the algorithm to compartmentize the BLAST hits.
The compartmentization step starts with computing protein-to-genomic blast hits. These give initial insight into the structure of compartments. Hits are separated into two same-strand sets and then compartments are identified within each strand. To do so, we formally define the optimization problem in terms of genomic sequence coverage and then solve it with a dynamic programming algorithm whose running time is short compared to the core dynamic programming described above.
|
Frequently Asked Questions
Q: Why am I getting "Unable to locate XXX" exceptions?
A: Please make sure that sequence identifiers in the input hit file match those in the index file. When indexing your fasta files, ProSplign records sequence IDs exactly as they appear after the leading '>' while your blast program could have printed them slightly differently.
Q: What does 'No compartment found' log file message mean? What is compartment?
A: Compartment is a localized interval on genomic sequence providing bounds for ProSplign in its search for exons. Compartments are identified based on input blast hits, so when there are not enough hits or hits are too weak or not consistent with each other to form a compartment, this message is generated.
|
|
|