CDART Help Document

CDART Help

	This help document describes how to use CDART, including detailed descriptions of the input required, output displays, and the program's features and functions. The Conserved Domains resources page describes additional, related resources and provides "How To" guides that illustrate how those resources can be used.

What is CDART?

Conserved Domain Architecture Retrieval Tool

Quick start guide

illustrated example showing the 1-2-3 step process for using the tool

Input Options

enter query directly into CDART home page

as a protein sequence

protein unique identifiers (UIDs)
protein sequence data

as a set of conserved domains (CDs)

conserved domain superfamily cluster IDs
PSSM IDs for specific domain models
mix of superfamily cluster IDs and PSSM IDs

enter multiple queries

retrieve sequence record from Entrez Protein

follow the "Domain relatives" link

Output Display

graphical summary of similar domain architectures

query
list of similar domain architectures

Filter your results

refine your results to include/exclude conserved domain superfamilies

information provided for each domain architecture

title
taxonomy span
similarity score
total nr sequences
lookup sequences in Entrez

References

Citing CDART

BRIEF TABLE OF CONTENTS


	What is CDART? Conserved Domain Architecture Retrieval Tool Quick start guide 1-2-3 step process (illustration) Input options Enter query into CDART home page - protein sequence - set of conserved domains (CDs) - multiple queries Retrieve Entrez Protein sequence record - follow "Domain Relatives" link Output Display Graphical summary of similar architectures Filter your results Details for individual domain architecture References

OUTPUT DISPLAY

Thumbnail image showing the domain relatives for a protein query sequence (NP_002917, regulator of G-protein signaling 12 isoform 2). Domain relatives are protein sequences that contain one or more of the conserved domains found in the query sequence. Click on the image to open the CDART help document for more information about the tool.

What is CDART?

The Conserved Domain Architecture Retrieval Tool (CDART) performs similarity searches of the Entrez Protein database based on domain architecture. A domain architecture is defined as the sequential order of conserved domains (functional units) in a protein sequence.

In this way, CDART finds protein similarities across significant evolutionary distances using sensitive domain profiles rather than direct sequence similarity.

Given a query sequence, CDART shows the conserved domains that make up a protein and then lists proteins with a similar domain architecture. The conserved domains in a sequence are found by RPS-BLAST, which defines a domain by a PSSM (Position-specific scoring matrices), a set of probabilities of amino acids existing at each position of the domain. RPS-BLAST is known as a "profile" search, which is a sensitive way to look for sequence homologues. Proteins similar to the query are then grouped and scored by domain architecture.

You can either search CDART directly with a query protein sequence, or retrieve a protein sequence record from the Entrez Protein database and select "Domain Relatives" from the "Related Information" menu in the right margin of the page to see the precalculated CDART results. Relying on domain profiles allows CDART to be fast and, because it relies on annotated functional domains, informative.

This tool is designed for interactive use. Scripting is not supported.

(A related tool, SPARCLE, the Subfamily Protein Architecture Labeling Engine, is a resource for the functional characterization and labeling of protein sequences that have been grouped by their characteristic conserved domain architecture. The SPARCLE Help document provides a comparison of CDD, CDART, and SPARCLE and includes examples of how each resource can be used.)

Quick Start Guide

The illustration below shows the easy, 1-2-3 step process for using the Conserved Domain Architecture Retrieval Tool (CDART). Click on any frame of the image below to link to subsequent sections in this help document, which provide additional details about the input options and output display.

If you would like to try this example yourself, open the CDART home page and enter NP_002917 (regulator of G-protein signaling 12 isoform 2) as the query, or retrieve the sequence record from the Entrez Protein database and then follow the link for "Domain Relatives" that appears under "Related Information" in the right margin. Click on any frame of the image above to link to subsequent sections in this help document, which provide additional details about the input options and output display.

Input Options

| enter query directly into CDART home page as a protein sequence, set of conserved domains, or multiple queries |
| retrieve sequence record from Entrez Protein → follow "Domain Relatives" link |

Enter query directly into CDART home page

Illustration of the CDART home page, where you can input a query either as protein, a set of conserved domains, or as multiple queries. See the corresponding text for details and examples. Click on this image to see the complete illustration of the steps in using CDART, featured in the Quick Start Guide.

One way to retrieve proteins with similar domain architectures is to enter your query as a protein sequence, or as a set of conserved domains, directly into the CDART home page in any of the following formats:

Protein sequence

You can submit a protein sequence as:

a protein unique identifier (UID) - enter the Accession or GI number of any protein that is in the Entrez Protein database.

protein sequence data - enter the sequence in FASTA format or as bare sequence data.

The CDART results will show the functional domains found in the query protein and will list proteins with a similar domain architecture. The similar proteins must include at least one of the conserved domain superfamilies in the query sequence. The similarity score of each domain architecture indicates the number of domain superfamilies in the architecture that match domain superfamilies in the query protein, and is used to rank the search results.

Set of conserved domains (CDs)

As an alternative to submitting a protein sequence as a query, you can you can specify a query as a set of one or more conserved domains*, using any of the identifiers below to specify your domains of interest. They should be entered on a single line, separated by commas, and surrounded by square brackets [], as in the examples below:

conserved domain superfamily cluster IDs - As explained in the Conserved Domain Database help document, a superfamily cluster is a set of conserved domain models that generate overlapping annotation on the same protein sequences. These models are assumed to represent evolutionarily related domains and may be redundant with each other.

A superfamily ID (accession number) begins with the prefix "cl" for "cluster," and can be entered in CDART as the complete alphanumeric accession or as digits only (with or without the leading zeros). For example, a query to retrieve proteins with domain architectures that include superfamilies cl00075 (HATPase_c Superfamily) and cl02783 (TopoII_MutL_Trans Superfamily) can be entered in any of the following ways:

[cl00075,cl02783]
[00075,02783]
[75,2783]
[cl00075,2783]
etc.

Accession numbers or PSSM IDs for specific domain models - If you are interested in a specific conserved domain model, you can enter its conserved domain accession number or position specific scoring matrix ID (PSSM ID). If you enter a PSSM ID, be sure to include a leading "p" so it won't be interpreted as a cluster ID. Note: The PSSM ID is displayed in the "Statistics" box of a domain model's summary page in the Conserved Domain Database. For example, a query for the domain models pfam02518 (whose current PSSM ID is 190334) and cd03483 (whose current PSSM ID is 48471) can be entered as:

[pfam02518,cd03483]
or
[p190334,p48471]

mix of superfamily cluster IDs, conserved domain accessions, PSSM IDs - Use the same syntax rules as above. For example, a query to retrieve proteins with domain architectures that include superfamily cl00075 and domain model cd03483 (whose current PSSM ID is 48471) can be entered in any of the following ways:

[cl00075,cd03483]
[cl00075,p48471]
[00075,p48471]
[75,p48471]
etc.

Note: The proteins that are returned by CDART will include at least one of the domains you have specified. The similarity score of each domain architecture indicates the number of domain superfamilies in the architecture that match domain superfamilies in the query protein, and is used to rank the search results.

Regardless of how you specify the conserved domains in your query (as superfamily cluster IDs or as the accessions or PSSM IDs of individual domain models), the CDART search results will display the superfamilies to which those models belong, and not the individual domain models themselves. However, you can see superfamilies and individual domain models by following the "domain details" link that appears in the expanded view of any domain architecture.

If you enter a single conserved domain as a query, you will retrieve all the domain architectures that contain the domain, ranked by the number of non-redundant proteins that have a given architecture.

Multiple queries

You can enter multiple queries using any of the formats above (i.e., as protein sequences or sets of conserved domains) or a mix of those formats. Note that:

Each protein Accession or GI number should be on a separate line

FASTA formatted sequence data and bare sequences can occupy multiple lines. (The FASTA format definition line, however, should occupy a single line).

If one of your queries is a set of conserved domains, they should be entered on a single line, separated by commas, and surrounded by square brackets [], as in the third line of the example below.

If you include a bare sequence as one of the queries, use a blank line to separate it from the query that precedes it, as in the last part of the example below.

Example: - The example below includes six queries, in the following order: (1) protein GI number; (2) protein accession number; (3) set of conserved domain superfamily cluster IDs; (4) protein sequence in FASTA format; (5) another protein sequence in FASTA format; (6) protein as bare sequence data:

   269849668
   EDV04934
   [75,2783]
   >gi|239592572|gb|EEQ75153.1| asparaginase [Ajellomyces dermatitidis SLH14081]
   MSPPIPQPRQRTRSQPLFKPAVILHGGAGNIQHSRLPPELYKQYRTSLLTYLRSTTALLNADIEEEEPSI
   NAKNDAVDDNMRISPASALNAAVHAVSLMEDNELFNCGRGSVFTSAGTIEMEASVMVASLLNDEDSVDDF
   NNSEVNCLASEKTPGSIKRGAGVMLVRNVRHPIQLAKEVLLRTGYASDGDGDGGNMHSQLSGEYVEGLAR
   DWGMEFCPDDWFWTKKRWDEHRRGLKKGKTRGRMTDGRNMGADVEVRGEGEADDGDGLYLSQGTVGCVCL
   DRWGNIAVATSTGGLTNKCPGRIGDTPTLGAGFWAEAWDVEGVEGLSNMSDSSNSVCASGRDRSKGCIQL
   KRDTMNYQTQDGRDNLLAYQASSSTTTTTSSYRMGSQWRSDFDSNSAFTLIRDCFSSSPPPPGYAALEPS
   KYPVEKFPLGKSTSSPHTDFNPHRYSQPQRRRILALSGTGNGDSFLRTAATRTAAAMVRFGSAQNSISLA
   QAVTAVAGPGGELQRSAGRRWGKTGEGEGGIIGIEAEVETDEQTLGEGKLRRGKVVFDFNSTGMFRAWME
   EKDGKDVERMMVFRDDYE
   >gi|336020358|ref|NP_001229488.1| mitogen-activated protein kinase kinase kinase kinase 4 isoform 4 [Homo sapiens]
   MANDSPAKSLVDIDLSSLRDPAGIFELVEVVGNGTYGQVYKGRHVKTGQLAAIKVMDVTEDEEEEIKLEI
   NMLKKYSHHRNIATYYGAFIKKSPPGHDDQLWLVMEFCGAGSITDLVKNTKGNTLKEDWIAYISREILRG
   LAHLHIHHVIHRDIKGQNVLLTENAEVKLVDFGVSAQLDRTVGRRNTFIGTPYWMAPEVIACDENPDATY
   DYRSDLWSCGITAIEMAEGAPPLCDMHPMRALFLIPRNPPPRLKSKKWSKKFFSFIEGCLVKNYMQRPST
   EQLLKHPFIRDQPNERQVRIQLKDHIDRTRKKRGEKDETEYEYSGSEEEEEEVPEQEGEPSSIVNVPGES
   TLRRDFLRLQQENKERSEALRRQQLLQEQQLREQEEYKRQLLAERQKRIEQQKEQRRRLEEQQRREREAR
   RQQEREQRRREQEEKRRLEELERRRKEEEERRRAEEEKRRVEREQEYIRRQLEEEQRHLEVLQQQLLQEQ
   AMLLECRWREMEEHRQAERLQRQLQQEQAYLLSLQHDHRRPHPQHSQQPPPPQQERSKPSFHAPEPKAHY
   EPADRAREVEDRFRKTNHSSPEAQSKQTGRVLEPPVPSRSESFSNGNSESVHPALQRPAEPQVPVRTTSR
   SPVLSRRDSPLQGSGQQNSQAGQRNSTSIEPRLLWERVEKLVPRPGSGSSSGSSNSGSQPGSHPGSQSGS
   GERFRVRSSSKSEGSPSQRLENAVKKPEDKKEVFRPLKPADLTALAKELRAVEDVRPPHKVTDYSSSSEE
   SGTTDEEDDDVEQEGADESTSGPEDTRAASSLNLSNGETESVKTMIVHDDVESEPAMTPSKEGTLIVRQT
   QSASSTLQKHKSSSSFTPFIDPRLLQISPSSGTTVTSVVGFSCDGMRPEAIRQDPTRKGSVVNVNPTNTR
   PQSDTPEIRKYKKRFNSEILCAALWGVNLLVGTESGLMLLDRSGQGKVYPLINRRRFQQMDVLEGLNVLV
   TISGKKDKLRVYYLSWLRNKILHNDPEVEKKQGWTTVGDLEGCVHYKVVKYERIKFLVIALKSSVEVYAW
   APKPYHKFMAFKSFGELVHKPLLVDLTVEEGQRLKVIYGSCAGFHAVDVDSGSVYDIYLPTHIQCSIKPH
   AIIILPNTDGMELLVCYEDEGVYVNTYGRITKDVVLQWGEMPTSVAYIRSNQTMGWGEKAIEIRSVETGH
   LDGVFMHKRAQRLKFLCERNDKVFFASVRSGGSSQVYFMTLGRTSLLSW

   MEQDPKPPRLRLWALIPWLPRKQRPRISQTSLPVPGPGSGPQRDSDEGVLKEISITHHVKAGSEKADPSH
   FELLKVLGQGSFGKVFLVRKVTRPDSGHLYAMKVLKKATLKVRDRVRTKMERDILADVNHPFVVKLHYAF
   QTEGKLYLILDFLRGGDLFTRLSKEVMFTEEDVKFYLAELALGLDHLHSLGIIYRDLKPENILLDEEGHI
   KLTDFGLSKEAIDHEKKAYSFCGTVEYMAPEVVNRQGHSHSADWWSYGVLMFEMLTGSLPFQGKDRKETM
   TLILKAKLGMPQFLSTEAQSLLRALFKRNPANRLGSGPDGAEEIKRHVFYSTIDWNKLYRREIKPPFKPA
   VAQPDDTFYFDTEFTSRTPKDSPGIPPSAGAHQLFRGFSFVATGLMEDDGKPRAPQAPLHSVVQQLHGKN
   LVFSDGYVVKETIGVGSYSECKRCVHKATNMEYAVKVIDKSKRDPSEEIEILLRYGQHPNIITLKDVYDD
   GKHVYLVTELMRGGELLDKILRQKFFSEREASFVLHTIGKTVEYLHSQGVVHRDLKPSNILYVDESGNPE
   CLRICDFGFAKQLRAENGLLMTPCYTANFVAPEVLKRQGYDEGCDIWSLGILLYTMLAGYTPFANGPSDT
   PEEILTRIGSGKFTLSGGNWNTVSETAKDLVSKMLHVDPHQRLTAKQVLQHPWVTQKDKLPQSQLSHQDL
   QLVKGAMAATYSALNSSKPTPQLKPIESSILAQRRVRKLPSTTL

Search Entrez Protein → link to "Domain Relatives"

Illustration of a sample protein sequence record (regulator of G-protein signaling 12 isoform 2, NP_002917) from the Entrez Protein database, where you can follow the link for Domain Relatives to view a list of proteins with similar domain architectures. Click on this image to see the complete illustration of the steps in using CDART, featured in the Quick Start Guide.

A second way to access CDART is to start by retrieving a record of interest from the Entrez Protein database, then follow the "Domain Relatives" link in the right margin of the sequence record. That will open the precalculated CDART results for the protein.

Note that the "Domain Relatives" is one of four links available from a protein sequence record to conserved domain annotations, allowing you to choose: (a) the format in which you want to view the conserved domains (e.g., in graphical format as domain footprints aligned to the protein sequence; as a list of records from the Conserved Domain Database, each of which includes a multiple sequence alignment of the proteins used to create the domain model; or as a list of proteins with similar domain architectures), and (b) the level of redundancy in the list of conserved domain models (e.g., a concise list of the top scoring models or a full list of all models that have a statistically significant RPS-BLAST hit to the protein).

The number of conserved domain models retrieved, and the order in which they are sorted/presented, depends upon the view you select:

Domain Relatives -- opens a graphical display of similar domain architectures, as determined by the CDART tool. A domain architecture is defined as the sequential order of conserved domains in a protein query sequence. The score for each CDART hit represents the number of domains that match those found in the query protein. (The CDART paper provides additional details.
)

CDD Search Results -- opens a graphical display (illustrated example) of conserved domain model footprints on the query protein, ranked by their RPS-BLAST score and hit type. A model may appear more than once if it aligns to multiple regions of the query sequence. A concise display showing only the top-scoring hits is presented by default, and it can be changed to a full display of all hits if desired. (The CDD help document provides additional details.)

Conserved Domains (Concise) -- opens a concise list of the conserved domain models that are the top-scoring RPS-BLAST hits to the query protein. Each domain model is listed only once, even if a model had a hit to more than one region on the query sequence. (The CDD help document provides additional details.)

Conserved Domains (Full) -- opens a full list of all the conserved domain models that have a statistically significant RPS-BLAST hit to the query protein. Each domain model is listed only once, even if a model had a hit to more than one region on the query sequence. (The CDD help document provides additional details.)

Output

Graphical summary of similar domain architectures

| query | list of similar domain architectures | filter your results | information provided for each domain architecture |

When the query is successful, a graphical interface is provided for the user to navigate through the results.

Illustration of CDART search results, which list proteins that have domain architectures similar to your query protein sequence (NP_002917, regulator of G-protein signaling 12 isoform 2, in this example). Click on this image to see the complete illustration of the steps in using CDART, featured in the Quick Start Guide.

Query
- The query you entered is displayed in a yellow background at the top of the CDART search results.
  - If your query was a protein sequence, the graphic shows the length of the protein in amino acids and the footprints of the highest scoring conserved domain superfamily(ies) found in the query sequence by RPS-BLAST. On the left side of the graphic is the description of the query and the total number of domain architectures found in the Entrez Protein database that contain at least one superfamily from the query. If you used any search result filters, a second number will show the number of remaining architectures after applying the filter. The two numbers may be the same.
  - If your query was a set of conserved domains rather than a protein sequence, the domains will be shown in the same order in which they were input, without a scale showing length. The "total architectures" statistic will indicate the number of domain architectures found in the Entrez Protein database that contain all of the domain superfamilies in your query.
- The Download button to the right of the query graph lets you download all the similar architectures found for this query (after applying the filters, if specified). Click on the button will result in a Save as dialog to let you save the result in a text file.

List of similar domain architectures
- A domain architecture is defined as the sequential order of conserved domains in protein queries. Each domain architecture displayed by CDART therefore represents a unique set and order of conserved domain superfamilies found among sequences in the Entrez Protein database.
- The CDART results list the proteins with a similar domain architecture to your query. The similar proteins must include at least one of the conserved domain superfamilies in the query sequence. The similarity score of each domain architecture indicates the number of domain superfamilies in the architecture that match domain superfamilies in the query protein, and is used to rank the search results. The results are paginated at 10 architectures per page, users can nagivate to any pages using the page selector or back/forward arrows at the bottom row of the table.
- Architectures are displayed as graphs. Different Superfamilies are displayed as different shape/color combinations. Mouse over a superfamily footprint to open a pop-up that shows the superfamily accession number (cluster ID), title, and description. Click on a footprint to open the Conserved Domain Database summary page for that superfamily, which lists the individual conserved domain models that belong to the superfamily.
- Protein sequences that share the same domain architecture are grouped together, and only one from each group is shown. To see all proteins in the group, follow the Lookup sequences in Entrez link to the right of the architecture graph.

Filter your results
- The "Filter your results" bar at the top of a CDART search results page allow you to refine the results by including/excluding architectures that contain specific domain superfamily(ies). The contents of the "Filter your results" dialog box are generated dynamically and represent the domain superfamilies that were found in your CDART search results. Select the desired superfamilies from the lists provided in the dialog box to include or exclude superfamilies from your CDART search results display.
- As an alternative to selecting domain superfamilies from the lists and then using the "Include" or "Exclude" buttons, you can type the desired parameters directly into the "Filter your results" text box, in a format such as:
  - INCLSFAM[xxx,yyy,zzz]
    Include only sequences that contain the conserved domain superfamilies with cluster IDs xxx AND yyy AND zzz.
  - EXCLSFAM[xxx,yyy,zzz]
    Exclude sequences that contain the conserved domain superfamilies with cluster IDs xxx AND yyy AND zzz.
    
    The above operations can be combined using the logical operators NOT, AND and OR, as well as parentheses. For example:
    
    INCLSFAM[aaa,bbb] AND NOT INCLSFAM[xxx]
    
    will filter your search results so they include only superfamilies aaa AND bbb, but not xxx.
    
    The logical operators are executed with the following precedence:
    
    () > NOT > AND > OR

Information provided for each domain architecture
- Title - The title of the representative protein sequence that contains this domain architecture.
- Taxonomy span - The highest taxonomic node common to all sequences with this domain architecture.
- Similarity score - The number of conserved domain superfamilies in this domain architecture that match superfamilies in the query sequence. The score does not count repeats of superfamilies within a domain architecture; rather, it counts each superfamily only once, regardless of how many times that superfamily appears in a protein sequence.
  
  The domain architectures shown on a CDART search results page are ranked/sorted by score. If two or more architectures have the same score, they are ranked by the number of non-redundant protein sequences that contain the architecture (so hits that may be spurious hits are at the bottom).
- Total nr (non-redundant) sequences -
  - Protein sequences that share the same domain architecture are grouped together.
  - If two or more of the proteins are identical in length and composition, they are placed in the same identical protein group ("IPG"). CDART displays one representative sequence from each IPG in order to produce a non-redundant (nr) list of sequences that contain a specific domain architecture. To find all sequences in this group, follow the Lookup sequences in Entrez link
  - The "total nr sequences" statistic represents the total number of IPGs with the domain architecture.
- Lookup sequences in Entrez - Retrieves the non-redundant set of sequence records that contain the domain architecture. The link will leave the search result page, so it is recommended to right-click on it and select to open link in a new window.

Navigate multiple queries

The graphical interface can only navigate results of one query at a time. If multiple queries have been specified, some extra controls will appear.

The View Query selector can be used to select to view results from a particular query. The Apply to all button can be used to apply a filter to results of all queries. The Download All button let the user download similar architectures from all queries at once.

References

Citing CDART:

Geer LY, Domrachev M, Lipman DJ, Bryant SH. CDART: protein homology by domain architecture. Genome Res. 2002 Oct;12(10):1619-23.

Revised 09 August 2017