This database provides a snapshot of the current taxonomic distribution of nucleotide sequences in GenBank.
Its purpose is to convey information about the potential phylogenetic data sets (clusters, or sets of homologous sequences) that can be constructed from the database for taxa of interest. It mirrors the NCBI
taxonomy tree.
The number of clusters is estimated by all-against-all BLAST searches and sequence clustering algorithms (for all nodes with < 35,000 sequences, and excluding sequences > 25,000 nt in length).
Model organisms are defined as any node (not subtree) having >100 clusters or more than 10,000 sequences. By default, sequence tallies for model organisms propogate upward in the tree along with nonmodel organisms, but this information can be excluded, so that users can get a sense of taxonomic breadth of the
sequence diversity in the database. Note, however, that the bulk of "genomic" data for model organisms is not entered in the database at all (see below for types of sequences included).
Cluster tallies are linked to a view of the data availability matrix for that node in the taxonomy tree, which can provide useful guidance for supermatrix and supertree construction. Sequences for each cluster can be downloaded as an unaligned FASTA file for further analysis. Provisional alignments and phylogenetic trees are under construction.
To see a list of "biodiversity research hotspots" (families with the largest increase in species since the last release) click
here (New!).
For a list of model organisms click
here.
For more information on how the clustering was implemented click
here.
For more information on the database structure, including downloads of this or previous releases of the entire database click
here (New!).
Finally, for more information about the developers, how to cite, etc., click here
Types of sequences included: Only "core" nucleotide data are included, which excludes ESTs, STSs, and other kinds of bulk or high-throughput sequences. Taxonomic coverage: At present the database contains sequences from eukaryotes. These represent the PLN, MAM, PRI, ROD, VRT, and INV divisions of GenBank.
GenBank release:172 (June 15, 2009) Number of sequences in this database:3982579 Number of nodes in our subtree(s) of the NCBI taxonomy tree:328550 Number of terminal nodes:256674 Number of nodes clustered (usually terminal taxa):250863 Number of subtrees clustered (always internal nodes):69969 Number of nodes with sequences that can be clustered:315461
Clusters:
Total number of clusters:2183344
Number of phylogenetically informative clusters (TIs >= 4):123242
Number of singleton clusters (GIs = 1):1601134
Number of large clusters (GIs >= 100):17549
Number of large clusters (TIs >= 100):4754
Size of largest cluster (w.r.t. GIs):14022
Size of largest cluster (w.r.t. TIs):5312
Questions or comments? Contact Mike Sanderson (sanderm at email dot arizona dot edu)