This database provides a snapshot of the current taxonomic distribution of nucleotide sequences in GenBank.
Its purpose is to convey information about the potential phylogenetic data sets (clusters, or sets of homologous sequences) that can be constructed from the database for taxa of interest. It mirrors the NCBI
taxonomy tree.
The number of clusters is estimated by all-against-all BLAST searches and sequence clustering algorithms (for all nodes with < 35,000 sequences, and excluding sequences > 25,000 nt in length).
Model organisms are defined as any node (not subtree) having >100 clusters or more than 10,000 sequences. By default, sequence tallies for model organisms propogate upward in the tree along with nonmodel organisms, but this information can be excluded, so that users can get a sense of taxonomic breadth of the
sequence diversity in the database. Note, however, that the bulk of "genomic" data for model organisms is not entered in the database at all (see below for types of sequences included).
Cluster tallies are linked to a view of the data availability matrix for that node in the taxonomy tree, which can provide useful guidance for supermatrix and supertree construction. Sequences for each cluster can be downloaded as an unaligned FASTA file for further analysis. Provisional alignments and phylogenetic trees are under construction.
To see a list of "biodiversity research hotspots" (families with the largest increase in species since the last release) click
here (New!).
For a list of model organisms click
here.
For more information on how the clustering was implemented click
here.
For more information on the database structure, including downloads of this or previous releases of the entire database click
here (New!).
Finally, for more information about the developers, how to cite, etc., click here
Types of sequences included: Only "core" nucleotide data are included, which excludes ESTs, STSs, and other kinds of bulk or high-throughput sequences. Taxonomic coverage: At present the database contains sequences from eukaryotes. These represent the PLN, MAM, PRI, ROD, VRT, and INV divisions of GenBank.
GenBank release:176 (February 23, 2010) Number of sequences in this database:4466273 Number of nodes in our subtree(s) of the NCBI taxonomy tree:348761 Number of terminal nodes:274078 Number of nodes clustered (usually terminal taxa):274055 Number of subtrees clustered (always internal nodes):73439 Number of nodes with sequences that can be clustered:341638
Clusters:
Total number of clusters:2380619
Number of phylogenetically informative clusters (TIs >= 4):134595
Number of singleton clusters (GIs = 1):1734658
Number of large clusters (GIs >= 100):20471
Number of large clusters (TIs >= 100):5224
Size of largest cluster (w.r.t. GIs):16739
Size of largest cluster (w.r.t. TIs):6065
Questions or comments? Contact Mike Sanderson (sanderm at email dot arizona dot edu)