Schema Overview

From BioSQL
Jump to: navigation, search

This document describes some of the tables and fields in the BioSQL schema. It also aims to demonstrate functional capabilities using example SQL. Design philosophies and expectations are presented with reasoning.


Weak-Typing Paradigm

The BioSQL model follows the weak-typing paradigm. As opposed to the strong-typing paradigm, in which entities and their attributes exhaustively define the object type(s) that they store, in a weakly-typed model a few relatively generic entities can hold any number of specializations (derived entities, in an object-oriented sense), and the attributes that apply to only some of those specializations are attached to the row through tag/value associations (vertical storage).

To identify what particular object type a row is of, many entities have a type. The type as well as the tag of tag/value pairs come from controlled vocabularies (ontologies), which allows assigning arbitrarily rich semantics to both the type of a tuple (row) as well as to the attributes.

Bioentry with Taxon and Namespace


This is the core entity of the BioSQL schema; a bioentry is any single entry or record in a biological database. The bioentry contains information about the record's public name, public accession and version, its description and an identifier field. Finally, for working convenience with GenBank records, the division of GenBank can be specified in a 3 character field.

For example, this truncated GenBank record:

LOCUS       S63169S6                  22 bp    DNA     linear   PRI 25-AUG-1993
DEFINITION  NDP=Norrie disease {first three exons, microdeletion regions}
VERSION     S63178.1  GI:386456

Would be stored in bioentry as:

name:		S63169S6
accession:	S63178
identifier:     386456
division:       PRI
description:	NDP=Norrie disease {first three exons, microdeletion regions}
version:	1

bioentries need not come from a public database; a bioentry from a private lab database might look like this:

name:		MyFavGene1
accession:	MFD12345
identifier:     902772
division:       ion_ch
description:	Gene prediction from my secret organism
version:	10

In this case, the identifier 902772 is not an NCBI GI number, but is a key to lookup this entry in the private database, "My Favorite Database" (MFD).


A biodatabase is simply a collection of bioentries; one bioentry may only belong to one biodatabase, but one biodatabase may contain many bioentries. biodatabase entities can be identified by their name: "GenBank", "trembl", "MyFavoriteGenes", etc. Databases may also be further identified by an authority, the organization under which this database name is officially mandated.

SQL example - fetch the accessions of all sequences from SwissProt:

 SELECT DISTINCT bioentry.accession
 FROM   bioentry JOIN biodatabase USING (biodatabase_id)
 WHERE = 'swiss'  -- or 'Swiss-Prot'

SQL example: Find the database, 'GenBank' or 'GenPept', that contains the GI number 386456:

 FROM   bioentry JOIN biodatabase USING (biodatabase_id)
 WHERE  bioentry.identifier = '386456'
   AND IN ('genbank', 'genpept')

SQL example - how many unique entries are there in GenBank:

 SELECT COUNT(DISTINCT bioentry.accession)
 FROM   bioentry JOIN biodatabase USING (biodatabase_id)
 WHERE = 'genbank'

SQL example - fetch the locus names for the latest versions of all entries where the biodatabase name is 'swiss' (Mysql syntax):

 SELECT MID(MAX(CONCAT(RPAD(LPAD(bioentry.version,5,'?'),10,'?'),,11)
        FROM bioentry JOIN biodatabase USING (biodatabase_id)
 WHERE = 'swiss'


In BioSQL, all databases have bioentries, but not all bioentries need have raw sequence data associated with the entry. The biosequence table contains the raw sequence information associated with a bioentry, and alphabet information ('protein', 'dna', 'rna'). One bioentry may have only one biosequence associated with it, and vice versa: a given biosequence applies to only one bioentry. Sequences may have their own version number, independent of its bioentry version information. The length of the sequence is also stored for pre-calculated convenience.

Note: while the schema's basic structure might imply that bioentries could be associated with multiple biosequences in a one-to-many relationship between bioentry and biosequence, this is not the case: the bioentry_id foreign key present in the biosequence table is constrained to be unique, thus enforcing the one-to-one relationship between the two tables.

Example SQL - what is the description of the longest sequence in GenBank?

 SELECT   bioentry.description
 FROM     bioentry
          JOIN biodatabase USING (biodatabase_id)
          JOIN biosequence USING (bioentry_id)
 WHERE = 'genbank'
 ORDER BY biosequence.length DESC

Example SQL - find all bioentries with protein sequences containing "ELVIS":

 SELECT  bioentry.*
 FROM    bioentry JOIN biosequence USING (bioentry_id)
 WHERE   biosequence.seq LIKE "%ELVIS%"
   AND   biosequence.alphabet = 'protein'


Bioentries may themselves be related to one another (e.g., a PDB record may be composed of multiple subrecords for separate chains, or multiple SwissProt records may be associated with a given PFAM domain entry). These relationships are "typed" via links to ontology terms using the term_id field.


These are tables to store basic taxonomic information about the organism to which a given bioentry refers, and they reflect the structure of NCBI's taxonomy database. Each bioentry can be associated with only one taxon, but many bioentries can be associated with the same taxon. In order to get the most value from these tables it's recommended that you use the BioSQL script to populate them.

The taxon_name.taxon_id field is meant to store an NCBI taxon id. The name_class field stores tags to describe taxonomic names (e.g. "scientific name") and the name field stores the value (e.g "Homo sapiens"). This flexibility allows us to store such things as synonyms and common names as well as the expected binomial.

The taxon table is designed to store the taxonomic relationship between taxons found in the taxon_name table. The node_rank field stores the class of the taxon (e.g. "species", "kingdom"). The parent_taxon_id contains the taxon id of the parent taxon, since there should only be one parent in the taxonomic tree. The right_value and left_value fields store values that are calculated and entered by the script. These arbitrary values are the upper and lower bounds of "nested sets", one set for each taxa, where the set of the child taxa is contained within the larger set of the parent taxon. An example would be the set for the species Procyon lotor, 365816 to 365823, contained within the set for the genus Procyon, 365815 to 365828.

Note: Taxon is optional for bioentries. This is because certain bioentries may not have a clearly identified taxon, or because the concept of taxon may not be meaningful for the bioentry.

Example SQL - find the taxon id of the parent taxon for 'Homo sapiens' using a self-join.

 SELECT parent.ncbi_taxon_id
 FROM   taxon AS parent
        JOIN taxon AS child
        ON child.parent_taxon_id = parent.ncbi_taxon_id
        JOIN taxon_name
        ON taxon_name.taxon_id = child.ncbi_taxon_id
 WHERE = 'Homo sapiens';

Example SQL - find all human sequences:

 SELECT * FROM biosequence
        JOIN bioentry USING (bioentry_id)
        JOIN taxon_name USING (taxon_id)
 WHERE = 'Homo sapiens'

Example SQL -find the taxon id's of all the parental taxa in the Primate lineage using a self-join:

 SELECT b.taxon_id FROM taxon AS a
        JOIN taxon AS b
        ON (a.left_value < b.right_value AND a.left_value > b.left_value)
        JOIN taxon_name
        ON a.taxon_id = taxon_name.taxon_id
 WHERE = 'primate'  -- 'Primata' or 'Primates'

Or list all species(final nodes) of Fungi (so no all hierarchy of parent nodes but lists the downstream, child path instead):

 SELECT DISTINCT include.ncbi_taxon_id FROM taxon
    INNER JOIN taxon AS include ON
      (include.left_value BETWEEN taxon.left_value
        AND taxon.right_value)
 WHERE taxon.taxon_id IN
   (SELECT taxon_id FROM taxon_name
    WHERE name LIKE '%fungi%')

Sequence Features with Location and Annotation


More information pertaining to a bioentry is stored as a generic "feature" of the sequence, the semantics of which are defined by associations with a specific "source" term and optional qualifiers (see below under TERM).


The location of each seqfeature (or sub-seqfeature) is defined by a location entity, describing the stop and start coordinates and strand. A seqfeature may have multiple locations (i.e. split locations are handled). Start and stop coordinates may be left NULL to accommodate some forms of "fuzzy" locations. Additionally, a location may refer to a "remote" sequence, i.e. not the sequence associated with the bioentry, this is accomplished by a dbxref_id link.


Sequence features may also have associated sub-seqfeatures (with potentially many-to-many parent-child relationships). These relationships are also "typed" via links to ontology terms using the term_id fields.

Ontology Terms and Relationships


An ontology (in the current usage) is essentially a dictionary of terms in a somewhat-controlled vocabulary. An ontology_term is used to "label" a seqfeature's name ("exon", "CDS", "5' UTR", etc), as well as its source ("GeneWise", "Glimmer", etc), and to define the types of relationships between seqfeatures and their sub-seqfeatures (e.g. "is composed of", "gives rise to", "transmembrane segments of" - see seqfeature_relationship below). While a seqfeature may have only one term to describe its type and source, relationships between seqfeatures and sub-seqfeatures may have multiple terms associated with them.


However, the powerful utility of ontology terms is that they can be associated with each other in hierarchies; e.g. a "sequence similarity search" is a general term that includes more specific terms like "BLAST result" or "HMMER PFAM result". "BLAST result" may also be a more specific term for "pairwise sequence alignment". One might like to further qualify these relationships by putting names on them: a "BLAST result" is a "result from" a "sequence similarity search" and a "example of" a "pairwise sequence alignment". We refer to these as "subject", "predicate", "object" (or, "parent", "relationship type", "child"). The term_relationship table performs this mapping between terms, using the subject/predicate/object terminology. This mapping (or "rule set") must itself be given an ontological namespace (ontology_id), as the mapping may relate terms between separate ontologies; we must keep track of where each "rule" comes from.

Finally, ontology terms themselves can be linked to external databases via a many-to-many relationship using the term_dbxref table.

Transitive closure tables over hierarchical structures


All three of these tables are meant to store the "transitive closure" of the respective *_relationship data; that is, if A is related to B, and B is related to C, then A is related to C, and will have a row in the table. The definition of the type of relationship between A and C depends greatly on the semantics of the individual relationships between A and B and B and C (including the possibility that A and C aren't actually related by any meaningful type, and should therefore not appear in the table). We leave it to individual implementors to define the policy for building these paths.

A very generic policy would be to use the ontology of relationship types involved between A and B and B and C, and choose the greatest common denominator between the two relationship types (e.g. when the two relationship types are the same, then A and C are related by the same type; when the two relationship types differ, then A and C are related by the first "supertype" that includes both relationship types).

Annotation Bundle

Annotations are similar to Sequence Features in that they describe a sequence, but unlike Sequence Features they have no locations on the sequence, they are associated with the entire sequence. Annotations may come with references and database ids.


A given literature reference may be associated with many bioentries, and a given bioentry may be associated with multiple references (thus calling for the linking "bioentry_reference" table to map the associations between each). Furthermore, the rank field may be used to define the order of the references for each associated bioentry. Lastly, start_pos and end_pos fields may be used to associate references with specific locations on the bioentry.


Each bioentry can have one or more simple textual comments associated with it, and the order of the comments may be specified by the rank field.


Furthermore, ontology terms may be used to qualify dbxrefs, bioentries, seqfeatures, and locations. Multiple qualifier values can be associated with each entity. Together, this allows one to put meaningful "labelled" data on these otherwise generic objects. For example, a SwissProt dbxref might have an additional qualifier value that was the SwissProt name (GTM1_HUMAN) of the dbxref.

An alternative design philosophy is to break out overtly common entities into their own entity table with explicitly named fields (as has been done for literature references and taxa). While an ontology-driven seqfeature "meta-table" is theoretically capable of storing any information about a bioentry, it is sometimes more useful to have extraordinarily common entities represented by their own tables. In a "view-capable" relational database, pre-computed SQL SELECT statements may be used to generate a read-only view to obtain entity-specific tables. For example, a "SwissProt_dbxref" view could be made that had all the dbxref fields plus a "SwissProt_name" field containing the qualifier value discussed previously.


Entries in a database may have cross-references to the literature. The reference table stores each journal article, book chapter, etc. that may be associated with a bioentry (or multiple bioentries). A reference's location refers to the journal (including volume, index, and possibly pages) or book in which the reference is found. Neither the location nor the authors fields have any canonical format, they are as found in the bioentry record. To help ensure uniqueness, a calculated checksum (crc) is kept over the author, location, title fields. Also, if provided by the data source, dbxref_id will contain the MEDLINE number, or any other identifier if the reference is indexed in another resource than MEDLINE.


Database cross references are links to records in other databases (whether they be sequence databases or not). The relationship between bioentries and dbxrefs is many-to-many: one bioentry may have multiple associated dbxrefs, and one dbxref may be associated with many bioentries.

Personal tools