It would be great if we could ask computers questions this easily. Unfortunately, computers lack our natural linguistic skills. But this doesn't mean they are incapable of answering such questions—the questions just need to be rephrased.
SQL (pronounced either as "S.Q.L." or "sequel", and standing for "Structured Query Language") is the most commonly-used language for querying relational databases. This past week I have been adapting the phylogenetic operations in my 2007 paper into SQL (specifically the PostgreSQL version) as part of my Names on Nodes project. There have been a few difficulties to do with the fact that the database only records taxa (sets of organisms), while my published algorithms work at the organism level. But this is a minor difficulty so far, and things seem to be working.
To get an idea of how the system works, let's look at the question posited at the beginning of this post. How can the question be translated into a SQL query?
First, a quick summary of what's in the database: taxa (sets of organisms) of all kinds are represented by a table of "signifiers". Each signifier references an "authority", which could be anything from a publication to a nomenclatural code to a systematics dataset to a personal opinion. Each signifier also has a name which is unique under that authority. For example, under the authority of the International Code of Zoological Nomenclature (ICZN for short), the name "Homo sapiens" signifies a particular taxon (a species). Under the authority of the Yale Peabody Museum's Vertebrate Paleontology collection, the name "1450" refers to a specimen, and by proxy the set of organisms (in this case, just one) which that specimen represents. Under the authority of a particular NEXUS file, "data/flight/present" may refer to a character state, and by proxy the set of all organisms exhibiting that state.
Each authority is represented by one or more Universal Resource Identifiers, or URIs. Website addresses (URLs) are one kind of URI, so, for example, the Yale Peabody Museum's Vertebrate Paleontology collection can be represented as
http://www.peabody.yale.edu/collections/vp. ISBN numbers are another type of URI, so the ICZN can be represented as
urn:isbn:0853010064. Systematics files are a little trickier; my current solution is to use a custom schema and a unique "hash" string of characters programmatically derived from the data in the file, e.g., something like
biofile:08192A3(actually longer, but I'll keep it short for this discussion).
Each signifier can be represented as a "qualified name" (or "QName"), which combines the authority's URI and the signifier's name. Examples:
Definitions may be stored as MathML, as indicated in my 2007 paper. Let's suppose that someone defines Dinosauria as, "the final common ancestor of Megalosaurus bucklandii von Meyer 1832, Hylaeosaurus armatus Mantell 1833, and Iguanodon bernissartensis Boulenger in Beneden 1881, plus all descendants of that ancestor." This can be rendered as:
Basically, this means, "Apply the
nodeCladefunction to the the union of the signified taxa." The
nodeCladefunction corresponds to the phrase, "the final common ancestors and all descendants thereof". It relies on three other phylogenetic functions:
commonAncestors. ("Maximal" in this context means "all members of a set which are not ancestral to any other members in that set"—a more precise way to say "final".)
Names on Nodes will take a MathML definition like this and convert it into a SQL query like this:
SELECT * FROM node_clade(The
ARRAY(SELECT * FROM resolve_signifier_identities(
resolve_signifier_identitiesfunction converts an array of qualified names into a set of identity numbers corresponding to signifiers, using a query like this:
SELECT identity_idThis is a relatively simple SQL query. The grammar is fairly intuitive (at least for a computer language). The query for the
WHERE qname = ANY(qname_list);
node_cladefunction is far more complex, though. To discuss that, I'll have to go into the database structure a bit more.
A phylogenetic definition cannot be applied without some kind of phylogenetic hypothesis. Names on Nodes is capable of reading phylogenies encoded as Newick trees, including those in NEXUS files. Let's suppose someone uploaded a NEXUS file with this Newick tree in it:
((Iguano_bern, Hylaeo_arma), (Megalo_buck, (Tyrann_rex, (Anas_plat, Passer_dome))))
Let's suppose the URI for this file is
biofile:08192A3. Then this file authorizes some new signifiers:
urn:isbn:0679428518::House_Sparrow, respectively) Equated signifiers share the same identity. This file also authorizes some less visible signifiers: the hypothetical ancestors, such as the ancestor of
Hylaeo_arma, the ancestor of
Passer_dome, etc. Names on Nodes assigns these ancestors arbitrary, numerical names.
The phylogeny in the Newick tree can be represented as a set of arcs, each pointing from a parent signifier (or "head") to a child signifier (or "tail"):
~1 → ~2These get stored in the database as rows representing parentage relations. Similar relations are used for supersets and subsets (e.g., to indicate that Tyrannosaurus includes Tyrannosaurus rex). Phylogenetic functions, such as
~1 → ~3
~2 → Iguano_bern
~2 → Hylaeo_arma
~3 → Megalo_buck
~3 → ~4
~4 → Tyrann_rex
~4 → ~5
~5 → Anas_plat
~5 → Passer_dome
common_ancestors, utilize the data in these relations.
Some of these functions use another function called
find_parentage_arcsto get a list of all parent-child relations which involve any of a given set of signifiers. For example,
maximaluses a query like this:
SELECT idIn other words, select all signifier identities from a given list, except those which are the parent (head) in any parent-child arc whose parent and child are in that list. (The
WHERE id = ANY(identity_list)
FROM find_parentage_arcs(identity_list, context_id)
context_idvariable specifies which datasets to use and which to ignore, but that's an essay unto itself.)
node_cladefunction uses a query like this:
SELECT * FROM all_descendants(ARRAY(In other words, select all descendants/members of all maximal members of the common ancestors of the specified signifiers. (Again,
SELECT * FROM maximal(ARRAY(
SELECT * FROM common_ancestors(identity_list,
context_idspecifies which datasets to use; a value of zero means that all datasets should be used.)
Running the Dinosauria SQL query mentioned earlier will now yield a set including all of the signifiers mentioned so far. In other words, according to the provided definition and the provided phylogenetic hypothesis, I. bernissartensis, H. armatus, M. bucklandii, T. rex, A. platyrhynchos, and P. domesticus are all members of Dinosauria (as are the hypothetical ancestors in the phylogenetic tree).
Suppose the aforementioned NEXUS file also had a character matrix including the character "flight". This would be scored as "present" in
Passer_dome. Thus, the table of inclusions would include these arcs:
flight/present → Anas_plat(Names on Nodes also extrapolates the arc
flight/present → Passer_dome
flight/present → ~5, but, again, that's another essay.)
A function called
subsetsretrieves all signifiers indicated as subsets of a given signifier. Thus, the question that began this essay can be written in SQL as:
SELECT COUNT(dinosaur) > 0 AS verdictThe verdict:
ARRAY(SELECT * FROM resolve_signifier_identities(
0) AS dinosaur
0) AS flier ON dinosaur = flier;