"Computer, can any dinosaurs fly?"It would be great if we could ask computers questions this easily. Unfortunately, computers lack our natural linguistic skills. But this doesn't mean they are incapable of answering such questions—the questions just need to be rephrased.
SQL (pronounced either as "S.Q.L." or "sequel", and standing for "Structured Query Language") is the most commonly-used language for querying relational databases. This past week I have been adapting the phylogenetic operations in
my 2007 paper into SQL (specifically the
PostgreSQL version) as part of my
Names on Nodes project. There have been a few difficulties to do with the fact that the database only records taxa (sets of organisms), while my published algorithms work at the organism level. But this is a minor difficulty so far, and things seem to be working.
To get an idea of how the system works, let's look at the question posited at the beginning of this post. How can the question be translated into a SQL query?
First, a quick summary of what's in the database: taxa (sets of organisms) of all kinds are represented by a table of "signifiers". Each signifier references an "authority", which could be anything from a publication to a nomenclatural code to a systematics dataset to a personal opinion. Each signifier also has a name which is unique under that authority. For example, under the authority of the
International Code of Zoological Nomenclature (
ICZN for short), the name "
Homo sapiens" signifies a particular taxon (a species). Under the authority of the
Yale Peabody Museum's Vertebrate Paleontology collection, the name "1450" refers to a specimen, and by proxy the set of organisms (in this case, just one) which that specimen represents. Under the authority of a particular
NEXUS file, "data/flight/present" may refer to a character state, and by proxy the set of all organisms exhibiting that state.
Each authority is represented by one or more
Universal Resource Identifiers, or URIs. Website addresses (
URLs) are one kind of URI, so, for example, the Yale Peabody Museum's Vertebrate Paleontology collection can be represented as
http://www.peabody.yale.edu/collections/vp
. ISBN numbers are another type of URI, so the
ICZN can be represented as
urn:isbn:0853010064
. Systematics files are a little trickier; my current solution is to use a custom schema and a unique "hash" string of characters programmatically derived from the data in the file, e.g., something like
biofile:08192A3
(actually longer, but I'll keep it short for this discussion).
Each signifier can be represented as a "qualified name" (or "QName"), which combines the authority's URI and the signifier's name. Examples:
http://www.peabody.yale.edu/collections/vp::1450
urn:isbn:0853010064::Homo_sapiens
biofile:08192A3::data/flight/present
(A fuller discussion of the
Names on Nodes database can be found in
this earlier post.)
Definitions may be stored as
MathML, as indicated in my 2007 paper. Let's suppose that someone defines
Dinosauria as, "the final common ancestor of
Megalosaurus bucklandii von Meyer 1832,
Hylaeosaurus armatus Mantell 1833, and
Iguanodon bernissartensis Boulenger in Beneden 1881, plus all descendants of that ancestor." This can be rendered as:
<apply xmlns="http://www.w3.org/1998/Math/MathML">
<csymbol definitionURL="http://namesonnodes.org/phylo/math::nodeClade"/>
<csymbol definitionURL="urn:isbn:0853010064::Hylaeosaurus_armatus"/>
<csymbol definitionURL="urn:isbn:0853010064::Iguanodon_bernissartensis"/>
<csymbol definitionURL="urn:isbn:0853010064::Megalosaurus_bucklandii"/>
</apply>
Basically, this means, "Apply the
nodeClade
function to the the union of the signified taxa." The
nodeClade
function corresponds to the phrase, "the final common ancestors and all descendants thereof". It relies on three other phylogenetic functions:
allDescendants
,
maximal
, and
commonAncestors
. ("Maximal" in this context means "all members of a set which are not ancestral to any other members in that set"—a more precise way to say "final".)
Names on Nodes will take a MathML definition like this and convert it into a SQL query like this:
SELECT * FROM node_clade(
ARRAY(SELECT * FROM resolve_signifier_identities(
ARRAY[
'urn:isbn:0853010064::Hylaeosaurus_armatus',
'urn:isbn:0853010064::Iguanodon_bernissartensis',
'urn:isbn:0853010064::Megalosarus_bucklandii'
])),
0);
The
resolve_signifier_identities
function converts an array of qualified names into a set of identity numbers corresponding to signifiers, using a query like this:
SELECT identity_id
FROM signifier
WHERE qname = ANY(qname_list);
This is a relatively simple SQL query. The grammar is fairly intuitive (at least for a computer language). The query for the
node_clade
function is far more complex, though. To discuss that, I'll have to go into the database structure a bit more.
A phylogenetic definition cannot be applied without some kind of phylogenetic hypothesis.
Names on Nodes is capable of reading phylogenies encoded as
Newick trees, including those in NEXUS files. Let's suppose someone uploaded a NEXUS file with this Newick tree in it:
((Iguano_bern, Hylaeo_arma), (Megalo_buck, (Tyrann_rex, (Anas_plat, Passer_dome))))
Let's suppose the URI for this file is
biofile:08192A3
. Then this file authorizes some new signifiers:
biofile:08192A3::taxa/Iguano_bern
biofile:08192A3::taxa/Hylaeo_arma
biofile:08192A3::taxa/Megalo_buck
biofile:08192A3::taxa/Tyrann_rex
biofile:08192A3::taxa/Anas_plat
biofile:08192A3::taxa/Passer_dome
These can be objectively equated with other signifiers:
urn:isbn:0853010064::Iguanodon_bernissartensis
urn:isbn:0853010064::Hylaeosaurus_armatus
urn:isbn:0853010064::Megalosaurus_bucklandii
urn:isbn:0853010064::Tyrannosaurus_rex
urn:isbn:0853010064::Anas_platyrhynchos
urn:isbn:0853010064::Passer_domesticus
(The last two could conceivably be equated with vernacular names from a field guide as well, e.g.,
urn:isbn:0679428518::Mallard
and
urn:isbn:0679428518::House_Sparrow
, respectively) Equated signifiers share the same identity. This file also authorizes some less visible signifiers: the hypothetical ancestors, such as the ancestor of
Iguano_bern
and
Hylaeo_arma
, the ancestor of
Anas_plat
and
Passer_dome
, etc.
Names on Nodes assigns these ancestors arbitrary, numerical names.
The phylogeny in the Newick tree can be represented as a set of arcs, each pointing from a parent signifier (or "head") to a child signifier (or "tail"):
~1 → ~2
~1 → ~3
~2 → Iguano_bern
~2 → Hylaeo_arma
~3 → Megalo_buck
~3 → ~4
~4 → Tyrann_rex
~4 → ~5
~5 → Anas_plat
~5 → Passer_dome
These get stored in the database as rows representing parentage relations. Similar relations are used for supersets and subsets (e.g., to indicate that
Tyrannosaurus includes
Tyrannosaurus rex). Phylogenetic functions, such as
maximal
,
all_ancestors
, and
common_ancestors
, utilize the data in these relations.
Some of these functions use another function called
find_parentage_arcs
to get a list of all parent-child relations which involve any of a given set of signifiers. For example,
maximal
uses a query like this:
SELECT id
FROM signifieridentity
WHERE id = ANY(identity_list)
EXCEPT
SELECT arc.head
FROM find_parentage_arcs(identity_list, context_id)
AS arc;
In other words, select all signifier identities from a given list, except those which are the parent (head) in any parent-child arc whose parent and child are in that list. (The
context_id
variable specifies which datasets to use and which to ignore, but that's an essay unto itself.)
The
node_clade
function uses a query like this:
SELECT * FROM all_descendants(ARRAY(
SELECT * FROM maximal(ARRAY(
SELECT * FROM common_ancestors(identity_list,
context_id)),
context_id)),
context_id);
In other words, select all descendants/members of all maximal members of the common ancestors of the specified signifiers. (Again,
context_id
specifies which datasets to use; a value of zero means that all datasets should be used.)
Running the
Dinosauria SQL query mentioned earlier will now yield a set including all of the signifiers mentioned so far. In other words, according to the provided definition and the provided phylogenetic hypothesis,
I. bernissartensis,
H. armatus,
M. bucklandii,
T. rex,
A. platyrhynchos, and
P. domesticus are all members of
Dinosauria (as are the hypothetical ancestors in the phylogenetic tree).
Suppose the aforementioned NEXUS file also had a character matrix including the character "flight". This would be scored as "present" in
Anas_plat
and
Passer_dome
. Thus, the table of inclusions would include these arcs:
flight/present → Anas_plat
flight/present → Passer_dome
(
Names on Nodes also extrapolates the arc
flight/present → ~5
, but, again, that's another essay.)
A function called
subsets
retrieves all signifiers indicated as subsets of a given signifier. Thus, the question that began this essay can be written in SQL as:
SELECT COUNT(dinosaur) > 0 AS verdict
FROM node_clade(
ARRAY(SELECT * FROM resolve_signifier_identities(
ARRAY[
'urn:isbn:0853010064::Hylaeosaurus_armatus',
'urn:isbn:0853010064::Iguanodon_bernissartensis',
'urn:isbn:0853010064::Megalosarus_bucklandii'
])),
0) AS dinosaur
JOIN subsets(
resolve_signifier_identity(
'biofile:08192A3::data/flight/present'
),
0) AS flier ON dinosaur = flier;
The verdict:
true
.