08 January 2009

Using SQL to make phylogenetic queries

"Computer, can any dinosaurs fly?"

It would be great if we could ask computers questions this easily. Unfortunately, computers lack our natural linguistic skills. But this doesn't mean they are incapable of answering such questions—the questions just need to be rephrased.

SQL (pronounced either as "S.Q.L." or "sequel", and standing for "Structured Query Language") is the most commonly-used language for querying relational databases. This past week I have been adapting the phylogenetic operations in my 2007 paper into SQL (specifically the PostgreSQL version) as part of my Names on Nodes project. There have been a few difficulties to do with the fact that the database only records taxa (sets of organisms), while my published algorithms work at the organism level. But this is a minor difficulty so far, and things seem to be working.

To get an idea of how the system works, let's look at the question posited at the beginning of this post. How can the question be translated into a SQL query?

First, a quick summary of what's in the database: taxa (sets of organisms) of all kinds are represented by a table of "signifiers". Each signifier references an "authority", which could be anything from a publication to a nomenclatural code to a systematics dataset to a personal opinion. Each signifier also has a name which is unique under that authority. For example, under the authority of the International Code of Zoological Nomenclature (ICZN for short), the name "Homo sapiens" signifies a particular taxon (a species). Under the authority of the Yale Peabody Museum's Vertebrate Paleontology collection, the name "1450" refers to a specimen, and by proxy the set of organisms (in this case, just one) which that specimen represents. Under the authority of a particular NEXUS file, "data/flight/present" may refer to a character state, and by proxy the set of all organisms exhibiting that state.

Each authority is represented by one or more Universal Resource Identifiers, or URIs. Website addresses (URLs) are one kind of URI, so, for example, the Yale Peabody Museum's Vertebrate Paleontology collection can be represented as http://www.peabody.yale.edu/collections/vp. ISBN numbers are another type of URI, so the ICZN can be represented as urn:isbn:0853010064. Systematics files are a little trickier; my current solution is to use a custom schema and a unique "hash" string of characters programmatically derived from the data in the file, e.g., something like biofile:08192A3 (actually longer, but I'll keep it short for this discussion).

Each signifier can be represented as a "qualified name" (or "QName"), which combines the authority's URI and the signifier's name. Examples:
  • http://www.peabody.yale.edu/collections/vp::1450
  • urn:isbn:0853010064::Homo_sapiens
  • biofile:08192A3::data/flight/present
(A fuller discussion of the Names on Nodes database can be found in this earlier post.)

Definitions may be stored as MathML, as indicated in my 2007 paper. Let's suppose that someone defines Dinosauria as, "the final common ancestor of Megalosaurus bucklandii von Meyer 1832, Hylaeosaurus armatus Mantell 1833, and Iguanodon bernissartensis Boulenger in Beneden 1881, plus all descendants of that ancestor." This can be rendered as:

<apply xmlns="http://www.w3.org/1998/Math/MathML">
<csymbol definitionURL="http://namesonnodes.org/phylo/math​::nodeClade"/>
<csymbol definitionURL="urn:isbn:0853010064​::Hylaeosaurus_armatus"/>
<csymbol definitionURL="urn:isbn:0853010064​::Iguanodon_bernissartensis"/>
<csymbol definitionURL="urn:isbn:0853010064​::Megalosaurus_bucklandii"/>
</apply>

Basically, this means, "Apply the nodeClade function to the the union of the signified taxa." The nodeClade function corresponds to the phrase, "the final common ancestors and all descendants thereof". It relies on three other phylogenetic functions: allDescendants, maximal, and commonAncestors. ("Maximal" in this context means "all members of a set which are not ancestral to any other members in that set"—a more precise way to say "final".)

Names on Nodes will take a MathML definition like this and convert it into a SQL query like this:
  SELECT * FROM node_clade(
ARRAY(SELECT * FROM resolve_signifier_identities(
ARRAY[
'urn:isbn:0853010064::Hylaeosaurus_armatus',
'urn:isbn:0853010064::Iguanodon_bernissartensis',
'urn:isbn:0853010064::Megalosarus_bucklandii'
])),
0);
The resolve_signifier_identities function converts an array of qualified names into a set of identity numbers corresponding to signifiers, using a query like this:
  SELECT identity_id
FROM signifier
WHERE qname = ANY(qname_list);
This is a relatively simple SQL query. The grammar is fairly intuitive (at least for a computer language). The query for the node_clade function is far more complex, though. To discuss that, I'll have to go into the database structure a bit more.

A phylogenetic definition cannot be applied without some kind of phylogenetic hypothesis. Names on Nodes is capable of reading phylogenies encoded as Newick trees, including those in NEXUS files. Let's suppose someone uploaded a NEXUS file with this Newick tree in it:

((Iguano_bern, Hylaeo_arma), (Megalo_buck, (Tyrann_rex, (Anas_plat, Passer_dome))))

Let's suppose the URI for this file is biofile:08192A3. Then this file authorizes some new signifiers:
  • biofile:08192A3::taxa/Iguano_bern
  • biofile:08192A3::taxa/Hylaeo_arma
  • biofile:08192A3::taxa/Megalo_buck
  • biofile:08192A3::taxa/Tyrann_rex
  • biofile:08192A3::taxa/Anas_plat
  • biofile:08192A3::taxa/Passer_dome
These can be objectively equated with other signifiers:
  • urn:isbn:0853010064::Iguanodon_bernissartensis
  • urn:isbn:0853010064::Hylaeosaurus_armatus
  • urn:isbn:0853010064::Megalosaurus_bucklandii
  • urn:isbn:0853010064::Tyrannosaurus_rex
  • urn:isbn:0853010064::Anas_platyrhynchos
  • urn:isbn:0853010064::Passer_domesticus
(The last two could conceivably be equated with vernacular names from a field guide as well, e.g., urn:isbn:0679428518::Mallard and urn:isbn:0679428518::House_Sparrow, respectively) Equated signifiers share the same identity. This file also authorizes some less visible signifiers: the hypothetical ancestors, such as the ancestor of Iguano_bern and Hylaeo_arma, the ancestor of Anas_plat and Passer_dome, etc. Names on Nodes assigns these ancestors arbitrary, numerical names.

The phylogeny in the Newick tree can be represented as a set of arcs, each pointing from a parent signifier (or "head") to a child signifier (or "tail"):
  ~1 → ~2
~1 → ~3
~2 → Iguano_bern
~2 → Hylaeo_arma
~3 → Megalo_buck
~3 → ~4
~4 → Tyrann_rex
~4 → ~5
~5 → Anas_plat
~5 → Passer_dome
These get stored in the database as rows representing parentage relations. Similar relations are used for supersets and subsets (e.g., to indicate that Tyrannosaurus includes Tyrannosaurus rex). Phylogenetic functions, such as maximal, all_ancestors, and common_ancestors, utilize the data in these relations.

Some of these functions use another function called find_parentage_arcs to get a list of all parent-child relations which involve any of a given set of signifiers. For example, maximal uses a query like this:
  SELECT id
FROM signifieridentity
WHERE id = ANY(identity_list)
EXCEPT
SELECT arc.head
FROM find_parentage_arcs(identity_list, context_id)
AS arc;
In other words, select all signifier identities from a given list, except those which are the parent (head) in any parent-child arc whose parent and child are in that list. (The context_id variable specifies which datasets to use and which to ignore, but that's an essay unto itself.)

The node_clade function uses a query like this:
  SELECT * FROM all_descendants(ARRAY(
SELECT * FROM maximal(ARRAY(
SELECT * FROM common_ancestors(identity_list,
context_id)),
context_id)),
context_id);
In other words, select all descendants/members of all maximal members of the common ancestors of the specified signifiers. (Again, context_id specifies which datasets to use; a value of zero means that all datasets should be used.)

Running the Dinosauria SQL query mentioned earlier will now yield a set including all of the signifiers mentioned so far. In other words, according to the provided definition and the provided phylogenetic hypothesis, I. bernissartensis, H. armatus, M. bucklandii, T. rex, A. platyrhynchos, and P. domesticus are all members of Dinosauria (as are the hypothetical ancestors in the phylogenetic tree).

Suppose the aforementioned NEXUS file also had a character matrix including the character "flight". This would be scored as "present" in Anas_plat and Passer_dome. Thus, the table of inclusions would include these arcs:
  flight/present → Anas_plat
flight/present → Passer_dome
(Names on Nodes also extrapolates the arc flight/present → ~5, but, again, that's another essay.)

A function called subsets retrieves all signifiers indicated as subsets of a given signifier. Thus, the question that began this essay can be written in SQL as:
  SELECT COUNT(dinosaur) > 0 AS verdict
FROM node_clade(
ARRAY(SELECT * FROM resolve_signifier_identities(
ARRAY[
'urn:isbn:0853010064::Hylaeosaurus_armatus',
'urn:isbn:0853010064::Iguanodon_bernissartensis',
'urn:isbn:0853010064::Megalosarus_bucklandii'
])),
0) AS dinosaur
JOIN subsets(
resolve_signifier_identity(
'biofile:08192A3::data/flight/present'
),
0) AS flier ON dinosaur = flier;
The verdict: true.

No comments:

Post a Comment