05 April 2010

Sketch of a Phylogenetic Query Language

Names on Nodes uses MathML for two primary purposes:
  • Delineating phylogenetic hypotheses (as directed, acyclic graphs).
  • Associating identifiers with definitions.
In some ways this works out to be a bit like a query language. You can use it to set up data constructs, and then search them for groups of interest. For example, suppose you wanted a list of all stem-humans from Kenya. Assuming that your dataset included 1) a taxonomic unit called Homo sapiens, 2) a group called extant for all extant taxonomic units, and 3) a group called Kenya for all Kenyan taxonomic units, that query might look like this:
<apply xmlns="http://www.w3.org/1998/Math/MathML">
   <intersect/>
   <ci>Kenya</ci>
   <apply>
      <setdiff/>
      <apply>
         <csymbol definitionURL="http://namesonnodes.org/ns/math/2009#def-Total"/>
         <ci>Homo sapiens</ci>
         <ci>extant</ci>
      </apply>
      <ci>Homo sapiens</ci>
   </apply>
</apply>
MathML is great for being flexible and extensible enough to cover concepts like this. But ... it's also really verbose. This is fine for my purposes so far, but it may be cumbersome for other purposes. So I've been playing around with a more succinct way to write these expressions. Today I tossed up some rough ideas here:


This is a plain-text format loosely inspired by mathematical notation, the C language, etc. Using it, the above query becomes:
"Kenya" & (total("Homo sapiens", "extant") - "Homo sapiens")
...which is quite a bit shorter.

This is still in very early stages, so I thought I'd post it to get some feedback.

Here are a few of the simpler clade definition examples:
"Aves"       := clade("Struthio camelus" | "Tetrao major" |
                      "Vultur gryphus").

"Saurischia" := clade("Megalosaurus bucklandii" <-
                      "Iguanodon bernissartensis").

"Avialae"    := clade("wings used for powered flight" @
                      "Vultur gryphus").

2 comments:

  1. All very impressive and i can se its uses. However, I can't really comment on its efficacy as I am still getting my head around "simple" SQL.

    ReplyDelete
  2. This proposal actually has very little in common with SQL. I did write on using SQL to make phylogenetic queries earlier here and here. As you can see, SQL makes things considerably more complex....

    ReplyDelete