A Three-Pound Monkey Brain: MathML

Showing posts with label MathML. Show all posts

14 February 2013

Mathematical expressions as JSON (and phyloreferencing)

For Names on Nodes I did a lot of work with MathML (specifically MathML-Content), an application of XML for representing mathematical concepts. But now, as XML wanes and JSON waxes, I've started to look at ideas for porting Names on Nodes concepts over to JSON.

I've been drawing up a very basic and extensible way to interpret JSON mathematically. Each of the core JSON values translates like so:

Null, Boolean, and Number values are interpreted as themselves.
Strings are interpreted as qualified identifiers (if they include ":") or local identifiers (otherwise).
Arrays are interpreted as the application of an operation, where the first element is a string identifying the operation and the remaining elements are arguments.
Objects are interpreted either as:

a set of declarations, where each key is a [local] identifier and each value is an evaluable JSON expression (see above), or
a namespace, where each key is a URI and each value is a series of declarations (see previous).

Examples

Here's a simple object declaring some mathematical constants (approximately):

{
    "e": 2.718281828459045,
    "pi": 3.141592653589793
}

Supposing we had declared some operations (only possible in JavaScript, since JSON doesn't have functions) equivalent to those of MathML (whose namespace URI is "http://www.w3.org/1998/Math/MathML"), we could do this:

{
    "x":

        ["http://www.w3.org/1998/Math/MathML:plus",

1,

        ],
    "y":

        ["http://www.w3.org/1998/Math/MathML:sin",

            ["http://www.w3.org/1998/Math/MathML:divide",

                "http://www.w3.org/1998/Math/MathML:pi",

]
}

Once evaluated, x would be 3 and y would be 1 (or close to it, given that this is floating-point math).

Now for the interesting stuff. Suppose we had declared Names on Nodes operations and some taxa using LSIDs:

{
    "Homo sapiens": "urn:lsid:ubio.org:namebank:109086",
    "Ornithorhynchus anatinus": "urn:lsid:ubio.org:namebank:7094675",
    "Mammalia":

        ["http://namesonnodes.org/ns/math/2013:clade",

            ["http://www.w3.org/1998/Math/MathML:union",

                "Homo sapiens",

                "Ornithorhynchus anatinus"

Voilá, a phylogenetic definition of Mammalia in JSON!

I think this could be pretty useful. My one issue is the repetition of long URIs. It would be nice to have a mechanism to import them using shorter handles. Maybe something like this?

{
    "mathml":   "http://www.w3.org/1998/Math/MathML:*",
    "namebank": "urn:lsid:ubio.org:namebank:*",
    "NoN":      "http://namesonnodes.org/ns/math/2013:*",

    "Mammalia":

        ["NoN:clade",

            ["mathml:union",

                "namebank:109086",

                "namebank:7094675"

]
}

Something to ponder. Another thing to ponder: what should I call this? MathON? MaSON?

02 April 2012

An Idea for the EOL Phylogenetic Tree Challenge

Earlier this year, the Encyclopedia of Life announced the EOL Phylogenetic Tree Challenge. The goal: to produce "a very large, phylogenetically-organized set of scientific names suitable for ingestion into the Encyclopedia of Life as an alternate browsing hierarchy". The prize: an all-expenses-paid trip to iEvoBio 2012 in Ottawa!

This interested me greatly, because:

It's exactly the sort of thing I'm working on for PhyloPic.
I can't really justify paying for a trip to iEvoBio this year. (Phyloinformatics is my hobby, not my profession!)

After reading Rod Page's thoughts on the challenge, I came up with a basic idea, and started to implement it. Unfortunately, now that we're two weeks from the deadline, I'm realizing that:

I do not have the time to complete it.
Even if it were paid for, I can't justify a trip on my own out of town right now.

Why not? Simply put, this.

So, instead, I'm going to outline the general approach I was going to take, and if someone else wants to run with it, knock yourself out. (Just give me partial credit.)

pymathema, a Python tool for evaluating MathML

Lately I've been learning the programming language Python, and I've really been enjoying it. In particular, as a dynamic language (i.e., having loose types), it's really well-suited for mathematical tools. (Having sets and tuples as native types doesn't hurt, either.)

I started creating a MathML-Content evaluator in Python, with an extension for Names on Nodes which implements phyloreferencing expressions. As part of this I am working on Version 2.0 of the Names on Nodes MathML Definitions, which will expand upon the current ones.

Basic functionality is pretty much complete, although there are some niceties to add. If you'd like to check it out and maybe collaborate, have a look here: PYMATHEMA.

27 May 2010

Upcoming Names on Nodes Presentation

I'll also be presenting Names on Nodes at iEvoBio, at the Software Bazaar on June 29. Here's the abstract:

Names on Nodes: Automating the Application of Taxonomic Names within a Phylogenetic Context
Names on Nodes¹ is an open-source² Flex application which utilizes a mathematical approach to automate the application of phylogenetically-defined names to phylogenetic hypotheses. Phylogenetic hypotheses are modeled as directed, acyclic graphs, and may be read from bioinformatics or graph files (Nexus, NexML, Newick, and GraphML) or created de novo. Hypotheses may also be merged from multiple sources. Names on Nodes stores hypotheses as MathML, an XML-based language for representing mathematical content and presentation. Phylogenetic definitions may be constructed using a visual editor and exported in MathML. Thus, it is possible to create a dictionary of defined names and automatically apply them to phylogenetic hypotheses. In the current version of the application, such dictionaries exist only as MathML files, but in future versions definitions may also be loaded from databases (e.g., RegNum).
Additional functionality in Names on Nodes includes the ability to coarsen a phylogenetic graph (thereby simplifying it while still reflecting the overall structure) or to export it as an image file (raster or vector, potentially with semantic annotations).
Source code available at: http://bitbucket.org/keesey/namesonnodes-sa/

MIT license

I have my work cut out for me....

21 May 2010

Names on Nodes Issue Tracker

Yesterday I transferred the list of remaining Names on Nodes issues from my whiteboard to the bitbucket issue tracker. My goal is to get through most of these by the end of June. (Some "nice-to-haves", like DOT or HTML 5 exporting, may have to wait.)

Essential features left to implement, complete or fix:

FILES AND FORMATS

Certain formats for import, especially NexML and NEXUS. (Currently only Newick can be imported. MathML files can be loaded as well.)
Certain formats for export, especially NexML. (Currently only PNG can be exported. MathML files can be saved as well.)
Ability to save just the definitions or just the phylogeny to a MathML file.
Ability to import definitions from a MathML file.
MathML tweaks. (Use csymbol instead of ci for taxa. Normalize presentation.)
Ability to write in Newick strings directly.

DISPLAY

Skin various components (sliders, steppers, checkboxes, etc.).
Fix line breaks in MathML formulas.
Various scrollbar issues.
Special character issues.

NAMES

Rich editor for taxon labels, including ability to edit taxon URIs.

NODES

Arc bisection tool.
Fix node merging (i.e., synonymization).
Add ability to select definition type when creating a name.
Node Pane Control Bar revisions. (Change Resolution Slider to a stepper. Add Zoom Slider.)

DEFINITIONS

Definition Editor tweaks/fixes. (Some actions are blocked that should be possible. Textual Editor does not always update. Various layout issues.)

OTHER

About/Help Panel.

02 April 2010

Names on Nodes: MathML Definitions (Version 1.2)

After the epiphany that Names on Nodes did not have to be associated with a database, I set to work creating a "standalone" version of the application. Progress has been pretty good, and if you are interested in the details (or collaborating), you can check the project out at its new home on Bitbucket (which also houses the related project, ASMathema).

I've just updated the Names on Nodes website based on these revisions to the project, most notably the MathML Definitions document. Most of the changes have actually been removals: no more mentions of rank-based taxonomy (which may be covered in future versions but not in this one), qualified names as taxonomic identifiers (no longer a necessary feature), etc. So if you didn't read it before because it was too long and dense ... well, it's still pretty long and dense, actually. But less so!

I've also added an example MathML document as a supplement. This document:

Defines a phylogenetic context (the same one used in the MathML Definitions examples), arranging taxonomic units as vertices in a directed, acyclic graph.
Defines sets based on characters ("wings used for powered flight" and "extant")
Refers a specimen (YPM-VP 1450) to a taxonomic unit (Ichthyornis).
Equates several species names as synonyms.
Defines some hybrid formulas as referring to specific taxonomic units.
Defines a number of clade names.

This file can be opened with Names on Nodes: Standalone Version, which I am currently developing and hope to release this year.

05 March 2010

Names on Nodes: Cutting Out the Fat

While pondering the headaches of homonymy recently, I started to ask myself, What am I doing with my life? Why am I worrying about this?

Seriously, though. I've been working on Names on Nodes on and off for about three years, and I still haven't launched it. And it's because I've been so focused on getting things like this right. (Well, that and having a day job.)

But things like this aren't part of the core functionality. The core functionality is the automated evaluation of phylogenetic definitions (encoded as MathML) within the context of phylogenetic hypotheses (modeled as directed, acyclic graphs). That part of it's been done for quite a while. So why am I wasting time on the rest?

By cutting out the entire database portion of the project, I could actually have something launched this year. Sure, it'd be nice to have a repository of taxonomic names, definitions, authorities, etc. But it's not necessary. It's phase II, not phase I. In fact, by the time I'm ready for a phase II, there will almost certainly be other services out there that already perform those things.

So, here on out, I'm going to be focusing on getting a lean, mean version of Names on Nodes up. Here's a quick summary of what you'll be able to do with it:

Open bioinformatics files (NEXUS to start with, other formats like nexml going forward).
View the phylogenies in a pretty graphical interface.
Merge phylogenies.
Tweak phylogenies (adding or removing parent-child relations, adding or removing taxonomic units, equating taxonomic units, etc.).
Formulate phylogenetic definitions using a spiffy interface.
Apply these definitions to the phylogeny.
Save your work as MathML files.

And that should be manageable (although there is still much work to be done, especially for the user interface). Once that's launched and working, then I'll look into connecting to other services and/or launching an associated database.

25 February 2010

Tricksy Definitions Expressed Mathematically

Just for fun, here are a few definitions of nonstandard type to go along with those in the previous post. As any practitioner of phylogenetic nomenclature knows, most definitions are node-, branch-, or apomorphy-based, but there have been a few that don't fall into these categories.

Here are Wagner's (2004) definitions of Panbiota and Biota:

   Panbiota := (Clade ∘ prc_∩)(Homo sapiens).

   Biota := Crown(Panbiota, "extant as of or after 2004").

This is one of the few cases where it makes more sense to define the crown clade based on the total clade rather than vice versa. (Maybe the only case? Not sure.) Technically, Wagner's wording for the definition of Panbiota might be better translated as (suc_∪ ∘ min ∘ prc_∩)(Homo sapiens), but it works out to the same thing.

And here's Clarke's (2004) definition of Ichthyornis:

   Let M := "apomorphy 2" ∩ "apomorphy 5" ∩ "apomorphy 6" ∩ "apomorphy 7" ∩ "apomorphy 8".
   (These refer to apomorphies in Clarke's Ichthyornis dispar Diagnosis.)

   Ichthyornithes := Clade(YPM 1450 ← Struthio camelus ∪ Tinamus major ∪ Vultur gryphus).
   ("YPM" refers to the Yale Peabody Museum's Vertebrate Paleontology collection. YPM 1450 is the Ichthyornis dispar holotype specimen.)

   Ichthyornis := Clade((M @ YPM 1450) ∩ Ichthyornithes).

Names on Nodes: MathML Definitions (Version 1.1)

After posting Version 1.0 earlier this week, I had a revelation: the cladogen functions are completely unnecessary, and everything would work a lot nicer if I just tossed them. I also realized that there really was no reason I couldn't include the various relations (precedence, immediate precedence, proper precedence, etc.), just in case anyone wanted to do some seriously non-standard definitions. After some significant revisions, I present Version 1.1.

Some examples of the updated notation, using humans (Homo sapiens), platypuses (Ornithorhynchus anatinus), and Dimetrodon grandis, a stem-mammal:

Union. Homo sapiens ∪ Ornithorhynchus anatinus = all humans and all platypuses (polyphyletic taxon, also monothetic)

Exclusive Predecessors. Homo sapiens ← Ornithorhynchus anatinus = humans and all of their ancestors, except for the ancestors shared with platypuses (lineage)

Synapomorphic Predecessors. "milk glands" @ Homo sapiens = humans and all human ancestors to possess milk glands synapomorphic with those in humans (lineage)

Node-Based Clade. Clade(Homo sapiens ∪ Ornithorhynchus anatinus) = Mammalia

Branch-Based Clade (simple). Clade(Homo sapiens ← Ornithorhynchus anatinus) = "Pan-Theria"

Branch-Based Clade (multiple external specifiers). Clade(Homo sapiens ← Ornithorhynchus anatinus ∪ Dimetrodon grandis) = "Pan-Theria"

Branch-Based Clade (multiple internal specifiers). Clade(Homo sapiens ∪ Ornithorhynchus anatinus ← Dimetrodon grandis) = (unnamed clade comprised mostly of Therapsida)

Null Branch-Based Definition (multiple internal specifiers). Clade(Homo sapiens ∪ Dimetrodon grandis ← Ornithorhynchus anatinus) = ∅

Apomorphy-Based Clade. Clade("milk glands" @ Homo sapiens) = "Apo-Mammalia"

Node-Modified Crown Clade. Crown(Homo sapiens ∪ Dimetrodon grandis, "extant as of or after 2010") = Mammalia

Branch-Modified Crown Clade. Crown(Homo sapiens ← Ornithorhynchus anatinus, "extant as of or after 2010") = Theria

Apomorphy-Modified Crown Clade. Crown("milk glands" @ Homo sapiens, "extant as of or after 2010") = Mammalia

Total Clade. Total(Mammalia, "extant as of or after 2010") = Synapsida (or "Pan-Mammalia")

Image showing a node-based clade (Mammalia) under a given phylogenetic hypothesis. Click to enlarge. More here.

21 February 2010

Names on Nodes: MathML Definitions (Version 1.0)

I've just posted version 1.0 of the MathML definition for Names on Nodes. This document provides the foundation for the mathematical entities and operations in Names on Nodes. Previously I had posted an incomplete draft version—this is the first complete version, and also the first version with illustrations. It won't be the last version, but it (or a slightly edited version) will be associated with the first release of Names on Nodes.

This document refines and rectifies concepts laid out in my 2007 paper. It's an important milestone to completing Names on Nodes, a project I've been working on for almost six years.

One of the illustrations, showing how the Clade function works.

07 September 2009

Glimpses of Stuff I've Been Working On

Two long-term projects of mine should see the light of day soon. Here are some previews (click to see full size):

(Why does everything I do lately involve directed, acyclic graphs?)

23 July 2009

Two "Names on Nodes"-Related Launches

I'm still a clear way away from launching the beta application, but I've just made a couple of launches related to my long-time work-in-progress, Names on Nodes.

First up, and probably of more interest to most people, I've begun the documentation for the MathML definitions used by Names on Nodes. The document includes general reviews of relevant mathematical and biological concepts, a quick review of MathML and the technologies it's based on, some comments on correlating mathematical and biological concepts, and definitions for all entities (including operations) used by Names on Nodes. Note that this covers a lot of the same ground as in my 2007 paper, with a few minor changes in the symbols and terminology (e.g., I now call the ancestor of a clade a "cladogen" rather than a "cladogenetic set").

Secondly, I've made the project open-source, by moving it to Google Code. If you are a developer interested in checking this out, go here. It's incomplete, so I don't know if anyone will have any real interest in looking at it yet. (Honestly, I mostly posted so that, on the off chance that I unexpectedly kick the bucket, my magnum opus won't be lost forever.)

This information is also on the new Names on Nodes home page.

13 February 2009

Using Conservation Status to Automatically Apply Phylogenetic Definitions

To briefly summarize some relevant points in the last post (Extinct or Extant?):

Some phylogenetic definitions require a definition of the term "extant".
The International Union for the Conservation of Nature maintains a database of species and their conservation status, as assessed for a particular year.

Since 2001, the IUCN Red List has used the following categories:

EX: Extinct
EW: Extinct in the Wild
CR: Critically Endangered
EN: Endangered
VU: Vulnerable
NT: Near Threatened
LC: Least Concern
DD: Data Deficient
NE: Not Evaluated

As mentioned in earlier posts, Names on Nodes uses URIs (URLs, ISBN numbers, DOIs, etc.) for authorities and qualified names (URI + unique local name) for taxonomic signifiers. Thus, these states can be stored as signifiers in the Names on Nodes database. Examples for the 2008 assessment:

urn:isbn:2831706335::categories:EX:2008
urn:isbn:2831706335::categories:CR:2008
urn:isbn:2831706335::categories:EN:2008
urn:isbn:2831706335::categories:VU:2008
urn:isbn:2831706335::categories:NT:2008
urn:isbn:2831706335::categories:LC:2008
urn:isbn:2831706335::categories:DD:2008
urn:isbn:2831706335::categories:NE:2008

One wonderful thing about the IUCN database is that you can export query results as XML (also CSV): Here's an example of an entry:

<species id="148296">
  <scientific_name>
    Zosterops xanthochroa
  </scientific_name> 
  <kingdom_name>
    ANIMALIA
  </kingdom_name> 
  <phylum_name>
    CHORDATA
  </phylum_name> 
  <class_name>
    AVES
  </class_name> 
  <order_name>
    Passeriformes
  </order_name> 
  <family_name>
    Zosteropidae
  </family_name> 
  <genus_name>
    Zosterops
  </genus_name> 
  <species_name>
    xanthochroa
  </species_name> 
  <authority>
    Gray, 1859
  </authority> 
  <synonyms>
    <synonym>
      <scientific_name>
        Zosterops xanthochrous
      </scientific_name> 
      <genus_name>
        Zosterops
      </genus_name> 
      <species_name>
        xanthochrous
      </species_name> 
    </synonym>
  </synonyms>
  <common_names>
    <name lang="Eng">
      Green-backed White-eye
    </name> 
  </common_names>
  <assessment
      version="3.1"
      year="2008">
    <category>
      LC
    </category> 
  </assessment>
</species>

This provides a source not only for the conservation status of species, but also for the species themselves and some of their higher taxa as well. This one XML snippet can provide all of the following signifiers:

Animalia
- urn:isbn:0853010064::Animalia
Chordata
- urn:isbn:0853010064::Chordata
Aves
- urn:isbn:0853010064::Aves
Passeriformes
- urn:isbn:0853010064::Passeriformes
Zosteropoidea
- urn:isbn:0853010064::Zosteropoidea
Zosteropidae
- urn:isbn:0853010064::Zosteropidae
Zosteropinae
- urn:isbn:0853010064::Zosteropinae
Zosteropini
- urn:isbn:0853010064::Zosteropini
Zosteropina
- urn:isbn:0853010064::Zosteropina
Zosterops
- urn:isbn:0853010064::Zosterops
Zosterops (Zosterops)
- urn:isbn:0853010064::Zosterops+%28Zosterops%29
Zosterops xanthochroa/Zosterops xanthochrous/Green-backed White-eye
- urn:isbn:0853010064::Zosterops+xanthochroa
- urn:isbn:0853010064::Zosterops+xanthochrous
- http://iucnredlist.org::species:148296
- http://iucnredlist.org::common_name:Eng:Green-backed+White-eye

It also authorizes a number of superset-subset relations, e.g., "Zosterops includes Zosterops xanthochroa" and "Least Concern (2008) includes Zosterops xanthochroa". The latter identifies Z. xanthochroa as an extant species during 2008. Because of relations like this, we can build a MathML set for the set of all organisms (or populations, whatever) which were extant in 2008 according to the IUCN Red List:

<apply xmlns="http://www.w3.org/1998/Math/MathML">
  <union/>
  <csymbol
    definitionURL="urn:isbn:2831706335::categories:EW:2008"/>
  <csymbol
    definitionURL="urn:isbn:2831706335::categories:CR:2008"/>
  <csymbol
    definitionURL="urn:isbn:2831706335::categories:EN:2008"/>
  <csymbol
    definitionURL="urn:isbn:2831706335::categories:VU:2008"/>
  <csymbol
    definitionURL="urn:isbn:2831706335::categories:NT:2008"/>
  <csymbol
    definitionURL="urn:isbn:2831706335::categories:LC:2008"/>
</apply>

Presto, now I can apply modified node-based definitions and total group definitions! Thanks, IUCN, for helping to enable the automated application of phylogenetic definitions! (And, you know, also for all the "saving threatened species from extinction" stuff.)

08 January 2009

Using SQL to make phylogenetic queries

"Computer, can any dinosaurs fly?"

It would be great if we could ask computers questions this easily. Unfortunately, computers lack our natural linguistic skills. But this doesn't mean they are incapable of answering such questions—the questions just need to be rephrased.

SQL (pronounced either as "S.Q.L." or "sequel", and standing for "Structured Query Language") is the most commonly-used language for querying relational databases. This past week I have been adapting the phylogenetic operations in my 2007 paper into SQL (specifically the PostgreSQL version) as part of my Names on Nodes project. There have been a few difficulties to do with the fact that the database only records taxa (sets of organisms), while my published algorithms work at the organism level. But this is a minor difficulty so far, and things seem to be working.

To get an idea of how the system works, let's look at the question posited at the beginning of this post. How can the question be translated into a SQL query?

First, a quick summary of what's in the database: taxa (sets of organisms) of all kinds are represented by a table of "signifiers". Each signifier references an "authority", which could be anything from a publication to a nomenclatural code to a systematics dataset to a personal opinion. Each signifier also has a name which is unique under that authority. For example, under the authority of the International Code of Zoological Nomenclature (ICZN for short), the name "Homo sapiens" signifies a particular taxon (a species). Under the authority of the Yale Peabody Museum's Vertebrate Paleontology collection, the name "1450" refers to a specimen, and by proxy the set of organisms (in this case, just one) which that specimen represents. Under the authority of a particular NEXUS file, "data/flight/present" may refer to a character state, and by proxy the set of all organisms exhibiting that state.

Each authority is represented by one or more Universal Resource Identifiers, or URIs. Website addresses (URLs) are one kind of URI, so, for example, the Yale Peabody Museum's Vertebrate Paleontology collection can be represented as http://www.peabody.yale.edu/collections/vp. ISBN numbers are another type of URI, so the ICZN can be represented as urn:isbn:0853010064. Systematics files are a little trickier; my current solution is to use a custom schema and a unique "hash" string of characters programmatically derived from the data in the file, e.g., something like biofile:08192A3 (actually longer, but I'll keep it short for this discussion).

Each signifier can be represented as a "qualified name" (or "QName"), which combines the authority's URI and the signifier's name. Examples:

http://www.peabody.yale.edu/collections/vp::1450
urn:isbn:0853010064::Homo_sapiens
biofile:08192A3::data/flight/present

(A fuller discussion of the Names on Nodes database can be found in this earlier post.)

Definitions may be stored as MathML, as indicated in my 2007 paper. Let's suppose that someone defines Dinosauria as, "the final common ancestor of Megalosaurus bucklandii von Meyer 1832, Hylaeosaurus armatus Mantell 1833, and Iguanodon bernissartensis Boulenger in Beneden 1881, plus all descendants of that ancestor." This can be rendered as:

<apply xmlns="http://www.w3.org/1998/Math/MathML">

<csymbol definitionURL="http://namesonnodes.org/phylo/math::nodeClade"/>

<csymbol definitionURL="urn:isbn:0853010064::Hylaeosaurus_armatus"/>

<csymbol definitionURL="urn:isbn:0853010064::Iguanodon_bernissartensis"/>

<csymbol definitionURL="urn:isbn:0853010064::Megalosaurus_bucklandii"/>

</apply>

Basically, this means, "Apply the nodeClade function to the the union of the signified taxa." The nodeClade function corresponds to the phrase, "the final common ancestors and all descendants thereof". It relies on three other phylogenetic functions: allDescendants, maximal, and commonAncestors. ("Maximal" in this context means "all members of a set which are not ancestral to any other members in that set"—a more precise way to say "final".)

Names on Nodes will take a MathML definition like this and convert it into a SQL query like this:

  SELECT * FROM node_clade(
    ARRAY(SELECT * FROM resolve_signifier_identities(
      ARRAY[
        'urn:isbn:0853010064::Hylaeosaurus_armatus',
        'urn:isbn:0853010064::Iguanodon_bernissartensis',
        'urn:isbn:0853010064::Megalosarus_bucklandii'
      ])),
  0);

The resolve_signifier_identities function converts an array of qualified names into a set of identity numbers corresponding to signifiers, using a query like this:

  SELECT identity_id
  FROM signifier
  WHERE qname = ANY(qname_list);

This is a relatively simple SQL query. The grammar is fairly intuitive (at least for a computer language). The query for the node_clade function is far more complex, though. To discuss that, I'll have to go into the database structure a bit more.

A phylogenetic definition cannot be applied without some kind of phylogenetic hypothesis. Names on Nodes is capable of reading phylogenies encoded as Newick trees, including those in NEXUS files. Let's suppose someone uploaded a NEXUS file with this Newick tree in it:

((Iguano_bern, Hylaeo_arma), (Megalo_buck, (Tyrann_rex, (Anas_plat, Passer_dome))))

Let's suppose the URI for this file is biofile:08192A3. Then this file authorizes some new signifiers:

biofile:08192A3::taxa/Iguano_bern
biofile:08192A3::taxa/Hylaeo_arma
biofile:08192A3::taxa/Megalo_buck
biofile:08192A3::taxa/Tyrann_rex
biofile:08192A3::taxa/Anas_plat
biofile:08192A3::taxa/Passer_dome

These can be objectively equated with other signifiers:

urn:isbn:0853010064::Iguanodon_bernissartensis
urn:isbn:0853010064::Hylaeosaurus_armatus
urn:isbn:0853010064::Megalosaurus_bucklandii
urn:isbn:0853010064::Tyrannosaurus_rex
urn:isbn:0853010064::Anas_platyrhynchos
urn:isbn:0853010064::Passer_domesticus

(The last two could conceivably be equated with vernacular names from a field guide as well, e.g., urn:isbn:0679428518::Mallard and urn:isbn:0679428518::House_Sparrow, respectively) Equated signifiers share the same identity. This file also authorizes some less visible signifiers: the hypothetical ancestors, such as the ancestor of Iguano_bern and Hylaeo_arma, the ancestor of Anas_plat and Passer_dome, etc. Names on Nodes assigns these ancestors arbitrary, numerical names.

The phylogeny in the Newick tree can be represented as a set of arcs, each pointing from a parent signifier (or "head") to a child signifier (or "tail"):

  ~1 → ~2
  ~1 → ~3
  ~2 → Iguano_bern
  ~2 → Hylaeo_arma
  ~3 → Megalo_buck
  ~3 → ~4
  ~4 → Tyrann_rex
  ~4 → ~5
  ~5 → Anas_plat
  ~5 → Passer_dome

These get stored in the database as rows representing parentage relations. Similar relations are used for supersets and subsets (e.g., to indicate that Tyrannosaurus includes Tyrannosaurus rex). Phylogenetic functions, such as maximal, all_ancestors, and common_ancestors, utilize the data in these relations.

Some of these functions use another function called find_parentage_arcs to get a list of all parent-child relations which involve any of a given set of signifiers. For example, maximal uses a query like this:

  SELECT id
    FROM signifieridentity
    WHERE id = ANY(identity_list)
  EXCEPT
  SELECT arc.head
    FROM find_parentage_arcs(identity_list, context_id)
      AS arc;

In other words, select all signifier identities from a given list, except those which are the parent (head) in any parent-child arc whose parent and child are in that list. (The context_id variable specifies which datasets to use and which to ignore, but that's an essay unto itself.)

The node_clade function uses a query like this:

  SELECT * FROM all_descendants(ARRAY(
    SELECT * FROM maximal(ARRAY(
      SELECT * FROM common_ancestors(identity_list,
      context_id)),
    context_id)),
  context_id);

In other words, select all descendants/members of all maximal members of the common ancestors of the specified signifiers. (Again, context_id specifies which datasets to use; a value of zero means that all datasets should be used.)

Running the Dinosauria SQL query mentioned earlier will now yield a set including all of the signifiers mentioned so far. In other words, according to the provided definition and the provided phylogenetic hypothesis, I. bernissartensis, H. armatus, M. bucklandii, T. rex, A. platyrhynchos, and P. domesticus are all members of Dinosauria (as are the hypothetical ancestors in the phylogenetic tree).

Suppose the aforementioned NEXUS file also had a character matrix including the character "flight". This would be scored as "present" in Anas_plat and Passer_dome. Thus, the table of inclusions would include these arcs:

  flight/present → Anas_plat
  flight/present → Passer_dome

(Names on Nodes also extrapolates the arc flight/present → ~5, but, again, that's another essay.)

A function called subsets retrieves all signifiers indicated as subsets of a given signifier. Thus, the question that began this essay can be written in SQL as:

  SELECT COUNT(dinosaur) > 0 AS verdict
  FROM node_clade(
    ARRAY(SELECT * FROM resolve_signifier_identities(
      ARRAY[
        'urn:isbn:0853010064::Hylaeosaurus_armatus',
        'urn:isbn:0853010064::Iguanodon_bernissartensis',
        'urn:isbn:0853010064::Megalosarus_bucklandii'
    ])),
  0) AS dinosaur
  JOIN subsets(
    resolve_signifier_identity(
      'biofile:08192A3::data/flight/present'
    ),
  0) AS flier ON dinosaur = flier;

The verdict: true.

21 October 2008

Six Ways to Say the Same Thing

Prose

"'Aves' refers to the crown clade stemming from the most recent common ancestor of Ratitae (Struthio camelus Linnaeus 1758), Tinamidae (Tetrao [Tinamus] major Gmelin 1789), and Neognathae (Vultur gryphus Linnaeus 1758)."

—Jacques Gauthier & Kevin de Queiroz 2001 December

Simple Mathematical Formula

Aves := Clade(Struthio camelus + Tetrao major + Vultur gryphus)

Complex Mathematical Formula

Aves Linnaeus 1758 [Gauthier & de Queiroz 2001] := (AD o max o CA)(Struthio camelus ∪ Tetrao major ∪ Vultur gryphus)

Ridiculously Complex Mathematical Formula

C := {x : (∀y ∈ (Struthio camelus ∪ Tetrao major ∪ Vultur gryphus))[x ≼ y]}
A := {x ∈ C : (∀y ∈ C)[x ⊀ y]}
Aves := {x : (∃y ∈ A)[x ≽ y]}

Simple MathML-Content

<apply>
    xmlns="http://www.w3.org/1998/Math/MathML"
  <csymbol
    definitionURL="http://namesonnodes.org/2008/phylo/math/nodeClade"/>
  <csymbol
    definitionURL="urn:isbn:0-85301-006-4/Struthio+camelus"/>
  <csymbol
    definitionURL="urn:isbn:0-85301-006-4/Tetrao+major"/>
  <csymbol
    definitionURL="urn:isbn:0-85301-006-4/Vultur+gryphus"/>
</apply>

Complex MathML within Custom Markup

<pn:definition
    xmlns="http://www.w3.org/1998/Math/MathML"
    xmlns:pn="http://namesonnodes.org/2008/phylo/names">
  <apply>
    <csymbol
        definitionURL="http://namesonnodes.org/2008/phylo/math/clade">
      <mi form="prefix">Clade</mi>
    </csymbol>
    <apply>
      <csymbol
          definitionURL="http://namesonnodes.org/2008/phylo/math/nodeAncestors">
        <mo form="infix">+</mo>
      </csymbol>
      <csymbol
          definitionURL="urn:isbn:0-85301-006-4/Struthio+camelus">
        <![CDATA[<i>Ratitae</i> (<i>Struthio camelus</i> Linnaeus 1758)]]>
      </csymbol>
      <csymbol
          definitionURL="urn:isbn:0-85301-006-4/Tetrao+major">
        <![CDATA[<i>Tinamidae</i> (<i>Tetrao</i> [<i>Tinamus</i>] <i>major</i> Gmelin 1789)]]>
      </csymbol>
      <csymbol
          definitionURL="urn:isbn:0-85301-006-4/Vultur+gryphus">
        <![CDATA[<i>Neognathae</i> (<i>Vultur gryphus</i> Linnaeus 1758)]]>
      </csymbol>
    </apply>
  </apply>
</pn:definition>

References

urn:isbn:0-912532-57-2/1 (Jacques Gauthier & Kevin de Queiroz 2001 December)
urn:doi:10.1111/j.1463-6409.2007.00302.x (T. Michael Keesey 2007 November)

03 March 2008

Names on NEXUS: Under the Hood

I nearly have the basic data model and data processing functions pinned down for Names on NEXUS. Once again, that's my project, hinted at in a paper of mine (Keesey 2007), to relate the data in NEXUS files (Maddison et al. 1997) to definitions of names as governed by the PhyloCode.

I've had to learn some new technologies and code packages to accomplish this. Here's a rundown of some key ones:

BioJava
This is the most recent addition. Originally I had built my own library in ActionScript 3.0 to parse NEXUS files. But it had some limitations. NEXUS is a rather old format (as bioinformatics formats go), and different applications produce somewhat different versions. So rather than use my own ad hoc library, I decided I should get an open source one.

There aren't any in ActionScript, of course, but there are some in Java. This meant I had to switch NEXUS parsing from the front end to the back end, but in some ways that's better. It means I can stored parsed data in the database instead of having client application parse NEXUS data every time. In fact, it means that the client never has to actually see raw NEXUS data—it can just fetch the pre-parsed data.

I first looked into using the NEXUS-parsing code in Mesquite, an open-source phylogenetic analysis program. But it's not set up for simply using the parsing engine on its own—the parser is tied into a whole file-browsing package. Then I found BioJava, which had exactly what I needed. Just looka this package!

Unfortunately there are still some problems with opening certain NEXUS files. I downloaded some samples from TreeBASE and they flagged errors in the TREES section. The reason, as I found after hours of searching and considering whether it might be better just to write my own parser after all, turns out to be an extra comma in the TRANSLATE section. Still not exactly sure how I'm going to solve that one. But it works when I remove the comma!

Hibernate
Remember how I wrote a post a while ago about building classes that map from the Java back-end to the database? Turns out that was all unnecessary. Hibernate is a persistence layer that provides pretty seamless integration between Java and a database (in this case, a PostgreSQL database). Augmented by Hibernate Annotations and Hibernate Validator, it makes it fairly easy to set up and use a complex, well-organized database.

Well, okay, there's a bit of a learning curve first, but it's totally worth it. Incidentally, the book I used to learn it has what is possibly the best title ever.

Flex Data Management Services
Basically, Hibernate is to Java and databases as mx.data is to Flex and Java. It provides a persistence layer so that I don't have to keep track of whether or not I need to request certain data from the Java back-end. I just create DataService objects, tie them to Assembler classes on the back end, and it's all taken care of.

FlexUnit and JUnit
I've already extolled the virtues of unit testing. These wonderful (and, yes, comically-named) packages (huh huh) make it possible. I haven't built enough unit tests, really, but the few I have have been enormously useful in hunting down peculiar errors. And aside from that, since Eclipse can run JUnit tests natively, I can even use them to perform certain important tasks, such as setting up the database from annotated classes via Hibernate.

So What's Left For Me To Do?
Plenty. Although these premade packages help out enormously, I've still had to build an entire mathematics library, a MathML parser, and some tools for handling URIs. I've still got tons of work left to do on the user interface. (Event bubbling is helping a lot with that, by the way.) And, even when stuff is already built, just hooking up one pipe to another pipe can be more complicated than it seems.

Here's a rough list of what's left:

Finalize the servlet for uploading and parsing NEXUS data. (I'm very close on this one.)
Finish the required behind-the-scenes "search" features. Some of these might be a bit involved, like the ones that suggest possible links between NEXUS taxa and species or specimens or between NEXUS character states and apomorphies.
Overhaul the way Names on NEXUS entities (particularly specifiers) are referenced in MathML.
Finish the user interface. So far I just have a few forms. I still have to do tree visualization, stylesheets, high-level navigation, transitions, etc.
Constrain access to certain functionality. Names on NEXUS is going to be a pretty open, collaborative tool, but I need to set a few boundaries. (E.g., I can't have any old person delete data.)
Make sure the server's all optimized, with a static, JNDI-named Hibernate factory, etc.

And here are some things that aren't, strictly speaking, essential, but would be awfully nice:

Create a servlet to provide permanent links for Names on NEXUS entities.
Create unit tests for all relevant classes.
Add JavaDoc and ASDoc comments to all code.

Part of me is also thinking about renaming the project. I mean, it's a good name for what it does right now, but what if I start to bring formats other than NEXUS into the fold? (Not that there are many, but....) Well, I'll probably cross that bridge when I come to it.

My goal is to get an alpha version online sometime this Spring and go open source with it by the Fall. We'll see....

05 November 2007

My First Paper

The inauguration of this blog was just barely in time for me to report my first paper as primary (and sole) author:

KEESEY, T. M. 2007. A mathematical approach to defining clade names, with potential applications to computer storage and processing. Zoologica Scripta 36 (6): 607–621. doi:10.1111/j.1463-6409.2007.00302.x

Here's the abstract, also available here:

Clade names may be objectively defined based on conditions of phylogeny. Definitions usually take one of three forms — node-, branch- or apomorphy-based — but other forms and complex permutations of these forms are also possible. Some database projects have attempted to store definitions of clade names in a manner accessible to computer applications, but, so far, they have only provided ways of storing the most common types of definition. To create a more extensible system, I have taken a mathematical approach to defining clade names. To render definitions accessible to computer storage and analysis, I propose using Mathematical Markup Language (MATHML) with extensions. Since the mathematical approach is granular to the level of the organism, not to fuzzy higher levels such as population or species, it sheds light on some theoretical difficulties with defining clade names. For example, some definitions do not resolve to a single organism as the ancestor, but to sets of organisms which are not ancestral to each other and share common descendants. I term such sets ‘cladogenetic sets’.

If you made it through that, congratulations. Now you may have some questions.

What is a "clade"?

An ancestor and all of its descendants. As an example, mammals form a clade. Fish do not form a clade, since they exclude some descendants (tetrapods). Hoofed mammals ("ungulates") do not form a clade, since their common ancestors were not hoofed (instead, hooves have evolved several times among placental mammals).

What is "branch-based", again?

The PhyloCode is a set of rules being put together to deal with the naming of clades. It recommends certain forms of definition. The main ones (but certainly not the only ones), with examples, are:

node-based. "Mammalia is the final common ancestor of platypuses and humans, and all descendants of that ancestor."
branch-based. "Synapsida is the initial ancestor of humans which is not also ancestral to sand lizards, and all descendants of that ancestor." (The image below represents two branch-based clades, one in red and one in yellow. White dots represent organisms in both clades.)
apomorphy-based. "Avialae is the first ancestor of Andean condors to possess powered flight homologous with that in Andean condors, and all descendants of that ancestor."

(Actual definitions would use proper scientific names instead of "platypuses", "humans", etc. but you get the idea.)

This stands in contrast to the current taxonomic codes, which are rank-based. Definitions under rank-based codes look more like, "Homo is the genus that includes Homo sapiens." There is a very important difference between these two styles of definition. Rank-based definitions are based (at least partly) on subjective opinions, since the ranks (with the possible, but contentious, exception of species) do not have any objective meaning. We all probably learned about kingdoms, classes, orders, families, and genera in biology class, but these ranks don't have any intrinsic meaning. A family of birds might include a few closely related species, while a family of insects might include thousands, with more distant common ancestry.

Phylogenetic definitions, on the other hand, proceed directly from our knowledge of phylogeny. When two researchers disagree on the content of a rank-based taxon, they might be arguing about aesthetics, actual relationships, or both. When they disagree about the content of a phylogenetic taxon, they can only be arguing about actual relationships.

So, what did you do?

Since phylogenetic definitions are based directly on phylogeny, without need for opinions, this means they can be expressed in completely unambiguous language. This includes:

Mathematical formulas.
Computer languages.

As I discuss in the paper, some people have created unambiguous shorthand formulas and unambiguous database schemas for representing phylogenetic definitions. But the previous efforts have all focused on simple definitional formats, ignoring other formats and complex permutations.

Well, la-ti-da. So what?

This means more of the taxonomic process can be automated. With rank-based definitions, there has to be an expert to "feel out" how expansive a genus, family, order, etc. should be. But with phylogenetic definitions, you can feed a computer application the phylogeny encoded in a popular file format (e.g., NEXUS) and taxonomic definitions encoded in a popular file format (MathML), and it can figure out the content referred to by a taxonomic name in fractions of a second.

Okay, so where's the application?

I'm still working on one, called Names on NEXUS. So far it's going well; I just need to refactor and complete the server-side application and touch up the client-side application. Should have some time for that next year.