02 April 2012

An Idea for the EOL Phylogenetic Tree Challenge

Earlier this year, the Encyclopedia of Life announced the EOL Phylogenetic Tree Challenge. The goal: to produce "a very large, phylogenetically-organized set of scientific names suitable for ingestion into the Encyclopedia of Life as an alternate browsing hierarchy". The prize: an all-expenses-paid trip to iEvoBio 2012 in Ottawa!

This interested me greatly, because:

  1. It's exactly the sort of thing I'm working on for PhyloPic.
  2. I can't really justify paying for a trip to iEvoBio this year. (Phyloinformatics is my hobby, not my profession!)
After reading Rod Page's thoughts on the challenge, I came up with a basic idea, and started to implement it. Unfortunately, now that we're two weeks from the deadline, I'm realizing that:
  1. I do not have the time to complete it.
  2. Even if it were paid for, I can't justify a trip on my own out of town right now.
Why not? Simply put, this.

So, instead, I'm going to outline the general approach I was going to take, and if someone else wants to run with it, knock yourself out. (Just give me partial credit.)
The Problem

As Rod Page said, "If you want a simple tree to navigate, then I'd argue that the NCBI tree is a pretty good start, and EOL already has this." Absolutely true. And in fact PhyloPic uses the NCBI tree (and others) to form the basis of its phylogeny.

But PhyloPic's current approach, while pragmatic, is completely backwards! It takes taxonomic names, organized into a taxonomy, and infers the phylogeny from that. This is the exact opposite of proper phylogenetic nomenclature (a la the PhyloCode), where you start out with a phylogeny and then apply defined names to it.

The Proper Approach

Ideally, forming a gigantic phylogenetic taxonomy (or any phylogenetic taxonomy) should involve these steps:
  1. Assemble a phylogeny.
  2. Assemble a list of phylogenetic name definitions.
  3. Apply the names, using their definitions.
We have a good resource for step 1: TreeBASE. I have a tool for automating step 3: Names on Nodes (part of Pymathema). And, in the near future, we should have a good resource for step 2: RegNum (the PhyloCode's registration database).

So, in the near future, we should be able to do this. But what about the present?

Automated Definitions

My plan was to create a tool, called Dephyne, that would automatically generate phylogenetic definitions for names. I actually implemented a working version of this, as part of Pymathema (although the implementation could use some refinement). Dephyne reads nomenclatural and taxonomic data, and converts it into MathML definitions for taxonomic names. Here are some of the rules it uses:
  • If a taxon has no subtaxa, consider it a unit, and therefore useable as a specifier.
  • If a taxon has more immediate subtaxa than sister taxa, and all of its subtaxa are extant, use a branch-modified node-based definition.
  • Otherwise, use a simple node-based definition.
  • Use one specifier for each immediate subtaxon (simple node-based definitions) and one specifier for each sister taxon (branch-modified node-based definitions).
  • To determine which specifier to use for a given taxon, follow these rules:
    • Favor eponymous names.
    • Favor type species.
    • Favor names of extant taxa over names of extinct taxa.
    • Favor more widely-used names.
    • Favor older names.
    • When all else fails, use alphabetical order.
Using these rules, I was able to take a data dump of the mammalian taxonomy from ITIS and convert it into a MathML file defining the names of all major mammalian taxa. Most of the definitions it generated were not half bad!

Next Steps

My next big challenge was going to be synthesizing as many TreeBASE phylogenies as possible into a gigantic phylogeny. One of the key difficulties here is that not all phylogenies use the same resolution. One might use genera while another uses species, for example. So the tree synthesis would have to be taxonomically informed (probably using the EOL API), and break the units in all phylogenies down to the same level.

The other big challenge is that not all taxa are in a TreeBASE phylogeny. So some of the specifiers used by the generated definitions might not be in the synthesized megaphylogeny. In these cases, it would be good to use a taxonomic resource (again, probably the EOL API) as a backup.

I am planning to work more on this, with an eye toward one day using it to create PhyloPic's taxonomy. But I'm not going to be able to get it done for this particular challenge. If you like the idea, contact me!


  1. Sorry you won't be competing, Mike. If I understand your approach, you are generating internal node names based on the subtaxa beneath them. That's a start.

    Rather than trying to synthesize across all TreeBASE phylogenies (a larger task that definitely wouldn't be worth it for a trip to iEvoBio), one could stitch together particularly large or promising ones, convert to Darwin Core Archive, and submit that.

  2. Hmm, like, just grab the ones relevant to Mammalia, or something like that? Could work. Again, though, dunno if I have time this year.