This interested me greatly, because:
- It's exactly the sort of thing I'm working on for PhyloPic.
- I can't really justify paying for a trip to iEvoBio this year. (Phyloinformatics is my hobby, not my profession!)
After reading Rod Page's thoughts on the challenge, I came up with a basic idea, and started to implement it. Unfortunately, now that we're two weeks from the deadline, I'm realizing that:
- I do not have the time to complete it.
- Even if it were paid for, I can't justify a trip on my own out of town right now.
Why not? Simply put, this.
So, instead, I'm going to outline the general approach I was going to take, and if someone else wants to run with it, knock yourself out. (Just give me partial credit.)
As Rod Page said, "If you want a simple tree to navigate, then I'd argue that the NCBI tree is a pretty good start, and EOL already has this." Absolutely true. And in fact PhyloPic uses the NCBI tree (and others) to form the basis of its phylogeny.
But PhyloPic's current approach, while pragmatic, is completely backwards! It takes taxonomic names, organized into a taxonomy, and infers the phylogeny from that. This is the exact opposite of proper phylogenetic nomenclature (a la the PhyloCode), where you start out with a phylogeny and then apply defined names to it.
The Proper Approach
- Assemble a phylogeny.
- Assemble a list of phylogenetic name definitions.
- Apply the names, using their definitions.
So, in the near future, we should be able to do this. But what about the present?
My plan was to create a tool, called Dephyne, that would automatically generate phylogenetic definitions for names. I actually implemented a working version of this, as part of Pymathema (although the implementation could use some refinement). Dephyne reads nomenclatural and taxonomic data, and converts it into MathML definitions for taxonomic names. Here are some of the rules it uses:
- If a taxon has no subtaxa, consider it a unit, and therefore useable as a specifier.
- If a taxon has more immediate subtaxa than sister taxa, and all of its subtaxa are extant, use a branch-modified node-based definition.
- Otherwise, use a simple node-based definition.
- Use one specifier for each immediate subtaxon (simple node-based definitions) and one specifier for each sister taxon (branch-modified node-based definitions).
- To determine which specifier to use for a given taxon, follow these rules:
- Favor eponymous names.
- Favor type species.
- Favor names of extant taxa over names of extinct taxa.
- Favor more widely-used names.
- Favor older names.
- When all else fails, use alphabetical order.
Using these rules, I was able to take a data dump of the mammalian taxonomy from ITIS and convert it into a MathML file defining the names of all major mammalian taxa. Most of the definitions it generated were not half bad!
My next big challenge was going to be synthesizing as many TreeBASE phylogenies as possible into a gigantic phylogeny. One of the key difficulties here is that not all phylogenies use the same resolution. One might use genera while another uses species, for example. So the tree synthesis would have to be taxonomically informed (probably using the EOL API), and break the units in all phylogenies down to the same level.
The other big challenge is that not all taxa are in a TreeBASE phylogeny. So some of the specifiers used by the generated definitions might not be in the synthesized megaphylogeny. In these cases, it would be good to use a taxonomic resource (again, probably the EOL API) as a backup.
I am planning to work more on this, with an eye toward one day using it to create PhyloPic's taxonomy. But I'm not going to be able to get it done for this particular challenge. If you like the idea, contact me!