03 March 2008

Names on NEXUS: Under the Hood

I nearly have the basic data model and data processing functions pinned down for Names on NEXUS. Once again, that's my project, hinted at in a paper of mine (Keesey 2007), to relate the data in NEXUS files (Maddison et al. 1997) to definitions of names as governed by the PhyloCode.

I've had to learn some new technologies and code packages to accomplish this. Here's a rundown of some key ones:

This is the most recent addition. Originally I had built my own library in ActionScript 3.0 to parse NEXUS files. But it had some limitations. NEXUS is a rather old format (as bioinformatics formats go), and different applications produce somewhat different versions. So rather than use my own ad hoc library, I decided I should get an open source one.

There aren't any in ActionScript, of course, but there are some in Java. This meant I had to switch NEXUS parsing from the front end to the back end, but in some ways that's better. It means I can stored parsed data in the database instead of having client application parse NEXUS data every time. In fact, it means that the client never has to actually see raw NEXUS data—it can just fetch the pre-parsed data.

I first looked into using the NEXUS-parsing code in Mesquite, an open-source phylogenetic analysis program. But it's not set up for simply using the parsing engine on its own—the parser is tied into a whole file-browsing package. Then I found BioJava, which had exactly what I needed. Just looka this package!

Unfortunately there are still some problems with opening certain NEXUS files. I downloaded some samples from TreeBASE and they flagged errors in the TREES section. The reason, as I found after hours of searching and considering whether it might be better just to write my own parser after all, turns out to be an extra comma in the TRANSLATE section. Still not exactly sure how I'm going to solve that one. But it works when I remove the comma!

Remember how I wrote a post a while ago about building classes that map from the Java back-end to the database? Turns out that was all unnecessary. Hibernate is a persistence layer that provides pretty seamless integration between Java and a database (in this case, a PostgreSQL database). Augmented by Hibernate Annotations and Hibernate Validator, it makes it fairly easy to set up and use a complex, well-organized database.

Well, okay, there's a bit of a learning curve first, but it's totally worth it. Incidentally, the book I used to learn it has what is possibly the best title ever.

Flex Data Management Services
Basically, Hibernate is to Java and databases as mx.data is to Flex and Java. It provides a persistence layer so that I don't have to keep track of whether or not I need to request certain data from the Java back-end. I just create DataService objects, tie them to Assembler classes on the back end, and it's all taken care of.

FlexUnit and JUnit
I've already extolled the virtues of unit testing. These wonderful (and, yes, comically-named) packages (huh huh) make it possible. I haven't built enough unit tests, really, but the few I have have been enormously useful in hunting down peculiar errors. And aside from that, since Eclipse can run JUnit tests natively, I can even use them to perform certain important tasks, such as setting up the database from annotated classes via Hibernate.

So What's Left For Me To Do?
Plenty. Although these premade packages help out enormously, I've still had to build an entire mathematics library, a MathML parser, and some tools for handling URIs. I've still got tons of work left to do on the user interface. (Event bubbling is helping a lot with that, by the way.) And, even when stuff is already built, just hooking up one pipe to another pipe can be more complicated than it seems.

Here's a rough list of what's left:
  • Finalize the servlet for uploading and parsing NEXUS data. (I'm very close on this one.)
  • Finish the required behind-the-scenes "search" features. Some of these might be a bit involved, like the ones that suggest possible links between NEXUS taxa and species or specimens or between NEXUS character states and apomorphies.
  • Overhaul the way Names on NEXUS entities (particularly specifiers) are referenced in MathML.
  • Finish the user interface. So far I just have a few forms. I still have to do tree visualization, stylesheets, high-level navigation, transitions, etc.
  • Constrain access to certain functionality. Names on NEXUS is going to be a pretty open, collaborative tool, but I need to set a few boundaries. (E.g., I can't have any old person delete data.)
  • Make sure the server's all optimized, with a static, JNDI-named Hibernate factory, etc.

And here are some things that aren't, strictly speaking, essential, but would be awfully nice:
  • Create a servlet to provide permanent links for Names on NEXUS entities.
  • Create unit tests for all relevant classes.
  • Add JavaDoc and ASDoc comments to all code.

Part of me is also thinking about renaming the project. I mean, it's a good name for what it does right now, but what if I start to bring formats other than NEXUS into the fold? (Not that there are many, but....) Well, I'll probably cross that bridge when I come to it.

My goal is to get an alpha version online sometime this Spring and go open source with it by the Fall. We'll see....


  1. Hi Michael - I've wanted to dig through your paper for a while - I guess I'll have to now. I just wanted to point you to one of the Summer of Code project ideas from NESCent, which, provided we are accepted, is going to extend the RegNum database, which you are probably aware of. See the "Enhancing the PhyloCode Registry" project. (Any thoughts or suggestions you have would be greatly welcome.)

    We will also develop a PhyloCode parser and PhyloCode->Topology query translators in the process of this and another project (PhyloWS), and these will all be open-source from the first line being written (which personally I think is the only way to go, but I'm also a huge believer in open development). If you can consider doing the same, there might be some synergies.

    BTW if you use a moderately decent application server, you don't have to worry about any of the JNDI mappings, the server will do it for you.

  2. Well, this is extremely similar to what I'm doing. I'll be happy to contribute.

    (As for the JNDI mappings, I'm only having trouble connecting from Eclipse -- connecting from the deployed application is fine.)