A Three-Pound Monkey Brain: Refactoring "Names on Nodes" Entities, Part II

As I discussed previously, the Names on Nodes project had reached a point where the schema just wasn't working out. I went through a list of what was wrong with it: confusing nomenclature, various unnecessary classes, unnecessary references, and major practical problems with looking up contextual relations.

Another big problem was the home-brewed keyword search system I had going. Synchronizing the keyword lists was becoming problematic, and I realized there are already perfectly good (better, even) tools out there such as Hibernate Search. That's a chief rule of programming: don't reinvent something that people smarter than you, with more time on their hands, have already invented.

After a clear, honest look at the contextual relations, I came to a realization: they should be in the client, not the back end. No need to bog down the server with computing definition applications when it can be done in the client. That simplified things a great deal.

Another thing I didn't really need was categories. They were basically an ad hoc form of class inheritance, e.g., a species name is a nomen, a nomenclatural code is a publication, etc. For a little while I considered implementing this as a class hierarchy, as I had in earlier versions. But, really, this is irrelevant data—Names on Nodes doesn't really need to know what category an identifier falls in.

Finally, I had another problem in the way datasets and taxon identifiers (=signifiers) used qualified names. Each one was supposed to have a unique qualified name. While I was able to guarantee uniqueness within datasets and within taxon identifiers, I wasn't able to guarantee that qualified names would be unique between datasets and taxon identifiers.

So, here's the new version (click to magnify):

Again, white arrows indicate "is-a" relationships ("inheritance")—so a PhyloDefinition is a type of Definition, a Dataset is a type of Qualified object, etc. And black diamonds indicate "has-a" relationships ("composition")—so a TaxonIdentifier has one (and only one) Taxon, an Equation has at least two TaxonIdentifier objects, etc. (I've left out a few non-core classes, like BioFile and UserAccount.)

Brief discussions of each class:

Authority.—An authority can be a publication, a person, a bioinformatics file, a database, a specimen catalogue, etc. Each authority has a canonical name (e.g., "Yale Peabody Museum: Vertebrate Paleontology Collection") and an optional abbreviation (e.g., "YPM-VP").

AuthorityIdentifer.—One or more identifiers may be used to indicate an authority, each one associated with a unique URI. Examples:

<urn:isbn:0853010064> (The International Code of Zoological Nomenclature, 4th Edition)
<http://iczn.org/iczn> (Another way of referring to the ICZN.)
<mailto:keesey@gmail.com> (myself)
<http://peabody.yale.edu/collections/vp> (Yale Peabody Museum: Vertebrate Paleontology Collection)
<urn:sha1:bc0ccc8a379edc44cf91b013d2da6238d4258a56> (a bioinformatics file, indicated by its SHA-1 hash key)

Qualified.—This new abstract class makes it possible for qualified names to be unique across all classes that use them. Each refers to an authority identifier and contains a local name, which is unique to that identifier. When combined, the identifier's URI and the local name form a qualified name, e.g., <urn:isbn:0853010064::Homo+sapiens> or <http://peabody.yale.edu/collections/vp::1450>.

TaxonIdentifier & Taxon.—Formerly called "signifiers", taxon identifiers are qualified objects that each refer to a taxon. Taxon identifiers may be scientific names, vernacular names, specimen identifiers, character state descriptions, etc. As with authorities, each taxon may have more than one identifier referring to it. For example, the following qualified names all refer to the same species: <urn:isbn:0853010064::Abeillia+abeillei>, <http://iucnredlist.org::species:142883>, and <http://iucnredlist.org::common_name:Eng:Emerald-chinned+Hummingbird>.

Label.—Authorities, datasets, and taxon identifiers are all labelled entities, possessing one label object. Each label has a name, an optional abbreviation, and a flag telling whether it should be italicized. Labels are merely cosmetic, and need not be unique. They are used as the targets of searches, using Hibernate Search.

Definition.—Each definition has one taxon identifier, and only one definition pertains to that taxon identifier. How do I accommodate differing definitions, then? I use a concept from the PhyloCode: conversion. Consider the name "Aves". Under the ICZN, it refers to a suprafamilial ranked taxon with no type. According to Sereno's TaxonSearch, it refers to a node-based clade including Archaeopteryx. According to Gauthier and de Queiroz (2001), it refers to a crown group. But instead of having multiples definitions for the same identifier, I consider each definition to define a different identifier, each indicating a (potentially) different taxon: <urn:isbn:0853010064::Aves>, <http://www.taxonsearch.org/Archive/stem-archosauria-1.0.php::Aves>, and <urn:bici:0912532572(200112)%3C7:FDFDCD%3E2.0.TX;2-H::Aves>, respectively. In cases of conversion, the definition also indicates the original identifier.

PhyloDefinition & RankDefinition.—These have not changed much, except that they now refer directly to their specifers and types, respectively. No more useless "Anchor" class.

Dataset.—Instead of storing a bunch of relations of unspecified type, each type of relation falls within its own set. I've also added optional ratios for converting weights in phylogenetic networks to generations and/or years.

Equation.—I almost called this "Synonymy". This is a new type of relation, which asserts that two or more identifiers refer to the same taxon.

Heredity & Inclusion.—Heredity was previously called "Parentage". The new nomenclature better reflects its real meaning, since it models ancestor-descendant relationships, not necessarily parent-child. These two classes are little changed, except that now they don't both descend from a useless Relation class, so their nomenclature can be clearer (predecessor and superset used to be "a"; successor and subset used to be "b").

This schema is much cleaner, and will make for a more efficient server-side. I've already implemented the entities, removed deprecated code, and updated the relevant code. After some hiccups with a Hibernate upgrade, unit tests are working again. The back-end should be complete fairly soon (pending some ideas about user accounts), and then it will be time to look at some massive refactorings for the front end!

2 comments:

Anonymous18 March, 2009 13:00
I don't understand all this, but it's fascinating nonetheless.
Mike Keesey18 March, 2009 13:31
Hehe, cool! I'm mostly just airing my own thoughts so I can clear my thinking. I imagine most people will be more interested in the results than these formative processes, though.

That said, feel free to ask questions.

17 March 2009

Refactoring "Names on Nodes" Entities, Part II

2 comments: