19 October 2008

This is why I never get anything done.

I was getting pretty close to presentable with Names on Nodes, when I had a revelation. Now I have to rewrite most of it.

The revelation was this: nomenclatural codes, bioinformatics files, publications, specimen collections, and people are all the same thing. They are authorities.

Scientific names, taxonomic units, character states, and specimens are all the same thing. They are signifiers. They each signify a taxon (a set of organisms).

Signifiers are authorized by an authority. For example, Homo sapiens is a species authorized by the International Code of Zoological Nomenclature. YPM-VP 1450 is a specimen authorized by the Yale Peabody Museum's Vertebrate Paleontology Collection. "Wings used for powered flight," is a character state authorized by Gauthier & de Queiroz (2001). "30. Number of stamens: ten or fewer," is authorized by the NEXUS file registered as M331 in TreeBASE, as are the taxonomic units Phytolaccaceae and Lardizabalaceae.

Signifiers may share the same identity. For example Tyrannosaurus bataar (ICZN) and Tarbosaurus bataar (ICZN) signify the same taxon, no matter what. The identity is only accessible to the signifiers themselves, which means that signifiers can be equated and differentiated without disrupting references to them. (A similar identity property holds for authorities.)

Every authority may be associated with an absolute URI (universal resource identifier). Publications (including nomenclatural codes) may be associated with DOIs, ISBNs, etc. People may be associated with OpenIDs. Anything may be associated with a web address. It's a bit trickier for NEXUS files, but I figure that they can be uniquely identified by an ad hoc schema plus a SHA-1 hash of their textual data.

Examples:
  • http://www.peabody.yale.edu/collections/vp Yale Peabody Museum: The Collections: Vertebrate Paleontology
  • http://uppsaladomkyrka.se Uppsala domkyrka (cathedral)
  • urn:isbn:0080-0694/146 The International Code of Botanical Nomenclature (Vienna Code)
  • http://openid-provider.appspot.com/keesey Timothy Michael Keesey
  • http://threelbmonkeybrain.blogspot.com Timothy Michael Keesey (also!)
  • urn:isbn:0-912532-57-2/chapter1 Gauthier & de Queiroz 2001
  • biofile:5b2f349967­c18006233f­c89b8643ff­6c57be2858 the NEXUS file of Rodman & al. 1984
Each signifier, then, can have a unique local name under its associated authority, which forms a unique qualified name when combined with the authority's URI. Examples:
  • http://www.peabody.yale.edu/collections/vp::1450 a specimen
  • http://uppsaladomkyrka.se::Carolus+Linnaeus a specimen
  • urn:isbn:0-85301-006-4::Homo+sapiens a species
  • urn:isbn:0-912532-57-2/chapter1::wings+used+for+powered+flight a character state
  • biofile:5b2f349967­c18006233f­c89b8643ff­6c57be2858::CHARACTERS/19._Crassulacean_acid_metabolis/present_in_at_least_some_specie a character state
  • biofile:5b2f349967­c18006233f­c89b8643ff­6c57be2858::TAXA/Menispermaceae a taxonomic unit
This basically means four things:
  1. I don't have to track that much information about each thing, since that information is held in other resources. I really just need to reference other resources (authorities and signifiers) and maybe provide a convenient name for each one (a canonical name in the Names on Nodes database).
  2. It is possible to create an extremely flexible data model capable of accomodating just about any data set, nomenclatural act, or taxonomic opinion.
  3. When using Names on Nodes, you'll be able to filter out authorities you don't want to use.
  4. I gotta redo a lot of stuff.
One thing I still have to completely figure out is the idea of relators. A relator is an entity which contains a set of relations, each of which relate a signifier to another. Two major types of relations are inclusion and precedence (i.e., ancestry). Examples:
  1. Precedence.—nexus:5b2f349967­c18006233f­c89b8643ff­6c57be2858::TREES/Fig._2/a (a hypothetical ancestor) is ancestral to nexus:5b2f349967­c18006233f­c89b8643ff­6c57be2858::TAXA/Caryophyllaceae according to nexus:5b2f349967­c18006233f­c89b8643ff­6c57be2858::TREES/Fig._2.
  2. Inclusion.—urn:isbn:0-85301-006-4::Homo includes urn:isbn:0-85301-006-4::Homo+sapiens according the rank-based definition authorized by urn:isbn:0-85301-006-4.
  3. Inclusion.—urn:isbn:0-85301-006-4::Homo+sapiens includes http://uppsaladomkyrka.se::Carolus+Linnaeus. according the rank-based definition authorized by urn:isbn:0-85301-006-4.
In case #1, the relator is a tree in a NEXUS file. In cases #2 and #3, the relators are rank-based definitions authorized by the International Code of Zoological Nomenclature.

So, once this is set up, the application will be able to automatically apply phylogenetic definitions, given a certain set of relators. This set will typically include a tree (or network) and a character matrix (optionally). But it could also include many trees, or a custom phylogeny. It gets a bit complex, though, since definitions themselves are relators (mandating the inclusion of types or internal specifiers), as are contextual applications of definitions (indicating other, non-essential inclusions).

Still some details to work out, but I think I'm on a good track here.

1 comment: