21 October 2008

Six Ways to Say the Same Thing

"'Aves' refers to the crown clade stemming from the most recent common ancestor of Ratitae (Struthio camelus Linnaeus 1758), Tinamidae (Tetrao [Tinamus] major Gmelin 1789), and Neognathae (Vultur gryphus Linnaeus 1758)."
—Jacques Gauthier & Kevin de Queiroz 2001 December

Simple Mathematical Formula

Aves := Clade(Struthio camelus + Tetrao major + Vultur gryphus)

Complex Mathematical Formula

Aves Linnaeus 1758 [Gauthier & de Queiroz 2001] := (AD o max o CA)(Struthio camelusTetrao majorVultur gryphus)

Ridiculously Complex Mathematical Formula

C := {x : (∀y ∈ (Struthio camelusTetrao majorVultur gryphus))[xy]}
A := {xC : (∀yC)[xy]}
Aves := {x : (∃yA)[xy]}

Simple MathML-Content

Complex MathML within Custom Markup
<mi form="prefix">Clade</mi>
<mo form="infix">+</mo>
<![CDATA[<i>Ratitae</i> (<i>Struthio camelus</i> Linnaeus 1758)]]>
<![CDATA[<i>Tinamidae</i> (<i>Tetrao</i> [<i>Tinamus</i>] <i>major</i> Gmelin 1789)]]>
<![CDATA[<i>Neognathae</i> (<i>Vultur gryphus</i> Linnaeus 1758)]]>


19 October 2008

This is why I never get anything done.

I was getting pretty close to presentable with Names on Nodes, when I had a revelation. Now I have to rewrite most of it.

The revelation was this: nomenclatural codes, bioinformatics files, publications, specimen collections, and people are all the same thing. They are authorities.

Scientific names, taxonomic units, character states, and specimens are all the same thing. They are signifiers. They each signify a taxon (a set of organisms).

Signifiers are authorized by an authority. For example, Homo sapiens is a species authorized by the International Code of Zoological Nomenclature. YPM-VP 1450 is a specimen authorized by the Yale Peabody Museum's Vertebrate Paleontology Collection. "Wings used for powered flight," is a character state authorized by Gauthier & de Queiroz (2001). "30. Number of stamens: ten or fewer," is authorized by the NEXUS file registered as M331 in TreeBASE, as are the taxonomic units Phytolaccaceae and Lardizabalaceae.

Signifiers may share the same identity. For example Tyrannosaurus bataar (ICZN) and Tarbosaurus bataar (ICZN) signify the same taxon, no matter what. The identity is only accessible to the signifiers themselves, which means that signifiers can be equated and differentiated without disrupting references to them. (A similar identity property holds for authorities.)

Every authority may be associated with an absolute URI (universal resource identifier). Publications (including nomenclatural codes) may be associated with DOIs, ISBNs, etc. People may be associated with OpenIDs. Anything may be associated with a web address. It's a bit trickier for NEXUS files, but I figure that they can be uniquely identified by an ad hoc schema plus a SHA-1 hash of their textual data.

  • http://www.peabody.yale.edu/collections/vp Yale Peabody Museum: The Collections: Vertebrate Paleontology
  • http://uppsaladomkyrka.se Uppsala domkyrka (cathedral)
  • urn:isbn:0080-0694/146 The International Code of Botanical Nomenclature (Vienna Code)
  • http://openid-provider.appspot.com/keesey Timothy Michael Keesey
  • http://threelbmonkeybrain.blogspot.com Timothy Michael Keesey (also!)
  • urn:isbn:0-912532-57-2/chapter1 Gauthier & de Queiroz 2001
  • biofile:5b2f349967­c18006233f­c89b8643ff­6c57be2858 the NEXUS file of Rodman & al. 1984
Each signifier, then, can have a unique local name under its associated authority, which forms a unique qualified name when combined with the authority's URI. Examples:
  • http://www.peabody.yale.edu/collections/vp::1450 a specimen
  • http://uppsaladomkyrka.se::Carolus+Linnaeus a specimen
  • urn:isbn:0-85301-006-4::Homo+sapiens a species
  • urn:isbn:0-912532-57-2/chapter1::wings+used+for+powered+flight a character state
  • biofile:5b2f349967­c18006233f­c89b8643ff­6c57be2858::CHARACTERS/19._Crassulacean_acid_metabolis/present_in_at_least_some_specie a character state
  • biofile:5b2f349967­c18006233f­c89b8643ff­6c57be2858::TAXA/Menispermaceae a taxonomic unit
This basically means four things:
  1. I don't have to track that much information about each thing, since that information is held in other resources. I really just need to reference other resources (authorities and signifiers) and maybe provide a convenient name for each one (a canonical name in the Names on Nodes database).
  2. It is possible to create an extremely flexible data model capable of accomodating just about any data set, nomenclatural act, or taxonomic opinion.
  3. When using Names on Nodes, you'll be able to filter out authorities you don't want to use.
  4. I gotta redo a lot of stuff.
One thing I still have to completely figure out is the idea of relators. A relator is an entity which contains a set of relations, each of which relate a signifier to another. Two major types of relations are inclusion and precedence (i.e., ancestry). Examples:
  1. Precedence.—nexus:5b2f349967­c18006233f­c89b8643ff­6c57be2858::TREES/Fig._2/a (a hypothetical ancestor) is ancestral to nexus:5b2f349967­c18006233f­c89b8643ff­6c57be2858::TAXA/Caryophyllaceae according to nexus:5b2f349967­c18006233f­c89b8643ff­6c57be2858::TREES/Fig._2.
  2. Inclusion.—urn:isbn:0-85301-006-4::Homo includes urn:isbn:0-85301-006-4::Homo+sapiens according the rank-based definition authorized by urn:isbn:0-85301-006-4.
  3. Inclusion.—urn:isbn:0-85301-006-4::Homo+sapiens includes http://uppsaladomkyrka.se::Carolus+Linnaeus. according the rank-based definition authorized by urn:isbn:0-85301-006-4.
In case #1, the relator is a tree in a NEXUS file. In cases #2 and #3, the relators are rank-based definitions authorized by the International Code of Zoological Nomenclature.

So, once this is set up, the application will be able to automatically apply phylogenetic definitions, given a certain set of relators. This set will typically include a tree (or network) and a character matrix (optionally). But it could also include many trees, or a custom phylogeny. It gets a bit complex, though, since definitions themselves are relators (mandating the inclusion of types or internal specifiers), as are contextual applications of definitions (indicating other, non-essential inclusions).

Still some details to work out, but I think I'm on a good track here.

08 October 2008

Three-Pound Monkey Brain Subpackages

So I'm looking at fleshing out the threelbmonkeybrain package, and specifically at moving a lot of reuseable code from Names on Nodes into threelbmonkeybrain. It's looking like it could be a pretty gigantic package. So big that I started thinking of breaking it down into the Three-Pound Monkey Brain family of packages.

Then I thought of some cutesy names.

Basic classes: relations, assertions, utilities, data filtering, basic geometry, basic collections.
Dependencies: Flash, Flex (some utilities)

Data transfer: Internet operations (email, streaming media), file operations (assets, load tracking)
Dependencies: Flash, Brainstem

Mathematics: operations, advanced collections, MATHML translation, formula rendering, etc.
Dependencies: Flash, Flex (some displays), Brainstem

Data modeling and persistence: value objects, form generation, metadata description, validation, CRUD services, uploads, etc.
Dependencies: Flash, Flex, Brainstem

Motor Cortex
Animation: motion blur, beacons, drawing, constraints, locators, controls, etc.
Dependencies: Flash, Brainstem, ?Calculia

Of course, I could use some more predictable, boring names: base, net, calc, persist, anim. I dunno, what do you guys think?

01 October 2008

Nomenclature vs. Science

Recently, due to an electronic submission SNAFU, an unreviewed paper naming a new species was accidentally published online. The new species is a very interesting one, of much interest to those of us who study the general group that it belongs to. But, we find ourselves morally obligated to avoid public discussion of it, because its publication was inadvertent. Now we must wait for the ponderous phases of review and publication to take place before we can discuss what we already know. In essence, the process of nomenclature is impeding the process of science.

Does this seem backward? Why shouldn't we be able to discuss new data as soon as they are available? Nomenclature is essential to proper communication, but should it be allowed to slow the march of science?

More to the point, why does nomenclature even have the opportunity to impede science? Why would we even set up a system that allowed that to happen? Why can't we publish data as soon as it's available (perhaps with an efficient review process)? Why do we place nomenclature on such a high pedestal?

Well, really, it's just one aspect of nomenclature that is placed on a pedestal: priority. Whoever publishes a name for a specimen first gets to be the NAMER OF THE TAXON, and any Johnnies-come-lately are mere footnotes. For this reason, researchers must keep their data under wraps to avoid "claim jumps".

Of course, objectively, they don't have to. It's really just that we, as humans, assign some sort of importance to the coiners of names. Naming is power. Naming is awesome.

So, in essence, it is human egotism that allows nomenclature to hinder science. That's all.

As I see it, there are two solutions: 1) we stop caring so much about who names stuff and get on with our lives, or 2) we revise the system. Option #1 is the ideal, but, like so many ideals, it's pretty unrealistic. So what about option #2?

Well, here's an idea. What if the nomenclatural codes allowed "specimen claims"? That is, what if you could register a specimen as "yours to name" for a specific amount of time, after which someone else could challenge you for the claim? Then no new taxa dependent on those specimens (or on species typified by those specimens) could be named by someone else.

Here are a couple of possible "use cases" under this idea:

Use Case 1.—Early publication of scientific data, later publication of nomenclature.
1) Researcher discovers specimen.
2) Specimen is catalogued in an institution.
3) Specimen is registered under the nomenclatural code's database. Researcher now has X amount of time to name taxa based on the specimen.
4) Researcher publishes a preliminary report on the specimen, noting its registration information.
5) Researcher spends more time assessing the relationships of the organism(s) represented by the specimen. Based on this, Researcher decides that the specimen represents a new species and also decides that a new clade should be named using that species as a specifier.
6) Researcher names the new species, typified by the specimen, and the new clade in a publication which is published before X amount of time has passed.

Use Case 2.—Differing taxonomic opinions.
1) Researcher A discovers specimen.
2) Specimen is catalogued in an institution.
3) Specimen is registered under the nomenclatural code's database. Researcher A now has X amount of time to name taxa based on the specimen.
4) Researcher A publishes a preliminary report on the specimen, noting its registration information.
5) Researcher A spends more time assessing the relationships of the organism(s) represented by the specimen. Based on this, Researcher A decides that it belongs to a preexisting species and decides to publish a paper assigning it to that species.
6) Researcher B reads the preliminary report, and notes data that indicate that it may belong to a new species.
7) Researcher B notes that Researcher A has a hold on naming taxa based on the specimen, and communicates with Researcher A, learning Researcher A's taxonomic opinion.
8) Researcher B maintains disagreement, and decides to name a new species based on the specimen.
9) Researcher B challenges Researcher A's claim, via the nomenclatural code's database.
10) Researcher A relinquishes the claim.
11) Researchers A and B publish their respective papers with their differing opinions.
Alternate course of events.
10) Researcher A maintains the claim.
11) Researcher A publishes the paper placing the specimen in a preexisting species.
12) The claim expires after X amount of time.
13) Researcher B publishes a paper placing the specimen in a new species.

Use Case 3.—Renewal.
1) Researcher discovers specimen.
2) Specimen is catalogued in an institution.
3) Specimen is registered under the nomenclatural code's database. Researcher now has X amount of time to name taxa based on the specimen.
4) Researcher publishes a preliminary report on the specimen, noting its registration information.
5) Researcher spends more time assessing the relationships of the organism(s) represented by the specimen. Based on this, Researcher decides that it belongs to a new species and decides to publish a paper naming the new species.
6) Writing the paper takes more time than expected. Researcher applies for an extension to the claim through the nomenclatural code's database.
7) The extension is automatically approved, since nobody else has filed a challenge.
8) Researcher publishes the paper naming the new species.

Note that the current process is possible under this scheme. That is, the researcher can forego registration if they plan to keep the data under wraps until the new taxa are published. Registration simply allows the researcher to get the scientific data out ASAP.

It does optionally involve a few extra steps, but this scheme allows researchers to get their data out as quickly as possible, and then take some time in establishing the nomenclature. That seems like an eminently desireable outcome.