Showing posts with label definition. Show all posts

15 February 2013

JSEN: JavaScript Expression Notation

That idea I was talking about yesterday? Storing mathematical expressions as JSON? I went ahead and made it as a TypeScript project and released it on GitHub:

JavaScript Expression Notation (JSEN)

Still need to complete the unit test coverage and add a couple more features. I made a change from my original post to the syntax for namespace references. (The reason? I realized I needed to be able to use "*" as a local identifier for multiplication.) ~~They work within Namespace declaration blocks, but I need to make them work at the higher level of Namespaces declaration blocks as well.~~ (Done.) ~~I also want to allow functions to be used as namespaces.~~ (Done.)

This is possible right now:

jsen.decl('my-fake-namespace', {
   'js': 'http://ecma-international.org/ecma-262/5.1',

   'x': 10,
   'y': ['js:Array', 1, 2, 3],
   'z': ['js:[]', 'y', 1]
});

jsen.eval('my-fake-namespace', 'x'); // 10
jsen.eval('my-fake-namespace', 'y'); // [1, 2, 3]
jsen.eval('my-fake-namespace', 'z'); // 2

jsen.expr('my-fake-namespace', 'x'); // 10 // Deprecated
jsen.expr('my-fake-namespace', 'y'); // Deprecated
    // ["http://ecma-international.org/ecma-262/5.1:Array", 1, 2, 3]
jsen.expr('my-fake-namespace', 'z'); // Deprecated
    // ["http://ecma-international.org/ecma-262/5.1:[]", "y", 1]

Eventually something like this will be possible as well:

Mathematical expressions as JSON (and phyloreferencing)

For Names on Nodes I did a lot of work with MathML (specifically MathML-Content), an application of XML for representing mathematical concepts. But now, as XML wanes and JSON waxes, I've started to look at ideas for porting Names on Nodes concepts over to JSON.

I've been drawing up a very basic and extensible way to interpret JSON mathematically. Each of the core JSON values translates like so:

Null, Boolean, and Number values are interpreted as themselves.
Strings are interpreted as qualified identifiers (if they include ":") or local identifiers (otherwise).
Arrays are interpreted as the application of an operation, where the first element is a string identifying the operation and the remaining elements are arguments.
Objects are interpreted either as:

a set of declarations, where each key is a [local] identifier and each value is an evaluable JSON expression (see above), or
a namespace, where each key is a URI and each value is a series of declarations (see previous).

Examples

Here's a simple object declaring some mathematical constants (approximately):

{
    "e": 2.718281828459045,
    "pi": 3.141592653589793
}

Supposing we had declared some operations (only possible in JavaScript, since JSON doesn't have functions) equivalent to those of MathML (whose namespace URI is "http://www.w3.org/1998/Math/MathML"), we could do this:

{
    "x":

        ["http://www.w3.org/1998/Math/MathML:plus",

1,

        ],
    "y":

        ["http://www.w3.org/1998/Math/MathML:sin",

            ["http://www.w3.org/1998/Math/MathML:divide",

                "http://www.w3.org/1998/Math/MathML:pi",

]
}

Once evaluated, x would be 3 and y would be 1 (or close to it, given that this is floating-point math).

Now for the interesting stuff. Suppose we had declared Names on Nodes operations and some taxa using LSIDs:

{
    "Homo sapiens": "urn:lsid:ubio.org:namebank:109086",
    "Ornithorhynchus anatinus": "urn:lsid:ubio.org:namebank:7094675",
    "Mammalia":

        ["http://namesonnodes.org/ns/math/2013:clade",

            ["http://www.w3.org/1998/Math/MathML:union",

                "Homo sapiens",

                "Ornithorhynchus anatinus"

Voilá, a phylogenetic definition of Mammalia in JSON!

I think this could be pretty useful. My one issue is the repetition of long URIs. It would be nice to have a mechanism to import them using shorter handles. Maybe something like this?

{
    "mathml":   "http://www.w3.org/1998/Math/MathML:*",
    "namebank": "urn:lsid:ubio.org:namebank:*",
    "NoN":      "http://namesonnodes.org/ns/math/2013:*",

    "Mammalia":

        ["NoN:clade",

            ["mathml:union",

                "namebank:109086",

                "namebank:7094675"

]
}

Something to ponder. Another thing to ponder: what should I call this? MathON? MaSON?

28 January 2013

Using TypeScript to Define JSON Data

JSON has gradually been wearing away at XML's position as the primary format for data communication on the Web. In some ways, that's a good thing: JSON is much more compact and readable. In other ways, it's not so great: JSON lacks some of XML's features.

One of these features is document type definitions. For XML, there are a variety of formats (DTD, XML Schema, RELAX NG, etc.) for specifying exactly what your XML data looks like: what are the tag names, possible attributes, etc. JSON is a lot more loosey-goosey here.

Okay, that's not entirely true: there is JSON Schema. I've never known anyone to use it, but it's there. It's awfully verbose, though. (So are the definitional formats for XML, but it's XML — you expect it!)

I was thinking about this the other day, and I realized that there is actually a great definitional format for JSON already in existence: TypeScript! If you haven't heard of it, TypeScript is a superset of JavaScript which introduces optional strict typing. And since JSON is a subset of JavaScript, TypeScript is applicable to JSON as well.

One of the great features of TypeScript is that interface implementation is implicit. In Java or ActionScript, you have to specifically say that a type "implements MyInterface". In TypeScript, if it fits, it fits. For example:

interface List

{

length: number;

}

function isEmpty(list: List): bool

{

return list.length === 0;

}

console.log(isEmpty("")); // true

console.log(isEmpty("foo")); // false

console.log(isEmpty({ length: 0 })); // true
console.log(isEmpty({ length: 3 })); // false
console.log(isEmpty({ size: 1})); // Compiler error!

(Note: for some reason that I can't fathom, isEmpty() doesn't work on arrays. Well, TypeScript is still in development — version 0.8.2 right now. Update: I filed this as a bug.)

Note that you can use interfaces even on plain objects. So of course you can use it to describe a JSON format. Here's an example from a project I hope to release before too long:

interface Model

{

uid: string;

}

interface Name extends Model

{

citationStart?: number;

html?: string;

namebankID?: string;

root?: bool;

string?: string;

type?: string;

uri?: string;

votes?: number;

}

interface Taxon

{

canonicalName?: Name;

illustrated?: bool;

names?: Name[];

}

Now, for example, I can declare that an API search method will return data as an array of Taxon objects (Taxon[]). And look how compact and readable it is!

Note that there is one drawback here: there is no way to enforce this at run-time. JSON Schema might be a better choice if that's what you need. But for compile-time checking and documentation, it's a pretty great tool.

02 May 2012

The PhyloCode Will Not Be Amended

At least for now.

In a 10-1 decision, the Committee on Phylogenetic Nomenclature voted to reject the wholesale adoption of a proposal to amend the PhyloCode that would have greatly changed how it handles species and species names. However, the CPN has decided to discuss the possibility of using some ideas in the proposal.

An Idea for the EOL Phylogenetic Tree Challenge

Earlier this year, the Encyclopedia of Life announced the EOL Phylogenetic Tree Challenge. The goal: to produce "a very large, phylogenetically-organized set of scientific names suitable for ingestion into the Encyclopedia of Life as an alternate browsing hierarchy". The prize: an all-expenses-paid trip to iEvoBio 2012 in Ottawa!

This interested me greatly, because:

It's exactly the sort of thing I'm working on for PhyloPic.
I can't really justify paying for a trip to iEvoBio this year. (Phyloinformatics is my hobby, not my profession!)

After reading Rod Page's thoughts on the challenge, I came up with a basic idea, and started to implement it. Unfortunately, now that we're two weeks from the deadline, I'm realizing that:

I do not have the time to complete it.
Even if it were paid for, I can't justify a trip on my own out of town right now.

Why not? Simply put, this.

So, instead, I'm going to outline the general approach I was going to take, and if someone else wants to run with it, knock yourself out. (Just give me partial credit.)

Amending the PhyloCode: The Species Problem

Earlier I mentioned a proposal by Cellinese, Baum, and Mishler to make a major revision to the PhyloCode, removing pretty much all mention of "species". In this post I'm going to take a high-level look at some of the proposed changes.

What Is and Is Not a Stem Group

In recent years, I've noticed a trend: the prefix "stem-" is becoming more and more popular for stem groups. For those who don't know what a "stem group" is:

A crown group is the last common ancestor or two or more extant taxa, and all descendants thereof.
A total group is the first ancestor of a crown group that is not also ancestral to any other extant taxa, and all descendants thereof.
A stem group is a total group minus its crown group. (Which means, of course, that a total group is a crown group plus its stem group.)

Or, to put it more simply, an extinct organism is a stem-X if it does not belong to X, but it shares more ancestry with X than with any extant organisms outside of X. Real-life examples:

Velociraptor mongoliensis, a stem-avian.
Illustration by myself (Mike Keesey).

Stem-mammals: Dimetrodon, Moschops, Cynognathus, Castorocauda.
Stem-avians: Marasuchus, Psittacosaurus, Plateosaurus, Tyrannosaurus, Velociraptor, Archaeopteryx, Hesperornis.
Stem-humans: Ardipithecus(?), Australopithecus, Paranthropus, Homo habilis, Homo erectus.
Stem-cetaceans: Ambulocetus, Pakicetus, Maiacetus, Basilosaurus.
Stem-felines: Proailurus, Smilodon.
~~Stem-~~~~pterygotes~~ Stem-neopterans: Dictyoneura, Lithomantis.

This is a great convention. It's consistently useful in every area of the Tree of Life. It's concise. It communicates instantly the general area we're talking about, and sets us up to make proper phylogenetic inferences (when the fossil data is lacking).

So I'm glad this trend is becoming more popular. Unfortunately, I've also noticed another trend: rampant misuse!

Case in point:

CABREIRA & al. (2011). New stem-sauropodomorph (Dinosauria, Saurischia) from the Triassic of Brazil. Naturwissenschaften (online early). doi:10.1007/s00114-011-0858-0

This looks to be an excellent paper on a very interesting find, so it's unfortunate that there's a glaring error in the title, but there it is: "stem-sauropodomorph". There is no such thing, because Sauropodomorpha is not a crown group. It doesn't even include a crown group (sadly—it'd be very cool if it did). Rather, all sauropodomorphs are part of the avian stem group.

Panphagia protos, a stem-avian
(not a "stem-sauropodomorph").
Photo by Eva K.
Used under the GFDL.

I see a lot of people making this mistake. I think what's happening is that they're using the basic concept of a stem group, but replacing "total group" with "some large clade" and "crown group" with "an interesting subclade". In this case, Sauropodomorpha is "some large clade" and Sauropoda is "an interesting subclade". (And in that case, the usage is even wronger, because it should at least be "stem-sauropod".)

This misuse is unfortunate because it is subjective, while the proper usage is objective. One could make the argument that the real "interesting subclade" of Sauropodomorpha is Titanosauria, or Neosauropoda, or whatever, and then the terminology would mean something very different. By contrast, e.g., "stem-crocodylian" very clearly indicates a particular paraphyletic group.

So, please, people, use the "stem-" prefix, but use it correctly!

23 December 2010

What is a human?

Find the human! Pretty easy, right? RIGHT??

It is obvious what is "human" and what is not if we just look at living organisms. There's a clear gap between us and our closest living relatives, the chimpanzees. No danger of mistaking one for the other.

But this clarity vanishes as soon as we look at the fossil record. There's a gradient of forms between us and things that are not clearly closer to us or chimpanzees (Ardipithecus, Orrorin, Sahelanthropus). Which ones are "human" and which are not? Is Praeanthropus afarensis human? What about Homo habilis? Homo ergaster? Neandertals? Homo sapiens idaltu?

Find the human! Or is there more than one?
Or are they all human?

This issue crops up for all kinds of taxa. Much time has been spent arguing what is and is not e.g., avian, or mammalian. The issue is more common within vertebrates than many other taxa, since vertebrates have an especially good and well-studied fossil record. But it applies, in theory or practice, to every extant taxon.

I subscribe to the school of thought that names born from neontology (the study of extant organisms) are best restricted to the crown group (that is, to the living forms, their final common ancestor, and all descendants of that ancestor). Arguments for restricting common names to crown groups were first laid out by de Queiroz and Gauthier (1992). The primary reason for doing this is that it prevents unjustified inferences about stem groups (that is, the extinct taxa which are not part of the crown group, but are closer to it than to anything else extant). For example, we currently have no way of knowing whether the statement, "Within all mammalian species, mothers produce milk," is true if we include things like Docodon as mammals (or, as a few have done, even earlier things like Dimetrodon). However, if we restrict Mammalia to the last common ancestor of monotremes and therians (marsupials and placentals) and all descendants of that ancestor, then the statement unambiguously holds.

This system also gives us a very easy way to refer to any stem group: just add the prefix "stem-". Some examples:

stem-avians: Pterodactylus, Iguanodon, Diplodocus, Eoraptor, Coelophysis, Tyrannosaurus, Oviraptor, Velociraptor, Archaeopteryx, Ichthyornis
stem-mammals: Casea, Dimetrodon, Moschops, Cynognathus, Docodon
stem-whales: Indohyus, Ambulocetus, Pakicetus, Basilosaurus, Dorudon
stem-humans: Ardipithecus(?), Praeanthropus, Australopithecus, Homo habilis, Homo ergaster

stem-humans

This is a nice, neat system. However, for humans, it gets a little sloppy the closer we get to the crown group.

For a long time, there was a debate in paleoanthropology as to how our species originated. We are distributed across the globe, so it's not immediately obvious where we are from. As the hominin fossil record gradually came to light during the 20th century, it became clearer that the earliest roots of the human total group were in Africa, since that's where the oldest remains are found. Everything before two million years ago is African, and only after that time period do we start to see remains in Eurasia, all of them belonging to the genus Homo. Remains in Australia and America don't occur until very late, and only modern humans appear in those regions.

But this leaves open the question of our own species' origin. Homo had spread all over the Old World by the time modern humans appeared, so we could have come from anywhere in Africa or Eurasia. Two major hypotheses were formed. The Out of Africa Hypothesis suggested that the ancestors of humans originated in Africa and then spread out over the globe, displacing all other populations of Homo: the Neandertals in West Eurasia, Peking Man in Asia, Java Man in Malaya, etc. The Multiregional Hypothesis, on the other hand, suggested that modern human races evolved more or less in their current areas: Negroids were descended from Rhodesian Man, Caucasoids from Neandertal Man, and Mongoloids from Peking Man.

These hypotheses competed with each other until the advent of genetic analysis. When scientists were finally able to study the mitochondrial genome, which is copied from mother to child, they found that all living humans shared a relatively recent matrilineal ancestor, much more recent than the splits between Rhodesian, Neandertal, and Peking fossils. Furthermore, the matrilineal family tree strongly points to an ancestor in Africa, where the most divergence is found. Study of the Y chromosome, which is copied from father to son, indicated an even more recent patrilineal ancestor, also African. The case seemed closed. Out of Africa had won.

The case seemed further bolstered when the Neandertal mitochondrial genome was recovered. It revealed a signature which clearly placed it outside the modern human group (Teschler-Nicola & al. 2006). Earlier this year, mitochondrial DNA was also retrieved from an indeterminate fossil from Denisova, Siberia, indicating that it represented a matrilineage even further out, preceding the human-Neandertal split (Krause & al. 2010).

This would give us a pretty nice, clean series of splits. And it would mean that Neandertals, Denisovans, etc. are stem-humans.

But there is more to ancestry than just the matrilineage and the patrilineage. Most of our ancestral lineages include members of both sexes (think of your mother's father and your father's mother). The matrilineage and patrilineage are the only ones that can be studied with clarity, since all other chromosomes undergo a shuffling process. But those other lineages exist nonetheless.

Only very recently has evidence come to light which challenges Out of Africa, at least in its strong form. Earlier this year, a study suggested that all humans except for Sub-Saharan Africans have inherited 1–4% of their DNA from Neandertal ancestors (Green & al. 2010). And just yesterday, a new analysis of Denisovan nuclear DNA showed that Melanesians have inherited 4–6% of their DNA from Denisovans. This nuclear DNA seems to originate from an ancestor close to the human-Neandertal split, but somewhat on the Neandertal side.

Long story short, the picture has gotten a lot more complicated. It's no longer, "Out of Africa, yes, Multiregional, no." Now it's, "Out of Africa, mostly; Multiregional, somewhat."

So what does this mean for the term "human"? Are Neandertals and Denisovans human? After all, they seem to be ancestral to some, but not all, modern human populations.

Well, they can only belong to the crown clade if they are the final common ancestor of all living humans, or descended from it. Neither of these criteria appear to hold. So, for now, I would still say that they are not human, only very close to human. (Note that this does not mean that people descended, in part, from Neandertals and/or Denisovans are somehow "less human" than those with pure African ancestry. The African ancestors are also not humans but stem-humans under this usage. This usage is discrete; you're either human or you aren't.)

Still, at this level of resolution, we start to see a problem with the crown clade usage. What is the final common ancestor? Many would assume it to be the last-occurring common ancestor, but this is problematic, and not just because that ancestor probably lived within recorded history (making, e.g., the Sumerians inhuman!). When I say "final" I'm really referring to something a bit more complex—the maximal members of a predecessor union. (More discussion here.) But determining what that is, exactly, requires better datasets than we have.

I still think it's a good convention, and if its application is a bit vague, so be it—our knowledge is a bit vague. For now I would say that humans are a clade of large, gracile hominins with high-vaulted crania that emerged roughly 150,000 years ago in Africa, and then spread out. They are descended from not one but at least three major populations of stem-human. One of these, the African population (idaltu, helmei, etc.), forms the majority of the ancestry, up to 100% in some populations. The others, Neandertals and Denisovans, only form a small part of the ancestry of some humans.

I feel this convention is useful because it prevent unjustified inferences. For example, we know that all living human populations have languages with highly complex grammar. We really don't know whether Neandertals and Denisovans had such languages, or whether the immediate African predecessors of humans did, for that matter. So it's good to be able to categorize them as stem-humans, because it reminds us that we don't have as much data available on them as we do for the crown group. We have to be more clever in figuring these things out.

And if we ever cloned a Neandertal? Well, ask me again once that happens.

References

de Queiroz & Gauthier (1992). Phylogenetic taxonomy. Annual Review of Ecology and Systematics 23:449–480. [PDF]
Green & al. (2010). A draft sequence of the Neandertal genome. Science 328:710–722. doi:10.1126/science.1188021
Krause & al. (2010). The complete mitochondrial DNA genome of an unknown hominin from southern Siberia. Nature 464(7290):894–897. doi:10.1038/nature08976
Reich & al. (2010). Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468:1053–1060 doi:10.1038/nature09710
Teschler-Nicola & al. (2006). No evidence of Neandertal mtDNA contribution to early modern humans. Pages 491–503 in Early Modern Humans at the Moravian Gate. Springer Vienna. doi:10.1007/978-3-211-49294-9_17

21 December 2010

The Purpose of Generic Names: Or, Everyone's a Homo

Vrſvs Lotor Linn.

It can be surprising for a modern-day biology student to look at 18th-century texts and see how broad the genera are. Consider Linnaeus: he named raccoons as a species of bear (Ursus lotor, the "washer bear"—still called "tvättbjörn" ["washbear"] in Swedish). Nowadays raccoons aren't even placed in the same family as bears, and bears are split into anywhere from roughly three to seven extant genera. Or consider bats, nowadays comprising about 60 extant genera, while Linnaeus classified them as just one (Vespertilio). These aren't even the craziest examples.

Was Linnaeus nuts? Of course not. He saw that the task of creating a unique name for every species would be extremely difficult, so he decided it would be okay to reuse the same names if they were prefaced by the name of a more general category—a genus. Thus, for example, he was able to call the house mouse Mus musculus and the blue whale Balaena musculus (now either Balaenoptera or Sibbaldus). Although they have the same specific epithet, those epithets are unique within their general category. As long as homonymy is avoided, there's no nomenclatural need to restrict genera, so why not make them broad?

In this way, genera function as what we in the computer science world call namespaces. Different things are allowed to have the same local name as long as they are within different namespaces. The qualified name, which combines a namespace identifier with the local name, is globally unique, even if the local name is not.

Biological Nomenclature

generic name + specific epithet = species name

Computer Science

namespace identifier + local name = qualified name

Recently, this got me to thinking—why have we restricted our genera so much, when this was their original purpose? If we just want to be sure that each species has a globally unique name, we could have much larger genera—in some cases, even larger than Linnaeus'.

Consider our own genus, Homo. It has exactly one extant species, Homo sapiens, and that species has an epithet that is already, as far as I know, globally unique. How is that useful? Even Linnaeus thought this was rather silly—he would have included humans in his anthropoid genus, Simia, except he feared backlash. (Even including us in the same order as other primates was controversial at the time.)

How far could we extend our genus and retain its usefulness as a namespace? How far out can we go without having duplicate local names? Within Homo we already have local names like sapiens, erectus, habilis, etc. We actually do have at least one duplicate name, Homo capensis, but it's universally considered a junior synonym. (Although this case is a bit complicated.) If we only consider valid, non-synonymous names, how far can we push our genus?

If we include all stem-humans there's no problem. We add things like Homo afarensis and Homo robustus.

Left to right: Homo sapiens, gorilla, troglodytes, lar, & pygmaeus.

If we push it out to the crown clade of African apes there's still no problem. We get things like Homo gorilla (western gorillas) and Homo troglodytes (common chimpanzees). (Admittedly Homo troglodytes was already named by Linnaeus, but it's a nomen oblitum without any specimens or certainty as to what, exactly, it was supposed to indicate.)

Still no problem if we push it out to the great ape crown clade, adding things like Homo pygmaeus (Bornean orangutans) and Homo indicus (otherwise Sivapithecus).

Pushing it out to the ape crown clade still works, as we add things like Homo lar (lar gibbons) and Homo syndactylus (siamangs). (Interestingly enough, Homo lar is Linnaeus' original name for the species. Only later was it given a new genus, Hylobates, by Illiger, where it resides to this day. I'm not quite sure why Linnaeus classified it this way, but my guess is he wasn't that familiar with the animal in question, as was often the case.)

But if we go beyond that, we hit a duplicate: Homo africanus (Hopwood 1933, originally Proconsul) and Homo africanus (Dart 1925, originally Australopithecus). (We actually already hit Meganthropus africanus Weinert, 1950 a while ago, but that's universally considered a synonym.)

So there we go, Homo could be used as the generic name for all crown-group apes without any problem. (I'm willing to bet I missed something, though, and I'm looking forward to some commenter correcting me.) We have restricted our genera far more than they need to be restricted in order for species names to be unique.

Are we nuts? Of course not. A genus is much more than just a namespace. We also use genera on their own, as groups in their own right. Expanding Homo to embrace all extant apes would ruin its utility as a name for a certain subclade of the human total group, and make it redundant with a name that we already have (Hominoidea). (In computer science namespaces are also often narrower than they technically could be.)

Then again, maybe it is a bit nuts to have one thing performing two jobs. Why not allow other, larger taxa to be used as namespaces? Well, under the PhyloCode, that will actually be a possibility. We can refer to Hominoidea syndactylus and Hominoidea sapiens if we want. In fact, I think you could use Synapsida syndactylus (avoiding homonymy with Bleda syndactylus, the red-tailed bristlebill) and Biota sapiens (this epithet being globally unique already, as previously mentioned). These particular examples will probably never be popular and I wouldn't use them myself, but I think it's neat that the possibility exists.

14 December 2010

pymathema, a Python tool for evaluating MathML

Lately I've been learning the programming language Python, and I've really been enjoying it. In particular, as a dynamic language (i.e., having loose types), it's really well-suited for mathematical tools. (Having sets and tuples as native types doesn't hurt, either.)

I started creating a MathML-Content evaluator in Python, with an extension for Names on Nodes which implements phyloreferencing expressions. As part of this I am working on Version 2.0 of the Names on Nodes MathML Definitions, which will expand upon the current ones.

Basic functionality is pretty much complete, although there are some niceties to add. If you'd like to check it out and maybe collaborate, have a look here: PYMATHEMA.

26 June 2010

Names on Nodes is finally online.

A month ago I got notice that my abstract had been accepted, and that I would be demoing Names on Nodes at the iEvoBio conference's Software Bazaar on June 29. This is the first time Names on Nodes has ever truly had a hard deadline. Since it's a personal project, until now I have had the luxury of languidly rebuilding and polishing and rebuilding and polishing. But now I have to get something up. So it's up.

Names on Nodes

There's still a lot left to do, but this will have to do for now. You can load NexML and Newick files (not NEXUS for now, sorry—although, really, you should be sorry for still using it when NexML is available). You can save as MathML and export PNG image files. You can create phylogenies and phylogenetic definitions on the fly using a visual interface that emphasizes drag-and-drop. Or you can type them in (as Newick and MathML, respectively), should you prefer that.

There are still a lot of bugs, and a lot of unimplemented features. If you come across issues or if you have feature requests, please feel free to submit an issue. And if you want to look at the code, it's open source (MIT license) and available on BitBucket.

27 May 2010

Upcoming Names on Nodes Presentation

I'll also be presenting Names on Nodes at iEvoBio, at the Software Bazaar on June 29. Here's the abstract:

Names on Nodes: Automating the Application of Taxonomic Names within a Phylogenetic Context
Names on Nodes¹ is an open-source² Flex application which utilizes a mathematical approach to automate the application of phylogenetically-defined names to phylogenetic hypotheses. Phylogenetic hypotheses are modeled as directed, acyclic graphs, and may be read from bioinformatics or graph files (Nexus, NexML, Newick, and GraphML) or created de novo. Hypotheses may also be merged from multiple sources. Names on Nodes stores hypotheses as MathML, an XML-based language for representing mathematical content and presentation. Phylogenetic definitions may be constructed using a visual editor and exported in MathML. Thus, it is possible to create a dictionary of defined names and automatically apply them to phylogenetic hypotheses. In the current version of the application, such dictionaries exist only as MathML files, but in future versions definitions may also be loaded from databases (e.g., RegNum).
Additional functionality in Names on Nodes includes the ability to coarsen a phylogenetic graph (thereby simplifying it while still reflecting the overall structure) or to export it as an image file (raster or vector, potentially with semantic annotations).
Source code available at: http://bitbucket.org/keesey/namesonnodes-sa/

MIT license

I have my work cut out for me....

26 May 2010

Upcoming Talk: Toward a Complete Phyloreferencing Language

I'll be giving a “Lightning Talk” (five minutes) at the iEvoBio Conference in Portland, Oregon. Here's the abstract:

Toward a Complete Phyloreferencing Language

A phyloreference is a statement indicating a taxon within a phylogenetic context. A common use for phyloreferences is in phylogenetic definitions, which tie taxonomic names to taxa via such statements. Several conventions for writing phyloreferences have been proposed, but most only cover a few “standard” forms (node-, branch-, and perhaps apomorphy-based clades) without the capacity to represent more “exotic” forms (e.g., ancestor-based clades and qualified/modified references). In order to build a complete phyloreferencing language, the mathematical underpinnings of phylogenetic contexts must be clarified. A phylogenetic context may be modeled as a directed, acyclic graph, wherein nodes model taxonomic units and directed edges model immediate descent. Higher taxa are modeled as unions of nodes. A phyloreferencing language must minimally allow for certain classes of entity: Boolean values, sets (including taxa, relations, and the empty set), and lists (including graphs and functions). It must also minimally allow for basic operations related to logic, set theory, and graph theory. Higher structures such as declarations and piecewise constructs must also be possible. With these as a basis, functions related to phylogeny can be defined: maximal, minimal, predecessor union/intersection, successor union/intersection, exclusive predecessors, synapomorphic predecessors, clade, crown clade, and total clade. I show how such a language may be used to represent various types of phyloreference, both “standard” and “exotic”.

Now to figure out how to condense that into a five-minute talk....

21 May 2010

Names on Nodes Issue Tracker

Yesterday I transferred the list of remaining Names on Nodes issues from my whiteboard to the bitbucket issue tracker. My goal is to get through most of these by the end of June. (Some "nice-to-haves", like DOT or HTML 5 exporting, may have to wait.)

Essential features left to implement, complete or fix:

FILES AND FORMATS

Certain formats for import, especially NexML and NEXUS. (Currently only Newick can be imported. MathML files can be loaded as well.)
Certain formats for export, especially NexML. (Currently only PNG can be exported. MathML files can be saved as well.)
Ability to save just the definitions or just the phylogeny to a MathML file.
Ability to import definitions from a MathML file.
MathML tweaks. (Use csymbol instead of ci for taxa. Normalize presentation.)
Ability to write in Newick strings directly.

DISPLAY

Skin various components (sliders, steppers, checkboxes, etc.).
Fix line breaks in MathML formulas.
Various scrollbar issues.
Special character issues.

NAMES

Rich editor for taxon labels, including ability to edit taxon URIs.

NODES

Arc bisection tool.
Fix node merging (i.e., synonymization).
Add ability to select definition type when creating a name.
Node Pane Control Bar revisions. (Change Resolution Slider to a stepper. Add Zoom Slider.)

DEFINITIONS

Definition Editor tweaks/fixes. (Some actions are blocked that should be possible. Textual Editor does not always update. Various layout issues.)

OTHER

About/Help Panel.

05 April 2010

Sketch of a Phylogenetic Query Language

Names on Nodes uses MathML for two primary purposes:

Delineating phylogenetic hypotheses (as directed, acyclic graphs).
Associating identifiers with definitions.

In some ways this works out to be a bit like a query language. You can use it to set up data constructs, and then search them for groups of interest. For example, suppose you wanted a list of all stem-humans from Kenya. Assuming that your dataset included 1) a taxonomic unit called Homo sapiens, 2) a group called extant for all extant taxonomic units, and 3) a group called Kenya for all Kenyan taxonomic units, that query might look like this:

<apply xmlns="http://www.w3.org/1998/Math/MathML">
   <intersect/>
   <ci>Kenya</ci>
   <apply>
      <setdiff/>
      <apply>
         <csymbol definitionURL="http://namesonnodes.org/ns/math/2009#def-Total"/>
         <ci>Homo sapiens</ci>
         <ci>extant</ci>
      </apply>
      <ci>Homo sapiens</ci>
   </apply>
</apply>

MathML is great for being flexible and extensible enough to cover concepts like this. But ... it's also really verbose. This is fine for my purposes so far, but it may be cumbersome for other purposes. So I've been playing around with a more succinct way to write these expressions. Today I tossed up some rough ideas here:

Phylogenetic Query Script (Rough Draft)

This is a plain-text format loosely inspired by mathematical notation, the C language, etc. Using it, the above query becomes:

"Kenya" & (total("Homo sapiens", "extant") - "Homo sapiens")

...which is quite a bit shorter.

This is still in very early stages, so I thought I'd post it to get some feedback.

Here are a few of the simpler clade definition examples:

"Aves"       := clade("Struthio camelus" | "Tetrao major" |
                      "Vultur gryphus").

"Saurischia" := clade("Megalosaurus bucklandii" <-
                      "Iguanodon bernissartensis").

"Avialae"    := clade("wings used for powered flight" @
                      "Vultur gryphus").

02 April 2010

Names on Nodes: MathML Definitions (Version 1.2)

After the epiphany that Names on Nodes did not have to be associated with a database, I set to work creating a "standalone" version of the application. Progress has been pretty good, and if you are interested in the details (or collaborating), you can check the project out at its new home on Bitbucket (which also houses the related project, ASMathema).

I've just updated the Names on Nodes website based on these revisions to the project, most notably the MathML Definitions document. Most of the changes have actually been removals: no more mentions of rank-based taxonomy (which may be covered in future versions but not in this one), qualified names as taxonomic identifiers (no longer a necessary feature), etc. So if you didn't read it before because it was too long and dense ... well, it's still pretty long and dense, actually. But less so!

I've also added an example MathML document as a supplement. This document:

Defines a phylogenetic context (the same one used in the MathML Definitions examples), arranging taxonomic units as vertices in a directed, acyclic graph.
Defines sets based on characters ("wings used for powered flight" and "extant")
Refers a specimen (YPM-VP 1450) to a taxonomic unit (Ichthyornis).
Equates several species names as synonyms.
Defines some hybrid formulas as referring to specific taxonomic units.
Defines a number of clade names.

This file can be opened with Names on Nodes: Standalone Version, which I am currently developing and hope to release this year.

05 March 2010

Names on Nodes: Cutting Out the Fat

While pondering the headaches of homonymy recently, I started to ask myself, What am I doing with my life? Why am I worrying about this?

Seriously, though. I've been working on Names on Nodes on and off for about three years, and I still haven't launched it. And it's because I've been so focused on getting things like this right. (Well, that and having a day job.)

But things like this aren't part of the core functionality. The core functionality is the automated evaluation of phylogenetic definitions (encoded as MathML) within the context of phylogenetic hypotheses (modeled as directed, acyclic graphs). That part of it's been done for quite a while. So why am I wasting time on the rest?

By cutting out the entire database portion of the project, I could actually have something launched this year. Sure, it'd be nice to have a repository of taxonomic names, definitions, authorities, etc. But it's not necessary. It's phase II, not phase I. In fact, by the time I'm ready for a phase II, there will almost certainly be other services out there that already perform those things.

So, here on out, I'm going to be focusing on getting a lean, mean version of Names on Nodes up. Here's a quick summary of what you'll be able to do with it:

Open bioinformatics files (NEXUS to start with, other formats like nexml going forward).
View the phylogenies in a pretty graphical interface.
Merge phylogenies.
Tweak phylogenies (adding or removing parent-child relations, adding or removing taxonomic units, equating taxonomic units, etc.).
Formulate phylogenetic definitions using a spiffy interface.
Apply these definitions to the phylogeny.
Save your work as MathML files.

And that should be manageable (although there is still much work to be done, especially for the user interface). Once that's launched and working, then I'll look into connecting to other services and/or launching an associated database.

01 March 2010

The Great PhyloCode Land Run

Sometime in the near future, the PhyloCode will be enacted. For this to happen, two things need to happen concurrently:

1. The registration database (called "RegNum") must be completed and opened to the public. This is necessary because the PhyloCode requires all names to be registered electronically.

2. Phylonyms: a Companion to the PhyloCode must be published. This is a multi-authored volume that will include the earliest definitions under the PhyloCode.

Which names will be defined in Phylonyms? The original goal was to cover the most historically important names (what Alain Dubois calls "sozonyms"). However, proponents of phylogenetic nomenclature tend to be clustered in several fields (most notably vascular plant botany and vertebrate zoology—note that the code's authorship reflects this). This means certain parts of the Tree of Life (e.g., entomology) will unfortunately be underrepresented, due to lack of interest in those fields. (The alternative, having non-specialists define such names in Phylonyms, does not bear consideration.) So Phylonyms will be less about providing coverage and more about providing sturdy, well-reasoned definitions that can serve as examples.

What about all the names that it omits? What will happen to those once the PhyloCode is enacted? That will be interesting to see.

One thing I could envision is a sort of "land run". I picture it working this way. Let's consider a field, say, anthropology, where phylogenetic nomenclature has not taken much of a hold. Currently there is debate about how to use some taxonomic names related to the field. Some workers like to use the familial name "Hominidae" to refer to a large taxon, including humans and great apes. Others prefer to restrict it to the human total clade (i.e., humans and everything closer to them than to other extant taxa). Similarly, some workers use the generic name "Homo" in a broad sense to include short, small-brained species like Homo habilis, while others prefer to restrict it to the tall, large-brained clade (relegating H. habilis to another genus, e.g., Australopithecus).

Let's say there's a researcher out there named Dr. Statler, who prefers a strict usage for "Hominidae" and a broad use for "Homo". But his colleague, Dr. Waldorf, prefers a broad usage for "Hominidae". Dr. Waldorf isn't really that interested in phylogenetic nomenclature, but when he notes that "Hominidae" is not in the registration database, he sees an opportunity. He writes a quick paper defining "Hominidae" as a node-based clade: "The clade originating with the last common ancestor of humans (Homo sapiens Linnaeus 1758), Bornean orangutans (Pongo pygmaeus Linnaeus 1760), common chimpanzees (Pan troglodytes Oken 1816, originally Simia troglodytes Blumenbach 1775), and western gorillas (Gorilla gorilla Geoffroy 1852, originally Troglodytes gorilla Savage 1847)."

Dr. Statler is, of course, outraged. Not that he cares that much about phylogenetic nomenclature, but what if anthropologists do start using it? What if someone ruins another taxonomic name? His colleagues Drs. Honeydew and Beaker prefer a strict definition of "Homo"—what if they author a paper cementing that definition under the PhyloCode?

This cannot come to pass! Dr. Statler does some reading on the code and decides that a branch-based definition would work nicely for his broader usage. He defines "Homo" as, "The clade consisting of Homo sapiens Linnaeus 1758 and all organisms that share a more recent common ancestor with H. sapiens than with Australopithecus africanus Dart 1925, Paranthropus robustus Broom 1938, Zinjanthropus boisei Leakey 1959, or Australopithecus afarensis Johanson & White 1978." This sets off another anthropologist, and soon all sorts of anthropological/primatological names are being defined under the PhyloCode, as workers struggle to assert their usages.

This is not an ideal situation. It would be much nicer if a group of anthropologists were to come together, discuss the matters rationally, and arrive at an agreement which they then publish together. But it's still not a horrible situation—at least people are defining phylogenetic names and at least interest in phylogenetic nomenclature is being spread. I can't predict the future, but I feel like this sort of "land run" is bound to occur at least in some fields—and maybe that's okay.

27 February 2010

Defining Rank-Based Taxa Mathematically

Let U be the set of all individuals.

Let ranks be represented by a contiguous series of natural numbers (ℕ). Let 1 represent the lowest (finest) rank and let some natural number n represent the highest (coarsest) rank.

Let T be a sequence of n sets of type individuals (i.e., individuals represented by type specimens). Let each set in the sequence (other than the last set) be a superset of the next set, i.e., T₁ ⊇ T₂ ⊇ … T_n.

Let d be a metric function measuring some distance between any two individuals: d(x, y) ∈ ℝ₀⁺ (the set of nonnegative real numbers). Note that, because it is a metric, d(x, x) = 0 and d(x, y) = d(y, x).

For each rank level r, let p_r be a function mapping each member, t, of T_r to a taxon (set of individuals): p_r(t) := {x ∈ U | for all s ∈ T_r, d(x, t) ≤ d(x, s)}. Let P_r be the image of p_r. Then P_r is the taxonomy of rank level r.

Note that some individuals may be placed in multiple taxa of the same rank if they are equidistant between type individuals. These individuals may be considered unclassifiable for that rank. Let U′ be the set of all individuals except for those which are unclassifiable for some rank. Similarly, let P′_r be P_r but with all unclassifiable individuals removed from each member taxon. P′_r is a partition on U′. For any two rank levels q and r, if q < r, then P′_q is a refinement of (or equal to) P′_r.

25 February 2010

Tricksy Definitions Expressed Mathematically

Just for fun, here are a few definitions of nonstandard type to go along with those in the previous post. As any practitioner of phylogenetic nomenclature knows, most definitions are node-, branch-, or apomorphy-based, but there have been a few that don't fall into these categories.

Here are Wagner's (2004) definitions of Panbiota and Biota:

   Panbiota := (Clade ∘ prc_∩)(Homo sapiens).

   Biota := Crown(Panbiota, "extant as of or after 2004").

This is one of the few cases where it makes more sense to define the crown clade based on the total clade rather than vice versa. (Maybe the only case? Not sure.) Technically, Wagner's wording for the definition of Panbiota might be better translated as (suc_∪ ∘ min ∘ prc_∩)(Homo sapiens), but it works out to the same thing.

And here's Clarke's (2004) definition of Ichthyornis:

   Let M := "apomorphy 2" ∩ "apomorphy 5" ∩ "apomorphy 6" ∩ "apomorphy 7" ∩ "apomorphy 8".
   (These refer to apomorphies in Clarke's Ichthyornis dispar Diagnosis.)

   Ichthyornithes := Clade(YPM 1450 ← Struthio camelus ∪ Tinamus major ∪ Vultur gryphus).
   ("YPM" refers to the Yale Peabody Museum's Vertebrate Paleontology collection. YPM 1450 is the Ichthyornis dispar holotype specimen.)

   Ichthyornis := Clade((M @ YPM 1450) ∩ Ichthyornithes).