23 December 2010

What is a human?

Find the human! Pretty easy, right? RIGHT??
It is obvious what is "human" and what is not if we just look at living organisms. There's a clear gap between us and our closest living relatives, the chimpanzees. No danger of mistaking one for the other.

But this clarity vanishes as soon as we look at the fossil record. There's a gradient of forms between us and things that are not clearly closer to us or chimpanzees (ArdipithecusOrrorinSahelanthropus). Which ones are "human" and which are not? Is Praeanthropus afarensis human? What about Homo habilis? Homo ergaster? Neandertals? Homo sapiens idaltu?
Find the human! Or is there more than one?
Or are they all human?

This issue crops up for all kinds of taxa. Much time has been spent arguing what is and is not e.g., avian, or mammalian. The issue is more common within vertebrates than many other taxa, since vertebrates have an especially good and well-studied fossil record. But it applies, in theory or practice, to every extant taxon.

I subscribe to the school of thought that names born from neontology (the study of extant organisms) are best restricted to the crown group (that is, to the living forms, their final common ancestor, and all descendants of that ancestor). Arguments for restricting common names to crown groups were first laid out by de Queiroz and Gauthier (1992). The primary reason for doing this is that it prevents unjustified inferences about stem groups (that is, the extinct taxa which are not part of the crown group, but are closer to it than to anything else extant). For example, we currently have no way of knowing whether the statement, "Within all mammalian species, mothers produce milk," is true if we include things like Docodon as mammals (or, as a few have done, even earlier things like Dimetrodon). However, if we restrict Mammalia to the last common ancestor of monotremes and therians (marsupials and placentals) and all descendants of that ancestor, then the statement unambiguously holds.

This system also gives us a very easy way to refer to any stem group: just add the prefix "stem-". Some examples:
  • stem-avians: Pterodactylus, Iguanodon, Diplodocus, Eoraptor, Coelophysis, Tyrannosaurus, Oviraptor, Velociraptor, ArchaeopteryxIchthyornis
  • stem-mammals: Casea, Dimetrodon, Moschops, Cynognathus, Docodon
  • stem-whales: Indohyus, Ambulocetus, Pakicetus, Basilosaurus, Dorudon
  • stem-humans: Ardipithecus(?), Praeanthropus, Australopithecus, Homo habilis, Homo ergaster
This is a nice, neat system. However, for humans, it gets a little sloppy the closer we get to the crown group.

For a long time, there was a debate in paleoanthropology as to how our species originated. We are distributed across the globe, so it's not immediately obvious where we are from. As the hominin fossil record gradually came to light during the 20th century, it became clearer that the earliest roots of the human total group were in Africa, since that's where the oldest remains are found. Everything before two million years ago is African, and only after that time period do we start to see remains in Eurasia, all of them belonging to the genus Homo. Remains in Australia and America don't occur until very late, and only modern humans appear in those regions.

But this leaves open the question of our own species' origin. Homo had spread all over the Old World by the time modern humans appeared, so we could have come from anywhere in Africa or Eurasia. Two major hypotheses were formed. The Out of Africa Hypothesis suggested that the ancestors of humans originated in Africa and then spread out over the globe, displacing all other populations of Homo: the Neandertals in West Eurasia, Peking Man in Asia, Java Man in Malaya, etc. The Multiregional Hypothesis, on the other hand, suggested that modern human races evolved more or less in their current areas: Negroids were descended from Rhodesian Man, Caucasoids from Neandertal Man, and Mongoloids from Peking Man.

These hypotheses competed with each other until the advent of genetic analysis. When scientists were finally able to study the mitochondrial genome, which is copied from mother to child, they found that all living humans shared a relatively recent matrilineal ancestor, much more recent than the splits between Rhodesian, Neandertal, and Peking fossils. Furthermore, the matrilineal family tree strongly points to an ancestor in Africa, where the most divergence is found. Study of the Y chromosome, which is copied from father to son, indicated an even more recent patrilineal ancestor, also African. The case seemed closed. Out of Africa had won.

The case seemed further bolstered when the Neandertal mitochondrial genome was recovered. It revealed a signature which clearly placed it outside the modern human group (Teschler-Nicola & al. 2006). Earlier this year, mitochondrial DNA was also retrieved from an indeterminate fossil from Denisova, Siberia, indicating that it represented a matrilineage even further out, preceding the human-Neandertal split (Krause & al. 2010).

This would give us a pretty nice, clean series of splits. And it would mean that Neandertals, Denisovans, etc. are stem-humans.

But there is more to ancestry than just the matrilineage and the patrilineage. Most of our ancestral lineages include members of both sexes (think of your mother's father and your father's mother). The matrilineage and patrilineage are the only ones that can be studied with clarity, since all other chromosomes undergo a shuffling process. But those other lineages exist nonetheless.

Only very recently has evidence come to light which challenges Out of Africa, at least in its strong form. Earlier this year, a study suggested that all humans except for Sub-Saharan Africans have inherited 1–4% of their DNA from Neandertal ancestors (Green & al. 2010). And just yesterday, a new analysis of Denisovan nuclear DNA showed that Melanesians have inherited 46% of their DNA from Denisovans. This nuclear DNA seems to originate from an ancestor close to the human-Neandertal split, but somewhat on the Neandertal side.

Long story short, the picture has gotten a lot more complicated. It's no longer, "Out of Africa, yes, Multiregional, no." Now it's, "Out of Africa, mostly; Multiregional, somewhat."

So what does this mean for the term "human"? Are Neandertals and Denisovans human? After all, they seem to be ancestral to some, but not all, modern human populations.

Well, they can only belong to the crown clade if they are the final common ancestor of all living humans, or descended from it. Neither of these criteria appear to hold. So, for now, I would still say that they are not human, only very close to human. (Note that this does not mean that people descended, in part, from Neandertals and/or Denisovans are somehow "less human" than those with pure African ancestry. The African ancestors are also not humans but stem-humans under this usage. This usage is discrete; you're either human or you aren't.)

Still, at this level of resolution, we start to see a problem with the crown clade usage. What is the final common ancestor? Many would assume it to be the last-occurring common ancestor, but this is problematic, and not just because that ancestor probably lived within recorded history (making, e.g., the Sumerians inhuman!). When I say "final" I'm really referring to something a bit more complexthe maximal members of a predecessor union. (More discussion here.) But determining what that is, exactly, requires better datasets than we have.

I still think it's a good convention, and if its application is a bit vague, so be itour knowledge is a bit vague. For now I would say that humans are a clade of large, gracile hominins with high-vaulted crania that emerged roughly 150,000 years ago in Africa, and then spread out. They are descended from not one but at least three major populations of stem-human. One of these, the African population (idaltu, helmei, etc.), forms the majority of the ancestry, up to 100% in some populations. The others, Neandertals and Denisovans, only form a small part of the ancestry of some humans.

I feel this convention is useful because it prevent unjustified inferences. For example, we know that all living human populations have languages with highly complex grammar. We really don't know whether Neandertals and Denisovans had such languages, or whether the immediate African predecessors of humans did, for that matter. So it's good to be able to categorize them as stem-humans, because it reminds us that we don't have as much data available on them as we do for the crown group. We have to be more clever in figuring these things out.

And if we ever cloned a Neandertal? Well, ask me again once that happens.

  • de Queiroz & Gauthier (1992). Phylogenetic taxonomy. Annual Review of Ecology and Systematics 23:449480. [PDF]
  • Green & al. (2010). A draft sequence of the Neandertal genome. Science 328:710722. doi:10.1126/science.1188021
  • Krause & al. (2010). The complete mitochondrial DNA genome of an unknown hominin from southern Siberia. Nature 464(7290):894–897. doi:10.1038/nature08976
  • Reich & al. (2010). Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature 468:1053–1060 doi:10.1038/nature09710
  • Teschler-Nicola & al. (2006). No evidence of Neandertal mtDNA contribution to early modern humans. Pages 491503 iEarly Modern Humans at the Moravian Gate. Springer Vienna.  doi:10.1007/978-3-211-49294-9_17

21 December 2010

The Purpose of Generic Names: Or, Everyone's a Homo

Vrſvs Lotor Linn.
It can be surprising for a modern-day biology student to look at 18th-century texts and see how broad the genera are. Consider Linnaeus: he named raccoons as a species of bear (Ursus lotor, the "washer bear"still called "tvättbjörn" ["washbear"] in Swedish). Nowadays raccoons aren't even placed in the same family as bears, and bears are split into anywhere from roughly three to seven extant genera. Or consider bats, nowadays comprising about 60 extant genera, while Linnaeus classified them as just one (Vespertilio). These aren't even the craziest examples.

Was Linnaeus nuts? Of course not. He saw that the task of creating a unique name for every species would be extremely difficult, so he decided it would be okay to reuse the same names if they were prefaced by the name of a more general category—a genus. Thus, for example, he was able to call the house mouse Mus musculus and the blue whale Balaena musculus (now either Balaenoptera or Sibbaldus). Although they have the same specific epithet, those epithets are unique within their general category. As long as homonymy is avoided, there's no nomenclatural need to restrict genera, so why not make them broad?

In this way, genera function as what we in the computer science world call namespaces. Different things are allowed to have the same local name as long as they are within different namespaces. The qualified name, which combines a namespace identifier with the local name, is globally unique, even if the local name is not.

Biological Nomenclature
generic name + specific epithet = species name

Computer Science
namespace identifier + local name = qualified name

Recently, this got me to thinking—why have we restricted our genera so much, when this was their original purpose? If we just want to be sure that each species has a globally unique name, we could have much larger genera—in some cases, even larger than Linnaeus'.

Consider our own genus, Homo. It has exactly one extant species, Homo sapiens, and that species has an epithet that is already, as far as I know, globally unique. How is that useful? Even Linnaeus thought this was rather silly—he would have included humans in his anthropoid genus, Simia, except he feared backlash. (Even including us in the same order as other primates was controversial at the time.)

How far could we extend our genus and retain its usefulness as a namespace? How far out can we go without having duplicate local names? Within Homo we already have local names like sapiens, erectus, habilis, etc. We actually do have at least one duplicate name, Homo capensis, but it's universally considered a junior synonym. (Although this case is a bit complicated.) If we only consider valid, non-synonymous names, how far can we push our genus?

If we include all stem-humans there's no problem. We add things like Homo afarensis and Homo robustus.

Left to right: Homo sapiens, gorilla, troglodytes, lar, & pygmaeus.
If we push it out to the crown clade of African apes there's still no problem. We get things like Homo gorilla (western gorillas) and Homo troglodytes (common chimpanzees). (Admittedly Homo troglodytes was already named by Linnaeus, but it's a nomen oblitum without any specimens or certainty as to what, exactly, it was supposed to indicate.)

Still no problem if we push it out to the great ape crown clade, adding things like Homo pygmaeus (Bornean orangutans) and Homo indicus (otherwise Sivapithecus).

Pushing it out to the ape crown clade still works, as we add things like Homo lar (lar gibbons) and Homo syndactylus (siamangs). (Interestingly enough, Homo lar is Linnaeus' original name for the species. Only later was it given a new genus, Hylobates, by Illiger, where it resides to this day. I'm not quite sure why Linnaeus classified it this way, but my guess is he wasn't that familiar with the animal in question, as was often the case.)

But if we go beyond that, we hit a duplicate: Homo africanus (Hopwood 1933, originally Proconsul) and  Homo africanus (Dart 1925, originally Australopithecus). (We actually already hit Meganthropus africanus Weinert, 1950 a while ago, but that's universally considered a synonym.)

So there we go, Homo could be used as the generic name for all crown-group apes without any problem. (I'm willing to bet I missed something, though, and I'm looking forward to some commenter correcting me.) We have restricted our genera far more than they need to be restricted in order for species names to be unique.

Are we nuts? Of course not. A genus is much more than just a namespace. We also use genera on their own, as groups in their own right. Expanding Homo to embrace all extant apes would ruin its utility as a name for a certain subclade of the human total group, and make it redundant with a name that we already have (Hominoidea). (In computer science namespaces are also often narrower than they technically could be.)

Then again, maybe it is a bit nuts to have one thing performing two jobs. Why not allow other, larger taxa to be used as namespaces? Well, under the PhyloCode, that will actually be a possibility. We can refer to Hominoidea syndactylus and Hominoidea sapiens if we want. In fact, I think you could use Synapsida syndactylus (avoiding homonymy with Bleda syndactylus, the red-tailed bristlebill) and Biota sapiens (this epithet being globally unique already, as previously mentioned). These particular examples will probably never be popular and I wouldn't use them myself, but I think it's neat that the possibility exists.

14 December 2010

pymathema, a Python tool for evaluating MathML

Lately I've been learning the programming language Python, and I've really been enjoying it. In particular, as a dynamic language (i.e., having loose types), it's really well-suited for mathematical tools. (Having sets and tuples as native types doesn't hurt, either.)

I started creating a MathML-Content evaluator in Python, with an extension for Names on Nodes which implements phyloreferencing expressions. As part of this I am working on Version 2.0 of the Names on Nodes MathML Definitions, which will expand upon the current ones.

Basic functionality is pretty much complete, although there are some niceties to add. If you'd like to check it out and maybe collaborate, have a look here: PYMATHEMA.

12 October 2010

What I Do For A Living: TRONiverse

We recently launched a Flash app for the upcoming film TRON: Legacy:

It pulls posts from Twitter, Facebook, and Flickr and displays them orbiting a 3D globe. You can click on the globe to find posts near that area. Post about TRON and your post might appear, too!

Also, we added a neat little Easter egg: click on the globe and then hold down the "M", "C", and "P" keys. (People who've seen the original film might have an idea what to expect....)

17 August 2010

An Example of Why We Need the PhyloCode

I just ran Radish on Zea mays (maize, a.k.a. corn). Look what the combined taxonomies from uBio look like:

What a mess! Keep in mind that the multiplicity of paths is not due to differing phylogenies (they all seem to agree on that), but to differing nomenclature. Even if uBio were to add some of the more obvious synonymies (e.g., Embryophyta and "Embryophytes"), it'd still be pretty wild.

Eventually I plan to have Radish work with automatically-generated taxonomies, made by placing PhyloCode names (from RegNum) onto TreeBase phylogenies using Names on Nodes algorithms, but until then I guess this is the best option.

And Zea mays is just a particularly egregious example. In contrast, here's a nice, neat "radish" for Scarabaeus sacer:

Apart from the one errant use of Animalia, pretty nice!

Radish: Pulling Up Taxonomic Hierarchies From Leaf to Root

I've just released a small open-source Java library called Radish. It contains tools for interfacing with uBio (and hopefully other services in the future), with the purpose of looking up a taxon's hierarchy. It does this by synthesizing multiple taxonomies into a single graph. Here, for example, is the graph produced by looking up Homo sapiens Linnaeus 1758 (click the image to enlarge):

This is a synthesis of several taxonomies. Some have more ranks than others, but they all converge on a single chain, with the exception of the taxa above the "phylum level", where at least two of them use different schemes. (Although it may be tempting to synonymize Biota and "Cellular life", or Animalia and Metazoa, these are not objective synonymies, and arguably the former is a bad idea anyway.) Ideally the library would be able to figure out that "Cellular life" includes Biota (if the latter is a crown clade), and that Animalia either includes or is a synonym of Metazoa, but it can only work with the data it's given.

Why the name "Radish"? From the project's wiki:
The English word "radish" is derived from Latin "radicem" (nominative: "radix"), meaning "root". (This is also where the word "radical" comes from.) The idea of the Radish library is that by "grabbing a leaf" (i.e., selecting a smaller taxon), you can "pull up the root" (i.e., extract that taxon's hierarchy of supertaxa, up to the root of all life).
Radish is part of a larger project I am working on ... I should have more to say about that in the near future....

02 August 2010

Phun Phylogenies

Pete Buchholz recently started compiling a phylogeny of edible plants, based on the APG III system. I ran it through Names on Nodes and produced a diagram:

A Phylogeny of Edible Plants

(Unfortunately, this version strips out the clade labels—I'll try and rectify that at some point.)

When I saw this, I though, what a fantastic way to learn plant phylogeny! It's something I don't know much about (apart from basics, like the differerence between gymnosperms and angiosperms), and so I found it fascinating to see the ways the foods I eat are related to each other.

It's such a good idea, I couldn't resist doing another version for edible animals and fungi:

And this reminded me of another project I'd been meaning to start for a while, so I finally took a stab at it. A phylogeny of cartoon animals!

(That's right, there's a stuffed tiger clade.)

26 July 2010

3D Visualization of the Fossil Distribution of the Human-Chimpanzee Total Clade

What it says.
Click on the image to open the visualization.
I've been compiling data on "pan-mangani" fossils. This is my first March of Man toolshop post in a while: a 3D visualization of that data, where the horizontal axis is longitude, the vertical axis is latitude, and depth (the z-axis) is age. The "blobs" each represent a fossilized individual, and you can mouse over them to see what their taxon is.

Some data is missing, notably a lot of entries for our own species. Other data needs to be refined—some of the better-known species (ahem, Neandertals) are big clouds that need to be tied down to specific sites. Also, I obviously need to do more work on that present-day distribution map. But it's a decent start.

Fun things to do:

  • See if you can find the oldest individual (the lone specimen of Sahelanthropus tchadensis).
  • Try to find its Chadian compatriots.
  • Find the earliest non-African individuals (hint: East Europe and the Malayan Archipelago).
  • Wonder what the heck that thing in India is.
  • Look for the single cluster of extinct chimpanzees (Pan sp.).
  • Find the three subspecies of Homo sapiens other than our own. (Note: these may not be distinct from each other—I just prefer to err on the side of splitting for projects like these. Easier to revise later.)
  • Marvel at how easy it is to become sympathetic to multiregionalism when you just view the distribution data without any morphological context and ignore the fact that not all regions are good for preservation.
  • Wonder how people can possibly believe in baraminology in the face of such ample evidence. (Adding morphological data to this would help a lot—there really aren't any good "cutoff" points for our lineage.)

Better version here.

06 July 2010

A Plea to Providers of Open Webservices

Let's suppose you have a webservice, and you have decided, out of the goodness of your heart, to make it "open". Anyone can browse it, search it, and utilize the data it provides. This is something you wantfor anyone to be able to utilize the data and present it in new, creative ways.

I'm willing to bet you missed a step.

Consider my Flash application, Names on Nodes. Right now it's a standalone program—it can import and export files locally from your system. But it doesn't read data from any webservice. There are many places in the program where it would be extremely useful to do so. For example, instead of having to enter LSIDs manually, it could search for them in uBio. Instead of having to open NexML files locally, it could pull them from TreeBase. Wouldn't that be nice? Why haven't I done that?

Well, fact is, I can't.

The Flash Player has a security mechanism whereby it will not load data across different domains. If my SWF resides on namesonnodes.org, then that is the only place it can load data from. Of course, there are ways around this: a good way, a less good way, and a really stupid way.

The Good Way

If your domain has a file called crossdomain.xml in its root, then the Flash Player will read that to see if cross-domain permission has been allowed. Here's an example of a crossdomain.xml file that allows maximum access across all domains:

<?xml version="1.0"?>
<!DOCTYPE cross-domain-policy SYSTEM "http://www.adobe.com/xml/dtds/cross-domain-policy.dtd">
   <site-control permitted-cross-domain-policies="all"/>
   <allow-access-from domain="*" secure="false"/>
   <allow-http-request-headers-from domain="*" headers="*" secure="false"/>

(Or, download it here.)

For more information, see the Cross-Domain Policy File Specification.

(Note that the folks at TreeBase are looking into this. I'm on tenterhooks.)

The Less Good Way

I could also make Names on Nodes as an AIR application. This is less good because it means the user would have to install it as an application on their local system, and they couldn't just access it online. But it would allow for more possibilities—not just the ability to load data from anywhere regardless of policy files, but also drag-and-drop from other applications, etc. I am actually planning to create an AIR version someday, but it's not my top priority right now.

The Really Stupid Way

Of course, I could also make a little server-side component on my domain and simply have it pull data from your domain before feeding it to the Flash application. This is really stupid because it greatly increases traffic on my webserver while still causing just as much traffic on your webserver as it would if you had just allowed cross-domain access with a policy file.

In Summary

It takes a matter of seconds to place a cross-domain policy file in your server's root directory (assuming you have access), there are no drawbacks if your service is intended to be open anyway, and there are huge benefits for us Flash developers who want to use your service. I mean, really, why not?

02 July 2010

Wrapping Up In Portland

Yesterday was the last day of the iEvoBio Conference, and tonight is my last night in Portland. The conference was quite illuminating and I would definitely like to return for another one.

I've had some time tonight to address a few issues I would like to have tended to before the conference. For one, I hadn't updated the Names on Nodes website to link to the application! This has been remedied. In the process, I removed links to the NEXUS demo application, since it's a bit old and out of sync with the rest by now. I'll probably add it back after adding support for the NEXUS format to Names on Nodes.

I also made a few simple additions to Names on Nodes, including:

  • Fullscreen mode
  • Ability to specify which characters in a matrix you wish to import
The latter in particular should prove usefulsome people were finding it impossible to open NexML files with character matrices. Since Names on Nodes only uses characters as part of apomorphy-based definitions, it's usually a big waste of time and memory to import entire matrices when only a few nomenclaturally-relevant characters (if any) are needed.

Once again I'd like to encourage people to check out the application and make feature requests and report issues on the BitBucket page.

26 June 2010

Names on Nodes is finally online.

A month ago I got notice that my abstract had been accepted, and that I would be demoing Names on Nodes at the iEvoBio conference's Software Bazaar on June 29. This is the first time Names on Nodes has ever truly had a hard deadline. Since it's a personal project, until now I have had the luxury of languidly rebuilding and polishing and rebuilding and polishing. But now I have to get something up. So it's up.

There's still a lot left to do, but this will have to do for now. You can load NexML and Newick files (not NEXUS for now, sorryalthough, really, you should be sorry for still using it when NexML is available). You can save as MathML and export PNG image files. You can create phylogenies and phylogenetic definitions on the fly using a visual interface that emphasizes drag-and-drop. Or you can type them in (as Newick and MathML, respectively), should you prefer that.

There are still a lot of bugs, and a lot of unimplemented features. If you come across issues or if you have feature requests, please feel free to submit an issue. And if you want to look at the code, it's open source (MIT license) and available on BitBucket.

28 May 2010

gautengensis in the sediba phylogeny

Here's the phylogeny/taxonomy from the Australopithecus sediba paper overlaid with the taxonomy from the Homo gautengensis paper:

(click to enlarge)

I've highlighted the taxonomic units that Curnoe referred to Homo gautengensis. Note that, by Berger & al.'s phylogeny, Homo gautengensis is polyphyletic. Each of those units represents a single specimen, so this could potentially be explained by individual variation, age differences, sexual dimorphism, etc. Or the new species is overextendedI'm not really qualified to judge.

Note also that Homo is polyphyletic in this phylogeny. One way to fix this is to move sediba into Homo.

A Homo gautengensis by any other name...

A new species of our genus, Homo, was recently published:

Curnoe (2010). A review of early Homo in southern Africa focusing on cranial, mandibular and dental remains, with the description of a new species (Homo gautengensis sp. nov.). HOMO Journal of Comparative Human Biology (online early). doi:10.1016/j.jchb.2010.04.002

Here's the abstract:
The southern African sample of early Homo is playing an increasingly important role in understanding the origins, diversity and adaptations of the human genus. Yet, the affinities and classification of these remains continue to be in a state of flux. The southern African sample derives from five karstic palaeocave localities and represents more than one-third of the total African sample for this group; sampling an even wider range of anatomical regions than the eastern African collection. Morphological and phenetic comparisons of southern African specimens covering dental, mandibular and cranial remains demonstrate this sample to contain a species distinct from known early Homo taxa. The new species Homo gautengensis sp. nov. is described herein: type specimen Stw 53; Paratypes SE 255, SE 1508, Stw 19b/33, Stw 75–79, Stw 80, Stw 84, Stw 151, SK 15, SK 27, SK 45, SK 847, SKX 257/258, SKX 267/268, SKX 339, SKX 610, SKW 3114 and DNH 70. H. gautengensis is identified from fossils recovered at three palaeocave localities with current best ages spanning ~2.0 to 1.26–0.82 million years BP. Thus, H. gautengensisis probably the earliest recognised species in the human genus and its longevity is apparently well in excess of H. habilis.
The holotype, Stw 53, has previously been referred to either Homo habilis Leakey & al. 1964 or Australopithecus africanus Dart 1923. Interestingly, though, one of the paratypes, SK 15, is already the holotype of Telanthropus capensis Robinson 1949! So Homo gautengensis would appear to be a junior subjective synonym.

It gets even more interesting (for us nomenclature buffs, anyway): if Telanthropus capensis were to be transferred to Homo, then it would be a junior homonym of Homo capensis Broom 1918 (a.k.a. "Boskop Man"), which itself is a junior subjective synonym of Homo sapiens (the type specimen probably representing an early Khoisan individual).

I'm not sure if any of this is discussed by Curnoe, because I don't have access to the paper. If anyone has a PDF, feel free to e-mail to keesey [at] gmail [dot] com.

UPDATE: I have the paper now and will be looking it over.

ANOTHER UPDATE: Telanthropus is mentioned in passing, but the synonymy is not discussed.

27 May 2010

Upcoming Names on Nodes Presentation

I'll also be presenting Names on Nodes at iEvoBio, at the Software Bazaar on June 29. Here's the abstract:

Names on Nodes: Automating the Application of Taxonomic Names within a Phylogenetic Context

Names on Nodes1 is an open-source2 Flex application which utilizes a mathematical approach to automate the application of phylogenetically-defined names to phylogenetic hypotheses. Phylogenetic hypotheses are modeled as directed, acyclic graphs, and may be read from bioinformatics or graph files (Nexus, NexML, Newick, and GraphML) or created de novo. Hypotheses may also be merged from multiple sources. Names on Nodes stores hypotheses as MathML, an XML-based language for representing mathematical content and presentation. Phylogenetic definitions may be constructed using a visual editor and exported in MathML. Thus, it is possible to create a dictionary of defined names and automatically apply them to phylogenetic hypotheses. In the current version of the application, such dictionaries exist only as MathML files, but in future versions definitions may also be loaded from databases (e.g., RegNum).

Additional functionality in Names on Nodes includes the ability to coarsen a phylogenetic graph (thereby simplifying it while still reflecting the overall structure) or to export it as an image file (raster or vector, potentially with semantic annotations).

  1. Source code available at: http://bitbucket.org/keesey/namesonnodes-sa/
  2. MIT license
I have my work cut out for me....

26 May 2010

Upcoming Talk: Toward a Complete Phyloreferencing Language

I'll be giving a “Lightning Talk” (five minutes) at the iEvoBio Conference in Portland, Oregon. Here's the abstract:

Toward a Complete Phyloreferencing Language

A phyloreference is a statement indicating a taxon within a phylogenetic context. A common use for phyloreferences is in phylogenetic definitions, which tie taxonomic names to taxa via such statements. Several conventions for writing phyloreferences have been proposed, but most only cover a few “standard” forms (node-, branch-, and perhaps apomorphy-based clades) without the capacity to represent more “exotic” forms (e.g., ancestor-based clades and qualified/modified references). In order to build a complete phyloreferencing language, the mathematical underpinnings of phylogenetic contexts must be clarified. A phylogenetic context may be modeled as a directed, acyclic graph, wherein nodes model taxonomic units and directed edges model immediate descent. Higher taxa are modeled as unions of nodes. A phyloreferencing language must minimally allow for certain classes of entity: Boolean values, sets (including taxa, relations, and the empty set), and lists (including graphs and functions). It must also minimally allow for basic operations related to logic, set theory, and graph theory. Higher structures such as declarations and piecewise constructs must also be possible. With these as a basis, functions related to phylogeny can be defined: maximal, minimal, predecessor union/intersection, successor union/intersection, exclusive predecessors, synapomorphic predecessors, clade, crown clade, and total clade. I show how such a language may be used to represent various types of phyloreference, both “standard” and “exotic”.

Now to figure out how to condense that into a five-minute talk....

21 May 2010

Names on Nodes Issue Tracker

Yesterday I transferred the list of remaining Names on Nodes issues from my whiteboard to the bitbucket issue tracker. My goal is to get through most of these by the end of June. (Some "nice-to-haves", like DOT or HTML 5 exporting, may have to wait.)

Essential features left to implement, complete or fix:


  • Certain formats for import, especially NexML and NEXUS. (Currently only Newick can be imported. MathML files can be loaded as well.)
  • Certain formats for export, especially NexML. (Currently only PNG can be exported. MathML files can be saved as well.)
  • Ability to save just the definitions or just the phylogeny to a MathML file.
  • Ability to import definitions from a MathML file.
  • MathML tweaks. (Use csymbol instead of ci for taxa. Normalize presentation.)
  • Ability to write in Newick strings directly.
  • Skin various components (sliders, steppers, checkboxes, etc.).
  • Fix line breaks in MathML formulas.
  • Various scrollbar issues.
  • Special character issues.
  • Rich editor for taxon labels, including ability to edit taxon URIs.
  • Arc bisection tool.
  • Fix node merging (i.e., synonymization).
  • Add ability to select definition type when creating a name.
  • Node Pane Control Bar revisions. (Change Resolution Slider to a stepper. Add Zoom Slider.)
  • Definition Editor tweaks/fixes. (Some actions are blocked that should be possible. Textual Editor does not always update. Various layout issues.)
  • About/Help Panel.

12 May 2010

Why HTML 5 Canvas Will Not Be Replacing Flash That Soon

Previously I mentioned a tool, PhyloPainter, which uses the HTML 5 <canvas> element to draw a phylogenetic graph. Here's what it looks like on my iPhone:

Not only are the arrowheads missing (as they are on Safari on all platforms, not just the iPhone), but the labels have bizarrely been placed outside the canvas, flipped upside-down! The tool works fine on Firefox and Chrome. (Internet Explorer has not implemented <canvas> yet, and I haven't played enough with the interim solution, ExplorerCanvas, to get it working.)

I think the <canvas> element is a cool idea, and I'll continue to play with it. But it has a long way to go to compete with a cross-platform tool like Flash. HTML 5 may be "open"—but it also needs to "work".

06 May 2010

PhyloPainter: Happy Little Trees

The whole Flash/Apple fracas has been rather distasteful to me. But I'm not going to dwell on that right now. Instead, I am trying to keep an open mind by trying out some of the technologies that are competing with my favored development tools. First up: HTML 5.

I'll probably write more on the topic later, but suffice to say for now that working HTML 5 feels like I've traveled in time back to 2001, the days of ActionScript 1.0. JavaScript is a poor language for anything complicated. Canvas has covered the basics of vector drawing well, but little else. That said, I see potential and I'm pretty certain the tools will improve.

For my first HTML 5 app, I ported some basic functionality from Names on Nodes, namely, the ability to read Newick tree strings and the ability to draw graphs. I give you:

It's a bit rough right now. For one thing, it doesn't work in Internet Explorer (despite the inclusion of a workaround JavaScript tool—the current version of IE doesn't support HTML 5 Canvas). But it's a start.

Give it a try—paint some happy little trees!

09 April 2010

Biota: Another Example of Coarse vs. Fine Phylogenies

Yesterday I posted an example of how graph-coarsening algorithms can be used to make the high-level patterns of a phylogeny more immediately visible. That example used a phylogenetic hypothesis about placental mammals. The hypothesis involves a lot of nodes (i.e., taxonomic units), but not much branching complexity. By which I mean each node has only a single parent.

So here's an example where nodes may have multiple parents. This is a phylogeny of Biota, i.e., Life:

Eukaryotes (organisms with cellular nuclei, i.e., plants [Embryophyta, etc.], animals [Metazoa], fungi [Eumycota], and "protists") have been highlighted in yellow. Some nodes have multiple parents due to one of two phenomena:

  • Lateral transfer. Many organisms (especially bacteria) are capable of acquiring genetic material from unrelated organisms.
  • Endosymbiosis. Some organisms have evolved into organelles within the cells of other organisms, notably mitochondria (descended from proteobacteria related to those that cause rickets) and plastids (photosynthesizing organelles in plants, descended from cyanobacteria). In these cases, the organelle often retains its own DNA, although much of it may have leapt over to the "host's" nuclear DNA. In some cases, all of it may have leapt over (as with mitochondria descendants like mitosomes).
Both lateral transfer and endosymbiosis are considered valid forms of descent in this hypothesis.

Here's the graph coarsened one step:

We can see the general patterns more clearly here. Eukaryotes share a relationship with archaeans, but also have descent from proteobacteria (via mitochondria). One clade of eukaryotes (Plastida) is also descended from a basal form of cyanobacteria (via plastids). A few cases of lateral transfer are visible, but not in detail. We can also see there there is a lot of bacterial diversity, although the details are not spelled out.

Here's the graph coarsened another step:

The endosymbiosis is made even clearer, although most other relationships are obscured.

Disclaimer: This hypothesis was cobbled together from a number of sources and does not represent any rigorous research on my part. I suspect parts of it are outdated, but this area of the Tree of Life is not my bailiwick. I just wanted to throw something together for a demonstration.

08 April 2010

Australopithecus sediba and its place among stem-humans

A new stem-human species, Australopithecus sediba Berger et al. 2010, has just been announced. The paper's supplementary information contains the results of a cladistic analysis of stem-humans. For fun, I thought I'd plug the most parsimonious tree into the in-development version of Names on Nodes:

Stw 53 and SK 847 are specimens that are not readily assignable to named species. (SK 847 might be Homo ergaster). Our own species, sapiens, is presumably descended from the SK 847-erectus node.

The analysis finds sediba as a sister taxon to Homo (which includes habilis, rudolfensis, SK 847, and erectus), and possibly ancestral to it. Which begs the question, why not place it in Homo? If this hypothesis is correct, it shares more ancestry with the type species of Homo (sapiens) than it does with the type species of Australopithecus (africanus). Even Stw 53, which is here placed outside the sediba-Homo clade, has been attributed to Homo in the past.

Viewing Phylogenies at Different Graph Resolution

Although I've been primarily reining in features on the next version of Names on Nodes, there was a new feature I couldn't resist adding. I think it's coming along pretty well.

A common problem with working with phylogenies is that many of them are gigantic, far too big to view all at once. As an example, consider Figure 1 from Beck et al. (2006). It models a hypothesis about placental mammal phylogeny, at an arbitrary resolution ("family-level"). Here's how the current version of Names on Nodes renders it:

When you look at it "zoomed out", it's almost impossible to know what's going on. When you look at it full size, you can see various local areas, but you lose a sense of what's going on with the larger image. Note that I've highlighted our own species' twig on the tree (Hominidae, the great ape clade) in yellow.

Earlier I used the term "resolution" to refer to the size of the graph's nodes. We can refer to a graph with very small nodes (e.g., each node representing an individual organism) as being "fine" and a graph with very large nodes (e.g., "class-level") as being "coarse". Thinking about the problem from this angle, I had the idea to create a control for coarsening or refining the viewed graph.

I implemented a simple graph-coarsening algorithm*, and then created an algorithm for picking the best name for the new, coarser graph's nodes. And here is the phylogeny at near-maximum coarseness:

This is placental phylogeny boiled down to its basics: rodents, laurasiatheres, and a bunch of other junk (including us). The node labelled "Placentalia*" contains the placental ancestor but not all descendants—it lacks an unnamed clade included most non-afrothere placentals. The unnamed greenish node includes all members of that unnamed clade except for rodents and laurasiatheres. (This happens to include Hominidae, which is why it has that greenish color.)

Let's refine it one step:

We're starting to get a better idea of the hypothesis. Finer:

Now we can see the basal split between afrotheres and other placentals, as well as developing complexity in Rodentia and Laurasiatheria. Finer:

Getting a little bit on the big side, now, but we can see more details. There are a lot of unnamed clades within Hystricoidea and Chiroptera—we can see that those clades are diverse, although we can't see details. Finer:
This has about 2/5 as many nodes as the base graph. It's a bit large, but still much easier to view than the base graph. Many important details are visible (e.g., the platyrrhine-catarrhine split), while others are just suggested (e.g., lots of diversity in Caviomorpha).

Obviously this works best if lots of clades have been named. I think it'll be a useful for boiling a phylogeny down to an appropriate level: coarser for quick overviews, finer for in-depth discussion.

* Basic summary of the coarsening algorithm:
  1. Look through all nodes that have children, and find the ones whose children are all terminal (sinks).
  2. Merge each of those nodes with their children to create a "supernode".
  3. Merge all overlapping supernodes. (This is important for graphs where nodes may have multiple ancestors, although it doesn't come into play in this example.)
  4. Remove the supernodes from the graph and repeat from step 1. Keep going until no nodes are left.
  5. Add the supernodes to a new graph. A supernode is ancestral to another supernode if any of its subnodes are ancestral to any of the other supernode's subnodes.

05 April 2010

Sketch of a Phylogenetic Query Language

Names on Nodes uses MathML for two primary purposes:
  • Delineating phylogenetic hypotheses (as directed, acyclic graphs).
  • Associating identifiers with definitions.
In some ways this works out to be a bit like a query language. You can use it to set up data constructs, and then search them for groups of interest. For example, suppose you wanted a list of all stem-humans from Kenya. Assuming that your dataset included 1) a taxonomic unit called Homo sapiens, 2) a group called extant for all extant taxonomic units, and 3) a group called Kenya for all Kenyan taxonomic units, that query might look like this:
<apply xmlns="http://www.w3.org/1998/Math/MathML">
         <csymbol definitionURL="http://namesonnodes.org/ns/math/2009#def-Total"/>
         <ci>Homo sapiens</ci>
      <ci>Homo sapiens</ci>
MathML is great for being flexible and extensible enough to cover concepts like this. But ... it's also really verbose. This is fine for my purposes so far, but it may be cumbersome for other purposes. So I've been playing around with a more succinct way to write these expressions. Today I tossed up some rough ideas here:

This is a plain-text format loosely inspired by mathematical notation, the C language, etc. Using it, the above query becomes:
"Kenya" & (total("Homo sapiens", "extant") - "Homo sapiens")
...which is quite a bit shorter.

This is still in very early stages, so I thought I'd post it to get some feedback.

Here are a few of the simpler clade definition examples:
"Aves"       := clade("Struthio camelus" | "Tetrao major" |
                      "Vultur gryphus").

"Saurischia" := clade("Megalosaurus bucklandii" <-
                      "Iguanodon bernissartensis").

"Avialae"    := clade("wings used for powered flight" @
                      "Vultur gryphus").