Why we trust in DNA to tell us how organisms are related

All organisms living on the planet today are thought to originate from a single ancestral organism. It’s an extraordinary claim, but it’s backed up by the extraordinary fact that all organisms are fundamentally similar at the molecular level. We all use nucleic acids (DNA or RNA) to store our hereditary material, and we all rely on almost identical sets of amino acids to build proteins to carry out cellular processes. Major chemical pathways related to e.g. protein synthesis and carbohydrate metabolism are largely conserved throughout the realms of life. It seems highly implausible that such similarities should arise independently several times.

Charles Darwin suggested a common origin of life without knowing any of this, and while you may have trouble spotting your resemblance to, say, a palm tree, it’s easier to do it with most animals. You may take offense if I likened you to a frog, but you both have two eyes, four limbs, a brain, a heart and a liver. Before the advent of fast molecular techniques, natural scientists had to rely on similarities in physical appearance to infer evolutionary relationships. Now that DNA sequencing has become fast and cheap, the use of molecular practice has almost completely replaced this method. Why is that?

The use of physical appearance (or “morphology” in scientific language) suffers from several drawbacks. Consider this simplified example: You compare three organisms and find that A and B have more teeth than C, whereas B and C have darker skin than A. Which is more important in inferring relationships, number of teeth or skin color? Perhaps one of these differences arose from a mutation in a single gene, whereas the other involved five genes. It’s not easy to tell. A single gene can explain the size difference between a terrier and a great Dane: https://www.nature.com/news/2006/061009/full/news061009-12.html

Image sources:
https://www.publicdomainpictures.net/view-image.php?image=53880&jazyk=SE
https://pixabay.com/en/yorkshire-terrier-dog-terrier-pet-790361/

Morphologists do not take sides. They assume equal importance of the characters they choose, they do their best to avoid characters that depend on each other (such as “egg diameter” and “egg weight”), and they choose a large number of characters in the hope that differences in complexity will cancel out. But there is no guarantee that this will happen. Another problem with morphology is convergent evolution. That is, distantly related organisms may acquire similar traits to deal with similar challenges in their environment. A classic example is the body shape of whales and fish. Another example is butterfly wings and bird wings. Sometimes, body parts used for one function in one organism take on a completely different function in another. The mouthparts of crustaceans, such as crabs and lobsters, are modified walking legs. Lungless salamanders rely entirely on their skin and mouth for breathing.

Morphology, at least the part of it that is relevant in determining relationships, ultimately results from genetics. When we use DNA, each character is a base in a DNA sequence. There are four different bases, abbreviated A, T, C and G. Gene sequences can be very long, and there are many genes to choose from. Consequently, we can infer relationships on the basis of huge numbers of characters, and the individual characters are less ambiguous than morphological ones, because we know that each of them pertains to just a single gene.

There are still problems though. One is the fact that as mutations accumulate over time, individual bases may mutate more than ones. But when we look at a base location and spot a T in one organism and a C in another, we cannot tell if there has been a mutation from T to C or a mutation from T to A and then to C. Mathematical methods for predicting numbers of multiple mutations exist. But the very fact that different genes mutate at different rates can also be used to our advantage.

Why does genes mutate at different rates? Most mutations are neutral or harmful. How harmful depends on the nature of the mutation but also on the importance of the mutated sequence. Over the course of many generations, an unimportant and an important sequence may mutate the same number of times, but fewer mutations are preserved in the important sequence, because organisms with mutations in this sequence are less likely to produce offspring.
The most useful genes for inferring relationships are the ones that both contain conserved sequences and sequences that store mutations. The latter provides us with characters that we can use to tell organisms apart and to rank their relationship to one another. The conserved sequences allow us to verify that we are comparing versions of the same gene in the first place.

Counting base differences between to DNA sequences sounds simple but can actually be tricky. Consider the following two sequences:

ATTGCGA_
AATTGCGA

A base-by-base count would tell you that the sequences differ in seven places. But all the differences can be explained by just one simple change: the deletion of a base in the upper sequence. If instead we compare the sequences like this:

A_TTGCGA
AATTGCGA

The sequences differ in just a single locus. Again, mathematical algorithms exist for dealing with these ambiguities. My next post will explain one of them.