Practical Taxonomic Computing

Richard J. Pankhurst

Cambridge University Press, Cambridge. 1991. xi+202 pp. ISBN 0-521-41760-0.

Some books strike you as being good from beginning to end, while others annoy you from start to finish; and the rest are a bit of both worlds. Unfortunately, this book fits into the last category - parts of it are very useful, parts of it are pretty erratic, and parts of it made me want to tear the pages out and throw them into the recycle paper bin.

The book itself is basically an update and re-working of Richard Pankhurst's earlier book, Biological Identification. The Principles and Practice of Identification Methods in Biology (Edward Arnold, 1978). However, this time, as well as the four chapters about identification methods, there is a chapter on databases, one on classification, and one on expert systems. This expansion of topics is a clear indication, if nothing else, of the inroads that computers have made into the working life of taxonomists.

These inroads are not, of course, confined to the relatively specialized world of taxonomy. Computers have come out of whatever dark and musty hole they have been hiding in for the previous 50 years, and have spread like a plague into all facets of modern life. It is not all that long ago that computers were huge machines that took up entire rooms, and required all sorts of complex instructions to perform even menial tasks. Then Apple released the first completely-assembled personal computers in 1977 (while I was a university undergraduate), and IBM released the first personal computer suitable for business people in 1981 (while I was a postgraduate). Since then, computers have bred in a way that makes rabbits look celibate, and have spread in a way that makes wildfires appear to move in slow motion.

What are we taxonomists, conservative people that we are, to do about this? Not a lot, unfortunately. Scientists have, by their very nature, always been at the forefront of computer usage, and so it should not come as any surprise that computers have been welcomed with open arms in systematics. What has perhaps caught more than a few systematists by surprise is the speed with which computer usage has spread in the past 10-15 years, and the depth to which it has influenced our work practices.

With this background in mind, it becomes obvious that what we need is a book about the uses to which computers can be put in taxonomic work. After all, computers are (at least at the moment) just a tool - a "productivity" tool, as the computer people express it. They are supposed to help increase our productivity at work, by being a tool that can relieve us of some of the more tedious parts of our work (such as mathematical calculations). They should also open up new possibilities for us, by allowing us to undertake tasks that we would never have even contemplated before (such as collating and inter-relating huge amounts of information - organizations such as IOPI would have been unthinkable even a short time ago). This is why many people have referred to the current restructuring of society as the computer "revolution", because the social consequences of these changes in work practice are potentially as large as those of the so-called "industrial revolution".

So, if computers are supposed to be of some use in taxonomy, someone had better tell us what these uses are; and this is apparently what Richard Pankhurst has set out to do. Unfortunately, it's very hard to work out who this book is actually aimed at. If it is intended to be used by the computer novice, then far too much of the information is either wrong, or left out entirely, to be of much use to them; if it is supposed to be aimed at current users who want to know more about computing, then large slabs of the book are redundant; and if it is for computer experts, then the book falls far short of what they need.

The book often seems to be a rather arbitrary collection of bits and pieces of information that the author knows something about from personal experience, and has decided to put in as a result. It is therefore rather difficult to discern any unifying theme to the book. The closest that I can get is to point out that it is often based around worked examples from a particular computer program for each type of computer usage in taxonomy. Thus, the book is about applications rather than explanations - the theoretical introduction to each section is sometimes a bit skimpy, and a lot of the information is therefore imparted by working through the example. This is certainly one way to do it, but there is very little satisfaction to be had from actually reading a book set out in this way.

The production quality of the book is quite good, although there is the usual collection of typographical errors. There are a number of illustrations to break up the text, but many of these contain a great deal of text themselves. The index is quite comprehensive, and the list of references is usefully broad in scope. Those of you who are female will, however, find no reference to yourselves anywhere in the text, as all computer users are treated as being male. The price of the book is a bit steep, even for a hardback.

The book has eight chapters, plus an appendix. It starts, not unexpectedly, with a General Introduction (10 pages). This is where the impact of computers could be usefully assessed, where the enormous possibilities of computers could be explained, and where some enthusiasm for the topic could be engendered. Unfortunately, no such things occur - the impact, possibilities, and enthusiasm are all assumed a priori. What we get instead is a bit of background information, some terminology, and an explanation of the data set that is used as the primary example throughout the rest of the book. This is all very necessary, but it is far short of what is needed - after all, the most obvious audience for this book is people who need to be made aware of what computers can do for them, and this audience is told no such thing.

The second chapter is about Databases (33 pages). It explains their uses quite well, and how they are set up, and how they are queried. However, the choice of topics is a somewhat eclectic mix; in particular, the section on database languages seems bizarrely out-of-place. Nevertheless, this is one of the chapters where the potential uses of computer technology are made most obvious, especially in the section on example applications.

The third chapter, on Classification (44 pages), is the low point of the book. The book is titularly about "practical" computing, and this section is a very uncomfortable mix of necessary background information about classification theory with the practical application of this theory to computer technology. It certainly doesn't get the mix right - the explanations of phenetic and cladistic methodology are woeful, although cladistics fares much the worse of the two. Anyone who has only learnt about classification techniques from this book needs to clear their mind completely, and then needs to go and read a book that is specifically about these topics. This is not to say that there aren't extremely useful sections in this chapter, as indeed there are - it's just that they are interspersed with such complete inaccuracies and inadequacies that they are lost to anyone who doesn't already have a strong background in these topics.

The inadequacies of the chapter range from simple things like a list of useful references for biological nomenclature that doesn't mention Charles Jeffrey's excellent book on the topic; to statements about objects being "split off one at a time" in a single-linkage clustering (which is agglomerative, not divisive); to the lack of discussion about the robustness of the various phenetic techniques to violation of their underlying mathematical assumptions, in spite of the fact that the assumptions are discussed in the book and that several published studies have evaluated them. Furthermore, the list of available computer programs does not include the widely-used Australian package PATN (or even its predecessor, NTP).

However, it is the section on cladistics that leaves most to be desired, and these are the pages that I was tempted to rip out. There appear to be two explanations in the chapter for many of the topics - there is first a brief introduction, along with a completely faked example, and then there is a more detailed explanation. It is the first part that is the most woeful, and in many ways it contradicts the later amplifications. For example, synapomorphies are defined (twice) as "correlated character sets"; the author equates "genealogical distance with genetic distance" rather than with classification rank; only outgroup analysis for character polarity is mentioned; there is much confusion of cladograms with character-state trees; and the worked example uses a non-monophyletic group along with arbitrary coding of characters and their polarity (all of the other examples in the book are realistic - this is the only one that is forced; the arbitrary assignment of polarity means that there is no necessary cladistic structure in the data set and therefore there is no phylogenetic information to be analysed).

The later section doesn't necessarily fare too well, either. For example, many of the so-called "problems" claimed for cladistic analysis also apply to all other classification techniques; there is much confusion of evolutionary parsimony with descriptive parsimony; the author naively claims that "if suitable fossils were available for study, polarity could be decided without any doubt"; and there is no mention of the fact that the a priori rationale for the procedures allows an objective assessment of what constitutes the "best" tree, whereas phenetics has no such assessment. The author also claims that groups defined by symplesiomorphies "might be paraphyletic", whereas they must be, since all taxa with the related apomorphy have to be included in the clade to make it monophyletic, unless some other character supports the monophyly. Furthermore, only parsimony techniques are discussed, and methods for molecular data are specifically excluded.

Only a few other problems in this section can be listed here, or you'll become bored by this tirade. The reasons for rejecting outgroup analysis for character polarity (p. 76) are bizarre, to say the least; and the author places great weight on the "statistical significance" of tree structure (e.g. p. 77), which is only useful if the procedures used are an appropriate model for phylogenetic data. The following statements appear to be philosophically naive (the first is a universal truism, the second is ridiculous): "This kind of argument may be plausible, but it is not direct evidence; it is just a credible hypothesis" (p. 74); "Statements about character compatibility are facts and not hypotheses" (p. 83). Finally, the author ends by suggesting that phenetics has "few subjective decisions", in spite of the plethora of techniques available to chose from, and that taxonomists should "experiment" with both cladistics and phenetics because "each type of method has different aims" - but wouldn't it be better to decide on your aims first, and then chose the appropriate method?

Chapters 4-7 are about identification methods, and they are a definite improvement on the previous chapter. They start with Conventional Identification Methods (22 pages), continue with Computerised Identification Methods (53 pages) and a History of Identification Methods (5 pages), and end with Applications in Computerised Identification (6 pages). Chapter 4 is a good introduction to biological identification methods, although some of the examples insist that Fig. 1.1 has missing data (which doesn't seem right to me). The chapter starts from scratch, and then develops all of the necessary theory, which is far more detail than we got in the chapter on databases. Also, I can't see why chapter 6 isn't the introduction to chapter 4.

Chapter 5 is interesting, as it makes clear the way in which computer identification has developed over the past 20 years. For example, the original on-line identification programs of the 1970s were developed from manual punched-card polyclaves, but it is now obvious that they are simply one version of computer expert systems - in other words, they are a particular application of a much more general class of computer applications. In a similar way, construction of keys is simply an application of decision trees. The chapter is quite effective, although I would have thought that the "practical" part should include things like a discussion of the trial-and-error nature of choosing character weights. It is also somewhat inconsistent in presentation. For example, the section on interactive keys describes how an example program works, while the section on computer-generated keys describes the programming considerations instead. The summary comparison of the different methods also seems to ignore the advent of laptop computers, which is certainly unexpected.

Chapter 7 is probably the most "practical" one in the book, as it makes mention of specific computer applications for identification. However, it would be even better if it included mention of a wider range of the available programs individually, and it would also be improved if there were more references from the 1980s and 1990s. Once again, the referencing is inadequate, in that there is mention of lists of identification guides without mention of David Frodin's mammoth tome about floras of the world.

Chapter 8, on Expert Systems (6 pages), is a brief introduction to the topic, and does not go into the depth that the other chapters do. It is almost a token gesture.

In the final analysis, this book is a bit too erratic to safely recommend to anyone. The content is basically a somewhat eclectic collation of computer-related information that contains much that is useful hidden among things that are either irrelevant or just plain wrong. Furthermore, the author provides no real overview of the topic, nor does he engender any enthusiasm for the potential uses of computers in taxonomy. As an introductory book, it therefore fails to live up to its promise, and it certainly can't be considered as anything more than an introduction. Given the lack of any competition, this book may find a useful place in any institutional library, but given its price and its oddities it certainly isn't value-for-money for an individual taxonomist.

David Morrison
Department of Environmental Biology & Horticulture
University of Technology, Sydney

Originally published in Australian Systematic Botany Society Newsletter 75: 18-21 (1993).