Botany Hunter - site information

About BH

This web site is an ongoing project to design a new type of plant identification key. The long term goal is the visual identification of any plant worldwide. Go here to see how it is hoped such an ambitious goal can be accomplished.

This is a work in progress, so if you have any comments or suggestions please contact me.

History
Objectives
How it works
The database
The database - reuse and access
Sources
Contact me

History

The project started in February 2010 as an undergraduate project at the Australian National University under the guidance of Professor Mike Crisp and his team. The site went live in April 2010 with just a small database covering the genus Banksia. Over the course of that year, the data for most Australian plants were added from four main sources: the Flora of New South Wales, the Flora of Western Australia, the Flora of South Australia and the Flora of Australia.

The project as university coursework ended in November 2010 with an unfortunately low rate of success in identifying plants. I attributed the poor result to lack of data; the reader, the program that gathers the data from the descriptions, was only pulling in about 10 descriptive items per plant. I focussed on improving that result over the next two years.

Over the course of 2011 and 2012, I improved the reader so that now, on average, 40 descriptive items per plant are being read for 58,000 described plants. That is 2.3 million described traits. Unfortunately, even that appears to not be enough. I am now finding, that the problem is not that the reader is missing data, but rather that many attributes are simply not in the original descriptions. So, in 2013, I changed the algorithm. Firstly, I generate generic and familial descriptions by agglomerating the specific descriptions. Thus as long as a trait was described for at least one of the sister species it will make it into the larger clades. Currently, the search is narrowed back down to the genera and species by using a rather crude scoring technique. I hope to fine-tune this in 2014 to use a more defendable maximum-likelihood approach.

Progress made in 2011 was the increase in the read percentage from 10 descriptive items per taxon to 25. I added 20,000 North American species and 20,000 Chinese species along with 40,000 more descriptions. Together this resulted in an increase from 200,000 to 1.4 million descriptive traits in the database.

Progress made in 2012 was the increase in the read percentage from 25 descriptive items per taxon to 40. This resulted in an increase from 1.4 to 2.3 million descriptive traits in the database. I have also run several programs to clean out orthographic variants from the taxa list. I suspect there are still plenty of non-orthographic synonyms and this will be a addressed in the future. Also in 2012, another couple thousand pictures have been added.

Progress made in 2013 was the change from a bottom-up, search all species, to a top-down, choose possible families first, approach. Navigation was improved so that the web-browser back button functions as expected. The European flora have been added from Flora Europaea (thank you Richard Pankhurst). And finally, another thousand or so pictures have been added.

The police-sketch-artist input for leaf descriptions where users can use an intuitive canvas to draw the leaf and the system converts the drawing to botany-speak has not any significant progress during 2013 and is still quite crude. But, I still have hope for this. A beta-version is available to Chrome users from the main screen.

The reader that converts scanned copies of classic, out of copyright, botany books into corrected texts that can be posted on-line as well as be used for new sources of plants and descriptions has made a little progress. I am still working on Hookers Flora of British India (3000 pages!) as an initial test. I only made it through 120 pages. I have posted these; they can be seen here.

Objectives

The objective of the web-site is the identification of plants from visual information. The first version of the program envisions users entering information into the system from plants they have observed. The system will read their descriptions and cross those with the descriptions of all plants and report back the mostly likely match. Because the descriptions in the database come from botanists, the user description will be more successful if it uses botanical jargon. The catch-22 is that users that can write in botanical terms may not need a computerized plant identification tool.

Version 2.0, which is in progress, include a more friendly user interface that will guide non-botanist users in the description of the plant features. It is imagined that something akin to a criminal sketch artist would be incorporated. Still, once the user pushes the search button the system would translate the given information into a botanical description and use the same pathways as were developed in version 1.0 to make the identification.

The long term goal would be a phone app where a picture could be taken of a plant and an identification would be returned. I believe that two quite different pathways could be contemplated to accomplish this goal. The first would ignore the botanics and focus on image mapping. The second would interpret the image in terms of its botanical features and then use this information for the identification. My intention is to follow the second path; not because I believe it is the best, but rather because it is the more interesting to me personally.

If the system becomes good at plant identification and thus becomes popular, It is contemplated adding a feature to allow users to upload their photos of plant sightings. The foreseen benefit is that the number of taxa in the system with photos attached could grow quite quickly.

How it works

There are two main challenges: getting the data and matching.

The approach being used to gather the data is quite simple in theory. A program has been written to read botanical descriptions of plants that are available on-line. It reads the descriptions and breaks the information down into chunks. For example, The Flora of New South Wales describes Banksia aemula as "Bushy shrub robust tree to 8 m high". The reader breaks this down into "Growth form Tree", "Growth form Shrub" and "Height < 8 m". Theoretically, this is simple; practically, it has proven more difficult. The reader currently will correctly pull apart 50% of a botanical description focussed on the plant features. It does less well when the describing botanists discusses items such as similarities to other species or the ecology of the plant.

There is also the challenge that there are only a handful of sites with large stashes of on-line descriptions. After these, it will become increasingly difficult to gather additional large quantities of data. I am working on a system that can read the scanned versions of the classic botany texts. Once this is working, the amount of data to which I have access will increase significantly.

The second challenge of matching the users specimen to the taxa in the database also has its own technical obstacles. A few examples are: 1) dealing with missing data - If a user has a plant that they describe with opposite leaf arrangement, how should the system deal with plants in the database that are silent on leaf arrangement. 2) Measurement information in the botanical descriptions takes many forms, such as 2 to 5m tall, less than 5m tall or c. 3m tall; these need to be homogenized. 3) Differences in wording such as "white flowers" vs. "white petals" or a user will describe something as "red" and the describing botanist said "pink".

For this reason, it was decided to take a probabilistic likelihood approach. This way if a user says 3m tall, then a higher likelihood would be given to a plant described as 2-5m tall than to one described as 1-3m tall.Similarly, when queried for a red flower, a pink flowered plant would have a lower likelihood than the red flowered plants, but a higher likelihood than a blue flowered plant.

The plant database

The plant database contains a full list of Orders and Families for Embryophyta (land plants). The sources used for the phylogeny are as follows:

Hornworts - the classification per Duff et al. (2007) except I put all five families into a single order, the Anthocerotales

Mosses - the classification per the Goffinet Lab website (www.eeb.uconn.edu/people/goffinet/Classificationmosses.html accessed 10 Sep 2010)

Liverworts - the classification per Crandall-Stotler et al. (2009) in Bryophyte Biology, Cambridge University Press.

Lycopods, Ferns and Gymnosperms - the classification per Judd et al. (2008) except I changed the name of Lindseaceae to Lindsaeaceae

A comparison of these families to those in The Plant List is as follows:

Judd did not recognize Taxodiaceae or Boweniaceae. TPL does.
Judd did recognize Lindsaeaceae and Sciadopityaceae. TPL does not.

Angiosperms - the classification per APG III

A comparison of the families in APG III to those in The Plant List is as follows:

APG III uses Asteraceae/Fabaceae; TPL uses Compositae/Leguminosae
APG III uses the spelling Ripogonaceae; TPL Rhipogonaceae
APG III uses the spelling Campynemataceae; TPL Campyneumataceae
APG III does not recognize Peraceae; TPL does.
APG III includes the following families; TPL does not: Trimeniaceae, Pennantiaceae, Anacampserotaceae,Gerrardinaceae, Lophopyxidaceae, Phellinaceae and Tetracarpaeaceae.

The lower ranked taxa come from on-line flora. They are shown in the sources.

The plant database - re-use and access

The information in the database might be interesting for ecologists and systemacists for purposes other than plant identification.In these cases, please contact me and I will try to provide the requested data.

Sources

(G) A census of the vascular plants of Tasmania.
ed. A.M.Buchanan, 2009 edition.

(G) Checklist of NT (Northern Territories) Vascular Plant Species.
eds. R.A.Kerrigan and D.E.Albrecht, 2007.

(D) eFloras.org - Flora of China.
http://efloras.org/flora_page.aspx?flora_id=2

(D) eFloras.org - Flora of North America.
http://efloras.org/flora_page.aspx?flora_id=1

(GD) eFloraSA - Electronic Flora of South Australia.
http://flora.sa.gov.au/

(G) Flora Digital de Portugal
http://www.jb.utad.pt/pt/herbario/cons_reg.asp

(G) Flora Europaea
http://rbg-web2.rbge.org.uk/FE/fe.html

(D) Flora of Australia Online
http://www.environment.gov.au/biodiversity/abrs/online-resources/flora/main/

(GD) FloraBase - the Western Australian Flora.
http://florabase.calm.wa.gov.au/

(G) FloraWeb - the Flora of Germany. Bundesamt fur Naturschutz.
http://floraweb.de/

(O) IPNI - The International Plant Names Index.
http://www.ipni.org/

(GD) PlantNET - the Flora of New South Wales.
http://plantnet.rbgsyd.nsw.gov.au/

(G) PLANTS Database - USDA PLANTS.
http://plants.usda.gov/java/

(G) Swiss Web Flora
http://www.wsl.ch/land/products/webflora/welcome-en.ehtml

(D) UConn Plant Pages
http://www.hort.uconn.edu/Plants/index.html

D - These flora included descriptions of the taxa they covered.

G - These flora included distribution information.

O - IPNI was the primary source for the full names of authors; it was also used extensively to sort out many issues with names of taxa.

Contact

My name is Steve Hunter. I would welcome any comments, suggestions or reports of errors. I can be reached at bh1@botanyhunter.com