Searching multiple collections for vertebrate palaeontology specimens

19th May 2005

1. Introduction
        1.1. Motivating example
        1.2. Relevance Ranking
2. Implementation
        2.1. Thesauri
        2.2. Distributed Search
        2.3. Heuristic Semantic Analysis

1. Introduction

1.1. Motivating example

Imagine trying to express a search like this:

I want to find specimens stegosaur metacarpals from the Kimmeridgian found on the Isle of Wight and held in the Natural History Museum.

If the exact material you want doesn't exist, there are five degrees of freedom that a clever search-engine could slide along to find papers that would be interesting to you:

Taxon.
Anatomy.
Geological age.
Locality.
Collection.

In the absence of better hits, such an engine might offer up information on Tithonian anylosaur manual phalanges from Dorset held in the OUMNH.

1.2. Relevance Ranking

### Rank by number of degrees of slippage?

### Allow users to specify which axes are most/least significant.

### View and rotate a 3d slice of the slippage space to see what areas are best represented (and which areas, because they're sparsely populated, will make good research subjects.)

2. Implementation

2.1. Thesauri

To make this work, the searching system would need to have five ``thesauri'' (in the most general sense of structured collections of authority records):

An ordered list of geological ages; or a tree indicating the containment of ages withing epochs, etc.
A tree indicating the phylogeny of the supported taxa, indicating (for example) the containment of Maniraptora within Tetanurae.
A graph representing the osteological components of vertebrate skeletons and the linkages between them - both physical links (metacarpals are next to manual phalanges) and analogical links (metacarpals are analogous to metatarsals).
A grid of locations indicating the distance between them.
A grid of collections indicating the distance between them and maybe also a tree or graph indicating institutional connections.

These thesauri would need to be provided by experts in the field. Experience shows that building them is usually more work than people expect, and is in any case an inexact science. That's OK: even a vague, imprecise and error-strewn thesaurus will yield useful results.

2.2. Distributed Search

### New sites can "nuzzle up to" the network.

2.3. Heuristic Semantic Analysis

### Guess which bits of title/abstract are author, taxon, etc.

Feedback to <mike@miketaylor.org.uk> is welcome!