Is my model fit for purpose?

Matching data and species distribution models to applications

By Gurutzeta Guillera-Arroita & José Lahoz-Monfort (University of Melbourne)


KEY MESSAGES:
  • Species distribution models aim to ‘reconstruct’ the distribution of species using a sample of data
  • The type of data available for a species affects the interpretation and reliability of SDM outputs
  • It is essential that users consider carefully whether their SDM outputs are suitable for their intended application

Knowing where a species occurs, or could occur, is important for a wide range of conservation applications. However, we rarely have complete information about species distributions, and we normally need to infer them through modelling approaches. By building species distribution models (SDMs), we aim to ‘reconstruct’ the distribution of species, based on a sample of data.

Species distribution models are used for all sorts of purposes in conservation planning and management. For example, they have been used to understand the invasion of cane toads in Australia. (Cane toad image by Ben Phillips).

Species distribution models are used for all sorts of purposes in conservation planning and management. For example, they have been used to understand the invasion of cane toads in Australia. (Cane toad image by Ben Phillips).

These models are often correlative, that is, they work by relating the observed pattern of species presence/ absence to some explanatory variable(s).

Species distribution modelling is becoming a fundamental tool in our discipline. For instance, SDMs are used to identify areas suitable for reintroduction of threatened species, sites at risk of biological invasions or to direct the search for new populations of species.

There are many considerations involved in building useful correlative SDMs. For an SDM to have good predictive ability we need to identify critical environmental predictors. For example, do average temperature, average rainfall and soil pH accurately capture why this plant species happens here and not there? Defining a suitable extent for the model is also fundamental. Am I interested in describing the habitat preferences for this mammal species at a continental scale, or do I want to understand its preferences at a local scale? There is a lot written about these and other aspects of building SDMs (and Brendan Wintle has developed an excellent checklist of the basics in  Decision Point #67).

But how does the type of data available for a species affect the interpretation and reliability of SDM outputs? This is a critical question in the practice of species distribution modelling yet it’s an issue often overlooked. Users often underestimate the strong links between data type, model output and suitability for end-use. Species distribution models can lead to suboptimal conservation outcomes and misguided theory if the underlying data are not suited to the intended application.

Data types and biases in SDMs

Often, the only data available about the occurrence of a species are ‘presence’ records from databases or from museum/ herbarium collections. Sometimes, data about both species

presences and absences are available. These are often produced through planned surveys, but can also be obtained from other sources such as checklists of volunteer contributors. Presence/ absence data may also be augmented to include information about the detection process (eg, how long it took to detect the species).

The level and reliability of information that we can extract from an SDM strongly depends on which of these types of species data we have available, and how we use them:

  • Presence-only methods (PO): There are methods to study species distributions that make use of presence-only records paired with information about the environmental conditions at those presence locations (eg, BIOCLIM). While these methods can provide interesting insights about environmental conditions where a species can exist, they have important limitations because species habitat preferences and habitat availability in the landscape are If many occurrences of a species come from areas with similar characteristics, this could be because these represent a real habitat preference, but it could also be that they are just very common in the landscape in general.
  • Presence-background methods (PB): A more powerful way to utilise species presence records is to analyse these in conjunction with information about the characteristics of the environment in the wider landscape. These methods provide a more accurate picture about species habitat preferences, as they can compare the types of environmental conditions where the species was detected to how common these conditions are in the landscape. Examples include the very popular MaxEnt and point-process methods. Yet, the modelling of species distributions based on presence-background data has important caveats. As presence-background data do not contain information about sampling effort, presence-background methods are very susceptible to estimation biases induced by sampling bias. Furthermore, presence-background methods cannot provide a robust quantification of prevalence or of probabilities of occurrence; from such data one cannot tell whether few species records are due to species rarity or due to little survey effort. Hence, presence-background methods at most only provide information about relative habitat preferences of the species. The output of presence- background methods is therefore NOT a probability of occurrence.
  • Presence-absence methods (PA): Data sets that also include species absence records are informative about sampling effort, hence they are much more robust than presence-background methods to biases in sampling and they can provide an estimation of species occurrence However, presence-absence data can be affected by imperfect detection of the species (as are presence-only and presence-background data). Two types of errors can arise in species-occurrence data: false negatives and false positives. The first is the most prevalent in ecological surveys and occurs when species are missed in searches of occupied sites. Disregarding imperfect detection can lead to biased inference about species distributions.
  • Occupancy-detection methods (DET): Augmenting presence-absence data by collecting information about the detectability of the species helps account for imperfect detection and hence obtain a more robust estimation of probabilities of species occurrence. Information about detectability can be obtained for instance by conducting replicate visits to the sites or, within one visit, by recording data from multiple independent observers, or recording times to detection.

In summary, there is a hierarchy in terms of the robustness of PO/PB/PA/DET methods and the quantities they can estimate (this is illustrated in Figure 1). It is essential that users consider carefully whether their SDM outputs are suitable for their intended application. Building models with unsuitable data can waste valuable resources and deliver outputs that do not solve the problem at hand.

Figure 1: Synthesis of how the type of survey data interacts with sampling bias and imperfect detection to determine what a correlative SDM can estimate. Dark arrows denote the default level of information that can be achieved with each type of survey data (PA, PB, DET). Light arrows indicate under which conditions higher levels of information can be achieved from those data types. ψ denotes the probability of species occurrence at a site, and p* the probability of detecting the species at a site where present (given all the survey effort applied per site). PO is not included as this type of data cannot distinguish preferences from availability in the landscape

Figure 1: Synthesis of how the type of survey data interacts with sampling bias and imperfect detection to determine what a correlative SDM can estimate. Dark arrows denote the default level of information that can be achieved with each type of survey data (PA, PB, DET). Light arrows indicate under which conditions higher levels of information can be achieved from those data types. ψ denotes the probability of species occurrence at a site, and p* the probability of detecting the species at a site where present (given all the survey effort applied per site). PO is not included as this type of data cannot distinguish preferences from availability in the landscape

In addition, it is important to consider the implications of reducing SDM outputs to a binary categorization based on thresholds, a step often conducted but rarely with clearly articulated justifications. In Box 1, we provide an illustration of these important considerations.

More examples can be found in Guillera-Arroita et al, 2015, together with a comprehensive table that discusses data type implications for a wide range of applications in ecology, conservation and biogeography.


Box 1: Prioritising invasive species

The potential distribution of an exotic species is a key indicator of its future capacity to cause damage. Examining the potential distribution of a range of candidate species can help in the prioritisation of management actions to prevent invasions.
However, as we show here, estimates of the relative likelihood of occupancy are not suitable for prioritising species according to their potential area of occurrence.
Let’s consider a set of 25 simulated species (figure 2). We sample their distributions randomly and build SDMs based on PA and PB datasets. We assume perfect detection and large sample sizes. In statistical terms, the sum of estimated occupancy probabilities across the region gives us the expected value of the area of occurrence of the species.
This is a quantity we can obtain from PA data. However, if the output of the SDM is a relative likelihood of species occupancy (from PB data), the area of occurrence cannot be estimated. Crucially, the quantities obtained are not comparable across species, and hence species cannot be prioritized based on these data. Applying a binary conversion to the SDM output (the species is assumed ‘present’ at sites with estimates above a given threshold, and ‘absent’ if below it) does not solve the problem. It does not change the fact that prevalence cannot be estimated without absence data.
Furthermore, binary conversion is detrimental compared with using the actual probabilities of occurrence when available. This is because a binary categorization represents a coarse interpretation of species occurrence probabilities and reduces the information content compared with using the full range of values provided by the SDM.
Figure 2: Estimated area of occupancy vs true area of occupancy for each of 25 simulated species, based on presence-absence data (top row) and presence-background data (bottom row). In column 1, the continuous output is used. The other two columns use a binary conversion prior to computing AOO [threshold 1: sensitivity = specificity; threshold 2: max(sensitivity + specificity)]

Figure 2: Estimated area of occupancy vs true area of occupancy for each of 25 simulated species, based on presence-absence data (top row) and presence-background data (bottom row). In column 1, the continuous output is used. The other two columns use a binary conversion prior to computing AOO [threshold 1: sensitivity = specificity; threshold 2: max(sensitivity + specificity)]


More info: Gurutzeta Guillera-Arroita gurutzeta.guillera@unimelb.edu.au, José Lahoz-Monfort José.lahoz@unimelb.edu.au

Reference: Guillera-Arroita G, JJ Lahoz-Monfort, J Elith, A Gordon, H Kujala, PE Lentini, MA McCarthy, R Tingley & BA Wintle (2015). Is my species distribution model fit for purpose? Matching data and models to applications. Global Ecology and Biogeography 24: 276-292. 

Leave a Reply