The Global Biodiversity Information Facility (GBIF) is a data aggregator for biodiversity data. The big advantage of using an aggregator like GBIF over getting data directly from the original data source is that an aggregator provides a single point of entry to many data sets, so analysing one data set is technically interoperable with any other data set.
There are two ways to get data from GBIF:
- using GBIF’s own R library to search and download the data directly into an R script
- searching the GBIF website and downloading the data for offline use
Both approches have advantages and disadvantages illustrated here using examples from two diffferent bird species: the bald ibis and the tawny owl.
Searching GBIF from inside R
GBIF provides a package for searching and reading data directly into your R script. This package is particularly suitable if the data set you are interested in is relatively small. The following code will search and read all occurrences of the bald ibis (Geronticus eremita), a relatively rare species for which GBIF holds about 1000 records, into an R variable:
library("rgbif") # search for species key key <- name_backbone(name='Geronticus eremita')$speciesKey # get occurrences occurrences <- occ_search(taxonKey=key, limit = 1000) head(occurrences$data) # write data to disk for reuse write.csv(occurrences$data, "data/occurrences.csv") print("done writing data")
Downloading data for offline use
GBIF records more than 500.000 occurrences for the tawnly owl. Trying to get such a massive dataset through a direct http request will undoubtedly result in a network timeout, so GBIF provides better ways of downloading large data sets.
Occurrence data can be searched on GBIF by going to the GBIF website (http://gbif.org), opening the occurrences tab (above the search box). I selected Strix aluco under “Scientific name”. After signing-in, I chose to download data in the “simple” format, which is a zipped text file containing data as one occurrence per row. You can leave the data zipped, as R can read compressed files without unpacking them first. I stored the data on my computer in the file data/GBIF_Strix.zip. Don’t forget to cite your data, a citation snippet will be generated for you.
Once the data is stored on your local computer, you can load it into your R script. The package “readr” allows to read data from a zipped file directly into a data frame. Note that GBIF data is in tab-separated format.
library(readr) occurrences <- read_tsv("data//GBIF_Strix.zip") head(occurrences)
Cleaning up the data
GBIF data, as any data, needs to be cleaned before using. Here, I will remove occurrences which do not have coordinates.
occurrences <- occurrences[ !is.na(occurrences$decimalLatitude) & !is.na(occurrences$decimalLongitude), ]
In a next post, I will plot the occurrence data on a map.
Chamberlain S, Boettiger C (2017). “R Python, and Ruby clients for GBIF
species occurrence data.” PeerJ PrePrints. .
GBIF.org (17 August 2020) GBIF Occurrence Download https://doi.org/10.15468/dl.nt5hgg
Hadley Wickham, Jim Hester and Romain Francois (2018). readr: Read Rectangular Text Data. R package version 1.3.1. https://CRAN.R-project.org/package=readr